Generative Recommender with End-to-End Learnable Item Tokenization
TL;DR Summary
ETEGRec integrates item tokenization with generative recommendation training in an end-to-end framework, leveraging dual encoder-decoder architecture and alignment strategies to enhance recommendation accuracy and training stability.
Abstract
Generative recommendation systems have gained increasing attention as an innovative approach that directly generates item identifiers for recommendation tasks. Despite their potential, a major challenge is the effective construction of item identifiers that align well with recommender systems. Current approaches often treat item tokenization and generative recommendation training as separate processes, which can lead to suboptimal performance. To overcome this issue, we introduce ETEGRec, a novel End-To-End Generative Recommender that unifies item tokenization and generative recommendation into a cohesive framework. Built on a dual encoder-decoder architecture, ETEGRec consists of an item tokenizer and a generative recommender. To enable synergistic interaction between these components, we propose a recommendation-oriented alignment strategy, which includes two key optimization objectives: sequence-item alignment and preference-semantic alignment. These objectives tightly couple the learning processes of the item tokenizer and the generative recommender, fostering mutual enhancement. Additionally, we develop an alternating optimization technique to ensure stable and efficient end-to-end training of the entire framework. Extensive experiments demonstrate the superior performance of our approach compared to traditional sequential recommendation models and existing generative recommendation baselines. Our code is available at https://github.com/RUCAIBox/ETEGRec.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Generative Recommender with End-to-End Learnable Item Tokenization." It focuses on improving generative recommendation systems by integrating the process of item tokenization directly into the recommendation training framework.
1.2. Authors
The authors are:
-
Enze Liu (Gaoling School of Artificial Intelligence, Renmin University of China)
-
Bowen Zheng (Gaoling School of Artificial Intelligence, Renmin University of China)
-
Cheng Ling (Kuaishou Technology)
-
Lantao Hu (Kuaishou Technology)
-
Han Li (Kuaishou Technology)
-
Wayne Xin Zhao (Gaoling School of Artificial Intelligence, Renmin University of China) - Corresponding author.
Their affiliations indicate a collaboration between academic institutions (Renmin University of China) and industry (Kuaishou Technology), suggesting a blend of theoretical rigor and practical application in the research.
1.3. Journal/Conference
The paper is published at SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13-18, 2025, Padua, Italy. SIGIR is a highly reputable and influential conference in the field of information retrieval, often considered a top-tier venue for research in recommender systems, search engines, and related areas. Its acceptance indicates significant academic merit and impact potential.
1.4. Publication Year
The paper is published in 2025. The Published at (UTC) timestamp from the abstract (2024-09-09T12:11:53.000Z) suggests it was made available as a preprint (likely on arXiv) in late 2024, ahead of its formal publication in the SIGIR 2025 proceedings.
1.5. Abstract
This paper introduces ETEGRec, an End-To-End Generative Recommender that addresses the challenge of effective item identifier construction in generative recommendation systems. Existing methods often separate item tokenization and generative recommendation training, leading to suboptimal performance. ETEGRec unifies these two processes within a cohesive framework built on a dual encoder-decoder architecture, comprising an item tokenizer and a generative recommender. To foster synergistic interaction, the authors propose a recommendation-oriented alignment strategy with two key objectives: sequence-item alignment and preference-semantic alignment. These objectives tightly couple the learning processes, promoting mutual enhancement. An alternating optimization technique is developed for stable and efficient end-to-end training. Extensive experiments demonstrate ETEGRec's superior performance compared to traditional sequential recommendation models and existing generative recommendation baselines.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2409.05546
- PDF Link: https://arxiv.org/pdf/2409.05546v3.pdf
- Publication Status: The paper is currently available as a preprint on arXiv (version 3 published on September 9, 2024) and is slated for official publication at SIGIR 2025.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve lies within the emerging field of generative recommendation systems. These systems approach recommendation by directly generating item identifiers (tokens) rather than just predicting a rank from a fixed set of items. While promising, a major challenge in this paradigm is the effective construction of these item identifiers—how to represent items as tokens in a way that aligns well with the recommendation task.
Why is this problem important?
Current approaches typically treat item tokenization (the process of converting an item into a sequence of tokens) and generative recommendation training as two separate, decoupled processes.
-
Suboptimal Tokenization: The
item tokenizermight be trained independently, often unaware of the specific optimization objectives of the downstream recommender. This can lead to tokens that are not optimally suited for recommendation. -
Limited Knowledge Fusion: The
generative recommendercannot deeply integrate or refine the implicit knowledge encoded in the item representations generated by a pre-trained tokenizer.These issues hinder the full potential of
generative recommendation. The paper's entry point and innovative idea is to address this decoupling by developing anend-to-end generative recommendation frameworkwhereitem tokenizationandautoregressive generationare jointly optimized, allowing them to mutually enhance each other.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Novel End-to-End Generative Recommender (
ETEGRec): Introduction of a new framework that achieves mutual enhancement and joint optimization ofitem tokenizationandautoregressive generation. This unifies previously separate processes into a single, cohesive system. -
Recommendation-Oriented Alignment Approach: Design of a novel alignment strategy that facilitates synergistic learning between the
item tokenizerand thegenerative recommender. This strategy includes two key objectives:sequence-item alignmentandpreference-semantic alignment. These objectives ensure that the learned item tokens are highly relevant to the recommendation task and that the recommender effectively utilizes the semantic information. -
Alternating Optimization Technique: Development of a stable and efficient
alternating optimizationmethod for the end-to-end training of theETEGRecframework, addressing the challenges of jointly optimizing two complex, interacting components.The key finding is that this end-to-end approach, with its tailored alignment strategies and optimization technique, significantly outperforms traditional sequential recommendation models and existing generative recommendation baselines across various benchmarks. Specifically, it demonstrates superior performance on metrics like
Recall@KandNDCG@K, and shows improved generalizability to unseen users. These findings indicate that tightly couplingitem tokenizationwithgenerative recommendationtraining is crucial for achieving state-of-the-art performance in this paradigm.
3. Prerequisite Knowledge & Related Work
This section provides foundational concepts and reviews related work to contextualize ETEGRec.
3.1. Foundational Concepts
To understand ETEGRec, a reader should be familiar with the following concepts:
3.1.1. Sequential Recommendation
Sequential recommendation is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., purchasing an item, watching a video) based on their historical sequence of interactions. Unlike traditional recommendation, which might consider all past interactions equally, sequential recommendation emphasizes the order and temporal dependencies of user behaviors. For example, if a user buys a phone, they are more likely to buy a phone case next than a completely unrelated item. Models in this area aim to capture sequential patterns or dynamic user preferences.
3.1.2. Generative Recommendation
Generative recommendation is an emerging paradigm that frames the recommendation task as a sequence generation problem, similar to how large language models generate text. Instead of predicting a score for each item from a fixed catalog (discriminative approach) or retrieving items, generative recommenders directly generate the identifiers (tokens) of the items to be recommended. This allows for potentially more flexible item representation and generation of novel recommendations.
3.1.3. Item Tokenization
Item tokenization is the process of converting an item (e.g., a product, movie, song) into a sequence of discrete tokens. In natural language processing, words are tokenized into subword units or characters. Similarly, in generative recommendation, an item's unique ID or features are mapped to a sequence of semantic tokens. This allows the generative model, often based on language models, to "understand" and "generate" items in a structured, compositional manner. The length of these token sequences (denoted as ) can be fixed or variable, and the tokens themselves are drawn from a codebook or vocabulary.
3.1.4. Transformer Model
The Transformer is a neural network architecture introduced in 2017, primarily known for its effectiveness in sequence-to-sequence tasks, especially in natural language processing (NLP). It relies entirely on attention mechanisms (specifically self-attention) to draw global dependencies between input and output. The core of a Transformer is its encoder-decoder structure.
-
Encoder: Processes the input sequence, creating a rich contextual representation for each element. It typically consists of multiple layers, each with a
multi-head self-attentionmechanism and afeed-forward network. -
Decoder: Generates the output sequence one element at a time, taking the encoder's output and previously generated elements as input. It also has
multi-head self-attention(masked to prevent attending to future tokens) and anencoder-decoder attentionlayer to focus on relevant parts of the encoder's output.A key component is the
self-attentionmechanism: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where: -
(Query), (Key), (Value) are matrices derived from the input embeddings.
-
is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
-
The
softmaxfunction normalizes the attention scores. This allows the model to weigh the importance of different parts of the input sequence when processing each element. Models likeT5(Text-To-Text Transfer Transformer) are variants of the Transformer architecture, designed for a wide range of NLP tasks, including sequence generation.
3.1.5. Residual Quantization Variational Autoencoder (RQ-VAE)
A Variational Autoencoder (VAE) is a type of generative model that learns a compressed, continuous latent representation of input data. It consists of an encoder that maps input to a probability distribution in the latent space and a decoder that reconstructs the input from a sample from this latent distribution. The training objective involves a reconstruction loss (to ensure the output is similar to the input) and a KL divergence term (to ensure the latent distribution is close to a prior distribution, usually a standard normal).
Vector Quantization (VQ) is a method to map continuous vector inputs to discrete representations. In VQ-VAE, the encoder maps input to a continuous latent vector, which is then quantized by finding the closest vector in a fixed-size codebook (a set of learnable embedding vectors). The decoder then reconstructs from this quantized vector.
Residual Quantization (RQ) extends VQ by applying quantization in multiple stages. Instead of quantizing the entire vector at once, RQ quantizes a residual error at each stage.
- Quantize the original input vector to get the first codebook vector.
- Calculate the
residual(difference) between the original vector and the first codebook vector. - Quantize this
residualusing a second codebook. - Repeat for levels. This hierarchical approach allows for a more detailed and expressive quantization using a sequence of tokens, which is particularly useful for item tokenization as it can represent items with varying levels of granularity from coarse to fine.
3.1.6. Kullback-Leibler (KL) Divergence
The Kullback-Leibler (KL) Divergence, often denoted as , is a non-symmetric measure of how one probability distribution is different from a second, reference probability distribution . It quantifies the information lost when is used to approximate .
For discrete probability distributions P(x) and Q(x) over the same event space :
$
D_{KL}(P || Q) = \sum_{x \in X} P(x) \log\left(\frac{P(x)}{Q(x)}\right)
$
Where:
P(x)is the probability of event in distribution .Q(x)is the probability of event in distribution . A lowerKL divergencevalue indicates that the two distributions are more similar. The paper uses asymmetric Kullback-Leibler divergenceforsequence-item alignment, which is , to ensure that both distributions are encouraged to be similar to each other.
3.1.7. InfoNCE Loss (NT-Xent Loss)
InfoNCE (Information Noise-Contrastive Estimation) loss, also known as NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss, is a popular self-supervised learning objective used to learn rich representations by maximizing the agreement between different augmented views of the same data point (positive pairs) while simultaneously pushing apart representations of different data points (negative pairs). It's commonly used in contrastive learning.
The general form for a positive pair within a batch of samples (where other samples act as negatives) is:
$
\mathcal{L}{InfoNCE} = -\log \frac{\exp(\mathrm{sim}(x_i, x_j) / \tau)}{\sum{k=1}^{N} \exp(\mathrm{sim}(x_i, x_k) / \tau)}
$
Where:
- is a similarity function (e.g., cosine similarity) between two representations.
- is a
temperatureparameter that controls the steepness of the similarity distribution. A small makes the model more sensitive to small differences, encouraging stronger separation. - The sum in the denominator includes the positive pair and all negative pairs, effectively performing a
softmaxover similarity scores. The paper usesInfoNCEforpreference-semantic alignmentto align user preference representations with item semantic representations.
3.2. Previous Works
The paper categorizes related work into Sequential Recommendation and Generative Recommendation.
3.2.1. Traditional Sequential Recommendation
This paradigm focuses on predicting the next item based on a user's historical interaction sequence.
- Markov Chains (e.g., [20]): Early approaches assuming item transitions follow a Markov process.
- Neural Network-based Models:
- RNN-based (e.g.,
GRU4Rec[7], [25]): UseRecurrent Neural Networksto model sequential dependencies. - CNN-based (e.g.,
Caser[26]): ApplyConvolutional Neural Networksto capture local patterns in sequences. - GNN-based (e.g., [1], [33]): Utilize
Graph Neural Networksto model complex relationships within interaction graphs. - Transformer-based (e.g.,
SASRec[10],BERT4Rec[22],FMLP-Rec[43],FDSA[38],S3-Rec[42]): These models leverage theTransformerarchitecture'sself-attentionmechanism for powerful sequence modeling.SASRec[10]: Uses unidirectional Transformer decoder.BERT4Rec[22]: Employs bidirectional attention with a masked item prediction task.S3-Rec[42]: Incorporates mutual information maximization for pre-training.FMLP-Rec[43]: Anall-MLParchitecture for sequential recommendation.FDSA[38]: Models item-level and feature-level sequences with self-attention.
- RNN-based (e.g.,
- Textual/Side Feature Exploitation (e.g., [34], [38]): Enhance representations using rich textual features of users and items.
3.2.2. Generative Recommendation
This newer paradigm tokenizes item sequences and uses generative models to predict target item tokens autoregressively. It generally involves two main processes: item tokenization and generative recommendation itself.
3.2.2.1. Item Tokenization
- Parameter-free methods:
- Co-occurrence matrix (e.g.,
CID[9],GPTRec[16]): Apply matrix factorization or graph-based clustering on item co-occurrence graphs to derive identifiers. Often simple and efficient but may lack deep collaborative semantics. - Clustering of item embeddings (e.g.,
SEATER[21],EAGER[32]): Group items with similar embeddings to form hierarchical identifiers. - Textual metadata (e.g., [2], [6],
IDGenRec[24],LlamaRec[36]): Use item titles, descriptions, or other textual features directly as tokens or to derive them.
- Co-occurrence matrix (e.g.,
- Deep learning methods based on multi-level Vector Quantization (VQ):
TIGER[19]: UsesRQ-VAEto learn multi-level codebooks for items, deriving semantic IDs from text embeddings.LETTER[30]: Builds uponRQ-VAEby aligning quantized embeddings with collaborative embeddings and introducing code assignment diversity regularization.MMGRec[13],TokenRec[17],Enhanced Generative Recommendation[31]: Further developments using VQ for multi-modal or enhanced generative recommendation.
3.2.2.2. Generative Recommender
- Encoder-decoder architecture (e.g.,
T5[18]): Widely used backbone for sequence modeling and generation in generative recommendation, as seen inTIGER,LETTER,CID,SID. - Architecture/Objective improvements (e.g., [21]): Studies focusing on adjusting the backbone architecture or learning objectives for better performance.
3.3. Technological Evolution
The field of recommender systems has evolved from basic collaborative filtering and matrix factorization to sophisticated neural network models.
-
Early Systems: Simple item-based or user-based collaborative filtering.
-
Sequential Models: Introduction of
Markov Chains, thenRNNs,CNNs, andTransformersto capture the temporal dynamics of user behavior. This marked a shift from static user profiles to dynamic preference modeling. -
Generative Paradigm: Inspired by the success of Large Language Models (LLMs), the latest evolution casts recommendation as a generation task. Instead of predicting one of items, the system generates the
item IDas a sequence of tokens. This opens possibilities for generating novel items or more flexible item representations.Within the generative paradigm, the evolution has moved from:
- Heuristic Item Tokenization: Simple, pre-defined item IDs or IDs derived from co-occurrence matrices. These were often fixed and lacked deep semantic meaning or adaptability.
- Pre-learned Item Tokenization: Using deep learning models (like
VQ-VAEorRQ-VAE) to learn item tokens, but still as a separate, pre-processing step. The tokenizer and recommender are decoupled during training. - End-to-End Learnable Item Tokenization (ETEGRec): This paper represents a crucial step by integrating
item tokenizationandgenerative recommendationinto a single, jointly optimized framework. This ensures the tokenizer learns representations that are explicitly useful for the recommender, and the recommender can refine its understanding of items based on these adaptively learned tokens.
3.4. Differentiation Analysis
ETEGRec distinguishes itself from previous generative recommendation models primarily by its end-to-end joint optimization of item tokenization and generative recommendation.
The following table, adapted from Table 1 in the paper, summarizes the key differences:
The following are the results from Table 1 of the original paper:
| Methods | Item Tokenization | Generative Recommendation | |||
|---|---|---|---|---|---|
| Learning | EL | IA | Token Sequence | TI | |
| GPTRec [16] | Heuristic | × | √ | Pre-processed | × |
| CID [9] | Heuristic | × | × | Pre-processed | × |
| TIGER [19] | Pre-learned | √ | × | Pre-processed | × |
| LETTER [30] | Pre-learned | √ | × | Pre-processed | × |
| ETEGRec | End-to-end | √ | √ | Gradually Refined | √ |
Where:
EL: Equal Length (item identifiers have the same length).IA: Interaction-Aware (tokenization considers user-item interactions).TI: Tokenization Integration (item tokenization is integrated into generative recommendation training).
Core Differences and Innovations of ETEGRec:
-
End-to-End Learnable Tokenization:
- Previous:
GPTRecandCIDuseheuristicmethods (e.g., co-occurrence matrices), which are parameter-free and efficient but often fail to capture deep semantic relevance.TIGERandLETTERusepre-learneddeep neural networks (RQ-VAE) for tokenization, but this is a separate pre-processing step. - ETEGRec: It trains the
item tokenizerjointly with thegenerative recommender. This means the tokenization process itself adapts to optimize the recommendation task, addressing thedecoupling problem.
- Previous:
-
Interaction-Aware Tokenization:
- Previous: Only
GPTRecintroduced interaction awareness, primarily through the user-item interaction matrix.TIGERandLETTER(in its main form) were not explicitly interaction-aware in their tokenization. - ETEGRec: It explicitly incorporates
preference informationfrom user behaviors into theitem tokenizerthroughsequence-item alignmentandpreference-semantic alignment. This ensures the learned tokens are not just semantically meaningful but also relevant to user preferences derived from sequential interactions.
- Previous: Only
-
Dynamic/Refined Token Sequences:
- Previous: Most methods use
pre-processedtoken sequences, which remain constant during generative recommendation training. This can lead to monotonous sequence patterns and potential overfitting if the initial tokenization is suboptimal. - ETEGRec: With joint optimization, the
item tokenizeris continuously updated, leading togradually refinedtoken semantics and diverse token sequences during model learning. This adaptive nature allows the system to learn better item representations over time.
- Previous: Most methods use
-
Mutual Enhancement: The key innovation is the
recommendation-oriented alignment approach(includingsequence-item alignmentandpreference-semantic alignment) and thealternating optimization technique. These explicitly foster a synergistic relationship, allowing thetokenizerto produce better, recommendation-focused tokens, and therecommenderto better utilize and refine the knowledge embedded in these tokens.In essence,
ETEGRecmoves beyond simply using tokens for recommendation towards actively learning and refining those tokens during the recommendation process itself, guided by explicit recommendation objectives.
4. Methodology
4.1. Principles
The core idea behind ETEGRec is to unify item tokenization and generative recommendation into a single, cohesive framework, allowing for joint optimization and mutual enhancement between these two components. This addresses the limitation of prior generative recommendation systems where item tokenization was treated as a separate, pre-processing step. The theoretical basis is that by aligning the learning objectives of the item tokenizer and the generative recommender through recommendation-oriented alignment strategies, the system can learn more effective and recommendation-aware item representations, leading to superior prediction performance. The intuition is that if the item representations (tokens) are explicitly learned to be useful for the recommendation task, and the recommender is trained to leverage these evolving representations, the overall system will be more powerful than decoupled approaches.
4.2. Core Methodology In-depth (Layer by Layer)
ETEGRec is built on a dual encoder-decoder architecture, consisting of an item tokenizer and a generative recommender. These components are synergistically trained through novel recommendation-oriented alignment strategies and an alternating optimization technique.
4.2.1. Problem Formulation
The task is sequential recommendation, where given a user's historical interaction sequence from an item set , the goal is to predict the next item . ETEGRec adopts a generative paradigm, casting this as token sequence generation.
Each item is represented by multiple tokens , where is the fixed identifier length.
The input interaction sequence is tokenized into . The objective is to generate the identifier of the target item at the -th step.
Formally, this is a sequence-to-sequence learning problem:
$
P ( \boldsymbol { Y } | \boldsymbol { X } ) = \prod _ { l = 1 } ^ { L } P ( c _ { l } ^ { t + 1 } | \boldsymbol { X } , c _ { 1 } ^ { t + 1 } , \ldots , c _ { l - 1 } ^ { t + 1 } )
$
Where:
- : The token sequence representing the target item .
- : The tokenized historical interaction sequence.
- : The fixed length of the item identifier (number of tokens per item).
- : The -th token of the target item.
- : The probability of generating the -th token of the target item, conditioned on the input sequence and all previously generated tokens for the current target item. This formulation highlights the
autoregressivenature of the generation.
4.2.2. Dual Encoder-Decoder Architecture
The ETEGRec framework comprises two main components: an item tokenizer () and a generative recommender (), both utilizing an encoder-decoder structure.
4.2.2.1. Item Tokenizer
The item tokenizer () uses a Residual Quantization Variational Autoencoder (RQ-VAE) to construct multi-level tokens for each item. This creates an -level hierarchical representation, where each item is indexed by token IDs, organizing items in a tree-structured way and allowing collaborative semantics to be shared.
Token Generation as Residual Quantization
For an item , the item tokenizer takes its contextual or collaborative semantic embedding as input, where is the dimension of this semantic embedding.
The tokenizer outputs a sequence of quantized tokens:
$
[ c _ { 1 } , \ldots , c _ { L } ] = { \mathcal { T } } ( z )
$
Where is the -th token for item .
The process begins by encoding the semantic embedding into a latent representation using a multilayer perceptron (MLP) based encoder:
$
r = \operatorname { E n c o d e r } _ { T } ( z )
$
Where represents the MLP encoder of the tokenizer.
The latent representation is then quantized into serialized codes (tokens) across levels using L-level codebooks. Each level has a codebook , where are the code embeddings in the -th codebook, and is the dimension of the code embeddings.
The residual quantization process is as follows:
$
c _ { l } = \arg \operatorname* { max } _ { k } P ( k | v _ { l } )
$
$
v _ { l } = v _ { l - 1 } - e _ { c _ { l - 1 } } ^ { l }
$
Where:
- : The -th assigned token.
- : The
residual vectorat the -th level. - : Initialized with the latent representation .
- : The code embedding chosen at the
(l-1)-th level. - : The likelihood that the residual is quantized to token . This probability is measured by the distance between and the codebook vectors :
$
P ( k | v _ { l } ) = \frac { \exp ( - | | v _ { l } - \pmb { e } _ { k } ^ { l } | | ^ { 2 } ) } { \sum _ { j = 1 } ^ { K } \exp ( - | | v _ { l } - \pmb { e } _ { j } ^ { l } | | ^ { 2 } ) }
$
This formula uses a
softmax-like function over negative squared Euclidean distances to compute the probability. The token corresponding to the highest probability (smallest distance) is selected.
Reconstruction Loss
After obtaining the tokens , the quantized representation is formed by summing the chosen code embeddings from each level:
$
\tilde { r } = \sum _ { l = 1 } ^ { L } e _ { c _ { l } } ^ { l } \in \mathbb { R } ^ { d _ { c } }
$
This is then fed into an MLP decoder to reconstruct the original item semantic embedding:
$
\tilde { z } = \mathrm { D e c o d e r } _ { T } ( \tilde { r } )
$
Where represents the MLP decoder of the tokenizer.
The semantic quantization loss () for learning the item tokenizer is a combination of reconstruction loss and RQ loss:
$
{ \mathcal { L } _ { \mathrm { S Q } } = \mathcal { L } _ { \mathrm { R E C O N } } + \mathcal { L } _ { \mathrm { R Q } } }
$
The reconstruction loss () ensures that the reconstructed semantic embedding is close to the original input :
$
{ \mathcal { L } _ { \mathrm { R E C O N } } = | | \boldsymbol { z } - \tilde { \boldsymbol { z } } | | ^ { 2 } }
$
This is the squared Euclidean distance between the original and reconstructed embeddings.
The RQ loss () is a standard loss for VQ-VAEs that guides the learning of codebook vectors and the encoder:
$
{ \mathcal { L } _ { \mathrm { R Q } } = \displaystyle \sum _ { l = 1 } ^ { L } | | \mathrm { sg } [ v _ { l } ] - e _ { c _ { l } } ^ { l } | | ^ { 2 } + \beta | | v _ { l } - \mathrm { sg } [ e _ { c _ { l } } ^ { l } ] | | ^ { 2 } }
$
Where:
- : The
stop-gradientoperation. This is crucial for trainingVQ-VAEs. It prevents gradients from flowing through the quantized codebook vector to the encoder, effectively allowing the encoder to optimize for similarity to the codebook entries without directly updating the codebook entries themselves. - : This term trains the codebook vectors to move towards the residual vectors they are chosen to represent. The gradient only flows to .
- : This term encourages the encoder's output to be close to the chosen codebook vector . The gradient only flows to (and thus back to the encoder).
- : A hyperparameter (typically 0.25) balancing the optimization between the encoder and the codebooks.
4.2.2.2. Generative Recommender
The generative recommender utilizes a Transformer-based encoder-decoder architecture, similar to T5, known for its effectiveness in sequence modeling.
Token-level Seq2Seq Formulation
During training, the item-level user interaction sequence and the target item are first tokenized by the item tokenizer () into token sequences and , respectively.
The token embeddings (where is the total number of tokens in the input sequence, and is the hidden size) are fed into the generative recommender.
The encoder of the recommender processes the input token embeddings:
$
H ^ { E } = \operatorname { E n c o d e r } _ { R } ( E ^ { X } )
$
Where is the encoded sequence representation, capturing the user's historical preferences.
For decoding, a special start-of-sequence token ([BOS]) is prepended to , forming .
The decoder of the recommender takes and as input to extract user preference representation:
$
H ^ { D } = \operatorname { D e c o d e r } _ { R } ( H ^ { E } , { \tilde { Y } } )
$
Where represents the decoder hidden states, implying user preferences over the items.
Recommendation Loss
The decoder hidden states are used to predict the target item token at each step by performing an inner product with the vocabulary embedding matrix . The recommender is optimized using the negative log-likelihood of the target tokens, following the sequence-to-sequence paradigm:
$
\mathcal { L } _ { \mathrm { R E C } } = - \sum _ { j = 1 } ^ { L } \log P ( Y _ { j } | \boldsymbol { X } , Y _ { < j } )
$
Where:
- : Represents the -th token of the target item's token sequence.
- : Denotes all tokens generated before for the current target item.
- : The probability of generating the -th token, conditioned on the full input sequence and the already generated target tokens.
This loss drives the
generative recommendertoautoregressivelygenerate the correct tokens for the target item.
4.2.3. Recommendation-oriented Alignment
The key innovation of ETEGRec is the recommendation-oriented alignment approach, which ensures the item tokenizer and generative recommender learn synergistically rather than independently.
4.2.3.1. Sequence-Item Alignment
Alignment Hypothesis
This alignment strategy focuses on the relationship between the encoder's sequential states () from the generative recommender and the collaborative embedding () of the target item. The hypothesis is that the sequence representation , which encodes information about the past interaction sequence, should be highly informative about the future target item. Therefore, when both types of representations are fed into the item tokenizer, they should yield similar tokenization results or token distributions. This alignment serves as a supervision signal to optimize both components.
Alignment Loss
To formalize this, the hidden state from the recommender's encoder (Eq. (11)) is first linearized by a mean pooling operation, followed by an MLP for semantic space transformation:
$
z ^ { E } = \mathrm { MLP } ( \mathrm { mean _ pool } ( H ^ { E } ) )
$
Where:
-
: Averages the hidden states across the sequence length, obtaining a single vector representing the entire input sequence.
-
: A
multilayer perceptronthat transforms this pooled representation into a semantic space compatible with theitem tokenizer.Next, the
item tokenizeris used to generatetoken distributionsfor each level . Let and denote the token distributions at the -th level for inputs (collaborative item embedding) and (encoder's sequence state), respectively. Thesequence-item alignment loss() uses asymmetric Kullback-Leibler divergenceto enforce similarity between these distributions: $ \mathcal { L } _ { \mathrm { SIA } } = - \sum _ { l = 1 } ^ { L } \left( D _ { K L } \big ( P _ { z } ^ { l } | | P _ { z ^ { E } } ^ { l } \big ) + D _ { K L } \big ( P _ { z ^ { E } } ^ { l } | | P _ { z } ^ { l } \big ) \right) $ Where: -
: The
Kullback-Leibler (KL) divergence. -
: The token distribution for the -th level derived from the target item's original semantic embedding .
-
: The token distribution for the -th level derived from the sequence state . The sum of two
KL divergencesmakes the alignment symmetric, ensuring is close to and vice versa. This loss ensures that theencoderlearns to produce sequence representations () that semantically align with the actual target items' representations (), making the encoder more informative and preventing the decoder from bypassing it.
4.2.3.2. Preference-Semantic Alignment
Alignment Hypothesis
This strategy aims to connect the decoder's first hidden state (denoted , which is the first column in from Eq. (12)) with the reconstructed semantic embedding () of the target item (Eq. (7)).
- : Represents the
sequential user preferencelearned by thegenerative recommenderafter modeling the interaction sequence. - : Encodes the
collaborative semanticsof the target item, reconstructed by theitem tokenizer. The hypothesis is that these two representations should be aligned, as the user's preference should reflect the semantics of the item they are likely to interact with next. Unlike therecommendation losswhich uses item tokens, this alignment explicitly involves thereconstructed embedding, thus engaging thetokenizerin the optimization.
Alignment Loss
An InfoNCE loss with in-batch negatives is used for preference-semantic alignment ():
$
\mathcal { L } _ { \mathrm { PSA } } = - \left( \log \frac { \exp { ( s ( \tilde { z } , h ^ { D } ) / \tau ) } } { \sum _ { \hat { h } \in \mathcal { B } } \exp { ( s ( \tilde { z } , \hat { h } ) / \tau ) } } + \log \frac { \exp { ( s ( h ^ { D } , \tilde { z } ) / \tau ) } } { \sum _ { \hat { z } \in \mathcal { B } } \exp { ( s ( h ^ { D } , \hat { z } ) / \tau ) } } \right)
$
Where:
- : The
cosine similarity function, which measures the angular similarity between two vectors. - : A
temperature coefficient, a hyperparameter that scales the logits before the softmax, affecting the concentration of the similarity distribution. - : Denotes a
batchof training instances. - : Represents negative samples of
preference representationswithin the batch. - : Represents negative samples of
semantic representationswithin the batch. This loss is a symmetricInfoNCE(NT-Xent) formulation, maximizing the similarity between the positive pair (, ) while minimizing similarity with negative pairs (other and in the batch). This loss serves as an additional enhancement to therecommendation lossby involving thetokenizerand explicitly aligning user preferences with item semantics.
The combination of sequence-item alignment and preference-semantic alignment effectively strengthens the interplay between the item tokenizer and the generative recommender, fostering mutual adaptation and enhancement.
4.2.4. Alternating Optimization
To ensure stable and effective training of the entire ETEGRec framework, an alternating optimization strategy is proposed instead of a straightforward joint optimization of all objectives simultaneously. This method iteratively optimizes the item tokenizer and the generative recommender.
4.2.4.1. Item Tokenizer Optimization
In this phase, the item tokenizer is optimized by considering its intrinsic semantic quantization loss and the two recommendation-oriented alignment losses. All parameters of the generative recommender are kept fixed (frozen).
The overall loss for the item tokenizer () is:
$
\mathcal { L } _ { \mathrm { IT } } = \mathcal { L } _ { \mathrm { S Q } } + \mu \mathcal { L } _ { \mathrm { SIA } } + \lambda \mathcal { L } _ { \mathrm { PSA } }
$
Where:
- : The
semantic quantization loss(Equation (8)). - : The
sequence-item alignment loss(Equation (15)). - : The
preference-semantic alignment loss(Equation (16)). - and : Hyperparameters that control the weighting of the
alignment losses.
4.2.4.2. Generative Recommender Optimization
Conversely, during this phase, the generative recommender is optimized using its primary recommendation loss and the two alignment losses. All parameters of the item tokenizer are kept fixed (frozen).
The overall loss for the generative recommender () is:
$
\mathcal { L } _ { \mathrm { GR } } = \mathcal { L } _ { \mathrm { R E C } } + \mu \mathcal { L } _ { \mathrm { SIA } } + \lambda \mathcal { L } _ { \mathrm { PSA } }
$
Where:
-
: The
generative recommendation loss(Equation (13)). -
: The
sequence-item alignment loss(Equation (15)). -
: The
preference-semantic alignment loss(Equation (16)). -
and : Same hyperparameters as in
item tokenizer optimization.The training process is divided into multiple
cycles. Each cycle consists of a fixed number of epochs (). -
In the first epoch of each cycle, the
item tokenizeris optimized based on . This updates the item representations based on feedback from the recommender and its own reconstruction needs. -
For the remaining
C-1epochs of that cycle, theitem tokenizeris frozen, and thegenerative recommenderis trained based on . During this period, the item tokens remain fixed. This alternation continues until theitem tokenizerconverges. Once converged, theitem tokenizeris permanently frozen, and thegenerative recommenderis fully trained to convergence. This strategy addresses the challenge ofstable optimizationwhen jointly training two tightly coupled components.
4.3. Discussion and Analysis
4.3.1. Comparison with Existing methods
The paper contrasts ETEGRec with typical generative recommendation models based on item tokenization and generative recommendation aspects, as presented in Table 1 (copied and explained in Section 3.4).
Key distinctions highlighted:
- Item Tokenization: Previous methods either use
heuristic(e.g.,GPTRec,CID) orpre-learned(e.g.,TIGER,LETTER) tokenizers. These approaches lead to adecouplingof tokenization and recommendation training.ETEGRec'send-to-endapproach allows the tokenizer to be optimized with the recommender. - Interaction Awareness: Only
GPTRecamong baselines was noted for interaction awareness through user-item matrices.ETEGRecexplicitly aligns past user interaction sequences with target items viasequence-itemandpreference-semantic alignment, integrating richpreference information. - Token Sequence Evolution: Existing methods rely on
pre-processed(constant) token sequences, which can lead tomonotonous sequence patternsandoverfitting.ETEGRec's joint optimization results ingradually refinedsemantics anddiverse token sequences, with ablation studies confirming its performance contribution. - Integration of Prior Knowledge:
ETEGRecactively integrates and refines prior knowledge fromitem semantic embeddingsduring training, rather than isolating the tokenizer.
4.3.2. Complexity Analysis
The paper provides a complexity analysis for ETEGRec.
- Item Tokenization: For a single item, the time complexity involves:
- Encoder and Decoder
MLPlayers: , where is the model dimension. - Codebook lookup operations: , where is the number of codebooks (token length), is the size per codebook, and is the dimension of code embeddings.
- Semantic quantization loss calculation: .
- Total for one item: .
- Encoder and Decoder
- Generative Recommendation:
- Sequential preference modeling (
Transformer): Primarilyself-attentionandfeed-forward layers. The complexity is , where is the sequence length. - Loss calculations:
- : Included in the generation process.
- : (due to codebook lookups for and ).
- : , where is the number of negative samples (batch size for
in-batch negatives).
- Sequential preference modeling (
- Overall Training Cost: . This is stated to be on the same order of magnitude as mainstream generative models like
TIGERandLETTER. - Inference Complexity: The inference complexity is "completely consistent with
TIGER" because item tokenization results can be cached in advance, meaning the overhead of tokenization is amortized.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three subsets of the Amazon 2023 review data [8], which contains user review data from May 1996 to September 2023. These datasets are:
-
"Musical Instruments"
-
"Video Games"
-
"Industrial Scientific"
To preprocess the data, the authors applied a 5-core filter, meaning only users and items with at least five interaction records were kept. This removes infrequent users and unpopular items, ensuring sufficient interaction data for modeling. User behavior sequences were constructed chronologically, and the maximum item sequence length was uniformly set to 50.
The following are the results from Table 2 of the original paper:
| Dataset | #Users | #Items | #Interactions | Sparsity |
|---|---|---|---|---|
| Instrument | 57,439 | 24,587 | 511,836 | 99.964% |
| Scientific | 50,985 | 25,848 | 412,947 | 99.969% |
| Game | 94,762 | 25,612 | 814,586 | 99.966% |
These datasets were chosen because they are publicly available, widely used in recommender systems research, and represent different domains, allowing for robust validation of the proposed method's generalizability across various types of e-commerce data. Their scale (tens of thousands of users/items, hundreds of thousands of interactions) and high sparsity are typical characteristics of real-world recommendation scenarios, making them effective for validating performance.
5.2. Evaluation Metrics
To evaluate the performance of various methods, the paper employs two widely used metrics for top- recommendation: Recall@K and Normalized Discounted Cumulative Gain@K (NDCG@K). is set to 5 and 10.
5.2.1. Recall@K
Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully recommended within the top items. In the context of sequential recommendation, it quantifies how often the actual next item a user interacts with is present in the list of the top items recommended by the system. A higher Recall@K indicates that the model is better at identifying relevant items.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{RecommendedItems}_K \cap \mathrm{RelevantItems}|}{|\mathrm{RelevantItems}|} $
Symbol Explanation:
- : The set of top items recommended by the system for a given user.
- : The set of actual next items the user interacted with (typically, for sequential recommendation, this is a single item, ).
- : Denotes the cardinality (number of elements) of a set.
In a "leave-one-out" setting where there's only one relevant item,
Recall@Ksimplifies to 1 if the relevant item is in the top recommendations, and 0 otherwise.
5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)
Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the recommended list. It assigns higher scores to relevant items that appear at higher (earlier) ranks. It's normalized to a value between 0 and 1, where 1 represents a perfect ranking. NDCG@K is particularly useful when the relevance of items can vary (e.g., highly relevant vs. marginally relevant), and when the order of recommendations matters.
Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ Where: $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}j} - 1}{\log_2(j+1)} $ $ \mathrm{IDCG@K} = \sum{j=1}^{K} \frac{2^{\mathrm{rel}^{'}_j} - 1}{\log_2(j+1)} $
Symbol Explanation:
-
:
Discounted Cumulative Gainat rank . It sums the relevance scores of items in the recommended list, with a discount factor applied to items at lower ranks. -
:
Ideal Discounted Cumulative Gainat rank . This is the maximum possibleDCGvalue, obtained by ranking all relevant items perfectly (in decreasing order of relevance). It serves as a normalization factor. -
: The relevance score of the item at position in the recommended list. In binary relevance (item is either relevant or not), is typically 1 if the item is relevant, and 0 otherwise.
-
: The relevance score of the item at position in the ideal ranked list. This is usually set up such that the most relevant item has the highest and is placed at , and so on.
-
: The logarithmic discount factor. Items at rank have a discount of , while items at rank have , meaning their contribution to
DCGis reduced.The
leave-one-out strategyis used for splitting data: the latest interaction for each user is test data, the second latest is validation data, and all others are training data. Afull ranking evaluationis performed over the entire item set to avoid sampling bias, and thebeam sizefor generative models is set to 20.
5.3. Baselines
The paper compares ETEGRec against two categories of baseline models:
5.3.1. Traditional Sequential Recommendation Models
These models aim to predict the next item in a sequence using discriminative approaches.
- Caser [26]: Utilizes
horizontal and vertical convolutional filtersto capture patterns in user behavior sequences. - GRU4Rec [7]: An
RNN-basedsequential recommender employingGated Recurrent Units (GRU)for user behavior modeling. - HGN [15]:
Hierarchical Gating Networksdesigned to capture both long-term and short-term user interests from item sequences. - SASRec [10]: Adopts a
unidirectional Transformerto model user behaviors withself-attention. - BERT4Rec [22]: Introduces a
bidirectional Transformerand amask prediction taskfor training, drawing inspiration from BERT in NLP. - FMLP-Rec [43]: An
all-MLPsequential recommender withlearnable filters, aiming to reduce behavior noise. - FDSA [38]: Emphasizes
transformation patternsbetween item features by modeling bothitem-levelandfeature-level sequenceswithself-attention. - S3-Rec [42]: Incorporates
mutual information maximizationfor pre-training sequential models, learning item and attribute correlations.
5.3.2. Generative Recommendation Models
These models formulate recommendation as a sequence generation task.
-
SID [9]: Sequentially encodes item IDs as numerical tokens and uses them as item identifiers for generative recommendation. Lacks semantic information in tokens.
-
CID [9]: Integrates collaborative knowledge by generating item identifiers through
spectral clusteringonitem co-occurrence graphs. Still a heuristic approach to tokenization. -
TIGER [19]: Leverages
text embeddingsto constructsemantic IDsfor items usingRQ-VAEand adopts a generative retrieval paradigm. -
TIGER-SAS [19]: A variant of TIGER that uses
item embeddingsfrom a trainedSASRecinstead of text embeddings to construct semantic IDs, thus incorporatingcollaborative prior knowledge. -
LETTER [30]: Designs a
learnable tokenizerby integratinghierarchical semantics,collaborative signals, andcode assignment diversityinto theRQ-VAEframework.These baselines are representative as they cover both traditional discriminative and modern generative approaches, including state-of-the-art models and specific generative methods that
ETEGRecdirectly aims to improve upon.
5.4. Semantic ID Generation
The paper highlights specific details for semantic ID generation using the item tokenizer:
- Item Collaborative Semantic Embeddings: 256-dimensional embeddings are obtained from a trained
SASRec[10] model. This provides a strong initial representation for items based on collaborative filtering. - Item Tokenizer Architecture: A 3-layer
MLPis used for both the encoder and decoder within theRQ-VAE. - Codebook Configuration: The number of codebooks () is set to 3. Each codebook contains code embeddings, and each embedding has a dimension of 128. This means each item is represented by a sequence of 3 tokens, where each token is chosen from a vocabulary of 256 codes.
- Uniqueness of Semantic IDs: To ensure distinct
semantic item IDs, an additional token is appended at the end of the 3 semantic tokens, following the approach inTIGER[19]. This likely handles cases where different items might accidentally share the same -token sequence from theRQ-VAE.
5.5. Implementation Details
- Generative Recommender Backbone:
T5model with 6 encoder and 6 decoder layers. - Model Dimensions: Hidden size . Feed-Forward Network (FFN) dimension = 512.
- Attention Mechanism: Each layer has 4
self-attentionheads, each with a dimension of 64. - Initialization: The
item tokenizeris initialized with apre-trained RQ-VAE. - Optimizer:
AdamWoptimizer with aweight decayof 0.05 is used for training the entire framework. - Alternating Optimization Cycles: The number of epochs per cycle () is tuned in .
- Cycle Structure: Each cycle begins with 1 epoch of
item tokenizertraining (using ). - Remaining Epochs: Followed by
C-1epochs ofgenerative recommendertraining (using ), during which the tokenizer is frozen. - This process repeats until the
item tokenizerconverges, after which it is permanently frozen, and thegenerative recommenderis trained to full convergence.
- Cycle Structure: Each cycle begins with 1 epoch of
- Learning Rates:
Generative recommender: Tuned within .Item tokenizer: Tuned within .
- Hyperparameters:
Alignment loss coefficients(): Tuned within .
- Baseline Implementations:
- Traditional models: Implemented using
RecBole[39, 40], an open-source recommendation framework. CID,SID,LETTER: Official implementations used.TIGER,TIGER-SAS: Implementation details from the original paper [19] followed.
- Traditional models: Implemented using
- Item Embedding Dimension: Set to 128 for all models, except for
SIDandCIDwhich retained their default dimension of 768.
6. Results & Analysis
6.1. Core Results Analysis
The ETEGRec model was evaluated against traditional and generative recommendation baselines on three Amazon datasets. The overall performance is presented in Table 3.
The following are the results from Table 3 of the original paper:
| Model | Instrument | Scientific | Game | |||||||||
| Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | |
| Caser | 0.0242 | 0.0392 | 0.0154 | 0.0202 | 0.0172 | 0.0281 | 0.0107 | 0.0142 | 0.0346 | 0.0567 | 0.0221 | 0.0291 |
| GRU4Rec | 0.0345 | 0.0537 | 0.0220 | 0.0281 | 0.0221 | 0.0353 | 0.0144 | 0.0186 | 0.0522 | 0.0831 | 0.0337 | 0.0436 |
| HGN | 0.0319 | 0.0515 | 0.0202 | 0.0265 | 0.0220 | 0.0356 | 0.0138 | 0.0182 | 0.0423 | 0.0694 | 0.0266 | 0.0353 |
| SASRec | 0.0341 | 0.0530 | 0.0217 | 0.0277 | 0.0256 | 0.0406 | 0.0147 | 0.0195 | 0.0517 | 0.0821 | 0.0329 | 0.0426 |
| BERT4Rec | 0.0305 | 0.0483 | 0.0196 | 0.0253 | 0.0180 | 0.0300 | 0.0113 | 0.0151 | 0.0453 | 0.0716 | 0.0294 | 0.0378 |
| FMLP-Rec | 0.0328 | 0.0529 | 0.0206 | 0.0271 | 0.0248 | 0.0388 | 0.0158 | 0.0203 | 0.0535 | 0.0860 | 0.0331 | 0.0435 |
| FDSA | 0.0364 | 0.0557 | 0.0233 | 0.0295 | 0.0261 | 0.0391 | 0.0174 | 0.0216 | 0.0548 | 0.0857 | 0.0353 | 0.0453 |
| S3Rec | 0.0340 | 0.0538 | 0.0218 | 0.0282 | 0.0253 | 0.0410 | 0.0172 | 0.0218 | 0.0533 | 0.0823 | 0.0351 | 0.0444 |
| SID | 0.0319 | 0.0438 | 0.0237 | 0.0275 | 0.0155 | 0.0234 | 0.0103 | 0.0129 | 0.0480 | 0.0693 | 0.0333 | 0.0401 |
| CID | 0.0352 | 0.0507 | 0.0234 | 0.0285 | 0.0192 | 0.0300 | 0.0123 | 0.0158 | 0.0497 | 0.0748 | 0.0343 | 0.0424 |
| TIGER | 0.0368 | 0.0574 | 0.0242 | 0.0308 | 0.0275 | 0.0431 | 0.0181 | 0.0231 | 0.0570 | 0.0895 | 0.0370 | 0.0471 |
| TIGER-SAS | 0.0375 | 0.0576 | 0.0242 | 0.0306 | 0.0272 | 0.0435 | 0.0174 | 0.0227 | 0.0561 | 0.0891 | 0.0363 | 0.0469 |
| LETTER | 0.0372 | 0.0581 | 0.0243 | 0.0310 | 0.0276 | 0.0433 | 0.0179 | 0.0230 | 0.0576 | 0.0901 | 0.0373 | 0.0475 |
| ETEGRec | 0.0402* | 0.0624* | 0.0260* | 0.0331* | 0.0294* | 0.0455* | 0.0190* | 0.0241* | 0.0616* | 0.0947* | 0.0400* | 0.0507* |
Observations:
- Traditional Sequential Models:
FDSAgenerally shows the best performance among traditional models, likely due to its utilization of additional textual features.FMLP-Rec,SASRec, andGRU4Recalso perform competitively, suggesting the effectiveness of various neural architectures in modeling behavior sequences.CaserandBERT4Recshow comparatively lower performance in some cases. - Generative Recommendation Models:
TIGERandTIGER-SASconsistently outperformCIDandSID. This highlights the importance of learning meaningful item tokens throughRQ-VAE(as in TIGER/TIGER-SAS/LETTER) compared to numerical IDs (SID) or heuristic co-occurrence-based IDs (CID).SIDandCIDperform poorly, despite using apretrained T5model, indicating that the quality of item tokenization is paramount.TIGER-SASperforms similarly toTIGER, suggesting that bothcollaborativeandtextual semanticsare valuable for item representation, andSASRec-derived embeddings provide good collaborative signals.LETTERgenerally achieves the best performance among prior generative baselines. This is attributed to its advanced learnable tokenizer design that integrateshierarchical semantics,collaborative signals, andcode assignment diversity.
- ETEGRec's Superiority:
ETEGRecconsistently achieves the best results across all three datasets and all evaluation metrics (Recall@5, Recall@10, NDCG@5, NDCG@10). The*indicates statistical significance ( in a paired t-test) compared to the best baseline. This empirically validates the effectiveness ofETEGRec's approach of jointly optimizingitem tokenizationandgenerative recommendationthrough itsrecommendation-oriented alignment. The improvements are attributed to the mutual enhancement between theitem tokenizerand thegenerative recommender.
6.2. Ablation Study
To understand the contribution of each proposed component, an ablation study was conducted. The results are presented in Table 4.
The following are the results from Table 4 of the original paper:
| Variants | Instrument | Scientific | Game | |||||||||
| Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | |
| ETEGRec | 0.0402 | 0.0624 | 0.0260 | 0.0331 | 0.0294 | 0.0455 | 0.0190 | 0.0241 | 0.0616 | 0.0947 | 0.0400 | 0.0507 |
| w/o LsIA | 0.0396 | 0.0614 | 0.0255 | 0.0325 | 0.0285 | 0.0446 | 0.0186 | 0.0238 | 0.0590 | 0.0917 | 0.0386 | 0.0491 |
| w/o LpSA | 0.0389 | 0.0609 | 0.0250 | 0.0321 | 0.0270 | 0.0422 | 0.0174 | 0.0223 | 0.0602 | 0.0933 | 0.0392 | 0.0499 |
| w/o LsIA & LpSA | 0.0379 | 0.0601 | 0.0245 | 0.0317 | 0.0269 | 0.0422 | 0.0175 | 0.0224 | 0.0576 | 0.0894 | 0.0375 | 0.0478 |
| w/o AT | 0.0337 | 0.0529 | 0.0215 | 0.0277 | 0.0234 | 0.0375 | 0.0153 | 0.0198 | 0.0514 | 0.0810 | 0.0333 | 0.0428 |
| w/o ETE | 0.0388 | 0.0600 | 0.0252 | 0.0320 | 0.0277 | 0.0431 | 0.0181 | 0.0230 | 0.0569 | 0.0899 | 0.0369 | 0.0475 |
Analysis of Ablation Study Variants:
w/o LsIA(without Sequence-Item Alignment loss ): Removing this loss leads to a performance drop across all datasets (e.g., Recall@10 on Instrument drops from 0.0624 to 0.0614). This indicates that aligning thetoken distributionsderived from theencoder's sequential states() and thetarget item's collaborative embedding() is crucial. It helps ensure the encoder generates highly informative representations for predicting future interactions.w/o LpSA(without Preference-Semantic Alignment loss ): This variant also shows performance degradation (e.g., Recall@10 on Instrument drops to 0.0609, Scientific to 0.0422). This confirms the effectiveness ofpreference-semantic alignmentin enhancinguser preference modelingby explicitly aligning thedecoder's hidden stateswith thereconstructed item semantics.w/o LsIA & LpSA(without both alignment losses): When bothalignment lossesare removed, the performance drops further (e.g., Recall@10 on Instrument drops to 0.0601). This demonstrates that bothsequence-item alignmentandpreference-semantic alignmentcontribute positively togenerative recommendation, and their combination provides the best results.w/o AT(without Alternating Training): Directlyjointly learningall optimization objectives withoutalternating trainingcauses a significant performance decline (e.g., Recall@10 on Instrument drops to 0.0529, a much larger drop than removing alignment losses). This highlights the importance of thealternating optimization strategyfor stabilizing the training process. Frequent updates to theitem tokenizerin a naive joint training setup can disrupt therecommender's training.w/o ETE(without End-To-End learning): This variant represents a scenario where theitem tokenizeris trained separately and then fixed, and its final item tokens are used to train agenerative recommender(similar to existing baselines likeLETTER). It performs worse than the fullETEGRecmodel (e.g., Recall@10 on Instrument drops to 0.0600, Scientific to 0.0431, Game to 0.0899). This crucial finding indicates thatETEGRec's improvement is not merely from superior initial item identifiers. Instead, theend-to-end optimization process, which continuously refines theitem tokenizerand integrates its prior knowledge with thegenerative recommender, is a core driver of its superior performance. This confirms the paper's central hypothesis about the benefits of tight coupling.
6.3. Further Analysis
6.3.1. Generalizability Evaluation
To assess ETEGRec's ability to generalize, especially to new or unseen users, the authors created a test set with users not present in the training data. For this, 5% of users with the least interaction history were selected as new users on the Instrument and Scientific datasets.
The following figure (Figure 2 from the original paper) shows the performance comparison:

该图像是柱状图,展示了ETEGRec与TIGER、LETTER模型在Instrument和Scientific数据集上对见过用户(Seen)和未见用户(Unseen)的Recall@10性能对比。ETEGRec模型在各场景下表现均优于其他两种方法。
As can be seen from the bar chart in Figure 2, ETEGRec consistently outperforms LETTER and TIGER for both seen and unseen users on both datasets. This indicates that the recommendation-oriented alignment in ETEGRec helps in learning more robust user preference models, enabling better generalization capabilities, even for users with limited or no prior training data.
6.3.2. Preference-Semantic Representation Visualization
To visually confirm the effectiveness of the preference-semantic alignment (), t-SNE [28] was used to project the learned preference representations () and semantic representations () into a 2D space. 10 items and 80 corresponding interaction histories were selected from the Instrument and Scientific datasets.
The following figure (Figure 3 from the original paper) shows the visualization:

该图像是图3的示意图,展示了偏好表示与语义表示的可视化,其中圆圈表示偏好点,星形表示语义点,不同颜色代表不同的组别,用于区分乐器类(a)和科学类(b)数据。
In Figure 3 (a for Instrument, b for Scientific), circles represent preference points (), and stars represent semantic points (). Different colors indicate distinct groups of items/preferences. The visualization clearly shows that preference points (circles) are clustered closely around their corresponding target semantic points (stars) of the same color. Simultaneously, these clusters are well-separated from other semantic points of different colors. This visual evidence strongly supports that the preference-semantic alignment objective effectively aligns sequential user preferences with the semantics of their target items, demonstrating its successful integration within the model.
6.3.3. Hyper-Parameter Analysis
The impact of the alignment loss coefficients (for sequence-item alignment) and (for preference-semantic alignment) on performance was analyzed.
The following figure (Figure 4 from the original paper) shows the performance comparison:

Analysis of (Sequence-Item Alignment Coefficient):
- Figure 4 (left panel) shows the impact of varying from 1e-4 to 5e-3.
- Performance (Recall@10 and NDCG@10) generally improves as increases to an optimal point, then starts to decline.
- Optimal results are achieved at for Instrument and Scientific datasets, and for the Game dataset.
- This indicates that too large a can interfere with model learning, suggesting a careful balance is needed for the
sequence-item alignment.
Analysis of (Preference-Semantic Alignment Coefficient):
-
Figure 4 (right panel) shows the impact of varying from 0 to 5e-3.
-
Similar trends are observed as with : performance improves up to an optimal and then degrades if becomes too large.
-
ETEGRecperforms best across all three datasets when . -
This suggests that
preference-semantic alignmentalso requires careful tuning to avoid over-emphasizing this objective, which could potentially distort the recommender's primary task.Overall, the hyper-parameter analysis demonstrates that while the
alignment lossesare beneficial, their coefficients need to be appropriately tuned to achieve optimal performance, as overly strong alignment signals can be detrimental.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ETEGRec, a novel end-to-end generative recommender that innovatively unifies item tokenization and generative recommendation. Unlike previous approaches that treat these processes as separate, ETEGRec integrates them into a cohesive framework built on a dual encoder-decoder architecture. The core of its success lies in its recommendation-oriented alignment approach, which comprises two key objectives: sequence-item alignment and preference-semantic alignment. These objectives are designed to tightly couple the learning of the item tokenizer and the generative recommender, fostering mutual enhancement. Furthermore, an alternating optimization technique is proposed to ensure stable and efficient end-to-end training of the entire system. Extensive experiments on three benchmark datasets demonstrate that ETEGRec consistently achieves superior performance compared to both traditional sequential recommendation models and existing generative recommendation baselines, validated by comprehensive ablation studies and analyses of generalizability and representation alignment.
7.2. Limitations & Future Work
The authors explicitly mention the following directions for future work:
- Transferability: Applying the joint tokenization method to other
generative recommendation architectures. This suggests that the core idea of end-to-end tokenization and alignment is generalizable beyond the specificTransformer-T5andRQ-VAEcombination used inETEGRec. - Scaling Effects: Exploring the performance and behavior of the model when increasing the
model parameters, implying an interest in understanding howETEGRecscales to larger capacities and potentially larger datasets or more complex recommendation scenarios.
7.3. Personal Insights & Critique
7.3.1. Personal Insights
ETEGRec offers several compelling insights and advancements:
- Addressing the Decoupling Problem: The paper's fundamental contribution of unifying item tokenization and generative recommendation is a significant step forward. It intuitively makes sense that item representations should be learned in the context of the task they serve. This moves generative recommendation from a two-stage process (tokenize, then recommend) to a single, integrated learning system.
- Explicit Alignment Objectives: The
sequence-item alignmentandpreference-semantic alignmentare cleverly designed. They provide direct supervision signals that force the representations learned by the tokenizer to be relevant to the recommender's objectives, and vice-versa. This kind of explicit cross-component alignment is a powerful technique for complex multi-module neural systems. - Alternating Optimization for Stability: The choice of
alternating optimizationis a pragmatic solution to a common problem in joint training: instability. Jointly optimizing two potentially conflicting or highly sensitive components can lead to oscillating gradients or training collapse. The cyclical freezing and unfreezing provide a more stable learning environment, allowing each component to adapt to the other gradually. - Robust Performance Gains: The consistent and statistically significant performance improvements over strong baselines across multiple datasets strongly validate the efficacy of
ETEGRec. The ablation studies further reinforce the importance of each proposed component. - Generalizability: The improved performance on
unseen usersis particularly valuable. Recommender systems often struggle with cold-start users or items. A more robust preference modeling capability, as suggested byETEGRec's results, indicates its potential for real-world deployment where user bases are constantly evolving.
7.3.2. Critique and Areas for Improvement
While ETEGRec is a strong paper, some potential issues or areas for further exploration include:
-
Dependency on Initial Embeddings: The
item tokenizerrelies oncollaborative semantic embeddingsobtained from a trainedSASRecmodel. WhileTIGER-SASshowed the value of collaborative embeddings, this introduces an external dependency. CouldETEGReclearn these initial semantic embeddings in a truly end-to-end fashion from raw interaction data, removing the need for a separate pre-trained model? This would make the framework even more self-contained. -
Interpretability of Tokens: The tokens generated by the
RQ-VAEare discrete codes from learned codebooks. While they encode semantics, their direct interpretability to humans (e.g., "this token means 'electronic music' and 'high energy'") is limited. Further research could explore methods to make these learned tokens more human-interpretable or align them with existing semantic taxonomies. -
Fixed Token Length and Codebook Size: The paper fixes the token length to 3 and codebook size to 256. While this simplifies design, real-world item semantics might benefit from dynamic token lengths or adaptive codebook sizes. For example, highly complex items might need more tokens, while simple ones require fewer.
-
Computational Cost of Alternating Optimization: While it provides stability,
alternating optimizationcan be slower than truly parallel joint optimization if not implemented carefully. The convergence criteria for theitem tokenizer(before permanent freezing) could also impact overall training time and final performance. Exploring more efficient joint training strategies that maintain stability could be beneficial. -
Negative Transfer Potential: The
alignment lossesare designed to bring components together. However, if not carefully tuned, there's a risk ofnegative transfer, where an alignment objective might force a component to learn representations that are suboptimal for its primary task, ultimately hurting overall performance. The hyperparameter analysis shows this indeed happens with very large or . -
Generalization Beyond Amazon Data: While Amazon datasets are standard, it would be interesting to see how
ETEGRecperforms on datasets with different characteristics, such as much sparser cold-start scenarios, or domains with richer metadata (e.g., movies with genre tags, plot summaries, etc.). -
Cold-Item Problem: The current setup primarily addresses
cold-start usersthrough generalizability. How wouldETEGRechandlecold items(new items with no interaction history) where the initialcollaborative semantic embeddingmight be difficult to obtain? Integrating metadata more robustly for cold items would be a valuable extension.The methods and conclusions of
ETEGRecare highly transferable. The concept ofend-to-end tokenizationandcomponent alignmentis applicable to any generative model that relies on discrete representations. For example, it could be used in: -
Generative models for image/video recommendation: tokenizing visual features. -
Drug discovery: generating molecular structures (tokens representing chemical substructures). -
Code generation: tokens representing code snippets or functions.The paper successfully demonstrates that actively integrating and optimizing the representation learning process with the downstream generative task yields significant performance benefits, opening new avenues for research in generative AI and its applications.
Similar papers
Recommended via semantic vector search.