Generative Recommender with End-to-End Learnable Item Tokenization

Wayne Xin Zhao

Paper status: completed

Generative Recommender with End-to-End Learnable Item Tokenization

Published:09/09/2024

Generative Recommendation Systems (20)End-to-End Learnable Item Tokenization (1)Dual Encoder-Decoder Architecture (1)Sequence-Item and Preference-Semantic Alignment in Recommendatio (1)Alternating Optimization Training Strategy (1)

Original Link PDF

Price: 0.10

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ETEGRec integrates item tokenization with generative recommendation training in an end-to-end framework, leveraging dual encoder-decoder architecture and alignment strategies to enhance recommendation accuracy and training stability.

Abstract

Generative recommendation systems have gained increasing attention as an innovative approach that directly generates item identifiers for recommendation tasks. Despite their potential, a major challenge is the effective construction of item identifiers that align well with recommender systems. Current approaches often treat item tokenization and generative recommendation training as separate processes, which can lead to suboptimal performance. To overcome this issue, we introduce ETEGRec, a novel End-To-End Generative Recommender that unifies item tokenization and generative recommendation into a cohesive framework. Built on a dual encoder-decoder architecture, ETEGRec consists of an item tokenizer and a generative recommender. To enable synergistic interaction between these components, we propose a recommendation-oriented alignment strategy, which includes two key optimization objectives: sequence-item alignment and preference-semantic alignment. These objectives tightly couple the learning processes of the item tokenizer and the generative recommender, fostering mutual enhancement. Additionally, we develop an alternating optimization technique to ensure stable and efficient end-to-end training of the entire framework. Extensive experiments demonstrate the superior performance of our approach compared to traditional sequential recommendation models and existing generative recommendation baselines. Our code is available at https://github.com/RUCAIBox/ETEGRec.

Mind Map

In-depth Reading

English Analysis~30 min read · 42,345 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Generative Recommender with End-to-End Learnable Item Tokenization." It focuses on improving generative recommendation systems by integrating the process of item tokenization directly into the recommendation training framework.

1.2. Authors

The authors are:

Enze Liu (Gaoling School of Artificial Intelligence, Renmin University of China)
Bowen Zheng (Gaoling School of Artificial Intelligence, Renmin University of China)
Cheng Ling (Kuaishou Technology)
Lantao Hu (Kuaishou Technology)
Han Li (Kuaishou Technology)
Wayne Xin Zhao (Gaoling School of Artificial Intelligence, Renmin University of China) - Corresponding author.

Their affiliations indicate a collaboration between academic institutions (Renmin University of China) and industry (Kuaishou Technology), suggesting a blend of theoretical rigor and practical application in the research.

1.3. Journal/Conference

The paper is published at SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13-18, 2025, Padua, Italy. SIGIR is a highly reputable and influential conference in the field of information retrieval, often considered a top-tier venue for research in recommender systems, search engines, and related areas. Its acceptance indicates significant academic merit and impact potential.

1.4. Publication Year

The paper is published in 2025. The Published at (UTC) timestamp from the abstract (2024-09-09T12:11:53.000Z) suggests it was made available as a preprint (likely on arXiv) in late 2024, ahead of its formal publication in the SIGIR 2025 proceedings.

1.5. Abstract

This paper introduces ETEGRec, an End-To-End Generative Recommender that addresses the challenge of effective item identifier construction in generative recommendation systems. Existing methods often separate item tokenization and generative recommendation training, leading to suboptimal performance. ETEGRec unifies these two processes within a cohesive framework built on a dual encoder-decoder architecture, comprising an item tokenizer and a generative recommender. To foster synergistic interaction, the authors propose a recommendation-oriented alignment strategy with two key objectives: sequence-item alignment and preference-semantic alignment. These objectives tightly couple the learning processes, promoting mutual enhancement. An alternating optimization technique is developed for stable and efficient end-to-end training. Extensive experiments demonstrate ETEGRec's superior performance compared to traditional sequential recommendation models and existing generative recommendation baselines.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2409.05546
PDF Link: https://arxiv.org/pdf/2409.05546v3.pdf
Publication Status: The paper is currently available as a preprint on arXiv (version 3 published on September 9, 2024) and is slated for official publication at SIGIR 2025.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve lies within the emerging field of generative recommendation systems. These systems approach recommendation by directly generating item identifiers (tokens) rather than just predicting a rank from a fixed set of items. While promising, a major challenge in this paradigm is the effective construction of these item identifiers—how to represent items as tokens in a way that aligns well with the recommendation task.

Why is this problem important? Current approaches typically treat item tokenization (the process of converting an item into a sequence of tokens) and generative recommendation training as two separate, decoupled processes.

Suboptimal Tokenization: The item tokenizer might be trained independently, often unaware of the specific optimization objectives of the downstream recommender. This can lead to tokens that are not optimally suited for recommendation.
Limited Knowledge Fusion: The generative recommender cannot deeply integrate or refine the implicit knowledge encoded in the item representations generated by a pre-trained tokenizer.

These issues hinder the full potential of generative recommendation. The paper's entry point and innovative idea is to address this decoupling by developing an end-to-end generative recommendation framework where item tokenization and autoregressive generation are jointly optimized, allowing them to mutually enhance each other.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel End-to-End Generative Recommender (ETEGRec): Introduction of a new framework that achieves mutual enhancement and joint optimization of item tokenization and autoregressive generation. This unifies previously separate processes into a single, cohesive system.
Recommendation-Oriented Alignment Approach: Design of a novel alignment strategy that facilitates synergistic learning between the item tokenizer and the generative recommender. This strategy includes two key objectives: sequence-item alignment and preference-semantic alignment. These objectives ensure that the learned item tokens are highly relevant to the recommendation task and that the recommender effectively utilizes the semantic information.
Alternating Optimization Technique: Development of a stable and efficient alternating optimization method for the end-to-end training of the ETEGRec framework, addressing the challenges of jointly optimizing two complex, interacting components.

The key finding is that this end-to-end approach, with its tailored alignment strategies and optimization technique, significantly outperforms traditional sequential recommendation models and existing generative recommendation baselines across various benchmarks. Specifically, it demonstrates superior performance on metrics like Recall@K and NDCG@K, and shows improved generalizability to unseen users. These findings indicate that tightly coupling item tokenization with generative recommendation training is crucial for achieving state-of-the-art performance in this paradigm.

This section provides foundational concepts and reviews related work to contextualize ETEGRec.

3.1. Foundational Concepts

To understand ETEGRec, a reader should be familiar with the following concepts:

3.1.1. Sequential Recommendation

Sequential recommendation is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., purchasing an item, watching a video) based on their historical sequence of interactions. Unlike traditional recommendation, which might consider all past interactions equally, sequential recommendation emphasizes the order and temporal dependencies of user behaviors. For example, if a user buys a phone, they are more likely to buy a phone case next than a completely unrelated item. Models in this area aim to capture sequential patterns or dynamic user preferences.

3.1.2. Generative Recommendation

Generative recommendation is an emerging paradigm that frames the recommendation task as a sequence generation problem, similar to how large language models generate text. Instead of predicting a score for each item from a fixed catalog (discriminative approach) or retrieving items, generative recommenders directly generate the identifiers (tokens) of the items to be recommended. This allows for potentially more flexible item representation and generation of novel recommendations.

3.1.3. Item Tokenization

Item tokenization is the process of converting an item (e.g., a product, movie, song) into a sequence of discrete tokens. In natural language processing, words are tokenized into subword units or characters. Similarly, in generative recommendation, an item's unique ID or features are mapped to a sequence of semantic tokens. This allows the generative model, often based on language models, to "understand" and "generate" items in a structured, compositional manner. The length of these token sequences (denoted as $L$ ) can be fixed or variable, and the tokens themselves are drawn from a codebook or vocabulary.

3.1.4. Transformer Model

The Transformer is a neural network architecture introduced in 2017, primarily known for its effectiveness in sequence-to-sequence tasks, especially in natural language processing (NLP). It relies entirely on attention mechanisms (specifically self-attention) to draw global dependencies between input and output. The core of a Transformer is its encoder-decoder structure.

Encoder: Processes the input sequence, creating a rich contextual representation for each element. It typically consists of multiple layers, each with a multi-head self-attention mechanism and a feed-forward network.
Decoder: Generates the output sequence one element at a time, taking the encoder's output and previously generated elements as input. It also has multi-head self-attention (masked to prevent attending to future tokens) and an encoder-decoder attention layer to focus on relevant parts of the encoder's output.

A key component is the self-attention mechanism: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
$Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
$d_k$ is the dimension of the key vectors, used for scaling to prevent vanishing gradients.
The softmax function normalizes the attention scores. This allows the model to weigh the importance of different parts of the input sequence when processing each element. Models like T5 (Text-To-Text Transfer Transformer) are variants of the Transformer architecture, designed for a wide range of NLP tasks, including sequence generation.

3.1.5. Residual Quantization Variational Autoencoder (RQ-VAE)

A Variational Autoencoder (VAE) is a type of generative model that learns a compressed, continuous latent representation of input data. It consists of an encoder that maps input to a probability distribution in the latent space and a decoder that reconstructs the input from a sample from this latent distribution. The training objective involves a reconstruction loss (to ensure the output is similar to the input) and a KL divergence term (to ensure the latent distribution is close to a prior distribution, usually a standard normal).

Vector Quantization (VQ) is a method to map continuous vector inputs to discrete representations. In VQ-VAE, the encoder maps input to a continuous latent vector, which is then quantized by finding the closest vector in a fixed-size codebook (a set of learnable embedding vectors). The decoder then reconstructs from this quantized vector.

Residual Quantization (RQ) extends VQ by applying quantization in multiple stages. Instead of quantizing the entire vector at once, RQ quantizes a residual error at each stage.

Quantize the original input vector to get the first codebook vector.
Calculate the residual (difference) between the original vector and the first codebook vector.
Quantize this residual using a second codebook.
Repeat for $L$ levels. This hierarchical approach allows for a more detailed and expressive quantization using a sequence of tokens, which is particularly useful for item tokenization as it can represent items with varying levels of granularity from coarse to fine.

3.1.6. Kullback-Leibler (KL) Divergence

The Kullback-Leibler (KL) Divergence, often denoted as $D_{KL}(P || Q)$ , is a non-symmetric measure of how one probability distribution $P$ is different from a second, reference probability distribution $Q$ . It quantifies the information lost when $Q$ is used to approximate $P$ . For discrete probability distributions P(x) and Q(x) over the same event space $X$ : $ D_{KL}(P || Q) = \sum_{x \in X} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Where:

P(x) is the probability of event $x$ in distribution $P$ .
Q(x) is the probability of event $x$ in distribution $Q$ . A lower KL divergence value indicates that the two distributions are more similar. The paper uses a symmetric Kullback-Leibler divergence for sequence-item alignment, which is $D_{KL}(P || Q) + D_{KL}(Q || P)$ , to ensure that both distributions are encouraged to be similar to each other.

3.1.7. InfoNCE Loss (NT-Xent Loss)

InfoNCE (Information Noise-Contrastive Estimation) loss, also known as NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss, is a popular self-supervised learning objective used to learn rich representations by maximizing the agreement between different augmented views of the same data point (positive pairs) while simultaneously pushing apart representations of different data points (negative pairs). It's commonly used in contrastive learning. The general form for a positive pair $(x_i, x_j)$ within a batch of $N$ samples (where other samples $x_k$ act as negatives) is: $ \mathcal{L}{InfoNCE} = -\log \frac{\exp(\mathrm{sim}(x_i, x_j) / \tau)}{\sum{k=1}^{N} \exp(\mathrm{sim}(x_i, x_k) / \tau)} $ Where:

$\mathrm{sim}(x_i, x_j)$ is a similarity function (e.g., cosine similarity) between two representations.
$\tau$ is a temperature parameter that controls the steepness of the similarity distribution. A small $\tau$ makes the model more sensitive to small differences, encouraging stronger separation.
The sum in the denominator includes the positive pair and all negative pairs, effectively performing a softmax over similarity scores. The paper uses InfoNCE for preference-semantic alignment to align user preference representations with item semantic representations.

3.2. Previous Works

The paper categorizes related work into Sequential Recommendation and Generative Recommendation.

3.2.1. Traditional Sequential Recommendation

This paradigm focuses on predicting the next item based on a user's historical interaction sequence.

Markov Chains (e.g., [20]): Early approaches assuming item transitions follow a Markov process.
Neural Network-based Models:
- RNN-based (e.g., GRU4Rec [7], [25]): Use Recurrent Neural Networks to model sequential dependencies.
- CNN-based (e.g., Caser [26]): Apply Convolutional Neural Networks to capture local patterns in sequences.
- GNN-based (e.g., [1], [33]): Utilize Graph Neural Networks to model complex relationships within interaction graphs.
- Transformer-based (e.g., SASRec [10], BERT4Rec [22], FMLP-Rec [43], FDSA [38], S3-Rec [42]): These models leverage the Transformer architecture's self-attention mechanism for powerful sequence modeling.
  - SASRec [10]: Uses unidirectional Transformer decoder.
  - BERT4Rec [22]: Employs bidirectional attention with a masked item prediction task.
  - S3-Rec [42]: Incorporates mutual information maximization for pre-training.
  - FMLP-Rec [43]: An all-MLP architecture for sequential recommendation.
  - FDSA [38]: Models item-level and feature-level sequences with self-attention.
Textual/Side Feature Exploitation (e.g., [34], [38]): Enhance representations using rich textual features of users and items.

3.2.2. Generative Recommendation

This newer paradigm tokenizes item sequences and uses generative models to predict target item tokens autoregressively. It generally involves two main processes: item tokenization and generative recommendation itself.

3.2.2.1. Item Tokenization

Parameter-free methods:
- Co-occurrence matrix (e.g., CID [9], GPTRec [16]): Apply matrix factorization or graph-based clustering on item co-occurrence graphs to derive identifiers. Often simple and efficient but may lack deep collaborative semantics.
- Clustering of item embeddings (e.g., SEATER [21], EAGER [32]): Group items with similar embeddings to form hierarchical identifiers.
- Textual metadata (e.g., [2], [6], IDGenRec [24], LlamaRec [36]): Use item titles, descriptions, or other textual features directly as tokens or to derive them.
Deep learning methods based on multi-level Vector Quantization (VQ):
- TIGER [19]: Uses RQ-VAE to learn multi-level codebooks for items, deriving semantic IDs from text embeddings.
- LETTER [30]: Builds upon RQ-VAE by aligning quantized embeddings with collaborative embeddings and introducing code assignment diversity regularization.
- MMGRec [13], TokenRec [17], Enhanced Generative Recommendation [31]: Further developments using VQ for multi-modal or enhanced generative recommendation.

3.2.2.2. Generative Recommender

Encoder-decoder architecture (e.g., T5 [18]): Widely used backbone for sequence modeling and generation in generative recommendation, as seen in TIGER, LETTER, CID, SID.
Architecture/Objective improvements (e.g., [21]): Studies focusing on adjusting the backbone architecture or learning objectives for better performance.

3.3. Technological Evolution

The field of recommender systems has evolved from basic collaborative filtering and matrix factorization to sophisticated neural network models.

Early Systems: Simple item-based or user-based collaborative filtering.
Sequential Models: Introduction of Markov Chains, then RNNs, CNNs, and Transformers to capture the temporal dynamics of user behavior. This marked a shift from static user profiles to dynamic preference modeling.
Generative Paradigm: Inspired by the success of Large Language Models (LLMs), the latest evolution casts recommendation as a generation task. Instead of predicting one of $N$ items, the system generates the item ID as a sequence of tokens. This opens possibilities for generating novel items or more flexible item representations.

Within the generative paradigm, the evolution has moved from:

Heuristic Item Tokenization: Simple, pre-defined item IDs or IDs derived from co-occurrence matrices. These were often fixed and lacked deep semantic meaning or adaptability.
Pre-learned Item Tokenization: Using deep learning models (like VQ-VAE or RQ-VAE) to learn item tokens, but still as a separate, pre-processing step. The tokenizer and recommender are decoupled during training.
End-to-End Learnable Item Tokenization (ETEGRec): This paper represents a crucial step by integrating item tokenization and generative recommendation into a single, jointly optimized framework. This ensures the tokenizer learns representations that are explicitly useful for the recommender, and the recommender can refine its understanding of items based on these adaptively learned tokens.

3.4. Differentiation Analysis

ETEGRec distinguishes itself from previous generative recommendation models primarily by its end-to-end joint optimization of item tokenization and generative recommendation.

The following table, adapted from Table 1 in the paper, summarizes the key differences:

The following are the results from Table 1 of the original paper:

Methods	Item Tokenization			Generative Recommendation
Methods	Learning	EL	IA	Token Sequence	TI
GPTRec [16]	Heuristic	×	√	Pre-processed	×
CID [9]	Heuristic	×	×	Pre-processed	×
TIGER [19]	Pre-learned	√	×	Pre-processed	×
LETTER [30]	Pre-learned	√	×	Pre-processed	×
ETEGRec	End-to-end	√	√	Gradually Refined	√

Where:

EL: Equal Length (item identifiers have the same length).
IA: Interaction-Aware (tokenization considers user-item interactions).
TI: Tokenization Integration (item tokenization is integrated into generative recommendation training).

Core Differences and Innovations of ETEGRec:

End-to-End Learnable Tokenization:
- Previous: GPTRec and CID use heuristic methods (e.g., co-occurrence matrices), which are parameter-free and efficient but often fail to capture deep semantic relevance. TIGER and LETTER use pre-learned deep neural networks (RQ-VAE) for tokenization, but this is a separate pre-processing step.
- ETEGRec: It trains the item tokenizer jointly with the generative recommender. This means the tokenization process itself adapts to optimize the recommendation task, addressing the decoupling problem.
Interaction-Aware Tokenization:
- Previous: Only GPTRec introduced interaction awareness, primarily through the user-item interaction matrix. TIGER and LETTER (in its main form) were not explicitly interaction-aware in their tokenization.
- ETEGRec: It explicitly incorporates preference information from user behaviors into the item tokenizer through sequence-item alignment and preference-semantic alignment. This ensures the learned tokens are not just semantically meaningful but also relevant to user preferences derived from sequential interactions.
Dynamic/Refined Token Sequences:
- Previous: Most methods use pre-processed token sequences, which remain constant during generative recommendation training. This can lead to monotonous sequence patterns and potential overfitting if the initial tokenization is suboptimal.
- ETEGRec: With joint optimization, the item tokenizer is continuously updated, leading to gradually refined token semantics and diverse token sequences during model learning. This adaptive nature allows the system to learn better item representations over time.
Mutual Enhancement: The key innovation is the recommendation-oriented alignment approach (including sequence-item alignment and preference-semantic alignment) and the alternating optimization technique. These explicitly foster a synergistic relationship, allowing the tokenizer to produce better, recommendation-focused tokens, and the recommender to better utilize and refine the knowledge embedded in these tokens.

In essence, ETEGRec moves beyond simply using tokens for recommendation towards actively learning and refining those tokens during the recommendation process itself, guided by explicit recommendation objectives.

4. Methodology

4.1. Principles

The core idea behind ETEGRec is to unify item tokenization and generative recommendation into a single, cohesive framework, allowing for joint optimization and mutual enhancement between these two components. This addresses the limitation of prior generative recommendation systems where item tokenization was treated as a separate, pre-processing step. The theoretical basis is that by aligning the learning objectives of the item tokenizer and the generative recommender through recommendation-oriented alignment strategies, the system can learn more effective and recommendation-aware item representations, leading to superior prediction performance. The intuition is that if the item representations (tokens) are explicitly learned to be useful for the recommendation task, and the recommender is trained to leverage these evolving representations, the overall system will be more powerful than decoupled approaches.

4.2. Core Methodology In-depth (Layer by Layer)

ETEGRec is built on a dual encoder-decoder architecture, consisting of an item tokenizer and a generative recommender. These components are synergistically trained through novel recommendation-oriented alignment strategies and an alternating optimization technique.

4.2.1. Problem Formulation

The task is sequential recommendation, where given a user's historical interaction sequence $S = \left[ i_1, i_2, \ldots, i_t \right]$ from an item set $\mathcal{I}$ , the goal is to predict the next item $i_{t+1} \in \mathcal{I}$ . ETEGRec adopts a generative paradigm, casting this as token sequence generation. Each item $i$ is represented by multiple tokens $[c_1, \ldots, c_L]$ , where $L$ is the fixed identifier length. The input interaction sequence $S$ is tokenized into $X = [ c_1^1, c_2^1, \ldots, c_L^t ]$ . The objective is to generate the identifier of the target item $Y = [ c_1^{t+1}, \ldots, c_L^{t+1} ]$ at the $(t+1)$ -th step. Formally, this is a sequence-to-sequence learning problem: $ P ( \boldsymbol { Y } | \boldsymbol { X } ) = \prod _ { l = 1 } ^ { L } P ( c _ { l } ^ { t + 1 } | \boldsymbol { X } , c _ { 1 } ^ { t + 1 } , \ldots , c _ { l - 1 } ^ { t + 1 } ) $ Where:

$\boldsymbol{Y}$ : The token sequence representing the target item $i_{t+1}$ .
$\boldsymbol{X}$ : The tokenized historical interaction sequence.
$L$ : The fixed length of the item identifier (number of tokens per item).
$c_l^{t+1}$ : The $l$ -th token of the target item.
$P(c_l^{t+1} | \boldsymbol{X}, c_1^{t+1}, \ldots, c_{l-1}^{t+1})$ : The probability of generating the $l$ -th token of the target item, conditioned on the input sequence $\boldsymbol{X}$ and all previously generated tokens for the current target item. This formulation highlights the autoregressive nature of the generation.

4.2.2. Dual Encoder-Decoder Architecture

The ETEGRec framework comprises two main components: an item tokenizer ( $\mathcal{T}$ ) and a generative recommender ( $\mathcal{R}$ ), both utilizing an encoder-decoder structure.

4.2.2.1. Item Tokenizer

The item tokenizer ( $\mathcal{T}$ ) uses a Residual Quantization Variational Autoencoder (RQ-VAE) to construct multi-level tokens for each item. This creates an $L$ -level hierarchical representation, where each item is indexed by $L$ token IDs, organizing items in a tree-structured way and allowing collaborative semantics to be shared.

Token Generation as Residual Quantization

For an item $i$ , the item tokenizer takes its contextual or collaborative semantic embedding $z \in \mathbb{R}^{d_s}$ as input, where $d_s$ is the dimension of this semantic embedding. The tokenizer outputs a sequence of quantized tokens: $ [ c _ { 1 } , \ldots , c _ { L } ] = { \mathcal { T } } ( z ) $ Where $c_l$ is the $l$ -th token for item $i$ .

The process begins by encoding the semantic embedding $z$ into a latent representation $r$ using a multilayer perceptron (MLP) based encoder: $ r = \operatorname { E n c o d e r } _ { T } ( z ) $ Where $\operatorname { Encoder } _ { T }$ represents the MLP encoder of the tokenizer.

The latent representation $r$ is then quantized into serialized codes (tokens) across $L$ levels using L-level codebooks. Each level $l \in \{1, \ldots, L\}$ has a codebook $C_l = \{ e_k^l \}_{k=1}^K$ , where $e_k^l \in \mathbb{R}^{d_c}$ are the $K$ code embeddings in the $l$ -th codebook, and $d_c$ is the dimension of the code embeddings.

The residual quantization process is as follows: $ c _ { l } = \arg \operatorname* { max } _ { k } P ( k | v _ { l } ) $ $ v _ { l } = v _ { l - 1 } - e _ { c _ { l - 1 } } ^ { l } $ Where:

$c_l$ : The $l$ -th assigned token.
$v_l$ : The residual vector at the $l$ -th level.
$v_1$ : Initialized with the latent representation $r$ .
$e_{c_{l-1}}^l$ : The code embedding chosen at the (l-1)-th level.
$P(k | v_l)$ : The likelihood that the residual $v_l$ is quantized to token $k$ . This probability is measured by the distance between $v_l$ and the codebook vectors $e_j^l$ : $ P ( k | v _ { l } ) = \frac { \exp ( - | | v _ { l } - \pmb { e } _ { k } ^ { l } | | ^ { 2 } ) } { \sum _ { j = 1 } ^ { K } \exp ( - | | v _ { l } - \pmb { e } _ { j } ^ { l } | | ^ { 2 } ) } $ This formula uses a softmax-like function over negative squared Euclidean distances to compute the probability. The token $c_l$ corresponding to the highest probability (smallest distance) is selected.

Reconstruction Loss

After obtaining the tokens $[c_1, \ldots, c_L]$ , the quantized representation $\tilde{r}$ is formed by summing the chosen code embeddings from each level: $ \tilde { r } = \sum _ { l = 1 } ^ { L } e _ { c _ { l } } ^ { l } \in \mathbb { R } ^ { d _ { c } } $ This $\tilde{r}$ is then fed into an MLP decoder to reconstruct the original item semantic embedding: $ \tilde { z } = \mathrm { D e c o d e r } _ { T } ( \tilde { r } ) $ Where $\mathrm{Decoder}_T$ represents the MLP decoder of the tokenizer.

The semantic quantization loss ( $\mathcal{L}_{\mathrm{SQ}}$ ) for learning the item tokenizer is a combination of reconstruction loss and RQ loss: $ { \mathcal { L } _ { \mathrm { S Q } } = \mathcal { L } _ { \mathrm { R E C O N } } + \mathcal { L } _ { \mathrm { R Q } } } $ The reconstruction loss ( $\mathcal{L}_{\mathrm{RECON}}$ ) ensures that the reconstructed semantic embedding $\tilde{z}$ is close to the original input $z$ : $ { \mathcal { L } _ { \mathrm { R E C O N } } = | | \boldsymbol { z } - \tilde { \boldsymbol { z } } | | ^ { 2 } } $ This is the squared Euclidean distance between the original and reconstructed embeddings.

The RQ loss ( $\mathcal{L}_{\mathrm{RQ}}$ ) is a standard loss for VQ-VAEs that guides the learning of codebook vectors and the encoder: $ { \mathcal { L } _ { \mathrm { R Q } } = \displaystyle \sum _ { l = 1 } ^ { L } | | \mathrm { sg } [ v _ { l } ] - e _ { c _ { l } } ^ { l } | | ^ { 2 } + \beta | | v _ { l } - \mathrm { sg } [ e _ { c _ { l } } ^ { l } ] | | ^ { 2 } } $ Where:

$\mathrm{sg}[\cdot]$ : The stop-gradient operation. This is crucial for training VQ-VAEs. It prevents gradients from flowing through the quantized codebook vector to the encoder, effectively allowing the encoder to optimize for similarity to the codebook entries without directly updating the codebook entries themselves.
$| | \mathrm { sg } [ v _ { l } ] - e _ { c _ { l } } ^ { l } | | ^ { 2 }$ : This term trains the codebook vectors $e_{c_l}^l$ to move towards the residual vectors $v_l$ they are chosen to represent. The gradient only flows to $e_{c_l}^l$ .
$| | v _ { l } - \mathrm { sg } [ e _ { c _ { l } } ^ { l } ] | | ^ { 2 }$ : This term encourages the encoder's output $v_l$ to be close to the chosen codebook vector $e_{c_l}^l$ . The gradient only flows to $v_l$ (and thus back to the encoder).
$\beta$ : A hyperparameter (typically 0.25) balancing the optimization between the encoder and the codebooks.

4.2.2.2. Generative Recommender

The generative recommender utilizes a Transformer-based encoder-decoder architecture, similar to T5, known for its effectiveness in sequence modeling.

Token-level Seq2Seq Formulation

During training, the item-level user interaction sequence $S$ and the target item $i_{t+1}$ are first tokenized by the item tokenizer ( $\mathcal{T}$ ) into token sequences $X = [ c_1^1, c_2^1, \ldots, c_L^t ]$ and $Y = [ c_1^{t+1}, \ldots, c_L^{t+1} ]$ , respectively. The token embeddings $E^X \in \mathbb{R}^{|X| \times d_h}$ (where $|X|$ is the total number of tokens in the input sequence, and $d_h$ is the hidden size) are fed into the generative recommender.

The encoder of the recommender processes the input token embeddings: $ H ^ { E } = \operatorname { E n c o d e r } _ { R } ( E ^ { X } ) $ Where $H^E \in \mathbb{R}^{|X| \times d_h}$ is the encoded sequence representation, capturing the user's historical preferences.

For decoding, a special start-of-sequence token ([BOS]) is prepended to $Y$ , forming $\tilde{Y} = [ [ \mathrm{BOS} ], c_1^{t+1}, \ldots, c_L^{t+1} ]$ . The decoder of the recommender takes $H^E$ and $\tilde{Y}$ as input to extract user preference representation: $ H ^ { D } = \operatorname { D e c o d e r } _ { R } ( H ^ { E } , { \tilde { Y } } ) $ Where $H^D \in \mathbb{R}^{(L+1) \times d_h}$ represents the decoder hidden states, implying user preferences over the items.

Recommendation Loss

The decoder hidden states $H^D$ are used to predict the target item token at each step by performing an inner product with the vocabulary embedding matrix $E$ . The recommender is optimized using the negative log-likelihood of the target tokens, following the sequence-to-sequence paradigm: $ \mathcal { L } _ { \mathrm { R E C } } = - \sum _ { j = 1 } ^ { L } \log P ( Y _ { j } | \boldsymbol { X } , Y _ { < j } ) $ Where:

$Y_j$ : Represents the $j$ -th token of the target item's token sequence.
$Y_{<j}$ : Denotes all tokens generated before $Y_j$ for the current target item.
$P(Y_j | \boldsymbol{X}, Y_{<j})$ : The probability of generating the $j$ -th token, conditioned on the full input sequence $\boldsymbol{X}$ and the already generated target tokens. This loss drives the generative recommender to autoregressively generate the correct tokens for the target item.

4.2.3. Recommendation-oriented Alignment

The key innovation of ETEGRec is the recommendation-oriented alignment approach, which ensures the item tokenizer and generative recommender learn synergistically rather than independently.

4.2.3.1. Sequence-Item Alignment

Alignment Hypothesis

This alignment strategy focuses on the relationship between the encoder's sequential states ( $H^E$ ) from the generative recommender and the collaborative embedding ( $z$ ) of the target item. The hypothesis is that the sequence representation $H^E$ , which encodes information about the past interaction sequence, should be highly informative about the future target item. Therefore, when both types of representations are fed into the item tokenizer, they should yield similar tokenization results or token distributions. This alignment serves as a supervision signal to optimize both components.

Alignment Loss

To formalize this, the hidden state $H^E$ from the recommender's encoder (Eq. (11)) is first linearized by a mean pooling operation, followed by an MLP for semantic space transformation: $ z ^ { E } = \mathrm { MLP } ( \mathrm { mean _ pool } ( H ^ { E } ) ) $ Where:

$\mathrm{mean\_pool}(H^E)$ : Averages the hidden states across the sequence length, obtaining a single vector representing the entire input sequence.
$\mathrm{MLP}$ : A multilayer perceptron that transforms this pooled representation $z^E$ into a semantic space compatible with the item tokenizer.

Next, the item tokenizer is used to generate token distributions for each level $l$ . Let $P_z^l$ and $P_{z^E}^l$ denote the token distributions at the $l$ -th level for inputs $z$ (collaborative item embedding) and $z^E$ (encoder's sequence state), respectively. The sequence-item alignment loss ( $\mathcal{L}_{\mathrm{SIA}}$ ) uses a symmetric Kullback-Leibler divergence to enforce similarity between these distributions: $ \mathcal { L } _ { \mathrm { SIA } } = - \sum _ { l = 1 } ^ { L } \left( D _ { K L } \big ( P _ { z } ^ { l } | | P _ { z ^ { E } } ^ { l } \big ) + D _ { K L } \big ( P _ { z ^ { E } } ^ { l } | | P _ { z } ^ { l } \big ) \right) $ Where:
$D_{KL}(\cdot || \cdot)$ : The Kullback-Leibler (KL) divergence.
$P_z^l$ : The token distribution for the $l$ -th level derived from the target item's original semantic embedding $z$ .
$P_{z^E}^l$ : The token distribution for the $l$ -th level derived from the sequence state $z^E$ . The sum of two KL divergences makes the alignment symmetric, ensuring $P_z^l$ is close to $P_{z^E}^l$ and vice versa. This loss ensures that the encoder learns to produce sequence representations ( $z^E$ ) that semantically align with the actual target items' representations ( $z$ ), making the encoder more informative and preventing the decoder from bypassing it.

4.2.3.2. Preference-Semantic Alignment

Alignment Hypothesis

This strategy aims to connect the decoder's first hidden state (denoted $\pmb{h}^D$ , which is the first column in $H^D$ from Eq. (12)) with the reconstructed semantic embedding ( $\tilde{z}$ ) of the target item (Eq. (7)).

$\pmb{h}^D$ : Represents the sequential user preference learned by the generative recommender after modeling the interaction sequence.
$\tilde{z}$ : Encodes the collaborative semantics of the target item, reconstructed by the item tokenizer. The hypothesis is that these two representations should be aligned, as the user's preference should reflect the semantics of the item they are likely to interact with next. Unlike the recommendation loss which uses item tokens, this alignment explicitly involves the reconstructed embedding $\tilde{z}$ , thus engaging the tokenizer in the optimization.

Alignment Loss

An InfoNCE loss with in-batch negatives is used for preference-semantic alignment ( $\mathcal{L}_{\mathrm{PSA}}$ ): $ \mathcal { L } _ { \mathrm { PSA } } = - \left( \log \frac { \exp { ( s ( \tilde { z } , h ^ { D } ) / \tau ) } } { \sum _ { \hat { h } \in \mathcal { B } } \exp { ( s ( \tilde { z } , \hat { h } ) / \tau ) } } + \log \frac { \exp { ( s ( h ^ { D } , \tilde { z } ) / \tau ) } } { \sum _ { \hat { z } \in \mathcal { B } } \exp { ( s ( h ^ { D } , \hat { z } ) / \tau ) } } \right) $ Where:

$s(\cdot, \cdot)$ : The cosine similarity function, which measures the angular similarity between two vectors.
$\tau$ : A temperature coefficient, a hyperparameter that scales the logits before the softmax, affecting the concentration of the similarity distribution.
$\mathcal{B}$ : Denotes a batch of training instances.
$\hat{h} \in \mathcal{B}$ : Represents negative samples of preference representations within the batch.
$\hat{z} \in \mathcal{B}$ : Represents negative samples of semantic representations within the batch. This loss is a symmetric InfoNCE (NT-Xent) formulation, maximizing the similarity between the positive pair ( $\tilde{z}$ , $h^D$ ) while minimizing similarity with negative pairs (other $\hat{h}$ and $\hat{z}$ in the batch). This loss serves as an additional enhancement to the recommendation loss by involving the tokenizer and explicitly aligning user preferences with item semantics.

The combination of sequence-item alignment and preference-semantic alignment effectively strengthens the interplay between the item tokenizer and the generative recommender, fostering mutual adaptation and enhancement.

4.2.4. Alternating Optimization

To ensure stable and effective training of the entire ETEGRec framework, an alternating optimization strategy is proposed instead of a straightforward joint optimization of all objectives simultaneously. This method iteratively optimizes the item tokenizer and the generative recommender.

4.2.4.1. Item Tokenizer Optimization

In this phase, the item tokenizer is optimized by considering its intrinsic semantic quantization loss and the two recommendation-oriented alignment losses. All parameters of the generative recommender are kept fixed (frozen). The overall loss for the item tokenizer ( $\mathcal{L}_{\mathrm{IT}}$ ) is: $ \mathcal { L } _ { \mathrm { IT } } = \mathcal { L } _ { \mathrm { S Q } } + \mu \mathcal { L } _ { \mathrm { SIA } } + \lambda \mathcal { L } _ { \mathrm { PSA } } $ Where:

$\mathcal{L}_{\mathrm{SQ}}$ : The semantic quantization loss (Equation (8)).
$\mathcal{L}_{\mathrm{SIA}}$ : The sequence-item alignment loss (Equation (15)).
$\mathcal{L}_{\mathrm{PSA}}$ : The preference-semantic alignment loss (Equation (16)).
$\mu$ and $\lambda$ : Hyperparameters that control the weighting of the alignment losses.

4.2.4.2. Generative Recommender Optimization

Conversely, during this phase, the generative recommender is optimized using its primary recommendation loss and the two alignment losses. All parameters of the item tokenizer are kept fixed (frozen). The overall loss for the generative recommender ( $\mathcal{L}_{\mathrm{GR}}$ ) is: $ \mathcal { L } _ { \mathrm { GR } } = \mathcal { L } _ { \mathrm { R E C } } + \mu \mathcal { L } _ { \mathrm { SIA } } + \lambda \mathcal { L } _ { \mathrm { PSA } } $ Where:

$\mathcal{L}_{\mathrm{REC}}$ : The generative recommendation loss (Equation (13)).
$\mathcal{L}_{\mathrm{SIA}}$ : The sequence-item alignment loss (Equation (15)).
$\mathcal{L}_{\mathrm{PSA}}$ : The preference-semantic alignment loss (Equation (16)).
$\mu$ and $\lambda$ : Same hyperparameters as in item tokenizer optimization.

The training process is divided into multiple cycles. Each cycle consists of a fixed number of epochs ( $C$ ).
In the first epoch of each cycle, the item tokenizer is optimized based on $\mathcal{L}_{\mathrm{IT}}$ . This updates the item representations based on feedback from the recommender and its own reconstruction needs.
For the remaining C-1 epochs of that cycle, the item tokenizer is frozen, and the generative recommender is trained based on $\mathcal{L}_{\mathrm{GR}}$ . During this period, the item tokens remain fixed. This alternation continues until the item tokenizer converges. Once converged, the item tokenizer is permanently frozen, and the generative recommender is fully trained to convergence. This strategy addresses the challenge of stable optimization when jointly training two tightly coupled components.

4.3. Discussion and Analysis

4.3.1. Comparison with Existing methods

The paper contrasts ETEGRec with typical generative recommendation models based on item tokenization and generative recommendation aspects, as presented in Table 1 (copied and explained in Section 3.4).

Key distinctions highlighted:

Item Tokenization: Previous methods either use heuristic (e.g., GPTRec, CID) or pre-learned (e.g., TIGER, LETTER) tokenizers. These approaches lead to a decoupling of tokenization and recommendation training. ETEGRec's end-to-end approach allows the tokenizer to be optimized with the recommender.
Interaction Awareness: Only GPTRec among baselines was noted for interaction awareness through user-item matrices. ETEGRec explicitly aligns past user interaction sequences with target items via sequence-item and preference-semantic alignment, integrating rich preference information.
Token Sequence Evolution: Existing methods rely on pre-processed (constant) token sequences, which can lead to monotonous sequence patterns and overfitting. ETEGRec's joint optimization results in gradually refined semantics and diverse token sequences, with ablation studies confirming its performance contribution.
Integration of Prior Knowledge: ETEGRec actively integrates and refines prior knowledge from item semantic embeddings during training, rather than isolating the tokenizer.

4.3.2. Complexity Analysis

The paper provides a complexity analysis for ETEGRec.

Item Tokenization: For a single item, the time complexity involves:
- Encoder and Decoder MLP layers: $O(d^2)$ , where $d$ is the model dimension.
- Codebook lookup operations: $O(LKd)$ , where $L$ is the number of codebooks (token length), $K$ is the size per codebook, and $d$ is the dimension of code embeddings.
- Semantic quantization loss calculation: $O(d + Ld)$ .
- Total for one item: $O(d^2 + LKd)$ .
Generative Recommendation:
- Sequential preference modeling (Transformer): Primarily self-attention and feed-forward layers. The complexity is $O(N^2d + Nd^2)$ , where $N$ is the sequence length.
- Loss calculations:
  - $\mathcal{L}_{\mathrm{REC}}$ : Included in the generation process.
  - $\mathcal{L}_{\mathrm{SIA}}$ : $O(LKd)$ (due to codebook lookups for $P_z^l$ and $P_{z^E}^l$ ).
  - $\mathcal{L}_{\mathrm{PSA}}$ : $O(Md)$ , where $M$ is the number of negative samples (batch size for in-batch negatives).
Overall Training Cost: $O(NLKd + N^2d + Nd^2 + Md)$ . This is stated to be on the same order of magnitude as mainstream generative models like TIGER and LETTER.
Inference Complexity: The inference complexity is "completely consistent with TIGER" because item tokenization results can be cached in advance, meaning the overhead of tokenization is amortized.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three subsets of the Amazon 2023 review data [8], which contains user review data from May 1996 to September 2023. These datasets are:

"Musical Instruments"
"Video Games"
"Industrial Scientific"

To preprocess the data, the authors applied a 5-core filter, meaning only users and items with at least five interaction records were kept. This removes infrequent users and unpopular items, ensuring sufficient interaction data for modeling. User behavior sequences were constructed chronologically, and the maximum item sequence length was uniformly set to 50.

The following are the results from Table 2 of the original paper:

Dataset	#Users	#Items	#Interactions	Sparsity
Instrument	57,439	24,587	511,836	99.964%
Scientific	50,985	25,848	412,947	99.969%
Game	94,762	25,612	814,586	99.966%

These datasets were chosen because they are publicly available, widely used in recommender systems research, and represent different domains, allowing for robust validation of the proposed method's generalizability across various types of e-commerce data. Their scale (tens of thousands of users/items, hundreds of thousands of interactions) and high sparsity are typical characteristics of real-world recommendation scenarios, making them effective for validating performance.

5.2. Evaluation Metrics

To evaluate the performance of various methods, the paper employs two widely used metrics for top- $K$ recommendation: Recall@K and Normalized Discounted Cumulative Gain@K (NDCG@K). $K$ is set to 5 and 10.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully recommended within the top $K$ items. In the context of sequential recommendation, it quantifies how often the actual next item a user interacts with is present in the list of the top $K$ items recommended by the system. A higher Recall@K indicates that the model is better at identifying relevant items.

Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{RecommendedItems}_K \cap \mathrm{RelevantItems}|}{|\mathrm{RelevantItems}|} $

Symbol Explanation:

$\mathrm{RecommendedItems}_K$ : The set of top $K$ items recommended by the system for a given user.
$\mathrm{RelevantItems}$ : The set of actual next items the user interacted with (typically, for sequential recommendation, this is a single item, $i_{t+1}$ ).
$|\cdot|$ : Denotes the cardinality (number of elements) of a set. In a "leave-one-out" setting where there's only one relevant item, Recall@K simplifies to 1 if the relevant item is in the top $K$ recommendations, and 0 otherwise.

5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items in the recommended list. It assigns higher scores to relevant items that appear at higher (earlier) ranks. It's normalized to a value between 0 and 1, where 1 represents a perfect ranking. NDCG@K is particularly useful when the relevance of items can vary (e.g., highly relevant vs. marginally relevant), and when the order of recommendations matters.

Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ Where: $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{\mathrm{rel}j} - 1}{\log_2(j+1)} $ $ \mathrm{IDCG@K} = \sum{j=1}^{K} \frac{2^{\mathrm{rel}^{'}_j} - 1}{\log_2(j+1)} $

Symbol Explanation:

$\mathrm{DCG@K}$ : Discounted Cumulative Gain at rank $K$ . It sums the relevance scores of items in the recommended list, with a discount factor applied to items at lower ranks.
$\mathrm{IDCG@K}$ : Ideal Discounted Cumulative Gain at rank $K$ . This is the maximum possible DCG value, obtained by ranking all relevant items perfectly (in decreasing order of relevance). It serves as a normalization factor.
$\mathrm{rel}_j$ : The relevance score of the item at position $j$ in the recommended list. In binary relevance (item is either relevant or not), $\mathrm{rel}_j$ is typically 1 if the item is relevant, and 0 otherwise.
$\mathrm{rel}^{'}_j$ : The relevance score of the item at position $j$ in the ideal ranked list. This is usually set up such that the most relevant item has the highest $\mathrm{rel}^{'}_j$ and is placed at $j=1$ , and so on.
$\log_2(j+1)$ : The logarithmic discount factor. Items at rank $j=1$ have a discount of $\log_2(2)=1$ , while items at rank $j=2$ have $\log_2(3) \approx 1.58$ , meaning their contribution to DCG is reduced.

The leave-one-out strategy is used for splitting data: the latest interaction for each user is test data, the second latest is validation data, and all others are training data. A full ranking evaluation is performed over the entire item set to avoid sampling bias, and the beam size for generative models is set to 20.

5.3. Baselines

The paper compares ETEGRec against two categories of baseline models:

5.3.1. Traditional Sequential Recommendation Models

These models aim to predict the next item in a sequence using discriminative approaches.

Caser [26]: Utilizes horizontal and vertical convolutional filters to capture patterns in user behavior sequences.
GRU4Rec [7]: An RNN-based sequential recommender employing Gated Recurrent Units (GRU) for user behavior modeling.
HGN [15]: Hierarchical Gating Networks designed to capture both long-term and short-term user interests from item sequences.
SASRec [10]: Adopts a unidirectional Transformer to model user behaviors with self-attention.
BERT4Rec [22]: Introduces a bidirectional Transformer and a mask prediction task for training, drawing inspiration from BERT in NLP.
FMLP-Rec [43]: An all-MLP sequential recommender with learnable filters, aiming to reduce behavior noise.
FDSA [38]: Emphasizes transformation patterns between item features by modeling both item-level and feature-level sequences with self-attention.
S3-Rec [42]: Incorporates mutual information maximization for pre-training sequential models, learning item and attribute correlations.

5.3.2. Generative Recommendation Models

These models formulate recommendation as a sequence generation task.

SID [9]: Sequentially encodes item IDs as numerical tokens and uses them as item identifiers for generative recommendation. Lacks semantic information in tokens.
CID [9]: Integrates collaborative knowledge by generating item identifiers through spectral clustering on item co-occurrence graphs. Still a heuristic approach to tokenization.
TIGER [19]: Leverages text embeddings to construct semantic IDs for items using RQ-VAE and adopts a generative retrieval paradigm.
TIGER-SAS [19]: A variant of TIGER that uses item embeddings from a trained SASRec instead of text embeddings to construct semantic IDs, thus incorporating collaborative prior knowledge.
LETTER [30]: Designs a learnable tokenizer by integrating hierarchical semantics, collaborative signals, and code assignment diversity into the RQ-VAE framework.

These baselines are representative as they cover both traditional discriminative and modern generative approaches, including state-of-the-art models and specific generative methods that ETEGRec directly aims to improve upon.

5.4. Semantic ID Generation

The paper highlights specific details for semantic ID generation using the item tokenizer:

Item Collaborative Semantic Embeddings: 256-dimensional embeddings are obtained from a trained SASRec [10] model. This provides a strong initial representation for items based on collaborative filtering.
Item Tokenizer Architecture: A 3-layer MLP is used for both the encoder and decoder within the RQ-VAE.
Codebook Configuration: The number of codebooks ( $L$ ) is set to 3. Each codebook contains $K = 256$ code embeddings, and each embedding has a dimension of 128. This means each item is represented by a sequence of 3 tokens, where each token is chosen from a vocabulary of 256 codes.
Uniqueness of Semantic IDs: To ensure distinct semantic item IDs, an additional token is appended at the end of the 3 semantic tokens, following the approach in TIGER [19]. This likely handles cases where different items might accidentally share the same $L$ -token sequence from the RQ-VAE.

5.5. Implementation Details

Generative Recommender Backbone: T5 model with 6 encoder and 6 decoder layers.
Model Dimensions: Hidden size $d_h = 128$ . Feed-Forward Network (FFN) dimension = 512.
Attention Mechanism: Each layer has 4 self-attention heads, each with a dimension of 64.
Initialization: The item tokenizer is initialized with a pre-trained RQ-VAE.
Optimizer: AdamW optimizer with a weight decay of 0.05 is used for training the entire framework.
Alternating Optimization Cycles: The number of epochs per cycle ( $C$ $C$ ) is tuned in $\{2, 4\}$ ${2, 4}$ .
- Cycle Structure: Each cycle begins with 1 epoch of item tokenizer training (using $\mathcal{L}_{\mathrm{IT}}$ ).
- Remaining Epochs: Followed by C-1 epochs of generative recommender training (using $\mathcal{L}_{\mathrm{GR}}$ ), during which the tokenizer is frozen.
- This process repeats until the item tokenizer converges, after which it is permanently frozen, and the generative recommender is trained to full convergence.
Learning Rates:
- Generative recommender: Tuned within $\{5\text{e-}3, 3\text{e-}3, 1\text{e-}3\}$ .
- Item tokenizer: Tuned within $\{5\text{e-}4, 1\text{e-}4, 5\text{e-}5\}$ .
Hyperparameters:
- Alignment loss coefficients ( $\mu, \lambda$ ): Tuned within $\{5\text{e-}3, 1\text{e-}3, 5\text{e-}4, 3\text{e-}4, 1\text{e-}4\}$ .
Baseline Implementations:
- Traditional models: Implemented using RecBole [39, 40], an open-source recommendation framework.
- CID, SID, LETTER: Official implementations used.
- TIGER, TIGER-SAS: Implementation details from the original paper [19] followed.
Item Embedding Dimension: Set to 128 for all models, except for SID and CID which retained their default dimension of 768.

6. Results & Analysis

6.1. Core Results Analysis

The ETEGRec model was evaluated against traditional and generative recommendation baselines on three Amazon datasets. The overall performance is presented in Table 3.

The following are the results from Table 3 of the original paper:

Model	Instrument				Scientific				Game
Model	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10
Caser	0.0242	0.0392	0.0154	0.0202	0.0172	0.0281	0.0107	0.0142	0.0346	0.0567	0.0221	0.0291
GRU4Rec	0.0345	0.0537	0.0220	0.0281	0.0221	0.0353	0.0144	0.0186	0.0522	0.0831	0.0337	0.0436
HGN	0.0319	0.0515	0.0202	0.0265	0.0220	0.0356	0.0138	0.0182	0.0423	0.0694	0.0266	0.0353
SASRec	0.0341	0.0530	0.0217	0.0277	0.0256	0.0406	0.0147	0.0195	0.0517	0.0821	0.0329	0.0426
BERT4Rec	0.0305	0.0483	0.0196	0.0253	0.0180	0.0300	0.0113	0.0151	0.0453	0.0716	0.0294	0.0378
FMLP-Rec	0.0328	0.0529	0.0206	0.0271	0.0248	0.0388	0.0158	0.0203	0.0535	0.0860	0.0331	0.0435
FDSA	0.0364	0.0557	0.0233	0.0295	0.0261	0.0391	0.0174	0.0216	0.0548	0.0857	0.0353	0.0453
S3Rec	0.0340	0.0538	0.0218	0.0282	0.0253	0.0410	0.0172	0.0218	0.0533	0.0823	0.0351	0.0444
SID	0.0319	0.0438	0.0237	0.0275	0.0155	0.0234	0.0103	0.0129	0.0480	0.0693	0.0333	0.0401
CID	0.0352	0.0507	0.0234	0.0285	0.0192	0.0300	0.0123	0.0158	0.0497	0.0748	0.0343	0.0424
TIGER	0.0368	0.0574	0.0242	0.0308	0.0275	0.0431	0.0181	0.0231	0.0570	0.0895	0.0370	0.0471
TIGER-SAS	0.0375	0.0576	0.0242	0.0306	0.0272	0.0435	0.0174	0.0227	0.0561	0.0891	0.0363	0.0469
LETTER	0.0372	0.0581	0.0243	0.0310	0.0276	0.0433	0.0179	0.0230	0.0576	0.0901	0.0373	0.0475
ETEGRec	0.0402*	0.0624*	0.0260*	0.0331*	0.0294*	0.0455*	0.0190*	0.0241*	0.0616*	0.0947*	0.0400*	0.0507*

Observations:

Traditional Sequential Models: FDSA generally shows the best performance among traditional models, likely due to its utilization of additional textual features. FMLP-Rec, SASRec, and GRU4Rec also perform competitively, suggesting the effectiveness of various neural architectures in modeling behavior sequences. Caser and BERT4Rec show comparatively lower performance in some cases.
Generative Recommendation Models:
- TIGER and TIGER-SAS consistently outperform CID and SID. This highlights the importance of learning meaningful item tokens through RQ-VAE (as in TIGER/TIGER-SAS/LETTER) compared to numerical IDs (SID) or heuristic co-occurrence-based IDs (CID). SID and CID perform poorly, despite using a pretrained T5 model, indicating that the quality of item tokenization is paramount.
- TIGER-SAS performs similarly to TIGER, suggesting that both collaborative and textual semantics are valuable for item representation, and SASRec-derived embeddings provide good collaborative signals.
- LETTER generally achieves the best performance among prior generative baselines. This is attributed to its advanced learnable tokenizer design that integrates hierarchical semantics, collaborative signals, and code assignment diversity.
ETEGRec's Superiority: ETEGRec consistently achieves the best results across all three datasets and all evaluation metrics (Recall@5, Recall@10, NDCG@5, NDCG@10). The * indicates statistical significance ( $p < 0.01$ in a paired t-test) compared to the best baseline. This empirically validates the effectiveness of ETEGRec's approach of jointly optimizing item tokenization and generative recommendation through its recommendation-oriented alignment. The improvements are attributed to the mutual enhancement between the item tokenizer and the generative recommender.

6.2. Ablation Study

To understand the contribution of each proposed component, an ablation study was conducted. The results are presented in Table 4.

The following are the results from Table 4 of the original paper:

Variants	Instrument				Scientific				Game
Variants	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10
ETEGRec	0.0402	0.0624	0.0260	0.0331	0.0294	0.0455	0.0190	0.0241	0.0616	0.0947	0.0400	0.0507
w/o LsIA	0.0396	0.0614	0.0255	0.0325	0.0285	0.0446	0.0186	0.0238	0.0590	0.0917	0.0386	0.0491
w/o LpSA	0.0389	0.0609	0.0250	0.0321	0.0270	0.0422	0.0174	0.0223	0.0602	0.0933	0.0392	0.0499
w/o LsIA & LpSA	0.0379	0.0601	0.0245	0.0317	0.0269	0.0422	0.0175	0.0224	0.0576	0.0894	0.0375	0.0478
w/o AT	0.0337	0.0529	0.0215	0.0277	0.0234	0.0375	0.0153	0.0198	0.0514	0.0810	0.0333	0.0428
w/o ETE	0.0388	0.0600	0.0252	0.0320	0.0277	0.0431	0.0181	0.0230	0.0569	0.0899	0.0369	0.0475

Analysis of Ablation Study Variants:

w/o LsIA (without Sequence-Item Alignment loss $\mathcal{L}_{\mathrm{SIA}}$ ): Removing this loss leads to a performance drop across all datasets (e.g., Recall@10 on Instrument drops from 0.0624 to 0.0614). This indicates that aligning the token distributions derived from the encoder's sequential states ( $z^E$ ) and the target item's collaborative embedding ( $z$ ) is crucial. It helps ensure the encoder generates highly informative representations for predicting future interactions.
w/o LpSA (without Preference-Semantic Alignment loss $\mathcal{L}_{\mathrm{PSA}}$ ): This variant also shows performance degradation (e.g., Recall@10 on Instrument drops to 0.0609, Scientific to 0.0422). This confirms the effectiveness of preference-semantic alignment in enhancing user preference modeling by explicitly aligning the decoder's hidden states with the reconstructed item semantics.
w/o LsIA & LpSA (without both alignment losses): When both alignment losses are removed, the performance drops further (e.g., Recall@10 on Instrument drops to 0.0601). This demonstrates that both sequence-item alignment and preference-semantic alignment contribute positively to generative recommendation, and their combination provides the best results.
w/o AT (without Alternating Training): Directly jointly learning all optimization objectives without alternating training causes a significant performance decline (e.g., Recall@10 on Instrument drops to 0.0529, a much larger drop than removing alignment losses). This highlights the importance of the alternating optimization strategy for stabilizing the training process. Frequent updates to the item tokenizer in a naive joint training setup can disrupt the recommender's training.
w/o ETE (without End-To-End learning): This variant represents a scenario where the item tokenizer is trained separately and then fixed, and its final item tokens are used to train a generative recommender (similar to existing baselines like LETTER). It performs worse than the full ETEGRec model (e.g., Recall@10 on Instrument drops to 0.0600, Scientific to 0.0431, Game to 0.0899). This crucial finding indicates that ETEGRec's improvement is not merely from superior initial item identifiers. Instead, the end-to-end optimization process, which continuously refines the item tokenizer and integrates its prior knowledge with the generative recommender, is a core driver of its superior performance. This confirms the paper's central hypothesis about the benefits of tight coupling.

6.3. Further Analysis

6.3.1. Generalizability Evaluation

To assess ETEGRec's ability to generalize, especially to new or unseen users, the authors created a test set with users not present in the training data. For this, 5% of users with the least interaction history were selected as new users on the Instrument and Scientific datasets.

The following figure (Figure 2 from the original paper) shows the performance comparison:

Figure 2: Performance comparison on seen and unseen users.
该图像是柱状图，展示了ETEGRec与TIGER、LETTER模型在Instrument和Scientific数据集上对见过用户(Seen)和未见用户(Unseen)的Recall@10性能对比。ETEGRec模型在各场景下表现均优于其他两种方法。

As can be seen from the bar chart in Figure 2, ETEGRec consistently outperforms LETTER and TIGER for both seen and unseen users on both datasets. This indicates that the recommendation-oriented alignment in ETEGRec helps in learning more robust user preference models, enabling better generalization capabilities, even for users with limited or no prior training data.

6.3.2. Preference-Semantic Representation Visualization

To visually confirm the effectiveness of the preference-semantic alignment ( $\mathcal{L}_{\mathrm{PSA}}$ ), t-SNE [28] was used to project the learned preference representations ( $h^D$ ) and semantic representations ( $\tilde{z}$ ) into a 2D space. 10 items and 80 corresponding interaction histories were selected from the Instrument and Scientific datasets.

The following figure (Figure 3 from the original paper) shows the visualization:

Figure 3: Visualization of preference and semantic representations, where circles denote preference points, stars represent semantic points, and different colors indicate distinct groups
该图像是图3的示意图，展示了偏好表示与语义表示的可视化，其中圆圈表示偏好点，星形表示语义点，不同颜色代表不同的组别，用于区分乐器类(a)和科学类(b)数据。

In Figure 3 (a for Instrument, b for Scientific), circles represent preference points ( $h^D$ ), and stars represent semantic points ( $\tilde{z}$ ). Different colors indicate distinct groups of items/preferences. The visualization clearly shows that preference points (circles) are clustered closely around their corresponding target semantic points (stars) of the same color. Simultaneously, these clusters are well-separated from other semantic points of different colors. This visual evidence strongly supports that the preference-semantic alignment objective effectively aligns sequential user preferences with the semantics of their target items, demonstrating its successful integration within the model.

6.3.3. Hyper-Parameter Analysis

The impact of the alignment loss coefficients $\mu$ (for sequence-item alignment) and $\lambda$ (for preference-semantic alignment) on performance was analyzed.

The following figure (Figure 4 from the original paper) shows the performance comparison:

Figure 4: Performance comparison of different alignment loss coefficients.
Analysis of $\mu$ (Sequence-Item Alignment Coefficient):

Figure 4 (left panel) shows the impact of varying $\mu$ from 1e-4 to 5e-3.
Performance (Recall@10 and NDCG@10) generally improves as $\mu$ increases to an optimal point, then starts to decline.
Optimal results are achieved at $\mu = 3\text{e-}4$ for Instrument and Scientific datasets, and $\mu = 1\text{e-}3$ for the Game dataset.
This indicates that too large a $\mu$ can interfere with model learning, suggesting a careful balance is needed for the sequence-item alignment.

Analysis of $\lambda$ (Preference-Semantic Alignment Coefficient):

Figure 4 (right panel) shows the impact of varying $\lambda$ from 0 to 5e-3.
Similar trends are observed as with $\mu$ : performance improves up to an optimal $\lambda$ and then degrades if $\lambda$ becomes too large.
ETEGRec performs best across all three datasets when $\lambda = 1\text{e-}4$ .
This suggests that preference-semantic alignment also requires careful tuning to avoid over-emphasizing this objective, which could potentially distort the recommender's primary task.

Overall, the hyper-parameter analysis demonstrates that while the alignment losses are beneficial, their coefficients need to be appropriately tuned to achieve optimal performance, as overly strong alignment signals can be detrimental.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ETEGRec, a novel end-to-end generative recommender that innovatively unifies item tokenization and generative recommendation. Unlike previous approaches that treat these processes as separate, ETEGRec integrates them into a cohesive framework built on a dual encoder-decoder architecture. The core of its success lies in its recommendation-oriented alignment approach, which comprises two key objectives: sequence-item alignment and preference-semantic alignment. These objectives are designed to tightly couple the learning of the item tokenizer and the generative recommender, fostering mutual enhancement. Furthermore, an alternating optimization technique is proposed to ensure stable and efficient end-to-end training of the entire system. Extensive experiments on three benchmark datasets demonstrate that ETEGRec consistently achieves superior performance compared to both traditional sequential recommendation models and existing generative recommendation baselines, validated by comprehensive ablation studies and analyses of generalizability and representation alignment.

7.2. Limitations & Future Work

The authors explicitly mention the following directions for future work:

Transferability: Applying the joint tokenization method to other generative recommendation architectures. This suggests that the core idea of end-to-end tokenization and alignment is generalizable beyond the specific Transformer-T5 and RQ-VAE combination used in ETEGRec.
Scaling Effects: Exploring the performance and behavior of the model when increasing the model parameters, implying an interest in understanding how ETEGRec scales to larger capacities and potentially larger datasets or more complex recommendation scenarios.

7.3. Personal Insights & Critique

7.3.1. Personal Insights

ETEGRec offers several compelling insights and advancements:

Addressing the Decoupling Problem: The paper's fundamental contribution of unifying item tokenization and generative recommendation is a significant step forward. It intuitively makes sense that item representations should be learned in the context of the task they serve. This moves generative recommendation from a two-stage process (tokenize, then recommend) to a single, integrated learning system.
Explicit Alignment Objectives: The sequence-item alignment and preference-semantic alignment are cleverly designed. They provide direct supervision signals that force the representations learned by the tokenizer to be relevant to the recommender's objectives, and vice-versa. This kind of explicit cross-component alignment is a powerful technique for complex multi-module neural systems.
Alternating Optimization for Stability: The choice of alternating optimization is a pragmatic solution to a common problem in joint training: instability. Jointly optimizing two potentially conflicting or highly sensitive components can lead to oscillating gradients or training collapse. The cyclical freezing and unfreezing provide a more stable learning environment, allowing each component to adapt to the other gradually.
Robust Performance Gains: The consistent and statistically significant performance improvements over strong baselines across multiple datasets strongly validate the efficacy of ETEGRec. The ablation studies further reinforce the importance of each proposed component.
Generalizability: The improved performance on unseen users is particularly valuable. Recommender systems often struggle with cold-start users or items. A more robust preference modeling capability, as suggested by ETEGRec's results, indicates its potential for real-world deployment where user bases are constantly evolving.

7.3.2. Critique and Areas for Improvement

While ETEGRec is a strong paper, some potential issues or areas for further exploration include:

Dependency on Initial Embeddings: The item tokenizer relies on collaborative semantic embeddings obtained from a trained SASRec model. While TIGER-SAS showed the value of collaborative embeddings, this introduces an external dependency. Could ETEGRec learn these initial semantic embeddings in a truly end-to-end fashion from raw interaction data, removing the need for a separate pre-trained model? This would make the framework even more self-contained.
Interpretability of Tokens: The tokens generated by the RQ-VAE are discrete codes from learned codebooks. While they encode semantics, their direct interpretability to humans (e.g., "this token means 'electronic music' and 'high energy'") is limited. Further research could explore methods to make these learned tokens more human-interpretable or align them with existing semantic taxonomies.
Fixed Token Length and Codebook Size: The paper fixes the token length $L$ to 3 and codebook size $K$ to 256. While this simplifies design, real-world item semantics might benefit from dynamic token lengths or adaptive codebook sizes. For example, highly complex items might need more tokens, while simple ones require fewer.
Computational Cost of Alternating Optimization: While it provides stability, alternating optimization can be slower than truly parallel joint optimization if not implemented carefully. The convergence criteria for the item tokenizer (before permanent freezing) could also impact overall training time and final performance. Exploring more efficient joint training strategies that maintain stability could be beneficial.
Negative Transfer Potential: The alignment losses are designed to bring components together. However, if not carefully tuned, there's a risk of negative transfer, where an alignment objective might force a component to learn representations that are suboptimal for its primary task, ultimately hurting overall performance. The hyperparameter analysis shows this indeed happens with very large $\mu$ or $\lambda$ .
Generalization Beyond Amazon Data: While Amazon datasets are standard, it would be interesting to see how ETEGRec performs on datasets with different characteristics, such as much sparser cold-start scenarios, or domains with richer metadata (e.g., movies with genre tags, plot summaries, etc.).
Cold-Item Problem: The current setup primarily addresses cold-start users through generalizability. How would ETEGRec handle cold items (new items with no interaction history) where the initial collaborative semantic embedding $z$ might be difficult to obtain? Integrating metadata more robustly for cold items would be a valuable extension.

The methods and conclusions of ETEGRec are highly transferable. The concept of end-to-end tokenization and component alignment is applicable to any generative model that relies on discrete representations. For example, it could be used in:
Generative models for image/video recommendation: tokenizing visual features.
Drug discovery: generating molecular structures (tokens representing chemical substructures).
Code generation: tokens representing code snippets or functions.

The paper successfully demonstrates that actively integrating and optimizing the representation learning process with the downstream generative task yields significant performance benefits, opening new avenues for research in generative AI and its applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Generative Recommender with End-to-End Learnable Item Tokenization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 42,345 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Sequential Recommendation

3.1.2. Generative Recommendation

3.1.3. Item Tokenization

3.1.4. Transformer Model

3.1.5. Residual Quantization Variational Autoencoder (RQ-VAE)

3.1.6. Kullback-Leibler (KL) Divergence

3.1.7. InfoNCE Loss (NT-Xent Loss)

3.2. Previous Works

3.2.1. Traditional Sequential Recommendation

3.2.2. Generative Recommendation

3.2.2.1. Item Tokenization

3.2.2.2. Generative Recommender

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Dual Encoder-Decoder Architecture

4.2.2.1. Item Tokenizer

Token Generation as Residual Quantization

Reconstruction Loss

4.2.2.2. Generative Recommender

Token-level Seq2Seq Formulation

Recommendation Loss

4.2.3. Recommendation-oriented Alignment

4.2.3.1. Sequence-Item Alignment

Alignment Hypothesis

Alignment Loss

4.2.3.2. Preference-Semantic Alignment

Alignment Hypothesis

Alignment Loss

4.2.4. Alternating Optimization

4.2.4.1. Item Tokenizer Optimization

4.2.4.2. Generative Recommender Optimization

4.3. Discussion and Analysis

4.3.1. Comparison with Existing methods

4.3.2. Complexity Analysis

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)

5.3. Baselines

5.3.1. Traditional Sequential Recommendation Models

5.3.2. Generative Recommendation Models

5.4. Semantic ID Generation

5.5. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Study

6.3. Further Analysis

6.3.1. Generalizability Evaluation

6.3.2. Preference-Semantic Representation Visualization

6.3.3. Hyper-Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

7.3.1. Personal Insights

7.3.2. Critique and Areas for Improvement

Similar papers