Paper status: completed

Pre-training Generative Recommender with Multi-Identifier Item Tokenization

Published:04/06/2025

Generative Recommendation Systems (36)Multi-Identifier Item Tokenization (1)Curriculum Recommender Pre-Training (1)RQ-VAE as Tokenizer (1)Low-Frequency Item Semantic Modeling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The MTGRec framework enhances generative recommender pre-training through multi-identifier item tokenization, using RQ-VAE for multiple identifier association and a curriculum learning scheme to improve semantic modeling for low-frequency items and token diversity.

Abstract

Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.

Mind Map

In-depth Reading

English Analysis~34 min read · 46,319 chars

1. Bibliographic Information

1.1. Title

Pre-training Generative Recommender with Multi-Identifier Item Tokenization

1.2. Authors

Bowen Zheng, Zhongfu Chen, Enze Liu, Zhongrui Ma, Yue Wang, Wayne Xin Zhao, Ji-Rong Wen. Their affiliations primarily include Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, and Poisson Lab, Huawei, Beijing, China. This indicates a collaboration between academia and industry, common in cutting-edge AI research.

1.3. Journal/Conference

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25). SIGIR is a premier international conference in the field of information retrieval, including recommender systems. Its publication signifies high-quality, impactful research in the domain.

1.4. Publication Year

2025

1.5. Abstract

Generative recommendation systems aim to recommend items by autoregressively generating their identifiers. The main issue with existing methods is their reliance on a one-to-one mapping between an item and a single identifier. This approach leads to suboptimal semantic modeling, especially for low-frequency items (items that appear rarely in interaction data), and limits the diversity of token sequence data (the sequences of item identifiers used for training).

To address these problems, the paper proposes MTGRec, a framework that uses Multi-identifier item Tokenization to augment token sequence data specifically for Generative Recommender pre-training. MTGRec introduces two core innovations:

Multi-identifier item tokenization: It utilizes RQ-VAE (Residual-Quantized Variational AutoEncoder) as the tokenizer backbone. Instead of a single tokenizer, it treats model checkpoints from adjacent training epochs of the RQ-VAE as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers. Consequently, a single user interaction sequence can be transformed into several token sequences, forming different data groups.
Curriculum recommender pre-training: This involves a curriculum learning scheme that is guided by data influence estimation. This dynamically adjusts the sampling probability of each data group during the recommender's pre-training phase.

After the pre-training phase, the model is fine-tuned using a single tokenizer to ensure accurate item identification for actual recommendations. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and existing generative recommendation baselines in terms of effectiveness and scalability.

1.6. Original Source Link

https://arxiv.org/abs/2504.04400 The publication status is a preprint on arXiv, intended for presentation at SIGIR '25.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Generative recommender systems typically use a one-to-one mapping strategy where each item is represented by a single identifier (a sequence of tokens). This rigid tokenization scheme faces two main issues:

Suboptimal Semantic Modeling for Low-Frequency Items: Items that appear infrequently in user interactions (low-frequency or long-tail items) result in their associated tokens also being low-frequency. This lack of sufficient supervision signals (enough examples in training data) makes it difficult for the model to learn their true semantic meaning effectively.
Limited Diversity in Token Sequence Data: The one-to-one mapping restricts the variety of token sequence data that can be generated from user interaction sequences. This lack of diversity can hinder the model's ability to generalize and improve performance, especially as model sizes scale, unlike the observations in Large Language Models (LLMs) where increased data diversity and volume often lead to better performance.

Importance: Recommender systems are crucial for various online platforms. Enhancing their ability to recommend low-frequency items can improve user satisfaction by offering more diverse and personalized suggestions beyond popular items. Increasing data diversity is also key for unlocking the full potential of generative models in recommendation, allowing them to scale more effectively and achieve higher performance.

Paper's Entry Point/Innovative Idea: The paper's core idea is to break the one-to-one mapping constraint by associating each item with multiple identifiers. This multi-identifier scheme addresses the limitations by:

Increasing Token Exposure: More tokens per item means low-frequency items gain increased exposure frequency for their tokens, making it easier for the model to learn their semantics. It also promotes token sharing across items.
Enriching Data Diversity: A single user interaction sequence can now be tokenized into multiple token sequences, significantly augmenting the training data's volume and diversity. This enables better model scaling.

2.2. Main Contributions / Findings

The paper proposes MTGRec to overcome the limitations of one-to-one item tokenization in generative recommenders. Its primary contributions are:

Novel Framework for Generative Recommenders: MTGRec introduces a novel framework that leverages multiple item tokenizers for curriculum recommender pre-training. This approach significantly improves the effectiveness and scalability of generative recommendation.
Multi-Identifier Item Tokenization for Data Augmentation: The paper develops a multi-identifier item tokenization approach. This involves using RQ-VAE checkpoints from adjacent training epochs as semantically relevant item tokenizers. This allows each item to have multiple identifiers, thereby augmenting a single user interaction sequence into several token sequences that serve as different data groups for training.
Data Curriculum Scheme based on Influence Estimation: MTGRec introduces a data curriculum scheme for recommender training. This scheme uses first-order gradient approximation to estimate the data influence of each tokenizer's data group, allowing for dynamic adjustment of sampling probabilities during pre-training. This ensures that the model learns more effectively from "useful" data.
Demonstrated Superior Performance and Scalability: Extensive experiments on three public datasets ("Musical Instruments", "Industrial and Scientific", and "Video Games" from Amazon 2023 review dataset) show that MTGRec significantly outperforms both traditional and existing generative recommendation baselines. This superiority is observed in terms of effectiveness (Recall@K, NDCG@K) and also in its scalability with model scale and its improved performance for long-tail items.

These findings address the core problems of suboptimal semantic modeling for low-frequency items and limited data diversity, leading to more robust and powerful generative recommender systems.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following concepts:

Recommender Systems: Systems designed to predict user preferences and suggest items (products, movies, articles, etc.) that a user might like.
- Sequential Recommendation: A sub-field of recommender systems that focuses on predicting the next item a user will interact with, given their historical sequence of interactions. It aims to capture dynamic user preferences and sequential patterns.
Generative Models: A class of machine learning models that can generate new data instances that resemble the training data. Examples include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). In recommendation, generative models aim to produce item identifiers.
Item Tokenization: The process of converting an item (e.g., a movie, a product) into a sequence of discrete tokens, similar to how words are tokenized into subword units in natural language processing. This sequence of tokens acts as the item's identifier or semantic representation.
Auto-regressive Generation: A generative process where each token in a sequence is generated conditioned on the previously generated tokens and the input context. This is common in LLMs for text generation and is adapted here for generating item identifiers.
Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which relies heavily on self-attention mechanisms. It has become the backbone for many state-of-the-art models in NLP, including LLMs, and is often used in generative recommenders.
- Encoder-Decoder Architecture: A common Transformer variant where an encoder processes the input sequence and a decoder generates the output sequence. The T5 model used in this paper is an example.
- Decoder-Only Architecture: Another Transformer variant, like GPT, which uses only the decoder part to generate sequences autoregressively.
Residual-Quantized Variational AutoEncoder (RQ-VAE): A specific type of VAE that combines residual quantization with a VAE structure.
- Variational Autoencoder (VAE): A generative model that learns a compressed, continuous latent representation of input data. It consists of an encoder (maps input to latent space) and a decoder (reconstructs input from latent space). The "variational" aspect involves learning the parameters of a probability distribution (e.g., mean and variance) for the latent space, rather than discrete points.
- Quantization: The process of mapping continuous values to a finite set of discrete values (codes or tokens). In RQ-VAE, this is often done using codebooks.
- Residual Quantization: An iterative process where an input is quantized, and then the residual (the difference between the original input and its quantized version) is quantized again, and so on. This allows for representing data with increasing levels of detail or using multiple "layers" of codes.
Curriculum Learning: A training strategy where a model is trained on progressively more difficult or complex data samples. Inspired by human learning, it starts with easier examples to build foundational knowledge, then gradually introduces harder ones.
Data Influence Estimation: Techniques used to quantify how much a specific training data point or group of data points affects the model's parameters or its performance on a validation set. First-order gradient approximation is one such method.
Long-tail Distribution: In many real-world datasets, a small number of items are very popular (head), while a large number of items are rarely observed (tail). This creates a long-tail distribution. Recommending items from the "tail" is challenging due to data sparsity.
Data Sparsity: A common problem in recommender systems where most users have interacted with only a tiny fraction of the available items, leading to very sparse user-item interaction matrices.

3.2. Previous Works

The paper categorizes related work into Sequential Recommendation and Generative Recommendation.

3.2.1. Sequential Recommendation

Traditional sequential recommendation models typically:

Assign a unique ID to each item.
Predict the next item by measuring similarity between user preference (learned from historical interactions) and candidate items.
Often rely on Approximate Nearest Neighbor (ANN) algorithms for efficient retrieval.

Early Studies:

Markov Chains (e.g., [9, 3]): Model item sequences by learning transition probabilities between items. For example, a first-order Markov Chain assumes the next item depends only on the immediate previous item.
- Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation [9] combines item similarity models with Markov chains to handle data sparsity.

Deep Learning-based Models:

Recurrent Neural Networks (RNNs) (e.g., GRU4Rec [10, 38]): Use Gated Recurrent Units (GRUs) to capture sequential patterns.
Convolutional Neural Networks (CNNs) (e.g., Caser [39]): Apply convolutional filters to extract local patterns from item sequences.
Graph Neural Networks (GNNs) (e.g., [47, 50]): Model item-item transitions or user-item interactions as graphs to capture complex relationships.
Transformers (e.g., SASRec [16], BERT4Rec [36]): Utilize self-attention mechanisms to weigh the importance of different items in a sequence.
- Self-Attention Mechanism: A core component of Transformers. For an input sequence of vectors $X = [x_1, ..., x_L]$ , where $x_i$ is the embedding of the $i$ -th item, self-attention calculates an output sequence $Z = [z_1, ..., z_L]$ where each $z_i$ is a weighted sum of all $x_j$ . The weights are learned based on the pairwise similarity between $x_i$ and $x_j$ . The attention score is typically calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, $Q$ (Query), $K$ (Key), and $V$ (Value) are linear transformations of the input embeddings. $d_k$ is the dimension of the key vectors, used for scaling. $QK^T$ measures the similarity between query and key vectors.
- SASRec [16] uses a unidirectional self-attention network to model user behavior, where each item can only attend to previous items.
- BERT4Rec [36] uses a bi-directional self-attentive model with a mask prediction objective, similar to BERT in NLP, where randomly masked items in a sequence are predicted.
Other Advanced Models:
- HGN [24]: Hierarchical Gating Networks use feature-level and instance-level gating mechanisms.
- FMLP-Rec [60]: An all-MLP model with learnable filters for sequence modeling.
- HSTU [54]: Hierarchical Sequential Transducers for next item prediction, incorporating user actions and timestamps.
- FDSA [55]: Dual-stream self-attention framework modeling item-level and feature-level sequences.
- $S^3-Rec$ [59]: Enhances sequential recommendation with self-supervised signals from feature-item correlations.
  
  Many of these approaches primarily use item IDs. Recent trends involve incorporating item content (textual features) for semantic embeddings to improve performance and generalization [11, 13, 20].

3.2.2. Generative Recommendation

This is an emerging paradigm where each item is represented by a list of tokens (an identifier). The task of next-item prediction becomes autoregressive generation of the target item's identifier. A critical component is item tokenization.

Item Tokenization Methods:

Heuristic Approaches (e.g., [15, 34, 45]): Rely on manually defined rules or techniques.
- Time order [15], item clustering [34, 45], matrix decomposition [15, 27]. These are simple but might miss implicit item relationships.
Text-based Approaches (e.g., [5, 8, 14, 21]): Directly use item attributes (title, description) as identifiers, leveraging pre-trained language models.
- These can suffer from inconsistent length, semantic ambiguity, and lack of collaborative information.
Codebook-based Approaches (e.g., TIGER [32], LETTER [42], [6, 29, 44]): Adopt learnable codebooks to quantize item embeddings into fixed-length, semantically rich identifiers.
- RQ-VAE [53] is a commonly used backbone for this.
- TIGER [32]: Employs RQ-VAE to quantize item embeddings into semantic IDs, using a generative retrieval paradigm.
- LETTER [42]: Extends TIGER by integrating collaborative and diversity regularization into RQ-VAE to improve codebook learning.
- $TIGER++$ [32]: Enhances TIGER by using representation whitening and exponential moving average (EMA) for better codebook learning and semantic ID quality. Representation whitening [35] aims to decorrelate features and equalize variance. EMA [41] provides more stable updates for codebook learning by averaging past states.
  
  Generative Recommender Models:
Usually employ decoder-only (like GPT [1, 30]) or encoder-decoder (like T5 [31]) architectures.
Some studies enhance these with dual decoders [45] or contrastive learning [34].

3.3. Technological Evolution

The evolution in recommender systems has moved from traditional ID-based collaborative filtering methods, through deep learning models (RNNs, CNNs, Transformers) leveraging item IDs, to content-aware methods using semantic embeddings from pre-trained language models. The latest frontier is Generative Recommendation, which aims to reformulate recommendation as a sequence-to-sequence generation task. This paradigm shifts from retrieving items by their IDs to generating their semantic identifiers.

This paper fits into the codebook-based generative recommendation stream. It builds upon RQ-VAE for item tokenization and Transformer-based models (specifically T5) for generation. It tackles the fundamental limitations of prior generative recommendation methods, particularly the one-to-one mapping constraint.

3.4. Differentiation Analysis

Compared to existing generative recommendation methods (e.g., TIGER, LETTER, $TIGER++$ ), MTGRec introduces a fundamental shift in item tokenization.

Core Difference: Previous methods use a strict one-to-one mapping (one item, one identifier). MTGRec proposes a multi-identifier item tokenization (one item, multiple identifiers).
Innovation 1: Multi-Identifier Item Tokenization: Instead of training a single RQ-VAE tokenizer or multiple independent ones, MTGRec intelligently leverages RQ-VAE model checkpoints from adjacent training epochs. This is key because these checkpoints are semantically relevant (they represent slightly different but related semantic views of the items), avoiding the semantic conflicts that would arise from completely independent tokenizers. This augmentation produces multiple data groups from a single user sequence.
Innovation 2: Curriculum Recommender Pre-training: With these multiple semantically relevant data groups, MTGRec doesn't just train uniformly. It introduces a data curriculum scheme guided by data influence estimation. This means it dynamically adjusts the sampling probability of each data group, prioritizing data that is more "useful" for training, thus optimizing the learning process.
Benefits: These innovations lead to more massive and diverse token sequence data for pre-training, which significantly helps in:
- Better Semantic Modeling: Especially for low-frequency items, as their tokens get more exposure and context.
- Improved Scalability: Leveraging larger and more diverse data allows the generative recommender to scale more effectively, leading to better performance.
- Enhanced Generalization: The diverse token sequences help the model learn more robust and generalizable representations.
  
  In essence, MTGRec improves generative recommendation by augmenting the input data generation process and optimizing the training strategy for this augmented data, rather than solely focusing on the generative recommender architecture or single tokenizer improvements like LETTER or $TIGER++$ .

4. Methodology

4.1. Principles

The core idea behind MTGRec is to enrich the training data for generative recommenders by associating each item with multiple identifiers, rather than a single one. This multi-identifier scheme aims to tackle the problems of suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. The theoretical basis or intuition is that by presenting the same item through slightly varied, yet semantically coherent, token representations, the model can learn more robust and generalized item semantics. Furthermore, increasing the volume and diversity of training data allows generative recommenders to benefit from model scaling trends observed in LLMs.

The framework operates in two main phases:

Multi-Identifier Item Tokenization: This phase generates multiple token sequences for each user interaction sequence. It does so by creating several semantically relevant item tokenizers using RQ-VAE checkpoints from different training epochs.
Curriculum Recommender Pre-training: This phase leverages the augmented data. Instead of uniform sampling, it uses a curriculum learning approach, dynamically adjusting the sampling probabilities of different data groups based on their estimated influence on the validation loss.

Finally, a fine-tuning step ensures precise item identification for deployment.

4.2. Core Methodology In-depth

The overall framework of MTGRec is shown in Figure 1.

该图像是示意图，展示了多标识符项目标记的过程及其在生成推荐系统中的应用。图中展示了通过 RQ-VAE 进行的多标识符标记、数据课程和生成推荐器如何相互作用，以及如何对训练数据进行标记和处理以优化推荐结果。

Figure 1: The overall framework of the proposed approach.

4.2.1. Problem Formulation

Let $V$ denote the set of all items. A user's historical interaction sequence is $S = [v_1, \dots, v_t]$ , ordered chronologically. The goal of sequential recommendation is to predict the next item $v_{t+1}$ .

In generative recommendation, an item tokenizer $T$ maps each item $v$ to a token sequence (its identifier) $[c_1, \dots, c_H] = T(\boldsymbol{v})$ , where $c_h$ is the $h$ -th token and $H$ is the identifier length. Thus, the historical sequence $S$ is transformed into X = T(S) = [c_1^1, c_2^1, \dots, c_{H-1}^t, c_H^t], and the target item $v_{t+1}$ becomes Y = T(v_{t+1}) = [c_1^{t+1}, \dots, c_H^{t+1}]. The next-item prediction is then reframed as a sequence-to-sequence problem of autoregressively generating $Y$ :

$ P(\boldsymbol{Y}|\boldsymbol{X}) = \prod_{h=1}^{H} P(c_h^{t+1} | \boldsymbol{X}, c_1^{t+1}, \dots, c_{h-1}^{t+1}) $

This formula describes the probability of generating the entire target item identifier $\boldsymbol{Y}$ given the input sequence $\boldsymbol{X}$ . It's a product of conditional probabilities, where each token $c_h^{t+1}$ is generated based on the input $\boldsymbol{X}$ and all previously generated tokens $c_1^{t+1}, \dots, c_{h-1}^{t+1}$ for the current target item.

4.2.2. Multi-Identifier Item Tokenization

This module aims to tokenize a single user interaction sequence into multiple token sequences by using multiple item tokenizers.

4.2.2.1. Tokenizer Backbone

The paper implements the item tokenizer using a Residual-Quantized Variational AutoEncoder (RQ-VAE) [53].

RQ-VAE Process:

Input: An item's semantic embedding $\boldsymbol{z}$ (e.g., obtained from a pre-trained language model like Sentence-T5).
Encoding: The RQ-VAE encoder maps $\boldsymbol{z}$ to a latent representation $\boldsymbol{r}$ .
Quantization: $\boldsymbol{r}$ is quantized into serialized codes (tokens) from coarse to fine through $H$ quantization levels. Each level $h$ has a codebook $C^h = \{\boldsymbol{e}_k^h\}_{k=1}^K$ , where $\boldsymbol{e}_k^h$ is the $k$ -th code vector in the $h$ -th codebook and $K$ is the codebook size.
Residual Quantization Mechanism: At each level $h$ , the closest code vector $\boldsymbol{e}_{c_h}^h$ in the codebook $C^h$ is found for the current residual vector $\boldsymbol{r}_h$ . The difference (residual) is then passed to the next level. $ c_h = \underset{k}{\arg\min} ||\boldsymbol{r}_h - \boldsymbol{e}_k^h||2^2 $ $ \boldsymbol{r}{h+1} = \boldsymbol{r}h - \boldsymbol{e}{c_h}^h $ Here, $c_h$ is the index of the chosen code for level $h$ . $\boldsymbol{r}_h$ is the residual vector at level $h$ , with $\boldsymbol{r}_1 = \boldsymbol{r}$ (the initial latent representation). The process is sequential: $\boldsymbol{r}_{h+1}$ is the residual left after quantizing $\boldsymbol{r}_h$ and removing the selected code $\boldsymbol{e}_{c_h}^h$ .
Reconstruction: The final item quantized representation is $\tilde{\boldsymbol{r}} = \sum_{h=1}^H \boldsymbol{e}_{c_h}^h$ . This $\tilde{\boldsymbol{r}}$ is then decoded to reconstruct the original item embedding $\hat{\boldsymbol{z}}$ .

RQ-VAE Loss Function: The overall loss for the RQ-VAE is $\mathcal{L}_{\mathrm{T}} = \mathcal{L}_{\mathrm{recon}} + \mathcal{L}_{\mathrm{rq}}$ .

Reconstruction Loss: $\mathcal{L}_{\mathrm{recon}} = ||\boldsymbol{z} - \hat{\boldsymbol{z}}||_2^2$ . This term ensures that the RQ-VAE can accurately reconstruct the input semantic embedding.
Quantization Loss: $\mathcal{L}_{\mathrm{rq}} = \sum_{h=1}^H ||\mathrm{sg}[\boldsymbol{r}_h] - \boldsymbol{e}_{c_h}^h||_2^2 + \beta ||\boldsymbol{r}_h - \mathrm{sg}[\boldsymbol{e}_{c_h}^h]||_2^2$ $L_{rq} = \sum_{h = 1}^{H} ∣∣ sg [r_{h}] - e_{c_{h}}^{h} ∣ ∣_{2}^{2} + β ∣∣ r_{h} - sg [e_{c_{h}}^{h}] ∣ ∣_{2}^{2}$ .
- $\mathrm{sg}[\cdot]$ denotes the stop-gradient operation, meaning gradients do not flow through this part during backpropagation.
- The first term $||\mathrm{sg}[\boldsymbol{r}_h] - \boldsymbol{e}_{c_h}^h||_2^2$ encourages the code vectors $\boldsymbol{e}_{c_h}^h$ to move towards the residual vectors $\boldsymbol{r}_h$ (codebook learning).
- The second term $\beta ||\boldsymbol{r}_h - \mathrm{sg}[\boldsymbol{e}_{c_h}^h]||_2^2$ encourages the encoder output $\boldsymbol{r}_h$ to commit to the code vectors $\boldsymbol{e}_{c_h}^h$ (encoder learning).
- $\beta$ is a hyperparameter (typically 0.25) to balance these two objectives.

4.2.2.2. Semantically Relevant Tokenizers

To obtain multiple item tokenizers that are not extraneous, MTGRec proposes using model checkpoints from adjacent epochs during a single RQ-VAE training process.

These checkpoints are derived from the same initialization parameters and iterative gradient descent, ensuring minimal disparities between their codebooks.
This approach yields token sequences that are related yet distinct, embodying homogeneous knowledge without severe semantic conflicts.
Formally, the set of semantically relevant item tokenizers is: $ \mathcal{T} = {\mathrm{T}1, \mathrm{T}2, \dots, \mathrm{T}n} = {\mathrm{T}{\phi^{N-n+1}}, \mathrm{T}{\phi^{N-n+2}}, \dots, \mathrm{T}{\phi^N}} $ Here, $\mathcal{T}$ is the set of $n$ item tokenizers. $\mathrm{T}_{\phi^i}$ denotes the RQ-VAE model with parameters $\phi^i$ (a checkpoint from the $i$ -th epoch). $N$ is the maximum number of training epochs for the RQ-VAE. The selected tokenizers are from the final $n$ epochs.

4.2.2.3. Tokenize an Item Sequence to Multiple Token Sequences

With the set of semantically relevant item tokenizers $\mathcal{T}$ , a single historical item sequence $S$ and target item $v_{t+1}$ can be tokenized into multiple pairs of token sequences: $ X_1, X_2, \dots, X_n = \mathrm{T}_1(S), \mathrm{T}_2(S), \dots, \mathrm{T}_n(S) $ $ Y_1, Y_2, \dots, Y_n = \mathrm{T}1(v{t+1}), \mathrm{T}2(v{t+1}), \dots, \mathrm{T}n(v{t+1}) $ where $X_i$ and $Y_i$ are the tokenized historical sequence and target item identifier, respectively, generated by tokenizer $\mathrm{T}_i$ . During pre-training, instead of using all augmented sequences simultaneously (which could be computationally prohibitive for large $n$ ), only one token sequence is sampled at a time for model optimization. The selection of which tokenizer's data group to sample is dynamically adjusted using curriculum learning.

4.2.3. Curriculum Recommender Pre-training

This module addresses how to effectively train the generative recommender using the hybrid data generated by multiple item tokenizers. It employs a curriculum learning strategy that dynamically adjusts sampling probabilities based on data influence.

4.2.3.1. Estimating Data Influence

The concept of "useful" data is quantified by its data influence [7, 28, 48], defined as the contribution of training data to the validation loss. This is estimated using first-order gradient approximation.

Validation Loss Update: The change in validation loss can be approximated by a first-order Taylor expansion: $ \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^{t+1}) = \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) + \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot (\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t) $ The update of validation loss is: $ \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^{t+1}) - \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) = \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot (\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t) $ Here, $\mathcal{D}_{\mathrm{val}}$ is the held-out validation data, and $\boldsymbol{\theta}^t$ represents the recommender parameters at time step $t$ .

Gradient of Validation Data: The validation data is acquired using a leave-one-out strategy (the last item for test, second-to-last for validation). After tokenization by different tokenizers, multiple groups of token sequence data are mixed into $\mathcal{D}_{\mathrm{val}}$ . The mean loss and cumulative gradient over all validation data are: $ \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}) = \frac{1}{|\mathcal{D}{\mathrm{val}}|} \sum_{X, Y \in \mathcal{D}{\mathrm{val}}} \mathcal{L}(X, Y ; \boldsymbol{\theta}) $ $ \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}) = \frac{1}{|\mathcal{D}{\mathrm{val}}|} \sum{X, Y \in \mathcal{D}_{\mathrm{val}}} \nabla \mathcal{L}(X, Y ; \boldsymbol{\theta}) $ where X, Y is a pair of token sequences for a historical interaction and target item. $\mathcal{L}(\cdot, \cdot ; \boldsymbol{\theta})$ is the negative log-likelihood loss (Eqn. 22).

Adam Gradients of Training Data: The generative recommender is trained using the Adam optimizer [18]. The parameter update $\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t$ can be expressed as: $ \boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t = - \eta_t \Gamma(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t) $ where $\eta_t$ is the learning rate at time step $t$ . $\Gamma(\mathcal{D}_{\mathrm{train}}^i ; \boldsymbol{\theta}^t)$ represents the Adam-specific gradient for the training data $\mathcal{D}_{\mathrm{train}}^i$ tokenized by $\mathrm{T}_i$ . The Adam update for $\Gamma$ involves first-order momentum $\boldsymbol{m}$ and second-order momentum $\boldsymbol{v}$ : $ \Gamma(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t) = \frac{\boldsymbol{m}^{t+1}}{\sqrt{\boldsymbol{v}^{t+1} + \epsilon}} $ $ \boldsymbol{m}^{t+1} = (\beta_1 \boldsymbol{m}^t + (1 - \beta_1) \nabla \mathcal{L}(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t)) / (1 - \beta_1^t) $ $ \boldsymbol{v}^{t+1} = (\beta_2 \boldsymbol{v}^t + (1 - \beta_2) \nabla \mathcal{L}(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t)^2) / (1 - \beta_2^t) $ Here, $\mathcal{D}_{\mathrm{train}}^i$ is the training token sequence data generated by item tokenizer $\mathrm{T}_i$ . $\beta_1$ (typically 0.9) and $\beta_2$ (typically 0.999) are Adam hyperparameters. $\epsilon$ is a small constant for numerical stability. $\nabla \mathcal{L}(\mathcal{D}_{\mathrm{train}}^i ; \boldsymbol{\theta}^t)$ is the gradient of the loss for the data group $\mathcal{D}_{\mathrm{train}}^i$ . The paper considers a group of data from each tokenizer as an entirety, calculating its gradient through gradient accumulation.

Calculate Influence: The data influence of each item tokenizer $\mathrm{T}_i$ at time step $t$ is defined as: $ \mathrm{I}(\mathrm{T}i ; \boldsymbol{\theta}^t) = \eta_t \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot \Gamma(\mathcal{D}_{\mathrm{train}}^i, \boldsymbol{\theta}^t) $ This formula essentially measures how much the gradient of the validation loss aligns with the Adam-normalized gradient from the training data of tokenizer $\mathrm{T}_i$ . A positive influence implies that the training step (driven by $\mathrm{T}_i$ 's data) is reducing the validation loss. Since training spans multiple time steps, the cumulative influence is calculated: $ \tilde{\mathrm{I}}(\mathrm{T}i) = \sum{k=1}^K \mathrm{I}(\mathrm{T}_i ; \boldsymbol{\theta}_k) $ where $\boldsymbol{\theta}_k$ is the model checkpoint at time step $t_k$ , and $K$ is the total number of checkpoints considered for influence estimation.

4.2.3.2. Curriculum Pre-training

The training process is divided into multiple stages. At the end of each stage, the data sampling probabilities are updated based on the cumulative data influence.

Cumulative Influence Update: For stage $k$ , the cumulative influence of tokenizer $\mathrm{T}_i$ is updated: $ \tilde{\mathrm{I}}_k(\mathrm{T}i) = \tilde{\mathrm{I}}{k-1}(\mathrm{T}_i) + \mathrm{I}(\mathrm{T}_i ; \boldsymbol{\theta}_k) $ Initially, all data groups are sampled with equal probability ( $\tilde{\mathrm{I}}_0(\mathrm{T}_i) = 0$ ).
Sampling Probability Update: The sampling probability $p_i^k$ for tokenizer $\mathrm{T}_i$ in the subsequent stage is calculated using a softmax-like function: $ p_i^k = \frac{e^{\tilde{\mathrm{I}}_k(\mathrm{T}i) / \tau}}{\sum{j=1}^n e^{\tilde{\mathrm{I}}_k(\mathrm{T}_j) / \tau}} $ Here, $\tau$ is the temperature coefficient that controls the smoothness of the distribution. A smaller $\tau$ makes the distribution sharper, favoring high-influence tokenizers more strongly. A larger $\tau$ makes it flatter, approaching uniform sampling.
Data Sampling Strategy: For stage $k+1$ , a tokenizer $\mathrm{T}$ is sampled from the set $\mathcal{T}$ according to the probabilities $P(\mathrm{T} = \mathrm{T}_i) = p_i^k$ . Then, the user interaction sequence $S$ and target item $v_{t+1}$ are tokenized using the sampled tokenizer $\mathrm{T}$ to get $X = \mathrm{T}(S)$ and $Y = \mathrm{T}(v_{t+1})$ .
Model Optimization: The sampled token sequence data (X, Y) is then fed into the generative recommender for optimization using the negative log-likelihood loss: $ \mathcal{L}(X, Y) = - \sum_{h=1}^H \log P(c_h^{t+1} | X, c_1^{t+1}, \dots, c_{h-1}^{t+1}) $ This is the standard loss for autoregressive sequence generation, where the model tries to maximize the likelihood of generating the correct target item identifier $Y$ given the historical sequence $X$ .

4.2.4. Fine-tuning and Inference

4.2.4.1. Fine-tuning for Item Identification

After pre-training with multiple item tokenizers, the generative recommender might not have a strict one-to-one mapping between an item and a single identifier (since an item could have been associated with multiple identifiers during pre-training). For practical deployment, accurate item identification is crucial. Therefore, the pre-trained model is fine-tuned using each item tokenizer separately. The model configuration that achieves the optimal validation performance is then selected for actual deployment and testing. This step re-establishes a clear mapping from generated identifier to a unique item.

4.2.4.2. Inference

During inference, the objective is to generate the top-K items for recommendation.

Decoding: Beam search is used to decode $K$ token sequences. Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set. In autoregressive generation, it keeps track of the $B$ (beam size) most probable partial sequences at each step, until complete sequences are generated.
Mapping: The generated token sequences are then mapped back to their corresponding items.
Efficiency: Unlike some prior works, MTGRec does not use a prefix tree to constrain the search, to avoid hindering parallel decoding and reducing efficiency.
Invalid Identifiers: Invalid identifiers (token sequences that do not map to any known item), which occur rarely, are simply ignored.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three subsets of the latest Amazon 2023 review dataset [12]:

"Musical Instruments" (referred to as Instrument)
"Industrial and Scientific" (referred to as Scientific)
"Video Games" (referred to as Game)

These datasets contain user review data spanning from May 1996 to September 2023.

Preprocessing Steps (following [32, 59]):

Low-activity users and items with less than five interaction records were filtered out.
Historical item sequences were grouped by users and sorted in chronological order.
A maximum sequence length limit of 20 items was applied.

The detailed statistics of the preprocessed datasets are presented in Table 1.

The following are the results from Table 1 of the original paper:

Dataset	#Users	#Items	#Interactions	Sparsity	Avg.len
Instrument	57,439	24,587	511,836	99.964%	8.91
Scientific	50,985	25,848	412,947	99.969%	8.10
Game	94,762	25,612	814,586	99.966%	8.60

Characteristics:

Domain: E-commerce product reviews.
Scale: Moderate number of users, items, and interactions.
Sparsity: All datasets exhibit very high sparsity (over 99.96%), which is typical for real-world recommender systems and highlights the challenge of predicting relevant items.
Average Length (Avg.len): Indicates the average number of interactions per user sequence, around 8-9 items.

These datasets are widely used benchmarks in sequential recommendation, making them suitable for validating the general effectiveness and scalability of the proposed method. Their high sparsity also makes them challenging, providing a good test for the model's ability to handle low-frequency items.

5.2. Evaluation Metrics

The model performance in sequential recommendation is evaluated using top-K Recall and Normalized Discounted Cumulative Gain (NDCG). $K$ is set to 5 and 10.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) among the top $K$ recommendations. It focuses on how many of the actual preferred items appear within the recommended list, without considering their rank.

Mathematical Formula: $ \mathrm{Recall@K} = \frac{\sum_{u \in U} \mathbb{I}(R_u \cap T_u \ne \emptyset, |R_u \cap T_u| \ge 1)}{\sum_{u \in U} |T_u|} $ Or, more commonly, for a single user $u$ : $ \mathrm{Recall@K}_u = \frac{|\mathrm{RecommendedItems}_u \cap \mathrm{RelevantItems}_u|}{|\mathrm{RelevantItems}u|} $ And then averaged over all users: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum{u \in U} \mathrm{Recall@K}_u $

Symbol Explanation:

$U$ : The set of all users.
$u$ : A specific user.
$\mathrm{RecommendedItems}_u$ : The set of top $K$ items recommended to user $u$ .
$\mathrm{RelevantItems}_u$ : The set of items that are actually relevant to user $u$ (e.g., the next item they interacted with in the test set). In a leave-one-out scenario, $|\mathrm{RelevantItems}_u|=1$ .
$|\cdot|$ : Denotes the cardinality (number of elements) of a set.
$\cap$ : Set intersection operator.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality, especially useful when items have varying degrees of relevance. It considers the position of relevant items in the recommended list; relevant items appearing higher in the list contribute more to the score. It penalizes relevant items that appear lower. The "Normalized" part ensures that scores are comparable across different query results by dividing by the Ideal DCG.

Mathematical Formula: For a single user $u$ : $ \mathrm{DCG@K}u = \sum{i=1}^{K} \frac{2^{\mathrm{rel}(i)} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K}u = \sum{i=1}^{K} \frac{2^{\mathrm{rel}_{\mathrm{ideal}}(i)} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K}_u = \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}u} $ And then averaged over all users: $ \mathrm{NDCG@K} = \frac{1}{|U|} \sum{u \in U} \mathrm{NDCG@K}_u $

Symbol Explanation:

$U$ : The set of all users.
$u$ : A specific user.
$K$ : The number of top recommendations considered.
$i$ : The rank position of an item in the recommended list.
$\mathrm{rel}(i)$ : The relevance score of the item at rank $i$ in the recommended list. Typically, for implicit feedback, this is 1 if the item is relevant (e.g., the actual next item) and 0 otherwise.
$\mathrm{rel}_{\mathrm{ideal}}(i)$ : The relevance score of the item at rank $i$ in the ideal recommended list (where all relevant items are ranked as high as possible). In leave-one-out evaluation, the ideal list would have relevance 1 at rank 1 and 0 elsewhere.
$\mathrm{DCG@K}_u$ : Discounted Cumulative Gain for user $u$ at rank $K$ .
$\mathrm{IDCG@K}_u$ : Ideal Discounted Cumulative Gain for user $u$ at rank $K$ (the maximum possible DCG for that user).

Evaluation Strategy:
Leave-one-out strategy: For each user's interaction sequence:
- The final item is used as the test data.
- The second most recent item is used as the validation data.
- All other preceding items are used for training.
Full ranking evaluation: The models predict over the entire item set, rather than a sampled subset, which is a more rigorous comparison.
Beam size: For all generative recommendation models, the beam size for autoregressive decoding is set to 50.

5.3. Baseline Models

The paper compares MTGRec against two groups of baseline models:

5.3.1. Traditional Sequential Recommendation Models

These models typically use item IDs and collaborative filtering information, with some incorporating item features.

Caser [39]: Leverages convolutional neural networks (CNNs) to capture spatial and positional patterns in user behavior sequences.
HGN [24]: Uses feature-level and instance-level gating mechanisms to model user preferences.
GRU4Rec [10]: Employs Gated Recurrent Units (GRUs) to capture sequential patterns in user interactions.
BERT4Rec [36]: Utilizes a bi-directional self-attentive model with a mask prediction objective for sequence modeling, similar to BERT.
SASRec [16]: Adopts a unidirectional self-attention network for user behavior modeling, a Transformer-based approach.
FMLP-Rec [60]: Proposes an all-MLP model with learnable filters to reduce noise and model user preferences.
HSTU [54]: Hierarchical Sequential Transducers that incorporate user actions and timestamps for next item prediction. It is an ID-based method.
FDSA [55]: Introduces a dual-stream self-attention framework that independently models item-level and feature-level sequences for recommendation. This model is notable for incorporating item textual features.
S^3-Rec [59]: Enhances sequential recommendation models by leveraging feature-item correlations as self-supervised signals.

5.3.2. Generative Recommendation Models

These models transform next-item prediction into autoregressive generation of item identifiers.

TIGER [32]: Employs RQ-VAE to quantize item embeddings into semantic IDs (token sequences) and uses a generative retrieval paradigm for sequential recommendation.
LETTER [42]: Extends TIGER by integrating collaborative and diversity regularization into RQ-VAE to improve the quality of item tokenization.
TIGER++ [32]: An improved version of TIGER that employs representation whitening and exponential moving average (EMA) techniques to enhance codebook learning and improve the quality of semantic IDs. The paper states it uses the same techniques as $TIGER++$ to learn RQ-VAE in MTGRec for fair comparison.

5.4. Implementation Details

5.4.1. Item Tokenizer (for MTGRec and baselines)

Semantic Embeddings: Sentence-T5 [25] is used to encode textual information associated with each item (e.g., title, description) into its semantic embedding.
RQ-VAE Structure: A RQ-VAE model with 3 codebooks of size 256, and an extra codebook for collision handling is used.
- Codebook dimension is set to 128.
- Encoder/Decoder: A deeper MLP with hidden layer sizes [2048, 1024, 512, 256] is used for the encoder and decoder, following previous studies [23, 58].
Enhancements (similar to TIGER++):
- PCA with representation whitening [35] is applied to enhance the quality of item semantic embeddings.
- Exponential moving averages (EMA) [41] are used instead of gradient descent for codebook learning, for stability and effectiveness.
Training: Optimized by Adagrad optimizer for 10K epochs, with a learning rate of 0.001 and a batch size of 2048.
Tokenizer Selection for MTGRec: RQ-VAE checkpoints from the final $n$ epochs are selected as semantically relevant item tokenizers. $n$ is tuned between 5 and 30 with an interval of 5.

5.4.2. Generative Recommender

Backbone: T5 [31] model.
Model Architecture:
- Model dimension: 128
- Inner dimension: 512
- Attention heads: 4, with a dimension of 64 each.
- Activation function: ReLU.
Model Layers: The number of model layers $L$ (both encoder and decoder layers) is tuned within {1, 2, 3, 4, 5, 6, 7, 8}.
Pre-training:
- Batch size: 256 on each GPU (total 4 GPUs used).
- Epochs: 200 epochs on all datasets.
- Curriculum Learning Schedule:
  - 60 epochs for gradient feature warmup.
  - Sampling probability update every 20 epochs thereafter.
- Temperature coefficient $\tau$ : Tuned in {0.1, 0.3, 1.0, 3.0, 5.0, 10.0}.
- Optimizer: AdamW [18] with a learning rate of 0.005.
- Learning rate scheduler: Cosine scheduler.
Fine-tuning:
- Optimizer: AdamW with a learning rate of 0.0002.
- Learning rate scheduler: Cosine scheduler.

5.4.3. Baselines Setup

Traditional Models: Implemented using RecBole [56]. Embedding dimension set to 128. Hyperparameters tuned via grid search.
Generative Baselines (TIGER, LETTER, TIGER++): Used the same T5 model architecture as MTGRec. $L$ (number of layers) tuned from 1 to 8.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a comprehensive comparison of MTGRec against both traditional and generative recommendation baselines on three public datasets. The overall results are summarized in Table 2.

The following are the results from Table 2 of the original paper:

Methods	Instrument				Scientific				Game
Methods	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10
Caser	0.0241	0.0386	0.0151	0.0197	0.0159	0.0257	0.0101	0.0132	0.0330	0.0553	0.0209	0.0281
HGN	0.0321	0.0517	0.0202	0.0265	0.0212	0.0351	0.0131	0.0176	0.0424	0.0687	0.0271	0.0356
GRU4Rec	0.0324	0.0501	0.0209	0.0266	0.0202	0.0338	0.0129	0.0173	0.0499	0.0799	0.0320	0.0416
BERT4Rec	0.0307	0.0485	0.0195	0.0252	0.0186	0.0296	0.0119	0.0155	0.0460	0.0735	0.0298	0.0386
SASRec	0.0333	0.0523	0.0213	0.0274	0.0259	0.0412	0.0150	0.0199	0.0535	0.0847	0.0331	0.0438
FMLP-Rec	0.0339	0.0536	0.0218	0.0282	0.0269	0.0422	0.0155	0.0204	0.0528	0.0857	0.0338	0.0444
HSTU	0.0343	0.0577	0.0191	0.0271	0.0271	0.0429	0.0147	0.0198	0.0578	0.0903	0.0334	0.0442
FDSA	0.0347	0.0545	0.0230	0.0293	0.0262	0.0421	0.0169	0.0213	0.0544	0.0852	0.0361	0.0448
S³-Rec	0.0317	0.0496	0.0199	0.0257	0.0263	0.0418	0.0171	0.0219	0.0485	0.0769	0.0315	0.0406
TIGER	0.0370	0.0564	0.0244	0.0306	0.0264	0.0422	0.0175	0.0226	0.0559	0.0868	0.0366	0.0467
LETTER	0.0372	0.0580	0.0246	0.0313	0.0279	0.0435	0.0182	0.0232	0.0563	0.0877	0.0372	0.0473
TIGER++	0.0380	0.0588	0.0249	0.0316	0.0289	0.0450	0.0190	0.0241	0.0580	0.0914	0.0377	0.0485
MTGRec	0.0413	0.0635	0.0275	0.0346	0.0322	0.0506	0.0212	0.0271	0.0621	0.0956	0.0410	0.0517
Imporve	+8.68%	+7.99%	+10.44%	+9.49%	+11.42%	+12.44%	+11.58%	+12.45%	+7.07%	+4.60%	+8.75%	+6.60%

Observations:

Traditional vs. Generative Models:
- FDSA, which incorporates item textual features, generally outperforms other traditional ID-based models (Caser, HGN, GRU4Rec, BERT4Rec, SASRec, FMLP-Rec, HSTU). This highlights the benefit of enriching item representations with semantic information beyond just IDs.
- Generative recommendation models (TIGER, LETTER, TIGER++) generally outperform traditional sequential recommendation models. This validates the advantage of the generative paradigm and item identifiers that carry semantics.
Generative Baselines Comparison:
- LETTER and $TIGER++$ show better performance than TIGER. This is attributed to their improvements in item tokenizer quality (LETTER with collaborative/diversity regularization, $TIGER++$ with representation whitening/EMA). These techniques enhance the semantic quality of the generated item tokens.
MTGRec's Superiority:
- MTGRec consistently achieves the optimal performance across all three datasets and all evaluation metrics (Recall@5, Recall@10, NDCG@5, NDCG@10).
- It shows substantial improvements over both traditional and generative baseline models. For example, on the Scientific dataset, MTGRec shows improvements of over 11% in Recall@5 and NDCG@5 compared to $TIGER++$ . On Instrument, NDCG@5 improves by over 10%.
- This strong performance validates MTGRec's core hypothesis: multi-identifier item tokenization combined with curriculum recommender pre-training significantly enhances the generative recommender's scalability and effectiveness by providing larger and more diverse sequence data.

6.2. Ablation Study

To understand the contribution of each component of MTGRec, an ablative analysis was performed on the Instrument and Scientific datasets. The variants are compared against the full MTGRec model.

The following are the results from Table 3 of the original paper:

Methods		Instrument				Scientific
Methods		Recall@5	Recall@10	NDCG@5	NDCG@10	Recall@5	Recall@10	NDCG@5	NDCG@10
(0)	MTGRec	0.0413	0.0635	0.0275	0.0346	0.0322	0.0506	0.0212	0.0271
	w/o Data curriculum	0.0406	0.0618	0.0268	0.0338	0.0312	0.0487	0.0205	0.0263
	w/o Relevant tokenizers	0.0350	0.0548	0.0226	0.0290	0.0249	0.0404	0.0158	0.0208
	w/o Pre-training	0.0380	0.0571	0.0247	0.0309	0.0285	0.0443	0.0181	0.0236

Analysis of Variants:

w/o Data curriculum: This variant removes the curriculum learning component; data from different item tokenizers are sampled with equal probability.
- Result: Performance is worse than full MTGRec across all metrics and datasets (e.g., Recall@10 on Instrument drops from 0.0635 to 0.0618).
- Conclusion: The data curriculum scheme, guided by data influence estimation, is effective. Dynamically prioritizing "useful" data groups helps the model learn more efficiently and effectively from the diverse augmented data.
w/o Relevant tokenizers: This variant uses multiple item tokenizers initialized with different random parameters, making them irrelevant and extraneous.
- Result: This variant shows a significant performance degradation, performing even worse than w/o Pre-training (e.g., Recall@10 on Instrument drops from 0.0635 to 0.0548).
- Conclusion: This emphasizes the importance of semantically relevant tokenizers. Using unrelated tokenizers creates semantic conflicts in the training data, leading to model learning collapse. The proposed method of using RQ-VAE checkpoints from adjacent epochs is crucial for maintaining semantic consistency while introducing diversity.
w/o Pre-training: This variant represents a baseline where the generative recommender is trained only on data from a single item tokenizer (equivalent to $TIGER++$ ). It does not use the augmented data or the pre-training strategy.
- Result: This variant performs worse than MTGRec (e.g., Recall@10 on Instrument is 0.0571 vs 0.0635 for MTGRec).
- Conclusion: Pre-training on augmented sequence data from multiple semantically relevant item tokenizers is a critical element for MTGRec's effectiveness. The multi-identifier scheme provides the necessary volume and diversity for improved performance.
  
  These ablation studies clearly demonstrate that both multi-identifier item tokenization (with semantically relevant tokenizers) and the data curriculum pre-training scheme are indispensable for MTGRec's superior performance.

6.3. Further Analysis

6.3.1. Performance Comparison w.r.t. Model Scale

This analysis investigates how the number of encoder and decoder layers in the generative recommender (model scale) impacts performance when combined with MTGRec's data augmentation.

$Figure 2: Performance Comparison w.r.t. Model Scale. The $\\mathbf { x }$ -axis coordinates are the number of encoder and decoder layers in$ 该图像是一个图表，展示了在不同层数下，MTGRec、TIGER++ 和 TIGER 模型在三个数据集（Instrument、Scientific 和 Game）中的 Recall@10 性能比较。在各层数中，MTGRec 模型的表现优于其他模型，特别是在 Game 数据集上达到最高的召回率。

Figure 2: Performance Comparison w.r.t. Model Scale. The $\mathbf { x }$ -axis coordinates are the number of encoder and decoder layers in the generative recommender, up to 8 layers.

Observations:

MTGRec consistently outperforms TIGER and $TIGER++$ across all model scales (number of layers). This confirms MTGRec's overall effectiveness.
For baseline models (TIGER, $TIGER++$ ), performance initially improves with scale at shallow layers, but can degrade due to overfitting if the model becomes too large (e.g., 4 or 5 layers), suggesting limited data capacity to fully utilize larger models.
MTGRec generally shows an upward trend in performance as the model scales, indicating that its multi-identifier item tokenization provides more massive and diverse data that larger models can benefit from.
Limitation: The paper acknowledges that this positive correlation is constrained compared to LLMs which scale to 100B parameters. The current method of augmenting data, while effective, might not generate sufficiently diverse data to fully utilize models beyond a certain scale, especially when RQ-VAE checkpoints from very distant epochs might lose semantic relevance. This points to a potential trade-off between data quality (semantic relevance) and quantity/diversity.

6.3.2. Performance Comparison w.r.t. Tokenizer Number

This analysis examines how the number of item tokenizers ( $n$ , from which RQ-VAE checkpoints are selected) used for pre-training affects recommendation performance for different model scales (3-layer and 6-layer generative recommenders).

Figure 3: Performance Comparison w.r.t. Tokenizer Number on the Instrument and Scientific datasets. 该图像是一个图表，展示了在工具和科学数据集上，基于不同标记器数量（3-layer 和 6-layer）对召回率（Recall@10）的影响。在工具数据集上，3-layer的召回率在标记器数量为15时达到最高，而在科学数据集上，6-layer则表现相对稳定。

Figure 3: Performance Comparison w.r.t. Tokenizer Number on the Instrument and Scientific datasets.

Observations:

Fewer Tokenizers: Using fewer tokenizers (e.g., $n=5$ ) offers marginal advancement. This is because the volume and diversity of augmented sequence data are insufficient for effective deep model optimization.
Excessive Tokenizers: Using an excessively large number of tokenizers also leads to suboptimal performance. This is hypothesized to occur when the epoch intervals between RQ-VAE checkpoints become too large, weakening or even conflicting the semantic relevance between tokenizers. This highlights a crucial trade-off in MTGRec: balancing data volume and semantic relevance.
Optimal Number: There exists an optimal number of tokenizers for each model scale.
Model Scale Influence: Larger models (e.g., 6-layer) benefit from a more extensive and diverse sequence data and thus generally have a higher optimal number of tokenizers compared to smaller models (e.g., 3-layer). This reinforces the idea that data augmentation is more beneficial for larger capacity models.

6.3.3. Performance Comparison w.r.t. Temperature Coefficient

The temperature coefficient $\tau$ in Eqn. (18) (sampling probability update) controls the smoothness of the sampling distribution for data groups. This analysis explores its impact.

Figure 4: Performance Comparison w.r.t. Temperature Coefficient on the Instrument and Scientific datasets. 该图像是一个图表，展示了在Instrument和Scientific数据集上，温度系数au对NDCG@10和Recall@10的影响。横轴为温度系数au，纵轴分别表示NDCG@10和Recall@10的值。图中可以看到，在不同的温度系数下，这两个指标的表现有所变化。

Figure 4: Performance Comparison w.r.t. Temperature Coefficient on the Instrument and Scientific datasets.

Observations:

An appropriate value of $\tau$ is crucial for MTGRec's performance.
Smaller $\tau$ : Makes the model more inclined towards high-probability item tokenizers (i.e., those with high cumulative influence). This can lead to over-specialization on a few "best" data groups and neglect potentially useful but lower-influence data.
Larger $\tau$ : Causes the data curriculum to degenerate into uniform sampling, meaning all data groups are sampled with roughly equal probability, regardless of their influence. This negates the benefit of curriculum learning.
Optimal $\tau$ : The optimal values for $\tau$ are found to be 3 (Instrument) and 1 (Scientific). Both extremes (very small or very large $\tau$ ) negatively affect the effectiveness of curriculum pre-training. This demonstrates the importance of tuning $\tau$ to balance exploration of different data groups with exploitation of high-influence ones.

6.3.4. Performance Comparison w.r.t. Long-tail Items

A key motivation for MTGRec is to improve recommendations for long-tail items. This analysis evaluates its performance across different item popularity groups.

$Figure 5: Performance Comparison w.r.t. Long-tail Items on the Instrument and Scientific datasets.. The bar graph illustrates the number of interactions in the test data for each group, while the line chart displays the improvement ratios for Recall $\\bf { \\Pi } _ { \\mathcal { \\Theta } } ( \\pmb { \\omega } 1 0$ in comparison to TIGER.$ 该图像是图表，展示了在 Instrument 和 Scientific 数据集上长期低频项目的性能比较。左侧柱状图显示每个组在测试数据中的交互数量，而右侧折线图展示了与 TIGER 相比，Recall f { ext{Improved Recall}@10} 的提升比例。

Figure 5: Performance Comparison w.r.t. Long-tail Items on the Instrument and Scientific datasets. The bar graph illustrates the number of interactions in the test data for each group, while the line chart displays the improvement ratios for Recall $\mathbf { \Pi } _ { \mathcal { \Theta } } ( \pmb { \omega } 1 0$ in comparison to TIGER.

Observations:

MTGRec consistently outperforms the baseline model (TIGER) across all item groups (defined by interaction counts in the test data). The line chart shows positive improvement ratios for Recall@10 across all groups.
Significant Improvement for Long-tail Items: MTGRec shows superior performance and more significant improvement for unpopular (long-tail) items, specifically in the group [0, 20) interactions. For example, the improvement ratio in this group is notably higher than for more popular items.
Conclusion: This phenomenon directly supports the paper's hypothesis that multi-identifier item tokenization benefits long-tail items. By associating items with multiple identifiers, their tokens gain increased exposure frequency and incorporate more knowledge from shared tokens across different semantic contexts. This helps the model learn more robust representations for items that are rarely seen in their original one-to-one mapping.

6.3.5. Applying MTGRec on Other Generative Recommendation Methods

The paper claims MTGRec can be seamlessly integrated into other generative recommendation methods that use a trainable item tokenizer. This section verifies its general applicability.

The following are the results from Table 4 of the original paper:

Methods	Instrument		Scientific
Methods	Recall@10	NDCG@10	Recall@10	NDCG@10
TIGER	0.0568	0.0307	0.0423	0.0225
+MTGRec	0.0598	0.0329	0.0465	0.0245
LETTER	0.0580	0.0313	0.0435	0.0232
+MTGRec	0.0614	0.0335	0.0481	0.0255
TIGER++	0.0588	0.0316	0.0450	0.0241
+MTGRec	0.0635	0.0346	0.0506	0.0271

Observations:

The results show that applying $+MTGRec$ (i.e., integrating MTGRec's multi-identifier pre-training strategy) consistently improves the performance of TIGER, LETTER, and $TIGER++$ across both Instrument and Scientific datasets for both Recall@10 and NDCG@10.
General Applicability: This verifies the general applicability and portability of the MTGRec approach. The idea of generating semantically relevant sequence data from RQ-VAE checkpoints from adjacent epochs creates homogeneous knowledge that can enhance various generative recommender backbones.

6.3.6. Multiple Identifier Difference Analysis

This section analyzes the relevance and differences of item identifiers generated by item tokenizers from different training epochs. Two metrics are used: proportion of items whose first token changed and proportion of items with any token changes.

The following are the results from Table 5 of the original paper:

Intervals	Instrument		Scientific		Game
Intervals	First	Any	First	Any	First	Any
1	0.39%	13.58%	0.27%	11.4%	0.36%	9.36%
5	0.44%	21.26%	0.58%	22.54%	0.58%	21.22%
10	0.51%	29.75%	0.51%	30.68%	0.57%	30.43%
20	0.75%	44.09%	0.71%	47.33%	0.79%	47.42%
30	0.87%	54.94%	0.85%	58.29%	1.14%	59.95%

Observations:

Adjacent Tokenizers (Interval 1): For tokenizers from adjacent epochs, the item identifiers show minimal changes. Only a small percentage of items (e.g., 0.39% on Instrument) have their first token changed, and a moderate percentage (e.g., 13.58% on Instrument) have any token changes. This indicates strong semantic consistency between adjacent checkpoints.
Increasing Interval: As the epoch interval between tokenizers increases, the number of item identifiers that change (both first token and any token) significantly increases. For an interval of 30, around 55-60% of items have any token changes. This suggests that larger intervals can lead to more distinct identifiers, which aligns with the finding in Section 6.3.2 about the trade-off between diversity and semantic relevance.
Preservation of Core Semantics: Despite significant changes in any token, the proportion of changes in the first token generally remains below 1% even for large intervals. This is important because the first token often encodes the most coarse-grained or dominant semantic information of an item. Its stability suggests that the core semantic knowledge is largely preserved across these semantically relevant tokenizers, preventing severe semantic conflicts even as finer details of the identifiers evolve.

6.3.7. Efficiency Analysis

This section analyzes the computational efficiency of MTGRec compared to baselines, specifically focusing on training time and epochs for convergence.

The following are the results from Table 6 of the original paper:

Methods	Instrument		Scientific		Game
Methods	Time	Epoch	Time	Epoch	Time	Epoch
TIGER	1.33 h	186	1.04 h	184	2.19 h	253
TIGER++	1.22 h	178	1.02 h	187	2.23 h	264
MTGRec	1.41 h	209	1.21 h	217	2.11 h	248

Observations:

MTGRec's training time is comparable to TIGER and $TIGER++$ . For instance, on Instrument, MTGRec takes 1.41h compared to TIGER's 1.33h and $TIGER++$ 's 1.22h. On Game, MTGRec is slightly faster than TIGER and $TIGER++$ .
The number of epochs for convergence for MTGRec is also in a similar range to the baselines.
Efficiency Conclusion: The multi-identifier pre-training strategy introduced by MTGRec does not introduce excessive training time costs despite augmenting data and adding a curriculum learning layer. This is a crucial finding, as it means the significant performance gains of MTGRec do not come at the expense of impractical training duration.
Curriculum Learning Effect: The curriculum learning scheme even appears to accelerate model convergence on the Game dataset (248 epochs for MTGRec vs. 253 for TIGER and 264 for $TIGER++$ ), suggesting its efficiency benefits in some cases.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces MTGRec, a novel framework for generative recommender pre-training that addresses the limitations of one-to-one item tokenization. By proposing multi-identifier item tokenization, MTGRec associates each item with multiple semantically relevant identifiers, leveraging RQ-VAE checkpoints from adjacent training epochs. This strategy significantly augments the token sequence data, creating multiple data groups with related yet distinct semantic distributions. To effectively learn from this augmented data, MTGRec incorporates a curriculum recommender pre-training scheme that dynamically adjusts the sampling probabilities of these data groups based on data influence estimation using first-order gradient approximation. Finally, the pre-trained model undergoes fine-tuning with a single tokenizer to ensure precise item identification for deployment. Extensive experiments on three real-world datasets consistently demonstrate MTGRec's superior performance and scalability over both traditional and existing generative recommendation baselines, particularly benefiting long-tail items and without incurring excessive training costs.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and suggest future research directions:

Model Scaling Constraints: While MTGRec shows improved scalability compared to baselines, the positive correlation between model scale and performance is currently constrained, unlike in LLMs that scale to billions of parameters. This limitation might stem from the current method's ability to generate sufficiently diverse data while maintaining semantic relevance when RQ-VAE checkpoints are separated by too many training epochs.
Generalized Recommendation Scenarios: The current work focuses on sequential recommendation. Future work could adapt multi-identifier item tokenization to more generalized scenarios, such as transferable recommendation (applying models across different domains or tasks) and multi-domain recommendation (recommending items from multiple catalogs simultaneously).
Further Model Scaling: The authors plan to investigate the scaling effect when the model parameters are increased to the billion level, pushing the boundaries of generative recommenders.

7.3. Personal Insights & Critique

This paper presents a highly innovative and practical solution to a fundamental limitation in generative recommendation. The idea of using RQ-VAE checkpoints from adjacent epochs is particularly clever; it's a simple yet effective way to generate semantically relevant variations of item representations without needing to train multiple independent tokenizers or complex data augmentation schemes. This method ensures that the augmented data maintains a crucial level of coherence, which is essential for effective learning.

Key Strengths:

Elegant Data Augmentation: The multi-identifier tokenization using RQ-VAE checkpoints is an elegant solution to the data diversity and sparsity problem for long-tail items. It leverages the inherent evolution of a tokenizer during training to create meaningful variations.
Effective Curriculum Learning: The data curriculum scheme based on data influence estimation is well-integrated. It intelligently guides the model towards more impactful data, optimizing the learning process and contributing to overall performance gains.
Strong Experimental Validation: The comprehensive experiments, including ablation studies and analyses on model scale, tokenizer number, temperature coefficient, and long-tail items, provide robust evidence for the proposed method's effectiveness. The demonstration of its applicability to other generative recommendation methods further strengthens its value.
Practical Applicability: The solution does not introduce excessive computational overhead, making it practical for real-world deployment, especially given the fine-tuning step to ensure accurate item identification.

Potential Issues/Areas for Improvement:

Defining "Semantically Relevant": While the paper argues for adjacent epoch checkpoints, the exact boundaries for semantic relevance (e.g., how far apart epochs can be before tokenizers become "extraneous") could be further explored theoretically or empirically. The multiple identifier difference analysis offers some insight, but a more formal definition or a dynamic selection mechanism for $n$ could be beneficial.
Generalizability of Influence Estimation: The first-order gradient approximation for data influence is a heuristic. While effective, its robustness across different recommender architectures or more complex datasets could be investigated. More sophisticated influence estimation techniques might offer further gains.
Scaling beyond Current Limits: The acknowledged limitation regarding extreme model scaling suggests that the current augmentation strategy, while good, might still not match the data demands of truly LLM-scale recommenders. Future work on generating even more diverse and semantically coherent token sequences (perhaps by incorporating external knowledge or more advanced generative techniques) would be crucial for breaking this barrier.
Interpretability of Token Identifiers: While RQ-VAE creates semantic tokens, the interpretability of these multi-identifiers and how specific token changes relate to item features could be an interesting area for deeper analysis.

Transferability and Future Value: The core idea of breaking the one-to-one mapping for item representation is highly transferable. This framework could be applied to:

Other domains: Any domain using item identifiers (e.g., knowledge graph entities, genomic sequences) could potentially benefit from multi-identifier representation to improve robustness and handle long-tail entities.
Different generative tasks: The multi-identifier concept could extend beyond recommendation to other generative tasks where discrete entities need rich, varied representations.
Data-centric AI: This work exemplifies a data-centric approach, where improving the data (augmenting its diversity and quality) is key to unlocking model performance, rather than solely focusing on model architecture.

Overall, MTGRec represents a significant step forward in making generative recommender systems more robust and scalable, particularly for the challenging long-tail problem. Its innovations provide a strong foundation for future research in data augmentation and curriculum learning within generative AI for recommendation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Pre-training Generative Recommender with Multi-Identifier Item Tokenization

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 46,319 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Sequential Recommendation

3.2.2. Generative Recommendation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Problem Formulation

4.2.2. Multi-Identifier Item Tokenization

4.2.2.1. Tokenizer Backbone

4.2.2.2. Semantically Relevant Tokenizers

4.2.2.3. Tokenize an Item Sequence to Multiple Token Sequences

4.2.3. Curriculum Recommender Pre-training

4.2.3.1. Estimating Data Influence

4.2.3.2. Curriculum Pre-training

4.2.4. Fine-tuning and Inference

4.2.4.1. Fine-tuning for Item Identification

4.2.4.2. Inference

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

5.3. Baseline Models

5.3.1. Traditional Sequential Recommendation Models

5.3.2. Generative Recommendation Models

5.4. Implementation Details

5.4.1. Item Tokenizer (for MTGRec and baselines)

5.4.2. Generative Recommender

5.4.3. Baselines Setup

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Study

6.3. Further Analysis

6.3.1. Performance Comparison w.r.t. Model Scale

6.3.2. Performance Comparison w.r.t. Tokenizer Number

6.3.3. Performance Comparison w.r.t. Temperature Coefficient

6.3.4. Performance Comparison w.r.t. Long-tail Items

6.3.5. Applying MTGRec on Other Generative Recommendation Methods

6.3.6. Multiple Identifier Difference Analysis

6.3.7. Efficiency Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers