Pre-training Generative Recommender with Multi-Identifier Item Tokenization
TL;DR Summary
The MTGRec framework enhances generative recommender pre-training through multi-identifier item tokenization, using RQ-VAE for multiple identifier association and a curriculum learning scheme to improve semantic modeling for low-frequency items and token diversity.
Abstract
Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Pre-training Generative Recommender with Multi-Identifier Item Tokenization
1.2. Authors
Bowen Zheng, Zhongfu Chen, Enze Liu, Zhongrui Ma, Yue Wang, Wayne Xin Zhao, Ji-Rong Wen. Their affiliations primarily include Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, and Poisson Lab, Huawei, Beijing, China. This indicates a collaboration between academia and industry, common in cutting-edge AI research.
1.3. Journal/Conference
Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25). SIGIR is a premier international conference in the field of information retrieval, including recommender systems. Its publication signifies high-quality, impactful research in the domain.
1.4. Publication Year
2025
1.5. Abstract
Generative recommendation systems aim to recommend items by autoregressively generating their identifiers. The main issue with existing methods is their reliance on a one-to-one mapping between an item and a single identifier. This approach leads to suboptimal semantic modeling, especially for low-frequency items (items that appear rarely in interaction data), and limits the diversity of token sequence data (the sequences of item identifiers used for training).
To address these problems, the paper proposes MTGRec, a framework that uses Multi-identifier item Tokenization to augment token sequence data specifically for Generative Recommender pre-training. MTGRec introduces two core innovations:
-
Multi-identifier item tokenization: It utilizes
RQ-VAE(Residual-Quantized Variational AutoEncoder) as the tokenizer backbone. Instead of a single tokenizer, it treatsmodel checkpointsfrom adjacent training epochs of the RQ-VAE as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers. Consequently, a single user interaction sequence can be transformed into several token sequences, forming differentdata groups. -
Curriculum recommender pre-training: This involves a curriculum learning scheme that is guided by
data influence estimation. This dynamically adjusts thesampling probabilityof each data group during the recommender's pre-training phase.After the pre-training phase, the model is fine-tuned using a single tokenizer to ensure accurate item identification for actual recommendations. Extensive experiments on three public benchmark datasets demonstrate that
MTGRecsignificantly outperforms both traditional and existing generative recommendation baselines in terms of effectiveness and scalability.
1.6. Original Source Link
https://arxiv.org/abs/2504.04400 The publication status is a preprint on arXiv, intended for presentation at SIGIR '25.
2. Executive Summary
2.1. Background & Motivation
Core Problem: Generative recommender systems typically use a one-to-one mapping strategy where each item is represented by a single identifier (a sequence of tokens). This rigid tokenization scheme faces two main issues:
-
Suboptimal Semantic Modeling for Low-Frequency Items: Items that appear infrequently in user interactions (low-frequency or
long-tail items) result in their associated tokens also beinglow-frequency. This lack of sufficientsupervision signals(enough examples in training data) makes it difficult for the model to learn their true semantic meaning effectively. -
Limited Diversity in Token Sequence Data: The one-to-one mapping restricts the variety of
token sequence datathat can be generated from user interaction sequences. This lack of diversity can hinder the model's ability to generalize and improve performance, especially as model sizes scale, unlike the observations inLarge Language Models (LLMs)where increased data diversity and volume often lead to better performance.Importance: Recommender systems are crucial for various online platforms. Enhancing their ability to recommend
low-frequency itemscan improve user satisfaction by offering more diverse and personalized suggestions beyond popular items. Increasing data diversity is also key for unlocking the full potential ofgenerative modelsin recommendation, allowing them to scale more effectively and achieve higher performance.
Paper's Entry Point/Innovative Idea: The paper's core idea is to break the one-to-one mapping constraint by associating each item with multiple identifiers. This multi-identifier scheme addresses the limitations by:
- Increasing Token Exposure: More tokens per item means
low-frequency itemsgain increasedexposure frequencyfor their tokens, making it easier for the model to learn their semantics. It also promotestoken sharingacross items. - Enriching Data Diversity: A single user interaction sequence can now be tokenized into
multiple token sequences, significantly augmenting the training data's volume and diversity. This enables bettermodel scaling.
2.2. Main Contributions / Findings
The paper proposes MTGRec to overcome the limitations of one-to-one item tokenization in generative recommenders. Its primary contributions are:
-
Novel Framework for Generative Recommenders:
MTGRecintroduces a novel framework that leveragesmultiple item tokenizersforcurriculum recommender pre-training. This approach significantly improves the effectiveness and scalability ofgenerative recommendation. -
Multi-Identifier Item Tokenization for Data Augmentation: The paper develops a
multi-identifier item tokenizationapproach. This involves usingRQ-VAEcheckpoints from adjacent training epochs assemantically relevant item tokenizers. This allows each item to have multiple identifiers, thereby augmenting a single user interaction sequence into severaltoken sequencesthat serve as differentdata groupsfor training. -
Data Curriculum Scheme based on Influence Estimation:
MTGRecintroduces adata curriculum schemefor recommender training. This scheme usesfirst-order gradient approximationto estimate thedata influenceof each tokenizer's data group, allowing for dynamic adjustment ofsampling probabilitiesduring pre-training. This ensures that the model learns more effectively from "useful" data. -
Demonstrated Superior Performance and Scalability: Extensive experiments on three public datasets ("Musical Instruments", "Industrial and Scientific", and "Video Games" from Amazon 2023 review dataset) show that
MTGRecsignificantly outperforms bothtraditionaland existinggenerative recommendation baselines. This superiority is observed in terms of effectiveness (Recall@K,NDCG@K) and also in itsscalabilitywithmodel scaleand its improved performance forlong-tail items.These findings address the core problems of suboptimal semantic modeling for
low-frequency itemsandlimited data diversity, leading to more robust and powerful generative recommender systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following concepts:
- Recommender Systems: Systems designed to predict user preferences and suggest items (products, movies, articles, etc.) that a user might like.
- Sequential Recommendation: A sub-field of recommender systems that focuses on predicting the next item a user will interact with, given their historical sequence of interactions. It aims to capture dynamic user preferences and sequential patterns.
- Generative Models: A class of machine learning models that can generate new data instances that resemble the training data. Examples include
Generative Adversarial Networks (GANs)andVariational Autoencoders (VAEs). In recommendation, generative models aim to produce item identifiers. - Item Tokenization: The process of converting an item (e.g., a movie, a product) into a sequence of discrete tokens, similar to how words are tokenized into subword units in natural language processing. This sequence of tokens acts as the item's identifier or semantic representation.
- Auto-regressive Generation: A generative process where each token in a sequence is generated conditioned on the previously generated tokens and the input context. This is common in
LLMsfor text generation and is adapted here for generating item identifiers. - Transformer Architecture: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), which relies heavily on
self-attention mechanisms. It has become the backbone for many state-of-the-art models in NLP, includingLLMs, and is often used in generative recommenders.- Encoder-Decoder Architecture: A common
Transformervariant where an encoder processes the input sequence and a decoder generates the output sequence. TheT5model used in this paper is an example. - Decoder-Only Architecture: Another
Transformervariant, likeGPT, which uses only the decoder part to generate sequences autoregressively.
- Encoder-Decoder Architecture: A common
- Residual-Quantized Variational AutoEncoder (RQ-VAE): A specific type of
VAEthat combinesresidual quantizationwith aVAEstructure.- Variational Autoencoder (VAE): A generative model that learns a compressed, continuous
latent representationof input data. It consists of an encoder (maps input to latent space) and a decoder (reconstructs input from latent space). The "variational" aspect involves learning the parameters of a probability distribution (e.g., mean and variance) for the latent space, rather than discrete points. - Quantization: The process of mapping continuous values to a finite set of discrete values (codes or tokens). In
RQ-VAE, this is often done usingcodebooks. - Residual Quantization: An iterative process where an input is quantized, and then the residual (the difference between the original input and its quantized version) is quantized again, and so on. This allows for representing data with increasing levels of detail or using multiple "layers" of codes.
- Variational Autoencoder (VAE): A generative model that learns a compressed, continuous
- Curriculum Learning: A training strategy where a model is trained on progressively more difficult or complex data samples. Inspired by human learning, it starts with easier examples to build foundational knowledge, then gradually introduces harder ones.
- Data Influence Estimation: Techniques used to quantify how much a specific training data point or group of data points affects the model's parameters or its performance on a validation set.
First-order gradient approximationis one such method. - Long-tail Distribution: In many real-world datasets, a small number of items are very popular (head), while a large number of items are rarely observed (tail). This creates a
long-tail distribution. Recommending items from the "tail" is challenging due todata sparsity. - Data Sparsity: A common problem in recommender systems where most users have interacted with only a tiny fraction of the available items, leading to very sparse user-item interaction matrices.
3.2. Previous Works
The paper categorizes related work into Sequential Recommendation and Generative Recommendation.
3.2.1. Sequential Recommendation
Traditional sequential recommendation models typically:
- Assign a unique ID to each item.
- Predict the next item by measuring similarity between user preference (learned from historical interactions) and candidate items.
- Often rely on
Approximate Nearest Neighbor (ANN)algorithms for efficient retrieval.
Early Studies:
- Markov Chains (e.g., [9, 3]): Model item sequences by learning transition probabilities between items. For example, a
first-order Markov Chainassumes the next item depends only on the immediate previous item.Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation[9] combines item similarity models with Markov chains to handle data sparsity.
Deep Learning-based Models:
- Recurrent Neural Networks (RNNs) (e.g.,
GRU4Rec[10, 38]): UseGated Recurrent Units (GRUs)to capture sequential patterns. - Convolutional Neural Networks (CNNs) (e.g.,
Caser[39]): Applyconvolutional filtersto extract local patterns from item sequences. - Graph Neural Networks (GNNs) (e.g., [47, 50]): Model item-item transitions or user-item interactions as graphs to capture complex relationships.
- Transformers (e.g.,
SASRec[16],BERT4Rec[36]): Utilizeself-attention mechanismsto weigh the importance of different items in a sequence.- Self-Attention Mechanism: A core component of
Transformers. For an input sequence of vectors , where is the embedding of the -th item,self-attentioncalculates an output sequence where each is a weighted sum of all . The weights are learned based on the pairwise similarity between and . The attention score is typically calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, (Query), (Key), and (Value) are linear transformations of the input embeddings. is the dimension of the key vectors, used for scaling. measures the similarity between query and key vectors. SASRec[16] uses aunidirectional self-attention networkto model user behavior, where each item can only attend to previous items.BERT4Rec[36] uses abi-directional self-attentive modelwith amask prediction objective, similar toBERTin NLP, where randomly masked items in a sequence are predicted.
- Self-Attention Mechanism: A core component of
- Other Advanced Models:
-
HGN[24]:Hierarchical Gating Networksuse feature-level and instance-level gating mechanisms. -
FMLP-Rec[60]: Anall-MLP modelwith learnable filters for sequence modeling. -
HSTU[54]:Hierarchical Sequential Transducersfornext item prediction, incorporating user actions and timestamps. -
FDSA[55]:Dual-stream self-attention frameworkmodeling item-level and feature-level sequences. -
[59]: Enhances sequential recommendation with
self-supervised signalsfrom feature-item correlations.Many of these approaches primarily use
item IDs. Recent trends involve incorporatingitem content(textual features) forsemantic embeddingsto improve performance and generalization [11, 13, 20].
-
3.2.2. Generative Recommendation
This is an emerging paradigm where each item is represented by a list of tokens (an identifier). The task of next-item prediction becomes autoregressive generation of the target item's identifier. A critical component is item tokenization.
Item Tokenization Methods:
- Heuristic Approaches (e.g., [15, 34, 45]): Rely on manually defined rules or techniques.
Time order[15],item clustering[34, 45],matrix decomposition[15, 27]. These are simple but might miss implicit item relationships.
- Text-based Approaches (e.g., [5, 8, 14, 21]): Directly use item attributes (title, description) as identifiers, leveraging
pre-trained language models.- These can suffer from inconsistent length, semantic ambiguity, and lack of
collaborative information.
- These can suffer from inconsistent length, semantic ambiguity, and lack of
- Codebook-based Approaches (e.g.,
TIGER[32],LETTER[42], [6, 29, 44]): Adoptlearnable codebooksto quantize item embeddings into fixed-length, semantically rich identifiers.-
RQ-VAE[53] is a commonly used backbone for this. -
TIGER[32]: EmploysRQ-VAEto quantize item embeddings into semantic IDs, using agenerative retrieval paradigm. -
LETTER[42]: ExtendsTIGERby integratingcollaborativeanddiversity regularizationintoRQ-VAEto improvecodebook learning. -
[32]: Enhances
TIGERby usingrepresentation whiteningandexponential moving average (EMA)for bettercodebook learningandsemantic IDquality.Representation whitening[35] aims to decorrelate features and equalize variance.EMA[41] provides more stable updates forcodebook learningby averaging past states.Generative Recommender Models:
-
- Usually employ
decoder-only(likeGPT[1, 30]) orencoder-decoder(likeT5[31]) architectures. - Some studies enhance these with
dual decoders[45] orcontrastive learning[34].
3.3. Technological Evolution
The evolution in recommender systems has moved from traditional ID-based collaborative filtering methods, through deep learning models (RNNs, CNNs, Transformers) leveraging item IDs, to content-aware methods using semantic embeddings from pre-trained language models. The latest frontier is Generative Recommendation, which aims to reformulate recommendation as a sequence-to-sequence generation task. This paradigm shifts from retrieving items by their IDs to generating their semantic identifiers.
This paper fits into the codebook-based generative recommendation stream. It builds upon RQ-VAE for item tokenization and Transformer-based models (specifically T5) for generation. It tackles the fundamental limitations of prior generative recommendation methods, particularly the one-to-one mapping constraint.
3.4. Differentiation Analysis
Compared to existing generative recommendation methods (e.g., TIGER, LETTER, ), MTGRec introduces a fundamental shift in item tokenization.
- Core Difference: Previous methods use a
strict one-to-one mapping(one item, one identifier).MTGRecproposes amulti-identifier item tokenization(one item, multiple identifiers). - Innovation 1: Multi-Identifier Item Tokenization: Instead of training a single
RQ-VAEtokenizer or multiple independent ones,MTGRecintelligently leveragesRQ-VAE model checkpointsfrom adjacent training epochs. This is key because these checkpoints aresemantically relevant(they represent slightly different but related semantic views of the items), avoiding thesemantic conflictsthat would arise from completely independent tokenizers. This augmentation producesmultiple data groupsfrom a single user sequence. - Innovation 2: Curriculum Recommender Pre-training: With these multiple
semantically relevant data groups,MTGRecdoesn't just train uniformly. It introduces adata curriculum schemeguided bydata influence estimation. This means it dynamically adjusts thesampling probabilityof each data group, prioritizing data that is more "useful" for training, thus optimizing the learning process. - Benefits: These innovations lead to
more massive and diverse token sequence datafor pre-training, which significantly helps in:-
Better Semantic Modeling: Especially for
low-frequency items, as their tokens get moreexposureandcontext. -
Improved Scalability: Leveraging larger and more diverse data allows the
generative recommenderto scale more effectively, leading to better performance. -
Enhanced Generalization: The diverse token sequences help the model learn more robust and generalizable representations.
In essence,
MTGRecimproves generative recommendation by augmenting the input data generation process and optimizing the training strategy for this augmented data, rather than solely focusing on thegenerative recommenderarchitecture or singletokenizerimprovements likeLETTERor .
-
4. Methodology
4.1. Principles
The core idea behind MTGRec is to enrich the training data for generative recommenders by associating each item with multiple identifiers, rather than a single one. This multi-identifier scheme aims to tackle the problems of suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. The theoretical basis or intuition is that by presenting the same item through slightly varied, yet semantically coherent, token representations, the model can learn more robust and generalized item semantics. Furthermore, increasing the volume and diversity of training data allows generative recommenders to benefit from model scaling trends observed in LLMs.
The framework operates in two main phases:
-
Multi-Identifier Item Tokenization: This phase generates multiple
token sequencesfor each user interaction sequence. It does so by creating severalsemantically relevant item tokenizersusingRQ-VAEcheckpoints from different training epochs. -
Curriculum Recommender Pre-training: This phase leverages the augmented data. Instead of uniform sampling, it uses a
curriculum learningapproach, dynamically adjusting thesampling probabilitiesof differentdata groupsbased on their estimatedinfluenceon the validation loss.Finally, a
fine-tuningstep ensures precise item identification for deployment.
4.2. Core Methodology In-depth
The overall framework of MTGRec is shown in Figure 1.
该图像是示意图,展示了多标识符项目标记的过程及其在生成推荐系统中的应用。图中展示了通过 RQ-VAE 进行的多标识符标记、数据课程和生成推荐器如何相互作用,以及如何对训练数据进行标记和处理以优化推荐结果。
Figure 1: The overall framework of the proposed approach.
4.2.1. Problem Formulation
Let denote the set of all items. A user's historical interaction sequence is , ordered chronologically. The goal of sequential recommendation is to predict the next item .
In generative recommendation, an item tokenizer maps each item to a token sequence (its identifier) , where is the -th token and is the identifier length.
Thus, the historical sequence is transformed into X = T(S) = [c_1^1, c_2^1, \dots, c_{H-1}^t, c_H^t], and the target item becomes Y = T(v_{t+1}) = [c_1^{t+1}, \dots, c_H^{t+1}]. The next-item prediction is then reframed as a sequence-to-sequence problem of autoregressively generating :
$ P(\boldsymbol{Y}|\boldsymbol{X}) = \prod_{h=1}^{H} P(c_h^{t+1} | \boldsymbol{X}, c_1^{t+1}, \dots, c_{h-1}^{t+1}) $
This formula describes the probability of generating the entire target item identifier given the input sequence . It's a product of conditional probabilities, where each token is generated based on the input and all previously generated tokens for the current target item.
4.2.2. Multi-Identifier Item Tokenization
This module aims to tokenize a single user interaction sequence into multiple token sequences by using multiple item tokenizers.
4.2.2.1. Tokenizer Backbone
The paper implements the item tokenizer using a Residual-Quantized Variational AutoEncoder (RQ-VAE) [53].
RQ-VAE Process:
-
Input: An item's
semantic embedding(e.g., obtained from a pre-trained language model likeSentence-T5). -
Encoding: The
RQ-VAEencoder maps to alatent representation. -
Quantization: is quantized into
serialized codes (tokens)fromcoarse to finethrough quantization levels. Each level has acodebook, where is the -thcode vectorin the -thcodebookand is thecodebook size. -
Residual Quantization Mechanism: At each level , the closest
code vectorin thecodebookis found for the currentresidual vector. The difference (residual) is then passed to the next level. $ c_h = \underset{k}{\arg\min} ||\boldsymbol{r}_h - \boldsymbol{e}_k^h||2^2 $ $ \boldsymbol{r}{h+1} = \boldsymbol{r}h - \boldsymbol{e}{c_h}^h $ Here, is the index of the chosen code for level . is theresidual vectorat level , with (the initial latent representation). The process is sequential: is the residual left after quantizing and removing the selected code . -
Reconstruction: The final
item quantized representationis . This is then decoded to reconstruct the original item embedding .RQ-VAE Loss Function: The overall loss for the
RQ-VAEis .
- Reconstruction Loss: . This term ensures that the
RQ-VAEcan accurately reconstruct the input semantic embedding. - Quantization Loss: .
- denotes the
stop-gradientoperation, meaning gradients do not flow through this part during backpropagation. - The first term encourages the
code vectorsto move towards theresidual vectors(codebook learning). - The second term encourages the
encoder outputto commit to thecode vectors(encoder learning). - is a hyperparameter (typically 0.25) to balance these two objectives.
- denotes the
4.2.2.2. Semantically Relevant Tokenizers
To obtain multiple item tokenizers that are not extraneous, MTGRec proposes using model checkpoints from adjacent epochs during a single RQ-VAE training process.
- These checkpoints are derived from the same initialization parameters and iterative
gradient descent, ensuring minimal disparities between theircodebooks. - This approach yields
token sequencesthat arerelated yet distinct, embodyinghomogeneous knowledgewithout severesemantic conflicts. - Formally, the set of
semantically relevant item tokenizersis: $ \mathcal{T} = {\mathrm{T}1, \mathrm{T}2, \dots, \mathrm{T}n} = {\mathrm{T}{\phi^{N-n+1}}, \mathrm{T}{\phi^{N-n+2}}, \dots, \mathrm{T}{\phi^N}} $ Here, is the set ofitem tokenizers. denotes theRQ-VAEmodel with parameters (a checkpoint from the -th epoch). is the maximum number of training epochs for theRQ-VAE. The selected tokenizers are from the final epochs.
4.2.2.3. Tokenize an Item Sequence to Multiple Token Sequences
With the set of semantically relevant item tokenizers , a single historical item sequence and target item can be tokenized into multiple pairs of token sequences:
$
X_1, X_2, \dots, X_n = \mathrm{T}_1(S), \mathrm{T}_2(S), \dots, \mathrm{T}_n(S)
$
$
Y_1, Y_2, \dots, Y_n = \mathrm{T}1(v{t+1}), \mathrm{T}2(v{t+1}), \dots, \mathrm{T}n(v{t+1})
$
where and are the tokenized historical sequence and target item identifier, respectively, generated by tokenizer .
During pre-training, instead of using all augmented sequences simultaneously (which could be computationally prohibitive for large ), only one token sequence is sampled at a time for model optimization. The selection of which tokenizer's data group to sample is dynamically adjusted using curriculum learning.
4.2.3. Curriculum Recommender Pre-training
This module addresses how to effectively train the generative recommender using the hybrid data generated by multiple item tokenizers. It employs a curriculum learning strategy that dynamically adjusts sampling probabilities based on data influence.
4.2.3.1. Estimating Data Influence
The concept of "useful" data is quantified by its data influence [7, 28, 48], defined as the contribution of training data to the validation loss. This is estimated using first-order gradient approximation.
Validation Loss Update: The change in validation loss can be approximated by a first-order Taylor expansion:
$
\mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^{t+1}) = \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) + \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot (\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t)
$
The update of validation loss is:
$
\mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^{t+1}) - \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) = \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot (\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t)
$
Here, is the held-out validation data, and represents the recommender parameters at time step .
Gradient of Validation Data: The validation data is acquired using a leave-one-out strategy (the last item for test, second-to-last for validation). After tokenization by different tokenizers, multiple groups of token sequence data are mixed into .
The mean loss and cumulative gradient over all validation data are:
$
\mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}) = \frac{1}{|\mathcal{D}{\mathrm{val}}|} \sum_{X, Y \in \mathcal{D}{\mathrm{val}}} \mathcal{L}(X, Y ; \boldsymbol{\theta})
$
$
\nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}) = \frac{1}{|\mathcal{D}{\mathrm{val}}|} \sum{X, Y \in \mathcal{D}_{\mathrm{val}}} \nabla \mathcal{L}(X, Y ; \boldsymbol{\theta})
$
where X, Y is a pair of token sequences for a historical interaction and target item. is the negative log-likelihood loss (Eqn. 22).
Adam Gradients of Training Data: The generative recommender is trained using the Adam optimizer [18]. The parameter update can be expressed as:
$
\boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^t = - \eta_t \Gamma(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t)
$
where is the learning rate at time step . represents the Adam-specific gradient for the training data tokenized by .
The Adam update for involves first-order momentum and second-order momentum :
$
\Gamma(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t) = \frac{\boldsymbol{m}^{t+1}}{\sqrt{\boldsymbol{v}^{t+1} + \epsilon}}
$
$
\boldsymbol{m}^{t+1} = (\beta_1 \boldsymbol{m}^t + (1 - \beta_1) \nabla \mathcal{L}(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t)) / (1 - \beta_1^t)
$
$
\boldsymbol{v}^{t+1} = (\beta_2 \boldsymbol{v}^t + (1 - \beta_2) \nabla \mathcal{L}(\mathcal{D}{\mathrm{train}}^i ; \boldsymbol{\theta}^t)^2) / (1 - \beta_2^t)
$
Here, is the training token sequence data generated by item tokenizer . (typically 0.9) and (typically 0.999) are Adam hyperparameters. is a small constant for numerical stability. is the gradient of the loss for the data group . The paper considers a group of data from each tokenizer as an entirety, calculating its gradient through gradient accumulation.
Calculate Influence: The data influence of each item tokenizer at time step is defined as:
$
\mathrm{I}(\mathrm{T}i ; \boldsymbol{\theta}^t) = \eta_t \nabla \mathcal{L}(\mathcal{D}{\mathrm{val}} ; \boldsymbol{\theta}^t) \cdot \Gamma(\mathcal{D}_{\mathrm{train}}^i, \boldsymbol{\theta}^t)
$
This formula essentially measures how much the gradient of the validation loss aligns with the Adam-normalized gradient from the training data of tokenizer . A positive influence implies that the training step (driven by 's data) is reducing the validation loss.
Since training spans multiple time steps, the cumulative influence is calculated:
$
\tilde{\mathrm{I}}(\mathrm{T}i) = \sum{k=1}^K \mathrm{I}(\mathrm{T}_i ; \boldsymbol{\theta}_k)
$
where is the model checkpoint at time step , and is the total number of checkpoints considered for influence estimation.
4.2.3.2. Curriculum Pre-training
The training process is divided into multiple stages. At the end of each stage, the data sampling probabilities are updated based on the cumulative data influence.
- Cumulative Influence Update: For stage , the cumulative influence of tokenizer is updated: $ \tilde{\mathrm{I}}_k(\mathrm{T}i) = \tilde{\mathrm{I}}{k-1}(\mathrm{T}_i) + \mathrm{I}(\mathrm{T}_i ; \boldsymbol{\theta}_k) $ Initially, all data groups are sampled with equal probability ().
- Sampling Probability Update: The
sampling probabilityfor tokenizer in the subsequent stage is calculated using asoftmax-like function: $ p_i^k = \frac{e^{\tilde{\mathrm{I}}_k(\mathrm{T}i) / \tau}}{\sum{j=1}^n e^{\tilde{\mathrm{I}}_k(\mathrm{T}_j) / \tau}} $ Here, is thetemperature coefficientthat controls thesmoothnessof the distribution. A smaller makes the distribution sharper, favoring high-influence tokenizers more strongly. A larger makes it flatter, approaching uniform sampling. - Data Sampling Strategy: For stage , a tokenizer is sampled from the set according to the probabilities . Then, the user interaction sequence and target item are tokenized using the sampled tokenizer to get and .
- Model Optimization: The sampled
token sequence data(X, Y)is then fed into thegenerative recommenderfor optimization using thenegative log-likelihood loss: $ \mathcal{L}(X, Y) = - \sum_{h=1}^H \log P(c_h^{t+1} | X, c_1^{t+1}, \dots, c_{h-1}^{t+1}) $ This is the standard loss forautoregressive sequence generation, where the model tries to maximize the likelihood of generating the correcttarget item identifiergiven thehistorical sequence.
4.2.4. Fine-tuning and Inference
4.2.4.1. Fine-tuning for Item Identification
After pre-training with multiple item tokenizers, the generative recommender might not have a strict one-to-one mapping between an item and a single identifier (since an item could have been associated with multiple identifiers during pre-training). For practical deployment, accurate item identification is crucial.
Therefore, the pre-trained model is fine-tuned using each item tokenizer separately. The model configuration that achieves the optimal validation performance is then selected for actual deployment and testing. This step re-establishes a clear mapping from generated identifier to a unique item.
4.2.4.2. Inference
During inference, the objective is to generate the top-K items for recommendation.
- Decoding:
Beam searchis used to decodetoken sequences.Beam searchis a heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set. Inautoregressive generation, it keeps track of the (beam size) most probable partial sequences at each step, until complete sequences are generated. - Mapping: The generated
token sequencesare then mapped back to their corresponding items. - Efficiency: Unlike some prior works,
MTGRecdoes not use aprefix treeto constrain the search, to avoid hinderingparallel decodingand reducing efficiency. - Invalid Identifiers:
Invalid identifiers(token sequences that do not map to any known item), which occur rarely, are simply ignored.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three subsets of the latest Amazon 2023 review dataset [12]:
-
"Musical Instruments" (referred to as
Instrument) -
"Industrial and Scientific" (referred to as
Scientific) -
"Video Games" (referred to as
Game)These datasets contain user review data spanning from May 1996 to September 2023.
Preprocessing Steps (following [32, 59]):
-
Low-activity usersanditems with less than five interaction recordswere filtered out. -
Historical item sequences were grouped by users and sorted in chronological order.
-
A maximum sequence length limit of
20 itemswas applied.The detailed statistics of the preprocessed datasets are presented in Table 1.
The following are the results from Table 1 of the original paper:
| Dataset | #Users | #Items | #Interactions | Sparsity | Avg.len |
| Instrument | 57,439 | 24,587 | 511,836 | 99.964% | 8.91 |
| Scientific | 50,985 | 25,848 | 412,947 | 99.969% | 8.10 |
| Game | 94,762 | 25,612 | 814,586 | 99.966% | 8.60 |
Characteristics:
-
Domain: E-commerce product reviews.
-
Scale: Moderate number of users, items, and interactions.
-
Sparsity: All datasets exhibit very high sparsity (over 99.96%), which is typical for real-world recommender systems and highlights the challenge of predicting relevant items.
-
Average Length (
Avg.len): Indicates the average number of interactions per user sequence, around 8-9 items.These datasets are widely used benchmarks in sequential recommendation, making them suitable for validating the general effectiveness and scalability of the proposed method. Their high sparsity also makes them challenging, providing a good test for the model's ability to handle
low-frequency items.
5.2. Evaluation Metrics
The model performance in sequential recommendation is evaluated using top-K Recall and Normalized Discounted Cumulative Gain (NDCG). is set to 5 and 10.
5.2.1. Recall@K
Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) among the top recommendations. It focuses on how many of the actual preferred items appear within the recommended list, without considering their rank.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{\sum_{u \in U} \mathbb{I}(R_u \cap T_u \ne \emptyset, |R_u \cap T_u| \ge 1)}{\sum_{u \in U} |T_u|} $ Or, more commonly, for a single user : $ \mathrm{Recall@K}_u = \frac{|\mathrm{RecommendedItems}_u \cap \mathrm{RelevantItems}_u|}{|\mathrm{RelevantItems}u|} $ And then averaged over all users: $ \mathrm{Recall@K} = \frac{1}{|U|} \sum{u \in U} \mathrm{Recall@K}_u $
Symbol Explanation:
- : The set of all users.
- : A specific user.
- : The set of top items recommended to user .
- : The set of items that are actually relevant to user (e.g., the next item they interacted with in the test set). In a
leave-one-outscenario, . - : Denotes the cardinality (number of elements) of a set.
- : Set intersection operator.
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
Conceptual Definition: NDCG@K is a measure of ranking quality, especially useful when items have varying degrees of relevance. It considers the position of relevant items in the recommended list; relevant items appearing higher in the list contribute more to the score. It penalizes relevant items that appear lower. The "Normalized" part ensures that scores are comparable across different query results by dividing by the Ideal DCG.
Mathematical Formula: For a single user : $ \mathrm{DCG@K}u = \sum{i=1}^{K} \frac{2^{\mathrm{rel}(i)} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K}u = \sum{i=1}^{K} \frac{2^{\mathrm{rel}_{\mathrm{ideal}}(i)} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K}_u = \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}u} $ And then averaged over all users: $ \mathrm{NDCG@K} = \frac{1}{|U|} \sum{u \in U} \mathrm{NDCG@K}_u $
Symbol Explanation:
-
: The set of all users.
-
: A specific user.
-
: The number of top recommendations considered.
-
: The rank position of an item in the recommended list.
-
: The relevance score of the item at rank in the recommended list. Typically, for implicit feedback, this is 1 if the item is relevant (e.g., the actual next item) and 0 otherwise.
-
: The relevance score of the item at rank in the ideal recommended list (where all relevant items are ranked as high as possible). In
leave-one-outevaluation, the ideal list would have relevance 1 at rank 1 and 0 elsewhere. -
:
Discounted Cumulative Gainfor user at rank . -
:
Ideal Discounted Cumulative Gainfor user at rank (the maximum possible DCG for that user).Evaluation Strategy:
-
Leave-one-out strategy: For each user's interaction sequence:
- The final item is used as the
test data. - The second most recent item is used as the
validation data. - All other preceding items are used for
training.
- The final item is used as the
-
Full ranking evaluation: The models predict over the entire item set, rather than a sampled subset, which is a more rigorous comparison.
-
Beam size: For all
generative recommendationmodels, thebeam sizeforautoregressive decodingis set to 50.
5.3. Baseline Models
The paper compares MTGRec against two groups of baseline models:
5.3.1. Traditional Sequential Recommendation Models
These models typically use item IDs and collaborative filtering information, with some incorporating item features.
- Caser [39]: Leverages
convolutional neural networks (CNNs)to capturespatial and positional patternsin user behavior sequences. - HGN [24]: Uses
feature-level and instance-level gating mechanismsto model user preferences. - GRU4Rec [10]: Employs
Gated Recurrent Units (GRUs)to capturesequential patternsin user interactions. - BERT4Rec [36]: Utilizes a
bi-directional self-attentive modelwith amask prediction objectivefor sequence modeling, similar toBERT. - SASRec [16]: Adopts a
unidirectional self-attention networkfor user behavior modeling, aTransformer-basedapproach. - FMLP-Rec [60]: Proposes an
all-MLP modelwithlearnable filtersto reduce noise and model user preferences. - HSTU [54]:
Hierarchical Sequential Transducersthat incorporate user actions and timestamps fornext item prediction. It is anID-basedmethod. - FDSA [55]: Introduces a
dual-stream self-attention frameworkthat independently modelsitem-levelandfeature-levelsequences for recommendation. This model is notable for incorporatingitem textual features. - S^3-Rec [59]: Enhances sequential recommendation models by leveraging
feature-item correlationsasself-supervised signals.
5.3.2. Generative Recommendation Models
These models transform next-item prediction into autoregressive generation of item identifiers.
- TIGER [32]: Employs
RQ-VAEto quantize item embeddings intosemantic IDs(token sequences) and uses agenerative retrieval paradigmfor sequential recommendation. - LETTER [42]: Extends
TIGERby integratingcollaborative and diversity regularizationintoRQ-VAEto improve the quality ofitem tokenization. - TIGER++ [32]: An improved version of
TIGERthat employsrepresentation whiteningandexponential moving average (EMA)techniques to enhancecodebook learningand improve the quality ofsemantic IDs. The paper states it uses the same techniques as to learnRQ-VAEinMTGRecfor fair comparison.
5.4. Implementation Details
5.4.1. Item Tokenizer (for MTGRec and baselines)
- Semantic Embeddings:
Sentence-T5[25] is used to encodetextual informationassociated with each item (e.g., title, description) into itssemantic embedding. - RQ-VAE Structure: A
RQ-VAEmodel with3 codebooksof size 256, and anextra codebookforcollision handlingis used.Codebook dimensionis set to 128.Encoder/Decoder: A deeperMLPwith hidden layer sizes[2048, 1024, 512, 256]is used for the encoder and decoder, following previous studies [23, 58].
- Enhancements (similar to TIGER++):
PCAwithrepresentation whitening[35] is applied to enhance the quality of item semantic embeddings.Exponential moving averages (EMA)[41] are used instead ofgradient descentforcodebook learning, for stability and effectiveness.
- Training: Optimized by
Adagrad optimizerfor 10K epochs, with alearning rateof 0.001 and abatch sizeof 2048. - Tokenizer Selection for MTGRec:
RQ-VAE checkpointsfrom the final epochs are selected assemantically relevant item tokenizers. is tuned between 5 and 30 with an interval of 5.
5.4.2. Generative Recommender
- Backbone:
T5[31] model. - Model Architecture:
Model dimension: 128Inner dimension: 512Attention heads: 4, with a dimension of 64 each.Activation function:ReLU.
- Model Layers: The number of model layers (both encoder and decoder layers) is tuned within
{1, 2, 3, 4, 5, 6, 7, 8}. - Pre-training:
Batch size: 256 on each GPU (total 4 GPUs used).Epochs: 200 epochs on all datasets.Curriculum Learning Schedule:- 60 epochs for
gradient feature warmup. Sampling probabilityupdate every 20 epochs thereafter.
- 60 epochs for
Temperature coefficient: Tuned in{0.1, 0.3, 1.0, 3.0, 5.0, 10.0}.Optimizer:AdamW[18] with alearning rateof 0.005.Learning rate scheduler:Cosine scheduler.
- Fine-tuning:
Optimizer:AdamWwith alearning rateof 0.0002.Learning rate scheduler:Cosine scheduler.
5.4.3. Baselines Setup
- Traditional Models: Implemented using
RecBole[56].Embedding dimensionset to 128. Hyperparameters tuned via grid search. - Generative Baselines (TIGER, LETTER, TIGER++): Used the same
T5model architecture asMTGRec. (number of layers) tuned from 1 to 8.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive comparison of MTGRec against both traditional and generative recommendation baselines on three public datasets. The overall results are summarized in Table 2.
The following are the results from Table 2 of the original paper:
| Methods | Instrument | Scientific | Game | |||||||||
| Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | |
| Caser | 0.0241 | 0.0386 | 0.0151 | 0.0197 | 0.0159 | 0.0257 | 0.0101 | 0.0132 | 0.0330 | 0.0553 | 0.0209 | 0.0281 |
| HGN | 0.0321 | 0.0517 | 0.0202 | 0.0265 | 0.0212 | 0.0351 | 0.0131 | 0.0176 | 0.0424 | 0.0687 | 0.0271 | 0.0356 |
| GRU4Rec | 0.0324 | 0.0501 | 0.0209 | 0.0266 | 0.0202 | 0.0338 | 0.0129 | 0.0173 | 0.0499 | 0.0799 | 0.0320 | 0.0416 |
| BERT4Rec | 0.0307 | 0.0485 | 0.0195 | 0.0252 | 0.0186 | 0.0296 | 0.0119 | 0.0155 | 0.0460 | 0.0735 | 0.0298 | 0.0386 |
| SASRec | 0.0333 | 0.0523 | 0.0213 | 0.0274 | 0.0259 | 0.0412 | 0.0150 | 0.0199 | 0.0535 | 0.0847 | 0.0331 | 0.0438 |
| FMLP-Rec | 0.0339 | 0.0536 | 0.0218 | 0.0282 | 0.0269 | 0.0422 | 0.0155 | 0.0204 | 0.0528 | 0.0857 | 0.0338 | 0.0444 |
| HSTU | 0.0343 | 0.0577 | 0.0191 | 0.0271 | 0.0271 | 0.0429 | 0.0147 | 0.0198 | 0.0578 | 0.0903 | 0.0334 | 0.0442 |
| FDSA | 0.0347 | 0.0545 | 0.0230 | 0.0293 | 0.0262 | 0.0421 | 0.0169 | 0.0213 | 0.0544 | 0.0852 | 0.0361 | 0.0448 |
| S3-Rec | 0.0317 | 0.0496 | 0.0199 | 0.0257 | 0.0263 | 0.0418 | 0.0171 | 0.0219 | 0.0485 | 0.0769 | 0.0315 | 0.0406 |
| TIGER | 0.0370 | 0.0564 | 0.0244 | 0.0306 | 0.0264 | 0.0422 | 0.0175 | 0.0226 | 0.0559 | 0.0868 | 0.0366 | 0.0467 |
| LETTER | 0.0372 | 0.0580 | 0.0246 | 0.0313 | 0.0279 | 0.0435 | 0.0182 | 0.0232 | 0.0563 | 0.0877 | 0.0372 | 0.0473 |
| TIGER++ | 0.0380 | 0.0588 | 0.0249 | 0.0316 | 0.0289 | 0.0450 | 0.0190 | 0.0241 | 0.0580 | 0.0914 | 0.0377 | 0.0485 |
| MTGRec | 0.0413 | 0.0635 | 0.0275 | 0.0346 | 0.0322 | 0.0506 | 0.0212 | 0.0271 | 0.0621 | 0.0956 | 0.0410 | 0.0517 |
| Imporve | +8.68% | +7.99% | +10.44% | +9.49% | +11.42% | +12.44% | +11.58% | +12.45% | +7.07% | +4.60% | +8.75% | +6.60% |
Observations:
-
Traditional vs. Generative Models:
FDSA, which incorporatesitem textual features, generally outperforms othertraditional ID-basedmodels (Caser, HGN, GRU4Rec, BERT4Rec, SASRec, FMLP-Rec, HSTU). This highlights the benefit of enriching item representations with semantic information beyond just IDs.Generative recommendation models(TIGER, LETTER, TIGER++) generally outperformtraditional sequential recommendation models. This validates the advantage of thegenerative paradigmanditem identifiersthat carry semantics.
-
Generative Baselines Comparison:
LETTERand show better performance thanTIGER. This is attributed to their improvements initem tokenizerquality (LETTERwith collaborative/diversity regularization, with representation whitening/EMA). These techniques enhance the semantic quality of the generated item tokens.
-
MTGRec's Superiority:
MTGRecconsistently achieves theoptimal performanceacross all three datasets and all evaluation metrics (Recall@5,Recall@10,NDCG@5,NDCG@10).- It shows
substantial improvementsover bothtraditionalandgenerative baseline models. For example, on theScientificdataset,MTGRecshows improvements of over 11% inRecall@5andNDCG@5compared to . OnInstrument,NDCG@5improves by over 10%. - This strong performance validates
MTGRec's core hypothesis:multi-identifier item tokenizationcombined withcurriculum recommender pre-trainingsignificantly enhances the generative recommender'sscalabilityandeffectivenessby providinglargerandmore diversesequence data.
6.2. Ablation Study
To understand the contribution of each component of MTGRec, an ablative analysis was performed on the Instrument and Scientific datasets. The variants are compared against the full MTGRec model.
The following are the results from Table 3 of the original paper:
| Methods | Instrument | Scientific | |||||||
| Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | ||
| (0) | MTGRec | 0.0413 | 0.0635 | 0.0275 | 0.0346 | 0.0322 | 0.0506 | 0.0212 | 0.0271 |
| w/o Data curriculum | 0.0406 | 0.0618 | 0.0268 | 0.0338 | 0.0312 | 0.0487 | 0.0205 | 0.0263 | |
| w/o Relevant tokenizers | 0.0350 | 0.0548 | 0.0226 | 0.0290 | 0.0249 | 0.0404 | 0.0158 | 0.0208 | |
| w/o Pre-training | 0.0380 | 0.0571 | 0.0247 | 0.0309 | 0.0285 | 0.0443 | 0.0181 | 0.0236 | |
Analysis of Variants:
-
w/o Data curriculum: This variant removes thecurriculum learningcomponent; data from differentitem tokenizersare sampled withequal probability.- Result: Performance is worse than full
MTGRecacross all metrics and datasets (e.g.,Recall@10onInstrumentdrops from 0.0635 to 0.0618). - Conclusion: The
data curriculum scheme, guided bydata influence estimation, is effective. Dynamically prioritizing "useful" data groups helps the model learn more efficiently and effectively from the diverse augmented data.
- Result: Performance is worse than full
-
w/o Relevant tokenizers: This variant uses multipleitem tokenizersinitialized with different random parameters, making themirrelevantandextraneous.- Result: This variant shows a significant performance degradation, performing even worse than
w/o Pre-training(e.g.,Recall@10onInstrumentdrops from 0.0635 to 0.0548). - Conclusion: This emphasizes the importance of
semantically relevant tokenizers. Using unrelated tokenizers createssemantic conflictsin the training data, leading tomodel learning collapse. The proposed method of usingRQ-VAE checkpointsfromadjacent epochsis crucial for maintainingsemantic consistencywhile introducing diversity.
- Result: This variant shows a significant performance degradation, performing even worse than
-
w/o Pre-training: This variant represents a baseline where thegenerative recommenderis trained only on data from a singleitem tokenizer(equivalent to ). It does not use the augmented data or the pre-training strategy.-
Result: This variant performs worse than
MTGRec(e.g.,Recall@10onInstrumentis 0.0571 vs 0.0635 forMTGRec). -
Conclusion:
Pre-trainingonaugmented sequence datafrommultiple semantically relevant item tokenizersis acritical elementforMTGRec's effectiveness. Themulti-identifier schemeprovides the necessary volume and diversity for improved performance.These ablation studies clearly demonstrate that both
multi-identifier item tokenization(with semantically relevant tokenizers) and thedata curriculum pre-trainingscheme are indispensable forMTGRec's superior performance.
-
6.3. Further Analysis
6.3.1. Performance Comparison w.r.t. Model Scale
This analysis investigates how the number of encoder and decoder layers in the generative recommender (model scale) impacts performance when combined with MTGRec's data augmentation.
该图像是一个图表,展示了在不同层数下,MTGRec、TIGER++ 和 TIGER 模型在三个数据集(Instrument、Scientific 和 Game)中的 Recall@10 性能比较。在各层数中,MTGRec 模型的表现优于其他模型,特别是在 Game 数据集上达到最高的召回率。
Figure 2: Performance Comparison w.r.t. Model Scale. The -axis coordinates are the number of encoder and decoder layers in the generative recommender, up to 8 layers.
Observations:
MTGRecconsistentlyoutperformsTIGERand acrossall model scales(number of layers). This confirmsMTGRec's overall effectiveness.- For
baseline models(TIGER, ), performance initiallyimproves with scaleat shallow layers, but candegrade due to overfittingif the model becomes too large (e.g., 4 or 5 layers), suggestinglimited data capacityto fully utilize larger models. MTGRecgenerally shows anupward trendin performance as the model scales, indicating that itsmulti-identifier item tokenizationprovidesmore massive and diverse datathatlarger models can benefit from.- Limitation: The paper acknowledges that this positive correlation is
constrainedcompared toLLMswhich scale to 100B parameters. The current method of augmenting data, while effective, might not generatesufficiently diversedata to fully utilize models beyond a certain scale, especially whenRQ-VAE checkpointsfrom very distant epochs might losesemantic relevance. This points to a potentialtrade-off between data quality (semantic relevance) and quantity/diversity.
6.3.2. Performance Comparison w.r.t. Tokenizer Number
This analysis examines how the number of item tokenizers (, from which RQ-VAE checkpoints are selected) used for pre-training affects recommendation performance for different model scales (3-layer and 6-layer generative recommenders).
该图像是一个图表,展示了在工具和科学数据集上,基于不同标记器数量(3-layer 和 6-layer)对召回率(Recall@10)的影响。在工具数据集上,3-layer的召回率在标记器数量为15时达到最高,而在科学数据集上,6-layer则表现相对稳定。
Figure 3: Performance Comparison w.r.t. Tokenizer Number on the Instrument and Scientific datasets.
Observations:
- Fewer Tokenizers: Using
fewer tokenizers(e.g., ) offersmarginal advancement. This is because the volume and diversity of augmented sequence data areinsufficientfor effective deep model optimization. - Excessive Tokenizers: Using an
excessively large number of tokenizersalso leads tosuboptimal performance. This is hypothesized to occur when theepoch intervalsbetweenRQ-VAE checkpointsbecome too large, weakening or even conflicting thesemantic relevancebetween tokenizers. This highlights a crucialtrade-offinMTGRec: balancingdata volumeandsemantic relevance. - Optimal Number: There exists an
optimal numberof tokenizers for each model scale. - Model Scale Influence:
Larger models(e.g., 6-layer) benefit from amore extensive and diverse sequence dataand thus generally have ahigher optimal number of tokenizerscompared to smaller models (e.g., 3-layer). This reinforces the idea that data augmentation is more beneficial for larger capacity models.
6.3.3. Performance Comparison w.r.t. Temperature Coefficient
The temperature coefficient in Eqn. (18) (sampling probability update) controls the smoothness of the sampling distribution for data groups. This analysis explores its impact.
该图像是一个图表,展示了在Instrument和Scientific数据集上,温度系数au对NDCG@10和Recall@10的影响。横轴为温度系数au,纵轴分别表示NDCG@10和Recall@10的值。图中可以看到,在不同的温度系数下,这两个指标的表现有所变化。
Figure 4: Performance Comparison w.r.t. Temperature Coefficient on the Instrument and Scientific datasets.
Observations:
- An
appropriatevalue of is crucial forMTGRec's performance. - Smaller : Makes the model
more inclined towards high-probability item tokenizers(i.e., those with high cumulative influence). This can lead toover-specializationon a few "best" data groups and neglect potentially useful but lower-influence data. - Larger : Causes the
data curriculumtodegenerate into uniform sampling, meaning all data groups are sampled with roughly equal probability, regardless of their influence. This negates the benefit ofcurriculum learning. - Optimal : The optimal values for are found to be 3 (Instrument) and 1 (Scientific). Both extremes (very small or very large ) negatively affect the
effectiveness of curriculum pre-training. This demonstrates the importance of tuning to balance exploration of different data groups with exploitation of high-influence ones.
6.3.4. Performance Comparison w.r.t. Long-tail Items
A key motivation for MTGRec is to improve recommendations for long-tail items. This analysis evaluates its performance across different item popularity groups.
该图像是图表,展示了在 Instrument 和 Scientific 数据集上长期低频项目的性能比较。左侧柱状图显示每个组在测试数据中的交互数量,而右侧折线图展示了与 TIGER 相比,Recall f { ext{Improved Recall}@10} 的提升比例。
Figure 5: Performance Comparison w.r.t. Long-tail Items on the Instrument and Scientific datasets. The bar graph illustrates the number of interactions in the test data for each group, while the line chart displays the improvement ratios for Recall in comparison to TIGER.
Observations:
MTGRecconsistentlyoutperforms the baseline model (TIGER)acrossall item groups(defined by interaction counts in the test data). The line chart shows positive improvement ratios forRecall@10across all groups.- Significant Improvement for Long-tail Items:
MTGRecshowssuperior performanceandmore significant improvementforunpopular (long-tail) items, specifically in the group[0, 20)interactions. For example, the improvement ratio in this group is notably higher than for more popular items. - Conclusion: This phenomenon directly supports the paper's hypothesis that
multi-identifier item tokenizationbenefitslong-tail items. By associating items with multiple identifiers, their tokens gainincreased exposure frequencyandincorporate more knowledge from shared tokensacross different semantic contexts. This helps the model learn more robust representations for items that are rarely seen in their original one-to-one mapping.
6.3.5. Applying MTGRec on Other Generative Recommendation Methods
The paper claims MTGRec can be seamlessly integrated into other generative recommendation methods that use a trainable item tokenizer. This section verifies its general applicability.
The following are the results from Table 4 of the original paper:
| Methods | Instrument | Scientific | ||
| Recall@10 | NDCG@10 | Recall@10 | NDCG@10 | |
| TIGER | 0.0568 | 0.0307 | 0.0423 | 0.0225 |
| +MTGRec | 0.0598 | 0.0329 | 0.0465 | 0.0245 |
| LETTER | 0.0580 | 0.0313 | 0.0435 | 0.0232 |
| +MTGRec | 0.0614 | 0.0335 | 0.0481 | 0.0255 |
| TIGER++ | 0.0588 | 0.0316 | 0.0450 | 0.0241 |
| +MTGRec | 0.0635 | 0.0346 | 0.0506 | 0.0271 |
Observations:
- The results show that applying (i.e., integrating
MTGRec'smulti-identifier pre-trainingstrategy) consistentlyimproves the performanceofTIGER,LETTER, and across bothInstrumentandScientificdatasets for bothRecall@10andNDCG@10. - General Applicability: This verifies the
general applicabilityandportabilityof theMTGRecapproach. The idea of generatingsemantically relevant sequence datafromRQ-VAE checkpointsfromadjacent epochscreateshomogeneous knowledgethat can enhance variousgenerative recommenderbackbones.
6.3.6. Multiple Identifier Difference Analysis
This section analyzes the relevance and differences of item identifiers generated by item tokenizers from different training epochs. Two metrics are used: proportion of items whose first token changed and proportion of items with any token changes.
The following are the results from Table 5 of the original paper:
| Intervals | Instrument | Scientific | Game | |||
| First | Any | First | Any | First | Any | |
| 1 | 0.39% | 13.58% | 0.27% | 11.4% | 0.36% | 9.36% |
| 5 | 0.44% | 21.26% | 0.58% | 22.54% | 0.58% | 21.22% |
| 10 | 0.51% | 29.75% | 0.51% | 30.68% | 0.57% | 30.43% |
| 20 | 0.75% | 44.09% | 0.71% | 47.33% | 0.79% | 47.42% |
| 30 | 0.87% | 54.94% | 0.85% | 58.29% | 1.14% | 59.95% |
Observations:
- Adjacent Tokenizers (Interval 1): For
tokenizersfromadjacent epochs, theitem identifiersshowminimal changes. Only a small percentage of items (e.g., 0.39% onInstrument) have theirfirst token changed, and a moderate percentage (e.g., 13.58% onInstrument) haveany token changes. This indicatesstrong semantic consistencybetween adjacent checkpoints. - Increasing Interval: As the
epoch intervalbetweentokenizersincreases, the number ofitem identifiersthat change (bothfirst tokenandany token) significantly increases. For an interval of 30, around 55-60% of items haveany token changes. This suggests that larger intervals can lead to more distinct identifiers, which aligns with the finding in Section 6.3.2 about the trade-off between diversity and semantic relevance. - Preservation of Core Semantics: Despite significant changes in
any token, the proportion of changes in thefirst tokengenerally remainsbelow 1%even for large intervals. This is important because thefirst tokenoften encodes the mostcoarse-grainedordominant semantic informationof an item. Its stability suggests that the core semantic knowledge is largely preserved across thesesemantically relevant tokenizers, preventing severesemantic conflictseven as finer details of the identifiers evolve.
6.3.7. Efficiency Analysis
This section analyzes the computational efficiency of MTGRec compared to baselines, specifically focusing on training time and epochs for convergence.
The following are the results from Table 6 of the original paper:
| Methods | Instrument | Scientific | Game | |||
| Time | Epoch | Time | Epoch | Time | Epoch | |
| TIGER | 1.33 h | 186 | 1.04 h | 184 | 2.19 h | 253 |
| TIGER++ | 1.22 h | 178 | 1.02 h | 187 | 2.23 h | 264 |
| MTGRec | 1.41 h | 209 | 1.21 h | 217 | 2.11 h | 248 |
Observations:
MTGRec's training time iscomparabletoTIGERand . For instance, onInstrument,MTGRectakes 1.41h compared toTIGER's 1.33h and 's 1.22h. OnGame,MTGRecis slightly faster thanTIGERand .- The
number of epochs for convergenceforMTGRecis also in a similar range to the baselines. - Efficiency Conclusion: The
multi-identifier pre-training strategyintroduced byMTGRecdoesnot introduce excessive training time costsdespite augmenting data and adding acurriculum learninglayer. This is a crucial finding, as it means the significantperformance gainsofMTGRecdo not come at the expense of impractical training duration. - Curriculum Learning Effect: The
curriculum learning schemeeven appears toaccelerate model convergenceon theGamedataset (248 epochs forMTGRecvs. 253 forTIGERand 264 for ), suggesting its efficiency benefits in some cases.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces MTGRec, a novel framework for generative recommender pre-training that addresses the limitations of one-to-one item tokenization. By proposing multi-identifier item tokenization, MTGRec associates each item with multiple semantically relevant identifiers, leveraging RQ-VAE checkpoints from adjacent training epochs. This strategy significantly augments the token sequence data, creating multiple data groups with related yet distinct semantic distributions. To effectively learn from this augmented data, MTGRec incorporates a curriculum recommender pre-training scheme that dynamically adjusts the sampling probabilities of these data groups based on data influence estimation using first-order gradient approximation. Finally, the pre-trained model undergoes fine-tuning with a single tokenizer to ensure precise item identification for deployment. Extensive experiments on three real-world datasets consistently demonstrate MTGRec's superior performance and scalability over both traditional and existing generative recommendation baselines, particularly benefiting long-tail items and without incurring excessive training costs.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and suggest future research directions:
- Model Scaling Constraints: While
MTGRecshows improved scalability compared to baselines, thepositive correlationbetweenmodel scaleand performance is currentlyconstrained, unlike inLLMsthat scale to billions of parameters. This limitation might stem from the current method's ability to generatesufficiently diversedata while maintainingsemantic relevancewhenRQ-VAE checkpointsare separated by too many training epochs. - Generalized Recommendation Scenarios: The current work focuses on sequential recommendation. Future work could adapt
multi-identifier item tokenizationto more generalized scenarios, such astransferable recommendation(applying models across different domains or tasks) andmulti-domain recommendation(recommending items from multiple catalogs simultaneously). - Further Model Scaling: The authors plan to investigate the
scaling effectwhen the model parameters are increased to thebillion level, pushing the boundaries of generative recommenders.
7.3. Personal Insights & Critique
This paper presents a highly innovative and practical solution to a fundamental limitation in generative recommendation. The idea of using RQ-VAE checkpoints from adjacent epochs is particularly clever; it's a simple yet effective way to generate semantically relevant variations of item representations without needing to train multiple independent tokenizers or complex data augmentation schemes. This method ensures that the augmented data maintains a crucial level of coherence, which is essential for effective learning.
Key Strengths:
- Elegant Data Augmentation: The
multi-identifier tokenizationusingRQ-VAE checkpointsis an elegant solution to the data diversity and sparsity problem forlong-tail items. It leverages the inherent evolution of a tokenizer during training to create meaningful variations. - Effective Curriculum Learning: The
data curriculum schemebased ondata influence estimationis well-integrated. It intelligently guides the model towards more impactful data, optimizing the learning process and contributing to overall performance gains. - Strong Experimental Validation: The comprehensive experiments, including ablation studies and analyses on model scale, tokenizer number, temperature coefficient, and
long-tail items, provide robust evidence for the proposed method's effectiveness. The demonstration of its applicability to other generative recommendation methods further strengthens its value. - Practical Applicability: The solution does not introduce excessive computational overhead, making it practical for real-world deployment, especially given the
fine-tuningstep to ensure accurateitem identification.
Potential Issues/Areas for Improvement:
- Defining "Semantically Relevant": While the paper argues for
adjacent epoch checkpoints, the exact boundaries forsemantic relevance(e.g., how far apart epochs can be before tokenizers become "extraneous") could be further explored theoretically or empirically. Themultiple identifier difference analysisoffers some insight, but a more formal definition or a dynamic selection mechanism for could be beneficial. - Generalizability of Influence Estimation: The
first-order gradient approximationfordata influenceis a heuristic. While effective, its robustness across different recommender architectures or more complex datasets could be investigated. More sophisticatedinfluence estimationtechniques might offer further gains. - Scaling beyond Current Limits: The acknowledged limitation regarding extreme
model scalingsuggests that the current augmentation strategy, while good, might still not match the data demands of trulyLLM-scalerecommenders. Future work on generating even more diverse and semantically coherent token sequences (perhaps by incorporating external knowledge or more advanced generative techniques) would be crucial for breaking this barrier. - Interpretability of Token Identifiers: While
RQ-VAEcreates semantic tokens, the interpretability of thesemulti-identifiersand how specific token changes relate to item features could be an interesting area for deeper analysis.
Transferability and Future Value:
The core idea of breaking the one-to-one mapping for item representation is highly transferable. This framework could be applied to:
-
Other domains: Any domain using item identifiers (e.g., knowledge graph entities, genomic sequences) could potentially benefit from
multi-identifier representationto improve robustness and handlelong-tail entities. -
Different generative tasks: The
multi-identifierconcept could extend beyond recommendation to othergenerative taskswhere discrete entities need rich, varied representations. -
Data-centric AI: This work exemplifies a data-centric approach, where improving the data (augmenting its diversity and quality) is key to unlocking model performance, rather than solely focusing on model architecture.
Overall,
MTGRecrepresents a significant step forward in makinggenerative recommender systemsmore robust and scalable, particularly for the challenginglong-tailproblem. Its innovations provide a strong foundation for future research indata augmentationandcurriculum learningwithingenerative AIfor recommendation.
Similar papers
Recommended via semantic vector search.