TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation
TL;DR Summary
TokenRec is introduced as a novel framework for enhancing LLM-based recommendation systems by effectively tokenizing user and item IDs. Featuring the Masked Vector-Quantized Tokenizer and generative retrieval, it captures high-order collaborative knowledge, improving recommendati
Abstract
There is a growing interest in utilizing large-scale language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and in-context learning capabilities. In this scenario, tokenizing (i.e., indexing) users and items becomes essential for ensuring a seamless alignment of LLMs with recommendations. While several studies have made progress in representing users and items through textual contents or latent representations, challenges remain in efficiently capturing high-order collaborative knowledge into discrete tokens that are compatible with LLMs. Additionally, the majority of existing tokenization approaches often face difficulties in generalizing effectively to new/unseen users or items that were not in the training corpus. To address these challenges, we propose a novel framework called TokenRec, which introduces not only an effective ID tokenization strategy but also an efficient retrieval paradigm for LLM-based recommendations. Specifically, our tokenization strategy, Masked Vector-Quantized (MQ) Tokenizer, involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving a smooth incorporation of high-order collaborative knowledge and a generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top- items for users to eliminate the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation".
1.2. Authors
The authors of the paper are:
-
Haohao Qu
-
Wenqi Fan
-
Zihuai Zhao
-
Qing Li, Fellow, IEEE
Their affiliations primarily appear to be with The Hong Kong Polytechnic University, based on the author biographies provided.
1.3. Journal/Conference
The paper is published on arXiv.
arXiv is a reputable open-access preprint server for research papers in fields such as physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While it is not a peer-reviewed journal or conference in itself, it is a widely used platform for researchers to disseminate their work quickly and receive feedback before, or in parallel with, formal peer review. Papers on arXiv are typically considered preprints, but many later undergo peer review and are published in top-tier venues.
1.4. Publication Year
The paper was published at 2024-06-15T00:07:44.000Z, which indicates a publication year of 2024.
1.5. Abstract
This paper introduces TokenRec, a novel framework designed to enhance Large Language Model (LLM)-based Recommender Systems (RecSys). The core problem addressed is the efficient and generalizable tokenization (indexing) of users and items, particularly in a way that captures high-order collaborative knowledge and is compatible with LLMs, while also overcoming challenges in handling new or unseen users/items and the computational inefficiency of traditional LLM inference. TokenRec proposes a Masked Vector-Quantized (MQ) Tokenizer that quantizes masked user/item representations learned from collaborative filtering into discrete tokens. This approach seamlessly integrates collaborative knowledge and offers generalizable tokenization. Additionally, it features a generative retrieval paradigm that efficiently recommends top- items by generating item representations and retrieving them from a pool, thereby circumventing the time-consuming auto-reggressive decoding and beam search processes typically used by LLMs. Comprehensive experiments on four datasets demonstrate that TokenRec outperforms both traditional and emerging LLM-based RecSys benchmarks, showcasing superior recommendation performance and better generalization to new users and items.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2406.10450
- PDF Link: https://arxiv.org/pdf/2406.10450v3.pdf
- Publication Status: The paper is available as a preprint on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The integration of Large Language Models (LLMs) into Recommender Systems (RecSys) has garnered significant interest due to LLMs' advanced language understanding, reasoning, and in-context learning capabilities. However, several critical challenges hinder the effective and efficient application of LLMs in personalized recommendations:
-
ID Tokenization Compatibility: LLMs are primarily designed to process natural language tokens. Representing the vast number of discrete user and item IDs (which can easily number in billions in real-world systems) as LLM-compatible tokens poses a significant challenge. Assigning a unique token to each user/item (known as
Independent Indexing (IID)) leads to an unmanageable vocabulary size for LLMs. While methods like using textual descriptions or continuous embeddings exist, they often struggle to capture the rich, high-order collaborative knowledge inherent in user-item interactions. -
Capturing High-Order Collaborative Knowledge: Traditional Collaborative Filtering (CF) methods, especially those leveraging Graph Neural Networks (GNNs), excel at learning complex, high-order relationships from user-item interaction graphs. The challenge is how to effectively embed this intricate collaborative knowledge into the discrete tokens that LLMs can process, without losing crucial information.
-
Generalizability to New/Unseen Users/Items (Cold-Start Problem): Existing tokenization approaches often struggle to generalize to users or items that were not part of the training data. This "cold-start" problem is pervasive in RecSys, as new users and items are constantly introduced. Retraining large LLM-based models for every new entry is computationally prohibitive.
-
Inference Efficiency: Many LLM-based RecSys rely on
auto-regressive decodingandbeam searchto generate recommendations as textual outputs (e.g., item titles). These processes are computationally intensive and slow, making them impractical for real-time recommendation scenarios, which demand high efficiency. Furthermore, LLMs can suffer fromhallucination(generating non-existent items) andcontext length limitationswhen provided with extensive interaction histories.The paper's entry point is to address these challenges by proposing a novel framework that reimagines ID tokenization and recommendation generation for LLM-based RecSys. The innovative idea is to leverage
Vector QuantizationonGNN-learned collaborative representationsof users and items, combined with agenerative retrievalapproach, to create an efficient, generalizable, and LLM-compatible recommendation system.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of LLM-based Recommender Systems:
- Novel ID Tokenization Strategy (Masked Vector-Quantized Tokenizer - MQ-Tokenizer): The paper introduces an effective and generalizable strategy for tokenizing user and item IDs. This
MQ-Tokenizerquantizes masked user/item representations, which are initially learned from collaborative filtering (specifically GNNs), into discrete tokens. This approach seamlessly integrates high-order collaborative knowledge into LLM-compatible tokens. It incorporates two novel mechanisms:Masking Operation: To enhance the tokenizer's generalization capability by creating a challenging reconstruction task.K-way Encoder: For multi-head feature extraction and a corresponding -way codebook for robust latent feature quantization.
- Efficient Generative Retrieval Paradigm:
TokenRecproposes a generative retrieval mechanism for recommendations. Instead of relying on time-consuming auto-regressive decoding and beam search to generate textual item descriptions,TokenRecgenerates agenerative representationof a user's preference and then efficiently retrieves the top- items from a pre-computed item pool based on similarity matching. This significantly reduces inference time and mitigates issues like hallucination and context length limitations. - Enhanced Generalizability to New Users/Items: The proposed
MQ-Tokenizerand the overallTokenRecframework demonstrate strong generalization capabilities to new and unseen users and items. By updating only the lightweight GNN component for new entities and keeping theMQ-Tokenizerand LLM backbone frozen,TokenReceffectively addresses the cold-start problem without costly retraining of the entire LLM. - State-of-the-Art Recommendation Performance: Extensive experiments conducted on four real-world benchmark datasets (Amazon-Beauty, Amazon-Clothing, LastFM, and MovieLens 1M) validate the effectiveness of
TokenRec. It consistently outperforms competitive benchmarks, including both traditional recommender systems (e.g., MF, LightGCN) and cutting-edge LLM-based recommender systems (e.g., P5, TIGER, CoLLM). - Efficiency Gains: The generative retrieval paradigm leads to substantial improvements in inference efficiency, achieving approximately 1000-1400% acceleration compared to existing LLM-based methods.
- Concise Prompts:
TokenReccan leverage collaborative knowledge embedded in user ID tokens to make recommendations with only user ID tokens as input, significantly reducing prompt length and computation resources, and circumventing LLM context length limitations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand TokenRec, it's essential to grasp several fundamental concepts from recommender systems and large language models:
- Recommender Systems (RecSys): Systems designed to predict user preferences and suggest items (e.g., movies, products, music) that users might like. They aim to alleviate information overload by filtering irrelevant content and surfacing personalized recommendations.
- Collaborative Filtering (CF): A widely used technique in RecSys that makes recommendations based on the principle that users who agreed in the past (e.g., liked similar items) will agree in the future. It works by identifying users with similar tastes or items with similar appeal.
- User-Item Interaction Matrix: A sparse matrix where rows represent users, columns represent items, and entries indicate interactions (e.g., ratings, purchases, clicks).
- High-Order Collaborative Knowledge: This refers to complex, indirect relationships between users and items that go beyond direct interactions. For example, "users who like item A also like item B, and users who like item B also like item C, so users who like item A might like item C." This knowledge is crucial for understanding nuanced preferences.
- Matrix Factorization (MF): A classic CF technique that decomposes the user-item interaction matrix into two lower-rank matrices: user latent factor matrix and item latent factor matrix. Each user and item is represented by a low-dimensional dense vector (embedding). The predicted interaction score is typically the dot product of the user and item embeddings.
- Graph Neural Networks (GNNs): A class of neural networks designed to operate on graph-structured data. In RecSys, user-item interactions can be modeled as a bipartite graph (users and items as nodes, interactions as edges). GNNs can propagate information across this graph, effectively capturing high-order collaborative signals by aggregating information from neighbors.
- LightGCN: A simplified yet powerful GNN for recommendations that removes feature transformation and non-linear activation from traditional GNNs, focusing solely on neighborhood aggregation for learning user and item embeddings.
- Large Language Models (LLMs): Deep learning models with billions of parameters, pre-trained on massive amounts of text data. They excel at understanding, generating, and reasoning with human language.
- Tokens: The fundamental units of text that LLMs process. These can be words, subwords, or characters. LLMs operate on a fixed vocabulary of tokens.
- Vocabulary Size: The total number of unique tokens an LLM can understand and generate.
- In-context Learning: The ability of LLMs to learn from examples provided in the prompt without explicit fine-tuning.
- Autoregressive Generation: The process by which LLMs generate text one token at a time, predicting the next token based on all previously generated tokens and the input prompt.
- Beam Search: A search algorithm used in sequence generation (like LLM text generation) to explore multiple possible sequences of tokens, aiming to find the most probable output sequence rather than just the single most likely next token at each step. It's more computationally expensive than greedy decoding.
- Hallucination: A phenomenon where LLMs generate plausible-sounding but factually incorrect or non-existent information. In RecSys, this could mean recommending non-existent item IDs or titles.
- Context Length Limitation: LLMs have a maximum number of tokens they can process in a single input (context window). Long user interaction histories can exceed this limit.
- Vector Quantization (VQ): A technique that maps high-dimensional input vectors to a discrete set of codebook vectors (codewords). It involves learning a codebook, where each entry is a codeword (vector), and then representing an input vector by the index of the closest codeword in the codebook. This effectively converts continuous representations into discrete tokens.
- Codebook: A learned collection of discrete vectors (codewords or embeddings), each associated with a unique index (token).
- Quantization: The process of mapping a continuous input vector to a discrete codeword from the codebook.
- Masking: A technique often used in self-supervised learning where parts of the input are intentionally hidden or corrupted, and the model is trained to reconstruct the original input. This forces the model to learn robust and comprehensive representations.
- Metric Learning: A machine learning paradigm focused on learning a distance metric or similarity function from data. In RecSys, it's used to learn representations where similar items (or user preferences) are close in the embedding space, and dissimilar ones are far apart.
- Projection Layer: A neural network layer (often an MLP) used to transform representations from one vector space to another, usually to align different modalities or dimensions.
3.2. Previous Works
The paper contextualizes its contributions against a backdrop of traditional and LLM-based recommender systems:
- Traditional Collaborative Filtering (CF):
- Matrix Factorization (MF) [5]: A foundational CF method that decomposes user-item interaction matrices into low-rank user and item embeddings. It represents users and items with unique IDs.
- NeuCF [39]: The first deep neural network (DNN)-based model for CF, combining MF with neural networks to learn user and item embeddings.
- LightGCN [6], GTN [7], LTGNN [34]: Representative GNN-based CF methods that capture high-order collaborative knowledge by modeling user-item interactions as graphs. They learn user and item representations through message passing on the interaction graph. These methods provide the initial
collaborative representationsthatTokenRecquantizes.
- Sequential Recommendation Methods:
- SASRec [40]: An attention-based model for sequential recommendations, focusing on a user's recent interactions to predict the next item.
- BERT4Rec [41]: A bidirectional Transformer-based recommender that leverages the masked language model objective from BERT to predict masked items in a user's interaction sequence.
- SRec [42], CoSeRec [43]: Sequential recommendation models employing self-supervised learning techniques, often using contrastive learning, to learn robust sequence representations.
- LLM-based Recommender Systems:
- Independent Indexing (IID): A naive approach where each user and item is assigned a unique token. The paper highlights its impracticality due to vocabulary explosion in large-scale systems.
- Textual Title Indexing [11], [13]: Uses item titles and descriptions to represent items, leveraging LLMs' in-vocabulary tokens. While avoiding vocabulary explosion, it may not capture collaborative knowledge effectively.
- P5 [12]: A pioneering framework that unifies diverse recommendation tasks (e.g., rating prediction, sequential recommendation, explanation generation) into a text-to-text generation paradigm using prompt-based pre-training. It uses positional and whole-word embeddings for users/items. The paper mentions its variants P5-RID (Random Indexing) and P5-SID (Sequential Indexing).
- POD [45]: Another LLM-based approach applying positional and whole-word embeddings.
- CID (Collaborative Indexing) [44]: A P5 variant that attempts to capture co-occurrence frequency for item indexing, showing that integrating collaborative knowledge can improve performance over random or sequential indexing.
- TIGER [46]: Uses residual vector quantization to condense textual data into a few
semantic IDsfor items, which are then used as tokens in a Sequence-to-Sequence Transformer for sequential recommendation. TIGER-G is a variant that incorporates graph-based collaborative knowledge. - CoLLM [16], LlaRA [15], E4SRec [62]: These methods borrow the concept of soft prompts or use exogenous tokens with continuous embeddings to represent users and items, integrating collaborative embeddings into LLMs. The paper points out that continuous representations can challenge tight alignment with LLMs due to their discrete nature.
- META ID [63]: Suggests integrating collaborative knowledge into discrete tokens via clustering item/user representations from skip-gram models, but the paper argues it lacks a robust tokenizer for quantization.
3.3. Technological Evolution
The field of recommender systems has evolved from basic statistical methods to complex deep learning models:
-
Early CF (e.g., User-based/Item-based CF): Relied on direct similarity between users or items.
-
Matrix Factorization (MF): Introduced latent factors, allowing for more nuanced representations and better scalability.
-
Neural Collaborative Filtering (NCF): Integrated deep learning into CF, moving beyond simple dot products to capture non-linear relationships.
-
Graph Neural Networks (GNNs) for CF: Leveraged the graph structure of user-item interactions to capture complex, multi-hop (high-order) collaborative relationships, significantly improving representation learning.
-
Sequential Recommendation Models: Focused on the temporal order of user interactions, using architectures like RNNs, LSTMs, and Transformers (e.g., SASRec, BERT4Rec) to model dynamic preferences.
-
Large Language Models (LLMs) for RecSys: The latest frontier, attempting to harness the powerful language understanding and reasoning abilities of LLMs. This involves rephrasing recommendation as a language task (e.g., text-to-text generation, prompt-based recommendation).
This paper's work fits into the LLM-based RecSys era. It addresses the critical bottleneck of
ID tokenizationandinference efficiency, which are major challenges in making LLM-based RecSys practical and performant for real-world scenarios.
3.4. Differentiation Analysis
TokenRec differentiates itself from previous works, especially other LLM-based methods, in several key aspects:
- Tokenization Strategy:
- Unlike
IID(vocabulary explosion) ortextual title indexing(limited collaborative knowledge),TokenRecexplicitly integrateshigh-order collaborative knowledgederived from GNNs into discrete, LLM-compatible tokens via aMasked Vector-Quantized (MQ) Tokenizer. - Compared to methods using
continuous embeddings(e.g., CoLLM, LlaRA),TokenRecgenerates discrete tokens, addressing the potential misalignment between continuous representations and LLMs' inherently discrete token processing. - While
TIGERalso uses vector quantization forsemantic IDsfrom textual data,TokenRec'sMQ-Tokenizeris specifically designed to incorporate graph-basedcollaborative knowledgeand includes novelmaskingandK-way encodermechanisms for enhanced robustness and generalization, which are not present inTIGER's approach.TokenRecfocuses on learning representations directly from collaborative signals rather than just condensing textual information. - Unlike
META IDwhich suggests clustering,TokenRecprovides a robust, learnableMQ-Tokenizerwith specific design choices (masking, K-way encoder) for effective quantization.
- Unlike
- Recommendation Paradigm:
TokenRecadopts agenerative retrieval paradigminstead ofauto-regressive decodingandbeam search, which are common inP5,CID,POD, andTIGER. This is a fundamental shift that significantly improvesinference efficiencyand mitigateshallucinationandcontext length limitations. It generates a user's preference representation and then retrieves items, rather than generating item tokens sequentially.
- Generalizability:
TokenRecexplicitly addresses thecold-start problemfor new/unseen users and items. By leveraging a lightweight GNN to learn representations for new entities and keeping theMQ-Tokenizerand LLM backbone frozen, it achieves robust generalization without costly retraining. Other LLM-based methods often experience significant performance drops for unseen users.
- Efficiency: The generative retrieval design makes
TokenRecsubstantially more efficient at inference time compared to methods relying on auto-regressive generation.
4. Methodology
The TokenRec framework proposes a novel approach to integrate Large Language Models (LLMs) into recommender systems by addressing the core challenges of ID tokenization and efficient recommendation generation. It consists of two main modules: the Masked Vector-Quantized (MQ) Tokenizer for users and items, and a Generative Retrieval paradigm for recommendations.
4.1. Notations and Definitions
Let be the set of users and be the set of items. For a given user , denotes the set of items that user has interacted with in their history.
Users and items are embedded into low-dimensional latent vectors, referred to as collaborative representations, denoted as for user and for item , where is the dimension of these vectors.
The traditional collaborative filtering (CF) goal is reformulated into a language model paradigm. Given an LLM, a textual prompt , user tokens and tokens for interacted items , the LLM aims to generate a representation of items that user might like: Here, represents the large language model's processing function, is the prompt, are the tokens for user , and are the tokens for items in user 's interaction history. The interacted items are placed in a non-sequential way to align with CF settings.
4.2. An Overview of the Proposed Framework
The overall framework of TokenRec consists of two main modules, as illustrated in Figure 6 (Figure 2 in the paper):
-
Masked Vector-Quantized (MQ) Tokenizer for Users and Items: This module addresses the ID tokenization challenge. It learns specific codebooks and represents users and items with a list of special discrete tokens through encoder and decoder networks. This process aims to seamlessly integrate numerical IDs (users & items) into a natural language compatible form, incorporating high-order collaborative knowledge.
-
Generative Retrieval for Recommendations: This module focuses on user modeling via an LLM for personalized recommendations. It employs a generative retrieval paradigm to efficiently generate item representations and retrieve the -nearest items from the entire item set, producing a personalized top- recommendation list.
The overall framework of the proposed TokenRec, which consists of the masked vector-quantized tokenizer with a -way encoder for item ID tokenization and the generative retrieval paradigm for recommendation generation. Note that we detail the item MQ-Tokenizer while omitting the user MQ-Tokenizer for simplicity.
该图像是一个示意图,展示了TokenRec框架中的Masked Vector-Quantized Tokenizer和生成推荐的检索机制。左侧描述了如何通过GNN提取高阶协作知识,并利用K-way编码器进行用户和项目的分词表示,右侧展示了如何通过LLMs对用户和项目进行推荐生成。公式中提及了作为匹配评分的输入。
4.3. Masked Vector-Quantized Tokenizers for Users and Items
Instead of assigning a specific token to each user and item, which would lead to an explosion in vocabulary size, TokenRec proposes a novel tokenization strategy using vector quantization. This method represents each user and item with a set of discrete indices (tokens). To capture high-order collaborative knowledge, vector quantization is applied to well-trained representations learned from advanced Graph Neural Networks (GNNs). To enhance generalization and overcome noise inherent in cascading tokenization processes, a Masked Vector-Quantized Tokenizer (MQ-Tokenizer) is introduced.
The MQ-Tokenizer comprises three key components:
-
A
masking operationon the input user/item representations. -
A
K-way encoderfor multi-head feature extraction with a correspondingK-way codebookfor latent feature quantization. -
A
K-to-1 decoderthat reconstructs the input representations from the quantized features.It's important to note that separate
MQ-Tokenizersare designed for users (User MQ-Tokenizer) and items (Item MQ-Tokenizer), but they share the same architecture. For simplicity, the paper details theItem MQ-Tokenizer.
4.3.1. Collaborative Knowledge
The primary goal of the MQ-Tokenizer is to embed high-order collaborative knowledge into latent representations through vector quantization. Collaborative knowledge, often derived from user-item interactions, reveals deep behavioral similarities and is critical for accurate recommendations. Graph Neural Networks (GNNs) are highly effective at capturing these high-order collaborative signals on user-item interaction graphs.
The MQ-Tokenizer quantizes these GNN-based collaborative representations into a small number of discrete tokens for each user and item. This means that users/items that are conceptually close in the collaborative latent space (i.e., have similar GNN-learned representations) will likely share similar tokens/indices, thereby aligning LLMs with recommendations by representing users and items with discrete, collaboratively-informed tokens.
4.3.2. Masking Operation
To create a more robust tokenizer with improved generalization capabilities, a masking operation is applied to the collaborative representations. This masking strategy forces the tokenizer to learn a more comprehensive understanding of the representations by reconstructing partially hidden inputs.
An element-wise masking strategy is introduced, sampled from a Bernoulli distribution:
where is the masking ratio. The Bernoulli distribution assigns a value of 1 (keep) with probability and 0 (mask) with probability .
Given the original collaborative representations for user and for item , the masking process is applied as:
Here, and are the masked representations. The mask is randomly regenerated at each training epoch, creating diverse samples to enhance the tokenizer's generalization.
4.3.3. K-way Encoder and Codebook
The masked collaborative representations are then tokenized using a novel K-way vector quantization framework. This involves:
-
K-way Encoder: A set of different encoders, denoted , process the masked item representation (or user representation) to generate corresponding latent vectors. Each encoder can be implemented as a Multilayer Perceptron (MLP) with three hidden layers. where is the -th latent vector for item , and is its dimension. The use of different encoders allows for multiple perspectives or "attentions" to uncover different patterns in the input, improving generalization.
-
K-way Codebook: A learnable codebook is developed for items (and similarly for users). Each represents a
sub-codebookassociated with the -th encoder. Each sub-codebook containscodewords(token embeddings), where is the -th codeword (embedding) in the -th sub-codebook. -
Quantization: The encoded vectors are quantized into discrete tokens (indices) by finding the nearest neighbor in their respective sub-codebooks. For each encoded vector and its corresponding sub-codebook , the token is found by minimizing the Euclidean distance: Here, is the index (codeword/ID token) of the nearest neighbor in the -th sub-codebook for item . Thus, a discrete ID of item is tokenized into discrete codebook tokens and their corresponding codeword embeddings: This process is also applied to tokenize users.
4.3.4. K-to-1 Decoder
After the -way encoder and quantization, a K-to-1 decoder is used for input reconstruction. It takes the different embeddings corresponding to the selected tokens from the -way codebook and reconstructs the original input representation.
Specifically, for item and its quantized tokens , the decoder first performs average pooling on their embeddings and then passes the result through a three-layer MLP to generate the reconstructed representation :
4.3.5. Learning Objective
To train the K-way encoder, codebook, and K-to-1 Decoder for both user and item MQ-Tokenizers, a combined learning objective is defined:
-
Reconstruction Loss (): This loss encourages the reconstructed representation from the decoder to approximate the original GNN-learned representation . A challenge here is that the operation in Equation (5) is non-differentiable. To address this, a
straight-through gradient estimatoris used, which directly passes the gradients of the decoder inputs (selected token embeddings) to the encoder outputs (encoded representations) during backpropagation. -
Codebook Loss (): This loss helps update the item's
K-way codebookby pulling the selected token's embedding closer to the output of theK-way encoder. Astop-gradient operatoris used to ensure that gradients only flow to the codebook and not to the encoder during this part of the loss calculation. The operator sets the gradient of its argument to zero during backpropagation. -
Commitment Loss (): This loss prevents the encoded features from fluctuating too frequently between different codewords, ensuring a smooth gradient flow for the operation. Unlike the codebook loss, it only applies to the encoder weights.
The overall optimization objective for the item MQ-Tokenizer is a weighted sum of these losses: where is a hyper-parameter balancing the commitment loss. A similar objective is defined for the user MQ-Tokenizer:
4.4. Generative Retrieval for Recommendations
This subsection describes how TokenRec leverages LLMs for recommendations using a generative retrieval paradigm. This involves tokenization & prompting, user modeling via LLM, and generative retrieval.
4.4.1. Tokenization & Prompts
- Tokenization: The
MQ-Tokenizersgenerateout-of-vocabulary (OOV) tokensfor user and item IDs. This is crucial because standard LLM vocabularies (e.g., 32,000 for LLaMA) are far too small for millions or billions of users/items. By using OOV tokens (where is the number of sub-codebooks and is tokens per sub-codebook),TokenReccan efficiently tokenize a massive number of users/items. For instance, OOV tokens can tokenize 39,387 items. Textual content within prompts (if any) is handled by the LLM's native tokenizer (e.g., SentencePiece). - Prompts: Prompts guide the LLM.
TokenRecdesigns prompts that use the OOV tokens generated by theMQ-Tokenizersto represent users and items.- Prompt 1 (User ID Only):
- Prompt 2 (with User's Historical Interactions): Here, denotes the -th sub-codebook's OOV token, e.g., is the 21st token in the second sub-codebook for user_03. Similarly for item tokens. The interactions in are randomly shuffled.
4.4.2. User Modeling via LLM
This component aims to capture user preferences and generate representations of items a user might like.
The input for user is formed by a prompt template and the corresponding ID tokens:
where are the ID tokens from the user MQ-Tokenizer for , and are the ID tokens from the item MQ-Tokenizer for 's interacted items.
Unlike conventional text-to-text generation (e.g., P5) where LLMs auto-regressively generate item tokens (), TokenRec passes through an LLM backbone, denoted , to generate a hidden representation :
This represents user 's generative preferences for next items. The LLM4Rec acts as a powerful query encoder, leveraging LLM capabilities in understanding diverse prompts, interpreting preferences, and generating desired outcomes beyond just text.
4.4.3. Generative Retrieval
This paradigm aims to overcome the inefficiencies and limitations (hallucination, context length) of auto-regressive generation.
The hidden state from is projected into a latent representation via a projection layer :
where can be a three-layer MLP. This is the generative representation of the next recommended items for user .
Subsequently, TokenRec retrieves the -nearest items from the entire item set . This is done by calculating similarity scores between and the GNN-based collaborative representations of all items (stored in a vector database).
The predicted similarity score for user towards item is calculated using cosine similarity:
The top- items with the highest scores are recommended. This approach offers:
- Efficiency: Bypasses time-consuming auto-regressive decoding and beam search.
- Accuracy: Avoids hallucination by retrieving from a valid item pool.
- Generalizability: Unseen items can be retrieved by simply updating the item representations pool without retraining the LLM.
- Alignment: This two-tower-like structure facilitates alignment between textual query information and collaborative knowledge.
4.5. TokenRec's Training and Inference
4.5.1. Training
TokenRec uses a two-step training process to handle the gap between quantization and language processing:
- Step 1. Training Users & Items MQ-Tokenizers: The
MQ-Tokenizersfor users and items are trained independently to quantize collaborative representations. The combined losses from Equation (13) and Equation (14) are used: and . - Step 2. Tuning the LLM4Rec for Generative Retrieval: After training, the
MQ-Tokenizersare frozen. The LLM backbone (e.g., T5), LLM token embeddings, and the projection layer are then tuned. The objective is to learn to generate user preference representations () that are close to positive items and far from negative items in the GNN-learned item representation space (). This is achieved using apairwise ranking loss:- : Generative item representation of user .
- : Collaborative representation of item .
- : A similarity metric, typically cosine similarity.
- : Indicates a positive pair (user has interacted with item ). The loss minimizes , maximizing similarity.
- : Indicates a negative pair (user has not interacted with item ). The loss maximizes , ensuring the similarity between the user's preference and a negative item is at least less than a positive item.
- : The margin value for negative pairs. During tuning, a 1:1 negative sampling ratio is used, selecting one random un-interacted item as a negative sample for each positive sample.
4.5.2. Inference
The inference process of TokenRec capitalizes on the generative retrieval framework, offering several advantages over traditional LLM-based approaches:
-
Efficient Recommendations: By generating a generative item representation () and then performing similarity-based retrieval from a pre-computed item pool,
TokenRecbypasses the computationally expensive auto-regressive decoding and beam search processes. This significantly reduces inference costs and enables real-time recommendation. -
Generalizability to New Users and Items:
TokenReccan effectively handlecold-startscenarios. When new users or items are introduced, only the lightweight GNN model needs to be updated to learn their collaborative representations and update the vector database. TheMQ-Tokenizersand the LLM backbone remain frozen. This is due to the robustness provided by themaskingandK-way encodermechanisms in theMQ-Tokenizer. The efficiency of GNN retraining is much higher than LLM fine-tuning.The TokenRec's efficiency and generalization capability for new users and items during the inference stage. Rather than retraining the MQ-Tokenizers and LLM backbone, which can be computationally expensive and time-consuming, only the GNN needs to be updated for learning representations for new users and items.
该图像是示意图,展示了 TokenRec 框架如何通过更新向量数据库,实现对新用户和新项目的泛化。图中展示了用户池和项目池的更新过程,结合 GNN 和 MQ-Tokenizer 进行用户和项目表示的生成,最终实现顶级项目的推荐。 -
Concise Prompts:
TokenReccan make recommendations using only user ID tokens (Prompt 1 in Section 4.4.1), without needing to include the user's historical interactions in the prompt. This is possible because collaborative knowledge is already embedded within the user ID tokens via theMQ-Tokenizer. This significantly reduces prompt length, saves computing resources, and helps circumvent thecontext length limitationsof many LLMs.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on four widely used real-world benchmark datasets to evaluate the effectiveness of TokenRec:
-
Amazon-Beauty (Beauty): E-commerce user-item interactions related to beauty products from amazon.com.
-
Amazon-Clothing (Clothing): E-commerce user-item interactions related to clothing products from amazon.com.
-
LastFM: Music artist listening records from users on the Last.fm online music system.
-
MovieLens 1M (ML1M): A collection of movie ratings made by MovieLens users.
The basic statistics of these datasets are provided in Table I. The maximum item sequence length was set to 100 to accommodate the input length of the T5 LLM backbone (512 tokens). For training, validation, and testing, a
leave-one-out policywas used, where all but the last observation in a user's interaction history form the training set. Users' interaction histories were randomly shuffled to align with collaborative filtering methods (i.e., neglecting sequential patterns).
The following are the results from Table I of the original paper:
| Datasets | User-Item Interaction | |||
| #Users | #Items | #Interactions | Density (%) | |
| LastFM | 1,090 | 3,646 | 37,080 | 0.9330 |
| ML1M | 6,040 | 3,416 | 447,294 | 2.1679 |
| Beauty | 22,363 | 12,101 | 197,861 | 0.0731 |
| Clothing | 23,033 | 39,387 | 278,641 | 0.0307 |
5.1.1. Data Sample Example
While the paper does not provide a concrete example of a data sample (e.g., an actual user interaction entry), we can infer from the dataset descriptions:
-
LastFM: A data sample would look like
(user_ID, artist_ID), indicating a user listened to a specific artist. -
ML1M: A data sample would be
(user_ID, movie_ID, rating), indicating a user rated a movie. -
Amazon-Beauty/Clothing: A data sample would typically be
(user_ID, product_ID), indicating a user purchased or interacted with a product.These datasets are effective for validating the method's performance because they represent diverse recommendation scenarios (music, movies, e-commerce) and vary in scale and density, allowing for a robust evaluation of
TokenRec's capabilities.
5.2. Evaluation Metrics
The quality of recommendation results is evaluated using two widely adopted metrics: Hit Ratio at K (HR@K) and Normalized Discounted Cumulative Gain at K (NDCG@K). Higher values for both metrics indicate better recommendation performance. The average metrics over all users in the test set are reported. The value of is set to 10, 20, and 30, with being the default for ablation studies.
5.2.1. Hit Ratio at K (HR@K)
- Conceptual Definition:
HR@Kmeasures the recall of the recommendation list. It quantifies how often the target item (the one the user actually interacted with in the test set) appears within the top recommended items. If the target item is present in the top recommendations, it's considered a "hit" for that user. HR@K is the proportion of users for whom a hit occurred. It's a simple, intuitive measure of whether the model successfully suggested any relevant item. - Mathematical Formula:
- Symbol Explanation:
Number of users for whom the target item is in the top K recommendations: The count of unique users for whom the ground truth item (the item the user interacted with in the test set) is found within the list of the top items suggested by the recommender system.Total number of users: The total number of users in the test set for whom recommendations are being generated.
5.2.2. Normalized Discounted Cumulative Gain at K (NDCG@K)
- Conceptual Definition:
NDCG@Kis a measure of ranking quality that accounts for both the relevance of recommended items and their position in the list. It assigns higher scores if more relevant items appear at higher ranks (closer to the top of the list). It normalizes theDiscounted Cumulative Gain (DCG)by theIdeal DCG (IDCG), which is the DCG of a perfectly ordered list of relevant items, making it comparable across different queries. - Mathematical Formula:
First,
Cumulative Gain (CG): Then,Discounted Cumulative Gain (DCG): Finally,Normalized Discounted Cumulative Gain (NDCG): - Symbol Explanation:
- : The number of top recommendations being considered.
- : The rank (position) of an item in the recommendation list, from 1 to .
- : The relevance score of the item at rank . For implicit feedback (like in this paper where interactions are binary), is typically 1 if the item at rank is the target item, and 0 otherwise. For explicit feedback (e.g., ratings), would be the rating score.
- : The Discounted Cumulative Gain for the top recommendations. It accumulates relevance scores, penalizing items that appear lower in the list by dividing by the logarithm of their rank.
- : The Ideal Discounted Cumulative Gain, which is the maximum possible DCG for the top items if the recommendation list were perfectly ordered by relevance. This serves as a normalization factor.
5.3. Baselines
TokenRec is compared against a comprehensive set of baselines, including traditional, sequential, and other LLM-based recommender systems:
- Collaborative Filtering (CF) Methods:
- MF [38]: The classic Matrix Factorization method.
- NCF [39]: Neural Collaborative Filtering, a DNN-based CF model.
- LightGCN [6]: A simplified Graph Convolutional Network (GCN) for recommendations.
- GTN [7]: Graph Temporal Network, likely a GNN variant considering temporal aspects or specific graph structures.
- LTGNN [34]: Linear-time Graph Neural Networks for scalable recommendations.
- Sequential Recommendation Methods:
- SASRec [40]: Self-Attentive Sequential Recommendation.
- BERT4Rec [41]: Bidirectional Encoder Representations from Transformers for Recommendation.
- SRec [42]: Self-Supervised Learning for Sequential Recommendation.
- CoSeRec [43]: Contrastive Self-Supervised Sequential Recommendation.
- LLM-based Recommendation Methods:
-
P5-RID (Random Indexing) [12]: P5 framework with randomly assigned item IDs.
-
P5-SID (Sequential Indexing) [12]: P5 framework with item IDs indexed sequentially.
-
CID [44]: Collaborative Indexing, a P5 variant incorporating co-occurrence frequencies.
-
POD [45]: Prompt distillation for efficient LLM-based recommendation.
-
TIGER [46]: Recommender systems with generative retrieval using semantic IDs.
-
TIGER-G [46]: TIGER variant incorporating graph-based collaborative knowledge.
-
CoLLM [16]: Integrating collaborative embeddings into Large Language Models for recommendation.
These baselines are representative of the state-of-the-art in various recommendation paradigms, including traditional methods, sequence-aware models, and recent LLM-integrated approaches. They collectively provide a strong comparative context for
TokenRec's performance.
-
5.4. Hyper-parameter Settings
The experimental setup involved specific hyper-parameter choices:
- Implementation: The model is implemented using Hugging Face (likely for the LLM backbone) and PyTorch.
- MQ-Tokenizer Hyper-parameters:
Codebook number K(number of sub-encoders/sub-codebooks): Searched in the range of .Token number Lat each sub-codebook: Searched in the range of .Masking ratio\rho: Searched in the range of $\{0.0, 0.1, \ldots, 1.0\}$. * **LLM4Rec Fine-tuning Hyper-parameters:** * `Ratio of negative sampling`\lambda: Fixed at 1:1, meaning one randomly selected un-interacted item (negative sample) for each positive user-item interaction.Margin\gamma in Equation (20): Set in the range of `0` to `0.2`. * **Optimization:** * Optimizer: `AdamW` [47] is used for both MQ-Tokenizers and the LLM backbone. * Batch size: 128. * Maximum training epochs: 100. * **Collaborative Representations:** The initial high-order collaborative representations for users and items are obtained from `LightGCN` [6]. * **LLM Backbone:** A widely-used lightweight LLM, `T5-small` [37], is employed for `TokenRec` and all LLM-based baselines to ensure a fair comparison. * **Prompting:** 11 prompt templates were designed for `TokenRec`: 10 `seen prompts` and 1 `unseen prompt` for evaluation. * **Computational Resources:** All experiments were conducted on a single `NVIDIA A800 GPU (80 GB)`. * **Baseline Hyper-parameters:** Other default hyper-parameters for baseline methods were set as suggested by their respective papers. # 6. Results & Analysis ## 6.1. Core Results Analysis The experimental results demonstrate `TokenRec`'s strong performance across various datasets and evaluation scenarios, outperforming both traditional and LLM-based recommender systems. ### 6.1.1. Overall Recommendation Performance The following are the results from Table II of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Model</td> <td colspan="6">LastFM</td> <td colspan="6">ML1M</td> </tr> <tr> <td>HR@10</td> <td>HR@20</td> <td>HR@30</td> <td>NG@10</td> <td>NG@20</td> <td>NG@30</td> <td>HR@10</td> <td>HR@20</td> <td>HR@30</td> <td>NG@10</td> <td>NG@20</td> <td>NG@30</td> </tr> </thead> <tbody> <tr> <td>BERT4Rec</td> <td>0.0319</td> <td>0.0461</td> <td>0.0640</td> <td>0.0128</td> <td>0.0234</td> <td>0.0244</td> <td>0.0779</td> <td>0.1255</td> <td>0.1736</td> <td>0.0533</td> <td>0.0486</td> <td>0.0595</td> </tr> <tr> <td>SASRec</td> <td>0.0345</td> <td>0.0484</td> <td>0.0658</td> <td>0.0142</td> <td>0.0236</td> <td>0.0248</td> <td>0.0785</td> <td>0.1293</td> <td>0.1739</td> <td>0.0367</td> <td>0.052</td> <td>0.0622</td> </tr> <tr> <td>S³Rec</td> <td>0.0385</td> <td>0.0490</td> <td>0.0689</td> <td>0.0177</td> <td>0.0266</td> <td>0.0266</td> <td>0.0867</td> <td>0.1270</td> <td>0.1811</td> <td>0.0361</td> <td>0.0501</td> <td>0.0601</td> </tr> <tr> <td>CoSeRec</td> <td>0.0388</td> <td>0.0504</td> <td>0.0720</td> <td>0.0180</td> <td>0.0268</td> <td>0.0278</td> <td>0.0795</td> <td>0.1316</td> <td>0.1804</td> <td>0.0375</td> <td>0.0529</td> <td>0.0652</td> </tr> <tr> <td>MF</td> <td>0.0239</td> <td>0.0450</td> <td>0.0569</td> <td>0.0114</td> <td>0.0166</td> <td>0.0192</td> <td>0.078</td> <td>0.1272</td> <td>0.1733</td> <td>0.0357</td> <td>0.0503</td> <td>0.0591</td> </tr> <tr> <td>NCF</td> <td>0.0321</td> <td>0.0462</td> <td>0.0643</td> <td>0.0141</td> <td>0.0252</td> <td>0.0254</td> <td>0.0786</td> <td>0.1273</td> <td>0.1738</td> <td>0.0363</td> <td>0.0504</td> <td>0.0601</td> </tr> <tr> <td>LightGCN</td> <td>0.0385</td> <td>0.0661</td> <td>0.0982</td> <td>0.0199</td> <td>0.0269</td> <td>0.0336</td> <td>0.0877</td> <td>0.1288</td> <td>0.1813</td> <td>0.0374</td> <td>0.0509</td> <td>0.0604</td> </tr> <tr> <td>GTN</td> <td>0.0394</td> <td>0.0688</td> <td>0.0963</td> <td>0.0199</td> <td>0.0273</td> <td>0.0331</td> <td>0.0883</td> <td>0.1307</td> <td>0.1826</td> <td>0.0378</td> <td>0.0512</td> <td>0.0677</td> </tr> <tr> <td>LTGNN</td> <td>0.0471</td> <td>0.076</td> <td>0.0925</td> <td>0.0234</td> <td>0.0318</td> <td>0.0354</td> <td>0.0915</td> <td>0.1387</td> <td>0.1817</td> <td>0.0419</td> <td>0.0570</td> <td>0.0659</td> </tr> <tr> <td>P5-RID</td> <td>0.0312</td> <td>0.0523</td> <td>0.0706</td> <td>0.0144</td> <td>0.0199</td> <td>0.0238</td> <td>0.0867</td> <td>0.1248</td> <td>0.1811</td> <td>0.0381</td> <td>0.0486</td> <td>0.0662</td> </tr> <tr> <td>P5-SID</td> <td>0.0375</td> <td>0.0536</td> <td>0.0851</td> <td>0.0224</td> <td>0.0255</td> <td>0.0261</td> <td>0.0892</td> <td>0.1380</td> <td>0.1784</td> <td>0.0422</td> <td>0.0550</td> <td>0.0641</td> </tr> <tr> <td>CID</td> <td>0.0381</td> <td>0.0552</td> <td>0.0870</td> <td>0.0229</td> <td>0.0260</td> <td>0.0277</td> <td>0.0901</td> <td>0.1294</td> <td>0.1863</td> <td>0.0379</td> <td>0.0525</td> <td>0.0706</td> </tr> <tr> <td>POD</td> <td>0.0367</td> <td>0.0572</td> <td>0.0747</td> <td>0.0184</td> <td>0.0220</td> <td>0.0273</td> <td>0.0886</td> <td>0.1277</td> <td>0.1846</td> <td>0.0373</td> <td>0.0487</td> <td>0.0668</td> </tr> <tr> <td>TIGER</td> <td>0.0467</td> <td>0.0749</td> <td>0.0984</td> <td>0.0226</td> <td>0.0306</td> <td>0.0348</td> <td>0.0901</td> <td>0.1382</td> <td>0.1803</td> <td>0.0427</td> <td>0.0562</td> <td>0.0653</td> </tr> <tr> <td>TIGER-G</td> <td>0.0470</td> <td>0.0767</td> <td>0.0997</td> <td>0.0229</td> <td>0.031</td> <td>0.0355</td> <td>0.0905</td> <td>0.1409</td> <td>0.1824</td> <td>0.0423</td> <td>0.0565</td> <td>0.0651</td> </tr> <tr> <td>CoLLM</td> <td>0.0483</td> <td>0.0786</td> <td>0.1017</td> <td>0.0234</td> <td>0.0319</td> <td>0.0366</td> <td>0.0923</td> <td>0.1499</td> <td>0.1998</td> <td>0.0456</td> <td>0.0620</td> <td>0.0719</td> </tr> <tr> <td>TokenRec (User ID Only)</td> <td>0.0505</td> <td>0.0881</td> <td>0.1128</td> <td>0.0251</td> <td>0.0345</td> <td>0.0397</td> <td>0.0964</td> <td>0.1546</td> <td>0.2043</td> <td>0.0493</td> <td>0.0640</td> <td>0.0745</td> </tr> <tr> <td>TokenRec (Unseen Prompt)</td> <td>0.0514</td> <td>0.0917</td> <td>0.1294</td> <td>0.0252</td> <td>0.0343</td> <td>0.0422</td> <td>0.1012</td> <td>0.1672</td> <td>0.2144</td> <td>0.0532</td> <td>0.0698</td> <td>0.0798</td> </tr> <tr> <td>TokenRec</td> <td>0.0532</td> <td>0.0936</td> <td>0.1248</td> <td>0.0247</td> <td>0.0348</td> <td>0.0415</td> <td>0.1008</td> <td>0.1677</td> <td>0.2149</td> <td>0.0528</td> <td>0.0697</td> <td>0.0797</td> </tr> </tbody> </table></div> The following are the results from Table III of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Model</td> <td colspan="6">Beauty</td> <td colspan="6">Clothing</td> </tr> <tr> <td>HR@10</td> <td>HR@20</td> <td>HR@50</td> <td>NG@10</td> <td>NG@20</td> <td>NG@30</td> <td>HR@10</td> <td>HR@20</td> <td>HR@50</td> <td>NG@10</td> <td>NG@20</td> <td>NG@30</td> </tr> </thead> <tbody> <tr> <td>BERT4Rec</td> <td>0.0329</td> <td>0.0464</td> <td>0.0637</td> <td>0.0162</td> <td>0.0025</td> <td>0.0255</td> <td>0.0135</td> <td>0.0271</td> <td>0.0248</td> <td>0.0061</td> <td>0.0074</td> <td>0.0079</td> </tr> <tr> <td>SASRec</td> <td>0.0338</td> <td>0.0472</td> <td>0.0637</td> <td>0.0170</td> <td>0.0213</td> <td>0.0260</td> <td>0.0136</td> <td>0.0221</td> <td>0.0256</td> <td>0.0063</td> <td>0.0076</td> <td>0.0081</td> </tr> <tr> <td>P3Rec</td> <td>0.0351</td> <td>0.0471</td> <td>0.0664</td> <td>0.0169</td> <td>0.0237</td> <td>0.0278</td> <td>0.0140</td> <td>0.0213</td> <td>0.0256</td> <td>0.0069</td> <td>0.0081</td> <td>0.0086</td> </tr> <tr> <td>CoSeRec</td> <td>0.0362</td> <td>0.0476</td> <td>0.0680</td> <td>0.0176</td> <td>0.0248</td> <td>0.0280</td> <td>0.0139</td> <td>0.0211</td> <td>0.0251</td> <td>0.0068</td> <td>0.0080</td> <td>0.0085</td> </tr> <tr> <td>MF</td> <td>0.0127</td> <td>0.0195</td> <td>0.0245</td> <td>0.0063</td> <td>0.0081</td> <td>0.0091</td> <td>0.0116</td> <td>0.0175</td> <td>0.0234</td> <td>0.0074</td> <td>0.0088</td> <td>0.0101</td> </tr> <tr> <td>NCF</td> <td>0.0315</td> <td>0.0462</td> <td>0.0623</td> <td>0.0160</td> <td>0.0196</td> <td>0.0237</td> <td>0.0119</td> <td>0.0178</td> <td>0.024</td> <td>0.0072</td> <td>0.0090</td> <td>0.0103</td> </tr> <tr> <td>LightGCN</td> <td>0.0344</td> <td>0.0498</td> <td>0.0630</td> <td>0.0194</td> <td>0.0233</td> <td>0.0261</td> <td>0.0157</td> <td>0.0226</td> <td>0.0279</td> <td>0.0085</td> <td>0.0103</td> <td>0.0114</td> </tr> <tr> <td>GTPN</td> <td>0.0345</td> <td>0.0502</td> <td>0.0635</td> <td>0.0198</td> <td>0.0241</td> <td>0.0268</td> <td>0.0158</td> <td>0.0226</td> <td>0.0282</td> <td>0.0084</td> <td>0.0103</td> <td>0.0111</td> </tr> <tr> <td>LTGNN</td> <td>0.0385</td> <td>0.0564</td> <td>0.0719</td> <td>0.0207</td> <td>0.0252</td> <td>0.0285</td> <td>0.0155</td> <td>0.0218</td> <td>0.0272</td> <td>0.0082</td> <td>0.0110</td> <td>0.0116</td> </tr> <tr> <td>P5-RID</td> <td>0.0330</td> <td>0.0511</td> <td>0.0651</td> <td>0.0146</td> <td>0.0200</td> <td>0.0144</td> <td>0.0148</td> <td>0.0225</td> <td>0.0263</td> <td>0.0071</td> <td>0.0086</td> <td>0.0095</td> </tr> <tr> <td>P5-SID</td> <td>0.0340</td> <td>0.0516</td> <td>0.0672</td> <td>0.0154</td> <td>0.0231</td> <td>0.0176</td> <td>0.0143</td> <td>0.0222</td> <td>0.0258</td> <td>0.0070</td> <td>0.0086</td> <td>0.0091</td> </tr> <tr> <td>CID</td> <td>0.0341</td> <td>0.0516</td> <td>0.0673</td> <td>0.0165</td> <td>0.0236</td> <td>0.0177</td> <td>0.0146</td> <td>0.0226</td> <td>0.0276</td> <td>0.0070</td> <td>0.0087</td> <td>0.0092</td> </tr> <tr> <td>POD</td> <td>0.0339</td> <td>0.0498</td> <td>0.0639</td> <td>0.0185</td> <td>0.0222</td> <td>0.0221</td> <td>0.0147</td> <td>0.0225</td> <td>0.0261</td> <td>0.0074</td> <td>0.0087</td> <td>0.0091</td> </tr> <tr> <td>TIGE</td> <td>0.0372</td> <td>0.0574</td> <td>0.0747</td> <td>0.0193</td> <td>0.0248</td> <td>0.0287</td> <td>0.0147</td> <td>0.0225</td> <td>0.0266</td> <td>0.0072</td> <td>0.0087</td> <td>0.0093</td> </tr> <tr> <td>TiGER-G</td> <td>0.0382</td> <td>0.0586</td> <td>0.0753</td> <td>0.0195</td> <td>0.0251</td> <td>0.0292</td> <td>0.0147</td> <td>0.0227</td> <td>0.0265</td> <td>0.0073</td> <td>0.0088</td> <td>0.0093</td> </tr> <tr> <td>CoLLM</td> <td>0.0391</td> <td>0.0606</td> <td>0.0772</td> <td>0.0200</td> <td>0.0259</td> <td>0.0303</td> <td>0.0150</td> <td>0.0218</td> <td>0.0274</td> <td>0.0079</td> <td>0.0091</td> <td>0.0117</td> </tr> <tr> <td>TokenRec (User ID Only)</td> <td>0.0396</td> <td>0.0599</td> <td>0.0763</td> <td>0.0214</td> <td>0.0265</td> <td>0.0300</td> <td>0.0160</td> <td>0.0228</td> <td>0.0282</td> <td>0.0092</td> <td>0.0109</td> <td>0.0119</td> </tr> <tr> <td>TokenRec (Unseen Prompt)</td> <td>0.0402</td> <td>0.0622</td> <td>0.0791</td> <td>0.0215</td> <td>0.0270</td> <td>0.0306</td> <td>0.0164</td> <td>0.0233</td> <td>0.0286</td> <td>0.0096</td> <td>0.0111</td> <td>0.0124</td> </tr> <tr> <td>TokenRec</td> <td>0.0407</td> <td>0.0615</td> <td>0.0782</td> <td>0.0222</td> <td>0.0276</td> <td>0.0303</td> <td>0.0171</td> <td>0.0240</td> <td>0.0291</td> <td>0.0108</td> <td>0.01112</td> <td>0.0130</td> </tr> </tbody> </table></div> **Key Observations from Table II and III:** * **`TokenRec`'s Superiority:** `TokenRec` consistently achieves the best performance across all datasets (LastFM, ML1M, Beauty, Clothing) and metrics (HR@K, NDCG@K). For instance, in the LastFM dataset, `TokenRec` significantly outperforms the strongest baselines by an average of 19.08% on HR@20 and 9.09% on NDCG@20. This highlights the effectiveness of its `MQ-Tokenizer` for collaborative ID tokenization and its `generative retrieval` paradigm. * **Effectiveness of `User ID Only` Input:** Even when `TokenRec` is provided with only user ID tokens (no historical interactions), denoted `TokenRec (User ID Only)`, it still surpasses most baselines in terms of accuracy. This indicates that the `MQ-Tokenizer` successfully embeds sufficient collaborative knowledge into the user ID tokens themselves, making detailed interaction history in prompts potentially redundant for certain recommendation scenarios. This also helps in circumventing LLM context length limitations. * **Impact of Collaborative Knowledge in LLM-based RecSys:** * `CID` generally outperforms `P5-RID` and `P5-SID`, suggesting that even basic integration of collaborative knowledge (co-occurrence frequency) into tokenization is beneficial. * However, `P5` variants and `POD` often perform worse than GNN-based CF methods (e.g., LightGCN, GTN, LTGNN), implying their struggle to effectively capture deep collaborative information with LLMs alone. * `CoLLM` achieves better results than other P5 variants, benefiting from its explicit incorporation of `collaborative embeddings` learned from GNNs. * `TIGER` (using semantic IDs) and `TIGER-G` (further incorporating graph knowledge) also show improved performance over basic LLM-based methods, emphasizing the value of richer item representations. * **GNNs' Strength:** GNN-based collaborative filtering methods (LightGCN, GTN, LTGNN) generally outperform traditional CF (MF, NCF) and sequential recommendation methods. This confirms the effectiveness of GNNs in capturing high-order collaborative signals through graph connectivity. `TokenRec` builds upon this strength by quantizing these GNN-learned representations. ### 6.1.2. Generalizability Evaluation The following are the results from Table IV of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Dataset</td> <td rowspan="2">Model</td> <td colspan="2">Seen</td> <td colspan="2">Unseen</td> </tr> <tr> <td>HR @ 20</td> <td>NG@ 20</td> <td>HR @ 20</td> <td>NG @ 20</td> </tr> </thead> <tbody> <tr> <td>LastFM</td> <td>P5</td> <td>0.0704</td> <td>0.0320</td> <td>0.0399</td> <td>0.0137</td> </tr> <tr> <td>POD</td> <td>0.0709</td> <td>0.0323</td> <td>0.0401</td> <td>0.0138</td> </tr> <tr> <td>CID</td> <td>0.0697</td> <td>0.0314</td> <td>0.0452</td> <td>0.0196</td> </tr> <tr> <td>TIGER</td> <td>0.0752</td> <td>0.0309</td> <td>0.0695</td> <td>0.0252</td> </tr> <tr> <td>CoLLM</td> <td>0.0812</td> <td>0.0336</td> <td>0.0574</td> <td>0.0235</td> </tr> <tr> <td>TokenRec</td> <td>0.0973</td> <td>0.0353</td> <td>0.0773</td> <td>0.0268</td> </tr> <tr> <td>Beauty</td> <td>P5</td> <td>0.0511</td> <td>0.0236</td> <td>0.0274</td> <td>0.0130</td> </tr> <tr> <td>POD</td> <td>0.0507</td> <td>0.0225</td> <td>0.0269</td> <td>0.0123</td> </tr> <tr> <td>CID</td> <td>0.0523</td> <td>0.0240</td> <td>0.0334</td> <td>0.0146</td> </tr> <tr> <td>TIGER</td> <td>0.0575</td> <td>0.0248</td> <td>0.0548</td> <td>0.0233</td> </tr> <tr> <td>CoLLM</td> <td>0.0612</td> <td>0.0261</td> <td>0.0477</td> <td>0.0195</td> </tr> <tr> <td>TokenRec</td> <td>0.0629</td> <td>0.0289</td> <td>0.0591</td> <td>0.0266</td> </tr> </tbody> </table></div> This analysis assesses the models' ability to recommend to `unseen users` (5% of users with the least interaction history, excluded from training). Only the GNN component is updated for new users/items, while LLM components remain frozen. **Key Observations from Table IV:** * **Significant Drop for Most LLM Baselines:** `P5` and `POD` experience substantial performance degradation (over 40% drop in HR@20 and NDCG@20) for `unseen users`. This highlights their inherent difficulty in generalizing to new entities without explicit fine-tuning. * **Improved Performance with Collaborative Knowledge:** `CID` and `CoLLM` show relatively better generalization, with performance drops around 20%, due to their inclusion of some form of collaborative knowledge. However, this is still a considerable drop, indicating that current methods struggle with stable ID tokenization for unseen entities. * **`TokenRec`'s Robust Generalization:** `TokenRec` demonstrates superior generalization. For instance, in Amazon-Beauty, its performance decreases by only 7% on average for unseen users. This strong capability is attributed to the `MQ-Tokenizer`'s robust ID tokenization, achieved through masking and the `K-way encoder`, and the flexible `generative retrieval` paradigm. * **`TIGER`'s Generalization:** `TIGER` also exhibits strong generalization (only 6.10% average decrease on HR@20). This is attributed to its use of `semantic IDs`, which incorporate item-side textual information as additional knowledge, making it effective in cold-start scenarios. This suggests that leveraging rich side information is beneficial for generalization. ### 6.1.3. Efficiency Evaluation The following are the results from Table V of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Inference Time</td> <td>LastFM</td> <td>MLIM</td> <td>Beauty</td> <td>Clothing</td> </tr> </thead> <tbody> <tr> <td>P5</td> <td>96.04</td> <td>99.75</td> <td>86.39</td> <td>93.38</td> </tr> <tr> <td>POD</td> <td>96.30</td> <td>101.42</td> <td>87.69</td> <td>94.48</td> </tr> <tr> <td>CID</td> <td>94.96</td> <td>99.42</td> <td>84.87</td> <td>92.02</td> </tr> <tr> <td>TIGER</td> <td>82.57</td> <td>85.98</td> <td>76.11</td> <td>80.68</td> </tr> <tr> <td>TokenRec</td> <td>6.92</td> <td>8.43</td> <td>5.76</td> <td>6.00</td> </tr> <tr> <td>Acceleration*</td> <td>1236.24%</td> <td>1046.41%</td> <td>1354.25%</td> <td>1402.33%</td> </tr> </tbody> </table></div> This evaluation compares the average inference time per user for Top-20 recommendations among LLM-based methods. **Key Observations from Table V:** * **`TokenRec`'s Drastic Efficiency Improvement:** `TokenRec` achieves significantly superior inference efficiency. It shows an average improvement (acceleration) of approximately 1236% to 1402% compared to the baselines. This massive gain is directly attributed to its `generative retrieval paradigm`, which bypasses the time-consuming `auto-regressive decoding` and `beam search` processes inherent in other LLM-based methods (P5, POD, CID, TIGER). * **`TIGER`'s Relative Efficiency:** `TIGER` is more efficient than `P5`, `CID`, and `POD`. This is because `TIGER` uses `compressed semantic IDs`, requiring fewer tokens for generation compared to full item titles or descriptions. However, `TokenRec`'s retrieval approach is fundamentally faster than any sequential generation. ## 6.2. Ablation Studies / Parameter Analysis ### 6.2.1. Ablation Studies The following are the results from Table VI of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td rowspan="2">Module</td> <td colspan="2">LastFM</td> <td colspan="2">Beauty</td> </tr> <tr> <td>HR @ 20</td> <td>NG @ 20</td> <td>HR @ 20</td> <td>NG @ 20</td> </tr> </thead> <tbody> <tr> <td>Full*</td> <td>0.0936</td> <td>0.0348</td> <td>0.0615</td> <td>0.0276</td> </tr> <tr> <td>w/o Masking</td> <td>0.0848</td> <td>0.0332</td> <td>0.0573</td> <td>0.0253</td> </tr> <tr> <td>w/o K-way</td> <td>0.0820</td> <td>0.0309</td> <td>0.0592</td> <td>0.0250</td> </tr> <tr> <td>w/o HOCK</td> <td>0.0549</td> <td>0.0172</td> <td>0.0407</td> <td>0.0149</td> </tr> <tr> <td>s RQ-VAE</td> <td>0.0831</td> <td>0.0314</td> <td>0.0596</td> <td>0.0253</td> </tr> <tr> <td>s VQ-VAE</td> <td>0.0810</td> <td>0.0308</td> <td>0.0589</td> <td>0.0247</td> </tr> <tr> <td>s K-Means</td> <td>0.0750</td> <td>0.0281</td> <td>0.0567</td> <td>0.0237</td> </tr> </tbody> </table></div> **Key Observations from Table VI:** * **Contribution of Each Component:** Each proposed component in `TokenRec` (masking, K-way encoder, high-order collaborative knowledge) contributes positively to the overall performance, as removing any of them leads to a degradation. * **Importance of Masking and K-way Framework:** Removing the `masking operation` (`w/o Masking`) or the `K-way encoder` (`w/o K-way`) leads to a performance drop. This confirms their role in enhancing generalization (as seen in Table IV) and improving accuracy. * **Crucial Role of High-Order Collaborative Knowledge (HOCK):** The most significant performance drop occurs when `high-order collaborative knowledge` is removed (`w/o HOCK`). This underscores the critical importance of leveraging GNN-learned representations for effective LLM-based recommendations and for aligning LLMs with personalized preferences. It also implies that the quality of these collaborative embeddings is paramount for the `MQ-Tokenizer`'s success. * **Effectiveness of MQ-Tokenizer Design:** Comparing `TokenRec` to using alternative vector quantization methods: * `s RQ-VAE` (Residual Quantized VAE) and `s VQ-VAE` (Vector Quantized VAE) perform worse than the full `TokenRec`, demonstrating the specific design advantages of the `MQ-Tokenizer` for encoding collaborative knowledge. * `s K-Means` shows the worst performance among the quantization alternatives, highlighting the need for a more sophisticated, learnable quantization approach than simple clustering for this task. ### 6.2.2. Hyper-parameter Analysis #### 6.2.2.1. Effect of Masking Ratio ($\rho$) The effect of the `masking ratio`\rho in theMQ-Tokenizerwas investigated. This parameter controls how much of the input representation is masked.
The following figure (Figure 4 from the original paper) shows the performance change of TokenRec w.r.t.HR @20 and NDCG@20.

Key Observations from Figure 1:
- Benefit of Small Masking: Introducing a small masking ratio (e.g., or ) generally leads to performance improvements in
TokenRec. This indicates that a moderate level of masking helps the tokenizer build a more robust and generalized understanding of the representations. - Optimal Masking Ratio: The optimal masking ratio appears to be around , where
TokenRecachieves its best performance. - Degradation with Excessive Masking: Performance degrades significantly when the masking ratio becomes too high (e.g., ). Excessive masking makes the reconstruction task too difficult, hindering the tokenizer's ability to learn meaningful representations.
6.2.2.2. Effect of Codebook Settings K and L
The study analyzed the impact of the number of sub-codebooks () and the number of tokens in each sub-codebook () on TokenRec's performance.
The following figure (Figure 5 from the original paper) shows the effect of the number of sub-codebooks and the number of tokens in each sub-codebook under HR@20 and NDCG@20 metrics.

Key Observations from Figure 2:
- Impact of K (Number of Sub-codebooks):
- As the
codebook depth(number of sub-codebooks ) increases from 1, a progressive improvement in recommendation performance is observed across all datasets. This validates the effectiveness of theK-way encoderandK-way codebookin capturing diverse patterns and enhancing quantization. - However, the improvement becomes
marginalwhen . This suggests a trade-off between effectiveness and efficiency, with being a practical choice.
- As the
- Impact of L (Tokens per Sub-codebook):
- The optimal value of varies with dataset size. For smaller datasets like LastFM/ML1M, an of 256 often provides a good balance of effectiveness and efficiency.
- For larger datasets like Amazon-Beauty/Clothing, a larger (e.g., 512) is often beneficial, indicating that more codebook tokens are needed to represent the greater diversity of users/items.
- Importance of K-way Mechanism: Simply increasing with a single codebook (i.e., ) does not yield significant performance gains compared to using multiple sub-codebooks. This further emphasizes the effectiveness and necessity of the proposed
K-way mechanismin theMQ-Tokenizer.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces TokenRec, a novel and comprehensive framework for LLM-based generative recommendations. TokenRec effectively addresses several critical challenges in the field:
- ID Tokenization: It provides a generalizable strategy for tokenizing user and item IDs through its
Masked Vector-Quantized (MQ) Tokenizer. This tokenizer quantizes masked representations derived fromGraph Neural Networks (GNNs), thereby seamlessly incorporatinghigh-order collaborative knowledgeinto LLM-compatible discrete tokens. Themasking operationandK-way encodermechanisms significantly enhance its robustness and generalization. - Inference Efficiency:
TokenRecadopts an innovativegenerative retrieval paradigmthat eliminates the need for time-consumingauto-regressive decodingandbeam searchprocesses typically used by LLMs. Instead, it generates a user's preference representation and retrieves top- items through similarity matching, leading to substantial gains in inference speed. - Generalizability: The framework demonstrates superior generalizability to
new and unseen users and items(cold-start problem). By updating only the lightweight GNN component for new entities, while keeping theMQ-Tokenizersand LLM backbone frozen,TokenRecachieves robust performance without requiring expensive LLM retraining. - Performance: Extensive experiments on four real-world datasets confirm that
TokenRecconsistently outperforms both traditional and state-of-the-art LLM-based recommender systems across various evaluation metrics, while also demonstrating significant efficiency improvements.
7.2. Limitations & Future Work
The paper primarily focuses on overcoming existing limitations in LLM-based RecSys. While it doesn't explicitly list "limitations" of its own method, the challenges it addresses implicitly highlight areas for continued research and improvement in the broader field:
- Reliance on GNNs for Initial Representations:
TokenRec's effectiveness is heavily reliant on the quality of the initialcollaborative representationslearned by the chosen GNN (LightGCN in this case). Further research could explore the robustness ofTokenRecto different GNN architectures or the integration of multi-modal information into these initial representations. - Prompt Sensitivity: Although
TokenRecuses pre-defined prompts and shows robustness with unseen prompts, LLMs are generally known to be sensitive to prompt wording. Further research might explore adaptive or user-specific prompt generation strategies. - Static Codebook during LLM Tuning: The
MQ-Tokenizersare frozen during the LLM fine-tuning stage. Exploring approaches that allow for more dynamic interaction or co-adaptation between the tokenizer and the LLM backbone could be a direction for future work. - Scalability of GNN Pre-training: While the GNN update for new users/items is efficient, the initial training of a powerful GNN on massive interaction graphs can still be resource-intensive. Research into more scalable GNN training or zero-shot GNN inference methods could further enhance the overall efficiency pipeline.
- Exploring Deeper LLM Backbones: The paper uses T5-small. Investigating the performance with larger and more powerful LLMs, while managing their computational demands, could be a future research avenue.
- Beyond Binary Interactions: The current setup focuses on implicit binary interactions. Extending
TokenRecto handle explicit feedback (e.g., ratings) with varying relevance levels could be explored.
7.3. Personal Insights & Critique
TokenRec offers a highly impactful and pragmatic solution to some of the most pressing challenges in LLM-based Recommender Systems. Its key innovation lies in intelligently bridging the gap between the discrete nature of LLM tokens and the continuous, complex collaborative knowledge inherent in user-item interactions.
Key Strengths and Inspirations:
- Elegant Solution to Tokenization: The
MQ-Tokenizeris a clever approach to theID tokenizationproblem. By quantizing GNN-learned embeddings, it ensures that the discrete tokens carry rich collaborative semantics, rather than being arbitrary identifiers or solely text-derived. ThemaskingandK-way encoderfurther enhance the robustness and expressiveness of these tokens. This approach could be transferred to other domains where continuous, rich data needs to be discretely represented for LLM consumption, such as in bioinformatics (e.g., tokenizing protein sequences based on learned structural embeddings). - Practical Inference Strategy: The shift from
auto-regressive generationtogenerative retrievalis a significant practical advancement. For real-time applications, the speedup is critical. This paradigm could inspire similar approaches in other LLM applications where a fixed set of "answers" needs to be efficiently selected rather than freely generated (e.g., LLM-based knowledge retrieval systems that select from a database of facts). - Strong Generalization: The ability to handle
cold-startusers and items without full LLM retraining is a major advantage for real-world deployment. The idea of efficiently updating only a lightweight component (GNN) and leveraging pre-trained LLM and tokenizer components is a powerful pattern for maintaining scalability and up-to-dateness in dynamic systems. - Concise Prompts: The finding that
TokenReccan perform well with only user ID tokens in the prompt is insightful. It implies that theMQ-Tokenizersuccessfully compresses significant user preference information into these tokens, mitigating thecontext length limitationissue that plagues many LLM applications.
Potential Issues and Areas for Improvement:
-
Interpretability of Tokens: While the tokens carry collaborative knowledge, their direct interpretability to humans might be limited compared to textual descriptions. For explainable recommendation systems, further work might be needed to link these tokens back to understandable user preferences or item attributes.
-
Dependency on GNN Quality: The performance of
TokenRecis fundamentally tied to the quality of the initial GNN-learned representations. If the chosen GNN struggles with sparse data or specific graph structures,TokenRec's foundation might be weakened. This suggests that robust and powerful GNNs are prerequisites. -
Complexity of Multi-Stage Training: While effective, the two-stage training process (MQ-Tokenizer first, then LLM4Rec) introduces some complexity. Exploring end-to-end or more integrated training schemes, possibly with techniques like curriculum learning, could be an area for future research, although this might increase training instability.
-
Overhead of Codebook Management: For extremely large item sets, managing and updating the codebook (especially for tokens per sub-codebook) could still incur overhead, though significantly less than a full vocabulary expansion. Strategies for dynamic codebook updates or hierarchical quantization might be considered.
Overall,
TokenRecmakes a substantial contribution by offering a robust, efficient, and generalizable framework for LLM-based recommendations. It highlights the importance of careful design at the intersection of discrete and continuous representations and provides a strong blueprint for future advancements in this rapidly evolving field.
Similar papers
Recommended via semantic vector search.