Learnable Item Tokenization for Generative Recommendation
TL;DR Summary
The paper introduces LETTER, a learnable tokenizer addressing challenges of transforming recommendation data into LLMs’ language space. By integrating hierarchical semantics, collaborative signals, and code assignment diversity, its experimental validation on three datasets demon
Abstract
Utilizing powerful Large Language Models (LLMs) for generative recommendation has attracted much attention. Nevertheless, a crucial challenge is transforming recommendation data into the language space of LLMs through effective item tokenization. Current approaches, such as ID, textual, and codebook-based identifiers, exhibit shortcomings in encoding semantic information, incorporating collaborative signals, or handling code assignment bias. To address these limitations, we propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity to satisfy the essential requirements of identifiers. LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias. We instantiate LETTER on two models and propose a ranking-guided generation loss to augment their ranking ability theoretically. Experiments on three datasets validate the superiority of LETTER, advancing the state-of-the-art in the field of LLM-based generative recommendation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Learnable Item Tokenization for Generative Recommendation
1.2. Authors
- Wenjie Wang (National University of Singapore, Singapore)
- Jizhi Zhang (University of Science and Technology of China, Hefei, China)
- See-Kiong Ng (National University of Singapore, Singapore)
- Honghui Bao (National University of Singapore, Singapore)
- Xinyu Lin (National University of Singapore, Singapore)
- Fuli Feng (University of Science and Technology of China, Hefei, China)
- Yongqi Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
- Tat-Seng Chua (National University of Singapore, Singapore)
1.3. Journal/Conference
The paper is published in the Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 24), October 21-25, 2024, Boise, ID, USA. CIKM is a highly reputable and influential conference in the fields of information retrieval, database management, and knowledge management, making it a significant venue for research in recommender systems.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses a crucial challenge in utilizing Large Language Models (LLMs) for generative recommendation: effectively transforming recommendation data into the language space of LLMs through item tokenization. Existing approaches—ID, textual, and codebook-based identifiers—suffer from limitations such as inefficient semantic encoding, lack of collaborative signals, or code assignment bias. To overcome these, the authors propose LETTER (a LEarnable okenizer for generaiv ecommendation). LETTER integrates hierarchical semantics, collaborative signals, and code assignment diversity, which are essential requirements for effective identifiers. It incorporates Residual Quantized VAE (RQ-VAE) for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias. The authors instantiate LETTER on two generative recommender models and introduce a ranking-guided generation loss to theoretically enhance their ranking ability. Experiments on three datasets demonstrate LETTER's superiority, advancing the state-of-the-art in LLM-based generative recommendation.
1.6. Original Source Link
-
Original Source Link:
https://arxiv.org/abs/2405.07314(Preprint) -
PDF Link:
https://arxiv.org/pdf/2405.07314v3.pdf(Preprint)The paper is published as a preprint on arXiv and is accepted for CIKM 2024.
2. Executive Summary
2.1. Background & Motivation
The rise of Large Language Models (LLMs) has opened new avenues for generative recommendation, where LLMs are used to directly generate recommended items. A fundamental hurdle in this paradigm is item tokenization, which involves converting discrete item data into a format (a sequence of tokens or identifiers) that LLMs can process. This process bridges the gap between the traditional recommendation domain and the language space of LLMs.
The current landscape of item tokenization approaches presents several critical shortcomings:
-
ID Identifiers: These assign unique numerical strings to items. While ensuring uniqueness, they are inefficient at encoding semantic information, making it difficult to generalize to
cold-start items(new items with little or no interaction history). -
Textual Identifiers: These leverage item descriptions (e.g., titles, attributes) directly as identifiers.
- They often lack
hierarchical semantics, meaning the token sequence doesn't progressively encode information from coarse to fine-grained, which is suboptimal for autoregressive generation. - They typically lack
collaborative signals(information derived from user-item interactions). Items with similar semantics but different user interaction patterns might have very similar textual identifiers, leading tomisalignmentand making it hard for the recommender to distinguish them based on collaborative preferences.
- They often lack
-
Codebook-based Identifiers: These use auto-encoders to map item semantics to hierarchical code sequences. While an improvement in semantics, they still suffer from the lack of
collaborative signalsin their code sequences and a significantcode assignment bias. This bias means certain codes are assigned more frequently than others, leading to an imbalanced distribution and an unfairitem generation biaswhere popular items are more likely to be generated, neglecting less popular but relevant items.The core problem the paper aims to solve is to develop an
effective item tokenizationmethod that can overcome these limitations, thus enabling LLMs to perform generative recommendation more accurately, fairly, and efficiently.
2.2. Main Contributions / Findings
The paper's primary contributions are encapsulated in LETTER, a novel learnable tokenizer designed to address the aforementioned issues:
-
Proposal of
LETTER: The authors proposeLETTER, aLEarnable okenizer for generaiv ecommendation, which is explicitly designed to meet three essential criteria for ideal identifiers:- Hierarchical Semantic Integration: Ensures identifiers encode semantics from broad to fine-grained, aligning with autoregressive generation.
- Collaborative Signal Incorporation: Integrates
collaborative signalsdirectly into the token assignment process, making similar interaction patterns result in similar token sequences. - Code Assignment Diversity: Mitigates
code assignment biasto ensure fairer item generation.
-
Three Regularization Mechanisms:
LETTERachieves its objectives through three novel regularization losses:- Semantic Regularization: Leverages
Residual Quantized VAE (RQ-VAE)to encode hierarchical item semantics into a code sequence, enabling coarse-to-fine generation and bettercold-start generalization. - Collaborative Regularization: Introduces a contrastive alignment loss that aligns the quantized semantic embeddings with
Collaborative Filtering (CF)embeddings from a well-trained CF model (e.g.,LightGCN). This injects collaborative information directly into the token assignment. - Diversity Regularization: Implements a diversity loss that encourages a more uniform distribution of code embeddings, thereby alleviating
code assignment biasand reducingitem generation bias.
- Semantic Regularization: Leverages
-
Ranking-Guided Generation Loss: To further enhance the recommendation performance of LLM-based generative models,
LETTERproposes aranking-guided generation loss. This loss modifies the traditional negative log-likelihood loss by introducing an adjustable temperature parameter, which emphasizes penalties forhard-negative samplesand theoretically improves thetop-K ranking ability. -
Empirical Validation and State-of-the-Art Performance: Extensive experiments on three real-world datasets (
Instruments,Beauty,Yelp) demonstrate thatLETTERsignificantly outperforms existing item tokenization methods. It consistently improves the performance of two representative generative recommender models (TIGERandLC-Rec) when integrated. Ablation studies confirm the effectiveness of each proposed regularization component and theranking-guided generation loss.In essence,
LETTERprovides a robust and comprehensive solution foritem tokenizationinLLM-based generative recommendation, addressing semantic encoding, collaborative information, and fairness issues simultaneously, thereby advancing the state-of-the-art in this emerging field.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Generative Recommendation
Generative recommendation is an emerging paradigm in recommender systems where, instead of predicting a score or ranking from a fixed set of items, the model directly generates the identifiers or attributes of recommended items. This approach often leverages Large Language Models (LLMs) because of their powerful text generation capabilities. Given a user's historical interactions, a generative recommender aims to output a sequence of tokens representing a new, relevant item.
3.1.2. Large Language Models (LLMs)
Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are capable of understanding, generating, and processing human language with remarkable fluency and coherence. Key characteristics include:
- Autoregressive Generation: LLMs typically generate text token by token, predicting the next token based on the preceding sequence.
- Rich World Knowledge: Acquired during pre-training on diverse internet data.
- Reasoning and Generalization: Ability to perform complex tasks, including inference and generalization to unseen scenarios. In the context of recommendation, LLMs are adapted to generate item identifiers, effectively treating items as words in a vocabulary.
3.1.3. Item Tokenization
Item tokenization is the process of converting an item in a recommender system (e.g., a movie, a product) into a representation that can be understood and processed by an LLM. This usually involves assigning a unique identifier or a sequence of tokens to each item. The effectiveness of item tokenization directly impacts how well an LLM can encode user preferences and generate relevant recommendations.
3.1.4. Residual Quantized Variational Autoencoder (RQ-VAE)
Residual Quantized Variational Autoencoder (RQ-VAE) is a type of Vector Quantized Variational Autoencoder (VQ-VAE) that focuses on generating hierarchical and high-fidelity representations.
- Variational Autoencoder (VAE): A type of generative model that learns a probabilistic mapping from input data to a latent space and then reconstructs the data from that latent space. It consists of an encoder (maps input to latent distribution parameters) and a decoder (samples from latent space to reconstruct input).
- Vector Quantization (VQ): A technique where continuous latent representations are mapped to discrete
codebookentries. Eachcodebookentry is a vector (code embedding). This process makes the latent space discrete, which is beneficial for tasks involving discrete tokens, like language modeling. - Residual Quantization: In
RQ-VAE, quantization is performed iteratively across multiple levels. Instead of quantizing the entire latent vector at once, it quantizes the residual error from the previous quantization step. This hierarchical approach allows the model to capture progressively finer-grained details, leading to more accurate and expressive discrete representations. This is crucial for encodinghierarchical semanticsas desired byLETTER.
3.1.5. Collaborative Filtering (CF)
Collaborative Filtering (CF) is a traditional and highly effective technique in recommender systems that predicts a user's preference for items based on the preferences of other users (user-based CF) or the similarity of items themselves (item-based CF). The core idea is that users who agreed in the past tend to agree again in the future, or that items liked by similar users are likely to be liked by the current user. CF embeddings are vector representations of users and items learned through CF models, capturing their interaction patterns.
3.1.6. Contrastive Learning
Contrastive learning is a self-supervised learning paradigm where the model learns representations by pulling "positive" pairs (e.g., different augmentations of the same data point, or semantically similar items) closer together in the embedding space, while pushing "negative" pairs (dissimilar data points) farther apart. This technique is often used to learn powerful and discriminative representations without explicit labels. In LETTER, it's used to align semantic quantized embeddings with CF embeddings.
3.1.7. Ranking Metrics (Recall@K, NDCG@K)
These are standard metrics used to evaluate the performance of recommender systems, particularly in top-K recommendation tasks, where the goal is to recommend a small list of items.
- Recall@K (R@K): Measures the proportion of relevant items that are successfully retrieved (recommended) within the top items.
- Conceptual Definition: Recall@K quantifies how many of the actual relevant items for a user were present in the recommended list of items. It focuses on the ability of the system to 'recall' or retrieve all relevant items.
- Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-}K \text{ recommendations}}{\text{Total number of relevant items for the user}} $
- Symbol Explanation:
Number of relevant items in top-K recommendations: The count of items that the user actually interacted with (or found relevant) and were also present in the list of the top items recommended by the system.Total number of relevant items for the user: The total count of items that the user actually interacted with (or found relevant) in the test set.
- Normalized Discounted Cumulative Gain (NDCG@K): A ranking quality metric that considers not only whether relevant items are in the top but also their position in the ranked list. Higher relevance at higher ranks yields a better score.
- Conceptual Definition: NDCG@K evaluates the quality of the ranked list of recommendations. It assigns higher scores to relevant items that appear earlier in the list and accounts for varying degrees of relevance. It's 'normalized' so that a perfect ranking always achieves an NDCG of 1.0.
- Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ where $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \text{IDCG}@K = \sum_{i=1}^{K} \frac{2^{\text{ideal_}rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
DCG@K: Discounted Cumulative Gain at rank . It sums the relevance scores of items in the recommended list, discounted logarithmically by their position.IDCG@K: Ideal Discounted Cumulative Gain at rank . This is the DCG score for the ideal ranking (where all relevant items are ranked highest). It serves as a normalization factor.- : The relevance score of the item at position in the recommended list. For binary relevance (relevant/not relevant), is typically 1 or 0.
- : The relevance score of the item at position in the ideal (perfect) ranking.
- : The number of top items considered in the recommendation list.
3.2. Previous Works
3.2.1. ID Identifiers
Early approaches to item tokenization for LLMs often relied on ID identifiers, which are unique numerical strings assigned to each item.
-
P5-SemiD [14]: This method constructs
ID identifiersbased on item metadata, such as categories or attributes. For example, all items in the "string" category might get IDs starting with a specific prefix. While this incorporates some semantic information, it's often coarse-grained and might fail to capture fine details or collaborative signals. -
P5-CID [14]: This approach integrates
collaborative signalsintoID identifiersby using a spectral clustering tree derived from item co-appearance graphs. Items that frequently co-occur in user interactions are grouped, and these groupings inform the ID assignment. This helps in capturing collaborative patterns but relies on a fixed, unlearnable structure, which can be rigid and less adaptable to new items or evolving patterns.Limitations:
ID identifiers, especially purely numerical ones, are inherently poor at encoding rich semantic information [39]. Even with semantic or collaborative enhancements, their fixed or tree-like structures struggle to adapt to new items (cold-start) or dynamically evolving user preferences. The misalignment between semantic and collaborative signals can also hinder effective learning.
3.2.2. Textual Identifiers
Textual identifiers directly use an item's content information, such as titles, attributes, or descriptions, as its token sequence.
- BIGRec [1]: An
LLM-based generative recommender modelthat uses items' titles astextual identifiers. - P5-TID [14]: Similarly, this method uses item titles as
textual identifiersfor anLLM-based generative recommender model.
Limitations:
- Non-hierarchical Semantics: The natural language text of a title or description does not inherently encode semantic information hierarchically (from coarse to fine-grained). This makes autoregressive generation less efficient, as the
LLMmight struggle to align early tokens with broad preferences. - Lack of Collaborative Signals:
Textual identifiersare solely based on content. Items with very similar textual descriptions might have vastly differentcollaborative signals(e.g., two similar-looking books, one popular, one niche). Thismisalignment(as depicted in Figure 2) can confuse theLLM, making it difficult to learn user preferences accurately. Injectingcollaborative signalsintotoken embeddingsafter tokenization can lead tocollisionsif similar textual identifiers need to represent different collaborative patterns.
3.2.3. Codebook-based Identifiers
These methods use auto-encoders to map item features into discrete code sequences, typically leveraging a codebook of learned embeddings.
- TIGER [32]: Introduces
codebook-based identifiersby employingRQ-VAEto quantize semantic information into code sequences forLLM-based generative recommendation. This addresses thehierarchical semanticsproblem. - LC-Rec [50]: Also uses
codebook-based identifiersand integrates auxiliary alignment tasks to connect generated code sequences with natural language, aiming to better utilize knowledge inLLMs.
Limitations:
-
Lack of Collaborative Signals: Similar to
textual identifiers, existingcodebook-based methodsoften primarily focus on semantic encoding and do not explicitly integratecollaborative signalsinto the assignment of codes within the identifier sequence. They might try to injectcollaborative signalsinto thetoken embeddingsduring theLLMtraining, but this still faces themisalignmentissue. -
Code Assignment Bias: As highlighted in Figure 3, the assignment of codes to items can be highly imbalanced. Some codes might be used far more frequently than others, leading to an
item generation biaswhere items associated with popular codes are over-recommended, undermining fairness and diversity.LC-Recattempts to address this with the Sinkhorn-Knopp Algorithm for intra-layer code fairness, butLETTERargues it misses the fundamental issue ofcode embedding distribution.The following figure illustrates the misalignment issue between item identifiers and collaborative signals, which is a key motivation for
LETTER.
该图像是一个示意图,展示了在推荐系统中不同项之间的语义、标识符和用户互动情况的关系。图中明确区分了相似、对齐、不对齐和不同信号的项,并通过标题/描述、标识符和互动用户的示例展示了如何在语义和协作信号中进行匹配与不匹配的解析。
The preceding figure (Figure 2 from the original paper) depicts the misalignment between item identifiers (textual or code-based, derived from semantics) and collaborative signals (derived from user interactions). It shows how items with similar semantics might have similar identifiers but vastly different user interaction patterns, leading to a mismatch. Conversely, items with different semantics but similar collaborative patterns might be forced into similar CF embeddings by a CF model, creating another type of misalignment if identifiers only capture semantics.
The following figure illustrates the code assignment bias problem in existing codebook-based methods.
该图像是一个图表,展示了训练数据中的目标标识符和生成标识符在不同代码百分比范围内的频率分布。图中的蓝色柱状图表示训练数据中目标标识符的频率,而线条则表示生成标识符的频率。可见,随着代码百分比的增加,频率逐渐下降。
The preceding figure (Figure 3 from the original paper) illustrates the code assignment bias and item generation bias observed in TIGER on the Instruments dataset. It shows that certain codes (lower percentage values on the x-axis) are assigned to items much more frequently in the training data (blue bars), and TIGER tends to generate items associated with these high-frequency codes (orange line), amplifying the bias towards popular items.
3.3. Technological Evolution
The evolution of recommender systems has moved from traditional methods like Collaborative Filtering and Matrix Factorization to deep learning-based approaches, including sequential models (SASRec, BERT4Rec) and graph neural networks (LightGCN). More recently, the advent of LLMs has spurred generative recommendation, shifting from score prediction to item generation. Item tokenization has evolved alongside this, from simple ID identifiers to textual identifiers and then codebook-based identifiers. Each step aimed to better bridge the gap between item data and the LLM's language space, trying to encode more semantic and collaborative information. LETTER fits into this trajectory by addressing the holistic challenges of item tokenization, particularly the concurrent integration of hierarchical semantics, collaborative signals, and code assignment diversity, which were not fully achieved by prior codebook-based or other methods.
3.4. Differentiation Analysis
LETTER distinguishes itself from prior item tokenization methods primarily by its comprehensive approach to identifier design, addressing multiple critical aspects simultaneously:
- Versus ID Identifiers (e.g., P5-SemiD, P5-CID):
LETTERgoes beyond fixed or pre-structured ID assignments. It uses alearnable tokenizer(RQ-VAE) to dynamically encode richhierarchical semantics, whichID identifierslargely lack. WhileP5-CIDincorporatescollaborative signals,LETTERintegrates them directly into the quantization process of semantic embeddings, offering a more flexible and adaptable alignment. - Versus Textual Identifiers (e.g., BIGRec, P5-TID):
LETTER'sRQ-VAEspecifically designshierarchical semanticsinto the token sequence, unlike raw textual descriptions that lack this structure. Crucially,LETTERexplicitly incorporatescollaborative signalsinto thetoken assignment, addressing themisalignmentissue where semantically similar items might have different collaborative patterns. Textual identifiers struggle with this, often leading tocollisionsifcollaborative signalsare only injected intotoken embeddingspost-tokenization. - Versus Codebook-based Identifiers (e.g., TIGER, LC-Rec): While
TIGERalso usesRQ-VAEforhierarchical semantics,LETTERsignificantly extends it by:-
Explicit Collaborative Regularization:
LETTERintroduces a dedicatedcontrastive alignment lossto directly alignquantized embeddingswithCF embeddings, ensuring that the code sequence itself reflectscollaborative signals, not just item semantics. This fundamentally alters the code assignment to better suit collaborative patterns. Previouscodebook-based methodsprimarily focused on semantic encoding or injected collaborative signals only at thetoken embeddinglevel, not the code assignment level. -
Diversity Regularization:
LETTERexplicitly tackles thecode assignment biasproblem (observed inTIGERand others) with a noveldiversity lossthat aims for a more uniform distribution ofcode embeddings. This directly mitigatesitem generation bias, an aspect largely overlooked or incompletely addressed by priorcodebook-based methods(e.g.,LC-Recuses Sinkhorn-Knopp for intra-layer codes butLETTERargues it misses the essence ofcode embedding distribution).In summary,
LETTER's innovation lies in its holistic framework that systematically addresses the multi-faceted requirements of an ideal identifier, simultaneously optimizing forhierarchical semantics,collaborative signals, andassignment diversitywithin alearnable tokenizer.
-
4. Methodology
The core idea of LETTER is to create an ideal identifier for generative recommendation that is both semantically rich and collaboratively informed, while also promoting diversity in item generation. This is achieved by building upon a codebook-based tokenization approach (RQ-VAE) and augmenting it with two novel regularization terms: collaborative regularization and diversity regularization. Additionally, LETTER introduces a ranking-guided generation loss during the LLM training phase to enhance the ranking ability of generative models.
4.1. Principles
LETTER is founded on three essential objectives for an ideal identifier:
- Hierarchical Semantic Integration: The token sequence should encode item semantics hierarchically, moving from coarse-grained to fine-grained details. This aligns with the
autoregressive generationprocess ofLLMs, where initial tokens set a broad context and subsequent tokens refine it. - Collaborative Signal Incorporation: The
token assignmentfor an item should reflect itscollaborative signals(user interaction patterns). This means items with similar interaction histories should have similar token sequences, even if their raw semantic descriptions differ slightly. This aims to resolve themisalignmentissue between semantics and collaborative patterns. - High Diversity of Code Assignment: The assignment of codes within the
codebookshould be diverse and balanced, avoiding concentration on a few codes. This mitigatescode assignment bias, which can lead to unfairitem generation bias(over-recommending popular items associated with frequently assigned codes).
4.2. Core Methodology In-depth (Layer by Layer)
The overall architecture of LETTER is illustrated in the figure below. It shows the learnable tokenizer at the core, generating identifiers with hierarchical semantics, enhanced by collaborative regularization and diversity regularization.
该图像是一个示意图,展示了LETTER(可学习的生成推荐项标记器)的工作框架。左上角部分概述了多样性正则化,包括代码嵌入和聚类的关系。中央流程图说明了编码器如何将语义嵌入转化为量化嵌入,并通过重构的语义嵌入进行语义正则化。右侧突出了协作正则化的过程,整张图表明了不同正则化损失(如、和)在系统中的作用。
The preceding figure (Figure 4 from the original paper) provides an overview of LETTER's architecture. It depicts how semantic regularization ensures hierarchical semantic encoding via RQ-VAE, collaborative regularization aligns the identifier's code sequence with collaborative signals, and diversity regularization alleviates code assignment bias.
4.2.1. Semantic Regularization
To achieve identifiers with hierarchical semantics, LETTER builds its tokenizer based on Residual Quantized VAE (RQ-VAE) [16]. RQ-VAE is chosen for its ability to recursively quantize semantic residuals, naturally producing identifiers that capture semantics from coarse to fine-grained levels.
The process involves two main steps:
4.2.1.1. Semantic Embedding Extraction
Given an item, its content information (e.g., titles, descriptions) is first processed to extract a semantic embedding. This is done using a pre-trained semantic extractor, such as LLaMA-7B [41], which yields an initial semantic embedding . This high-dimensional embedding is then compressed into a lower-dimensional latent semantic embedding through an encoder network:
$
z = \mathrm{Encoder}(s)
$
Here, Encoder is a neural network that maps the high-dimensional semantic embedding to a lower-dimensional latent semantic embedding .
4.2.1.2. Semantic Embedding Quantization
The latent semantic embedding is then quantized into a code sequence of length using -level codebooks. For each code level , there is a dedicated codebook , where is a learnable code embedding (a vector in the same dimension as ) and is the codebook size (number of entries in each codebook).
The residual quantization process is formulated as follows:
$
\left{ \begin{array}{ll}c_l = \arg \min_i| \pmb {r}{l - 1} - \pmb {e}{l,i}| ^2, & \pmb {e}{l,i}\in Q_l, \ \pmb {r}l = \pmb {r}{l - 1} - \pmb {e}{l,c_l}, & \end{array} \right.
$ (1)
- Symbol Explanation:
-
: The index of the assigned code from the -th level
codebook. This is found by selecting thecode embeddingin that is closest (in Euclidean distance) to the currentsemantic residual. -
: The
semantic residualfrom the previouscode level. It represents the part of the semantic information that has not yet been captured by thecode embeddingsfrom levels1tol-1. -
: A
code embedding(vector) from the -thcodebook. -
: The new
semantic residualafter subtracting the selectedcode embeddingfrom the previous residual . This residual is then passed to the nextcode level. -
The process starts with , meaning the initial
semantic residualis thelatent semantic embeddingitself.After recursively quantizing through all levels,
LETTERobtains thequantized identifier(a sequence of code indices) and thequantized embedding(the sum of all selectedcode embeddings). Thisquantized embeddingis then decoded back to areconstructed semantic embeddingusing adecodernetwork.
-
The loss for semantic regularization is formulated as:
$
\begin{cases} \mathscr{L}{\mathrm{Sem}} = \mathscr{L}{\mathrm{Recon}} + \mathscr{L}{\mathrm{RQ VAE}},\quad \mathrm{where}\ \begin{cases} \mathscr{L}{\mathrm{Recon}} = | s - \hat{s} | ^2,\ \mathscr{L}{\mathrm{RQ VAE}} = \sum{l = 1}^{L} | \mathrm{sg}[\pmb{r}{l - 1}] - \pmb{e}{l,c_l} | ^2 + \mu | \pmb {r}{l - 1} - \mathrm{sg} [ \pmb{e}{l,c_l}] | ^2, \end{cases} \end{cases}
$ (2)
- Symbol Explanation:
- : The total
semantic regularization loss. - : The
reconstruction loss, which measures the squared Euclidean distance (L2 norm) between the originalsemantic embeddingand itsreconstructed semantic embedding. This term ensures that thequantized embeddingretains the essential semantic information. - : The
RQ-VAEspecific loss term, summed over allcode levels. This term is a crucial part ofVQ-VAEandRQ-VAEtraining.-
: The
stop-gradient operation[42]. This means that the gradients do not flow through the argument ofsg. -
The first term, : Encourages the
code embedding(selected forresidual) to move closer to theresidual. Thestop-gradienton means only is updated by this term. -
The second term, : Encourages the
encoderto producelatent semantic embeddings(and subsequent residuals ) that are closer to the chosencode embeddings. Thestop-gradienton means only (and thus theencoder) is updated by this term. -
: A hyper-parameter coefficient that balances the strength of these two updates, specifically controlling the commitment of the encoder to the
code embeddings.By applying
semantic regularization, the code sequence learns to encodehierarchical semantics, which facilitatescoarse-grained to fine-grained generationby theLLMand improvescold-start generalization.
-
- : The total
4.2.2. Collaborative Regularization
To overcome the limitation of existing methods that lack collaborative signals in their code sequences, LETTER introduces collaborative regularization. This aims to inject collaborative signals directly into the quantized embedding (and thus implicitly, the code sequence ) by aligning it with CF embeddings using contrastive learning.
Specifically, LETTER utilizes a pre-trained Collaborative Filtering (CF) model (e.g., SASRec [15] or LightGCN [11]) to obtain CF embeddings for all items. Let denote the CF embedding for item . The goal is to make the quantized embedding of item similar to its CF embedding .
The collaborative regularization loss () is formulated as a contrastive alignment loss:
$
\mathcal{L}{\mathrm{CF}} = -\frac{1}{B}\sum{i = 1}^{B}\frac{\exp(< \hat{z}_i,\mathbf{h}i>)}{\sum{j = 1}^{B}\exp(< \hat{z}_i,\mathbf{h}_j>)}
$ (3)
- Symbol Explanation:
-
: The
collaborative regularization loss. -
: The
batch size, representing the number of items processed in one training step. -
: The
quantized embeddingof item , obtained fromsemantic embedding quantization. -
: The
CF embeddingof item , obtained from a pre-trainedCF model. -
: Denotes the
inner product(dot product) between two vectors, which measures their similarity. -
The term
\frac{\exp(< \hat{z}_i,\mathbf{h}_i>)}{\sum_{j = 1}^{B}\exp(< \hat{z}_i,\mathbf{h}_j>)}is similar to asoftmaxprobability, where the numerator represents the similarity between item 'squantized embeddingand its ownCF embedding(a positive pair), and the denominator sums the similarity of with allCF embeddingsin the current batch (including itself, and other items'CF embeddingswhich act as negatives). Minimizing this negative log-likelihood maximizes the similarity between and its corresponding , pushing away from otherCF embeddingsin the batch.This
collaborative regularizationencourages items with similarcollaborative interactionsto have similarquantized embeddingsand, by extension, similarcode sequences. This contrasts with methods that only injectcollaborative signalsintotoken embeddingsafter quantization, which can lead tocollisionsifcode sequencesare fixed based solely on semantics.
-
4.2.3. Diversity Regularization
To tackle the code assignment bias (where some codes are over-assigned, leading to item generation bias), LETTER introduces diversity regularization. The intuition is that a more uniform distribution of code embeddings in the latent space (as shown in Figure 5(b)) will lead to a more balanced assignment of codes to items, compared to a biased distribution (Figure 5(a)).
该图像是示意图,展示了代码分配的两种不同情况:左侧为偏倚的代码嵌入分布,包括与潜在语义嵌入相关的元素;右侧为均匀的代码嵌入分布,分别展示了相应的代码嵌入及其分配。
The preceding figure (Figure 5 from the original paper) contrasts a biased code embedding distribution (a) with a uniform code embedding distribution (b), illustrating how the latter facilitates more balanced code assignments.
The goal is to improve the diversity of code embeddings within each codebook.
The paper describes the diversity loss as follows: for each codebook, the code embeddings are clustered into groups using constrained K-means [3]. The diversity loss then regularizes these clustered code embeddings by:
-
Pulling
code embeddingsfrom the same cluster closer together. -
Pushing
code embeddingsfrom different clusters farther apart.While the explicit mathematical formula for (Equation 4) is not provided in the main text, the description suggests a
contrastive loss-like mechanism applied to the clusteredcode embeddings. The text states, "...which is defined as where is the nearest code embedding of item denotes the. code embedding of a randomly selected sample from the same cluster of code and represents all code embeddings from the codebook except for ." This indicates that for a givencode embedding, positive samples are other embeddings within the same cluster (or its nearest neighbors), and negative samples are embeddings from other clusters. The core idea is to enforce separation between clusters while maintaining coherence within them.
4.2.4. Overall Loss
The complete training loss for LETTER is a weighted sum of the three regularization terms:
$
\mathcal{L}{\mathrm{LETTER}} = \mathcal{L}{\mathrm{Sem}} + \alpha \mathcal{L}{\mathrm{CF}} + \beta \mathcal{L}{\mathrm{Div}}
$ (5)
- Symbol Explanation:
- : The total
lossfor training theLETTERtokenizer. - : The
semantic regularization loss(Equation 2). - : The
collaborative regularization loss(Equation 3). - : The
diversity regularization loss(described intuitively in Section 4.2.3). - : A hyper-parameter controlling the strength of
collaborative regularization. - : A hyper-parameter controlling the strength of
diversity regularization.
- : The total
4.2.5. Instantiation on LLM-based Generative Recommender Models
4.2.5.1. Training
The training process involves two stages:
-
Tokenizer Training: First, the
LETTERtokenizer is trained independently on the recommendation items using theoverall loss(Equation 5). -
LLM Fine-tuning: Once the
LETTERtokenizer is well-trained, it is used to tokenize all items. Each item is indexed into an identifier (a code sequence) . User interaction sequences are then translated into sequences of these item identifiers. For a given user, a training sample consists of (historically interacted items) and (the identifier of the next interacted item).Ranking-Guided Generation Loss: Existing
LLM-based generative recommender modelstypically optimizeLLMsusing ageneration loss(negative log-likelihood minimization). However, thisgeneration lossmight not be optimally aligned withranking optimization. To address this,LETTERproposes aranking-guided generation loss(), which modifies the traditionalgeneration lossby introducing an adjustable temperature parameter to emphasize penalties forhard-negative samples, thereby enhancing the ranking ability.The
ranking-guided generation lossis defined as: $ \mathcal{L}{\mathrm{rank}} = -\sum{t=1}^{|y|}\log \left( \frac{\exp(p(y_t) / \bar{\tau})}{\sum_{v\in \mathcal{V}}\exp(p(v) / \bar{\tau})} \right) $ (6)- Symbol Explanation:
-
: The
ranking-guided generation loss. Thesummationis over all tokens in the target item identifier . -
: The length of the target item identifier (i.e., the number of tokens).
-
: The index of the current token being predicted in the sequence.
-
: The -th token of the target identifier .
-
: The unnormalized log-probability (or 'logit') predicted by the
generative modelfor the true token . -
: The adjustable
temperaturehyper-parameter. A smaller sharpens the probability distribution, making the model more confident about high-logit tokens and penalizing low-logithard-negative samplesmore heavily. -
: The entire
token vocabulary(all possible code indices from the codebooks). -
The expression is a
softmaxfunction applied to the logits, scaled by , yielding a probability distribution over the vocabulary for the -th token. Minimizing the negative logarithm of this probability maximizes the likelihood of generating the true token .The paper provides a theoretical justification for this loss:
-
PROPOSITION 1. For a given
ranking-guided generation lossand a parameter , the following statements hold:-
Minimizing is equivalent to optimizing
hard-negative itemsfor users, where a smaller intensifies the penalty onhard negatives. -
The minimization of is associated with the optimization of
one-way partial AUC[36], which is strongly correlated withranking metricssuch asRecallandNDCG, ultimately leading to an improvement in thetop-K ranking ability.The full proof of Proposition 1 is provided in Appendix 7 of the original paper. It involves analyzing the gradients of the loss function (Eq. 7, 8) to show how influences the weights of negative samples, highlighting that
hard-negative samplesreceive higher weights with smaller . It then links thishard-negative miningtoDistributionally Robust Optimization (DRO)(Eq. 10, 11, 12), which is, in turn, a surrogate forone-way partial AUC (OPAUC)(Eq. 13). Finally, Theorem 1 (Eq. 14) in the Appendix explicitly shows the strong correlation betweenOPAUCandtop-K ranking metricslikeRecallandNDCG.
- Symbol Explanation:
4.2.5.2. Inference
During inference, generative recommender models autoregressively generate the code sequence for the next item. The next token is selected by taking the token with the highest predicted probability from the token vocabulary :
$
\hat{y}t = \arg \max{v\in V}P_\theta (v|y_{< t},x)
$
- Symbol Explanation:
-
: The predicted -th token of the generated identifier.
-
: The probability that token is the next token, given the preceding tokens and the user's historical interactions . This probability is computed by the
generative modelwith parameters . -
: The
token vocabulary.To ensure that the generated sequences form valid item identifiers,
LETTERemploysconstrained generation[8] using aTrie(prefix tree) [5]. ATrieallows the model to efficiently find all strictly valid successor tokens at each step, preventing the generation of invalid or non-existent item identifiers.
-
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three real-world recommendation datasets from different domains:
- 1. Instruments:
- Source: Amazon review datasets [29].
- Characteristics: Contains user interactions related to music gears (e.g., musical instruments and accessories).
- Domain: E-commerce, product reviews.
- 2. Beauty:
- Source: Amazon review datasets [29].
- Characteristics: Encompasses user interactions with a wide range of beauty products.
- Domain: E-commerce, product reviews.
- 3. Yelp:
- Source: Popular Yelp platform dataset.
- Characteristics: Comprises business interactions, such as user reviews and ratings for restaurants, shops, and services.
- Domain: Local business reviews, services.
Preprocessing:
- The datasets underwent preprocessing techniques consistent with previous work [15, 32].
- Sparse users and items with fewer than 5 interactions were discarded to ensure sufficient data density.
- A
sequential recommendationsetting was adopted, where the goal is to predict the next item a user will interact with based on their history. - The
leave-one-out strategy[32, 50] was used for splitting datasets, meaning for each user, the last interaction is used as the test item, the second-to-last as the validation item, and the rest for training. - For training, the number of items in a user's history was restricted to 20, following [14, 50], to manage sequence length for
LLMinputs.
5.2. Evaluation Metrics
The performance of the models is evaluated using two standard top-K ranking metrics: Recall@K (R@K) and NDCG@K (N@K), with set to 5 and 10.
-
Recall@K (R@K):
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) within the top items. It focuses on the ability of the system to 'recall' or retrieve all relevant items.
- Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-}K \text{ recommendations}}{\text{Total number of relevant items for the user}} $
- Symbol Explanation:
Number of relevant items in top-K recommendations: The count of items that the user actually interacted with (or found relevant) and were also present in the list of the top items recommended by the system.Total number of relevant items for the user: The total count of items that the user actually interacted with (or found relevant) in the test set.
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K evaluates the quality of the ranked list of recommendations. It assigns higher scores to relevant items that appear earlier in the list and accounts for varying degrees of relevance. It's 'normalized' so that a perfect ranking always achieves an NDCG of 1.0.
- Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ where $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \text{IDCG}@K = \sum_{i=1}^{K} \frac{2^{\text{ideal_}rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
DCG@K: Discounted Cumulative Gain at rank . It sums the relevance scores of items in the recommended list, discounted logarithmically by their position.IDCG@K: Ideal Discounted Cumulative Gain at rank . This is the DCG score for the ideal ranking (where all relevant items are ranked highest). It serves as a normalization factor.- : The relevance score of the item at position in the recommended list. For binary relevance (relevant/not relevant), is typically 1 or 0.
- : The relevance score of the item at position in the ideal (perfect) ranking.
- : The number of top items considered in the recommendation list.
5.3. Baselines
LETTER is compared against a comprehensive set of baselines, categorized into traditional recommender models and LLM-based generative recommender models with different item identifier types.
5.3.1. Traditional Recommender Models
These models do not rely on LLMs for generation but are included for a broader comparison of recommendation performance.
- MF [35]:
Matrix Factorizationdecomposes the user-item interaction matrix into lower-dimensional user and item embeddings. - Caser [40]:
Convolutional Sequence Embedding Recommendationemploysconvolutional neural networksto capture sequential and positional information in user interactions. - HGN [28]:
Hierarchical Gating Networksutilizegraph neural networksto learn user and item representations for interaction prediction. - BERT4Rec [37]: Leverages
BERT's pre-trained language representations to capture sequential user-item relationships. - LightGCN [11]: A lightweight
graph convolutional networkmodel that simplifies graph convolutions for recommendation, focusing on high-order connections. - SASRec [15]:
Self-Attentive Sequential Recommendationemploysself-attention mechanismsto capture long-term dependencies in user interaction history.
5.3.2. LLM-based Generative Recommender Models
These models utilize LLMs for recommendation, categorized by their item tokenization strategy.
5.3.2.1. ID Identifiers
- P5-SemiD [14]: Assigns
item identifiersbased on item metadata (e.g., categories, attributes), essentially using semi-structured IDs. - P5-CID [14]: Incorporates
collaborative signalsintoitem identifiersby building a spectral clustering tree from item co-appearance graphs, creating collaboratively-informed IDs.
5.3.2.2. Textual Identifiers
- BIGRec [1]: Uses items' titles directly as
textual identifiersforLLM-based generative recommendation. - P5-TID [14]: Similar to
BIGRec, it leverages item titles astextual identifiersfor anLLM-based generative recommender model.
5.3.2.3. Codebook-based Identifiers
- TIGER [32]:
Transformer-based Item Generation and Retrievalintroducescodebook-based identifiersviaRQ-VAE, quantizing item semantic information into a code sequence forLLM-based generative recommendation. This is one of the backend modelsLETTERis instantiated upon. - LC-Rec [50]:
Leveraging Collaborative Semantics for Recommendationalso usescodebook-based identifiersand employs auxiliary alignment tasks to better integrateLLMknowledge by connecting generated code sequences with natural language. This is the other backend modelLETTERis instantiated upon.
5.4. Implementation Details
- Backend Models:
LETTERis instantiated on two representativeLLM-based generative recommender models:TIGER[32] andLC-Rec[50].- For
TIGER, as official implementations were not released, the authors followed the paper for their own implementation. - For
LC-Rec,parameter-efficient fine-tuning (PEFT)techniqueLoRA[12] was used to fine-tuneLLaMA-7B[41].
- For
- Semantic Embedding Extraction:
LLaMA-7B[41] was adopted to encode item content information (titles, descriptions) to obtain the initialsemantic embeddings, following [50]. - CF Embeddings: 32-dimensional item embeddings were obtained from a
SASRec[15] model, which were then used forcollaborative regularization. - Hardware: All experiments were conducted on 4 NVIDIA RTX A5000 GPUs.
5.4.1. LETTER Tokenizer Specifics
- RQ-VAE Structure: A 4-level
codebookstructure was used for theRQ-VAE(). - Codebook Size and Dimension: Each
codebookcomprised 256code embeddings(), with eachembeddinghaving a dimension of 32. - Diversity Regularization: The number of clusters for
constrained K-meanswas set to 10. - Tokenizer Training:
LETTERwas trained for 20,000 epochs.- Optimizer:
AdamW[27]. - Learning Rate: .
- Batch Size: 1,024.
- Hyper-parameters:
- (RQ-VAE coefficient): Set to 0.25, following [32].
- (strength of
collaborative regularization): Searched in the range of . - (strength of
diversity regularization): Searched in the range of .
5.4.2. LLM Fine-tuning
- After
LETTERtraining, thebackend generative models(TIGERandLC-Rec) were fine-tuned for convergence based on validation performance. - Learning Rates:
- For
TIGER: . - For
LC-Rec: .
- For
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Overall Performance (RQ1)
The following are the results from Table 1 of the original paper, comparing LETTER instantiated on TIGER (LETTER-TIGER) and LC-REC (LETTER-LC-REC) with various baselines across three datasets.
| Model | Instruments | Beauty | Yelp | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | R@5 | R@10 | N@5 | N@10 | |
| MF | 0.0479 | 0.0735 | 0.0330 | 0.0412 | 0.0294 | 0.0474 | 0.0145 | 0.0191 | 0.0220 | 0.0296 | 0.0142 | 0.0177 |
| Caser | 0.0543 | 0.0710 | 0.0355 | 0.0409 | 0.0205 | 0.0347 | 0.0131 | 0.0176 | 0.0150 | 0.0203 | 0.0094 | 0.0118 |
| HGN | 0.0813 | 0.1048 | 0.0668 | 0.0774 | 0.0325 | 0.0512 | 0.0206 | 0.0266 | 0.0186 | 0.0245 | 0.0118 | 0.0147 |
| Bert4Rec | 0.0671 | 0.0822 | 0.0560 | 0.0608 | 0.0203 | 0.0347 | 0.0124 | 0.0170 | 0.0186 | 0.0249 | 0.0119 | 0.0149 |
| LightGCN | 0.0794 | 0.1000 | 0.0662 | 0.0728 | 0.0305 | 0.0511 | 0.0194 | 0.0260 | 0.0248 | 0.0321 | 0.0158 | 0.0196 |
| SASRec | 0.0751 | 0.0947 | 0.0627 | 0.0690 | 0.0380 | 0.0588 | 0.0246 | 0.0313 | 0.0183 | 0.0238 | 0.0117 | 0.0146 |
| BIGRec | 0.0513 | 0.0576 | 0.0470 | 0.0491 | 0.0243 | 0.0299 | 0.0181 | 0.0198 | 0.0154 | 0.0191 | 0.0110 | 0.0127 |
| P5-TID | 0.0000 | 0.0001 | 0.0000 | 0.0000 | 0.0182 | 0.0432 | 0.0132 | 0.0254 | 0.0184 | 0.0251 | 0.0124 | 0.0156 |
| P5-SemiID | 0.0775 | 0.0964 | 0.0669 | 0.0730 | 0.0393 | 0.0584 | 0.0273 | 0.0335 | 0.0202 | 0.0268 | 0.0130 | 0.0163 |
| P5-CID | 0.0809 | 0.0987 | 0.0695 | 0.0751 | 0.0404 | 0.0597 | 0.0284 | 0.0347 | 0.0219 | 0.0284 | 0.0141 | 0.0174 |
| TIGER | 0.0870 | 0.1058 | 0.0737 | 0.0797 | 0.0395 | 0.0610 | 0.0253 | 0.0321 | 0.0262 | 0.0331 | 0.0169 | 0.0207 |
| LETTER-TIGER | 0.0909 | 0.1122 | 0.0763 | 0.0831 | 0.0431 | 0.0672 | 0.0277 | 0.0364 | 0.0286 | 0.0364 | 0.0184 | 0.0227 |
| LC-Rec | 0.0824 | 0.1006 | 0.0712 | 0.0772 | 0.0443 | 0.0642 | 0.0311 | 0.0374 | 0.0230 | 0.0298 | 0.0148 | 0.0184 |
| LETTER-LC-Rec | 0.0913 | 0.1115 | 0.0789 | 0.0854 | 0.0505 | 0.0703 | 0.0355 | 0.0418 | 0.0255 | 0.0326 | 0.0166 | 0.0205 |
Observations from Table 1:
- Comparison of LLM-based Models with ID Identifiers: Among
P5-CIDandP5-SemiD,P5-CIDgenerally outperformsP5-SemiD. This is attributed toP5-CIDleveragingcollaborative signals(from item co-appearance graphs) in its identifier assignment, which helpsLLMscapture user behavioral patterns.P5-SemiD, which assigns IDs based on coarse item categories, struggles to capture fine-grained semantics and experiencesmisalignmentbetween semantic and collaborative signals. - Comparison with Textual Identifiers:
BIGRecandP5-TID(with textual identifiers) generally perform worse thancodebook-basedand even someID identifiermethods. This is likely due to the inherentmisalignmentissue where similar semantics in text don't necessarily correspond to similar user interactions, hindering the learning of accuratecollaborative signals.P5-TIDshows particularly poor performance onInstruments, indicating that direct textual representation can be ineffective if not carefully handled. - Superiority of Codebook-based Identifiers (TIGER, LC-Rec):
TIGERandLC-Rec(codebook-based methods) generally outperformIDandtextual identifiermethods in most cases. This suggests that thehierarchical semanticsencoded byRQ-VAE-like approaches provide a more effective representation forgenerative recommendationby distinguishing items through fine-grained details. LETTER's Consistent Improvements: The most significant observation is thatLETTERconsistently and substantially improves the performance of its backend models,TIGERandLC-Rec, across all three datasets and all metrics. For instance,LETTER-TIGERimprovesR@10onInstrumentsfrom 0.1058 to 0.1122, andLETTER-LC-RecimprovesR@10onBeautyfrom 0.0642 to 0.0703. This robust improvement validates the core hypothesis ofLETTER: that integratingcollaborative signalsinto code assignment and enhancingcode assignment diversityare crucial for effectiveitem tokenization.- The improvements are attributed to:
- CF Integration: Aligning
quantized embeddingswithCF embeddingsduring code assignment addresses themisalignmentbetween semantic and collaborative signals, encouraging similarcode sequencesfor items with similar collaborative patterns. - Improved Diversity:
Diversity regularizationmitigatescode assignment bias, leading to a more balanced generation of items and overcoming theitem generation bias.
- CF Integration: Aligning
- The improvements are attributed to:
6.2. In-depth Analysis
6.2.1. Ablation Study (RQ2)
To investigate the contribution of each regularization component within LETTER, an ablation study was conducted on TIGER using the Instruments and Beauty datasets.
The following are the results from Table 2 of the original paper:
| Variants | Instruments | Beauty | ||
|---|---|---|---|---|
| R@10 | N@10 | R@10 | N@10 | |
| (0): TIGER | 0.1058 | 0.0797 | 0.0610 | 0.0331 |
| (1): TIGER w/ c. r. | 0.1078 | 0.0810 | 0.0660 | 0.0351 |
| (2): TIGER w/ d. r. | 0.1075 | 0.0809 | 0.0618 | 0.0335 |
| (3): (1) w/ d. r. | 0.1092 | 0.0819 | 0.0672 | 0.0357 |
| (4): LETTER-TIGER | 0.1122 | 0.0831 | 0.0672 | 0.0364 |
Observations from Table 2:
- Effectiveness of Individual Regularizations:
TIGER w/ c. r.(incorporatingcollaborative regularization) shows improved performance over baseTIGER(0) on both datasets (e.g.,R@10onInstruments: 0.1078 vs. 0.1058). This confirms the value of injectingcollaborative signalsinto thecode assignment.TIGER w/ d. r.(incorporatingdiversity regularization) also improves over baseTIGER(0) (e.g.,R@10onInstruments: 0.1075 vs. 0.1058). This validates the effectiveness of enhancingcode embedding diversityto mitigatecode assignment bias.
- Combined Regularizations:
- (combining
collaborativeanddiversity regularization) achieves better results than either individual regularization and baseTIGER(e.g.,R@10onInstruments: 0.1092). This indicates that jointly considering semantics, collaboration, and diversity incode assignmentis more effective than any single aspect.
- (combining
- Ranking-Guided Generation Loss:
(4): LETTER-TIGER(which includes all regularizations and theranking-guided generation loss) achieves the best performance across all variants (e.g.,R@10onInstruments: 0.1122). This highlights the effectiveness of theranking-guided generation lossin improvingtop-K ranking abilityby penalizinghard-negative samplesmore effectively.
6.2.2. Code Assignment Distribution (RQ2)
To ascertain if diversity regularization effectively mitigates code assignment bias, the distribution of the first code in item identifiers was analyzed.
The following figure illustrates the normalized frequency of different code assignment groups.
该图像是一个比较图表,展示了不同代码分配组(根据流行度排名)的归一化频率。左侧显示了 TIGER 及其引入多样性正则化后的结果,右侧展示了引入协作正则化的 TIGER 和 LETTER 的结果,分别标注了总码本数和使用情况。
The preceding figure (Figure 6 from the original paper) compares the normalized frequency distribution of the first code in item identifiers. The left panel compares TIGER (without diversity regularization) and TIGER with diversity regularization. The right panel compares TIGER with collaborative regularization and LETTER (which combines collaborative and diversity regularization). The bars represent the normalized frequency of target identifiers in training data (assigned codes), grouped by popularity.
Observations from Figure 6:
- Diversity Regularization Mitigates Bias: The figures clearly show that incorporating
diversity regularization(bothTIGER w/ d. r.andLETTER) leads to a smoother, more uniform distribution ofcode assignments. The peaks observed inTIGER(without diversity) are flattened, and the tails are raised, indicating thatdiversity regularizationsuccessfully reduces thecode assignment biasand promotes a more balanced utilization of codes. This implies a potential reduction initem generation bias. - Increased Code Utilization:
Diversity regularizationsignificantly increases the utilization rate of codes in the first-levelcodebook. For example,TIGER w/ d. r.uses 180 codes out of 256, compared toTIGER's 148. Similarly,LETTERuses 150 codes, compensating for the drop caused bycollaborative regularization. - Interaction with Collaborative Regularization: While
collaborative regularizationalone (comparingTIGERtoTIGER w/ c. r.) can sometimes decrease code utilization (from 148 to 76 used codes onInstruments, not explicitly shown in this graph, but mentioned in the text), integratingdiversity regularization(as inLETTER) helps to recover and maintain high code utilization (150 used codes). This demonstrates thatLETTERcan simultaneously capturecollaborative signalsand maintain highcode diversity, fulfilling multiple criteria of an ideal identifier.
6.2.3. Code Embedding Distribution (RQ2)
To visually confirm the effect of diversity regularization on the code embedding distribution, the code embeddings from the first-level codebook were visualized using PCA for dimensionality reduction to 3D space.
The following figure illustrates the distribution of code embeddings.
该图像是一个示意图,展示了 LETTER 方法在没有多样性正则化(a)和有多样性正则化(b)下的代码嵌入分布。左侧图展示了未经多样性正则化的结果,右侧图展示了应用了多样性正则化的结果,图中红色圆点表示代码嵌入,深色区域表示高频率。通过比较这两幅图,可以观察到多样性正则化对嵌入分布的影响。
The preceding figure (Figure 7 from the original paper) visualizes the 3D code embeddings (after PCA) of the first-level codebook. Figure (a) shows the distribution for LETTER w/o diversity regularization, and Figure (b) shows it for LETTER (with diversity regularization). Darker colors indicate codes assigned to more items.
Observations from Figure 7:
- Uniform Distribution: Comparing Figure (a) (
LETTER w/o diversity regularization) to Figure (b) (LETTER), it is evident that thecode embeddingsinLETTERare more evenly distributed in the representation space. In (a), there are noticeable clusters and denser regions, suggesting some codes are more central or preferred. In (b), the points are spread out more uniformly across the sphere. - Alleviating Bias: This visual evidence validates that
diversity regularizationis effective in achieving a more diverse distribution ofcode embeddings. By spreading out the embeddings, it fundamentally addresses thebiased code assignmentproblem illustrated in Figure 5(a), ensuring that items are not disproportionately mapped to a few specific code regions.
6.2.4. Investigation on Collaborative Signals in Identifiers (RQ2)
Two experiments were designed to verify whether LETTER successfully encodes collaborative signals into identifiers.
6.2.4.1. Ranking Experiment
This experiment assesses the ranking performance by using LETTER's quantized embeddings for interaction prediction. The quantized embedding from the trained LETTER tokenizer replaces the item embeddings in a well-trained traditional CF model (SASRec), and its ranking performance is evaluated. An identifier that effectively captures collaborative signals should lead to better ranking performance.
The following are the results from Table 3 of the original paper:
| Dataset | Model | R@5 | R@10 | N@5 | N@10 |
|---|---|---|---|---|---|
| Instruments | TIGER LETTER | 0.0050 | 0.0150 | 0.0024 | 0.0049 |
| LETTER | 0.0080 | 0.0159 | 0.0038 | 0.0058 | |
| Beauty | TIGER LETTER | 0.0128 | 0.0213 | 0.0064 | 0.0085 |
| LETTER | 0.0175 | 0.0343 | 0.0076 | 0.0118 |
Observations from Table 3:
LETTERsignificantly outperformsTIGER LETTER(likely referring toTIGER's original quantized embeddings) by a large margin across both datasets and all metrics (e.g.,R@10onBeauty: 0.0343 vs. 0.0213). This strong improvement indicates thatLETTER'squantized embeddings(which are influenced bycollaborative regularization) are far better at capturingcollaborative signalssuitable forinteraction predictionthanTIGER's purely semantic ones.
6.2.4.2. Similarity Experiment
This experiment verifies if items with similar collaborative signals indeed exhibit similar identifiers (code sequences).
-
Method: For every item, its most similar item is identified based on
similarityderived from pre-trainedCF embeddings. Then, thesimilarityof thecode sequencebetween these two "collaboratively similar" items is assessed using an overlap degree. The averaged results over all items are reported.The following are the results from Table 4 of the original paper:
Instruments Beauty TIGER LETTER 0.0849 0.1135 LETTER 0.2760 0.3312
Observations from Table 4:
LETTERachieves a much higher code sequencesimilarityfor items that arecollaboratively similarcompared toTIGER LETTER(e.g., 0.2760 vs. 0.0849 onInstruments, and 0.3312 vs. 0.1135 onBeauty). This provides direct evidence thatLETTERsuccessfully incorporatescollaborative signalsinto thecode sequencesthemselves, leading to identifiers that reflect not just semantics but also user interaction patterns. This effectivelyalleviates the misalignmentissue between semantic and collaborative similarity.
6.2.5. Hyper-Parameter Analysis (RQ3)
The following figure (Figure 8 from the original paper) shows the performance of LETTER-TIGER over different hyper-parameters on the Instruments dataset.
该图像是示意图,展示了不同参数对模型性能指标(Recall和NDCG@10)的影响。图中包括五个子图,每个子图对应一个不同的参数(例如,、、、、),通过折线图显示了随参数变化的Recall和NDCG@10的数值趋势。这些结果表明了各参数对推荐效果的贡献。
The preceding figure (Figure 5 from the original paper) illustrates the performance (R@10 and N@10) of LETTER-TIGER as various hyper-parameters are adjusted: identifier length (L), codebook size (N), strength of collaborative regularization (alpha), strength of diversity regularization (beta), cluster number (K), and temperature (tau).
Observations from Figure 8:
- Identifier length :
- Performance initially improves when increases from 2 to 4. This suggests that longer identifiers (up to a point) can capture more
fine-grained information, leading to better expressiveness. - However, increasing beyond 4 (e.g., to 8) degrades performance. This is attributed to the
autoregressive generationprocess suffering fromerror accumulation. Generating longer sequences accurately is more challenging, as an error in an early token can propagate.
- Performance initially improves when increases from 2 to 4. This suggests that longer identifiers (up to a point) can capture more
- Codebook size :
- Performance generally improves as increases (e.g., from 64 to 256). A larger
codebookprovides more distinctcode embeddings, allowing for better differentiation between items and richer representation. - However, excessively large (e.g., 512) can hurt performance. This might be because a very large
codebookbecomes more susceptible to noise initem's semantic information, potentially leading tooverfittingtomeaningless semanticsor sparsity issues.
- Performance generally improves as increases (e.g., from 64 to 256). A larger
- Strength of collaborative regularization :
- As increases, performance generally improves, peaking around . This indicates that a stronger injection of
collaborative patternsis beneficial. - However, an
overly large\alpha$$ (e.g., 0.1) can cause a slight drop. This suggests a trade-off: too much emphasis oncollaborative regularizationmight interfere withsemantic regularization, leading to suboptimal overall performance. A value like 0.02 seems to strike a good balance.
- As increases, performance generally improves, peaking around . This indicates that a stronger injection of
- Strength of diversity regularization :
- Even a small strength of
diversity regularization(e.g., from to ) significantly improves performance. This confirms its effectiveness in enhancingcode assignment diversity. - However, an
excessive amount of diversity signal(e.g., ) can degrade performance. This implies that too much regularization for diversity might interfere with the integration ofsemanticandcollaborative signals, as the tokenizer is forced to prioritize diversity over other crucial information.
- Even a small strength of
- Cluster (for diversity regularization):
- The optimal performance is observed at . Deviating from this value (decreasing to 5 or increasing to 20) leads to performance degradation.
- If is too large, clusters might contain too many
code embeddings, making it difficult to enforce sufficient closeness within clusters. If is too small, clusters might be too coarse, leading tocode embeddingswithin the same cluster beingoverly closeor not discriminative enough.
- Temperature (for ranking-guided generation loss):
- Decreasing from 1.2 to 0.7 generally improves performance. This is consistent with Proposition 1, as a smaller
temperatureplaces more emphasis on penalizinghard negatives, strengthening theranking ability. - However, the performance slightly drops if becomes too small (e.g., 0.6). A very small might suppress the possibility of
hard-negative samplesbeing considered aspositive samplesfor other users, potentially making the model too rigid or overly sensitive to minor differences, thereby harming generalization. Careful tuning of is essential.
- Decreasing from 1.2 to 0.7 generally improves performance. This is consistent with Proposition 1, as a smaller
7. Conclusion & Reflections
7.1. Conclusion Summary
This study rigorously analyzed the essential characteristics of effective item tokenization for LLM-based generative recommendation. The authors introduced LETTER, a novel learnable tokenizer, which addresses the limitations of existing methods by integrating three critical components: hierarchical semantics, collaborative signals, and code assignment diversity. LETTER achieves this through a multi-faceted regularization scheme: semantic regularization using RQ-VAE for hierarchical encoding, collaborative regularization via a contrastive alignment loss to embed CF signals into code sequences, and diversity regularization to mitigate code assignment bias. Furthermore, LETTER incorporates a ranking-guided generation loss to theoretically enhance the top-K ranking ability of generative models. Extensive experiments on three real-world datasets consistently demonstrated LETTER's superior performance, pushing the state-of-the-art in LLM-based generative recommendation.
7.2. Limitations & Future Work
The authors identified several promising directions for future exploration:
- Tokenization with Rich User Behaviors: Future work could explore incorporating more diverse and complex
user behaviors(beyond simple interactions) into the tokenization process. This would enablegenerative recommender modelsto inferuser preferencesfrom a richer set of actions. - Cross-Domain Item Tokenization:
LETTERhas the potential to tokenizecross-domain items. This would allowgenerative recommender modelsto leverage multi-domain user behaviors and items for more comprehensiveuser preference reasoningandnext-item recommendation, addressing scenarios where users interact with items across different categories or platforms. - Combining User Instructions with Tokens: An exciting future direction is to combine natural language
user instructionswithuser interaction historytokenized byLETTER. This could lead to more personalized recommendations by enablingcollaborative reasoningthat integrates complex natural language queries with structured item tokens within thegenerative recommender model's space.
7.3. Personal Insights & Critique
This paper presents a highly relevant and well-structured approach to a critical problem in LLM-based generative recommendation. The comprehensive analysis of item tokenization limitations and the systematic design of LETTER to address these are commendable.
Strengths:
- Holistic Approach:
LETTER's strength lies in its ability to simultaneously tacklehierarchical semantics,collaborative signals, andcode assignment diversity. This multi-objective optimization foritem tokenizationis a significant advancement over prior work that often focused on one or two aspects in isolation. - Theoretical Justification: The
ranking-guided generation losswith its theoretical connection tohard-negative miningandOPAUCprovides strong grounding for its effectiveness in improvingranking metrics. - Empirical Validation: The extensive experiments and detailed
ablation studiesthoroughly validate each component's contribution andLETTER's overall superiority. - Interpretability: By explicitly defining what an
ideal identifiershould entail, the paper offers a clear framework for understandingitem tokenizationingenerative recommendation.
Potential Issues/Areas for Improvement:
- Formula for Diversity Loss: The explicit mathematical formula for
diversity regularization loss() is not provided in the main text. While the intuitive description is helpful, a precise formula would enhance reproducibility and clarity for researchers seeking to implement or extend this specific component. The description "which is defined as where..." suggests a missing equation (4), which is a minor but notable omission in an otherwise rigorous paper. - Computational Cost: Training
RQ-VAEwith multiplecodebooksandLLMswith additional regularization terms, especially withconstrained generationusingTries, can be computationally intensive. While the authors mention using 4 GPUs, a more explicit discussion of the computational overhead and scalability for very large item catalogs would be beneficial. - Generalizability of
CF Embeddings: Thecollaborative regularizationrelies onCF embeddingsfrom a pre-trainedCF model(e.g.,SASRec). The quality and robustness of theseCF embeddingsdirectly impactLETTER's performance. The paper does not delve into how sensitiveLETTERis to the choice or quality of the upstreamCF model. In real-world scenarios, maintaining high-qualityCF embeddingsfor new items or evolving user behavior can be a challenge. - Cold-Start Scenarios for
CF Embeddings: WhileLETTERhelpscold-start itemswithsemantic regularization, thecollaborative regularizationmight still face challenges for trulycold-start itemsthat lack sufficient interaction data to generate reliableCF embeddings. - Subjectivity of "Ideal Identifier": The criteria for an
ideal identifierare well-defined, but their relative importance might vary depending on the specific recommendation task or dataset. The hyperparameters () indicate these trade-offs, but a deeper discussion on task-specific tuning considerations could be valuable.
Transferability and Applications:
The methodology proposed in LETTER is highly transferable. The concept of a learnable tokenizer that integrates multiple information sources (semantics, collaborative) and addresses distribution biases is applicable to any domain where discrete data needs to be mapped to a continuous or tokenized space for generative models. This could extend beyond recommendation to areas like:
-
Generative molecule design: Tokenizing chemical compounds based on structure and desired properties.
-
Generative music/art: Tokenizing musical notes or art elements based on stylistic and perceptual features.
-
Knowledge graph completion: Tokenizing entities and relations to facilitate
LLM-based knowledge generation.Overall,
LETTERmakes a significant contribution by providing a comprehensive and principled solution foritem tokenization, which is a cornerstone for the successful deployment ofLLMsingenerative recommendation. The paper opens exciting avenues for more intelligent and fairLLM-based recommenders.
Similar papers
Recommended via semantic vector search.