Paper status: completed

Learnable Item Tokenization for Generative Recommendation

Published:05/12/2024

LLM-based Generative Recommendation Systems (1)Learnable Item Tokenization (1)Contrastive Learning-based Recommendation Algorithms (1)Residual Quantized Variational Autoencoder (1)Ranking-Guided Generation Loss (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces LETTER, a learnable tokenizer addressing challenges of transforming recommendation data into LLMs’ language space. By integrating hierarchical semantics, collaborative signals, and code assignment diversity, its experimental validation on three datasets demon

Abstract

Utilizing powerful Large Language Models (LLMs) for generative recommendation has attracted much attention. Nevertheless, a crucial challenge is transforming recommendation data into the language space of LLMs through effective item tokenization. Current approaches, such as ID, textual, and codebook-based identifiers, exhibit shortcomings in encoding semantic information, incorporating collaborative signals, or handling code assignment bias. To address these limitations, we propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity to satisfy the essential requirements of identifiers. LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias. We instantiate LETTER on two models and propose a ranking-guided generation loss to augment their ranking ability theoretically. Experiments on three datasets validate the superiority of LETTER, advancing the state-of-the-art in the field of LLM-based generative recommendation.

Mind Map

In-depth Reading

English Analysis~33 min read · 43,658 chars

1. Bibliographic Information

1.1. Title

Learnable Item Tokenization for Generative Recommendation

1.2. Authors

Wenjie Wang (National University of Singapore, Singapore)
Jizhi Zhang (University of Science and Technology of China, Hefei, China)
See-Kiong Ng (National University of Singapore, Singapore)
Honghui Bao (National University of Singapore, Singapore)
Xinyu Lin (National University of Singapore, Singapore)
Fuli Feng (University of Science and Technology of China, Hefei, China)
Yongqi Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
Tat-Seng Chua (National University of Singapore, Singapore)

1.3. Journal/Conference

The paper is published in the Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 24), October 21-25, 2024, Boise, ID, USA. CIKM is a highly reputable and influential conference in the fields of information retrieval, database management, and knowledge management, making it a significant venue for research in recommender systems.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses a crucial challenge in utilizing Large Language Models (LLMs) for generative recommendation: effectively transforming recommendation data into the language space of LLMs through item tokenization. Existing approaches—ID, textual, and codebook-based identifiers—suffer from limitations such as inefficient semantic encoding, lack of collaborative signals, or code assignment bias. To overcome these, the authors propose LETTER (a LEarnable $T$ okenizer for genera $T$ iv $E$ $R$ ecommendation). LETTER integrates hierarchical semantics, collaborative signals, and code assignment diversity, which are essential requirements for effective identifiers. It incorporates Residual Quantized VAE (RQ-VAE) for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias. The authors instantiate LETTER on two generative recommender models and introduce a ranking-guided generation loss to theoretically enhance their ranking ability. Experiments on three datasets demonstrate LETTER's superiority, advancing the state-of-the-art in LLM-based generative recommendation.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2405.07314 (Preprint)
PDF Link: https://arxiv.org/pdf/2405.07314v3.pdf (Preprint)

The paper is published as a preprint on arXiv and is accepted for CIKM 2024.

2. Executive Summary

2.1. Background & Motivation

The rise of Large Language Models (LLMs) has opened new avenues for generative recommendation, where LLMs are used to directly generate recommended items. A fundamental hurdle in this paradigm is item tokenization, which involves converting discrete item data into a format (a sequence of tokens or identifiers) that LLMs can process. This process bridges the gap between the traditional recommendation domain and the language space of LLMs.

The current landscape of item tokenization approaches presents several critical shortcomings:

ID Identifiers: These assign unique numerical strings to items. While ensuring uniqueness, they are inefficient at encoding semantic information, making it difficult to generalize to cold-start items (new items with little or no interaction history).
Textual Identifiers: These leverage item descriptions (e.g., titles, attributes) directly as identifiers.
- They often lack hierarchical semantics, meaning the token sequence doesn't progressively encode information from coarse to fine-grained, which is suboptimal for autoregressive generation.
- They typically lack collaborative signals (information derived from user-item interactions). Items with similar semantics but different user interaction patterns might have very similar textual identifiers, leading to misalignment and making it hard for the recommender to distinguish them based on collaborative preferences.
Codebook-based Identifiers: These use auto-encoders to map item semantics to hierarchical code sequences. While an improvement in semantics, they still suffer from the lack of collaborative signals in their code sequences and a significant code assignment bias. This bias means certain codes are assigned more frequently than others, leading to an imbalanced distribution and an unfair item generation bias where popular items are more likely to be generated, neglecting less popular but relevant items.

The core problem the paper aims to solve is to develop an effective item tokenization method that can overcome these limitations, thus enabling LLMs to perform generative recommendation more accurately, fairly, and efficiently.

2.2. Main Contributions / Findings

The paper's primary contributions are encapsulated in LETTER, a novel learnable tokenizer designed to address the aforementioned issues:

Proposal of LETTER: The authors propose LETTER, a LEarnable $T$ okenizer for genera $T$ iv $E$ $R$ ecommendation, which is explicitly designed to meet three essential criteria for ideal identifiers:
1. Hierarchical Semantic Integration: Ensures identifiers encode semantics from broad to fine-grained, aligning with autoregressive generation.
2. Collaborative Signal Incorporation: Integrates collaborative signals directly into the token assignment process, making similar interaction patterns result in similar token sequences.
3. Code Assignment Diversity: Mitigates code assignment bias to ensure fairer item generation.
Three Regularization Mechanisms: LETTER achieves its objectives through three novel regularization losses:
1. Semantic Regularization: Leverages Residual Quantized VAE (RQ-VAE) to encode hierarchical item semantics into a code sequence, enabling coarse-to-fine generation and better cold-start generalization.
2. Collaborative Regularization: Introduces a contrastive alignment loss that aligns the quantized semantic embeddings with Collaborative Filtering (CF) embeddings from a well-trained CF model (e.g., LightGCN). This injects collaborative information directly into the token assignment.
3. Diversity Regularization: Implements a diversity loss that encourages a more uniform distribution of code embeddings, thereby alleviating code assignment bias and reducing item generation bias.
Ranking-Guided Generation Loss: To further enhance the recommendation performance of LLM-based generative models, LETTER proposes a ranking-guided generation loss. This loss modifies the traditional negative log-likelihood loss by introducing an adjustable temperature parameter, which emphasizes penalties for hard-negative samples and theoretically improves the top-K ranking ability.
Empirical Validation and State-of-the-Art Performance: Extensive experiments on three real-world datasets (Instruments, Beauty, Yelp) demonstrate that LETTER significantly outperforms existing item tokenization methods. It consistently improves the performance of two representative generative recommender models (TIGER and LC-Rec) when integrated. Ablation studies confirm the effectiveness of each proposed regularization component and the ranking-guided generation loss.

In essence, LETTER provides a robust and comprehensive solution for item tokenization in LLM-based generative recommendation, addressing semantic encoding, collaborative information, and fairness issues simultaneously, thereby advancing the state-of-the-art in this emerging field.

3.1. Foundational Concepts

3.1.1. Generative Recommendation

Generative recommendation is an emerging paradigm in recommender systems where, instead of predicting a score or ranking from a fixed set of items, the model directly generates the identifiers or attributes of recommended items. This approach often leverages Large Language Models (LLMs) because of their powerful text generation capabilities. Given a user's historical interactions, a generative recommender aims to output a sequence of tokens representing a new, relevant item.

3.1.2. Large Language Models (LLMs)

Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data. They are capable of understanding, generating, and processing human language with remarkable fluency and coherence. Key characteristics include:

Autoregressive Generation: LLMs typically generate text token by token, predicting the next token based on the preceding sequence.
Rich World Knowledge: Acquired during pre-training on diverse internet data.
Reasoning and Generalization: Ability to perform complex tasks, including inference and generalization to unseen scenarios. In the context of recommendation, LLMs are adapted to generate item identifiers, effectively treating items as words in a vocabulary.

3.1.3. Item Tokenization

Item tokenization is the process of converting an item in a recommender system (e.g., a movie, a product) into a representation that can be understood and processed by an LLM. This usually involves assigning a unique identifier or a sequence of tokens to each item. The effectiveness of item tokenization directly impacts how well an LLM can encode user preferences and generate relevant recommendations.

3.1.4. Residual Quantized Variational Autoencoder (RQ-VAE)

Residual Quantized Variational Autoencoder (RQ-VAE) is a type of Vector Quantized Variational Autoencoder (VQ-VAE) that focuses on generating hierarchical and high-fidelity representations.

Variational Autoencoder (VAE): A type of generative model that learns a probabilistic mapping from input data to a latent space and then reconstructs the data from that latent space. It consists of an encoder (maps input to latent distribution parameters) and a decoder (samples from latent space to reconstruct input).
Vector Quantization (VQ): A technique where continuous latent representations are mapped to discrete codebook entries. Each codebook entry is a vector (code embedding). This process makes the latent space discrete, which is beneficial for tasks involving discrete tokens, like language modeling.
Residual Quantization: In RQ-VAE, quantization is performed iteratively across multiple levels. Instead of quantizing the entire latent vector at once, it quantizes the residual error from the previous quantization step. This hierarchical approach allows the model to capture progressively finer-grained details, leading to more accurate and expressive discrete representations. This is crucial for encoding hierarchical semantics as desired by LETTER.

3.1.5. Collaborative Filtering (CF)

Collaborative Filtering (CF) is a traditional and highly effective technique in recommender systems that predicts a user's preference for items based on the preferences of other users (user-based CF) or the similarity of items themselves (item-based CF). The core idea is that users who agreed in the past tend to agree again in the future, or that items liked by similar users are likely to be liked by the current user. CF embeddings are vector representations of users and items learned through CF models, capturing their interaction patterns.

3.1.6. Contrastive Learning

Contrastive learning is a self-supervised learning paradigm where the model learns representations by pulling "positive" pairs (e.g., different augmentations of the same data point, or semantically similar items) closer together in the embedding space, while pushing "negative" pairs (dissimilar data points) farther apart. This technique is often used to learn powerful and discriminative representations without explicit labels. In LETTER, it's used to align semantic quantized embeddings with CF embeddings.

3.1.7. Ranking Metrics (Recall@K, NDCG@K)

These are standard metrics used to evaluate the performance of recommender systems, particularly in top-K recommendation tasks, where the goal is to recommend a small list of $K$ items.

Recall@K (R@K): Measures the proportion of relevant items that are successfully retrieved (recommended) within the top $K$ $K$ items.
- Conceptual Definition: Recall@K quantifies how many of the actual relevant items for a user were present in the recommended list of $K$ items. It focuses on the ability of the system to 'recall' or retrieve all relevant items.
- Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-}K \text{ recommendations}}{\text{Total number of relevant items for the user}} $
- Symbol Explanation:
  - Number of relevant items in top-K recommendations: The count of items that the user actually interacted with (or found relevant) and were also present in the list of the top $K$ items recommended by the system.
  - Total number of relevant items for the user: The total count of items that the user actually interacted with (or found relevant) in the test set.
Normalized Discounted Cumulative Gain (NDCG@K): A ranking quality metric that considers not only whether relevant items are in the top $K$ $K$ but also their position in the ranked list. Higher relevance at higher ranks yields a better score.
- Conceptual Definition: NDCG@K evaluates the quality of the ranked list of recommendations. It assigns higher scores to relevant items that appear earlier in the list and accounts for varying degrees of relevance. It's 'normalized' so that a perfect ranking always achieves an NDCG of 1.0.
- Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ where $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \text{IDCG}@K = \sum_{i=1}^{K} \frac{2^{\text{ideal_}rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
  - DCG@K: Discounted Cumulative Gain at rank $K$ . It sums the relevance scores of items in the recommended list, discounted logarithmically by their position.
  - IDCG@K: Ideal Discounted Cumulative Gain at rank $K$ . This is the DCG score for the ideal ranking (where all relevant items are ranked highest). It serves as a normalization factor.
  - $rel_i$ : The relevance score of the item at position $i$ in the recommended list. For binary relevance (relevant/not relevant), $rel_i$ is typically 1 or 0.
  - $\text{ideal\_}rel_i$ : The relevance score of the item at position $i$ in the ideal (perfect) ranking.
  - $K$ : The number of top items considered in the recommendation list.

3.2. Previous Works

3.2.1. ID Identifiers

Early approaches to item tokenization for LLMs often relied on ID identifiers, which are unique numerical strings assigned to each item.

P5-SemiD [14]: This method constructs ID identifiers based on item metadata, such as categories or attributes. For example, all items in the "string" category might get IDs starting with a specific prefix. While this incorporates some semantic information, it's often coarse-grained and might fail to capture fine details or collaborative signals.
P5-CID [14]: This approach integrates collaborative signals into ID identifiers by using a spectral clustering tree derived from item co-appearance graphs. Items that frequently co-occur in user interactions are grouped, and these groupings inform the ID assignment. This helps in capturing collaborative patterns but relies on a fixed, unlearnable structure, which can be rigid and less adaptable to new items or evolving patterns.

Limitations: ID identifiers, especially purely numerical ones, are inherently poor at encoding rich semantic information [39]. Even with semantic or collaborative enhancements, their fixed or tree-like structures struggle to adapt to new items (cold-start) or dynamically evolving user preferences. The misalignment between semantic and collaborative signals can also hinder effective learning.

3.2.2. Textual Identifiers

Textual identifiers directly use an item's content information, such as titles, attributes, or descriptions, as its token sequence.

BIGRec [1]: An LLM-based generative recommender model that uses items' titles as textual identifiers.
P5-TID [14]: Similarly, this method uses item titles as textual identifiers for an LLM-based generative recommender model.

Limitations:

Non-hierarchical Semantics: The natural language text of a title or description does not inherently encode semantic information hierarchically (from coarse to fine-grained). This makes autoregressive generation less efficient, as the LLM might struggle to align early tokens with broad preferences.
Lack of Collaborative Signals: Textual identifiers are solely based on content. Items with very similar textual descriptions might have vastly different collaborative signals (e.g., two similar-looking books, one popular, one niche). This misalignment (as depicted in Figure 2) can confuse the LLM, making it difficult to learn user preferences accurately. Injecting collaborative signals into token embeddings after tokenization can lead to collisions if similar textual identifiers need to represent different collaborative patterns.

3.2.3. Codebook-based Identifiers

These methods use auto-encoders to map item features into discrete code sequences, typically leveraging a codebook of learned embeddings.

TIGER [32]: Introduces codebook-based identifiers by employing RQ-VAE to quantize semantic information into code sequences for LLM-based generative recommendation. This addresses the hierarchical semantics problem.
LC-Rec [50]: Also uses codebook-based identifiers and integrates auxiliary alignment tasks to connect generated code sequences with natural language, aiming to better utilize knowledge in LLMs.

Limitations:

Lack of Collaborative Signals: Similar to textual identifiers, existing codebook-based methods often primarily focus on semantic encoding and do not explicitly integrate collaborative signals into the assignment of codes within the identifier sequence. They might try to inject collaborative signals into the token embeddings during the LLM training, but this still faces the misalignment issue.
Code Assignment Bias: As highlighted in Figure 3, the assignment of codes to items can be highly imbalanced. Some codes might be used far more frequently than others, leading to an item generation bias where items associated with popular codes are over-recommended, undermining fairness and diversity. LC-Rec attempts to address this with the Sinkhorn-Knopp Algorithm for intra-layer code fairness, but LETTER argues it misses the fundamental issue of code embedding distribution.

The following figure illustrates the misalignment issue between item identifiers and collaborative signals, which is a key motivation for LETTER.

该图像是一个示意图，展示了在推荐系统中不同项之间的语义、标识符和用户互动情况的关系。图中明确区分了相似、对齐、不对齐和不同信号的项，并通过标题/描述、标识符和互动用户的示例展示了如何在语义和协作信号中进行匹配与不匹配的解析。

The preceding figure (Figure 2 from the original paper) depicts the misalignment between item identifiers (textual or code-based, derived from semantics) and collaborative signals (derived from user interactions). It shows how items with similar semantics might have similar identifiers but vastly different user interaction patterns, leading to a mismatch. Conversely, items with different semantics but similar collaborative patterns might be forced into similar CF embeddings by a CF model, creating another type of misalignment if identifiers only capture semantics.

The following figure illustrates the code assignment bias problem in existing codebook-based methods.

fig 3 该图像是一个图表，展示了训练数据中的目标标识符和生成标识符在不同代码百分比范围内的频率分布。图中的蓝色柱状图表示训练数据中目标标识符的频率，而线条则表示生成标识符的频率。可见，随着代码百分比的增加，频率逐渐下降。

The preceding figure (Figure 3 from the original paper) illustrates the code assignment bias and item generation bias observed in TIGER on the Instruments dataset. It shows that certain codes (lower percentage values on the x-axis) are assigned to items much more frequently in the training data (blue bars), and TIGER tends to generate items associated with these high-frequency codes (orange line), amplifying the bias towards popular items.

3.3. Technological Evolution

The evolution of recommender systems has moved from traditional methods like Collaborative Filtering and Matrix Factorization to deep learning-based approaches, including sequential models (SASRec, BERT4Rec) and graph neural networks (LightGCN). More recently, the advent of LLMs has spurred generative recommendation, shifting from score prediction to item generation. Item tokenization has evolved alongside this, from simple ID identifiers to textual identifiers and then codebook-based identifiers. Each step aimed to better bridge the gap between item data and the LLM's language space, trying to encode more semantic and collaborative information. LETTER fits into this trajectory by addressing the holistic challenges of item tokenization, particularly the concurrent integration of hierarchical semantics, collaborative signals, and code assignment diversity, which were not fully achieved by prior codebook-based or other methods.

3.4. Differentiation Analysis

LETTER distinguishes itself from prior item tokenization methods primarily by its comprehensive approach to identifier design, addressing multiple critical aspects simultaneously:

Versus ID Identifiers (e.g., P5-SemiD, P5-CID): LETTER goes beyond fixed or pre-structured ID assignments. It uses a learnable tokenizer (RQ-VAE) to dynamically encode rich hierarchical semantics, which ID identifiers largely lack. While P5-CID incorporates collaborative signals, LETTER integrates them directly into the quantization process of semantic embeddings, offering a more flexible and adaptable alignment.
Versus Textual Identifiers (e.g., BIGRec, P5-TID): LETTER's RQ-VAE specifically designs hierarchical semantics into the token sequence, unlike raw textual descriptions that lack this structure. Crucially, LETTER explicitly incorporates collaborative signals into the token assignment, addressing the misalignment issue where semantically similar items might have different collaborative patterns. Textual identifiers struggle with this, often leading to collisions if collaborative signals are only injected into token embeddings post-tokenization.
Versus Codebook-based Identifiers (e.g., TIGER, LC-Rec): While TIGER also uses RQ-VAE for hierarchical semantics, LETTER significantly extends it by:
1. Explicit Collaborative Regularization: LETTER introduces a dedicated contrastive alignment loss to directly align quantized embeddings with CF embeddings, ensuring that the code sequence itself reflects collaborative signals, not just item semantics. This fundamentally alters the code assignment to better suit collaborative patterns. Previous codebook-based methods primarily focused on semantic encoding or injected collaborative signals only at the token embedding level, not the code assignment level.
2. Diversity Regularization: LETTER explicitly tackles the code assignment bias problem (observed in TIGER and others) with a novel diversity loss that aims for a more uniform distribution of code embeddings. This directly mitigates item generation bias, an aspect largely overlooked or incompletely addressed by prior codebook-based methods (e.g., LC-Rec uses Sinkhorn-Knopp for intra-layer codes but LETTER argues it misses the essence of code embedding distribution).
  
  In summary, LETTER's innovation lies in its holistic framework that systematically addresses the multi-faceted requirements of an ideal identifier, simultaneously optimizing for hierarchical semantics, collaborative signals, and assignment diversity within a learnable tokenizer.

4. Methodology

The core idea of LETTER is to create an ideal identifier for generative recommendation that is both semantically rich and collaboratively informed, while also promoting diversity in item generation. This is achieved by building upon a codebook-based tokenization approach (RQ-VAE) and augmenting it with two novel regularization terms: collaborative regularization and diversity regularization. Additionally, LETTER introduces a ranking-guided generation loss during the LLM training phase to enhance the ranking ability of generative models.

4.1. Principles

LETTER is founded on three essential objectives for an ideal identifier:

Hierarchical Semantic Integration: The token sequence should encode item semantics hierarchically, moving from coarse-grained to fine-grained details. This aligns with the autoregressive generation process of LLMs, where initial tokens set a broad context and subsequent tokens refine it.
Collaborative Signal Incorporation: The token assignment for an item should reflect its collaborative signals (user interaction patterns). This means items with similar interaction histories should have similar token sequences, even if their raw semantic descriptions differ slightly. This aims to resolve the misalignment issue between semantics and collaborative patterns.
High Diversity of Code Assignment: The assignment of codes within the codebook should be diverse and balanced, avoiding concentration on a few codes. This mitigates code assignment bias, which can lead to unfair item generation bias (over-recommending popular items associated with frequently assigned codes).

4.2. Core Methodology In-depth (Layer by Layer)

The overall architecture of LETTER is illustrated in the figure below. It shows the learnable tokenizer at the core, generating identifiers with hierarchical semantics, enhanced by collaborative regularization and diversity regularization.

fig 4 该图像是一个示意图，展示了LETTER（可学习的生成推荐项标记器）的工作框架。左上角部分概述了多样性正则化，包括代码嵌入和聚类的关系。中央流程图说明了编码器如何将语义嵌入转化为量化嵌入，并通过重构的语义嵌入进行语义正则化。右侧突出了协作正则化的过程，整张图表明了不同正则化损失（如 $L_{Div}$ 、 $L_{sem}$ 和 $L_{CF}$ ）在系统中的作用。

The preceding figure (Figure 4 from the original paper) provides an overview of LETTER's architecture. It depicts how semantic regularization ensures hierarchical semantic encoding via RQ-VAE, collaborative regularization aligns the identifier's code sequence with collaborative signals, and diversity regularization alleviates code assignment bias.

4.2.1. Semantic Regularization

To achieve identifiers with hierarchical semantics, LETTER builds its tokenizer based on Residual Quantized VAE (RQ-VAE) [16]. RQ-VAE is chosen for its ability to recursively quantize semantic residuals, naturally producing identifiers that capture semantics from coarse to fine-grained levels.

The process involves two main steps:

4.2.1.1. Semantic Embedding Extraction

Given an item, its content information (e.g., titles, descriptions) is first processed to extract a semantic embedding. This is done using a pre-trained semantic extractor, such as LLaMA-7B [41], which yields an initial semantic embedding $s$ . This high-dimensional embedding $s$ is then compressed into a lower-dimensional latent semantic embedding $z \in \mathbb{R}^d$ through an encoder network: $ z = \mathrm{Encoder}(s) $ Here, Encoder is a neural network that maps the high-dimensional semantic embedding $s$ to a lower-dimensional latent semantic embedding $z$ .

4.2.1.2. Semantic Embedding Quantization

The latent semantic embedding $z$ is then quantized into a code sequence of length $L$ using $L$ -level codebooks. For each code level $l \in \{1, \dots, L\}$ , there is a dedicated codebook $Q_l = \{\pmb{e}_{l,i}\}_{i=1}^N$ , where $\pmb{e}_{l,i} \in \mathbb{R}^d$ is a learnable code embedding (a vector in the same dimension as $z$ ) and $N$ is the codebook size (number of entries in each codebook).

The residual quantization process is formulated as follows: $ \left{ \begin{array}{ll}c_l = \arg \min_i| \pmb {r}{l - 1} - \pmb {e}{l,i}| ^2, & \pmb {e}{l,i}\in Q_l, \ \pmb {r}l = \pmb {r}{l - 1} - \pmb {e}{l,c_l}, & \end{array} \right. $ (1)

Symbol Explanation:
- $c_l$ : The index of the assigned code from the $l$ -th level codebook. This is found by selecting the code embedding $\pmb{e}_{l,i}$ in $Q_l$ that is closest (in Euclidean distance) to the current semantic residual $\pmb{r}_{l-1}$ .
- $\pmb{r}_{l-1}$ : The semantic residual from the previous code level. It represents the part of the semantic information that has not yet been captured by the code embeddings from levels 1 to l-1.
- $\pmb{e}_{l,i} \in Q_l$ : A code embedding (vector) from the $l$ -th codebook.
- $\pmb{r}_l$ : The new semantic residual after subtracting the selected code embedding $\pmb{e}_{l,c_l}$ from the previous residual $\pmb{r}_{l-1}$ . This residual is then passed to the next code level.
- The process starts with $\pmb{r}_0 = z$ , meaning the initial semantic residual is the latent semantic embedding itself.
  
  After recursively quantizing through all $L$ levels, LETTER obtains the quantized identifier $\tilde{\pmb{r}} = [c_{1},c_{2},\dots,c_{L}]$ (a sequence of code indices) and the quantized embedding $\hat{z} = \sum_{l=1}^{L}\pmb{e}_{l,c_l}$ (the sum of all selected code embeddings). This quantized embedding $\hat{z}$ is then decoded back to a reconstructed semantic embedding $\hat{s}$ using a decoder network.

The loss for semantic regularization is formulated as: $ \begin{cases} \mathscr{L}{\mathrm{Sem}} = \mathscr{L}{\mathrm{Recon}} + \mathscr{L}{\mathrm{RQ VAE}},\quad \mathrm{where}\ \begin{cases} \mathscr{L}{\mathrm{Recon}} = | s - \hat{s} | ^2,\ \mathscr{L}{\mathrm{RQ VAE}} = \sum{l = 1}^{L} | \mathrm{sg}[\pmb{r}{l - 1}] - \pmb{e}{l,c_l} | ^2 + \mu | \pmb {r}{l - 1} - \mathrm{sg} [ \pmb{e}{l,c_l}] | ^2, \end{cases} \end{cases} $ (2)

Symbol Explanation:
- $\mathscr{L}_{\mathrm{Sem}}$ : The total semantic regularization loss.
- $\mathscr{L}_{\mathrm{Recon}}$ : The reconstruction loss, which measures the squared Euclidean distance (L2 norm) between the original semantic embedding $s$ and its reconstructed semantic embedding $\hat{s}$ . This term ensures that the quantized embedding $\hat{z}$ retains the essential semantic information.
- $\mathscr{L}_{\mathrm{RQ VAE}}$ $L_{RQVAE}$ : The RQ-VAE specific loss term, summed over all $L$ $L$ code levels. This term is a crucial part of VQ-VAE and RQ-VAE training.
  - $\mathrm{sg}[\cdot]$ : The stop-gradient operation [42]. This means that the gradients do not flow through the argument of sg.
  - The first term, $\| \mathrm{sg}[\pmb{r}_{l - 1}] - \pmb{e}_{l,c_l} \| ^2$ : Encourages the code embedding $\pmb{e}_{l,c_l}$ (selected for residual $\pmb{r}_{l-1}$ ) to move closer to the residual $\pmb{r}_{l-1}$ . The stop-gradient on $\pmb{r}_{l-1}$ means only $\pmb{e}_{l,c_l}$ is updated by this term.
  - The second term, $\mu \| \pmb {r}_{l - 1} - \mathrm{sg} [ \pmb{e}_{l,c_l}] \| ^2$ : Encourages the encoder to produce latent semantic embeddings $z$ (and subsequent residuals $\pmb{r}_{l-1}$ ) that are closer to the chosen code embeddings $\pmb{e}_{l,c_l}$ . The stop-gradient on $\pmb{e}_{l,c_l}$ means only $\pmb{r}_{l-1}$ (and thus the encoder) is updated by this term.
  - $\mu$ : A hyper-parameter coefficient that balances the strength of these two updates, specifically controlling the commitment of the encoder to the code embeddings.
    
    By applying semantic regularization, the code sequence $\tilde{\pmb{r}}$ learns to encode hierarchical semantics, which facilitates coarse-grained to fine-grained generation by the LLM and improves cold-start generalization.

4.2.2. Collaborative Regularization

To overcome the limitation of existing methods that lack collaborative signals in their code sequences, LETTER introduces collaborative regularization. This aims to inject collaborative signals directly into the quantized embedding $\hat{z}$ (and thus implicitly, the code sequence $\tilde{\pmb{r}}$ ) by aligning it with CF embeddings using contrastive learning.

Specifically, LETTER utilizes a pre-trained Collaborative Filtering (CF) model (e.g., SASRec [15] or LightGCN [11]) to obtain CF embeddings for all items. Let $\mathbf{h}_i$ denote the CF embedding for item $i$ . The goal is to make the quantized embedding $\hat{z}_i$ of item $i$ similar to its CF embedding $\mathbf{h}_i$ .

The collaborative regularization loss ( $\mathcal{L}_{\mathrm{CF}}$ ) is formulated as a contrastive alignment loss: $ \mathcal{L}{\mathrm{CF}} = -\frac{1}{B}\sum{i = 1}^{B}\frac{\exp(< \hat{z}_i,\mathbf{h}i>)}{\sum{j = 1}^{B}\exp(< \hat{z}_i,\mathbf{h}_j>)} $ (3)

Symbol Explanation:
- $\mathcal{L}_{\mathrm{CF}}$ : The collaborative regularization loss.
- $B$ : The batch size, representing the number of items processed in one training step.
- $\hat{z}_i$ : The quantized embedding of item $i$ , obtained from semantic embedding quantization.
- $\mathbf{h}_i$ : The CF embedding of item $i$ , obtained from a pre-trained CF model.
- $< \cdot ,\cdot>$ : Denotes the inner product (dot product) between two vectors, which measures their similarity.
- The term \frac{\exp(< \hat{z}_i,\mathbf{h}_i>)}{\sum_{j = 1}^{B}\exp(< \hat{z}_i,\mathbf{h}_j>)} is similar to a softmax probability, where the numerator represents the similarity between item $i$ 's quantized embedding and its own CF embedding (a positive pair), and the denominator sums the similarity of $\hat{z}_i$ with all CF embeddings $\mathbf{h}_j$ in the current batch (including $\mathbf{h}_i$ itself, and other items' CF embeddings which act as negatives). Minimizing this negative log-likelihood maximizes the similarity between $\hat{z}_i$ and its corresponding $\mathbf{h}_i$ , pushing $\hat{z}_i$ away from other CF embeddings in the batch.
  
  This collaborative regularization encourages items with similar collaborative interactions to have similar quantized embeddings and, by extension, similar code sequences. This contrasts with methods that only inject collaborative signals into token embeddings after quantization, which can lead to collisions if code sequences are fixed based solely on semantics.

4.2.3. Diversity Regularization

To tackle the code assignment bias (where some codes are over-assigned, leading to item generation bias), LETTER introduces diversity regularization. The intuition is that a more uniform distribution of code embeddings in the latent space (as shown in Figure 5(b)) will lead to a more balanced assignment of codes to items, compared to a biased distribution (Figure 5(a)).

fig 8 该图像是示意图，展示了代码分配的两种不同情况：左侧为偏倚的代码嵌入分布，包括与潜在语义嵌入相关的元素；右侧为均匀的代码嵌入分布，分别展示了相应的代码嵌入及其分配。

The preceding figure (Figure 5 from the original paper) contrasts a biased code embedding distribution (a) with a uniform code embedding distribution (b), illustrating how the latter facilitates more balanced code assignments.

The goal is to improve the diversity of code embeddings within each codebook. The paper describes the diversity loss as follows: for each codebook, the code embeddings are clustered into $K$ groups using constrained K-means [3]. The diversity loss then regularizes these clustered code embeddings by:

Pulling code embeddings from the same cluster closer together.
Pushing code embeddings from different clusters farther apart.

While the explicit mathematical formula for $\mathcal{L}_{\mathrm{Div}}$ (Equation 4) is not provided in the main text, the description suggests a contrastive loss-like mechanism applied to the clustered code embeddings. The text states, "...which is defined as where $\pmb{e}_{ij}^{i}$ is the nearest code embedding of item $\it{I}$ $\epsilon_{+}$ denotes the. code embedding of a randomly selected sample from the same cluster of code $c_{I}$ and $\pmb{e}_{I\in \{1,\dots ,N\} \setminus c_{I}}$ represents all code embeddings from the codebook except for $\pmb{e}_{iT}$ ." This indicates that for a given code embedding, positive samples are other embeddings within the same cluster (or its nearest neighbors), and negative samples are embeddings from other clusters. The core idea is to enforce separation between clusters while maintaining coherence within them.

4.2.4. Overall Loss

The complete training loss for LETTER is a weighted sum of the three regularization terms: $ \mathcal{L}{\mathrm{LETTER}} = \mathcal{L}{\mathrm{Sem}} + \alpha \mathcal{L}{\mathrm{CF}} + \beta \mathcal{L}{\mathrm{Div}} $ (5)

Symbol Explanation:
- $\mathcal{L}_{\mathrm{LETTER}}$ : The total loss for training the LETTER tokenizer.
- $\mathcal{L}_{\mathrm{Sem}}$ : The semantic regularization loss (Equation 2).
- $\mathcal{L}_{\mathrm{CF}}$ : The collaborative regularization loss (Equation 3).
- $\mathcal{L}_{\mathrm{Div}}$ : The diversity regularization loss (described intuitively in Section 4.2.3).
- $\alpha$ : A hyper-parameter controlling the strength of collaborative regularization.
- $\beta$ : A hyper-parameter controlling the strength of diversity regularization.

4.2.5. Instantiation on LLM-based Generative Recommender Models

4.2.5.1. Training

The training process involves two stages:

Tokenizer Training: First, the LETTER tokenizer is trained independently on the recommendation items using the overall loss $\mathcal{L}_{\mathrm{LETTER}}$ (Equation 5).
LLM Fine-tuning: Once the LETTER tokenizer is well-trained, it is used to tokenize all items. Each item is indexed into an identifier (a code sequence) $\hat{i} = [c_{1},c_{2},\ldots,c_{L}]$ . User interaction sequences are then translated into sequences of these item identifiers. For a given user, a training sample consists of $x = [\hat{i}_{1},\hat{i}_{2},\dots,\hat{i}_{M}]$ (historically interacted items) and $y = \hat{i}_{M+1}$ (the identifier of the next interacted item).

Ranking-Guided Generation Loss: Existing LLM-based generative recommender models typically optimize LLMs using a generation loss (negative log-likelihood minimization). However, this generation loss might not be optimally aligned with ranking optimization. To address this, LETTER proposes a ranking-guided generation loss ( $\mathcal{L}_{\mathrm{rank}}$ ), which modifies the traditional generation loss by introducing an adjustable temperature parameter $\bar{\tau}$ to emphasize penalties for hard-negative samples, thereby enhancing the ranking ability.

The ranking-guided generation loss is defined as: $ \mathcal{L}{\mathrm{rank}} = -\sum{t=1}^{|y|}\log \left( \frac{\exp(p(y_t) / \bar{\tau})}{\sum_{v\in \mathcal{V}}\exp(p(v) / \bar{\tau})} \right) $ (6)
- Symbol Explanation:
  - $\mathcal{L}_{\mathrm{rank}}$ : The ranking-guided generation loss. The summation is over all tokens $t$ in the target item identifier $y$ .
  - $|y|$ : The length of the target item identifier $y$ (i.e., the number of tokens).
  - $t$ : The index of the current token being predicted in the sequence.
  - $y_t$ : The $t$ -th token of the target identifier $y$ .
  - $p(y_t)$ : The unnormalized log-probability (or 'logit') predicted by the generative model for the true token $y_t$ .
  - $\bar{\tau}$ : The adjustable temperature hyper-parameter. A smaller $\bar{\tau}$ sharpens the probability distribution, making the model more confident about high-logit tokens and penalizing low-logit hard-negative samples more heavily.
  - $\mathcal{V}$ : The entire token vocabulary (all possible code indices from the codebooks).
  - The expression $\frac{\exp(p(y_t) / \bar{\tau})}{\sum_{v\in \mathcal{V}}\exp(p(v) / \bar{\tau})}$ is a softmax function applied to the logits, scaled by $\bar{\tau}$ , yielding a probability distribution over the vocabulary for the $t$ -th token. Minimizing the negative logarithm of this probability maximizes the likelihood of generating the true token $y_t$ .
    
    The paper provides a theoretical justification for this loss:
PROPOSITION 1. For a given ranking-guided generation loss $\mathcal{L}_{\mathrm{rank}}$ and a parameter $\bar{\tau}$ , the following statements hold:
- Minimizing $\mathcal{L}_{\mathrm{rank}}$ is equivalent to optimizing hard-negative items for users, where a smaller $\bar{\tau}$ intensifies the penalty on hard negatives.
- The minimization of $\mathcal{L}_{\mathrm{rank}}$ is associated with the optimization of one-way partial AUC [36], which is strongly correlated with ranking metrics such as Recall and NDCG, ultimately leading to an improvement in the top-K ranking ability.
  
  The full proof of Proposition 1 is provided in Appendix 7 of the original paper. It involves analyzing the gradients of the loss function (Eq. 7, 8) to show how $\bar{\tau}$ influences the weights of negative samples, highlighting that hard-negative samples receive higher weights with smaller $\bar{\tau}$ . It then links this hard-negative mining to Distributionally Robust Optimization (DRO) (Eq. 10, 11, 12), which is, in turn, a surrogate for one-way partial AUC (OPAUC) (Eq. 13). Finally, Theorem 1 (Eq. 14) in the Appendix explicitly shows the strong correlation between OPAUC and top-K ranking metrics like Recall and NDCG.

4.2.5.2. Inference

During inference, generative recommender models autoregressively generate the code sequence for the next item. The next token $\hat{y}_t$ is selected by taking the token with the highest predicted probability from the token vocabulary $V$ : $ \hat{y}t = \arg \max{v\in V}P_\theta (v|y_{< t},x) $

Symbol Explanation:
- $\hat{y}_t$ : The predicted $t$ -th token of the generated identifier.
- $P_\theta (v|y_{< t},x)$ : The probability that token $v$ is the next token, given the preceding tokens $y_{<t}$ and the user's historical interactions $x$ . This probability is computed by the generative model with parameters $\theta$ .
- $V$ : The token vocabulary.
  
  To ensure that the generated sequences form valid item identifiers, LETTER employs constrained generation [8] using a Trie (prefix tree) [5]. A Trie allows the model to efficiently find all strictly valid successor tokens at each step, preventing the generation of invalid or non-existent item identifiers.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three real-world recommendation datasets from different domains:

1. Instruments:
- Source: Amazon review datasets [29].
- Characteristics: Contains user interactions related to music gears (e.g., musical instruments and accessories).
- Domain: E-commerce, product reviews.
2. Beauty:
- Source: Amazon review datasets [29].
- Characteristics: Encompasses user interactions with a wide range of beauty products.
- Domain: E-commerce, product reviews.
3. Yelp:
- Source: Popular Yelp platform dataset.
- Characteristics: Comprises business interactions, such as user reviews and ratings for restaurants, shops, and services.
- Domain: Local business reviews, services.

Preprocessing:

The datasets underwent preprocessing techniques consistent with previous work [15, 32].
Sparse users and items with fewer than 5 interactions were discarded to ensure sufficient data density.
A sequential recommendation setting was adopted, where the goal is to predict the next item a user will interact with based on their history.
The leave-one-out strategy [32, 50] was used for splitting datasets, meaning for each user, the last interaction is used as the test item, the second-to-last as the validation item, and the rest for training.
For training, the number of items in a user's history was restricted to 20, following [14, 50], to manage sequence length for LLM inputs.

5.2. Evaluation Metrics

The performance of the models is evaluated using two standard top-K ranking metrics: Recall@K (R@K) and NDCG@K (N@K), with $K$ set to 5 and 10.

Recall@K (R@K):
- Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved (recommended) within the top $K$ items. It focuses on the ability of the system to 'recall' or retrieve all relevant items.
- Mathematical Formula: $ \text{Recall}@K = \frac{\text{Number of relevant items in top-}K \text{ recommendations}}{\text{Total number of relevant items for the user}} $
- Symbol Explanation:
  - Number of relevant items in top-K recommendations: The count of items that the user actually interacted with (or found relevant) and were also present in the list of the top $K$ items recommended by the system.
  - Total number of relevant items for the user: The total count of items that the user actually interacted with (or found relevant) in the test set.
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG@K evaluates the quality of the ranked list of recommendations. It assigns higher scores to relevant items that appear earlier in the list and accounts for varying degrees of relevance. It's 'normalized' so that a perfect ranking always achieves an NDCG of 1.0.
- Mathematical Formula: $ \text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K} $ where $ \text{DCG}@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $ and $ \text{IDCG}@K = \sum_{i=1}^{K} \frac{2^{\text{ideal_}rel_i} - 1}{\log_2(i+1)} $
- Symbol Explanation:
  - DCG@K: Discounted Cumulative Gain at rank $K$ . It sums the relevance scores of items in the recommended list, discounted logarithmically by their position.
  - IDCG@K: Ideal Discounted Cumulative Gain at rank $K$ . This is the DCG score for the ideal ranking (where all relevant items are ranked highest). It serves as a normalization factor.
  - $rel_i$ : The relevance score of the item at position $i$ in the recommended list. For binary relevance (relevant/not relevant), $rel_i$ is typically 1 or 0.
  - $\text{ideal\_}rel_i$ : The relevance score of the item at position $i$ in the ideal (perfect) ranking.
  - $K$ : The number of top items considered in the recommendation list.

5.3. Baselines

LETTER is compared against a comprehensive set of baselines, categorized into traditional recommender models and LLM-based generative recommender models with different item identifier types.

5.3.1. Traditional Recommender Models

These models do not rely on LLMs for generation but are included for a broader comparison of recommendation performance.

MF [35]: Matrix Factorization decomposes the user-item interaction matrix into lower-dimensional user and item embeddings.
Caser [40]: Convolutional Sequence Embedding Recommendation employs convolutional neural networks to capture sequential and positional information in user interactions.
HGN [28]: Hierarchical Gating Networks utilize graph neural networks to learn user and item representations for interaction prediction.
BERT4Rec [37]: Leverages BERT's pre-trained language representations to capture sequential user-item relationships.
LightGCN [11]: A lightweight graph convolutional network model that simplifies graph convolutions for recommendation, focusing on high-order connections.
SASRec [15]: Self-Attentive Sequential Recommendation employs self-attention mechanisms to capture long-term dependencies in user interaction history.

5.3.2. LLM-based Generative Recommender Models

These models utilize LLMs for recommendation, categorized by their item tokenization strategy.

5.3.2.1. ID Identifiers

P5-SemiD [14]: Assigns item identifiers based on item metadata (e.g., categories, attributes), essentially using semi-structured IDs.
P5-CID [14]: Incorporates collaborative signals into item identifiers by building a spectral clustering tree from item co-appearance graphs, creating collaboratively-informed IDs.

5.3.2.2. Textual Identifiers

BIGRec [1]: Uses items' titles directly as textual identifiers for LLM-based generative recommendation.
P5-TID [14]: Similar to BIGRec, it leverages item titles as textual identifiers for an LLM-based generative recommender model.

5.3.2.3. Codebook-based Identifiers

TIGER [32]: Transformer-based Item Generation and Retrieval introduces codebook-based identifiers via RQ-VAE, quantizing item semantic information into a code sequence for LLM-based generative recommendation. This is one of the backend models LETTER is instantiated upon.
LC-Rec [50]: Leveraging Collaborative Semantics for Recommendation also uses codebook-based identifiers and employs auxiliary alignment tasks to better integrate LLM knowledge by connecting generated code sequences with natural language. This is the other backend model LETTER is instantiated upon.

5.4. Implementation Details

Backend Models: LETTER is instantiated on two representative LLM-based generative recommender models: TIGER [32] and LC-Rec [50].
- For TIGER, as official implementations were not released, the authors followed the paper for their own implementation.
- For LC-Rec, parameter-efficient fine-tuning (PEFT) technique LoRA [12] was used to fine-tune LLaMA-7B [41].
Semantic Embedding Extraction: LLaMA-7B [41] was adopted to encode item content information (titles, descriptions) to obtain the initial semantic embeddings, following [50].
CF Embeddings: 32-dimensional item embeddings were obtained from a SASRec [15] model, which were then used for collaborative regularization.
Hardware: All experiments were conducted on 4 NVIDIA RTX A5000 GPUs.

5.4.1. LETTER Tokenizer Specifics

RQ-VAE Structure: A 4-level codebook structure was used for the RQ-VAE ( $L=4$ ).
Codebook Size and Dimension: Each codebook comprised 256 code embeddings ( $N=256$ ), with each embedding having a dimension of 32.
Diversity Regularization: The number of clusters $K$ for constrained K-means was set to 10.
Tokenizer Training:
- LETTER was trained for 20,000 epochs.
- Optimizer: AdamW [27].
- Learning Rate: $1 \times 10^{-3}$ .
- Batch Size: 1,024.
- Hyper-parameters:
  - $\mu$ (RQ-VAE coefficient): Set to 0.25, following [32].
  - $\alpha$ (strength of collaborative regularization): Searched in the range of $\{1 \times 10^{-1}, 2 \times 10^{-2}, 1 \times 10^{-2}, 1 \times 10^{-3}\}$ .
  - $\beta$ (strength of diversity regularization): Searched in the range of $\{1 \times 10^{-2}, 1 \times 10^{-3}, 1 \times 10^{-4}, 1 \times 10^{-5}\}$ .

5.4.2. LLM Fine-tuning

After LETTER training, the backend generative models (TIGER and LC-Rec) were fine-tuned for convergence based on validation performance.
Learning Rates:
- For TIGER: $\{1 \times 10^{-3}, 5 \times 10^{-4}\}$ .
- For LC-Rec: $\{1 \times 10^{-4}, 2 \times 10^{-4}, 3 \times 10^{-4}\}$ .

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance (RQ1)

The following are the results from Table 1 of the original paper, comparing LETTER instantiated on TIGER (LETTER-TIGER) and LC-REC (LETTER-LC-REC) with various baselines across three datasets.

Model	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10	R@5	R@10	N@5	N@10
Model	Instruments				Beauty				Yelp
MF	0.0479	0.0735	0.0330	0.0412	0.0294	0.0474	0.0145	0.0191	0.0220	0.0296	0.0142	0.0177
Caser	0.0543	0.0710	0.0355	0.0409	0.0205	0.0347	0.0131	0.0176	0.0150	0.0203	0.0094	0.0118
HGN	0.0813	0.1048	0.0668	0.0774	0.0325	0.0512	0.0206	0.0266	0.0186	0.0245	0.0118	0.0147
Bert4Rec	0.0671	0.0822	0.0560	0.0608	0.0203	0.0347	0.0124	0.0170	0.0186	0.0249	0.0119	0.0149
LightGCN	0.0794	0.1000	0.0662	0.0728	0.0305	0.0511	0.0194	0.0260	0.0248	0.0321	0.0158	0.0196
SASRec	0.0751	0.0947	0.0627	0.0690	0.0380	0.0588	0.0246	0.0313	0.0183	0.0238	0.0117	0.0146
BIGRec	0.0513	0.0576	0.0470	0.0491	0.0243	0.0299	0.0181	0.0198	0.0154	0.0191	0.0110	0.0127
P5-TID	0.0000	0.0001	0.0000	0.0000	0.0182	0.0432	0.0132	0.0254	0.0184	0.0251	0.0124	0.0156
P5-SemiID	0.0775	0.0964	0.0669	0.0730	0.0393	0.0584	0.0273	0.0335	0.0202	0.0268	0.0130	0.0163
P5-CID	0.0809	0.0987	0.0695	0.0751	0.0404	0.0597	0.0284	0.0347	0.0219	0.0284	0.0141	0.0174
TIGER	0.0870	0.1058	0.0737	0.0797	0.0395	0.0610	0.0253	0.0321	0.0262	0.0331	0.0169	0.0207
LETTER-TIGER	0.0909	0.1122	0.0763	0.0831	0.0431	0.0672	0.0277	0.0364	0.0286	0.0364	0.0184	0.0227
LC-Rec	0.0824	0.1006	0.0712	0.0772	0.0443	0.0642	0.0311	0.0374	0.0230	0.0298	0.0148	0.0184
LETTER-LC-Rec	0.0913	0.1115	0.0789	0.0854	0.0505	0.0703	0.0355	0.0418	0.0255	0.0326	0.0166	0.0205

Observations from Table 1:

Comparison of LLM-based Models with ID Identifiers: Among P5-CID and P5-SemiD, P5-CID generally outperforms P5-SemiD. This is attributed to P5-CID leveraging collaborative signals (from item co-appearance graphs) in its identifier assignment, which helps LLMs capture user behavioral patterns. P5-SemiD, which assigns IDs based on coarse item categories, struggles to capture fine-grained semantics and experiences misalignment between semantic and collaborative signals.
Comparison with Textual Identifiers: BIGRec and P5-TID (with textual identifiers) generally perform worse than codebook-based and even some ID identifier methods. This is likely due to the inherent misalignment issue where similar semantics in text don't necessarily correspond to similar user interactions, hindering the learning of accurate collaborative signals. P5-TID shows particularly poor performance on Instruments, indicating that direct textual representation can be ineffective if not carefully handled.
Superiority of Codebook-based Identifiers (TIGER, LC-Rec): TIGER and LC-Rec (codebook-based methods) generally outperform ID and textual identifier methods in most cases. This suggests that the hierarchical semantics encoded by RQ-VAE-like approaches provide a more effective representation for generative recommendation by distinguishing items through fine-grained details.
LETTER's Consistent Improvements: The most significant observation is that LETTER consistently and substantially improves the performance of its backend models, TIGER and LC-Rec, across all three datasets and all metrics. For instance, LETTER-TIGER improves R@10 on Instruments from 0.1058 to 0.1122, and LETTER-LC-Rec improves R@10 on Beauty from 0.0642 to 0.0703. This robust improvement validates the core hypothesis of LETTER: that integrating collaborative signals into code assignment and enhancing code assignment diversity are crucial for effective item tokenization.
- The improvements are attributed to:
  1. CF Integration: Aligning quantized embeddings with CF embeddings during code assignment addresses the misalignment between semantic and collaborative signals, encouraging similar code sequences for items with similar collaborative patterns.
  2. Improved Diversity: Diversity regularization mitigates code assignment bias, leading to a more balanced generation of items and overcoming the item generation bias.

6.2. In-depth Analysis

6.2.1. Ablation Study (RQ2)

To investigate the contribution of each regularization component within LETTER, an ablation study was conducted on TIGER using the Instruments and Beauty datasets.

The following are the results from Table 2 of the original paper:

Variants	R@10	N@10	R@10	N@10
Variants	Instruments		Beauty
(0): TIGER	0.1058	0.0797	0.0610	0.0331
(1): TIGER w/ c. r.	0.1078	0.0810	0.0660	0.0351
(2): TIGER w/ d. r.	0.1075	0.0809	0.0618	0.0335
(3): (1) w/ d. r.	0.1092	0.0819	0.0672	0.0357
(4): LETTER-TIGER	0.1122	0.0831	0.0672	0.0364

Observations from Table 2:

Effectiveness of Individual Regularizations:
- TIGER w/ c. r. (incorporating collaborative regularization) shows improved performance over base TIGER (0) on both datasets (e.g., R@10 on Instruments: 0.1078 vs. 0.1058). This confirms the value of injecting collaborative signals into the code assignment.
- TIGER w/ d. r. (incorporating diversity regularization) also improves over base TIGER (0) (e.g., R@10 on Instruments: 0.1075 vs. 0.1058). This validates the effectiveness of enhancing code embedding diversity to mitigate code assignment bias.
Combined Regularizations:
- $(1) w/ d. r.$ (combining collaborative and diversity regularization) achieves better results than either individual regularization and base TIGER (e.g., R@10 on Instruments: 0.1092). This indicates that jointly considering semantics, collaboration, and diversity in code assignment is more effective than any single aspect.
Ranking-Guided Generation Loss:
- (4): LETTER-TIGER (which includes all regularizations and the ranking-guided generation loss) achieves the best performance across all variants (e.g., R@10 on Instruments: 0.1122). This highlights the effectiveness of the ranking-guided generation loss in improving top-K ranking ability by penalizing hard-negative samples more effectively.

6.2.2. Code Assignment Distribution (RQ2)

To ascertain if diversity regularization effectively mitigates code assignment bias, the distribution of the first code in item identifiers was analyzed. The following figure illustrates the normalized frequency of different code assignment groups.

fig 6 该图像是一个比较图表，展示了不同代码分配组（根据流行度排名）的归一化频率。左侧显示了 TIGER 及其引入多样性正则化后的结果，右侧展示了引入协作正则化的 TIGER 和 LETTER 的结果，分别标注了总码本数和使用情况。

The preceding figure (Figure 6 from the original paper) compares the normalized frequency distribution of the first code in item identifiers. The left panel compares TIGER (without diversity regularization) and TIGER with diversity regularization. The right panel compares TIGER with collaborative regularization and LETTER (which combines collaborative and diversity regularization). The bars represent the normalized frequency of target identifiers in training data (assigned codes), grouped by popularity.

Observations from Figure 6:

Diversity Regularization Mitigates Bias: The figures clearly show that incorporating diversity regularization (both TIGER w/ d. r. and LETTER) leads to a smoother, more uniform distribution of code assignments. The peaks observed in TIGER (without diversity) are flattened, and the tails are raised, indicating that diversity regularization successfully reduces the code assignment bias and promotes a more balanced utilization of codes. This implies a potential reduction in item generation bias.
Increased Code Utilization: Diversity regularization significantly increases the utilization rate of codes in the first-level codebook. For example, TIGER w/ d. r. uses 180 codes out of 256, compared to TIGER's 148. Similarly, LETTER uses 150 codes, compensating for the drop caused by collaborative regularization.
Interaction with Collaborative Regularization: While collaborative regularization alone (comparing TIGER to TIGER w/ c. r.) can sometimes decrease code utilization (from 148 to 76 used codes on Instruments, not explicitly shown in this graph, but mentioned in the text), integrating diversity regularization (as in LETTER) helps to recover and maintain high code utilization (150 used codes). This demonstrates that LETTER can simultaneously capture collaborative signals and maintain high code diversity, fulfilling multiple criteria of an ideal identifier.

6.2.3. Code Embedding Distribution (RQ2)

To visually confirm the effect of diversity regularization on the code embedding distribution, the code embeddings from the first-level codebook were visualized using PCA for dimensionality reduction to 3D space.

The following figure illustrates the distribution of code embeddings.

fig 7 该图像是一个示意图，展示了 LETTER 方法在没有多样性正则化（a）和有多样性正则化（b）下的代码嵌入分布。左侧图展示了未经多样性正则化的结果，右侧图展示了应用了多样性正则化的结果，图中红色圆点表示代码嵌入，深色区域表示高频率。通过比较这两幅图，可以观察到多样性正则化对嵌入分布的影响。

The preceding figure (Figure 7 from the original paper) visualizes the 3D code embeddings (after PCA) of the first-level codebook. Figure (a) shows the distribution for LETTER w/o diversity regularization, and Figure (b) shows it for LETTER (with diversity regularization). Darker colors indicate codes assigned to more items.

Observations from Figure 7:

Uniform Distribution: Comparing Figure (a) (LETTER w/o diversity regularization) to Figure (b) (LETTER), it is evident that the code embeddings in LETTER are more evenly distributed in the representation space. In (a), there are noticeable clusters and denser regions, suggesting some codes are more central or preferred. In (b), the points are spread out more uniformly across the sphere.
Alleviating Bias: This visual evidence validates that diversity regularization is effective in achieving a more diverse distribution of code embeddings. By spreading out the embeddings, it fundamentally addresses the biased code assignment problem illustrated in Figure 5(a), ensuring that items are not disproportionately mapped to a few specific code regions.

6.2.4. Investigation on Collaborative Signals in Identifiers (RQ2)

Two experiments were designed to verify whether LETTER successfully encodes collaborative signals into identifiers.

6.2.4.1. Ranking Experiment

This experiment assesses the ranking performance by using LETTER's quantized embeddings for interaction prediction. The quantized embedding $\hat{z}$ from the trained LETTER tokenizer replaces the item embeddings in a well-trained traditional CF model (SASRec), and its ranking performance is evaluated. An identifier that effectively captures collaborative signals should lead to better ranking performance.

The following are the results from Table 3 of the original paper:

Dataset	Model	R@5	R@10	N@5	N@10
Instruments	TIGER LETTER	0.0050	0.0150	0.0024	0.0049
Instruments	LETTER	0.0080	0.0159	0.0038	0.0058
Beauty	TIGER LETTER	0.0128	0.0213	0.0064	0.0085
Beauty	LETTER	0.0175	0.0343	0.0076	0.0118

Observations from Table 3:

LETTER significantly outperforms TIGER LETTER (likely referring to TIGER's original quantized embeddings) by a large margin across both datasets and all metrics (e.g., R@10 on Beauty: 0.0343 vs. 0.0213). This strong improvement indicates that LETTER's quantized embeddings (which are influenced by collaborative regularization) are far better at capturing collaborative signals suitable for interaction prediction than TIGER's purely semantic ones.

6.2.4.2. Similarity Experiment

This experiment verifies if items with similar collaborative signals indeed exhibit similar identifiers (code sequences).

Method: For every item, its most similar item is identified based on similarity derived from pre-trained CF embeddings. Then, the similarity of the code sequence between these two "collaboratively similar" items is assessed using an overlap degree. The averaged results over all items are reported.

The following are the results from Table 4 of the original paper:

Instruments Beauty

TIGER LETTER 0.0849 0.1135

LETTER 0.2760 0.3312

	Instruments	Beauty
TIGER LETTER	0.0849	0.1135
LETTER	0.2760	0.3312

Observations from Table 4:

LETTER achieves a much higher code sequence similarity for items that are collaboratively similar compared to TIGER LETTER (e.g., 0.2760 vs. 0.0849 on Instruments, and 0.3312 vs. 0.1135 on Beauty). This provides direct evidence that LETTER successfully incorporates collaborative signals into the code sequences themselves, leading to identifiers that reflect not just semantics but also user interaction patterns. This effectively alleviates the misalignment issue between semantic and collaborative similarity.

6.2.5. Hyper-Parameter Analysis (RQ3)

The following figure (Figure 8 from the original paper) shows the performance of LETTER-TIGER over different hyper-parameters on the Instruments dataset.

fig 5 该图像是示意图，展示了不同参数对模型性能指标（Recall和NDCG@10）的影响。图中包括五个子图，每个子图对应一个不同的参数（例如， $l$ 、 $N$ 、 $\alpha$ 、 $\beta$ 、 $K$ ），通过折线图显示了随参数变化的Recall和NDCG@10的数值趋势。这些结果表明了各参数对推荐效果的贡献。

The preceding figure (Figure 5 from the original paper) illustrates the performance (R@10 and N@10) of LETTER-TIGER as various hyper-parameters are adjusted: identifier length (L), codebook size (N), strength of collaborative regularization (alpha), strength of diversity regularization (beta), cluster number (K), and temperature (tau).

Observations from Figure 8:

Identifier length $L$ :
- Performance initially improves when $L$ increases from 2 to 4. This suggests that longer identifiers (up to a point) can capture more fine-grained information, leading to better expressiveness.
- However, increasing $L$ beyond 4 (e.g., to 8) degrades performance. This is attributed to the autoregressive generation process suffering from error accumulation. Generating longer sequences accurately is more challenging, as an error in an early token can propagate.
Codebook size $N$ :
- Performance generally improves as $N$ increases (e.g., from 64 to 256). A larger codebook provides more distinct code embeddings, allowing for better differentiation between items and richer representation.
- However, excessively large $N$ (e.g., 512) can hurt performance. This might be because a very large codebook becomes more susceptible to noise in item's semantic information, potentially leading to overfitting to meaningless semantics or sparsity issues.
Strength of collaborative regularization $\alpha$ :
- As $\alpha$ increases, performance generally improves, peaking around $\alpha = 0.02$ . This indicates that a stronger injection of collaborative patterns is beneficial.
- However, an overly large\alpha$$ (e.g., 0.1) can cause a slight drop. This suggests a trade-off: too much emphasis on collaborative regularization might interfere with semantic regularization, leading to suboptimal overall performance. A value like 0.02 seems to strike a good balance.
Strength of diversity regularization $\beta$ :
- Even a small strength of diversity regularization (e.g., from $1 \times 10^{-5}$ to $1 \times 10^{-4}$ ) significantly improves performance. This confirms its effectiveness in enhancing code assignment diversity.
- However, an excessive amount of diversity signal (e.g., $\beta = 0.01$ ) can degrade performance. This implies that too much regularization for diversity might interfere with the integration of semantic and collaborative signals, as the tokenizer is forced to prioritize diversity over other crucial information.
Cluster $K$ (for diversity regularization):
- The optimal performance is observed at $K=10$ . Deviating from this value (decreasing to 5 or increasing to 20) leads to performance degradation.
- If $K$ is too large, clusters might contain too many code embeddings, making it difficult to enforce sufficient closeness within clusters. If $K$ is too small, clusters might be too coarse, leading to code embeddings within the same cluster being overly close or not discriminative enough.
Temperature $\bar{\tau}$ (for ranking-guided generation loss):
- Decreasing $\bar{\tau}$ from 1.2 to 0.7 generally improves performance. This is consistent with Proposition 1, as a smaller temperature places more emphasis on penalizing hard negatives, strengthening the ranking ability.
- However, the performance slightly drops if $\bar{\tau}$ becomes too small (e.g., 0.6). A very small $\bar{\tau}$ might suppress the possibility of hard-negative samples being considered as positive samples for other users, potentially making the model too rigid or overly sensitive to minor differences, thereby harming generalization. Careful tuning of $\bar{\tau}$ is essential.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study rigorously analyzed the essential characteristics of effective item tokenization for LLM-based generative recommendation. The authors introduced LETTER, a novel learnable tokenizer, which addresses the limitations of existing methods by integrating three critical components: hierarchical semantics, collaborative signals, and code assignment diversity. LETTER achieves this through a multi-faceted regularization scheme: semantic regularization using RQ-VAE for hierarchical encoding, collaborative regularization via a contrastive alignment loss to embed CF signals into code sequences, and diversity regularization to mitigate code assignment bias. Furthermore, LETTER incorporates a ranking-guided generation loss to theoretically enhance the top-K ranking ability of generative models. Extensive experiments on three real-world datasets consistently demonstrated LETTER's superior performance, pushing the state-of-the-art in LLM-based generative recommendation.

7.2. Limitations & Future Work

The authors identified several promising directions for future exploration:

Tokenization with Rich User Behaviors: Future work could explore incorporating more diverse and complex user behaviors (beyond simple interactions) into the tokenization process. This would enable generative recommender models to infer user preferences from a richer set of actions.
Cross-Domain Item Tokenization: LETTER has the potential to tokenize cross-domain items. This would allow generative recommender models to leverage multi-domain user behaviors and items for more comprehensive user preference reasoning and next-item recommendation, addressing scenarios where users interact with items across different categories or platforms.
Combining User Instructions with Tokens: An exciting future direction is to combine natural language user instructions with user interaction history tokenized by LETTER. This could lead to more personalized recommendations by enabling collaborative reasoning that integrates complex natural language queries with structured item tokens within the generative recommender model's space.

7.3. Personal Insights & Critique

This paper presents a highly relevant and well-structured approach to a critical problem in LLM-based generative recommendation. The comprehensive analysis of item tokenization limitations and the systematic design of LETTER to address these are commendable.

Strengths:

Holistic Approach: LETTER's strength lies in its ability to simultaneously tackle hierarchical semantics, collaborative signals, and code assignment diversity. This multi-objective optimization for item tokenization is a significant advancement over prior work that often focused on one or two aspects in isolation.
Theoretical Justification: The ranking-guided generation loss with its theoretical connection to hard-negative mining and OPAUC provides strong grounding for its effectiveness in improving ranking metrics.
Empirical Validation: The extensive experiments and detailed ablation studies thoroughly validate each component's contribution and LETTER's overall superiority.
Interpretability: By explicitly defining what an ideal identifier should entail, the paper offers a clear framework for understanding item tokenization in generative recommendation.

Potential Issues/Areas for Improvement:

Formula for Diversity Loss: The explicit mathematical formula for diversity regularization loss ( $\mathcal{L}_{\mathrm{Div}}$ ) is not provided in the main text. While the intuitive description is helpful, a precise formula would enhance reproducibility and clarity for researchers seeking to implement or extend this specific component. The description "which is defined as where..." suggests a missing equation (4), which is a minor but notable omission in an otherwise rigorous paper.
Computational Cost: Training RQ-VAE with multiple codebooks and LLMs with additional regularization terms, especially with constrained generation using Tries, can be computationally intensive. While the authors mention using 4 GPUs, a more explicit discussion of the computational overhead and scalability for very large item catalogs would be beneficial.
Generalizability of CF Embeddings: The collaborative regularization relies on CF embeddings from a pre-trained CF model (e.g., SASRec). The quality and robustness of these CF embeddings directly impact LETTER's performance. The paper does not delve into how sensitive LETTER is to the choice or quality of the upstream CF model. In real-world scenarios, maintaining high-quality CF embeddings for new items or evolving user behavior can be a challenge.
Cold-Start Scenarios for CF Embeddings: While LETTER helps cold-start items with semantic regularization, the collaborative regularization might still face challenges for truly cold-start items that lack sufficient interaction data to generate reliable CF embeddings.
Subjectivity of "Ideal Identifier": The criteria for an ideal identifier are well-defined, but their relative importance might vary depending on the specific recommendation task or dataset. The hyperparameters ( $\alpha, \beta, \bar{\tau}$ ) indicate these trade-offs, but a deeper discussion on task-specific tuning considerations could be valuable.

Transferability and Applications: The methodology proposed in LETTER is highly transferable. The concept of a learnable tokenizer that integrates multiple information sources (semantics, collaborative) and addresses distribution biases is applicable to any domain where discrete data needs to be mapped to a continuous or tokenized space for generative models. This could extend beyond recommendation to areas like:

Generative molecule design: Tokenizing chemical compounds based on structure and desired properties.
Generative music/art: Tokenizing musical notes or art elements based on stylistic and perceptual features.
Knowledge graph completion: Tokenizing entities and relations to facilitate LLM-based knowledge generation.

Overall, LETTER makes a significant contribution by providing a comprehensive and principled solution for item tokenization, which is a cornerstone for the successful deployment of LLMs in generative recommendation. The paper opens exciting avenues for more intelligent and fair LLM-based recommenders.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Learnable Item Tokenization for Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 43,658 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Generative Recommendation

3.1.2. Large Language Models (LLMs)

3.1.3. Item Tokenization

3.1.4. Residual Quantized Variational Autoencoder (RQ-VAE)

3.1.5. Collaborative Filtering (CF)

3.1.6. Contrastive Learning

3.1.7. Ranking Metrics (Recall@K, NDCG@K)

3.2. Previous Works

3.2.1. ID Identifiers

3.2.2. Textual Identifiers

3.2.3. Codebook-based Identifiers

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Semantic Regularization

4.2.1.1. Semantic Embedding Extraction

4.2.1.2. Semantic Embedding Quantization

4.2.2. Collaborative Regularization

4.2.3. Diversity Regularization

4.2.4. Overall Loss

4.2.5. Instantiation on LLM-based Generative Recommender Models

4.2.5.1. Training

4.2.5.2. Inference

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.3.1. Traditional Recommender Models

5.3.2. LLM-based Generative Recommender Models

5.3.2.1. ID Identifiers

5.3.2.2. Textual Identifiers

5.3.2.3. Codebook-based Identifiers

5.4. Implementation Details

5.4.1. LETTER Tokenizer Specifics

5.4.2. LLM Fine-tuning

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance (RQ1)

6.2. In-depth Analysis

6.2.1. Ablation Study (RQ2)

6.2.2. Code Assignment Distribution (RQ2)

6.2.3. Code Embedding Distribution (RQ2)

6.2.4. Investigation on Collaborative Signals in Identifiers (RQ2)

6.2.4.1. Ranking Experiment

6.2.4.2. Similarity Experiment

6.2.5. Hyper-Parameter Analysis (RQ3)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers