Paper status: completed

Towards Scalable Semantic Representation for Recommendation

Published:10/12/2024

LLM-based Recommendation Systems (27)Semantic ID Modeling (2)High-Dimensional Embedding Dimensionality Reduction (1)Mixture-of-Codes for Recommendation (1)Recommendation System Performance Enhancement (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces the Mixture-of-Codes (MoC) method to address dimensionality compression when integrating LLM embeddings into recommendation systems. By constructing multiple independent codebooks and incorporating a fusion module, MoC significantly enhances the discriminabi

Abstract

With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length. Such dimension compression results in inevitable losses in discriminability and dimension robustness of the LLM embeddings, which motivates us to scale up the semantic representation. In this paper, we propose Mixture-of-Codes, which first constructs multiple independent codebooks for LLM representation in the indexing stage, and then utilizes the Semantic Representation along with a fusion module for the downstream recommendation stage. Extensive analysis and experiments demonstrate that our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.

Mind Map

In-depth Reading

English Analysis~41 min read · 54,677 chars

1. Bibliographic Information

1.1. Title

Towards Scalable Semantic Representation for Recommendation

1.2. Authors

Taolin Zhang, Junwei Pan, Jinpeng Wang, Yaohua Zha, Tao Dai, Bin Chen, Ruisheng Luo, Xiaoxiang Deng, Yuan Wang, Ming Yue, Jie Jiang, Shu-Tao Xia

Affiliations: Tsinghua University, China; Tencent Inc, China.

1.3. Journal/Conference

This paper is a preprint, published on arXiv. arXiv is a well-regarded open-access repository for scholarly articles, primarily in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to share their work before formal peer review and publication, making it influential for disseminating cutting-edge research quickly.

1.4. Publication Year

Published at (UTC): 2024-10-12T15:10:56.000Z

1.5. Abstract

The paper addresses the challenge of integrating large language model (LLM) embeddings into recommendation systems. While LLMs offer rich semantic information, their high-dimensional embeddings must often be compressed to match the typically much smaller dimensions of traditional ID embeddings in recommendation systems. This compression leads to a significant loss in the discriminability and dimension robustness of the LLM embeddings. To overcome this, the authors propose a novel approach called Mixture-of-Codes (MoC). MoC operates in two stages: first, an indexing stage where multiple independent codebooks are constructed from LLM representations; second, a downstream recommendation stage where these semantic representations are fused using a dedicated module. Through extensive analysis and experiments, MoC is shown to achieve superior scalability in terms of discriminability and dimension robustness, resulting in enhanced recommendation performance.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2410.09560v1 PDF Link: https://arxiv.org/pdf/2410.09560v1.pdf Publication Status: This is a preprint ( $v1$ ) on arXiv, indicating it has not yet undergone formal peer review and publication in a conference or journal.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the semantic gap and dimension mismatch when attempting to leverage the rich semantic knowledge from Large Language Models (LLMs) to enhance recommendation systems. LLMs produce high-dimensional embeddings (e.g., 4,096 to 16,384 dimensions) that capture nuanced semantic information. However, recommendation systems typically operate with much lower-dimensional ID embeddings (e.g., usually no more than 256 dimensions), largely due to factors like the Interaction Collapse Theory which suggests that excessive dimensions can hinder effective feature interaction.

This disparity necessitates dimension compression of LLM embeddings. Existing methods often project these high-dimensional embeddings or derive Semantic IDs (discrete codes) using techniques like VQ-VAE or RQ-VAE. The challenge is that compressing these rich LLM embeddings into a low-dimensional space for recommendation leads to inevitable information loss, specifically impacting the discriminability (ability to distinguish between items) and dimension robustness (ability to maintain information across varying dimensions) of the resulting semantic representations. This information loss hinders the effective transfer of knowledge from LLMs to recommendation systems, making their application less impactful than desired.

The paper's entry point is this observed failure of current approaches to effectively scale up semantic representations while preserving LLM knowledge. The authors note that a single semantic embedding is insufficient to capture the complexity of high-dimensional LLM embeddings. They aim to design a scalable semantic representation strategy that can overcome this limitation and mitigate information loss during dimension compression.

2.2. Main Contributions / Findings

The primary contributions of this paper can be summarized as follows:

Pioneering Scalability Study: The paper initiates a dedicated study into the scalability of semantic representations for transferring knowledge from LLMs to recommendation systems. It rigorously demonstrates that existing baseline methods, such as Multi-Embedding (ME) and Residual Quantization VAE (RQ-VAE), fail to scale effectively in terms of discriminability and dimension robustness.
Proposal of Mixture-of-Codes (MoC): The authors propose a novel two-stage approach named Mixture-of-Codes (MoC).
- In the indexing stage, MoC learns multiple independent discrete codebooks from LLM embeddings, departing from the hierarchical structure of RQ-VAE. This parallel architecture aims to capture complementary information more effectively.
- In the downstream recommendation stage, MoC employs a dedicated fusion module (a bottleneck network) to implicitly combine the learnable embeddings derived from these multiple codebooks. This fusion mechanism helps in better knowledge transfer and generalization.
Demonstrated Superior Scalability and Performance: Extensive experiments on three public datasets (Amazon-Beauty, Amazon-Sports, Amazon-Toys) and across four representative CTR models (DeepFM, DeepIM, AutoInt+, DCNv2) consistently demonstrate MoC's advantages.
- Discriminability: MoC shows superior and consistently increasing discriminability with higher scaling factors, unlike baselines.
- Dimension Robustness: MoC exhibits more robust singular spectra, maintaining higher top singular values without suffering from dimensional collapse in its long-tail singular values, thus outperforming ME and RQ-VAE.
- Recommendation Performance: MoC achieves the best scale-up performance in recommendations, showing significant AUC gains, especially at higher scaling factors (e.g., 7x), while baselines often suffer from performance degradation or only marginal gains. The fusion module is also shown to be crucial for this performance.
  
  In essence, the paper provides a solution to effectively bridge the high-dimensional semantic space of LLMs with the low-dimensional embedding space of recommendation systems, ensuring that valuable LLM knowledge is transferred without significant loss, leading to improved recommendation accuracy.

This section provides an overview of the foundational concepts and previous research necessary to understand the "Towards Scalable Semantic Representation for Recommendation" paper.

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are a class of artificial intelligence models, typically based on transformer architectures, trained on vast amounts of text data. They excel at understanding, generating, and processing human language.

Role in Recommendation: LLMs can extract rich semantic information from item descriptions, user reviews, or other textual data associated with items and users. This semantic understanding can be represented as high-dimensional embeddings (vector representations), which can then be used to enhance recommendation system capabilities, such as cold-start recommendations, understanding nuanced user preferences, or generating explanations.

3.1.2. Recommendation Systems (RS)

Recommendation systems are information filtering systems that aim to predict the "rating" or "preference" a user would give to an item. They help users discover items they might like (e.g., movies, products, articles).

Core Idea: By analyzing past user-item interactions and item attributes, RS suggest relevant items, typically improving user experience and engagement.
Embeddings in RS: Items and users are often represented as low-dimensional embedding vectors. These vectors capture latent features and relationships, and their interactions (e.g., dot product) can predict user preference.

3.1.3. Embeddings

An embedding is a dense vector representation of a discrete variable (like a word, item, or user) in a continuous vector space. The idea is that items with similar properties or meanings will have similar embedding vectors (i.e., their vectors will be close to each other in the vector space).

High-dimensional vs. Low-dimensional: LLMs typically produce very high-dimensional embeddings (e.g., 4096 dimensions), capturing a wide range of semantic information. Recommendation systems, for efficiency and due to specific theoretical considerations, often use much lower-dimensional embeddings (e.g., 32 to 256 dimensions).

3.1.4. Semantic IDs/Codes

Semantic IDs (or codes) are discrete identifiers derived from continuous, high-dimensional semantic embeddings. Instead of using the full continuous vector, an item is assigned one or more discrete codes that represent its semantic properties.

How they are formed: Often, these codes are generated through quantization techniques, where a continuous vector is mapped to the closest discrete code from a predefined set (a codebook). This process essentially clusters items with similar semantic meanings.
Benefit: They can reduce storage and computational cost, enable efficient retrieval, and facilitate the integration of LLM semantics into recommendation systems by acting as learnable categorical features.

3.1.5. VQ-VAE (Vector Quantized Variational AutoEncoder)

VQ-VAE is a type of neural network that learns discrete representations (codes) of its input data. It's an autoencoder architecture but with a crucial modification: a vector quantization layer in the latent space.

Autoencoder: Consists of an encoder that maps input data to a latent representation, and a decoder that reconstructs the input from this latent representation.
Vector Quantization: Instead of a continuous latent space, VQ-VAE uses a discrete set of vectors called a codebook. The encoder's output is quantized by finding the closest vector in the codebook, and this quantized vector is then passed to the decoder.
Objective: To learn a compressed, discrete representation of the input while minimizing the reconstruction error. This helps in discovering meaningful discrete features.
Quantization Step: Given an encoder output $\mathbf{z} := \mathcal{E}(x) \in \mathbb{R}^{n_z}$ , VQ-VAE quantizes this embedding by finding the code $\mathbf{z_k}$ from the codebook $\mathcal{Z} = \{ \boldsymbol{z}_k \}_{k=1}^K$ that is nearest to $\mathbf{z}$ : $\mathbf { z ^ { q } } = \underset { z _ { k } \in Z } { \arg \operatorname* { m i n } } \ : \| \mathbf { z } - z _ { k } \| _ { 2 } ^ { 2 }$ where $\mathbf{z^q}$ is the quantized vector, and $z_k$ are the learnable codebook vectors.
Loss Function: The VQ-VAE is trained end-to-end using a specific loss function: $\mathcal { L } _ { \mathrm { V Q } } ( \boldsymbol { \mathcal { E } } , \mathcal { D } , \mathcal { Z } ) = \| x - \hat { x } \| ^ { 2 } + \| \mathrm { s g } [ \mathbf { z ^ { q } } ] - \mathbf { z } \| _ { 2 } ^ { 2 } + \| \mathrm { s g } [ \mathbf { z } ] - \mathbf { z ^ { q } } \| _ { 2 } ^ { 2 }$ Here:
- $x$ : Original input data.
- $\hat{x}$ : Reconstructed output from the decoder $\mathcal{D}(\mathbf{z^q})$ .
- $\mathcal{L}_{\mathrm{rec}} = \| x - \hat { x } \| ^ { 2 }$ : The reconstruction loss (e.g., Mean Squared Error), which ensures the decoder can reconstruct the input from the quantized latent space.
- $\| \mathrm { s g } [ \mathbf { z ^ { q } } ] - \mathbf { z } \| _ { 2 } ^ { 2 }$ : The commitment loss, which encourages the encoder output $\mathbf{z}$ to "commit" to (stay close to) the chosen codebook vector $\mathbf{z^q}$ . sg[.] denotes the stop-gradient operation, meaning gradients do not flow through this part, preventing the encoder from collapsing the codebook.
- $\| \mathrm { s g } [ \mathbf { z } ] - \mathbf { z ^ { q } } \| _ { 2 } ^ { 2 }$ : This term updates the codebook vectors $\mathbf{z_k}$ (specifically, the chosen $\mathbf{z^q}$ ) to move closer to the encoder output $\mathbf{z}$ . The stop-gradient on $\mathbf{z}$ ensures that only the codebook is updated, not the encoder.

3.1.6. RQ-VAE (Residual Quantization VAE)

RQ-VAE is an extension of VQ-VAE that uses residual quantization to achieve finer-grained discrete representations. Instead of quantizing the entire latent vector at once, it quantizes the residual (the remaining information) from the previous quantization step.

Hierarchical Design: It applies quantization at multiple levels. First, a codebook quantizes the original latent representation. Then, the difference (residual) between the original and the quantized representation is passed to a second codebook for further quantization, and so on.
Benefit: This allows for a more detailed and hierarchical encoding of information, potentially capturing more nuances than a single VQ-VAE codebook. Each subsequent level captures increasingly fine-grained details.
Quantization Steps: $\begin{array} { r l } & { \mathbf { z _ { i } } ^ { \mathbf { q } } = \underset { z _ { k } \in Z _ { i } } { \arg \operatorname* { m i n } } \| \mathbf { z _ { i } } - z _ { k } \| _ { 2 } ^ { 2 } , } \\ & { \mathbf { z _ { i + 1 } } = \mathbf { z _ { i } } - \mathbf { z _ { i } } ^ { \mathbf { q } } . } \end{array}$ Here, $\mathbf{z_i}$ is the input to the $i$ -th quantization stage (where $\mathbf{z_1}$ is the encoder output), $\mathbf{z_i^q}$ is the quantized output of the $i$ -th stage using codebook $\mathcal{Z}_i$ , and $\mathbf{z_{i+1}}$ is the residual passed to the next stage.

3.1.7. Interaction Collapse Theory

The Interaction Collapse Theory in recommendation systems suggests that increasing the dimensionality of item or user embeddings beyond a certain point can lead to diminishing returns or even performance degradation. This is because higher dimensions can make it harder for interaction models to effectively learn meaningful relationships between features, potentially introducing noise or sparsity. Therefore, recommendation embeddings are usually kept at relatively low dimensions (e.g., 64-256).

3.1.8. Discriminability

In the context of representation learning, discriminability refers to how well a learned representation can differentiate between different classes, items, or concepts. A highly discriminative representation means that distinct items will have distinct representations, making it easier for downstream tasks (like classification or recommendation) to tell them apart.

Measurement: Often measured using metrics like Normalized Mutual Information (NMI) in clustering settings, where a higher NMI indicates better discriminative power.

3.1.9. Dimension Robustness

Dimension robustness refers to the ability of a representation to maintain its informative content and structure even when its dimensionality is altered (e.g., compressed or expanded). A robust representation should not suffer from significant information loss or dimensional collapse when its dimensions are changed.

Dimensional Collapse: A phenomenon where all learned representations tend to lie on a low-dimensional subspace, or even collapse to a single point, losing much of their discriminative power.
Measurement: Can be analyzed by examining the singular spectrum of the representation matrix. A robust representation would have a singular spectrum that decays smoothly, indicating that information is spread across many dimensions, rather than sharply dropping off (indicating collapse).

3.1.10. Mutual Information (MI) and Normalized Mutual Information (NMI)

Mutual Information (MI) is a measure of the mutual dependence between two random variables. It quantifies the amount of information obtained about one random variable by observing the other. In simpler terms, it tells us how much knowing one variable reduces uncertainty about the other.

Formula: For discrete random variables $X$ and $Y$ with joint probability p(x,y) and marginal probabilities p(x) and p(y): $I(X;Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) \log \left( \frac{p(x,y)}{p(x)p(y)} \right)$
Normalized Mutual Information (NMI): NMI is a normalized version of MI, typically used to compare clustering results against ground truth labels. It scales the MI score to a value between 0 and 1, where 1 indicates perfect correlation and 0 indicates no mutual information.
- Formula: A common normalization is: $\mathrm{NMI}(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}}$ where H(X) and H(Y) are the entropies of $X$ and $Y$ , respectively.

3.1.11. Singular Value Decomposition (SVD) and Singular Spectrum

Singular Value Decomposition (SVD) is a fundamental matrix factorization technique. For any matrix $A$ , SVD decomposes it into three matrices: $A = U \Sigma V^T$ .

Singular Values: The diagonal entries of $\Sigma$ (denoted $\sigma_1, \sigma_2, ..., \sigma_r$ ) are the singular values of $A$ , usually ordered from largest to smallest. These singular values represent the "strength" or "importance" of the corresponding dimensions (principal components) in the data.
Singular Spectrum: The plot of singular values (or their squares) against their index is called the singular spectrum.
Interpretation for Robustness: A rapidly decaying singular spectrum indicates that most of the information in the data is captured by a few principal components (low effective rank), suggesting potential dimensional collapse or redundancy. A slow, smooth decay suggests that information is distributed across many dimensions, indicating higher dimension robustness.

3.1.12. Pearson Correlation Coefficient

The Pearson correlation coefficient (often denoted as $r$ ) is a measure of the linear correlation between two sets of data. It has a value between +1 and −1 inclusive, where +1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

Formula: For two variables $X$ $X$ and $Y$ $Y$ with $n$ $n$ observations: $r_{XY} = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n (X_i - \bar{X})^2}\sqrt{\sum_{i=1}^n (Y_i - \bar{Y})^2}}$ Here:
- $X_i, Y_i$ : individual data points.
- $\bar{X}, \bar{Y}$ : means of $X$ and $Y$ .
- $\sum$ : summation symbol.

3.2. Previous Works

The paper builds upon and contrasts with several lines of research:

LLM-enhanced Recommendation Systems: Early works like those by Hou et al. (2024) and Bao et al. (2023) explored using LLMs to improve recommendations by leveraging their semantic understanding. This often involved simple projection of LLM embeddings into recommendation systems, which the current paper identifies as ineffective due to the semantic gap.
Semantic IDs for Recommendation: A key direction has been to derive discrete Semantic IDs from LLM embeddings.
- VQ-VAE/RQ-VAE based approaches: Papers like Rajput et al. (2024) (TIGER), Singh et al. (2023), Jin et al. (2023) (LMINDEXER), and Zheng et al. (2024) (LC-Rec) have utilized VQ-VAE or RQ-VAE to convert continuous LLM embeddings into discrete codes. These codes then serve as categorical features (Semantic IDs) in downstream recommendation models. The idea is to capture the local structure of the original LLM embedding space.
- TIGER (Rajput et al., 2024): Uses a hierarchical quantizer (RQ-VAE) to generate item tokens for generative retrieval.
- LC-Rec (Zheng et al., 2024): Improves TIGER by integrating LLM knowledge and instruction tuning for adaptation to recommender systems.
- LMINDEXER (Jin et al., 2023): Learns Semantic IDs in a self-supervised manner to obtain document semantic representations and their hierarchical structures.
Embedding Scaling in Recommendation:
- Multi-Embedding (Guo et al., 2023): This work, related to the Interaction Collapse Theory, explores scaling up embeddings in recommendation systems by using multiple embeddings for the same ID. The current paper uses this as a baseline, specifically applying the idea of using $M$ embeddings for a single Semantic ID.
- RQ-VAE (Lee et al., 2022): While originally for image generation, RQ-VAE's hierarchical quantization mechanism has been adopted by recommendation systems to create multi-level Semantic IDs. The current paper uses it as a baseline to explore scaling with multiple hierarchical codebooks and embeddings.

3.3. Technological Evolution

The evolution starts from basic recommendation systems, then moves to integrating powerful general-purpose models like LLMs. The initial attempts were often direct projections, which proved suboptimal. This led to the idea of Semantic IDs using quantization techniques (like VQ-VAE and RQ-VAE) to bridge the semantic gap more effectively. However, even these Semantic ID approaches faced limitations when trying to scale, especially concerning information loss due to dimension compression. This paper, "Towards Scalable Semantic Representation for Recommendation," represents the next step in this evolution: addressing the scalability challenge of semantic representations derived from LLMs, proposing a new method (MoC) that effectively preserves discriminability and dimension robustness while enabling performance gains as the semantic representation scales.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Addressing Scalability: Unlike previous works that focus on generating Semantic IDs, this paper explicitly investigates the scalability of these representations, quantifying it with discriminability and dimension robustness. It reveals that existing methods (ME, RQ-VAE) fail in this regard.
Parallel Codebooks vs. Hierarchical/Single Codebook:
- Multi-Embedding (ME) baseline: Uses a single codebook but multiple embeddings for the same Semantic ID. The paper argues this introduces redundancy and minimal new information.
- RQ-VAE baseline: Employs hierarchical codebooks. The paper criticizes this for producing Semantic IDs that are highly dependent and where higher-level IDs contain diminishing or less informative details, hindering overall discriminability.
- MoC Innovation: Proposes multiple independent parallel codebooks. This design aims to capture complementary information from the LLM embedding without the inter-dependency issues of hierarchical approaches or the redundancy of single-codebook multi-embeddings. Each codebook focuses on a distinct aspect of the LLM representation.
Implicit Fusion Mechanism: MoC introduces a novel implicit fusion module (a bottleneck network) in the downstream recommendation stage. This is distinct from traditional Mixture-of-Experts approaches that often rely on a gating network for explicit routing. MoC's implicit fusion, trained with task-specific loss, effectively mixes the embeddings from parallel codebooks, leading to better generalization and performance.
Comprehensive Metrics: The paper introduces and thoroughly uses specific metrics (NMI for discriminability, singular spectrum for dimension robustness) to quantitatively evaluate scalability, providing a more rigorous analysis than simply reporting recommendation performance.

In summary, MoC offers a fundamentally different way to structure Semantic IDs and integrate them, moving beyond single-source or hierarchically dependent representations to a more robust, independent, and fused approach that explicitly targets and achieves scalability.

4. Methodology

This section provides a detailed, step-by-step breakdown of the proposed Mixture-of-Codes (MoC) method, integrating explanations of its underlying principles and mathematical formulations.

4.1. Principles

The core principle of the Mixture-of-Codes (MoC) method is to overcome the inherent information loss that occurs when high-dimensional LLM embeddings are compressed to match the lower dimensions typically used in recommendation systems. Instead of relying on a single, compact semantic representation that inevitably sacrifices rich LLM knowledge, MoC aims to scale up the semantic representation effectively. It achieves this by recognizing that multiple, independent, and complementary discrete representations can collectively capture more information than a single one.

The theoretical basis draws inspiration from vector quantization techniques (like VQ-VAE) for learning discrete codes, but critically diverges in how multiple codes are generated and integrated. The intuition is that by training several codebooks in parallel, each can learn to represent different aspects or "views" of the original LLM embedding space. These distinct "codes" then provide a richer, more robust semantic fingerprint for each item. When these multiple code-derived embeddings are later combined via a fusion module, they can collectively offer a more complete picture to the downstream recommendation model, enabling better prediction without suffering from the dimension collapse or redundancy observed in prior scaling attempts.

4.2. Core Methodology In-depth (Layer by Layer)

The Mixture-of-Codes (MoC) approach is structured into two main stages: an indexing stage for constructing multiple independent codebooks and a downstream recommendation stage for fusing the resulting semantic representations.

4.2.1. Preliminaries: VQ-VAE for Semantic ID Generation

Before detailing MoC, it's essential to understand the foundation of Vector Quantized Variational AutoEncoder (VQ-VAE) as it's the base for generating discrete semantic IDs.

Given an input $x$ (e.g., an LLM embedding for an item), a VQ-VAE first uses an encoder $\mathcal{E}$ to map $x$ to a continuous latent representation $\mathbf{z} := \mathcal{E}(x) \in \mathbb{R}^{n_z}$ . Then, a codebook $\mathcal{Z} = \{ \boldsymbol{z}_k \}_{k=1}^K$ is used for quantization. This codebook consists of $K$ learnable code vectors (or codewords), each $\boldsymbol{z}_k \in \mathbb{R}^{n_z}$ . The quantization process involves finding the codebook vector $\boldsymbol{z}_k$ that is closest to the encoder output $\mathbf{z}$ in Euclidean distance. The chosen codebook vector becomes the quantized embedding $\mathbf{z^q}$ . This selection process is formulated as: $\mathbf { z ^ { q } } = \underset { z _ { k } \in Z } { \arg \operatorname* { m i n } } \ : \| \mathbf { z } - z _ { k } \| _ { 2 } ^ { 2 }$ Here, $\mathbf{z^q}$ is the quantized vector that is then passed to a decoder $\mathcal{D}$ to reconstruct the original input, yielding $\hat{x} = \mathcal{D}(\mathbf{z^q})$ . The index $k$ of the chosen codebook vector $\mathbf{z_k}$ (which becomes $\mathbf{z^q}$ ) can be used as the Semantic ID for the input $x$ .

The training of the VQ-VAE (the encoder $\mathcal{E}$ , decoder $\mathcal{D}$ , and codebook $\mathcal{Z}$ ) is performed end-to-end using the following loss function: $\begin{array} { r } { \mathcal { L } _ { \mathrm { V Q } } ( \boldsymbol { \mathcal { E } } , \mathcal { D } , \mathcal { Z } ) = \| x - \hat { x } \| ^ { 2 } + \| \mathrm { s g } [ \mathbf { z ^ { q } } ] - \mathbf { z } \| _ { 2 } ^ { 2 } + \| \mathrm { s g } [ \mathbf { z } ] - \mathbf { z ^ { q } } \| _ { 2 } ^ { 2 } , } \end{array}$ where:

$x$ : The original high-dimensional LLM embedding of an item.
$\hat{x}$ : The reconstructed LLM embedding.
$\mathcal{L}_{\mathrm{rec}} = \| x - \hat { x } \| ^ { 2 }$ : The reconstruction loss (e.g., Mean Squared Error), which ensures that the quantized representation captures enough information to recreate the original input.
$\| \mathrm { s g } [ \mathbf { z ^ { q } } ] - \mathbf { z } \| _ { 2 } ^ { 2 }$ : The commitment loss, which pulls the encoder output $\mathbf{z}$ towards the selected codebook vector $\mathbf{z^q}$ . The stop-gradient operation (sg[.]) prevents gradients from flowing back through $\mathbf{z^q}$ to the codebook, ensuring that only the encoder learns to commit to the codebook.
$\| \mathrm { s g } [ \mathbf { z } ] - \mathbf { z ^ { q } } \| _ { 2 } ^ { 2 }$ : This term is used to update the codebook vectors. It pulls the selected codebook vector $\mathbf{z^q}$ towards the encoder output $\mathbf{z}$ . The stop-gradient on $\mathbf{z}$ ensures that the encoder itself is not affected by this update, maintaining the stability of the codebook. In practice, as mentioned in the paper, moving averages update (Van Den Oord et al., 2017) is often used for stable codebook training instead of relying solely on the auxiliary losses.

4.2.2. Mixture-of-Codes (MoC) - Indexing Stage: Multi Codebooks for Vector Quantization

The MoC method differentiates itself by learning multiple independent discrete codebooks in parallel, rather than a single codebook (like basic VQ-VAE) or hierarchical codebooks (like RQ-VAE). This addresses the issue that a single code might not capture enough information, and hierarchical codes might become less informative at higher levels.

The process is as follows:

Encoder Output: An item's LLM embedding $x$ is first passed through an encoder $\mathcal{E}$ to obtain a continuous latent representation $\mathbf{z} = \mathcal{E}(x)$ . This is the same as in VQ-VAE.
Parallel Quantization: Instead of one, $N$ independent codebooks $\{ \mathcal{Z}_i \}_{i=1}^N$ are used. Each codebook $\mathcal{Z}_i$ performs an independent quantization of the encoder output $\mathbf{z}$ to produce its own quantized vector $\mathbf{z_i^q}$ . For each codebook $i \in \{1, \dots, N\}$ , the quantization is: $\mathbf { z _ { i } ^ { q } } = \underset { z _ { k } \in Z_i } { \arg \operatorname* { m i n } } \ : \| \mathbf { z } - z _ { k } \| _ { 2 } ^ { 2 }$ This yields $N$ different quantized vectors, one from each codebook.
Average Quantization for Reconstruction: For the purpose of reconstruction, these $N$ quantized vectors are averaged to form a combined quantized representation $\mathbf{z^q}$ : $\quad \quad \quad \quad \mathbf { z ^ { q } } = \mathrm { A V G } ( \{ \mathbf { z _ { i } ^ { q } } \} _ { i = 1 } ^ { N } )$ This averaged $\mathbf{z^q}$ is then fed into the decoder $\mathcal{D}$ to reconstruct $\hat{x}$ .
Training Loss: The overall training loss for this multi-codebook VQ-VAE (MoC in the indexing stage) is similar to the standard VQ-VAE loss, but applied to the averaged quantized vector: $\begin{array} { r l } & { \mathcal { L } _ { \mathrm { M o C } } ( \boldsymbol { \mathcal { E } } , \mathcal { D } , \{ \mathcal { Z } _ { i } \} _ { i = 1 } ^ { N } ) = \| \boldsymbol { x } - \hat { \boldsymbol { x } } \| ^ { 2 } + \| \mathrm { s g } [ \mathbf { z ^ { q } } ] - \mathbf { z } \| _ { 2 } ^ { 2 } + \| \mathrm { s g } [ \mathbf { z } ] - \mathbf { z ^ { q } } \| _ { 2 } ^ { 2 } , } \end{array}$ where $x$ , $\hat{x}$ , $\mathbf{z}$ , and sg[.] are defined as before. The key difference is that $\mathbf{z^q}$ here refers to the averaged quantized vector from the $N$ codebooks. This formulation allows each codebook to independently learn distinct semantic groupings, contributing to a more comprehensive representation.
Semantic ID Extraction: After this indexing stage, for each item, we obtain $N$ Semantic IDs (one index from each of the $N$ codebooks), which can be denoted as $\{ x_{\mathrm{sid}_1}, x_{\mathrm{sid}_2}, \dots, x_{\mathrm{sid}_N} \}$ . These discrete IDs will then be used in the downstream recommendation stage.

4.2.3. Mixture-of-Codes (MoC) - Downstream Recommendation Stage: Implicit Fusion

Once the $N$ Semantic IDs for all items are obtained from the indexing stage, these IDs need to be integrated into a downstream recommendation model. Traditional Mixture-of-Experts approaches often use a gating router to dynamically select and weight experts, but this is impractical here because the codebooks are not trained end-to-end with the recommendation task. Instead, MoC proposes an implicit fusion mechanism using a bottleneck network.

The steps for integration and fusion are:

Embedding Lookup: For each of the $N$ Semantic IDs $\{ x_{\mathrm{sid}_1}, \dots, x_{\mathrm{sid}_N} \}$ , a corresponding learnable embedding $e_{\mathrm{sid}_i}$ is looked up from an embedding table associated with that codebook. These Semantic ID embeddings are initialized and then jointly trained with the downstream recommendation model. Along with $n$ other original feature attributes of the item (e.g., category, brand), represented as embeddings $e_1, \dots, e_n$ .
Concatenation: All item embeddings—the original feature embeddings and the Semantic ID embeddings—are concatenated into a single vector $e_{\mathrm{concat}}$ : $e _ { \mathrm { c o n c a t } } = \mathrm { CONCAT } ( e _ { 1 } , . . . , e _ { n } , e _ { \mathrm { s i d } _ { 1 } } , . . . , e _ { \mathrm { s i d } _ { M } } )$ Here, $M$ represents the total number of Semantic IDs used (which is $N$ in the context of MoC).
Bottleneck Fusion Network: A bottleneck network is applied to $e_{\mathrm{concat}}$ $e_{concat}$ . This network first projects the high-dimensional concatenated embedding down to a lower dimension, and then projects it back up to the original dimension. This forces information flow and interaction between all concatenated features (original features and Semantic ID embeddings) in a compressed latent space, facilitating implicit fusion. The output is $e_{\mathrm{concat}}^\prime$ $e_{concat}^{'}$ : $e _ { \mathrm { c o n c a t } } ^ { \prime } = e _ { \mathrm { c o n c a t } } + e _ { \mathrm { c o n c a t } } \cdot \mathbf { W } _ { \mathrm { d o w n } } \cdot \mathbf { W } _ { \mathrm { u p } }$ Where:
- $\mathbf{W}_{\mathrm{down}}$ : The down-projection layer (a matrix or MLP) that maps $e_{\mathrm{concat}}$ to a lower-dimensional space.
- $\mathbf{W}_{\mathrm{up}}$ : The up-projection layer (a matrix or MLP) that maps the compressed representation back to the original dimension.
- The residual connection (+e_{\mathrm{concat}} $) helps with training stability and preserves information. * This bottleneck design, often involving activation functions like `ReLU` or `GeLU` (though not explicitly stated for this specific layer, `GeLU` is mentioned in Figure 6), acts as an implicit fusion mechanism, allowing the recommendation model to learn how to best combine the complementary information from different Semantic ID embeddings. 4. **Splitting and Feature Interaction**: The fused concatenated embedding $e_{\mathrm{concat}}^\prime$ is then split back into its individual feature embeddings:$ e _ { 1 } , . . . , e _ { n } , e _ { \mathrm { s i d } _ { 1 } } , . . . , e _ { \mathrm { s i d } _ { M } } = \mathrm { SPLIT } ( e _ { \mathrm { c o n c a t } } ^ { \prime } ) $These refined embeddings are then fed into the subsequent `feature interaction modules` of the chosen downstream recommendation model (e.g., DeepFM, DCNv2) to predict the recommendation outcome. Figure 6 from the original paper visually represents this overall architecture of MoC Fusion. ![Figure 6: The overall architecture of MoC Fusion. A bottleneck network is adopted for feature fusion in the downstream stage.](/files/papers/693a727555e2e128365dfcc8/images/6.jpg) *该图像是图示，展示了MoC Fusion的特征交互过程，左侧为标准特征交互，右侧为采用了MoC Fusion机制的特征交互。通过对特征向量进行分割、混合和拼接，结合GeLU激活函数提升了预测的性能。* Figure 6: The overall architecture of MoC Fusion. A bottleneck network is adopted for feature fusion in the downstream stage. The left side of the diagram shows the standard feature interaction process where individual embeddings $e_i$ are directly fed into an interaction module $I$. The right side illustrates `MoC Fusion`. After item attributes $e_1, \dots, e_n$ and Semantic ID embeddings $e_{\mathrm{sid}_1}, \dots, e_{\mathrm{sid}_M}$ are concatenated, they pass through a `Bottleneck Network` (represented by `W_down`, `GeLU`, `W_up`). This network performs the implicit fusion. The output is then split back into individual embeddings before entering the feature interaction module $I$, which then feeds into a prediction head $F$. By combining multiple independent codebooks with a sophisticated implicit fusion mechanism, MoC aims to effectively scale semantic representations, preserve discriminative information, and enhance dimension robustness, leading to improved recommendation performance. # 5. Experimental Setup This section details the experimental setup used to evaluate the proposed `Mixture-of-Codes (MoC)` method, including the datasets, evaluation metrics, baseline models, and implementation specifics. ## 5.1. Datasets The experiments are conducted on three widely-used public datasets from the Amazon review benchmark (He & McAuley, 2016): * **Amazon-Beauty**: Contains reviews and metadata for beauty products. * **Amazon-Sports**: Contains reviews and metadata for sports and outdoors items. * **Amazon-Toys**: Contains reviews and metadata for toys and games. These datasets are chosen for their diverse domains and extensive user-item interaction data, which are standard for evaluating recommendation systems. **Data Preprocessing**: * The authors follow the methodology of LMINDEXER (Jin et al., 2023) by filtering out infrequent users and items. Specifically, only users and items that have at least 5 interactions are kept. This helps to ensure sufficient data for learning meaningful representations and interactions. * **Item Textual Descriptions**: For each item, textual descriptions (comprising title, brand, and categories) are utilized. * **LLM Embeddings**: To obtain initial `LLM embeddings` for these textual descriptions, the `LLM2Vec` (BehnamGhader et al., 2024) framework is employed, with `LLaMA3` (Dubey et al., 2024) serving as the backbone large language model. LLaMA3 is a powerful, state-of-the-art LLM known for its strong language understanding capabilities. * **Data Splits**: An 8/1/1 split is used for training, validation, and testing, respectively, across all three datasets. An early stopping strategy is adopted based on validation set performance to prevent overfitting. ## 5.2. Evaluation Metrics The performance of the proposed `MoC` method and baselines is evaluated using several metrics. ### 5.2.1. Area Under the Receiver Operating Characteristic Curve (AUC) * **Conceptual Definition**: `AUC` is a performance measurement for classification problems at various threshold settings. It represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. In recommendation, it's often used to evaluate how well a model discriminates between items a user would like (positive) and items they wouldn't (negative). A higher AUC indicates better discriminatory power. * **Mathematical Formula**: For binary classification, AUC is the area under the `Receiver Operating Characteristic (ROC) curve`. The ROC curve plots the `True Positive Rate (TPR)` against the `False Positive Rate (FPR)` at various threshold settings.$ \mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} \quad \text{and} \quad \mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}} $where: * $\mathrm{TP}$: True Positives (correctly predicted positive instances). * $\mathrm{FN}$: False Negatives (positive instances incorrectly predicted as negative). * $\mathrm{FP}$: False Positives (negative instances incorrectly predicted as positive). * $\mathrm{TN}$: True Negatives (correctly predicted negative instances). The AUC is then calculated as the integral of the ROC curve:$ \mathrm{AUC} = \int_{0}^{1} \mathrm{TPR}(FPR^{-1}(x)) dx $In practice, AUC is often approximated using trapezoidal rule or by counting concordant pairs. * **Symbol Explanation**: * $\mathrm{TP}$: Number of true positive predictions. * $\mathrm{FN}$: Number of false negative predictions. * $\mathrm{FP}$: Number of false positive predictions. * $\mathrm{TN}$: Number of true negative predictions. * $\mathrm{TPR}$: True Positive Rate, also known as sensitivity or recall. * $\mathrm{FPR}$: False Positive Rate. * $x$: Variable representing the FPR on the x-axis of the ROC curve. * $FPR^{-1}(x)$: The inverse function of FPR. ### 5.2.2. Normalized Mutual Information (NMI) * **Conceptual Definition**: `NMI` is used to measure the `discriminability scalability` of semantic representations. It quantifies the mutual dependence between two clusterings or partitions of a dataset, normalized to be between 0 and 1. In this context, it measures the mutual information between the quantized representation (`Q(r)`) and the supervised label ($Y$) in downstream tasks. A higher NMI indicates that the semantic representation is more discriminative, meaning it better captures information relevant to the labels. * **Mathematical Formula**:$ \mathrm{NMI}(U, V) = \frac{I(U;V)}{\sqrt{H(U)H(V)}} $where: * $U$: A set of clusters (e.g., derived from the semantic representation using K-means). * $V$: The ground truth labels (e.g., the supervised labels in the downstream task). * `I(U;V)`: The `Mutual Information` between $U$ and $V$. It is defined as:$ I(U;V) = \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} P(u_i, v_j) \log \left( \frac{P(u_i, v_j)}{P(u_i)P(v_j)} \right) $* `H(U)`: The `entropy` of clustering $U$, defined as `H(U) = -\sum_{i=1}^{|U|} P(u_i) \log(P(u_i))`. * `H(V)`: The `entropy` of clustering $V$, defined as `H(V) = -\sum_{j=1}^{|V|} P(v_j) \log(P(v_j))`. * $P(u_i, v_j)$: Joint probability that an item belongs to cluster $u_i$ and has label $v_j$. * $P(u_i)$: Probability that an item belongs to cluster $u_i$. * $P(v_j)$: Probability that an item has label $v_j$. * **Symbol Explanation**: * $U$: The set of clusters obtained from the semantic representation. * $V$: The set of ground truth labels. * `I(U;V)`: Mutual Information between $U$ and $V$. * `H(U), H(V)`: Entropies of $U$ and $V$. * $P(\cdot)$: Probability distribution. * $u_i, v_j$: Specific clusters/labels. * $\log$: Natural logarithm. ### 5.2.3. Singular Spectrum for Dimension Robustness * **Conceptual Definition**: The `singular spectrum` is used to quantify the `dimension robustness scalability` of semantic representations. By plotting the singular values of the semantic representation matrix, one can observe how much information is captured by each principal component. A robust representation should have higher top singular values (indicating strong principal components) without suffering from `dimensional collapse` (where long-tail singular values diminish sharply, meaning information is concentrated in very few dimensions). * **Mathematical Formula**: For a matrix $A$ (representing the semantic representations, where rows are items and columns are dimensions), its `Singular Value Decomposition (SVD)` is given by:$ A = U \Sigma V^T $where: * $U$: An $m \times m$ orthogonal matrix whose columns are the left singular vectors. * $\Sigma$: An $m \times n$ diagonal matrix containing the `singular values` $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0$ on its diagonal, where `r = \min(m, n)` is the rank of $A$. Other entries are zero. * $V$: An $n \times n$ orthogonal matrix whose columns are the right singular vectors. The `singular spectrum` is typically a plot of these singular values $\sigma_k$ against their index $k$. * **Symbol Explanation**: * $A$: The matrix of semantic representations. * $U$: Left singular vectors. * $\Sigma$: Diagonal matrix of singular values. * $V^T$: Transpose of right singular vectors. * $\sigma_k$: The $k$-th singular value. * `m, n`: Dimensions of the matrix $A$. * $r$: Rank of the matrix $A$. ### 5.2.4. Pearson Correlation Coefficient * **Conceptual Definition**: The `Pearson correlation coefficient` is used to analyze the correlation between different semantic representations. It measures the linear relationship between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear correlation. In this context, it helps understand how redundant or independent multiple semantic ID embeddings are. * **Mathematical Formula**: For two semantic representations $e_{\mathrm{sid}_i}$ and $e_{\mathrm{sid}_j}$ observed over $n$ samples (items), the Pearson correlation coefficient $r_{ij}$ is calculated as:$ r _ { i j } = \frac { \sum _ { k = 1 } ^ { n } ( e _ { s i d _ { i } } ^ { k } - \bar { e } _ { s i d _ { i } } ) ( e _ { s i d _ { j } } ^ { k } - \bar { e } _ { s i d _ { j } } ) } { \sqrt { \sum _ { k = 1 } ^ { n } ( e _ { s i d _ { i } } ^ { k } - \bar { e } _ { s i d _ { i } } ) ^ { 2 } } \sqrt { \sum _ { k = 1 } ^ { n } ( e _ { s i d _ { j } } ^ { k } - \bar { e } _ { s i d _ { j } } ) ^ { 2 } } } $$ The paper specifies using the dot product for multiplication of embedding vectors, implying the inner product between the $k$ -th sample of $e_{\mathrm{sid}_i}$ and $e_{\mathrm{sid}_j}$ if they are vectors, but the formula given is for scalar values $e_{\mathrm{sid}_i}^k$ . Assuming $e_{\mathrm{sid}_i}^k$ refers to the $k$ -th item's value for the $i$ -th semantic representation's scalar value, or component-wise correlation if applied to vectors. Given the context of correlation between different semantic representations, this formula applies to the correlation coefficient between the set of $n$ values for $e_{\mathrm{sid}_i}$ and $e_{\mathrm{sid}_j}$ .

Symbol Explanation:
- $r_{ij}$ : Pearson correlation coefficient between semantic representation $i$ and $j$ .
- $n$ : Total number of samples (items).
- $e_{\mathrm{sid}_i}^k$ : The value of the $i$ -th semantic representation for the $k$ -th item.
- $\bar{e}_{\mathrm{sid}_i}$ : The mean value of the $i$ -th semantic representation across all $n$ items.
- $\sum$ : Summation over all $n$ samples.

5.3. Baselines

The paper compares MoC against two main baseline approaches for scaling semantic representation:

Single Codebook with Multi-Embeddings (ME):
- Concept: Inspired by Guo et al. (2023), this method uses a single VQ-VAE codebook to generate one Semantic ID per item. However, instead of using just one embedding for this ID in the recommendation model, it assigns $M$ independent embeddings $\{e_{\mathrm{sid}}^1, \dots, e_{\mathrm{sid}}^M\}$ to the same Semantic ID.
- Why representative: It's a straightforward way to "scale up" by increasing the number of embedding vectors, even if they all derive from the same underlying discrete code.
- Limitation (as shown by paper): Due to deriving from the same Semantic ID, these multiple embeddings primarily introduce redundancy, leading to unstable optimization and ineffective scalability, often with performance degradation.
RQ-VAE (Residual Quantization VAE):
- Concept: Based on Lee et al. (2022), RQ-VAE applies quantization on residuals at multiple hierarchical levels, generating multiple Semantic IDs $\{x_{\mathrm{sid}_1}, \dots, x_{\mathrm{sid}_M}\}$ , where $x_{\mathrm{sid}_i}$ comes from the $i$ -th hierarchical codebook. Each of these IDs gets its own embedding $e_{\mathrm{sid}_i}$ .
- Why representative: RQ-VAE is a common practice for deriving hierarchical Semantic IDs in recent recommendation systems (Rajput et al., 2024; Jin et al., 2023). It represents a natural way to scale up by using more detailed, hierarchically obtained information.
- Limitation (as shown by paper): The hierarchical nature makes the Semantic IDs highly dependent and entangled. Higher-level IDs often contain diminishing or less informative details, leading to its discriminability not consistently increasing with scaling factor and sometimes hindering generalization.
  
  In addition to these semantic scaling baselines, the evaluation is conducted on four representative downstream Click-Through Rate (CTR) prediction models, which serve as backbones for integrating the semantic representations:

DeepFM (Guo et al., 2017): Combines a Factorization Machine (FM) for low-order feature interactions with a deep neural network (DNN) for high-order interactions.
DeepIM (Yu et al., 2020): Deep Interaction Machine, a simple but effective model for high-order feature interactions.
AutoInt+ (Song et al., 2019): Uses self-attentive neural networks to automatically learn feature interactions.
DCNv2 (Wang et al., 2021): Deep & Cross Network V2, improves upon DCN by better modeling explicit and implicit feature interactions through a cross-network and a deep network.

5.4. Implementation Details

Codebook Size: For the VQ-VAE based approaches (MoC, ME, RQ-VAE), the size of each codebook is set to 256. This means each codebook contains 256 distinct code vectors.
Latent Representation Dimension: The dimension of the latent representation (the output of the encoder before quantization) is set to 32. This is the dimension of the code vectors in the codebook.
Encoder Architecture: The encoder in the indexing stage is a multi-layer perceptron (MLP) with three hidden layers. The layer sizes are 512, 256, and 128, respectively, and ReLU activation is used between layers.
Optimizer: The Adam optimizer is used for training all models.
Batch Size: The batch size for training is set to 8012.
Learning Rate: The learning rate is fixed at 0.001.
Trials: All experiments are run across three different random seeds, and the averaged results are reported to ensure robustness and reduce the impact of random initialization.

6. Results & Analysis

This section delves into the experimental results presented in the paper, analyzing the performance of Mixture-of-Codes (MoC) against baseline methods across various metrics and scaling factors.

6.1. Core Results Analysis

6.1.1. Overall Performance (Test AUC)

The following are the results from Table 1 of the original paper:

Model		1x	2x	3x	7x	1x	2x	3x	7x	1x	2x	3x	7x
Model		Toys				Beauty				Sports
DeepFM	ME		0.7403	0.7397	0.7390		0.6651	0.6649	0.6638		0.6942	0.6928	0.6917
	RQ-VAE	0.7406	0.7409	0.7405	0.7398	0.6651	0.6676	0.6670	0.6687	0.6931	0.6945	0.6932	0.6937
	MoC		0.7408	0.7415	0.7418		0.6656	0.6674	0.6681		0.6931	0.6936	0.6953
DeepIM	ME		0.7396	0.7404	0.7395		0.6620	0.6635	0.6637		0.6907	0.6910	0.6925
	RQ-VAE	0.7404	0.7401	0.7403	0.7404	0.6648	0.6651	0.6660	0.6678	0.6931	0.6918	0.6925	0.6938
	MoC		0.7401	0.7417	0.7422		0.6641	0.6668	0.6691		0.6927	0.6935	0.6942
AutoInt+	ME		0.7430	0.7419	0.7414		0.6648	0.6630	0.6641		0.6935	0.6930	0.6929
	RQ-VAE	0.7415	0.7430	0.7419	0.7418	0.6630	0.6672	0.6642	0.6677	0.6911	0.6934	0.6933	0.6915
	MoC		0.7414	0.7420	0.7447		0.6661	0.6651	0.6689		0.6939	0.6926	0.6927
DCNv2	ME		0.7445	0.7449	0.7459		0.6717	0.6716	0.6722		0.6955	0.6963	0.6976
	RQ-VAE	0.7445	0.7457	0.7457	0.7469	0.6701	0.6719	0.6720	0.6726	0.6962	0.6965	0.6966	0.6979
	MoC		0.7462	0.7458	0.7474		0.6714	0.6730	0.6729		0.6970	0.6972	0.6989

Analysis of Table 1 (Test AUC results):

MoC's Superiority at Higher Scaling Factors: Across all four backbone models (DeepFM, DeepIM, AutoInt+, DCNv2) and all three datasets (Toys, Beauty, Sports), MoC consistently achieves the best performance (highest AUC) at the highest scaling factor (7x). For instance, with DCNv2 on the Toys dataset, MoC achieves 0.7474 AUC, surpassing RQ-VAE's 0.7469. On the Beauty dataset, MoC with DCNv2 reaches 0.6729. On the Sports dataset, MoC with DCNv2 achieves 0.6989.
Scalability Law for MoC: The results demonstrate that MoC generally adheres to a scaling law: its performance tends to increase as the scaling factor (number of semantic representations) increases from 1x to 7x. This is evident in most MoC rows, where AUC values generally rise or remain stable with higher factors. This validates the core hypothesis that scaling up semantic representation can be effective if done correctly.
ME's Performance Degradation: The Multi-Embedding (ME) baseline often shows performance degradation with increasing scaling factors. For example, DeepFM with ME on Toys sees AUC drop from 0.7403 (2x) to 0.7390 (7x). This confirms the paper's argument that simply adding more embeddings for the same semantic ID leads to redundancy and hinders performance.
RQ-VAE's Limited Gains: RQ-VAE generally shows slight performance gains or fluctuations as the scaling factor increases, but these gains are often less pronounced and sometimes less consistent than MoC's. For example, DeepFM with RQ-VAE on Toys goes from 0.7406 (1x) to 0.7398 (7x), showing a slight decrease. On Beauty, it shows some gains, e.g., 0.6651 (1x) to 0.6687 (7x) for DeepFM. This supports the claim that RQ-VAE's hierarchical Semantic IDs might not always provide sufficiently informative additional details.
Relative Gains: The paper highlights specific gains: with a 7x scaling factor, MoC surpasses RQ-VAE by 0.20%, 0.18%, 0.29%, and 0.05% for DeepFM, DeepIM, AutoInt+, and DCNv2 respectively on the Toys dataset. This indicates that while the absolute AUC differences might seem small, they are consistent and meaningful in recommendation system performance.

6.1.2. Discriminability Scalability

The paper evaluates discriminability using Normalized Mutual Information (NMI) between the semantic representation and the supervised labels.

Initial Findings (Figure 2, Figure 3b):
- Figure 2 from the original paper illustrates the scalability on discriminability of various methods.
  
  该图像是图表，展示了不同方法在可扩展性和可分辨性方面的表现，包括对缩放因子和聚类数量的影响。图中包含三部分，分别为标定的嵌入 NMI 对缩放因子的关系、对聚类数量的关系，以及与 RQ-VAE SID 1 嵌入的比较。
Figure 2: Scalability on discriminability of various methods.
- The paper observes that ME does not show increased discriminability with scaling; in fact, it may slightly decrease. This is attributed to the redundancy of extra embeddings associated with the same Semantic ID.
- RQ-VAE also shows inconsistent discriminability gains. Figure 3 from the original paper, specifically part (b), focuses on the NMI of Semantic Representation with 7x scaling factor.
  
  该图像是图表，展示了语义表示和语义 ID 的归一化互信息（NMI）结果。图中包含两个部分：部分 (a) 显示了与语义 ID 相关的 NMI 值，部分 (b) 展示了与语义表示相关的 NMI 值，二者均为不同方法的比较。
Figure 3: Normalized Mutual Information(NMI) of Semantic Representation with $7 \mathbf { x }$ scaling factor.
- Figure 3b shows that the discriminability of MoC (blue line) consistently increases as the number of clusters (which implicitly relates to capturing more distinct features) grows. Its NMI value is higher than ME and RQ-VAE. The paper states that MoC at 1x is comparable to RQ-VAE and much higher than ME, and it continuously improves with larger scaling factors (up to 7x). This confirms MoC's superior discriminability scalability.
- An interesting observation from Figure 3a (NMI of Semantic ID) is that for RQ-VAE, the lowest level Semantic ID (SID 1) contains more information than the flattened embedding of all Semantic IDs, suggesting higher-level SIDs may hinder generalization. In contrast, MoC performs uniformly well across various Semantic IDs.

6.1.3. Dimension Robustness Scalability

Dimension robustness is assessed by analyzing the singular spectrum of the semantic representations. A robust representation should have high top singular values and a long-tail singular spectrum that doesn't diminish abruptly, indicating that information is well-distributed and not collapsing into a few dimensions.

Comparison with Baselines (Figure 4):

$Figure 3: Normalized Mutual Information(NMI) of Semantic Representation with $7 \\mathbf { x }$ scaling factor.$ 该图像是图表，展示了在不同缩放因子下语义表示的维度鲁棒性的可扩展性。每个子图呈现了在给定缩放因子下的奇异谱，分别对应于 3x、5x 和 7x 的缩放因子。

Figure 4: Scalability of Dimension Robustness regarding different scaling factors. Each figure presents the singular spectrum of the semantic representation at the given scaling factor.
- RQ-VAE: Figure 4 shows that RQ-VAE's long-tail singular values do not diminish rapidly, indicating it doesn't suffer from severe dimensional collapse. However, its top singular values are not as large as ME's, suggesting less concentrated signal in the primary components.
- ME: ME exhibits the largest top singular values, implying strong primary components. However, it suffers from dimensional collapse, as its long-tail singular values diminish suddenly after a certain index (e.g., around index 250 for 5x and 275 for 7x settings). This implies that while ME captures strong primary signals, it quickly loses information in other dimensions.
- MoC: The paper states that MoC's singular spectrum (not directly shown in Figure 4 but described in Section 4.3 and later shown for MoC Fusion in Figure 10) indicates higher values on the low-index singular values compared to RQ-VAE, while its high-index singular values are robust and do not diminish like ME. This implies MoC balances capturing strong principal components with distributing information robustly across dimensions, preventing collapse.
  
  This comprehensive analysis leads to Finding 1: Existing methods like ME and RQ-VAE are not scalable semantic representations for recommendation regarding discriminability and dimension robustness. And Finding 2: Our proposed MoC successfully enables scalable Semantic Representation regarding both discriminability and dimension robustness.

6.1.4. Correlation Between Multiple Representations

The correlation between different semantic representations helps understand their independence and potential for redundancy. The Pearson correlation coefficient is used for this.

Analysis (Figure 7):

该图像是图表，展示了不同方法的相关性分析，包括MoC、RQ-VAE和ME方法的相关性矩阵。通过色彩深浅，图中清晰地呈现了不同方法在特征之间的相关性程度。

Figure 7: Correlation analysis of different methods.
- The heatmaps in Figure 7 show the correlation matrices.
- ME: The correlation matrix for ME (top) shows high correlation values between many different semantic representations (off-diagonal cells are often bright). This confirms that because ME uses multiple embeddings for the same underlying Semantic ID, these embeddings are highly redundant. This strong correlation leads to unstable optimization and ineffective scalability, as argued by the authors.
- RQ-VAE and MoC: Both RQ-VAE (middle) and MoC (bottom) exhibit low correlation scores in their off-diagonal cells, indicating that their different Semantic IDs are relatively independent. This is desirable as it suggests each Semantic ID (or codebook in MoC) captures distinct or complementary information, which is crucial for effective scaling. MoC's independence is particularly important for its superior performance, as it ensures each parallel codebook contributes unique semantic aspects.

6.1.5. Detailed Comparison with RQ-VAE

To further highlight MoC's advantages, a more detailed comparison with RQ-VAE is provided, focusing on how performance changes when adding Semantic IDs at different levels or in increasing numbers.

Adding Single ID at Each Level (Figure 8a):

该图像是图表，展示了RQ-VAE与MOC在单一ID和多ID情况下的AUC测试结果。在单一ID测试中，AUC随语义ID索引变化而波动，而在多ID测试中，AUC在扩大比例因子的情况下逐渐上升，显示出MOC的优势。

Figure 8: More comparison results with RQ-VAE.
- Figure 8a (left plot in the combined image, which is the first plot in the actual figure 8) shows that for RQ-VAE, adding a single Semantic ID from a low level provides a larger performance gain than from a high level. This supports the argument that higher-level Semantic IDs in RQ-VAE are less informative or redundant.
- In contrast, MoC performs more uniformly across various Semantic IDs, suggesting that each of its parallel codebooks contributes valuable, non-redundant information.
Adding Multiple IDs Starting from Lowest Level (Figure 8b):
- Figure 8b (right plot in the combined image, which is the second plot in the actual figure 8) shows that when equipping multiple Semantic IDs starting from the lowest level (i.e., scaling up the number of SIDs), MoC gains significant improvements compared to RQ-VAE under different scaling factors. This demonstrates MoC's better generalization and effective scalability when more semantic representations are introduced.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation on MoC Fusion

An ablation study is conducted to verify the importance of the implicit fusion module in the downstream stage, specifically the bottleneck network.

The following are the results from Table 2 of the original paper:

Method	w/o	w/	w/o	w/	w/o	w/
Method	2x		3x		7x
RQ-VAE	0.7409	0.7414	0.7405	0.7407	0.7398	0.7413
MoC	0.7409	0.7408	0.7404	0.7415	0.7416	0.7418

Analysis of Table 2 (Ablation on MoC Fusion - Test AUC on Toys with DeepFM):

Benefits for Both Methods: The results show that both RQ-VAE and MoC benefit from the fusion module ( $w/$ vs. w/o). For RQ-VAE, integrating the fusion module consistently improves performance across all scaling factors (e.g., from 0.7398 to 0.7413 at 7x). For MoC, the fusion module also generally leads to improvements, especially at higher scaling factors (e.g., from 0.7416 to 0.7418 at 7x). This indicates that explicitly mixing the semantic ID embeddings, even those from RQ-VAE, helps the downstream model.
MoC Fusion's Impact: For MoC, while the gain at 2x is negligible, at 3x and 7x, MoC w/ fusion consistently outperforms MoC w/o fusion. This highlights the importance of the implicit fusion mechanism for MoC to fully leverage the multiple independent semantic representations and achieve its best performance, especially as scaling factors increase.

6.2.2. Impact of Fusion on Discriminability and Dimension Robustness

Discriminability Scalability with Fusion (Figure 9):

该图像是图表，展示了不同聚类数量下的归一化互信息（NMI）。图中包括两条曲线，分别代表使用和不使用MOC融合的情况，横轴为聚类数量，纵轴为NMI值。

Figure 9: Discriminability scalability of MoC Fusion.
- Figure 9 (This image is labeled as Figure 8, but based on the text "In Fig. 9, it can be observed that the fusion module enhances the overall discriminability scalability of MoC by mixing the features," it should correspond to the plot discussed for MoC fusion discriminability. The image provided in the prompt as "Figure 8" also contains the plot for NMI of MoC Fusion. Let's assume Figure 9 is the plot with NMI vs Cluster number, comparing MoC w/o and w/ fusion.)
- The plot shows that the fusion module enhances the overall discriminability scalability of MoC. The NMI curve for MoC with fusion (cyan line) is consistently higher than MoC without fusion (black line) across various cluster numbers. This suggests that mixing the features using the bottleneck network helps the semantic representations become more discriminative, allowing them to better distinguish between different classes/items.
Dimension Robustness Scalability with Fusion (Figure 10):

该图像是图表，展示了带融合和不带融合的MoC在维度鲁棒性可扩展性方面的比较。可以看出，MoC带融合的结果在维度增加时表现出更优的鲁棒性，损失较少，图中横轴为维度，纵轴为鲁棒性指标，黑色线条表示不带融合的情况，青色线条则表示带融合的情况。

Figure 10: Dimension robustness scalability of MoC Fusion.
- Figure 10 displays the singular spectrum of MoC with and without the fusion module. The results are striking: the fusion module amplifies the principal components of the semantic representation, resulting in significantly higher top singular values (cyan line for 'w/' fusion is much higher at low indices). This means the fused representation captures stronger, more dominant modes of variation.
- Additionally, the long-tail part of the singular spectrum for MoC with fusion remains robust, staying close to that of RQ-VAE and not suffering from dimension collapse like ME. This indicates that while the fusion boosts the primary signals, it does not sacrifice the distributed information across other dimensions, leading to a highly dimension-robust representation.
  
  These ablation results strongly confirm that the fusion module is a critical component of MoC, contributing not only to improved recommendation performance but also enhancing the fundamental properties of discriminability and dimension robustness of the learned semantic representations.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously investigates the crucial problem of scalable semantic representation for recommendation systems, particularly in the context of leveraging knowledge from Large Language Models (LLMs). It identifies a significant challenge: the inherent information loss and dimension collapse when high-dimensional LLM embeddings are compressed to fit the typically low-dimensional requirements of recommendation systems.

The core contribution is the introduction of Mixture-of-Codes (MoC), a novel two-stage approach. In the indexing stage, MoC builds upon VQ-VAE by constructing multiple independent parallel codebooks to capture diverse and complementary semantic information from LLM embeddings. This approach stands in contrast to existing methods like Multi-Embedding (ME) (which creates redundancy) and RQ-VAE (which suffers from hierarchical dependencies and diminishing returns). In the downstream recommendation stage, MoC employs an implicit fusion module (a bottleneck network) to effectively combine the embeddings derived from these multiple codebooks.

Through extensive experiments across multiple datasets and four state-of-the-art recommendation models, MoC consistently demonstrates:

Superior Discriminability Scalability: As the number of semantic representations increases, MoC's ability to distinguish between items improves, measured by Normalized Mutual Information (NMI).
Enhanced Dimension Robustness: MoC maintains robust singular spectra, exhibiting higher top singular values without suffering from dimensional collapse in its long-tail components.
Best Scale-up Performance: MoC achieves the highest AUC scores in recommendation tasks, showing consistent performance gains with increasing scaling factors, effectively leveraging the rich semantic knowledge from LLMs.

The paper successfully shows that by carefully designing how multiple discrete semantic representations are generated and integrated, it is possible to achieve truly scalable semantic representation, leading to significant improvements in recommendation performance.

7.2. Limitations & Future Work

The paper does not explicitly list a "Limitations" section, but some can be inferred from the problem statement and the proposed solution:

Computational Cost of Multiple Codebooks: While effective, maintaining and training multiple independent codebooks in the indexing stage might incur higher computational costs and memory usage compared to a single codebook or a simple hierarchical structure, especially as the number of parallel codebooks grows. The paper focuses on scalability of representation, but not necessarily efficiency.
Fixed Number of Codebooks: The current MoC model seems to use a fixed number of codebooks ( $N$ ). Exploring adaptive mechanisms to dynamically determine the optimal number of codebooks based on dataset complexity or LLM embedding characteristics could be a direction.
Generality of LLM Embeddings: The paper relies on LLM2Vec with LLaMA3 for LLM embeddings. While robust, the generalizability of MoC to other LLM embedding models or different modalities (e.g., visual features) could be further explored.
Fusion Module Complexity: The implicit fusion module is a simple bottleneck network. While effective, more sophisticated fusion mechanisms, perhaps inspired by Mixture-of-Experts (MoE) with specialized routing for discrete codes, could be explored to further optimize interaction between different semantic embeddings.
Theoretical Guarantees: While empirical results are strong, providing more theoretical guarantees on why parallel codebooks are inherently more discriminative or dimensionally robust than hierarchical ones could strengthen the work.

Potential future research directions implicitly suggested or arising from this work include:
Investigating more efficient training strategies for multiple codebooks.
Developing adaptive methods for determining the optimal scaling factor and number of codebooks.
Exploring hybrid fusion mechanisms that combine implicit bottleneck fusion with more explicit, adaptive routing strategies.
Applying MoC's principles to other domains where high-dimensional embeddings need to be represented scalably in lower-dimensional systems.
Further studying the interplay between the dimensionality of LLM embeddings, the latent dimension of the VQ-VAE, and the overall scalability.

7.3. Personal Insights & Critique

The paper provides a refreshing and rigorous approach to a critical problem in LLM-augmented recommendation systems. The authors' clear articulation of the dimension mismatch and subsequent information loss as a core challenge is highly valuable. The empirical demonstration that naive scaling methods (ME, RQ-VAE) fail to preserve discriminability and robustness is a crucial finding that sets the stage for their contribution.

The Mixture-of-Codes (MoC) proposal is intuitively appealing. The idea of using multiple independent parallel codebooks to capture complementary aspects of a rich LLM embedding, rather than relying on a single or hierarchically entangled representation, makes logical sense. It's akin to having multiple experts, each specializing in a different facet of the item's semantic profile, and then intelligently combining their insights. The implicit fusion module is a clever way to integrate these insights without the complexities of explicit gating often seen in MoE, making it practical for real-world recommendation systems.

A key strength of this paper lies in its quantitative analysis of scalability. By introducing and leveraging metrics like NMI for discriminability and singular spectrum for dimension robustness, the authors move beyond mere performance metrics (like AUC) to deeply understand why their method scales better. The singular spectrum analysis, in particular, provides compelling evidence of MoC's superior ability to distribute information robustly across dimensions without suffering from collapse.

From a critique perspective, while the paper excels at empirical validation, a deeper theoretical dive into why parallel codebooks guarantee more independent information capture compared to hierarchical ones could enhance the understanding. Also, the computational overhead of multiple codebooks and the fusion network, while likely manageable, could be an important consideration for extremely large-scale industrial applications, and a discussion on this would be beneficial.

Overall, this paper offers a significant step forward in effectively integrating the power of LLMs into recommendation systems. Its methodology is sound, and its analysis is thorough, providing clear guidance for future research in scalable semantic representation. The insights gained regarding discriminability and dimension robustness are broadly applicable to other fields dealing with high-dimensional embedding compression.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Towards Scalable Semantic Representation for Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 54,677 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

3.1.2. Recommendation Systems (RS)

3.1.3. Embeddings

3.1.4. Semantic IDs/Codes

3.1.5. VQ-VAE (Vector Quantized Variational AutoEncoder)

3.1.6. RQ-VAE (Residual Quantization VAE)

3.1.7. Interaction Collapse Theory

3.1.8. Discriminability

3.1.9. Dimension Robustness

3.1.10. Mutual Information (MI) and Normalized Mutual Information (NMI)

3.1.11. Singular Value Decomposition (SVD) and Singular Spectrum

3.1.12. Pearson Correlation Coefficient

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries: VQ-VAE for Semantic ID Generation

4.2.2. Mixture-of-Codes (MoC) - Indexing Stage: Multi Codebooks for Vector Quantization

4.2.3. Mixture-of-Codes (MoC) - Downstream Recommendation Stage: Implicit Fusion

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance (Test AUC)

6.1.2. Discriminability Scalability

6.1.3. Dimension Robustness Scalability

6.1.4. Correlation Between Multiple Representations

6.1.5. Detailed Comparison with RQ-VAE

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation on MoC Fusion

6.2.2. Impact of Fusion on Discriminability and Dimension Robustness

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers