Paper status: completed

Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs

Published:11/08/2025
Original Link
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces MME-SID, a novel framework for sequential recommendation using large language models and multimodal embeddings to address embedding collapse and catastrophic forgetting, enhancing recommendation performance through a multimodal residual quantized variational

Abstract

Sequential recommendation (SR) aims to capture users’ dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, we identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen the model scalability and lead to suboptimal recommendation performance. Therefore, based on LLMs like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively. To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings. Finally, we fine-tune the LLM efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting. The implementation code and datasets are publicly available for reproduction.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs

1.2. Authors

Yuhao Wang (City University of Hong Kong), Junwei Pan (Tencent Inc.), Xinhang Li (Tsinghua University), Maolin Wang (City University of Hong Kong), Yuan Wang (Tencent Inc.), Yue Liu (Tencent Inc.), Dapeng Liu (Tencent Inc.), Jie Jiang (Tencent Inc.), Xiangyu Zhao (City University of Hong Kong).

1.3. Journal/Conference

Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). CIKM is a highly reputable and influential conference in the fields of information retrieval, knowledge management, and database systems. Its long history and rigorous review process make it a significant venue for publishing research in these areas, including recommender systems.

1.4. Publication Year

2025

1.5. Abstract

Sequential recommendation (SR) aims to capture users’ dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, the authors identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen model scalability and lead to suboptimal recommendation performance. To address these, the paper introduces a novel SR framework called MME-SID, based on LLMs like Llama3-8B-instruct. MME-SID integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, it proposes a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy (MMD) as the reconstruction loss and contrastive learning for alignment, which effectively preserves intra-modal distance information and captures inter-modal correlations, respectively. To further alleviate catastrophic forgetting, the model is initialized with the trained multimodal code embeddings. Finally, the LLM is fine-tuned efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID due to its capability to mitigate embedding collapse and catastrophic forgetting.

/files/papers/695777a34a1fbc163064c29i/paper.pdf This paper is published in the proceedings of CIKM '25.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is enhancing sequential recommendation (SR) by leveraging the powerful capabilities of Large Language Models (LLMs). SR aims to predict users' next interactions by understanding their dynamic interests and sequential behavior patterns from historical data. This is crucial for web applications like e-commerce and video platforms to drive engagement and profit.

While LLMs have shown great promise for SR due to their ability to comprehend semantic data, the authors identify two critical challenges in existing LLM-based SR (LLM4SR) methods:

  1. Embedding Collapse: This phenomenon, also known as dimensional collapse, occurs when item embeddings, especially pre-trained collaborative embeddings (which represent items based on user-item interaction data), are mapped into the high-dimensional LLM token space. This mapping can cause the embeddings to occupy only a low-dimensional subspace, leading to inefficient use of model capacity and limited scalability. The problem is important because it limits the richness and expressiveness of item representations within the LLM, hindering recommendation accuracy.

  2. Catastrophic Forgetting: This happens when LLM4SR methods utilize semantic IDs (discrete tokens representing item features) for items. Existing approaches often discard the learned code embeddings (the vector representations of these semantic IDs) after training a quantization model and then train new embeddings from scratch for downstream recommendation tasks. This results in a significant loss of previously learned knowledge, particularly the partial order information of distance between item embeddings, thus reducing performance.

    The paper's innovative idea is to address these two challenges simultaneously by carefully integrating multimodal embeddings (collaborative, textual, visual) and semantic IDs in a way that preserves crucial information.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of LLM4SR:

  1. First to Identify and Systematically Address Key Challenges: It is the first work to specifically identify and systematically tackle the embedding collapse and catastrophic forgetting issues within LLM4SR. This provides a new perspective on improving LLM performance in recommendation tasks.

  2. Novel Framework MME-SID: The paper proposes MME-SID, a novel framework that integrates multimodal embeddings and quantized embeddings (derived from semantic IDs) to mitigate embedding collapse. This approach leverages the rich information from different modalities (collaborative, textual, visual) to create more informative and less collapsed item representations for the LLM.

  3. Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE): A new quantization model, MM-RQ-VAE, is introduced. This model is designed to better preserve intra-modal distance information and capture inter-modal correlations. It achieves this through:

    • Using Maximum Mean Discrepancy (MMD) as the reconstruction loss, which explicitly aims to preserve the distribution of distances between original and quantized embeddings.
    • Adopting contrastive learning objectives to align quantized embeddings across different modalities (e.g., collaborative with textual and visual), thus learning meaningful cross-modal relationships.
  4. Mitigating Catastrophic Forgetting through Initialization: To combat catastrophic forgetting, MME-SID initializes the embeddings of multimodal semantic IDs using the trained code embeddings from the MM-RQ-VAE. This ensures that previously learned structural and distance information is retained when the LLM is fine-tuned.

  5. Efficient Multimodal Frequency-Aware Fine-tuning: The framework efficiently fine-tunes the LLM using LoRA and incorporates a multimodal frequency-aware fusion module. This module adaptively weighs the importance of different modalities based on item frequency, leading to better recommendation results, especially for cold or warm items.

    The key conclusion is that MME-SID significantly surpasses existing LLM4SR methods on various performance metrics across three public datasets. This superior performance is directly attributed to its effective strategies for mitigating embedding collapse and catastrophic forgetting, thus truly unleashing the potential of LLMs for recommendation. Additionally, the proposed solution offers advantages in inference efficiency and avoids issues like collision commonly faced by generative retrieval methods using semantic IDs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Sequential Recommendation (SR)

Sequential Recommendation (SR) is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., next item purchase, next video watched) based on their historical sequence of interactions. Unlike traditional recommender systems that might only consider static preferences or item popularity, SR models aim to capture dynamic user interests and temporal patterns. For example, if a user buys a camera, they might then be interested in camera lenses or tripods.

3.1.2. Large Language Models (LLMs)

Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human-like text, performing tasks such as natural language generation, summarization, translation, and question answering. In the context of recommendation, LLMs can process textual item descriptions, user reviews, and even user prompts to understand item semantics and user preferences more deeply. The paper specifically mentions using Llama3-8B-instruct, which is an instruction-tuned variant of the Llama 3 LLM with 8 billion parameters, designed to follow instructions effectively.

3.1.3. Embedding Collapse (Dimensional Collapse)

Embedding collapse, or dimensional collapse, is a phenomenon observed in deep learning models where the learned embeddings (vector representations of entities like items or users) occupy only a low-dimensional subspace within their intended high-dimensional embedding space. This means that despite being represented by high-dimensional vectors, the embeddings effectively behave as if they are in a much lower-dimensional space, leading to a loss of expressiveness and capacity. In recommender systems, if item embeddings collapse, many distinct items might end up with very similar representations, making it difficult for the model to differentiate between them and provide diverse or accurate recommendations.

3.1.4. Catastrophic Forgetting

Catastrophic forgetting (or catastrophic interference) is a common problem in neural networks where learning new information causes the network to forget previously learned information. In the context of LLM4SR and semantic IDs, this refers to the loss of valuable knowledge encoded in code embeddings (vector representations of discrete semantic IDs) when these embeddings are re-initialized and re-trained from scratch for a downstream LLM-based recommendation task. The paper specifically highlights the loss of partial order information of distance, meaning the relative relationships between item embeddings are forgotten.

3.1.5. Semantic IDs and Quantization

Semantic IDs are discrete tokens or codes used to represent items (or users) by capturing their semantic features. Instead of using a simple unique integer ID, an item might be represented by a sequence of semantic IDs (e.g., "electronics", "smartphone", "high-end"). This allows LLMs, which operate on discrete tokens, to directly process item information. Quantization is the process of converting continuous input values (like dense item embeddings) into discrete representations (like semantic IDs). Residual Quantized Variational Autoencoder (RQ-VAE) is a specific type of quantization model that tokenizes and generates semantic IDs in a hierarchical manner. It works by progressively quantizing the residual error from the previous quantization step, using multiple levels of codebooks. Each codebook contains a set of code embeddings, and the quantization process finds the closest code embedding for a given input or residual.

3.1.6. Multimodal Embeddings

Multimodal embeddings refer to vector representations that incorporate information from multiple data modalities, such as collaborative (user-item interaction history), textual (item descriptions, reviews), and visual (item images). By combining these different perspectives, multimodal embeddings can provide a richer, more comprehensive, and robust representation of items, which helps address issues like the cold-start problem (where new items lack sufficient interaction data).

3.1.7. LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique for LLMs and other large neural networks. Instead of fine-tuning all the parameters of a large pre-trained model, LoRA injects trainable low-rank matrices into the transformer architecture's attention layers. This significantly reduces the number of trainable parameters, making fine-tuning much faster and requiring less computational resources, while often achieving performance comparable to full fine-tuning.

3.1.8. Maximum Mean Discrepancy (MMD)

Maximum Mean Discrepancy (MMD) is a statistical distance measure used to quantify the difference between two probability distributions. Unlike measures that compare points directly (like Mean Squared Error), MMD compares the "mean embeddings" of the distributions in a Reproducing Kernel Hilbert Space (RKHS). If two distributions are identical, their MMD will be zero. It's particularly useful when comparing complex distributions where explicit density estimation is difficult. A characteristic kernel is a type of kernel function (e.g., Gaussian kernel) for which MMD can uniquely determine if two distributions are identical.

3.1.9. Contrastive Learning

Contrastive learning is a self-supervised learning paradigm where a model learns representations by pushing "similar" (positive) samples closer together in the embedding space and pulling "dissimilar" (negative) samples farther apart. InfoNCE loss is a common objective function used in contrastive learning. In the context of multimodal learning, contrastive learning can be used to align representations from different modalities (e.g., making the embedding of an item's image similar to its text description) or to align quantized embeddings from different modalities, as in this paper.

3.2. Previous Works

3.2.1. Traditional Sequential Recommenders

  • SASRec [5]: A seminal work in SR, SASRec uses a self-attention mechanism to capture sequential patterns. It treats a user's interaction history as a sequence and predicts the next item by attending to relevant past items. It primarily relies on collaborative modality (item IDs).

3.2.2. LLM-based Sequential Recommendation (LLM4SR)

  • TALLRec [2]: This work formulates SR as a text generation task and applies instruction tuning on LLMs. Users' historical interactions and item attributes are converted into natural language prompts, and the LLM is trained to generate the recommended items as text.
  • E4SRec [10]: Adopts a linear projection to map pre-trained ID embeddings into the LLM token space. It aims to tackle the out-of-range generation problem, where LLMs might generate item IDs that don't exist in the item set.
  • CoLLM [59] and LLaRA [13]: These works, similar to Concat in the baselines, integrate collaborative embeddings into LLMs for recommendation, often by mapping them into the LLM token embedding space and concatenating them.
  • CTRL [9]: Connects collaborative and language models for CTR (Click-Through Rate) prediction. CTRL-MM (a baseline in this paper) adapts this idea for multimodal SR, explicitly aligning embeddings using InfoNCE loss.
  • MOTOR [56]: Replaces collaborative embeddings with token embeddings of vision and text features and uses a token cross-network for interaction modeling. It also uses semantic IDs for visual and textual embeddings.

3.2.3. Semantic IDs for Recommendation

  • Generative Models using Semantic IDs (TIGER [35], Sun et al. [40], LETTER [43], Zheng et al. [66]): These methods learn to transform item embeddings into semantic IDs, which are then treated as new generative LLM tokens. TIGER uses content information to generate semantic token sequences for SR. However, a common drawback, as highlighted by the current paper, is that these methods typically discard the trained code embeddings after quantization and randomly initialize them for downstream tasks, leading to catastrophic forgetting. They often face issues like collision (multiple items mapping to the same semantic ID sequence) and autoregressive inference latency.
  • Semantic IDs as Auxiliary Information (QARM [30], Zhang et al. [58]): These works use vector quantization and residual quantization to generate quantitative codes as new features to enhance traditional recommender systems. However, their improvements are often limited by the constraints of the traditional model structure.

3.2.4. Multimodal Encoding

  • Individual Vision/Text Encoders or Multimodal Encoders (BEiT3 [42], CLIP [34]): Previous works on multimodal recommendation often combine separate vision and text encoders or use multimodal encoders like BEiT3 or CLIP to process multimodal data. The paper notes limitations: individual encoders might not be in the same representation space requiring alignment, and some multimodal encoders (like CLIP's text encoder) have limited capability for long, complex texts.
  • LLM2CLIP [52]: This work enhances the original CLIP model by replacing its text encoder with a more powerful LLM (e.g., Llama3-8B). This allows CLIP to process longer and more complex textual information while still benefiting from its cross-modal alignment capabilities. MME-SID adopts LLM2CLIP for multimodal embedding encoding.

3.3. Technological Evolution

The evolution of recommender systems has moved from traditional collaborative filtering, focusing solely on user-item interactions, to incorporating content-based features, and then to sequential recommendation with models like SASRec that capture temporal dynamics. More recently, the advent of powerful Large Language Models has led to LLM4SR, where LLMs are used to understand and generate recommendations based on natural language and item semantics. The integration of multimodal information (text, visual, collaborative) has been a parallel trend, aiming to enrich item representations and address cold-start issues. This paper sits at the intersection of these trends, specifically addressing the challenges that arise when combining LLMs, multimodal embeddings, and semantic IDs – namely embedding collapse and catastrophic forgetting. It refines the way semantic IDs are utilized to maximize information retention and LLM capacity.

3.4. Differentiation Analysis

MME-SID distinguishes itself from previous LLM4SR methods primarily by its novel approach to mitigate embedding collapse and catastrophic forgetting.

  • Addressing Embedding Collapse:

    • Unlike methods relying solely on linear projections of low-dimensional collaborative embeddings (e.g., E4SRec, Concat), MME-SID leverages a combination of the original collaborative, textual, and visual embeddings and the embeddings of their semantic IDs. This multimodal and multi-representation input is designed to prevent the projected embeddings from collapsing into a low-dimensional subspace within the LLM token space.
    • The paper's MM-RQ-VAE specifically aims to create quantized embeddings that better preserve intra-modal distance information and capture inter-modal correlations, which also contributes to richer, less collapsed representations.
  • Mitigating Catastrophic Forgetting:

    • In contrast to approaches like TIGER-MM, MOTOR, or LETTER that discard trained code embeddings and randomly initialize them for downstream tasks, MME-SID directly initializes the semantic ID embeddings with the trained code embeddings from its MM-RQ-VAE. This crucial step ensures that the valuable partial order information of distance learned during the quantization process is retained, preventing catastrophic forgetting.
  • Enhanced Semantic ID Utilization:

    • MME-SID proposes a more effective way to use semantic IDs. Instead of only generating semantic IDs for generative retrieval (which can suffer from collision and autoregressive latency), MME-SID uses multimodal semantic ID embeddings as part of a rich input to the LLM for direct scoring, allowing it to generate a ranking list on the whole item set flexibly and efficiently.
  • Multimodal Frequency-Aware Fusion:

    • The introduction of a multimodal frequency-aware fusion module is another innovation, allowing MME-SID to adaptively weigh different modalities based on item popularity, which is not commonly seen in other LLM4SR methods.

      In summary, MME-SID goes beyond simply adopting LLMs or using semantic IDs; it fundamentally rethinks how item representations are constructed and fed into LLMs to address specific challenges that limit their potential in sequential recommendation.

4. Methodology

The proposed framework, MME-SID, aims to empower Large Language Models (LLMs) for sequential recommendation (SR) by mitigating embedding collapse and catastrophic forgetting. The overall framework, depicted in Figure 1, consists of two main stages: an Encoding Stage and a Fine-tuning Stage.

4.1. Principles

The core idea of MME-SID is to leverage both multimodal embeddings (collaborative, textual, visual) and semantic IDs with their trained code embeddings to construct more robust and informative item representations. This approach is guided by two main theoretical intuitions:

  1. Combating Embedding Collapse: By incorporating information from multiple modalities and distinct representations (original embeddings and quantized semantic ID embeddings), the model aims to enrich the feature space and prevent item representations from collapsing into a low-dimensional subspace when projected into the LLM token space. The theoretical basis for this is that combining diverse, less correlated feature sources can increase the effective rank of the combined representation.
  2. Mitigating Catastrophic Forgetting: By explicitly initializing the semantic ID embeddings in the LLM with the code embeddings learned during the quantization process, the model retains the valuable distance and structural information captured in the original multimodal embeddings. This prevents the loss of knowledge that occurs when these embeddings are randomly initialized. The use of Maximum Mean Discrepancy (MMD) as a reconstruction loss further strengthens this by ensuring the quantized representations preserve the underlying distribution of distances.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation (Section 2.1)

In sequential recommendation, the goal is to model a user's dynamic interests and sequential patterns. Let U\mathcal{U} be the set of users and I\mathcal{I} be the set of items. For each user uUu \in \mathcal{U}, we have a behavioral item sequence {hu}\{h_u\}, a target item xux_u, and a true label yuy_u. A conventional sequential recommender system (SRS) fθf_\theta takes {hu}\{h_u\} as input. The prediction result y^\hat{y} is obtained by multiplying the model's output with the target item embedding through a dot product. The model parameters θ\theta are typically optimized by minimizing the Binary Cross Entropy (BCE) loss:

minθL=1UuUBCE(fθ({hu},xu),yu) \operatorname*{min}_{\theta} \mathcal{L} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \mathrm{BCE}\left(f_{\theta}\left(\{h_u\}, x_u\right), y_u\right) where U|\mathcal{U}| is the total number of users.

4.2.2. RQ-VAE (Section 2.2)

Residual Quantized Variational Autoencoder (RQ-VAE) is a model designed to tokenize (convert to discrete codes) and generate semantic IDs for original embeddings in a hierarchical manner. An original embedding s\boldsymbol{s} is first encoded into a latent semantic embedding z\boldsymbol{z}. This z\boldsymbol{z} is then quantized into a sequence of codes (or semantic IDs) through LL-level codebooks. Each level l=1,,Ll = 1, \ldots, L has a codebook Cl={CEj}j=1SC_l = \{CE_j\}_{j=1}^S, where CEjRdCE_j \in \mathbb{R}^d are learnable code embeddings and SS is the codebook size.

The residual quantization process is formulated as: SIDl=argminjrl1CEj2 rl=rl1CESIDl \begin{array} { r l } & { SID^l = \arg \operatorname*{min}_j \left\| r_{l-1} - CE_j \right\|^2 } \\ & { ~ r_l = r_{l-1} - CE_{SID^l} } \end{array} Here, SIDlSID^l is the assigned semantic ID at the ll-th level codebook. rl1r_{l-1} is the residual from the previous level, with r0=zr_0 = \boldsymbol{z}. The term \| \cdot \| denotes the L2 norm, indicating that SIDlSID^l is chosen as the index jj of the code embedding CEjCE_j that is closest to the current residual rl1r_{l-1}. The new residual rlr_l is then calculated by subtracting the chosen code embedding from rl1r_{l-1}.

Finally, the semantic IDs for the original embedding are {SID1,,SIDL}\{SID^1, \ldots, SID^L\}. The quantized embedding z^\hat{\boldsymbol{z}} is formed by summing these chosen code embeddings: $ \hat{\boldsymbol{z}} = \sum_{l=1}^L CE_{SID^l} $ This z^\hat{\boldsymbol{z}} is then further decoded into s^\hat{\boldsymbol{s}} to reconstruct the original embedding s\boldsymbol{s}.

The overall loss function for RQ-VAE is: \begin{array} { r l r } & { \mathcal{L} = \mathcal{L}_{\mathrm{Recon}} + \mathcal{L}_{\mathrm{RQ-VAE}} } & { (3) } \\ & { \mathcal{L}_{\mathrm{Recon}} = { \left. { \boldsymbol{s} - \hat{\boldsymbol{s}} } \right. }^2 } & { (4) } \\ & { \mathcal{L}_{\mathrm{RQ-VAE}} = \displaystyle \sum_{l=1}^L { \left. { \mathsf{SG} \left( { r_{l-1} } \right) - CE_{SID^{\boldsymbol{l}}} \right. }^2 } + \alpha \left. { r_{l-1} } - { \mathsf{SG} \left( CE_{SID^{\boldsymbol{l}}} \right) } \right. ^2 } \\ & { } & { \qquad \quad - \left. \mathrm{e}^\alpha \right. ^2 } \end{array} Here, SG denotes the stop gradient operation, meaning the gradient is not passed through this part of the expression. α\alpha is a hyper-parameter.

  • LRecon\mathcal{L}_{\mathrm{Recon}} (Equation 4) is the reconstruction loss, typically Mean Squared Error (MSE), aiming to minimize the difference between the original embedding s\boldsymbol{s} and its reconstructed version s^\hat{\boldsymbol{s}}.
  • LRQVAE\mathcal{L}_{\mathrm{RQ-VAE}} (Equation 5) is the quantization loss. The first term ensures that the code embeddings CESIDlCE_{SID^l} are updated to be close to the residuals rl1r_{l-1} (gradients flow from the residual to the code embedding). The second term encourages the residuals rl1r_{l-1} to be close to the code embeddings (gradients flow from the code embedding to the residual, but the SG ensures the code embedding itself is treated as a constant here). The eα\mathrm{e}^\alpha term seems like a typo or part of a regularization not fully described in standard RQ-VAE literature, as commonly it might be a commitment loss to keep code embeddings updated. Given the strict instruction to be faithful, it is presented as is.

4.2.3. Preliminary Analysis (Section 3)

4.2.3.1. Embedding Collapse (Section 3.1)

The paper theoretically investigates embedding collapse when pre-trained collaborative embeddings are mapped into the LLM token space. Let AA and BB be matrices. The following properties of matrix rank hold: rank(AB)min{rank(A),rank(B)}rank(A+B)<rank(A)+rank(B) \begin{array} { r l } & { \operatorname{rank}(A \cdot B) \leq \operatorname*{min} \{ \operatorname{rank}(A), \operatorname{rank}(B) \} } \\ & { \operatorname{rank}(A + B) < \operatorname{rank}(A) + \operatorname{rank}(B) } \end{array} Consider a common scenario where a pre-trained collaborative embedding table EcRM×DE_c \in \mathbb{R}^{M \times D} (where MM is the number of items and DD is the embedding dimension) is projected into the LLM token space using a linear projection with weight WRD×DW \in \mathbb{R}^{D' \times D} and bias bRD×1b \in \mathbb{R}^{D' \times 1}, resulting in a projected embedding. The rank of this projected embedding satisfies: rank(WEc+b)<rank(WEc)+rank(b)min{rank(W),rank(Ec)}+1rank(Ec)+1 \begin{array} { r l } & { \mathrm{rank}( { W } \cdot { E }_c + b ) < \mathrm{rank}( { W } \cdot { E }_c ) + \mathrm{rank}( b ) } \\ & { \qquad \le \operatorname*{min} \left\{ \mathrm{rank}(W), \mathrm{rank}( { E }_c ) \right\} + 1 } \\ & { \qquad \le \mathrm{rank}( { E }_c ) + 1 } \end{array} Since EcE_c typically has a low rank (e.g., D=64D=64 or 128 in traditional SRS), the projected embedding will also be low-rank in the higher-dimensional LLM space (DD'), leading to embedding collapse. This means the LLM cannot utilize the full dimensionality of its token embedding space.

4.2.3.2. Catastrophic Forgetting (Section 3.2)

To quantify catastrophic forgetting, the paper uses Kendall's tau (τ\tau) [6] to measure the loss or preservation of distance information. Kendall's tau assesses the concordance (similarity in ranking) between two sets of paired observations.

Given two models, fθf_\theta and fθf_{\theta'}, which produce distance variables for item pairs (e.g., e1,e2\langle e_1, e_2 \rangle and e1,e3\langle e_1, e_3 \rangle for fθf_\theta, and e1,e2\langle e'_1, e'_2 \rangle and e1,e3\langle e'_1, e'_3 \rangle for fθf_{\theta'}), Kendall's tau is defined as: τ=#(concordant pairs)#(discordant pairs)#(pairs) \tau = { \frac { \# ( { \mathrm { concordant ~ pairs } } ) - \# ( { \mathrm { discordant ~ pairs } } ) } { \# ( { \mathrm { pairs } } ) } } Here, #\# denotes the count. A pair of samples is concordant if their relative order is the same in both sets (e.g., both e1,e2<e1,e3\langle e_1, e_2 \rangle < \langle e_1, e_3 \rangle and e1,e2<e1,e3\langle e'_1, e'_2 \rangle < \langle e'_1, e'_3 \rangle are true, or both are greater). It's discordant if their relative order is opposite.

A preliminary experiment on the Amazon Beauty dataset showed that quantized embeddings from an RQ-VAE trained on SASRec's collaborative embeddings preserved 37.14%37.14\% of the original distance information (τ=0.3714\tau = 0.3714). However, when code embeddings were randomly initialized and fine-tuned on downstream tasks, τ\tau dropped to 0.0550, indicating 94.50%94.50\% information loss. This empirically validates the catastrophic forgetting issue.

4.2.4. Encoding Stage (Section 4.2)

The Encoding Stage aims to obtain multimodal embeddings and their semantic IDs.

4.2.4.1. Multimodal Embedding Encoding (Section 4.2.1)

To get multimodal embeddings, the paper uses LLM2CLIP [52] as the multimodal encoder. LLM2CLIP enhances the original CLIP model by replacing its text encoder with a more powerful LLM (like Llama3-8B), enabling it to handle long and complex textual descriptions.

  • LLM2CLIP takes multimodal attributes (e.g., product title, descriptions, images) of items as input.
  • It outputs textual embeddings EtRDt×IE_t \in \mathbb{R}^{D_t \times |\mathcal{I}|} and visual embeddings EvRDv×IE_v \in \mathbb{R}^{D_v \times |\mathcal{I}|}, where DtD_t and DvD_v are the embedding sizes, and I|\mathcal{I}| is the number of items.
  • Separately, a traditional SRS like SASRec [5] is trained on collaborative data (item ID only). Its embedding table EcRDc×IE_c \in \mathbb{R}^{D_c \times |\mathcal{I}|} is extracted, where DcD_c is the collaborative embedding size.

4.2.4.2. Multimodal Embedding Quantization (Section 4.2.2)

The paper proposes a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) to generate multimodal semantic IDs and address the drawbacks of existing methods (MSE reconstruction loss, not capturing inter-modal distinctions). The architecture is shown in Figure 2.

The VLM description for Figure 2: The image is a schematic diagram illustrating the integration process of multimodal embeddings and semantic ID in the MME-SID framework, including the encoding and decoding processes of collaborative embeddings, text embeddings, and visual embeddings, utilizing Llama3-8B and CLIP-ViT for multimodal representation. It also shows the reconstruction of quantized embeddings and their alignment relationships with collaborative and semantic IDs, highlighting the role of maximum mean discrepancy as the reconstruction loss.

该图像是示意图,展示了MME-SID框架的多模态嵌入与语义ID的集成过程。图中包括协作嵌入、文本嵌入和视觉嵌入的编码和解码流程,采用Llama3-8B和CLIP-ViT进行多模态表示。图示还展示了量化嵌入的重建及其与协作和语义ID的对齐关系,强调了最大均值差异作为重建损失的作用。 该图像是示意图,展示了MME-SID框架的多模态嵌入与语义ID的集成过程。图中包括协作嵌入、文本嵌入和视觉嵌入的编码和解码流程,采用Llama3-8B和CLIP-ViT进行多模态表示。图示还展示了量化嵌入的重建及其与协作和语义ID的对齐关系,强调了最大均值差异作为重建损失的作用。

For each modality j{c,t,v}j \in \{c, t, v\} (collaborative, textual, visual):

  • The original embedding sjEj\boldsymbol{s}_j \in E_j is encoded into a semantic embedding zj\boldsymbol{z}_j.
  • Through an LL-level codebook, semantic IDs {SIDj1,,SIDjL}\{SID_j^1, \ldots, SID_j^L\}, quantized embedding z^j\hat{\boldsymbol{z}}_j, and decoded quantized embedding s^j\hat{\boldsymbol{s}}_j are generated, similar to the RQ-VAE process described earlier.

1. MMD as Reconstruction Loss: To explicitly improve the ability of quantized embeddings z^j\hat{\boldsymbol{z}}_j to preserve information from original embeddings sj\boldsymbol{s}_j, the paper proposes minimizing the Maximum Mean Discrepancy (MMD) between s^j\hat{\boldsymbol{s}}_j and sj\boldsymbol{s}_j as the reconstruction loss. MMD [28, 36] measures the distance between two probability distributions PP and QQ: MMDK(P,Q)μPμQHK \mathrm{MMD}_K(P, Q) \triangleq \| \pmb{\mu}_P - \pmb{\mu}_Q \|_{\mathcal{H}_K} Here, k(,)k(\cdot, \cdot) is a symmetric positive-definite kernel (a Gaussian kernel is used as a characteristic kernel for this purpose), HK\mathcal{H}_K is its unique reproducing kernel Hilbert space, and μ\pmb{\mu} represents the mean embedding of a distribution. MMD with a characteristic kernel can preserve all statistics of the distribution, making it more effective than MSE (which only minimizes Euclidean distance).

2. Contrastive Learning for Alignment: To capture inter-modality connections (correlations between different modalities), a contrastive learning objective, specifically InfoNCE loss, is adopted. This loss aligns the quantized collaborative embedding z^c\hat{\boldsymbol{z}}_c with the quantized textual embedding z^t\hat{\boldsymbol{z}}_t and quantized visual embedding z^v\hat{\boldsymbol{z}}_v. LLM2CLIP has already aligned visual and textual information into the same embedding space, so direct alignment between z^t\hat{\boldsymbol{z}}_t and z^v\hat{\boldsymbol{z}}_v is not needed.

The overall loss function of MM-RQ-VAE is: LMMRQVAE=LRecon+βLAlign+γj{c,t,v}LRQVAE(10)LRecon=bIj{c,t,v}MMDK2(SG(sjb),s^jb)LAlign=Lct+LcvLct=1Ii=1Ilogexp(z^ci,z^ti/ϵ)exp(z^ci,z^ti/ϵ)+iiexp(z^ci,z^ti/ϵ) \begin{array} { r l } & { \mathcal{L}_{\mathrm{MM-RQ-VAE}} = \mathcal{L}_{\mathrm{Recon}} + \beta \cdot \mathcal{L}_{\mathrm{Align}} + \gamma \cdot \displaystyle \sum_{j \in \{c, t, v\}} \mathcal{L}_{\mathrm{RQ-VAE}} \quad \mathrm{(10)}} \\ & { \mathcal{L}_{\mathrm{Recon}} = \displaystyle \sum_{b \subset \mathcal{I}} \displaystyle \sum_{j \in \{c, t, v\}} \mathrm{MMD}_K^2 \left( \mathsf{SG}(\boldsymbol{s}_j^b), \hat{\boldsymbol{s}}_j^b \right) } \\ & { \mathcal{L}_{\mathrm{Align}} = \mathcal{L}_{c-t} + \mathcal{L}_{c-v} } \\ & { \mathcal{L}_{c-t} = - \displaystyle \frac{1}{|\mathcal{I}|} \displaystyle \sum_{i=1}^{|\mathcal{I}|} \log \frac{\exp { \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_t^i \rangle / \epsilon \right) } } { \exp { \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_t^i \rangle / \epsilon \right) } + \sum_{i' \neq i} \exp { \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_t^{i'} \rangle / \epsilon \right) } } } \end{array} And for collaborative-visual alignment: Lcv=1Ii=1Ilogexp(z^ci,z^vi/ϵ)exp(z^ci,z^vi/ϵ)+iiexp(z^ci,z^vi/ϵ) \mathcal{L}_{c-v} = - \frac{1}{|\mathcal{I}|} \sum_{i=1}^{|\mathcal{I}|} \log \frac{\exp \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_v^i \rangle / \epsilon \right) } { \exp \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_v^i \rangle / \epsilon \right) + \sum_{i' \neq i} \exp \left( \langle \hat{\boldsymbol{z}}_c^i, \hat{\boldsymbol{z}}_v^{i'} \rangle / \epsilon \right) } Where:

  • LRQVAE\mathcal{L}_{\mathrm{RQ-VAE}} is the RQ-VAE loss (equivalent to Equation 5) summed across modalities.
  • ,\langle \cdot, \cdot \rangle denotes a similarity metric (e.g., cosine similarity).
  • bIb \subset \mathcal{I} denotes a batch of samples from the item set.
  • SG\mathsf{SG} is the stop gradient operation, applied to the original embedding sjb\boldsymbol{s}_j^b in the MMD loss to prevent gradients from flowing back into the multimodal encoders.
  • β\beta and γ\gamma are hyper-parameters to balance the different loss terms.
  • ϵ\epsilon is the temperature coefficient used in the InfoNCE loss for contrastive learning.

4.2.5. Fine-tuning Stage (Section 4.3)

The Fine-tuning Stage focuses on efficiently tuning the LLM for the SR task.

4.2.5.1. Initialization for Catastrophic Forgetting

To address catastrophic forgetting, MME-SID initializes the embeddings of multimodal semantic IDs (IDESIDjl\mathrm{ID}E_{SID_j^l}) directly with the code embeddings (CESIDjlCE_{SID_j^l}) obtained from the trained MM-RQ-VAE. This ensures that the abundant intra-modal information (e.g., distances between behavioral and target item embeddings) learned during the encoding stage is preserved.

4.2.5.2. LLM Input Formulation

The input to the LLM consists of an {Instruction} (specifying the SR task) and a {Behavioral Item Sequence}. The {Behavioral Item Sequence} for an item is formulated as: fMLP([Wj(XSG(Ej))+bj,l=1LESIDjl]),j{c,t,v} f_{\mathrm{MLP}} \left( \left[ { \mathcal{W}_j \cdot ( { \mathcal{X} \cdot \mathsf{SG} ( E_j ) } ) + b_j , \sum_{l=1}^L E_{SID_j^l} } \right] \right) , j \in \{c, t, v\} Where:

  • X\mathcal{X} is the one-hot vector of the behavioral item sequence.

  • EjE_j is the original embedding (collaborative EcE_c, textual EtE_t, or visual EvE_v) for modality jj.

  • Wj\mathcal{W}_j and bjb_j are the weight and bias of a linear projection for each modality jj.

  • SG(Ej)\mathsf{SG}(E_j) applies a stop gradient to the original embeddings, treating them as fixed features.

  • l=1LESIDjl\sum_{l=1}^L E_{SID_j^l} represents the sum of the semantic ID embeddings for modality jj, which is the quantized embedding initialized from MM-RQ-VAE.

  • The square bracket [ , ] denotes the concatenation operation, combining the linearly projected original embedding and the sum of semantic ID embeddings for each modality.

  • fMLPf_{\mathrm{MLP}} is a Multi-Layer Perceptron (MLP) that converts the concatenated feature into the LLM's token embedding dimension DLLMD_{\mathrm{LLM}}.

    This input format is designed to simultaneously preserve distance information from original embeddings and the hierarchical structure of semantic IDs across modalities, providing a rich, less collapsed, and less forgotten representation to the LLM.

4.2.5.3. Multimodal Frequency-Aware Fusion

Existing SR models often ignore that the importance of different modalities can vary for cold (infrequently interacted) or warm (frequently interacted) items. To address this, a multimodal frequency-aware fusion module is proposed.

First, the frequency qiq_i of each item ii in the training set is recorded. Given the long-tail distribution of user-item interactions, qiq_i is transformed and normalized: qi=log(qi+1)qi=qimin(qi)max(qi)min(qi) \begin{array} { l } { { q_i^\prime = \log \left( q_i + 1 \right) } } \\ { { q_i^{\prime\prime} = \displaystyle \frac { q_i^\prime - \operatorname*{min} \left( q_i^\prime \right) } { \operatorname*{max} \left( q_i^\prime \right) - \operatorname*{min} \left( q_i^\prime \right) } } } \end{array} qiq_i^\prime is the logarithmic transformation of frequency (adding 1 to handle zero frequency), and qiq_i^{\prime\prime} is the min-max normalized frequency feature.

Next, an MLP gg takes qiq_i^{\prime\prime} as input and outputs fusion weights {wX,wc,wt,wv}\{w_X, w_c, w_t, w_v\} for each target item. Finally, the prediction score y^\hat{y} for each target item is calculated using the last hidden state of the LLM output oLLM\boldsymbol{o}_{\mathrm{LLM}}: wX(oLLMEx)+jwj(oLLM(Wj(XSG(Ej))+bj)) w_X \odot ( \boldsymbol{o}_{\mathrm{LLM}} \cdot E_x^\top ) + \sum_j w_j \odot \big ( \boldsymbol{o}_{\mathrm{LLM}} \cdot ( W_j \cdot ( \mathcal{X} \cdot \operatorname{SG}(E_j) ) + b_j )^\top \big ) Where:

  • j{c,t,v}j \in \{c, t, v\} refers to collaborative, textual, and visual modalities.

  • \odot denotes the Hadamard product (element-wise multiplication).

  • \cdot denotes the dot product.

  • ExRDLLM×IE_x \in \mathbb{R}^{D_{\mathrm{LLM}} \times |\mathcal{I}|} is a new embedding table specifically for target items, introduced to further relieve potential collapse issues in target item embeddings.

  • The first term represents the score from the LLM's output interacting with the new target item embedding.

  • The summation term represents the scores from the LLM's output interacting with the linearly projected original embeddings of each modality.

    The BCE loss is then calculated using y^\hat{y} and the true label yy to update the LLM. Notably, only a small proportion of all LLM parameters (e.g., 0.19%0.19\% in experiments) are updated efficiently using LoRA.

The overall framework of MME-SID is shown in Figure 1.

The VLM description for Figure 1: The image is a schematic diagram of the MME-SID framework, illustrating how the Large Language Model (LLM) integrates multimodal embeddings and quantized embeddings. The diagram shows several linear projections, including collaborative embeddings, textual embeddings, and visual embeddings, utilizing frequency-aware fusion for information integration.

Figure 1: The overall framework of MME-SID. 该图像是MME-SID框架的示意图,显示了大语言模型(LLM)如何集成多模态嵌入和量化嵌入。图中展现了多个线性映射,其中包括协作嵌入、文本嵌入和视觉嵌入,利用频率感知融合进行信息整合。

4.2.6. Discussions (Section 4.4)

The paper highlights several advantages of MME-SID, particularly its potential to improve upon common suboptimal practices of using semantic IDs in generative retrieval or generative recommendation:

  • Flexible Ranking: MME-SID can generate a ranking list over the entire item set and output the top-k most relevant items flexibly. This contrasts with methods like TIGER [35], which can only retrieve the most relevant item autoregressively (code by code), making it less efficient for full-list recommendations.
  • No Collision Issue: MME-SID does not need to handle the collision problem, where multiple items might map to the same sequence of semantic IDs. This is because the multimodal data and its rich representation naturally help discriminate between different items. Existing methods like TIGER require extra computation and storage to ensure unique semantic ID sequences.
  • Higher Inference Efficiency: MME-SID achieves higher inference efficiency. If an LLM has token embedding dimension DLLMD_{\mathrm{LLM}}, and a user has NN behavioral items, where each item is encoded into LL semantic IDs:
    • Methods like TIGER might require an input vector of DLLM×N×LD_{\mathrm{LLM}} \times N \times L dimensions (since each item is a sequence of LL tokens).
    • MME-SID only requires a DLLM×ND_{\mathrm{LLM}} \times N dimensional vector input because each item is represented by a single, comprehensive multimodal embedding that is less collapsed, less prone to forgetting, and more informative. This significantly improves inference efficiency.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three categories of the Amazon 5-core dataset [31]: Beauty, Toys & Games, and Sports & Outdoors.

  • Source and Characteristics: These datasets are collected from Amazon, a large e-commerce platform. They represent user-item interactions where each user and item has at least 5 interactions.

  • Task: The task is to predict whether a user will give a rating higher than 3 to a target item, effectively framing it as a binary classification problem for positive preference.

  • Sparsity: The sparsity metric denotes the proportion of negative samples (label y=0y=0), indicating that positive interactions are rare.

  • Splitting: For each user, the (N-1)-th item in their historical sequence is used as the target item for the training set, and the NN-th item is used as the target item for the test set.

  • Preprocessing: Items lacking a title or image in the original dataset are removed.

    The following are the results from Table 1 of the original paper:

    Category Users Items Interactions Sparsity
    Beauty 22,332 12,086 198,215 99.93%
    Toys & Games 19,121 11,757 165,221 99.93%
    Sports & Outdoors 35,092 18,090 292,007 99.95%

These datasets are well-suited for validating sequential recommendation methods, especially those incorporating multimodal information, due to their rich item attributes and real-world interaction patterns. The high sparsity also presents a realistic challenge for recommendation systems.

5.2. Evaluation Metrics

The performance of the models is evaluated using Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (nDCG@k) for k{5,10,20}k \in \{5, 10, 20\}. Kendall's tau is also used for internal analysis of catastrophic forgetting.

5.2.1. Hit Ratio (HR@k)

  • Conceptual Definition: Hit Ratio (HR@k) measures the proportion of users for whom the target item is present within the top-kk recommended items. It's a straightforward measure of recall, indicating how often the system successfully recommends the relevant item within the top-kk choices. A higher HR@k means more users find their target item in the top recommendations.
  • Mathematical Formula: $ \mathrm{HR@k} = \frac{\text{Number of users with hit in top-k}}{\text{Total number of users}} $
  • Symbol Explanation:
    • Number of users with hit in top-k: The count of unique users for whom the ground truth item appears in their generated top-kk recommendation list.
    • Total number of users: The total number of users in the evaluation set.

5.2.2. Normalized Discounted Cumulative Gain (nDCG@k)

  • Conceptual Definition: Normalized Discounted Cumulative Gain (nDCG@k) is a ranking quality metric that accounts for the position of the relevant items in the recommendation list. It assigns higher scores to relevant items that appear at higher ranks (closer to the top). nDCG values range from 0 to 1, where 1 represents a perfect ranking. It's normalized to be comparable across different recommendation lists.
  • Mathematical Formula: First, DCG@k (Discounted Cumulative Gain at kk) is calculated: $ \mathrm{DCG@k} = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ Then, nDCG@k is calculated by dividing DCG@k by the Ideal DCG@k (IDCG@k), which is the DCG@k for a perfectly ordered list: $ \mathrm{nDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $
  • Symbol Explanation:
    • kk: The number of top recommendations considered.
    • ii: The rank position in the recommendation list (starting from 1).
    • reli\mathrm{rel}_i: The relevance score of the item at position ii. In binary relevance scenarios (like this paper, where a rating > 3 implies relevance), reli\mathrm{rel}_i is typically 1 if the item is relevant and 0 otherwise.
    • log2(i+1)\log_2(i+1): The discount factor for items at lower ranks.
    • IDCG@k\mathrm{IDCG@k}: The maximum possible DCG@k for a given set of relevant items, achieved by ranking all relevant items at the top of the list in decreasing order of relevance.

5.2.3. Kendall's tau (τ\tau)

  • Conceptual Definition: Kendall's tau is a non-parametric statistic used to measure the ordinal association between two measured quantities. In this paper, it's used to quantify the concordance (similarity in ranking order) between the pairwise distances of original item embeddings and the pairwise distances of their quantized embeddings. A higher Kendall's tau value (closer to 1) indicates better preservation of the relative distance information.
  • Mathematical Formula (from Section 3.2 of the paper): $ \tau = { \frac { # ( { \mathrm { concordant ~ pairs } } ) - # ( { \mathrm { discordant ~ pairs } } ) } { # ( { \mathrm { pairs } } ) } } $
  • Symbol Explanation:
    • #(concordant pairs)\# (\mathrm{concordant~pairs}): The number of pairs of samples where their relative order (e.g., A<BA < B in both lists, or A>BA > B in both lists) is the same in both compared rankings.
    • #(discordant pairs)\# (\mathrm{discordant~pairs}): The number of pairs of samples where their relative order is opposite in the two compared rankings (e.g., A<BA < B in the first list but A>BA > B in the second list).
    • #(pairs)\# (\mathrm{pairs}): The total number of unique pairs of samples that can be formed from the data.

5.3. Baselines

The proposed MME-SID is compared against the following representative baseline methods. Their input formulations for the {Behavioral Item Sequence} are detailed in Table 2. Llama3-8B-instruct is adopted for all LLM-based methods, and RQ-VAE is used to generate semantic IDs where applicable.

The following are the results from Table 2 of the original paper:

Method Input
SASRec XEc\mathcal{X} \cdot E_c
E4SRec Wc(XSG(Ec))+bcW_c \cdot (\mathcal{X} \cdot \mathsf{SG}(E_c)) + b_c
ME fLP([Wc(XSG(Ec))+bc,XEc])f_{\mathrm{LP}}([W_c \cdot (\mathcal{X} \cdot \mathsf{SG}(E_c)) + b_c, \mathcal{X} \cdot E_c'])
Concat [Wj(XSG(Ej))+bj]j{c,t,v}[W_j \cdot (\mathcal{X} \cdot \mathsf{SG}(E_j)) + b_j]_{j \in \{c, t, v\}}
Concat&MLP MLP([Wj(XSG(Ej))+bj]j{c,t,v})\mathrm{MLP}([W_j \cdot (\mathcal{X} \cdot \mathsf{SG}(E_j)) + b_j]_{j \in \{c, t, v\}})
CTRL-MM MLP([Wj(XSG(Ej))+bj]j{c,t,v})\mathrm{MLP}([W_j \cdot (\mathcal{X} \cdot \mathsf{SG}(E_j)) + b_j]_{j \in \{c, t, v\}})
TIGER-MM [SIDj1,,SIDjL]j{c,t,v}[SID_j^1, \dots, SID_j^L]_{j \in \{c, t, v\}}
MOTOR [SIDt1,,SIDtL,SIDv1,,SIDvL][SID_t^1, \dots, SID_t^L, SID_v^1, \dots, SID_v^L]
LETTER [SIDt1,,SIDtL][SID_t^1, \dots, SID_t^L]

Where:

  • X\mathcal{X}: One-hot vector of the historical interaction (representing items in the sequence).
  • EcE_c: Collaborative embedding matrix.
  • EcE_c': A new, randomly initialized collaborative embedding matrix (for ME).
  • EjE_j: Embedding matrix for modality j{c,t,v}j \in \{c, t, v\} (collaborative, textual, visual).
  • Wj,bjW_j, b_j: Weight and bias of linear projection for modality jj.
  • fLPf_{\mathrm{LP}}: A linear projection (for ME).
  • MLP\mathrm{MLP}: Multi-Layer Perceptron.
  • SIDjlSID_j^l: Semantic ID at the ll-th codebook for modality jj.
  • SG: Stop gradient operation.
  • Square brackets [ ]: Concatenation operation.

5.3.1. Single-Modal (Collaborative Only) Baselines

  • SASRec [5]: Represents the original SASRec model. It uses self-attention to model sequential patterns based on item IDs (EcE_c).
  • E4SRec [10]: Adopts a linear projection of pre-trained ID embeddings (EcE_c) to map them into the LLM token space, aiming to address out-of-range generation.
  • Multi Embedding (ME): A baseline proposed by the authors. It takes both the linear projection of pre-trained ID embedding (EcE_c) and a new set of randomly initialized ID embeddings (EcE_c') as input. It aims to see if simply adding another embedding table helps.

5.3.2. Multimodal Baselines

  • Concat: This method maps pre-trained collaborative, textual, and visual embeddings (Ec,Et,EvE_c, E_t, E_v) to the LLM token embedding space via linear layers, then directly concatenates them. It represents approaches like CoLLM [59] and LLaRA [13].
  • Concat&MLP: A typical multimodal fusion method [12, 53]. It concatenates the projected collaborative, textual, and visual embeddings of items, then feeds this into an MLP before passing it to the LLM.
  • CTRL-MM: Adapted from CTRL [9]. It has the same input as Concat&MLP but explicitly aligns the collaborative embedding with textual and visual embeddings using InfoNCE as a contrastive learning loss.
  • TIGER-MM: A multimodal variant adapted from TIGER [35]. It exclusively uses semantic IDs of collaborative, textual, and visual embeddings to perform generative retrieval. It trains an RQ-VAE for each modality separately to generate semantic IDs.
  • MOTOR [56]: Replaces collaborative embeddings with token embeddings of vision and text features. It obtains semantic IDs for visual and textual embeddings and uses SASRec as the traditional downstream multimodal recommendation model.
  • LETTER [43]: Implemented on TIGER as the backbone. It adopts various regularization methods, such as diversity, to achieve better item tokenization for generative recommendation.

5.4. Implementation Details

  • Multimodal Encoding: Product titles and images are used.
  • Embedding Dimensions: Dc=64D_c = 64 for collaborative embeddings. Dt=Dv=1280D_t = D_v = 1280 for textual and visual embeddings.
  • Kernel for MMD: A Gaussian kernel k(e,e)=exp(ee22σ2)k(\boldsymbol{e}, \boldsymbol{e}') = \exp \left( - \frac{\| \boldsymbol{e} - \boldsymbol{e}' \|^2}{2\sigma^2} \right) is adopted, which is a characteristic kernel.
  • LLM Backbone: Llama3-8B-instruct is used, with a token embedding dimension DLLM=4096D_{\mathrm{LLM}} = 4096. This instruction-tuned model is chosen for its better capability to follow instructions.
  • Hardware: All experiments are conducted on A100 GPUs.
  • Runs: Results are averaged over 3 runs to ensure robustness.
  • Optimizer: AdamW [29] is used.
  • Hyper-parameters:
    • α=1\alpha = 1, β=1e3\beta = 1\mathrm{e}{-3}, and γ=1\gamma = 1 for MM-RQ-VAE loss.

    • LoRA target modules are [gate_proj, down_proj, up_proj].

    • Only about 0.19%0.19\% of total LLM parameters are updated during fine-tuning with LoRA.

      The following are the results from Table 4 of the original paper:

      Dataset Beauty Toys & Games Sports & Outdoors
      Training epochs 3 3 2
      Learning rate 3e-4 2e-4 2e-4
      Batch size 16 16 16
      LoRA rank 8 8 8
      LoRA alpha 16 16 16
      LoRA dropout 0.05 0.05 0.05
      Warm-up steps 100 100 200
      Number of codes 256 256 300
      Level of codebooks 4 4 4

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

RQ1: What is the performance of the proposed MME-SID compared with baseline methods?

The overall performance comparison of MME-SID with various baselines on the three Amazon datasets is presented in Table 3. The metrics used are HR@5, HR@10, HR@20, nDCG@5, nDCG@10, and nDCG@20.

The following are the results from Table 3 of the original paper:

Datasets Metric SASRec E4SRec ME Concat Concat&MLP CTRL-MM TIGER-MM MOTOR LETTER Ours-full Impr.
Beauty HR@5 0.0368 0.0545 0.0567 0.0523 0.0581 0.0614 0.0471 0.0226 0.0415 0.0675* 9.93%
HR@10 0.0578 0.0757 0.0787 0.0757 0.0830 0.0875 0.0668 0.0380 0.0654 0.0955* 9.14%
HR@20 0.0903 0.1040 0.1046 0.1070 0.1177 0.1224 0.0945 0.0635 0.0833 0.1342* 9.64%
nDCG@5 0.0243 0.0388 0.0402 0.0365 0.0404 0.0430 0.0329 0.0140 0.0262 0.0475* 10.47%
nDCG@10 0.0310 0.0456 0.0473 0.0440 0.0484 0.0515 0.0393 0.0189 0.0351 0.0566* 9.90%
nDCG@20 0.0392 0.0527 0.0538 0.0519 0.0571 0.0602 0.0463 0.0253 0.0408 0.0663* 10.13%
Toys & Games HR@5 0.0508 0.0593 0.0598 0.0620 0.0623 0.0618 0.0486 0.0168 0.0471 0.0653* 4.82%
HR@10 0.0713 0.0802 0.0827 0.0846 0.0871 0.0850 0.0667 0.0310 0.0650 0.0909* 4.36%
HR@20 0.1022 0.1064 0.1120 0.1114 0.1184 0.1179 0.0889 0.0528 0.0852 0.1223* 3.29%
nDCG@5 0.0357 0.0433 0.0435 0.0452 0.0444 0.0429 0.0354 0.0104 0.0343 0.0472* 4.42%
nDCG@10 0.0422 0.0501 0.0509 0.0525 0.0524 0.0503 0.0412 0.0150 0.0399 0.0555* 5.71%
nDCG@20 0.0500 0.0566 0.0582 0.0592 0.0602 0.0586 0.0468 0.0204 0.0449 0.0634* 5.32%
Sports & Outdoors HR@5 0.0204 0.0316 0.0339 0.0287 0.0292 0.0270 0.0251 0.0154 0.0224 0.0371* 9.44%
HR@10 0.0327 0.0456 0.0494 0.0431 0.0445 0.0424 0.0376 0.0253 0.0334 0.0541* 9.51%
HR@20 0.0522 0.0650 0.0718 0.0658 0.0667 0.0652 0.0551 0.0426 0.0503 0.0778* 8.36%
nDCG@5 0.0132 0.0218 0.0234 0.0191 0.0194 0.0181 0.0167 0.0100 0.0149 0.0253* 8.12%
nDCG@10 0.0171 0.0263 0.0285 0.0237 0.0243 0.0230 0.0207 0.0131 0.0186 0.0308* 8.07%
nDCG@20 0.0220 0.0312 0.0341 0.0294 0.0299 0.0287 0.0251 0.0174 0.0226 0.0367* 7.62%

Observations:

  • LLM-based methods generally outperform traditional SRS: E4SRec, ME, Concat, Concat&MLP, CTRL-MM, and MME-SID (all utilizing LLMs) generally show better performance compared to SASRec (a traditional SRS), indicating the significant potential of LLMs for sequential recommendation.
  • Single-modal LLM-based methods: E4SRec consistently improves upon SASRec. ME (Multi Embedding), which incorporates an additional randomly initialized collaborative embedding table, shows some improvement over E4SRec, but this enhancement is not significant across all datasets (e.g., Beauty and Toys & Games), suggesting that simply adding more single-modal information might not be enough.
  • Suboptimal use of multimodal data: Surprisingly, Concat, Concat&MLP, and CTRL-MM, despite using multimodal data, often perform worse than E4SRec (a single-modal LLM-based method). This suggests that naive concatenation or even simple contrastive alignment (as in CTRL-MM) of multimodal embeddings may not be optimal for LLM4SR, and can even degrade performance if not handled carefully.
  • Semantic ID-only methods struggle: TIGER-MM, MOTOR, and LETTER show comparably poor performance among multimodal methods. These methods primarily rely on using only semantic IDs for generative retrieval. This result challenges the common belief that semantic ID-only approaches are inherently superior without careful consideration of information preservation.
  • MME-SID's Superiority: The proposed MME-SID (labeled as Ours-full) consistently and significantly surpasses all baseline methods across all three datasets and all evaluation metrics.
    • On the Beauty dataset, MME-SID achieves an improvement of 10.47%10.47\% on nDCG@5 over the best baseline (CTRL-MM).

    • On Toys & Games, it improves by 4.42%4.42\% on nDCG@5 over Concat (which performs best among baselines for this metric).

    • On Sports & Outdoors, it shows an 8.12%8.12\% improvement on nDCG@5 over ME (the best baseline for this metric).

    • The * symbol indicates statistical significance (PP-value <0.05< 0.05 in t-test), validating the robustness of MME-SID's performance.

      These results strongly validate the efficacy of MME-SID, particularly its ability to effectively integrate multimodal embeddings and semantic IDs while addressing the challenges of embedding collapse and catastrophic forgetting.

6.2. Alleviating Embedding Collapse (RQ2)

RQ2: Do multimodal embeddings and semantic IDs contribute to alleviating embedding collapse?

To answer this, the paper compares five methods: SASRec, E4SRec, ME, SE-SID-MMD, and MME-SID. SE-SID-MMD is a variant of MME-SID that only uses the collaborative modality as input, specifically fMLP([Wc(XSG(Ec))+bc,l=1LESIDcl])f_{\mathrm{MLP}} ( [ W_c \cdot ( \mathcal{X} \cdot \mathsf{SG}(E_c) ) + b_c , \sum_{l=1}^L E_{SID_c^l} ] ). Embedding collapse is measured by the singular value of the embedding table, where higher values indicate a lower degree of collapse. The results on the Beauty dataset are shown in Figure 3.

The VLM description for Figure 3: The image is a chart presenting the performance of sequence recommendation (SR) and the measurements of embedding collapse. Figure (a) shows the performance of various methods on the SR task with the y-axis representing nDCG@20; figure (b) illustrates the measurement of embedding collapse, with the x-axis as dimension index and the y-axis as the logarithmic singular values of the embeddings (normalized). These results are derived from the Beauty dataset.

Figure 3: (a) Sequential recommendation performance where the y-axis is \(\\mathbf { n D C G } @ 2 \\mathbf { 0 } .\) (b) Embedding collapse Measurement. The \(\\mathbf { x }\) -axis is dimension index and y-axis is the logarithm of singular value (normalized by the maximum value) of embedding. They are both conducted on Beauty dataset.

Analysis of Figure 3 (a) - Sequential Recommendation Performance (nDCG@20):

  • SASRec and E4SRec (single-modal, ID-based methods) show the lowest nDCG@20 scores, consistent with the overall performance table.
  • ME (multi-embedding for collaborative ID) performs slightly better than E4SRec.
  • SE-SID-MMD (using semantic IDs for collaborative modality with MMD loss) significantly outperforms SASRec, E4SRec, and ME. This indicates that a better way of handling collaborative embeddings (through semantic IDs and MMD) improves performance even without other modalities.
  • MME-SID (full model with multimodal embeddings and semantic IDs) achieves the highest nDCG@20 score, demonstrating the combined benefit of its proposed mechanisms.

Analysis of Figure 3 (b) - Embedding Collapse Measurement:

  • The y-axis represents the logarithm of singular values (normalized by the maximum value), and the x-axis is the dimension index of the embedding matrix. A flatter curve or higher singular values across dimensions indicate less collapse.

  • SASRec and E4SRec show a sharp drop in singular values after the 64th dimension (since their collaborative embedding dimension Dc=64D_c = 64). This signifies drastic embedding collapse, where effectively only 64 dimensions are meaningfully utilized in a much higher-dimensional LLM space (DLLM=4096D_{\mathrm{LLM}} = 4096). The paper notes that over 98%98\% of dimensions collapse.

  • ME also exhibits significant collapse, similar to E4SRec, indicating that adding another low-dimensional collaborative embedding table doesn't fundamentally solve the collapse issue.

  • SE-SID-MMD shows a better singular value distribution than SASRec, E4SRec, and ME after the 64th dimension, suggesting that incorporating semantic IDs even for a single modality helps to expand the effective embedding space.

  • MME-SID (input behavioral item embedding): This line shows the highest singular values and the slowest decay among all variants for the input behavioral item embeddings. This indicates that MME-SID effectively alleviates embedding collapse by leveraging multimodal embeddings and semantic IDs, leading to a more expressive and less redundant representation. The embedding space is better utilized.

  • MME-SID (target item): The singular values for the target item embedding ExE_x in MME-SID also show a healthy distribution. The introduction of a new target item embedding table (ExE_x) is justified here as it helps prevent collapse specifically for the target item representations used in scoring, which is crucial for overall performance.

    The paper also empirically analyzes the effect of nonlinear mappings (like ReLU) on embedding collapse, finding that they do not significantly improve matrix rank, can degrade recommendation accuracy, and catastrophic forgetting still occurs, likely because nonlinearity disrupts distance information.

Result 1: Solely relying on pre-trained low-dimensional collaborative embeddings in LLM4SR leads to embedding collapse. In contrast, MME-SID effectively alleviates this phenomenon and achieves better performance by adopting multimodal embeddings and semantic IDs.

6.3. MMD-based Reconstruction Loss (RQ3)

RQ3: What is the effect of MMD-based reconstruction loss?

To evaluate the effect of MMD as a reconstruction loss, the paper compares two model variants: SE-SID-MMD and SE-SID-MSE.

  • SE-SID-MMD: Uses MMD as the reconstruction loss in its RQ-VAE.
  • SE-SID-MSE: Uses Mean Squared Error (MSE) as the reconstruction loss in its RQ-VAE. Both variants only utilize the collaborative modality for input. Their SR performance on the Beauty dataset and embedding collapse measurements are shown in Figure 4.

The VLM description for Figure 4: The image is a chart comparing the performance of MMD and MSE in terms of (a) reconstruction loss and (b) embedding collapse. Chart (a) shows the effects of SE-SID-MSE and SE-SID-MMD under different nDCG metrics, while chart (b) presents the corresponding embedding collapse.

Figure 4: Comparison of MMD and MSE as the reconstruction loss on (a) sequential recommendation performance and (b) embedding collapse on Beauty dataset.

Analysis of Figure 4 (a) - Sequential Recommendation Performance:

  • SE-SID-MMD consistently outperforms SE-SID-MSE across all nDCG@k metrics. This suggests that using MMD for reconstruction loss leads to better SR performance.

Analysis of Figure 4 (b) - Embedding Collapse Measurement:

  • The singular value curves for SE-SID-MMD and SE-SID-MSE (blue and violet lines) show comparable degrees of embedding collapse. Although the semantic ID embeddings of SE-SID-MMD might show slightly lower collapse than SE-SID-MSE at certain points, the overall pattern is similar. This implies that the performance gain of SE-SID-MMD is not primarily due to a reduction in embedding collapse at this stage.

Analysis of Catastrophic Forgetting (Supplementary):

  • The paper further analyzes catastrophic forgetting using Kendall's tau (τ\tau). For SE-SID-MMD, the τ\tau between the quantized collaborative embedding (after MM-RQ-VAE training) and the pre-trained collaborative embedding EcE_c is 0.4436.

  • For SE-SID-MSE, this τ\tau value is 0.3714.

  • The higher τ\tau value for SE-SID-MMD (0.4436>0.37140.4436 > 0.3714) indicates that MMD as a reconstruction loss is more effective at preserving the partial order of distance information from the original embeddings during quantization. This better preservation directly contributes to mitigating forgetting and, consequently, to the improved SR performance.

    Result 2: Compared with Mean Squared Error as the reconstruction loss, the Maximum Mean Discrepancy reconstruction loss enables the quantized embedding to better preserve the information (specifically, the partial order of behavioral-target item embedding distance), thereby achieving better recommendation performance. The improvement is attributed to the mitigation of catastrophic forgetting, rather than a significant reduction in embedding collapse at this stage.

6.4. Embedding Initialization (RQ4)

RQ4: Does using trained code embeddings for initialization mitigate catastrophic forgetting?

To address this, MME-SID is compared with a variant named MME-SID-random, which randomly initializes the embeddings of semantic IDs instead of using the trained code embeddings from MM-RQ-VAE. The performance comparison is shown in Figure 5 (a).

The VLM description for Figure 5: The image is a chart that illustrates the comparison of code embedding initialization (a) and an ablation study on the Beauty dataset (b). The y-axis represents nDCG@k, including performance metrics of methods such as MME-SID and E4Rec, where nDCG@5 and nDCG@20 values reflect the effectiveness of the recommendation system.

Figure 5: (a) Comparison of code embedding initialization where the y-axis denotes nDCG@k. (b) Ablation study on Beauty dataset where the y-axis denotes nDCG \(\\textcircled { \\pmb { \\omega } } 2 \\mathbf { 0 }\) .

Analysis of Figure 5 (a) - Comparison of Code Embedding Initialization (nDCG@k):

  • MME-SID consistently outperforms MME-SID-random across all nDCG@k metrics. This clearly demonstrates that initializing with trained code embeddings is crucial for better recommendation performance.

Analysis of Catastrophic Forgetting (Supplementary):

  • The Kendall's tau (τ\tau) between the fine-tuned collaborative embedding (from MME-SID-random) and the pre-trained collaborative embedding (EcE_c) is 0.0508. This extremely low value indicates severe catastrophic forgetting when semantic ID embeddings are randomly initialized and trained from scratch on the downstream task.

  • In contrast, for the full MME-SID model, the τ\tau value after fine-tuning is 0.2727. This significantly higher value (though still lower than before fine-tuning, as some adaptation occurs) demonstrates that MME-SID substantially mitigates catastrophic forgetting by preserving a much larger portion of the original distance information.

    Result 3: Simply discarding the pre-trained code embeddings and randomly initializing them on downstream tasks leads to catastrophic forgetting. The proposed MME-SID mitigates this phenomenon by initializing with the trained code embeddings, thereby effectively preserving the distance information and achieving superior recommendation performance.

6.5. Ablation Study

An ablation study is conducted on the Beauty dataset to understand the contribution of different components of MME-SID. The results, measured by nDCG@20, are shown in Figure 5 (b).

Analysis of Figure 5 (b) - Ablation Study (nDCG@20):

  • SE-random: This model variant uses only randomly initialized item ID embeddings as input (similar parameter count to E4SRec, but its embeddings are randomly initialized). It achieves the worst performance, highlighting the importance of properly initialized or pre-trained embeddings and the severe impact of forgetting if random initialization is used without robust mechanisms to learn from scratch.
  • MME-random: This variant has the same number of input parameters as MME-SID but replaces the quantized embedding with a new, randomly initialized embedding table for each modality. It performs worse than MME-SID, indicating that the improvement of MME-SID is not merely due to an increased number of input parameters or multimodal input itself. Instead, the specific mechanism of leveraging intra- and inter-modal correlation learned by MM-RQ-VAE and initializing with its trained code embeddings is crucial.
  • w/o Fusion: This variant removes the multimodal frequency-aware fusion module from MME-SID. The performance decrease compared to the full MME-SID model demonstrates the significance of this adaptive fusion mechanism. It confirms that dynamically weighting modalities based on item frequency is beneficial for recommendation performance, especially for handling diverse item characteristics (cold vs. warm).
  • MME-SID (full model): As expected, the full MME-SID model achieves the best performance, validating the collective effectiveness of all its proposed components in addressing embedding collapse, catastrophic forgetting, and adaptively leveraging multimodal information.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously identifies and systematically addresses two critical challenges in Large Language Model for Sequential Recommendation (LLM4SR): embedding collapse and catastrophic forgetting. To tackle these issues, the authors propose MME-SID, a novel framework that effectively integrates multimodal embeddings (collaborative, textual, visual) and semantic IDs.

Key components and findings include:

  1. Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE): This new model is central to MME-SID. It uses Maximum Mean Discrepancy (MMD) as a reconstruction loss to explicitly preserve intra-modal distance information and contrastive learning to capture inter-modal correlations.

  2. Mitigation of Catastrophic Forgetting: MME-SID successfully alleviates catastrophic forgetting by initializing the semantic ID embeddings with the trained code embeddings from MM-RQ-VAE, thus retaining valuable pre-learned knowledge.

  3. Alleviation of Embedding Collapse: By combining original multimodal embeddings with their semantic ID embeddings and using a dedicated target item embedding, MME-SID creates richer, less collapsed representations in the LLM's token space.

  4. Efficient Fine-tuning and Adaptive Fusion: The LLM is efficiently fine-tuned using LoRA and incorporates a multimodal frequency-aware fusion module, which adaptively combines modality scores based on item frequency, further boosting performance.

    Extensive experiments on three public Amazon datasets demonstrate the superior recommendation performance of MME-SID over strong baselines, validating its efficacy in tackling the identified challenges and unlocking the full potential of LLMs for sequential recommendation.

7.2. Limitations & Future Work

While the paper does not explicitly dedicate a section to "Limitations & Future Work," some aspects can be inferred from the "Discussions" and the scope of the current work:

  • Computational Cost of MM-RQ-VAE Training: While LoRA makes LLM fine-tuning efficient, the MM-RQ-VAE itself involves training a separate model to learn multimodal semantic IDs. This encoding stage, especially with MMD and contrastive learning, can be computationally intensive and require significant data, which might be a barrier for smaller datasets or resource-constrained environments.

  • Generalizability of MM-RQ-VAE Design: The MM-RQ-VAE is specifically designed for collaborative, textual, and visual modalities. While robust for these, adapting it to new modalities (e.g., audio, sensor data) might require redesign or careful extension of the alignment and reconstruction mechanisms.

  • Hyperparameter Sensitivity: The MM-RQ-VAE loss (Equation 10) involves multiple hyperparameters (α,β,γ,ϵ,σ2\alpha, \beta, \gamma, \epsilon, \sigma^2 for the Gaussian kernel). Optimizing these can be complex and dataset-dependent.

  • Interpretability of Fused Embeddings: While the fusion module is frequency-aware, a deeper analysis into why certain modalities are weighted more heavily for cold vs. warm items could provide further insights and potentially lead to more advanced fusion strategies.

  • Scalability of Multimodal Data Handling: For extremely large-scale industrial scenarios with billions of items, the storage and retrieval of multimodal embeddings (text, visual) for all items, even before quantization, can pose engineering challenges.

    Potential future research directions could include:

  • Exploring more advanced quantization techniques or alternative semantic ID generation methods that are even more efficient or robust.

  • Investigating self-supervised pre-training strategies for the multimodal embeddings directly within the LLM rather than relying on external encoders like LLM2CLIP, to potentially achieve end-to-end optimization.

  • Applying MME-SID to more diverse LLM architectures or even multimodal LLMs that inherently handle multiple modalities, to see how the embedding collapse and catastrophic forgetting issues manifest and can be mitigated in those contexts.

  • Developing more nuanced frequency-aware or context-aware fusion mechanisms that also consider user-specific preferences or current session context.

  • Extending the framework to address cold-start users more directly, beyond just cold-start items.

7.3. Personal Insights & Critique

This paper offers a highly rigorous and insightful analysis of two fundamental problems (embedding collapse and catastrophic forgetting) that arise when integrating LLMs into sequential recommendation, particularly when semantic IDs and multimodal embeddings are involved. Its strength lies in its systematic problem identification, clear theoretical grounding (e.g., rank analysis for collapse), and empirical validation with robust metrics (e.g., Kendall's tau for forgetting). The detailed breakdown of how MMD and contrastive learning are integrated into RQ-VAE is particularly illuminating, showcasing a deep understanding of information preservation in quantized representations.

The explicit emphasis on using trained code embeddings for initialization is a critical insight, challenging the prevalent practice of random initialization in semantic ID-based generative models. This simple yet profound change significantly impacts knowledge retention. Furthermore, the practical advantages highlighted in the "Discussions" section—such as flexible ranking, collision avoidance, and higher inference efficiency—demonstrate that MME-SID is not just theoretically sound but also offers compelling real-world benefits over existing generative retrieval methods.

Critiques and Areas for Improvement:

  1. Detailed Explanation of Equation 5: The structure of Equation 5 in the RQ-VAE loss, specifically the term -\mathrm{e}^\alpha\text{}^2,issomewhatunusualcomparedtostandardVQVAEorRQVAEformulationswhichtypicallyincludeacommitmentloss(e.g.,, is somewhat unusual compared to standard `VQ-VAE` or `RQ-VAE` formulations which typically include a `commitment loss` (e.g., |z - \mathsf{SG}(e)|^2$$) to ensure the encoder output commits to the codebook entries. While faithful to the paper, a deeper explanation of this specific term's purpose and derivation would benefit a beginner audience.
  2. Trade-off between Complexity and Performance: While MME-SID achieves superior performance, the encoding stage requires training MM-RQ-VAE with three modalities, MMD loss, and contrastive learning. This adds complexity. A clearer analysis of the marginal gains from each MM-RQ-VAE component (beyond MMD vs. MSE) could be insightful, potentially in a more fine-grained ablation study.
  3. Robustness to Data Quality: The framework heavily relies on the quality of multimodal embeddings generated by LLM2CLIP and SASRec. While LLM2CLIP is powerful, textual and visual data quality can vary greatly in real-world scenarios. An analysis of MME-SID's robustness to noisy or incomplete multimodal data would be valuable.
  4. Long-Term Impact of Forgetting Mitigation: While catastrophic forgetting is mitigated, the τ\tau value for MME-SID post-fine-tuning (0.2727) is still lower than the pre-fine-tuning MM-RQ-VAE (0.4436). This suggests some information loss still occurs during LLM adaptation. Future work could explore regularization techniques during LLM fine-tuning to further preserve the learned distance metrics.

Transferability and Applications: The core principles of MME-SID are highly transferable.

  • Beyond Recommendation: The strategy for mitigating embedding collapse (combining multiple feature sources and representations) and catastrophic forgetting (initializing with pre-trained code embeddings) could be applied to other LLM-based tasks that use discrete tokens or quantized representations, especially in domains like knowledge graph completion, multimodal generation, or information retrieval where semantic consistency and information retention are crucial.

  • Different LLM Architectures: The framework could be adapted to other LLMs beyond Llama3-8B-instruct, providing a generalizable approach for enhancing LLM performance in diverse applications.

  • Industrial Applications: The improved inference efficiency and avoidance of collision make MME-SID particularly attractive for industrial-scale recommendation systems, where fast, accurate, and scalable solutions are paramount.

    In conclusion, this paper presents a significant advancement in LLM4SR by tackling critical, previously unaddressed challenges. Its rigorous methodology and strong empirical results make it a valuable contribution to the field, offering both practical solutions and foundational insights for future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.