Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
TL;DR Summary
This study introduces MME-SID, a novel framework for sequential recommendation using large language models and multimodal embeddings to address embedding collapse and catastrophic forgetting, enhancing recommendation performance through a multimodal residual quantized variational
Abstract
Sequential recommendation (SR) aims to capture users’ dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, we identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen the model scalability and lead to suboptimal recommendation performance. Therefore, based on LLMs like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively. To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings. Finally, we fine-tune the LLM efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting. The implementation code and datasets are publicly available for reproduction.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
1.2. Authors
Yuhao Wang (City University of Hong Kong), Junwei Pan (Tencent Inc.), Xinhang Li (Tsinghua University), Maolin Wang (City University of Hong Kong), Yuan Wang (Tencent Inc.), Yue Liu (Tencent Inc.), Dapeng Liu (Tencent Inc.), Jie Jiang (Tencent Inc.), Xiangyu Zhao (City University of Hong Kong).
1.3. Journal/Conference
Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25). CIKM is a highly reputable and influential conference in the fields of information retrieval, knowledge management, and database systems. Its long history and rigorous review process make it a significant venue for publishing research in these areas, including recommender systems.
1.4. Publication Year
2025
1.5. Abstract
Sequential recommendation (SR) aims to capture users’ dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, the authors identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen model scalability and lead to suboptimal recommendation performance. To address these, the paper introduces a novel SR framework called MME-SID, based on LLMs like Llama3-8B-instruct. MME-SID integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, it proposes a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy (MMD) as the reconstruction loss and contrastive learning for alignment, which effectively preserves intra-modal distance information and captures inter-modal correlations, respectively. To further alleviate catastrophic forgetting, the model is initialized with the trained multimodal code embeddings. Finally, the LLM is fine-tuned efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID due to its capability to mitigate embedding collapse and catastrophic forgetting.
1.6. Original Source Link
/files/papers/695777a34a1fbc163064c29i/paper.pdf This paper is published in the proceedings of CIKM '25.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is enhancing sequential recommendation (SR) by leveraging the powerful capabilities of Large Language Models (LLMs). SR aims to predict users' next interactions by understanding their dynamic interests and sequential behavior patterns from historical data. This is crucial for web applications like e-commerce and video platforms to drive engagement and profit.
While LLMs have shown great promise for SR due to their ability to comprehend semantic data, the authors identify two critical challenges in existing LLM-based SR (LLM4SR) methods:
-
Embedding Collapse: This phenomenon, also known as
dimensional collapse, occurs when item embeddings, especiallypre-trained collaborative embeddings(which represent items based on user-item interaction data), are mapped into the high-dimensionalLLMtoken space. This mapping can cause the embeddings to occupy only a low-dimensional subspace, leading to inefficient use of model capacity and limited scalability. The problem is important because it limits the richness and expressiveness of item representations within theLLM, hindering recommendation accuracy. -
Catastrophic Forgetting: This happens when
LLM4SRmethods utilizesemantic IDs(discrete tokens representing item features) for items. Existing approaches often discard thelearned code embeddings(the vector representations of these semantic IDs) after training a quantization model and then train new embeddings from scratch for downstream recommendation tasks. This results in a significant loss of previously learned knowledge, particularly thepartial order information of distancebetween item embeddings, thus reducing performance.The paper's innovative idea is to address these two challenges simultaneously by carefully integrating
multimodal embeddings(collaborative, textual, visual) andsemantic IDsin a way that preserves crucial information.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of LLM4SR:
-
First to Identify and Systematically Address Key Challenges: It is the first work to specifically identify and systematically tackle the
embedding collapseandcatastrophic forgettingissues withinLLM4SR. This provides a new perspective on improvingLLMperformance in recommendation tasks. -
Novel Framework MME-SID: The paper proposes
MME-SID, a novel framework that integratesmultimodal embeddingsandquantized embeddings(derived fromsemantic IDs) to mitigateembedding collapse. This approach leverages the rich information from different modalities (collaborative, textual, visual) to create more informative and less collapsed item representations for theLLM. -
Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE): A new quantization model,
MM-RQ-VAE, is introduced. This model is designed to better preserveintra-modal distance informationand captureinter-modal correlations. It achieves this through:- Using
Maximum Mean Discrepancy (MMD)as the reconstruction loss, which explicitly aims to preserve the distribution of distances between original and quantized embeddings. - Adopting
contrastive learningobjectives to align quantized embeddings across different modalities (e.g., collaborative with textual and visual), thus learning meaningful cross-modal relationships.
- Using
-
Mitigating Catastrophic Forgetting through Initialization: To combat
catastrophic forgetting,MME-SIDinitializes the embeddings ofmultimodal semantic IDsusing thetrained code embeddingsfrom theMM-RQ-VAE. This ensures that previously learned structural and distance information is retained when theLLMis fine-tuned. -
Efficient Multimodal Frequency-Aware Fine-tuning: The framework efficiently fine-tunes the
LLMusingLoRAand incorporates amultimodal frequency-aware fusionmodule. This module adaptively weighs the importance of different modalities based on item frequency, leading to better recommendation results, especially forcoldorwarmitems.The key conclusion is that
MME-SIDsignificantly surpasses existingLLM4SRmethods on various performance metrics across three public datasets. This superior performance is directly attributed to its effective strategies for mitigatingembedding collapseandcatastrophic forgetting, thus truly unleashing the potential ofLLMsfor recommendation. Additionally, the proposed solution offers advantages ininference efficiencyand avoids issues likecollisioncommonly faced by generative retrieval methods usingsemantic IDs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Sequential Recommendation (SR)
Sequential Recommendation (SR) is a subfield of recommender systems that focuses on predicting a user's next interaction (e.g., next item purchase, next video watched) based on their historical sequence of interactions. Unlike traditional recommender systems that might only consider static preferences or item popularity, SR models aim to capture dynamic user interests and temporal patterns. For example, if a user buys a camera, they might then be interested in camera lenses or tripods.
3.1.2. Large Language Models (LLMs)
Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human-like text, performing tasks such as natural language generation, summarization, translation, and question answering. In the context of recommendation, LLMs can process textual item descriptions, user reviews, and even user prompts to understand item semantics and user preferences more deeply. The paper specifically mentions using Llama3-8B-instruct, which is an instruction-tuned variant of the Llama 3 LLM with 8 billion parameters, designed to follow instructions effectively.
3.1.3. Embedding Collapse (Dimensional Collapse)
Embedding collapse, or dimensional collapse, is a phenomenon observed in deep learning models where the learned embeddings (vector representations of entities like items or users) occupy only a low-dimensional subspace within their intended high-dimensional embedding space. This means that despite being represented by high-dimensional vectors, the embeddings effectively behave as if they are in a much lower-dimensional space, leading to a loss of expressiveness and capacity. In recommender systems, if item embeddings collapse, many distinct items might end up with very similar representations, making it difficult for the model to differentiate between them and provide diverse or accurate recommendations.
3.1.4. Catastrophic Forgetting
Catastrophic forgetting (or catastrophic interference) is a common problem in neural networks where learning new information causes the network to forget previously learned information. In the context of LLM4SR and semantic IDs, this refers to the loss of valuable knowledge encoded in code embeddings (vector representations of discrete semantic IDs) when these embeddings are re-initialized and re-trained from scratch for a downstream LLM-based recommendation task. The paper specifically highlights the loss of partial order information of distance, meaning the relative relationships between item embeddings are forgotten.
3.1.5. Semantic IDs and Quantization
Semantic IDs are discrete tokens or codes used to represent items (or users) by capturing their semantic features. Instead of using a simple unique integer ID, an item might be represented by a sequence of semantic IDs (e.g., "electronics", "smartphone", "high-end"). This allows LLMs, which operate on discrete tokens, to directly process item information.
Quantization is the process of converting continuous input values (like dense item embeddings) into discrete representations (like semantic IDs). Residual Quantized Variational Autoencoder (RQ-VAE) is a specific type of quantization model that tokenizes and generates semantic IDs in a hierarchical manner. It works by progressively quantizing the residual error from the previous quantization step, using multiple levels of codebooks. Each codebook contains a set of code embeddings, and the quantization process finds the closest code embedding for a given input or residual.
3.1.6. Multimodal Embeddings
Multimodal embeddings refer to vector representations that incorporate information from multiple data modalities, such as collaborative (user-item interaction history), textual (item descriptions, reviews), and visual (item images). By combining these different perspectives, multimodal embeddings can provide a richer, more comprehensive, and robust representation of items, which helps address issues like the cold-start problem (where new items lack sufficient interaction data).
3.1.7. LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique for LLMs and other large neural networks. Instead of fine-tuning all the parameters of a large pre-trained model, LoRA injects trainable low-rank matrices into the transformer architecture's attention layers. This significantly reduces the number of trainable parameters, making fine-tuning much faster and requiring less computational resources, while often achieving performance comparable to full fine-tuning.
3.1.8. Maximum Mean Discrepancy (MMD)
Maximum Mean Discrepancy (MMD) is a statistical distance measure used to quantify the difference between two probability distributions. Unlike measures that compare points directly (like Mean Squared Error), MMD compares the "mean embeddings" of the distributions in a Reproducing Kernel Hilbert Space (RKHS). If two distributions are identical, their MMD will be zero. It's particularly useful when comparing complex distributions where explicit density estimation is difficult. A characteristic kernel is a type of kernel function (e.g., Gaussian kernel) for which MMD can uniquely determine if two distributions are identical.
3.1.9. Contrastive Learning
Contrastive learning is a self-supervised learning paradigm where a model learns representations by pushing "similar" (positive) samples closer together in the embedding space and pulling "dissimilar" (negative) samples farther apart. InfoNCE loss is a common objective function used in contrastive learning. In the context of multimodal learning, contrastive learning can be used to align representations from different modalities (e.g., making the embedding of an item's image similar to its text description) or to align quantized embeddings from different modalities, as in this paper.
3.2. Previous Works
3.2.1. Traditional Sequential Recommenders
- SASRec [5]: A seminal work in
SR,SASRecuses aself-attentionmechanism to capture sequential patterns. It treats a user's interaction history as a sequence and predicts the next item by attending to relevant past items. It primarily relies oncollaborative modality(item IDs).
3.2.2. LLM-based Sequential Recommendation (LLM4SR)
- TALLRec [2]: This work formulates
SRas a text generation task and appliesinstruction tuningonLLMs. Users' historical interactions and item attributes are converted into natural language prompts, and theLLMis trained to generate the recommended items as text. - E4SRec [10]: Adopts a linear projection to map
pre-trained ID embeddingsinto theLLMtoken space. It aims to tackle theout-of-range generation problem, whereLLMsmight generate item IDs that don't exist in the item set. - CoLLM [59] and LLaRA [13]: These works, similar to
Concatin the baselines, integratecollaborative embeddingsintoLLMsfor recommendation, often by mapping them into theLLMtoken embedding space and concatenating them. - CTRL [9]: Connects collaborative and language models for
CTR (Click-Through Rate)prediction.CTRL-MM(a baseline in this paper) adapts this idea for multimodalSR, explicitly aligning embeddings usingInfoNCEloss. - MOTOR [56]: Replaces
collaborative embeddingswithtoken embeddingsofvisionandtext featuresand uses atoken cross-networkfor interaction modeling. It also usessemantic IDsforvisualandtextual embeddings.
3.2.3. Semantic IDs for Recommendation
- Generative Models using Semantic IDs (TIGER [35], Sun et al. [40], LETTER [43], Zheng et al. [66]): These methods learn to transform
item embeddingsintosemantic IDs, which are then treated as new generativeLLMtokens.TIGERuses content information to generate semantic token sequences forSR. However, a common drawback, as highlighted by the current paper, is that these methods typically discard thetrained code embeddingsafterquantizationandrandomly initializethem for downstream tasks, leading tocatastrophic forgetting. They often face issues likecollision(multiple items mapping to the samesemantic IDsequence) andautoregressive inference latency. - Semantic IDs as Auxiliary Information (QARM [30], Zhang et al. [58]): These works use
vector quantizationandresidual quantizationto generatequantitative codesas new features to enhance traditional recommender systems. However, their improvements are often limited by the constraints of the traditional model structure.
3.2.4. Multimodal Encoding
- Individual Vision/Text Encoders or Multimodal Encoders (BEiT3 [42], CLIP [34]): Previous works on
multimodal recommendationoften combine separatevisionandtext encodersor usemultimodal encoderslikeBEiT3orCLIPto process multimodal data. The paper notes limitations: individual encoders might not be in the same representation space requiring alignment, and somemultimodal encoders(likeCLIP's text encoder) have limited capability for long, complex texts. - LLM2CLIP [52]: This work enhances the original
CLIPmodel by replacing its text encoder with a more powerfulLLM(e.g.,Llama3-8B). This allowsCLIPto process longer and more complex textual information while still benefiting from its cross-modal alignment capabilities.MME-SIDadoptsLLM2CLIPformultimodal embedding encoding.
3.3. Technological Evolution
The evolution of recommender systems has moved from traditional collaborative filtering, focusing solely on user-item interactions, to incorporating content-based features, and then to sequential recommendation with models like SASRec that capture temporal dynamics. More recently, the advent of powerful Large Language Models has led to LLM4SR, where LLMs are used to understand and generate recommendations based on natural language and item semantics. The integration of multimodal information (text, visual, collaborative) has been a parallel trend, aiming to enrich item representations and address cold-start issues. This paper sits at the intersection of these trends, specifically addressing the challenges that arise when combining LLMs, multimodal embeddings, and semantic IDs – namely embedding collapse and catastrophic forgetting. It refines the way semantic IDs are utilized to maximize information retention and LLM capacity.
3.4. Differentiation Analysis
MME-SID distinguishes itself from previous LLM4SR methods primarily by its novel approach to mitigate embedding collapse and catastrophic forgetting.
-
Addressing Embedding Collapse:
- Unlike methods relying solely on linear projections of low-dimensional
collaborative embeddings(e.g.,E4SRec,Concat),MME-SIDleverages a combination of the originalcollaborative,textual, andvisual embeddingsand the embeddings of theirsemantic IDs. Thismultimodalandmulti-representationinput is designed to prevent the projected embeddings from collapsing into a low-dimensional subspace within theLLMtoken space. - The paper's
MM-RQ-VAEspecifically aims to createquantized embeddingsthat better preserveintra-modal distance informationand captureinter-modal correlations, which also contributes to richer, less collapsed representations.
- Unlike methods relying solely on linear projections of low-dimensional
-
Mitigating Catastrophic Forgetting:
- In contrast to approaches like
TIGER-MM,MOTOR, orLETTERthat discardtrained code embeddingsandrandomly initializethem for downstream tasks,MME-SIDdirectly initializes thesemantic ID embeddingswith thetrained code embeddingsfrom itsMM-RQ-VAE. This crucial step ensures that the valuablepartial order information of distancelearned during thequantizationprocess is retained, preventingcatastrophic forgetting.
- In contrast to approaches like
-
Enhanced Semantic ID Utilization:
MME-SIDproposes a more effective way to usesemantic IDs. Instead of only generatingsemantic IDsforgenerative retrieval(which can suffer fromcollisionandautoregressive latency),MME-SIDusesmultimodal semantic ID embeddingsas part of a rich input to theLLMfor direct scoring, allowing it to generate a ranking list on the whole item set flexibly and efficiently.
-
Multimodal Frequency-Aware Fusion:
-
The introduction of a
multimodal frequency-aware fusionmodule is another innovation, allowingMME-SIDto adaptively weigh different modalities based on item popularity, which is not commonly seen in otherLLM4SRmethods.In summary,
MME-SIDgoes beyond simply adoptingLLMsor usingsemantic IDs; it fundamentally rethinks how item representations are constructed and fed intoLLMsto address specific challenges that limit their potential insequential recommendation.
-
4. Methodology
The proposed framework, MME-SID, aims to empower Large Language Models (LLMs) for sequential recommendation (SR) by mitigating embedding collapse and catastrophic forgetting. The overall framework, depicted in Figure 1, consists of two main stages: an Encoding Stage and a Fine-tuning Stage.
4.1. Principles
The core idea of MME-SID is to leverage both multimodal embeddings (collaborative, textual, visual) and semantic IDs with their trained code embeddings to construct more robust and informative item representations. This approach is guided by two main theoretical intuitions:
- Combating Embedding Collapse: By incorporating information from multiple modalities and distinct representations (original embeddings and quantized semantic ID embeddings), the model aims to enrich the feature space and prevent item representations from collapsing into a low-dimensional subspace when projected into the
LLMtoken space. The theoretical basis for this is that combining diverse, less correlated feature sources can increase the effective rank of the combined representation. - Mitigating Catastrophic Forgetting: By explicitly initializing the
semantic ID embeddingsin theLLMwith thecode embeddingslearned during thequantizationprocess, the model retains the valuable distance and structural information captured in the originalmultimodal embeddings. This prevents the loss of knowledge that occurs when these embeddings are randomly initialized. The use ofMaximum Mean Discrepancy (MMD)as a reconstruction loss further strengthens this by ensuring the quantized representations preserve the underlying distribution of distances.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation (Section 2.1)
In sequential recommendation, the goal is to model a user's dynamic interests and sequential patterns.
Let be the set of users and be the set of items. For each user , we have a behavioral item sequence , a target item , and a true label . A conventional sequential recommender system (SRS) takes as input. The prediction result is obtained by multiplying the model's output with the target item embedding through a dot product. The model parameters are typically optimized by minimizing the Binary Cross Entropy (BCE) loss:
where is the total number of users.
4.2.2. RQ-VAE (Section 2.2)
Residual Quantized Variational Autoencoder (RQ-VAE) is a model designed to tokenize (convert to discrete codes) and generate semantic IDs for original embeddings in a hierarchical manner.
An original embedding is first encoded into a latent semantic embedding . This is then quantized into a sequence of codes (or semantic IDs) through -level codebooks. Each level has a codebook , where are learnable code embeddings and is the codebook size.
The residual quantization process is formulated as:
Here, is the assigned semantic ID at the -th level codebook. is the residual from the previous level, with . The term denotes the L2 norm, indicating that is chosen as the index of the code embedding that is closest to the current residual . The new residual is then calculated by subtracting the chosen code embedding from .
Finally, the semantic IDs for the original embedding are . The quantized embedding is formed by summing these chosen code embeddings:
$
\hat{\boldsymbol{z}} = \sum_{l=1}^L CE_{SID^l}
$
This is then further decoded into to reconstruct the original embedding .
The overall loss function for RQ-VAE is:
\begin{array} { r l r } & { \mathcal{L} = \mathcal{L}_{\mathrm{Recon}} + \mathcal{L}_{\mathrm{RQ-VAE}} } & { (3) } \\ & { \mathcal{L}_{\mathrm{Recon}} = { \left. { \boldsymbol{s} - \hat{\boldsymbol{s}} } \right. }^2 } & { (4) } \\ & { \mathcal{L}_{\mathrm{RQ-VAE}} = \displaystyle \sum_{l=1}^L { \left. { \mathsf{SG} \left( { r_{l-1} } \right) - CE_{SID^{\boldsymbol{l}}} \right. }^2 } + \alpha \left. { r_{l-1} } - { \mathsf{SG} \left( CE_{SID^{\boldsymbol{l}}} \right) } \right. ^2 } \\ & { } & { \qquad \quad - \left. \mathrm{e}^\alpha \right. ^2 } \end{array}
Here, SG denotes the stop gradient operation, meaning the gradient is not passed through this part of the expression. is a hyper-parameter.
- (Equation 4) is the reconstruction loss, typically
Mean Squared Error (MSE), aiming to minimize the difference between the original embedding and its reconstructed version . - (Equation 5) is the
quantizationloss. The first term ensures that thecode embeddingsare updated to be close to the residuals (gradients flow from the residual to the code embedding). The second term encourages the residuals to be close to thecode embeddings(gradients flow from the code embedding to the residual, but theSGensures the code embedding itself is treated as a constant here). The term seems like a typo or part of a regularization not fully described in standardRQ-VAEliterature, as commonly it might be a commitment loss to keepcode embeddingsupdated. Given the strict instruction to be faithful, it is presented as is.
4.2.3. Preliminary Analysis (Section 3)
4.2.3.1. Embedding Collapse (Section 3.1)
The paper theoretically investigates embedding collapse when pre-trained collaborative embeddings are mapped into the LLM token space.
Let and be matrices. The following properties of matrix rank hold:
Consider a common scenario where a pre-trained collaborative embedding table (where is the number of items and is the embedding dimension) is projected into the LLM token space using a linear projection with weight and bias , resulting in a projected embedding. The rank of this projected embedding satisfies:
Since typically has a low rank (e.g., or 128 in traditional SRS), the projected embedding will also be low-rank in the higher-dimensional LLM space (), leading to embedding collapse. This means the LLM cannot utilize the full dimensionality of its token embedding space.
4.2.3.2. Catastrophic Forgetting (Section 3.2)
To quantify catastrophic forgetting, the paper uses Kendall's tau () [6] to measure the loss or preservation of distance information. Kendall's tau assesses the concordance (similarity in ranking) between two sets of paired observations.
Given two models, and , which produce distance variables for item pairs (e.g., and for , and and for ), Kendall's tau is defined as:
Here, denotes the count. A pair of samples is concordant if their relative order is the same in both sets (e.g., both and are true, or both are greater). It's discordant if their relative order is opposite.
A preliminary experiment on the Amazon Beauty dataset showed that quantized embeddings from an RQ-VAE trained on SASRec's collaborative embeddings preserved of the original distance information (). However, when code embeddings were randomly initialized and fine-tuned on downstream tasks, dropped to 0.0550, indicating information loss. This empirically validates the catastrophic forgetting issue.
4.2.4. Encoding Stage (Section 4.2)
The Encoding Stage aims to obtain multimodal embeddings and their semantic IDs.
4.2.4.1. Multimodal Embedding Encoding (Section 4.2.1)
To get multimodal embeddings, the paper uses LLM2CLIP [52] as the multimodal encoder. LLM2CLIP enhances the original CLIP model by replacing its text encoder with a more powerful LLM (like Llama3-8B), enabling it to handle long and complex textual descriptions.
LLM2CLIPtakesmultimodal attributes(e.g., product title, descriptions, images) of items as input.- It outputs
textual embeddingsandvisual embeddings, where and are the embedding sizes, and is the number of items. - Separately, a traditional
SRSlikeSASRec[5] is trained oncollaborative data(item ID only). Itsembedding tableis extracted, where is thecollaborative embeddingsize.
4.2.4.2. Multimodal Embedding Quantization (Section 4.2.2)
The paper proposes a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) to generate multimodal semantic IDs and address the drawbacks of existing methods (MSE reconstruction loss, not capturing inter-modal distinctions). The architecture is shown in Figure 2.
The VLM description for Figure 2: The image is a schematic diagram illustrating the integration process of multimodal embeddings and semantic ID in the MME-SID framework, including the encoding and decoding processes of collaborative embeddings, text embeddings, and visual embeddings, utilizing Llama3-8B and CLIP-ViT for multimodal representation. It also shows the reconstruction of quantized embeddings and their alignment relationships with collaborative and semantic IDs, highlighting the role of maximum mean discrepancy as the reconstruction loss.
该图像是示意图,展示了MME-SID框架的多模态嵌入与语义ID的集成过程。图中包括协作嵌入、文本嵌入和视觉嵌入的编码和解码流程,采用Llama3-8B和CLIP-ViT进行多模态表示。图示还展示了量化嵌入的重建及其与协作和语义ID的对齐关系,强调了最大均值差异作为重建损失的作用。
For each modality (collaborative, textual, visual):
- The original embedding is encoded into a
semantic embedding. - Through an -level codebook,
semantic IDs,quantized embedding, anddecoded quantized embeddingare generated, similar to theRQ-VAEprocess described earlier.
1. MMD as Reconstruction Loss:
To explicitly improve the ability of quantized embeddings to preserve information from original embeddings , the paper proposes minimizing the Maximum Mean Discrepancy (MMD) between and as the reconstruction loss. MMD [28, 36] measures the distance between two probability distributions and :
Here, is a symmetric positive-definite kernel (a Gaussian kernel is used as a characteristic kernel for this purpose), is its unique reproducing kernel Hilbert space, and represents the mean embedding of a distribution. MMD with a characteristic kernel can preserve all statistics of the distribution, making it more effective than MSE (which only minimizes Euclidean distance).
2. Contrastive Learning for Alignment:
To capture inter-modality connections (correlations between different modalities), a contrastive learning objective, specifically InfoNCE loss, is adopted. This loss aligns the quantized collaborative embedding with the quantized textual embedding and quantized visual embedding . LLM2CLIP has already aligned visual and textual information into the same embedding space, so direct alignment between and is not needed.
The overall loss function of MM-RQ-VAE is:
And for collaborative-visual alignment:
Where:
- is the
RQ-VAEloss (equivalent to Equation 5) summed across modalities. - denotes a similarity metric (e.g., cosine similarity).
- denotes a batch of samples from the item set.
- is the
stop gradientoperation, applied to the original embedding in theMMDloss to prevent gradients from flowing back into themultimodal encoders. - and are hyper-parameters to balance the different loss terms.
- is the temperature coefficient used in the
InfoNCE lossforcontrastive learning.
4.2.5. Fine-tuning Stage (Section 4.3)
The Fine-tuning Stage focuses on efficiently tuning the LLM for the SR task.
4.2.5.1. Initialization for Catastrophic Forgetting
To address catastrophic forgetting, MME-SID initializes the embeddings of multimodal semantic IDs () directly with the code embeddings () obtained from the trained MM-RQ-VAE. This ensures that the abundant intra-modal information (e.g., distances between behavioral and target item embeddings) learned during the encoding stage is preserved.
4.2.5.2. LLM Input Formulation
The input to the LLM consists of an {Instruction} (specifying the SR task) and a {Behavioral Item Sequence}. The {Behavioral Item Sequence} for an item is formulated as:
Where:
-
is the
one-hot vectorof the behavioral item sequence. -
is the original embedding (collaborative , textual , or visual ) for modality .
-
and are the weight and bias of a linear projection for each modality .
-
applies a
stop gradientto the original embeddings, treating them as fixed features. -
represents the sum of the
semantic ID embeddingsfor modality , which is thequantized embeddinginitialized fromMM-RQ-VAE. -
The square bracket
[ , ]denotes theconcatenationoperation, combining the linearly projected original embedding and the sum ofsemantic ID embeddingsfor each modality. -
is a
Multi-Layer Perceptron (MLP)that converts the concatenated feature into theLLM's token embedding dimension .This input format is designed to simultaneously preserve distance information from original embeddings and the hierarchical structure of
semantic IDsacross modalities, providing a rich, less collapsed, and less forgotten representation to theLLM.
4.2.5.3. Multimodal Frequency-Aware Fusion
Existing SR models often ignore that the importance of different modalities can vary for cold (infrequently interacted) or warm (frequently interacted) items. To address this, a multimodal frequency-aware fusion module is proposed.
First, the frequency of each item in the training set is recorded. Given the long-tail distribution of user-item interactions, is transformed and normalized:
is the logarithmic transformation of frequency (adding 1 to handle zero frequency), and is the min-max normalized frequency feature.
Next, an MLP takes as input and outputs fusion weights for each target item.
Finally, the prediction score for each target item is calculated using the last hidden state of the LLM output :
Where:
-
refers to
collaborative,textual, andvisualmodalities. -
denotes the
Hadamard product(element-wise multiplication). -
denotes the
dot product. -
is a new embedding table specifically for target items, introduced to further relieve potential
collapseissues in target item embeddings. -
The first term represents the score from the
LLM's output interacting with the new target item embedding. -
The summation term represents the scores from the
LLM's output interacting with the linearly projected original embeddings of each modality.The
BCE lossis then calculated using and the true label to update theLLM. Notably, only a small proportion of allLLMparameters (e.g., in experiments) are updated efficiently usingLoRA.
The overall framework of MME-SID is shown in Figure 1.
The VLM description for Figure 1: The image is a schematic diagram of the MME-SID framework, illustrating how the Large Language Model (LLM) integrates multimodal embeddings and quantized embeddings. The diagram shows several linear projections, including collaborative embeddings, textual embeddings, and visual embeddings, utilizing frequency-aware fusion for information integration.
该图像是MME-SID框架的示意图,显示了大语言模型(LLM)如何集成多模态嵌入和量化嵌入。图中展现了多个线性映射,其中包括协作嵌入、文本嵌入和视觉嵌入,利用频率感知融合进行信息整合。
4.2.6. Discussions (Section 4.4)
The paper highlights several advantages of MME-SID, particularly its potential to improve upon common suboptimal practices of using semantic IDs in generative retrieval or generative recommendation:
- Flexible Ranking:
MME-SIDcan generate a ranking list over the entire item set and output the top-k most relevant items flexibly. This contrasts with methods likeTIGER[35], which can only retrieve the most relevant itemautoregressively(code by code), making it less efficient for full-list recommendations. - No Collision Issue:
MME-SIDdoes not need to handle thecollisionproblem, where multiple items might map to the same sequence ofsemantic IDs. This is because themultimodal dataand its rich representation naturally help discriminate between different items. Existing methods likeTIGERrequire extra computation and storage to ensure uniquesemantic IDsequences. - Higher Inference Efficiency:
MME-SIDachieves higherinference efficiency. If anLLMhas token embedding dimension , and a user has behavioral items, where each item is encoded intosemantic IDs:- Methods like
TIGERmight require an input vector of dimensions (since each item is a sequence of tokens). MME-SIDonly requires a dimensional vector input because each item is represented by a single, comprehensivemultimodal embeddingthat isless collapsed,less prone to forgetting, andmore informative. This significantly improvesinference efficiency.
- Methods like
5. Experimental Setup
5.1. Datasets
The experiments are conducted on three categories of the Amazon 5-core dataset [31]: Beauty, Toys & Games, and Sports & Outdoors.
-
Source and Characteristics: These datasets are collected from Amazon, a large e-commerce platform. They represent user-item interactions where each user and item has at least 5 interactions.
-
Task: The task is to predict whether a user will give a rating higher than 3 to a target item, effectively framing it as a binary classification problem for positive preference.
-
Sparsity: The
sparsitymetric denotes the proportion of negative samples (label ), indicating that positive interactions are rare. -
Splitting: For each user, the
(N-1)-th item in their historical sequence is used as the target item for the training set, and the -th item is used as the target item for the test set. -
Preprocessing: Items lacking a title or image in the original dataset are removed.
The following are the results from Table 1 of the original paper:
Category Users Items Interactions Sparsity Beauty 22,332 12,086 198,215 99.93% Toys & Games 19,121 11,757 165,221 99.93% Sports & Outdoors 35,092 18,090 292,007 99.95%
These datasets are well-suited for validating sequential recommendation methods, especially those incorporating multimodal information, due to their rich item attributes and real-world interaction patterns. The high sparsity also presents a realistic challenge for recommendation systems.
5.2. Evaluation Metrics
The performance of the models is evaluated using Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (nDCG@k) for . Kendall's tau is also used for internal analysis of catastrophic forgetting.
5.2.1. Hit Ratio (HR@k)
- Conceptual Definition:
Hit Ratio (HR@k)measures the proportion of users for whom the target item is present within the top- recommended items. It's a straightforward measure of recall, indicating how often the system successfully recommends the relevant item within the top- choices. A higherHR@kmeans more users find their target item in the top recommendations. - Mathematical Formula: $ \mathrm{HR@k} = \frac{\text{Number of users with hit in top-k}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users with hit in top-k: The count of unique users for whom the ground truth item appears in their generated top- recommendation list.Total number of users: The total number of users in the evaluation set.
5.2.2. Normalized Discounted Cumulative Gain (nDCG@k)
- Conceptual Definition:
Normalized Discounted Cumulative Gain (nDCG@k)is a ranking quality metric that accounts for the position of the relevant items in the recommendation list. It assigns higher scores to relevant items that appear at higher ranks (closer to the top).nDCGvalues range from 0 to 1, where 1 represents a perfect ranking. It's normalized to be comparable across different recommendation lists. - Mathematical Formula:
First,
DCG@k(Discounted Cumulative Gain at ) is calculated: $ \mathrm{DCG@k} = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ Then,nDCG@kis calculated by dividingDCG@kby theIdeal DCG@k(IDCG@k), which is theDCG@kfor a perfectly ordered list: $ \mathrm{nDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $ - Symbol Explanation:
- : The number of top recommendations considered.
- : The rank position in the recommendation list (starting from 1).
- : The relevance score of the item at position . In binary relevance scenarios (like this paper, where a rating > 3 implies relevance), is typically 1 if the item is relevant and 0 otherwise.
- : The discount factor for items at lower ranks.
- : The maximum possible
DCG@kfor a given set of relevant items, achieved by ranking all relevant items at the top of the list in decreasing order of relevance.
5.2.3. Kendall's tau ()
- Conceptual Definition:
Kendall's tauis a non-parametric statistic used to measure the ordinal association between two measured quantities. In this paper, it's used to quantify the concordance (similarity in ranking order) between the pairwise distances of original item embeddings and the pairwise distances of theirquantized embeddings. A higherKendall's tauvalue (closer to 1) indicates better preservation of the relative distance information. - Mathematical Formula (from Section 3.2 of the paper): $ \tau = { \frac { # ( { \mathrm { concordant ~ pairs } } ) - # ( { \mathrm { discordant ~ pairs } } ) } { # ( { \mathrm { pairs } } ) } } $
- Symbol Explanation:
- : The number of pairs of samples where their relative order (e.g., in both lists, or in both lists) is the same in both compared rankings.
- : The number of pairs of samples where their relative order is opposite in the two compared rankings (e.g., in the first list but in the second list).
- : The total number of unique pairs of samples that can be formed from the data.
5.3. Baselines
The proposed MME-SID is compared against the following representative baseline methods. Their input formulations for the {Behavioral Item Sequence} are detailed in Table 2. Llama3-8B-instruct is adopted for all LLM-based methods, and RQ-VAE is used to generate semantic IDs where applicable.
The following are the results from Table 2 of the original paper:
| Method | Input |
|---|---|
| SASRec | |
| E4SRec | |
| ME | |
| Concat | |
| Concat&MLP | |
| CTRL-MM | |
| TIGER-MM | |
| MOTOR | |
| LETTER |
Where:
- : One-hot vector of the historical interaction (representing items in the sequence).
- : Collaborative embedding matrix.
- : A new, randomly initialized collaborative embedding matrix (for
ME). - : Embedding matrix for modality (collaborative, textual, visual).
- : Weight and bias of linear projection for modality .
- : A linear projection (for
ME). - : Multi-Layer Perceptron.
- : Semantic ID at the -th codebook for modality .
SG: Stop gradient operation.- Square brackets
[ ]: Concatenation operation.
5.3.1. Single-Modal (Collaborative Only) Baselines
- SASRec [5]: Represents the original
SASRecmodel. It usesself-attentionto model sequential patterns based on item IDs (). - E4SRec [10]: Adopts a linear projection of
pre-trained ID embeddings() to map them into theLLMtoken space, aiming to addressout-of-range generation. - Multi Embedding (ME): A baseline proposed by the authors. It takes both the linear projection of
pre-trained ID embedding() and a new set ofrandomly initialized ID embeddings() as input. It aims to see if simply adding another embedding table helps.
5.3.2. Multimodal Baselines
- Concat: This method maps
pre-trained collaborative,textual, andvisual embeddings() to theLLMtoken embedding space via linear layers, then directlyconcatenatesthem. It represents approaches likeCoLLM[59] andLLaRA[13]. - Concat&MLP: A typical
multimodal fusionmethod [12, 53]. Itconcatenatesthe projectedcollaborative,textual, andvisual embeddingsof items, then feeds this into anMLPbefore passing it to theLLM. - CTRL-MM: Adapted from
CTRL[9]. It has the same input asConcat&MLPbut explicitly aligns thecollaborative embeddingwithtextualandvisual embeddingsusingInfoNCEas acontrastive learning loss. - TIGER-MM: A
multimodal variantadapted fromTIGER[35]. It exclusively usessemantic IDsofcollaborative,textual, andvisual embeddingsto performgenerative retrieval. It trains anRQ-VAEfor each modality separately to generatesemantic IDs. - MOTOR [56]: Replaces
collaborative embeddingswithtoken embeddingsofvisionandtext features. It obtainssemantic IDsforvisualandtextual embeddingsand usesSASRecas the traditional downstreammultimodal recommendationmodel. - LETTER [43]: Implemented on
TIGERas the backbone. It adopts various regularization methods, such asdiversity, to achieve better item tokenization forgenerative recommendation.
5.4. Implementation Details
- Multimodal Encoding: Product titles and images are used.
- Embedding Dimensions: for
collaborative embeddings. fortextualandvisual embeddings. - Kernel for MMD: A
Gaussian kernelis adopted, which is acharacteristic kernel. - LLM Backbone:
Llama3-8B-instructis used, with atoken embedding dimension. This instruction-tuned model is chosen for its better capability to follow instructions. - Hardware: All experiments are conducted on
A100 GPUs. - Runs: Results are averaged over 3 runs to ensure robustness.
- Optimizer:
AdamW[29] is used. - Hyper-parameters:
-
, , and for
MM-RQ-VAEloss. -
LoRAtarget modules are[gate_proj, down_proj, up_proj]. -
Only about of total
LLMparameters are updated during fine-tuning withLoRA.The following are the results from Table 4 of the original paper:
Dataset Beauty Toys & Games Sports & Outdoors Training epochs 3 3 2 Learning rate 3e-4 2e-4 2e-4 Batch size 16 16 16 LoRA rank 8 8 8 LoRA alpha 16 16 16 LoRA dropout 0.05 0.05 0.05 Warm-up steps 100 100 200 Number of codes 256 256 300 Level of codebooks 4 4 4
-
6. Results & Analysis
6.1. Core Results Analysis (RQ1)
RQ1: What is the performance of the proposed MME-SID compared with baseline methods?
The overall performance comparison of MME-SID with various baselines on the three Amazon datasets is presented in Table 3. The metrics used are HR@5, HR@10, HR@20, nDCG@5, nDCG@10, and nDCG@20.
The following are the results from Table 3 of the original paper:
| Datasets | Metric | SASRec | E4SRec | ME | Concat | Concat&MLP | CTRL-MM | TIGER-MM | MOTOR | LETTER | Ours-full | Impr. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Beauty | HR@5 | 0.0368 | 0.0545 | 0.0567 | 0.0523 | 0.0581 | 0.0614 | 0.0471 | 0.0226 | 0.0415 | 0.0675* | 9.93% |
| HR@10 | 0.0578 | 0.0757 | 0.0787 | 0.0757 | 0.0830 | 0.0875 | 0.0668 | 0.0380 | 0.0654 | 0.0955* | 9.14% | |
| HR@20 | 0.0903 | 0.1040 | 0.1046 | 0.1070 | 0.1177 | 0.1224 | 0.0945 | 0.0635 | 0.0833 | 0.1342* | 9.64% | |
| nDCG@5 | 0.0243 | 0.0388 | 0.0402 | 0.0365 | 0.0404 | 0.0430 | 0.0329 | 0.0140 | 0.0262 | 0.0475* | 10.47% | |
| nDCG@10 | 0.0310 | 0.0456 | 0.0473 | 0.0440 | 0.0484 | 0.0515 | 0.0393 | 0.0189 | 0.0351 | 0.0566* | 9.90% | |
| nDCG@20 | 0.0392 | 0.0527 | 0.0538 | 0.0519 | 0.0571 | 0.0602 | 0.0463 | 0.0253 | 0.0408 | 0.0663* | 10.13% | |
| Toys & Games | HR@5 | 0.0508 | 0.0593 | 0.0598 | 0.0620 | 0.0623 | 0.0618 | 0.0486 | 0.0168 | 0.0471 | 0.0653* | 4.82% |
| HR@10 | 0.0713 | 0.0802 | 0.0827 | 0.0846 | 0.0871 | 0.0850 | 0.0667 | 0.0310 | 0.0650 | 0.0909* | 4.36% | |
| HR@20 | 0.1022 | 0.1064 | 0.1120 | 0.1114 | 0.1184 | 0.1179 | 0.0889 | 0.0528 | 0.0852 | 0.1223* | 3.29% | |
| nDCG@5 | 0.0357 | 0.0433 | 0.0435 | 0.0452 | 0.0444 | 0.0429 | 0.0354 | 0.0104 | 0.0343 | 0.0472* | 4.42% | |
| nDCG@10 | 0.0422 | 0.0501 | 0.0509 | 0.0525 | 0.0524 | 0.0503 | 0.0412 | 0.0150 | 0.0399 | 0.0555* | 5.71% | |
| nDCG@20 | 0.0500 | 0.0566 | 0.0582 | 0.0592 | 0.0602 | 0.0586 | 0.0468 | 0.0204 | 0.0449 | 0.0634* | 5.32% | |
| Sports & Outdoors | HR@5 | 0.0204 | 0.0316 | 0.0339 | 0.0287 | 0.0292 | 0.0270 | 0.0251 | 0.0154 | 0.0224 | 0.0371* | 9.44% |
| HR@10 | 0.0327 | 0.0456 | 0.0494 | 0.0431 | 0.0445 | 0.0424 | 0.0376 | 0.0253 | 0.0334 | 0.0541* | 9.51% | |
| HR@20 | 0.0522 | 0.0650 | 0.0718 | 0.0658 | 0.0667 | 0.0652 | 0.0551 | 0.0426 | 0.0503 | 0.0778* | 8.36% | |
| nDCG@5 | 0.0132 | 0.0218 | 0.0234 | 0.0191 | 0.0194 | 0.0181 | 0.0167 | 0.0100 | 0.0149 | 0.0253* | 8.12% | |
| nDCG@10 | 0.0171 | 0.0263 | 0.0285 | 0.0237 | 0.0243 | 0.0230 | 0.0207 | 0.0131 | 0.0186 | 0.0308* | 8.07% | |
| nDCG@20 | 0.0220 | 0.0312 | 0.0341 | 0.0294 | 0.0299 | 0.0287 | 0.0251 | 0.0174 | 0.0226 | 0.0367* | 7.62% |
Observations:
- LLM-based methods generally outperform traditional SRS:
E4SRec,ME,Concat,Concat&MLP,CTRL-MM, andMME-SID(all utilizingLLMs) generally show better performance compared toSASRec(a traditionalSRS), indicating the significant potential ofLLMsforsequential recommendation. - Single-modal LLM-based methods:
E4SRecconsistently improves uponSASRec.ME(Multi Embedding), which incorporates an additional randomly initialized collaborative embedding table, shows some improvement overE4SRec, but this enhancement is not significant across all datasets (e.g.,BeautyandToys & Games), suggesting that simply adding more single-modal information might not be enough. - Suboptimal use of multimodal data: Surprisingly,
Concat,Concat&MLP, andCTRL-MM, despite using multimodal data, often perform worse thanE4SRec(a single-modalLLM-basedmethod). This suggests that naive concatenation or even simple contrastive alignment (as inCTRL-MM) of multimodal embeddings may not be optimal forLLM4SR, and can even degrade performance if not handled carefully. - Semantic ID-only methods struggle:
TIGER-MM,MOTOR, andLETTERshow comparably poor performance amongmultimodal methods. These methods primarily rely on using onlysemantic IDsforgenerative retrieval. This result challenges the common belief thatsemantic ID-only approaches are inherently superior without careful consideration of information preservation. - MME-SID's Superiority: The proposed
MME-SID(labeled asOurs-full) consistently and significantly surpasses all baseline methods across all three datasets and all evaluation metrics.-
On the
Beautydataset,MME-SIDachieves an improvement of onnDCG@5over the best baseline (CTRL-MM). -
On
Toys & Games, it improves by onnDCG@5overConcat(which performs best among baselines for this metric). -
On
Sports & Outdoors, it shows an improvement onnDCG@5overME(the best baseline for this metric). -
The
*symbol indicates statistical significance (-value in t-test), validating the robustness ofMME-SID's performance.These results strongly validate the efficacy of
MME-SID, particularly its ability to effectively integratemultimodal embeddingsandsemantic IDswhile addressing the challenges ofembedding collapseandcatastrophic forgetting.
-
6.2. Alleviating Embedding Collapse (RQ2)
RQ2: Do multimodal embeddings and semantic IDs contribute to alleviating embedding collapse?
To answer this, the paper compares five methods: SASRec, E4SRec, ME, SE-SID-MMD, and MME-SID. SE-SID-MMD is a variant of MME-SID that only uses the collaborative modality as input, specifically . Embedding collapse is measured by the singular value of the embedding table, where higher values indicate a lower degree of collapse. The results on the Beauty dataset are shown in Figure 3.
The VLM description for Figure 3: The image is a chart presenting the performance of sequence recommendation (SR) and the measurements of embedding collapse. Figure (a) shows the performance of various methods on the SR task with the y-axis representing nDCG@20; figure (b) illustrates the measurement of embedding collapse, with the x-axis as dimension index and the y-axis as the logarithmic singular values of the embeddings (normalized). These results are derived from the Beauty dataset.

Analysis of Figure 3 (a) - Sequential Recommendation Performance (nDCG@20):
SASRecandE4SRec(single-modal,ID-basedmethods) show the lowestnDCG@20scores, consistent with the overall performance table.ME(multi-embedding for collaborative ID) performs slightly better thanE4SRec.SE-SID-MMD(usingsemantic IDsfor collaborative modality withMMDloss) significantly outperformsSASRec,E4SRec, andME. This indicates that a better way of handlingcollaborative embeddings(throughsemantic IDsandMMD) improves performance even without other modalities.MME-SID(full model with multimodal embeddings andsemantic IDs) achieves the highestnDCG@20score, demonstrating the combined benefit of its proposed mechanisms.
Analysis of Figure 3 (b) - Embedding Collapse Measurement:
-
The y-axis represents the logarithm of singular values (normalized by the maximum value), and the x-axis is the dimension index of the embedding matrix. A flatter curve or higher singular values across dimensions indicate less collapse.
-
SASRecandE4SRecshow a sharp drop in singular values after the 64th dimension (since their collaborative embedding dimension ). This signifies drasticembedding collapse, where effectively only 64 dimensions are meaningfully utilized in a much higher-dimensionalLLMspace (). The paper notes that over of dimensions collapse. -
MEalso exhibits significant collapse, similar toE4SRec, indicating that adding another low-dimensionalcollaborative embeddingtable doesn't fundamentally solve the collapse issue. -
SE-SID-MMDshows a better singular value distribution thanSASRec,E4SRec, andMEafter the 64th dimension, suggesting that incorporatingsemantic IDseven for a single modality helps to expand the effective embedding space. -
MME-SID(input behavioral item embedding): This line shows the highest singular values and the slowest decay among all variants for the input behavioral item embeddings. This indicates thatMME-SIDeffectively alleviatesembedding collapseby leveragingmultimodal embeddingsandsemantic IDs, leading to a more expressive and less redundant representation. The embedding space is better utilized. -
MME-SID(target item): The singular values for the target item embedding inMME-SIDalso show a healthy distribution. The introduction of a new target item embedding table () is justified here as it helps preventcollapsespecifically for the target item representations used in scoring, which is crucial for overall performance.The paper also empirically analyzes the effect of nonlinear mappings (like ReLU) on
embedding collapse, finding that they do not significantly improve matrix rank, can degrade recommendation accuracy, andcatastrophic forgettingstill occurs, likely because nonlinearity disrupts distance information.
Result 1: Solely relying on pre-trained low-dimensional collaborative embeddings in LLM4SR leads to embedding collapse. In contrast, MME-SID effectively alleviates this phenomenon and achieves better performance by adopting multimodal embeddings and semantic IDs.
6.3. MMD-based Reconstruction Loss (RQ3)
RQ3: What is the effect of MMD-based reconstruction loss?
To evaluate the effect of MMD as a reconstruction loss, the paper compares two model variants: SE-SID-MMD and SE-SID-MSE.
SE-SID-MMD: UsesMMDas the reconstruction loss in itsRQ-VAE.SE-SID-MSE: UsesMean Squared Error (MSE)as the reconstruction loss in itsRQ-VAE. Both variants only utilize thecollaborative modalityfor input. TheirSRperformance on theBeautydataset andembedding collapsemeasurements are shown in Figure 4.
The VLM description for Figure 4: The image is a chart comparing the performance of MMD and MSE in terms of (a) reconstruction loss and (b) embedding collapse. Chart (a) shows the effects of SE-SID-MSE and SE-SID-MMD under different nDCG metrics, while chart (b) presents the corresponding embedding collapse.

Analysis of Figure 4 (a) - Sequential Recommendation Performance:
SE-SID-MMDconsistently outperformsSE-SID-MSEacross allnDCG@kmetrics. This suggests that usingMMDfor reconstruction loss leads to betterSRperformance.
Analysis of Figure 4 (b) - Embedding Collapse Measurement:
- The singular value curves for
SE-SID-MMDandSE-SID-MSE(blue and violet lines) show comparable degrees ofembedding collapse. Although thesemantic ID embeddingsofSE-SID-MMDmight show slightly lower collapse thanSE-SID-MSEat certain points, the overall pattern is similar. This implies that the performance gain ofSE-SID-MMDis not primarily due to a reduction inembedding collapseat this stage.
Analysis of Catastrophic Forgetting (Supplementary):
-
The paper further analyzes
catastrophic forgettingusingKendall's tau(). ForSE-SID-MMD, the between thequantized collaborative embedding(afterMM-RQ-VAEtraining) and thepre-trained collaborative embeddingis0.4436. -
For
SE-SID-MSE, this value is0.3714. -
The higher value for
SE-SID-MMD() indicates thatMMDas a reconstruction loss is more effective at preserving thepartial order of distance informationfrom the original embeddings duringquantization. This better preservation directly contributes to mitigatingforgettingand, consequently, to the improvedSRperformance.Result 2: Compared with
Mean Squared Erroras the reconstruction loss, theMaximum Mean Discrepancyreconstruction loss enables thequantized embeddingto better preserve the information (specifically, thepartial order of behavioral-target item embedding distance), thereby achieving better recommendation performance. The improvement is attributed to the mitigation ofcatastrophic forgetting, rather than a significant reduction inembedding collapseat this stage.
6.4. Embedding Initialization (RQ4)
RQ4: Does using trained code embeddings for initialization mitigate catastrophic forgetting?
To address this, MME-SID is compared with a variant named MME-SID-random, which randomly initializes the embeddings of semantic IDs instead of using the trained code embeddings from MM-RQ-VAE. The performance comparison is shown in Figure 5 (a).
The VLM description for Figure 5: The image is a chart that illustrates the comparison of code embedding initialization (a) and an ablation study on the Beauty dataset (b). The y-axis represents nDCG@k, including performance metrics of methods such as MME-SID and E4Rec, where nDCG@5 and nDCG@20 values reflect the effectiveness of the recommendation system.

Analysis of Figure 5 (a) - Comparison of Code Embedding Initialization (nDCG@k):
MME-SIDconsistently outperformsMME-SID-randomacross allnDCG@kmetrics. This clearly demonstrates that initializing withtrained code embeddingsis crucial for better recommendation performance.
Analysis of Catastrophic Forgetting (Supplementary):
-
The
Kendall's tau() between thefine-tuned collaborative embedding(fromMME-SID-random) and thepre-trained collaborative embedding() is0.0508. This extremely low value indicates severecatastrophic forgettingwhensemantic ID embeddingsarerandomly initializedand trained from scratch on the downstream task. -
In contrast, for the full
MME-SIDmodel, the value after fine-tuning is0.2727. This significantly higher value (though still lower than before fine-tuning, as some adaptation occurs) demonstrates thatMME-SIDsubstantially mitigatescatastrophic forgettingby preserving a much larger portion of the original distance information.Result 3: Simply discarding the
pre-trained code embeddingsandrandomly initializingthem on downstream tasks leads tocatastrophic forgetting. The proposedMME-SIDmitigates this phenomenon by initializing with thetrained code embeddings, thereby effectively preserving thedistance informationand achieving superior recommendation performance.
6.5. Ablation Study
An ablation study is conducted on the Beauty dataset to understand the contribution of different components of MME-SID. The results, measured by nDCG@20, are shown in Figure 5 (b).
Analysis of Figure 5 (b) - Ablation Study (nDCG@20):
SE-random: This model variant uses onlyrandomly initialized item ID embeddingsas input (similar parameter count toE4SRec, but its embeddings are randomly initialized). It achieves the worst performance, highlighting the importance of properly initialized or pre-trained embeddings and the severe impact offorgettingifrandom initializationis used without robust mechanisms to learn from scratch.MME-random: This variant has the same number of input parameters asMME-SIDbut replaces thequantized embeddingwith a new,randomly initialized embedding tablefor each modality. It performs worse thanMME-SID, indicating that the improvement ofMME-SIDis not merely due to an increased number of input parameters ormultimodalinput itself. Instead, the specific mechanism of leveragingintra- and inter-modal correlationlearned byMM-RQ-VAEand initializing with itstrained code embeddingsis crucial.w/o Fusion: This variant removes themultimodal frequency-aware fusionmodule fromMME-SID. The performance decrease compared to the fullMME-SIDmodel demonstrates the significance of this adaptive fusion mechanism. It confirms that dynamically weighting modalities based on item frequency is beneficial for recommendation performance, especially for handling diverse item characteristics (cold vs. warm).MME-SID(full model): As expected, the fullMME-SIDmodel achieves the best performance, validating the collective effectiveness of all its proposed components in addressingembedding collapse,catastrophic forgetting, and adaptively leveragingmultimodal information.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously identifies and systematically addresses two critical challenges in Large Language Model for Sequential Recommendation (LLM4SR): embedding collapse and catastrophic forgetting. To tackle these issues, the authors propose MME-SID, a novel framework that effectively integrates multimodal embeddings (collaborative, textual, visual) and semantic IDs.
Key components and findings include:
-
Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE): This new model is central to
MME-SID. It usesMaximum Mean Discrepancy (MMD)as a reconstruction loss to explicitly preserveintra-modal distance informationandcontrastive learningto captureinter-modal correlations. -
Mitigation of Catastrophic Forgetting:
MME-SIDsuccessfully alleviatescatastrophic forgettingby initializing thesemantic ID embeddingswith thetrained code embeddingsfromMM-RQ-VAE, thus retaining valuable pre-learned knowledge. -
Alleviation of Embedding Collapse: By combining original
multimodal embeddingswith theirsemantic ID embeddingsand using a dedicated target item embedding,MME-SIDcreates richer, less collapsed representations in theLLM's token space. -
Efficient Fine-tuning and Adaptive Fusion: The
LLMis efficiently fine-tuned usingLoRAand incorporates amultimodal frequency-aware fusionmodule, which adaptively combines modality scores based on item frequency, further boosting performance.Extensive experiments on three public Amazon datasets demonstrate the superior recommendation performance of
MME-SIDover strong baselines, validating its efficacy in tackling the identified challenges and unlocking the full potential ofLLMsforsequential recommendation.
7.2. Limitations & Future Work
While the paper does not explicitly dedicate a section to "Limitations & Future Work," some aspects can be inferred from the "Discussions" and the scope of the current work:
-
Computational Cost of MM-RQ-VAE Training: While
LoRAmakesLLMfine-tuning efficient, theMM-RQ-VAEitself involves training a separate model to learnmultimodal semantic IDs. This encoding stage, especially withMMDandcontrastive learning, can be computationally intensive and require significant data, which might be a barrier for smaller datasets or resource-constrained environments. -
Generalizability of MM-RQ-VAE Design: The
MM-RQ-VAEis specifically designed forcollaborative,textual, andvisualmodalities. While robust for these, adapting it to new modalities (e.g., audio, sensor data) might require redesign or careful extension of thealignmentandreconstructionmechanisms. -
Hyperparameter Sensitivity: The
MM-RQ-VAEloss (Equation 10) involves multiple hyperparameters ( for the Gaussian kernel). Optimizing these can be complex and dataset-dependent. -
Interpretability of Fused Embeddings: While the fusion module is
frequency-aware, a deeper analysis into why certain modalities are weighted more heavily forcoldvs.warmitems could provide further insights and potentially lead to more advanced fusion strategies. -
Scalability of Multimodal Data Handling: For extremely large-scale industrial scenarios with billions of items, the storage and retrieval of
multimodal embeddings(text, visual) for all items, even beforequantization, can pose engineering challenges.Potential future research directions could include:
-
Exploring more advanced
quantizationtechniques or alternativesemantic IDgeneration methods that are even more efficient or robust. -
Investigating self-supervised pre-training strategies for the
multimodal embeddingsdirectly within theLLMrather than relying on external encoders likeLLM2CLIP, to potentially achieve end-to-end optimization. -
Applying
MME-SIDto more diverseLLMarchitectures or even multimodalLLMsthat inherently handle multiple modalities, to see how theembedding collapseandcatastrophic forgettingissues manifest and can be mitigated in those contexts. -
Developing more nuanced
frequency-awareorcontext-aware fusionmechanisms that also consider user-specific preferences or current session context. -
Extending the framework to address
cold-startusers more directly, beyond justcold-start items.
7.3. Personal Insights & Critique
This paper offers a highly rigorous and insightful analysis of two fundamental problems (embedding collapse and catastrophic forgetting) that arise when integrating LLMs into sequential recommendation, particularly when semantic IDs and multimodal embeddings are involved. Its strength lies in its systematic problem identification, clear theoretical grounding (e.g., rank analysis for collapse), and empirical validation with robust metrics (e.g., Kendall's tau for forgetting). The detailed breakdown of how MMD and contrastive learning are integrated into RQ-VAE is particularly illuminating, showcasing a deep understanding of information preservation in quantized representations.
The explicit emphasis on using trained code embeddings for initialization is a critical insight, challenging the prevalent practice of random initialization in semantic ID-based generative models. This simple yet profound change significantly impacts knowledge retention. Furthermore, the practical advantages highlighted in the "Discussions" section—such as flexible ranking, collision avoidance, and higher inference efficiency—demonstrate that MME-SID is not just theoretically sound but also offers compelling real-world benefits over existing generative retrieval methods.
Critiques and Areas for Improvement:
- Detailed Explanation of Equation 5: The structure of Equation 5 in the
RQ-VAEloss, specifically the term-\mathrm{e}^\alpha\text{}^2|z - \mathsf{SG}(e)|^2$$) to ensure the encoder output commits to the codebook entries. Whilefaithfulto the paper, a deeper explanation of this specific term's purpose and derivation would benefit a beginner audience. - Trade-off between Complexity and Performance: While
MME-SIDachieves superior performance, theencoding stagerequires trainingMM-RQ-VAEwith three modalities,MMDloss, andcontrastive learning. This adds complexity. A clearer analysis of the marginal gains from eachMM-RQ-VAEcomponent (beyondMMDvs.MSE) could be insightful, potentially in a more fine-grained ablation study. - Robustness to Data Quality: The framework heavily relies on the quality of
multimodal embeddingsgenerated byLLM2CLIPandSASRec. WhileLLM2CLIPis powerful,textualandvisualdata quality can vary greatly in real-world scenarios. An analysis ofMME-SID's robustness to noisy or incompletemultimodal datawould be valuable. - Long-Term Impact of Forgetting Mitigation: While
catastrophic forgettingis mitigated, the value forMME-SIDpost-fine-tuning (0.2727) is still lower than the pre-fine-tuningMM-RQ-VAE(0.4436). This suggests some information loss still occurs duringLLMadaptation. Future work could explore regularization techniques duringLLMfine-tuning to further preserve the learned distance metrics.
Transferability and Applications:
The core principles of MME-SID are highly transferable.
-
Beyond Recommendation: The strategy for mitigating
embedding collapse(combining multiple feature sources and representations) andcatastrophic forgetting(initializing with pre-trained code embeddings) could be applied to otherLLM-based tasks that usediscrete tokensorquantized representations, especially in domains likeknowledge graph completion,multimodal generation, orinformation retrievalwhere semantic consistency and information retention are crucial. -
Different LLM Architectures: The framework could be adapted to other
LLMsbeyondLlama3-8B-instruct, providing a generalizable approach for enhancingLLMperformance in diverse applications. -
Industrial Applications: The improved
inference efficiencyand avoidance ofcollisionmakeMME-SIDparticularly attractive for industrial-scale recommendation systems, where fast, accurate, and scalable solutions are paramount.In conclusion, this paper presents a significant advancement in
LLM4SRby tackling critical, previously unaddressed challenges. Its rigorous methodology and strong empirical results make it a valuable contribution to the field, offering both practical solutions and foundational insights for future research.
Similar papers
Recommended via semantic vector search.