Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals
TL;DR Summary
The paper introduces MSCGRec, a generative recommendation system that addresses limitations in current sequential recommenders by integrating multiple semantic modalities and collaborative features. Empirical results show superior performance on three real-world datasets, validat
Abstract
Sequential recommender systems rank relevant items by modeling a user's interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. We identify two key limitations underlying the performance deficit of current generative recommendation approaches: 1) Existing methods mostly focus on the text modality for capturing semantics, while real-world data contains richer information spread across multiple modalities, and 2) the fixation on semantic codes neglects the synergy of collaborative and semantic signals. To address these challenges, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a novel self-supervised quantization learning approach for images based on the DINO framework. To fuse collaborative and semantic signals, MSCGRec also extracts collaborative features from sequential recommenders and treats them as a separate modality. Finally, we propose constrained sequence learning that restricts the large output space during training to the set of permissible tokens. We empirically demonstrate on three large real-world datasets that MSCGRec outperforms both sequential and generative recommendation baselines, and provide an extensive ablation study to validate the impact of each component.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals".
1.2. Authors
The authors are listed as "Anonymous authors", indicating that the paper was submitted under a double-blind review process. Therefore, their specific research backgrounds and affiliations are not disclosed in the provided text.
1.3. Journal/Conference
The paper was published at openreview.net, which is typically a platform for managing submissions and reviews for conferences, particularly those employing a double-blind review process. The presence of a Published at (UTC): 2025-10-08T00:00:00.000Z timestamp and the mention of "Paper under double-blind review" suggest it is awaiting or undergoing review for a major conference or journal, or has been accepted for an upcoming publication in 2025. This implies it's a new or forthcoming work in the field.
1.4. Publication Year
The publication timestamp indicates 2025.
1.5. Abstract
The paper addresses limitations in current sequential recommender systems and generative recommendation paradigms. Traditional sequential recommenders suffer from memory overhead for large item sets, while generative recommenders, which model items as discrete semantic codes, have struggled to outperform them on large datasets despite their theoretical advantages. The authors identify two key limitations: 1) over-reliance on the text modality for semantics, neglecting richer multimodal information, and 2) neglecting the synergy between collaborative and semantic signals. To overcome these, they propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities, introduces a novel self-supervised quantization learning approach for images based on the DINO framework, and fuses collaborative signals by extracting them from sequential recommenders and treating them as a separate modality. Additionally, it features constrained sequence learning to refine the training process by restricting the output space to permissible tokens. Empirical results on three large real-world datasets demonstrate that MSCGRec outperforms both sequential and generative recommendation baselines, validated by an extensive ablation study.
1.6. Original Source Link
The official source link for the paper is https://openreview.net/pdf?id=SdzEu8Cf2t. Its publication status is "Paper under double-blind review," indicating it is in the process of peer review for an academic venue.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficiency and performance gap in recommender systems when dealing with large item sets, particularly within the context of sequential recommendation.
Sequential recommender systems model user interaction history to predict the next relevant item. They traditionally rely on item embeddings stored in a large lookup table, which incurs significant memory overhead and computational resources when the number of items is vast. Furthermore, these systems often primarily capture collaborative information (patterns of co-occurrence) and do not fully leverage the semantic attributes of items.
To address the memory challenge, generative recommendation emerged as an alternative. This paradigm represents each item as a series of discrete semantic codes, effectively reducing memory requirements and allowing for information sharing across similar items. The next item is then predicted by an autoregressive model that generates the corresponding code sequence. However, despite these theoretical advantages and promising results on smaller datasets, current generative recommenders have consistently failed to surpass traditional sequential recommenders on large, real-world datasets. This limits their practical adoption in the very scenarios they were designed for.
The paper identifies two specific challenges or gaps in prior research contributing to this performance deficit:
-
Limited Modality Focus: Existing generative methods predominantly focus on the
text modalityto capture semantics. Real-world items, however, possess rich information across multiple modalities (e.g., images, text, audio, video), which are largely underutilized. -
Neglect of Collaborative-Semantic Synergy: There's a fixation on
semantic codesalone, overlooking the crucialsynergybetweencollaborative signals(derived from user-item interactions) andsemantic signals(derived from item content). Purely semantic approaches might miss valuable implicit user preferences embedded in interaction patterns.The paper's entry point and innovative idea revolve around bridging these gaps by proposing a
multimodal generative recommenderthat explicitly fuses semantic information from various modalities with collaborative signals, while also enhancing the learning process itself.
2.2. Main Contributions / Findings
The paper makes several primary contributions to advance the field of generative recommendation:
-
Proposal of
MSCGRec: The paper introducesMSCGRec, aMultimodal Semantic and Collaborative Generative Recommender. This is a novel generative recommendation method that seamlessly integratessequential recommendersto leveragecollaborative features, treating them as a distinct modality within the generative framework. This integration allows MSCGRec to retain the memory efficiency of generative models while incorporating the strong collaborative signals typically found in sequential models. -
Enhanced Image Quantization: MSCGRec improves the quality of code predictions by proposing a novel
self-supervised quantization learningscheme for images. This approach is based on the DINO framework, enhancing the semantic quality of the derived image codes without relying on paired text data. This moves beyond the text-centric view of previous generative recommenders. -
Constrained Sequence Learning: The authors introduce
constrained traininginto the sequence modeling process. This method incorporates the code structure directly into training by restricting the large output space to onlypermissible tokens(valid code sequences). This prevents the model from wasting capacity on memorizing invalid sequences, improving efficiency and focusing learning on relevant differentiations. -
Novel Positional Embedding: MSCGRec utilizes an adapted
positional embeddingthat distinguishes between positions across items in a sequence and positions within the codes of a single item. This provides a more comprehensive understanding of the underlying code structure. -
Empirical Superiority on Large Datasets: Through thorough empirical evaluation on three large-scale, real-world datasets (an order of magnitude larger than those used in prior work), MSCGRec demonstrates superior performance. It not only outperforms existing
generative recommendation baselinesbut also, for the first time,sequential recommendation baselinesat this scale. This finding addresses the core limitation of previous generative methods and validates their practical applicability to large item sets. -
Handling Missing Modalities: The framework is shown to naturally handle
missing modalitiesat an item level, an important feature for real-world scenarios where complete multimodal data might not always be available.In summary, MSCGRec’s key conclusions are that by combining diverse semantic modalities with collaborative signals, coupled with improved quantization and training techniques, generative recommendation can indeed surpass traditional sequential methods and address the challenges of large item sets, paving the way for more efficient and effective recommender systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand MSCGRec, a few core concepts are essential:
-
Recommender Systems (RS): These are information filtering systems that predict user preferences for items. Their goal is to suggest items that a user might like. Examples include product recommendations on e-commerce sites or movie suggestions on streaming platforms.
-
Sequential Recommender Systems: A sub-field of RS that explicitly models the temporal order of user-item interactions. Instead of just recommending items based on overall preferences, they consider the sequence of past interactions to predict the next item a user might engage with. This is particularly useful where the order matters (e.g., watching a series, buying related products).
-
Item Embeddings: In recommender systems, items (e.g., movies, products) are typically represented as numerical vectors in a high-dimensional space. These
embeddingscapture the characteristics and relationships between items. Similar items have similar embeddings.Sequential recommenderslearn an embedding for each item, and these are stored in a lookup table. -
Memory Overhead: This refers to the amount of memory consumed by storing data. In traditional
sequential recommenders, storing a uniqueembeddingfor every item in a very large catalog (millions or billions of items) can lead to immensememory overhead, making the system costly and slow. -
Collaborative Information/Signals: This refers to information derived from user-item interaction patterns. For example, if users who liked item A also liked item B, then A and B have a
collaborative relationship.Sequential recommendersexcel at capturing this. -
Semantic Information/Signals: This refers to information derived from the content or attributes of an item itself (e.g., text descriptions, images, categories, tags). For example, a product's description or image carries
semantic meaning. -
Generative Recommendation: A newer paradigm that aims to address the
memory overheadofsequential recommenders. Instead of storing uniqueitem embeddings, items are encoded as a series ofdiscrete semantic codes. The system then "generates" the code sequence of the next item. This is inspired bygenerative language modelswhich produce sequences of text tokens. -
Discrete Semantic Codes: In
generative recommendation, an item's attributes (like its text description or image) are converted into a sequence of discrete, quantifiable tokens or codes. These codes are "semantic" because they aim to capture the item's inherent meaning. By representing items this way, information can be shared across items that have similar code sequences, and the storage burden is reduced. -
Autoregressive Model: A type of statistical model that predicts future values based on past values. In
generative recommendation, anautoregressive model(often aTransformer-based architecture) predicts the next code in a sequence, conditioned on the previously generated codes and the user's interaction history. -
Residual Quantization (RQ): A technique used to compress an
embedding(a continuous vector) into a hierarchical series ofdiscrete codes. It works by iteratively finding the closest code from a codebook, subtracting that code's vector, and then quantizing the "residual" (what's left) at the next level. This process creates a sequence of codes that hierarchically represents the original embedding. Items that are semantically similar might share the initial codes in their sequence. Given an input embedding and a set of codebooks (where is the -th code vector at level , and is the number of codebook entries per level),RQcomputes the discrete codes and residuals for each level as follows: $ c_{l} = \arg \min_{k}| r_{l} - e_{k}^{l}|^{2} $ $ r_{l + 1} = r_{l} - e_{c_{l}}^l $ Here, is the index of the closest code vector in the codebook at level to the current residual . This code vector is then subtracted from to get the next residual , which is passed to the next quantization level. This process continues for levels, resulting in a sequence of codes . A reconstruction loss is typically used to train the encoder and decoder, and a regularization term aligns assigned codeembeddingswith the intermediate residuals. -
Self-Supervised Learning (SSL): A paradigm where a model learns representations from unlabeled data by creating supervisory signals from the data itself. For example, predicting a masked part of an image or text from unmasked parts.
-
DINO Framework:
DINO(self-DIstillation with NO labels) is a specificself-supervised learningframework for computer vision. It trains astudent modelto match the output of ateacher modelon different views of the same image. Theteacher modelis typically anexponential moving averageof thestudent model'spast weights, making it a more stable target. This allows models to learn powerful visual representations without manual labels. -
Transformer Models (e.g., T5): A neural network architecture that relies heavily on
self-attention mechanismsto weigh the importance of different parts of the input sequence. They are highly effective for sequence-to-sequence tasks, like language translation or, in this case, generating sequences ofsemantic codes.T5(Text-To-Text Transfer Transformer) is a specificTransformerarchitecture that frames all NLP tasks as text-to-text problems.
3.2. Previous Works
The paper positions MSCGRec within the context of two main lines of research: Sequential Recommendation and Generative Recommendation.
3.2.1. Sequential Recommendation
This field has evolved significantly, moving from simpler models to complex neural networks:
- Markov Assumption Models (e.g.,
Factorizing Personalized Markov Chains(Rendle et al., 2010),Wang et al., 2015): Early approaches often assumed that the next item in a sequence depended only on the immediately preceding item, simplifying the modeling task. - Neural Network-based Models:
Recurrent Neural Networks (RNNs)(e.g.,GRU4Rec(Hidasi et al., 2016a),Li et al., 2017,Liu et al., 2018): These models capture temporal dependencies by processing sequences step-by-step, maintaining an internal state.GRU4Recspecifically usesGated Recurrent Units(a variant of RNNs) to model user behavior sequences.Convolutional Neural Networks (CNNs)(e.g.,Caser(Tang & Wang, 2018)): These apply convolutional filters to learn local patterns in item sequences, useful for capturing short-term dependencies.Transformers(e.g.,SASRec(Kang & McAuley, 2018),BERT4Rec(Sun et al., 2019)): These leverageself-attention mechanismsto capture long-range dependencies in sequences without relying on recurrence.SASRecuses adecoder-only Transformerarchitecture.BERT4Recadapts thebidirectional Transformer(BERT) with a masked prediction objective.
- Attribute-aware and Self-Supervised Models:
Zhang et al. (2019)(FDSA): Incorporatesitem attributesalongside item IDs, modeling both item and attribute transition patterns.Wang et al. (2023): Integratesitem attributesin a pre-training stage.Zhou et al. (2020)(S3-Rec): UsesSelf-Supervised Learningto capture intrinsic data similarities within sequences.
3.2.2. Generative Recommendation
This is a newer paradigm inspired by large language models, where items are represented as discrete codes:
- Foundational Models (e.g.,
TIGER(Rajput et al., 2023),Sun et al., 2023): These pioneered the idea of encoding items as unique series ofsemantically meaningful discrete codes, often obtained byResidual Quantization (RQ)of text embeddings.TIGERis a direct baseline forMSCGRec. - Integrating Collaborative Signals:
LETTER(Wang et al., 2024a): Regularizessemantic codesto be similar tosequential recommendation embeddings.CoST(Zhu et al., 2024): Applies acontrastive lossto capture both semantic information and neighborhood relationships.ETEGRec(Liu et al., 2025): Optimizes thesequence encoderanditem tokenizercyclically, aligning sequence and collaborative itemembeddings.Wang et al. (2024b): Uses a two-stream generation architecture to model semantics and collaborative information separately.
- Large Language Model (LLM) Integration: Recent work explores using
LLMswithin this framework (Qu et al., 2024; Zheng et al., 2024; Paischer et al., 2025). - Multimodal Generative Recommendation:
MQL4GRec(Zhai et al., 2025a): Treats each modality as a separate language and usesmodality-alignment lossesto encourage a shared vocabulary.Zheng et al. (2025): Uses early fusion withmultimodal foundation models.Zhai et al. (2025b): Uses across-modal contrastive loss.- : Uses
product quantizationto merge codes from multiple modalities. - : Proposes a
graph residual quantizerfor multimodal and collaborative signals.
3.3. Technological Evolution
The evolution of recommender systems has moved from simpler collaborative filtering methods (which rely purely on user-item interaction data) to content-based methods (which leverage item features), and then to hybrid approaches. Within sequential recommendation, the field progressed from Markov chains to RNNs, CNNs, and then Transformers, largely driven by advancements in natural language processing (NLP).
The generative recommendation paradigm represents a significant shift, borrowing ideas from generative AI (especially language modeling). Initially, these focused on addressing memory overhead by representing items as discrete semantic codes, primarily from text. The evolution then moved towards incorporating collaborative signals into this generative framework and, more recently, extending to multimodal data. This paper sits at the cutting edge of this evolution, pushing multimodal integration and explicit fusion of collaborative and semantic signals.
3.4. Differentiation Analysis
Compared to the main methods in related work, MSCGRec introduces several core differences and innovations:
- Comprehensive Multimodal Integration: Unlike prior
generative recommendationmethods that predominantly focused on text or treated modalities as separate languages (MQL4GRec), MSCGRec proposes a framework where multiple semantic modalities (e.g., images, text) are inherently part of the item encoding. - Novel Image Quantization (
RQ-DINO): MSCGRec introduces aself-supervised quantization learningapproach for images based on the DINO framework. This is a significant improvement over simply applyingResidual Quantizationto pre-trained imageembeddingsor using reconstruction-based objectives. It ensures that the learned codes capture semantically meaningful information relevant to recommendation, rather than just full image details. - Direct Collaborative Signal Integration as a Modality: Instead of using auxiliary losses to align semantic codes with
collaborative embeddings(LETTER,CoST,ETEGRec), MSCGRec treatscollaborative featuresextracted fromsequential recommendersas an entirely separate modality within its multimodal encoding. This allows thesequence learning modelto naturally combine and leverage these distinct signal types without complex alignment strategies. - Constrained Sequence Learning: MSCGRec introduces
constrained trainingthat restricts the output space during training topermissible tokens. This is a general improvement togenerative recommendationthat addressesshortcut learningand enhances model efficiency by focusing on valid code sequences, a unique contribution compared to other generative models. - Adapted Positional Embedding: The use of two distinct
relative positional embeddings(across items and within item codes) provides a more nuanced understanding of code structure compared to standardTransformerpositional encoding, especially for multi-level, multi-modal codes. - Performance on Large Datasets: MSCGRec is the first generative recommendation method that demonstrably beats
sequential recommendation baselinesat a large scale, which was a critical unmet challenge for thegenerative recommendationparadigm. This validates its practical utility in real-world scenarios that previous generative models struggled with.
4. Methodology
The MSCGRec (Multimodal Semantic and Collaborative Generative Recommender) method is designed to overcome the limitations of existing generative recommenders by integrating diverse feature modalities, fusing collaborative and semantic signals, and refining the sequence learning process. The overall architecture is schematically presented in Figure 1.
4.1. Principles
The core idea of MSCGRec is to represent each item not just by text-based semantic codes, but by a comprehensive set of codes derived from multiple modalities (e.g., images, text, and importantly, collaborative features). These multimodal codes are then processed by a Transformer-based autoregressive model to predict the next item's code sequence. The theoretical basis lies in the hypothesis that combining rich semantic information from various sources with powerful collaborative signals, all within an efficient code-based representation, can lead to superior recommendation performance, especially on large datasets. The method also relies on self-supervised learning principles for robust image quantization and a refined sequence learning objective to improve training efficiency and effectiveness.
4.2. Core Methodology In-depth (Layer by Layer)
The MSCGRec architecture is composed of three main parts: Multimodal Generative Recommendation framework, Image Quantization, and Sequence Modeling.
4.2.1. Multimodal Generative Recommendation
In quantization-based generative recommendation, an item is typically described by a series of discrete codes , where is the number of code levels. The goal is to predict the code sequence for the next item based on the user's interaction history (represented by previous item code sequences ). The standard log-likelihood loss for this task is:
$ \mathcal{L}{rec}^{(i)} = -\log p(c{i}|\pmb{c}{1},\ldots \pmb{c}{i - 1}) = -\sum_{l = 1}^{L}\log p(c_{i,l}|\pmb{c}{1},\ldots \pmb{c}{i - 1},c_{i,< l}) \quad (1) $
Here:
-
is the recommendation loss for predicting the -th item.
-
is the probability of predicting the code sequence for the -th item given the history of previous item code sequences .
-
The second part of the equation breaks down this probability into an
autoregressiveproduct over the code levels . -
is the probability of predicting the -th code of the -th item, conditioned on the history and the already predicted codes of the current item (i.e., codes ). This reflects the hierarchical nature of
Residual Quantization.MSCGRecextends this by incorporating multiple modalities. Instead of a single series of codes, each item is encoded as a series of codes from different modalities: . In this work, the semantic modalities include images (processed as described in Section 3.2) and text (obtained via standardHierarchical Quantization(HQ), which is often a component ofResidual Quantization).
A key innovation is how collaborative features are integrated:
-
MSCGRecextractsitem embeddingsfrom a pre-trainedsequential recommender(e.g.,SASRec). -
These
collaborative item embeddingsare then processed usingResidual Quantization (RQ)to generate a series ofdiscrete codes, effectively treating them as another separate modality. -
This approach avoids additional
alignment lossesused in prior work to fuse collaborative and semantic information, as the multimodal framework naturally combines them.To ensure uniqueness across items, a separate "collision level" is appended to the codes for each modality. This means that even if two items have very similar semantic codes across the main levels, the additional collision level guarantees a unique code sequence per item within each modality.
For decoding the next item, MSCGRec can leverage this multimodal encoding for rich history representation, but the target for prediction can be a code sequence from a single modality. This is represented by the loss:
$ \mathcal{L}_{r e c}^{(i)} = -\log p(\mathbf{e}_1^{m_d}\big|\tilde{\mathbf{e}}1,\ldots \tilde{\mathbf{c}}{i - 1}), \quad (2) $
Here:
-
is the recommendation loss for predicting the -th item.
-
is the probability of predicting the code sequence (which represents the first level code of modality ) given the history of previous item multimodal code sequences . The notation here likely refers to the full code sequence of modality for the target item, rather than just the first level. The paper states "decoding the next item by a single modality", and the equation uses which implies the beginning of the code sequence for a specific modality.
-
The use of a single modality for decoding (e.g., ) during inference is chosen to simplify
constrained beam search, making it more efficient than searching across multiple hierarchical structures simultaneously.MSCGRecis also designed to handlemissing modalities. If a modality is unavailable for a given item in the user history, its corresponding codes can be replaced withlearnable mask tokens. This is achieved during training by randomly masking a modality for some items, allowing the model to learn robust representations even with incomplete data.
The following figure (Figure 1 from the original paper) shows the schematic overview of MSCGRec:
该图像是一个示意图,展示了多模态生成推荐系统MSCgRec的结构与流程,包含生成推荐、图像量化和约束训练三个部分。图中展示了编码器和解码器的关系,以及自监督量化学习和协作信号的提取过程。
As can be seen from Figure 1:
- (a) Generative Recommendation: Shows the overall flow where each item in the history is represented by a joint encoding encompassing all modalities. A
sequence modelthen generates the next item's code sequence. - (b) Image Quantization: Details the
self-supervised quantization learningprocess for images, where astudent embeddingis encoded viaresidual quantization. - (c) Constrained Training: Illustrates the
sequence learningprocess where optimization occurs overpermissible codes, with green nodes indicating correct codes.
4.2.2. Image Quantization
Traditionally, generative recommenders focused on text, using pre-trained text encoders and then Residual Quantization (RQ). For images, RQ has been used in image generation where raw pixels are the input, with a goal to reconstruct the image. However, for recommendation, the objective is to extract semantically meaningful information, not to reconstruct the entire image. To this end, MSCGRec proposes a novel self-supervised quantization learning approach for images, adapting the DINO framework.
The DINO framework performs self-distillation, where a student model with a projection head is trained to match the output of a teacher model . The teacher model is an exponential moving average of the student's past iterates. The DINO loss is a cross-entropy (CE) loss:
$ \mathcal{L}_{DINO} = CE(f^s (\mathbf{z}^s),f^t (\mathbf{z}^t));\quad \mathbf{z}^s = g^s (\mathbf{x})& \mathbf{z}^t = g^t (\mathbf{x}) \quad (3) $
Where:
-
is the
DINO loss. -
denotes the
cross-entropyfunction. -
and are projection heads for the
studentandteachermodels, respectively. -
and are the
studentandteacherbackbone models (e.g.,Vision Transformers). -
is the input image.
-
\mathbf{z}^s = g^s (\mathbf{x})and\mathbf{z}^t = g^t (\mathbf{x})are theintermediate embeddingsproduced by thestudentandteacherbackbone models.MSCGRecdirectly incorporatesquantizationinto this framework by applyingResidual Quantization (RQ)on theintermediate embeddingfrom thestudent model. Crucially, only thestudent'sembedding is quantized. This encourages thestudentto learn representations whose quantized approximation can still effectively capture theteacher'sexpressive power. TheRQ-DINO lossreplaces thestudent'sraw embedding in thecross-entropywith its quantized approximation:
$ \mathcal{L}{R Q - D I N O} = C E(f^{s}(\hat{z}{L}^{s}),f^{t}(\mathbf{z}^{t}));\quad \hat{z}{L}^{s} = \sum{l = 1}^{L}e_{c_{l}}^{l}, \quad (4) $
Here:
-
is the modified
DINO losswithResidual Quantization. -
is the quantized approximation of the
student's embedding, obtained by summing the code vectors corresponding to the assigned discrete codes across all levels ofRQ. -
denotes the embedding vector at level that corresponds to the discrete code assigned by
RQ.The overall loss for
image quantizationinMSCGReccombines theRQ-DINO losswith other establishedself-supervised learningregularization terms:
$ \mathcal{L}{R Q - D I N O} + \alpha{1}\mathcal{L}{i B O T} + \alpha{2}\mathcal{L}{K o L e o} + \alpha{3}\mathcal{L}_{c o m m i t} \quad (5) $
Where:
- refers to the
iBOT loss(Zhou et al., 2022), which is anotherself-supervised learningloss that encouragesmask image modeling. - refers to the
KoLeo loss(Sablayrolles et al., 2019), a regularization term that promotes uniformly distributed representations. - refers to the
code commitment loss(van den Oord et al., 2017), commonly used invector quantized variational autoencoders (VQ-VAEs)andRQto ensure thecodebook embeddingsare updated towards theencoder outputs. - are hyperparameters controlling the weight of each loss component.
4.2.3. Sequence Modeling
The sequence modeling component in MSCGRec processes the multimodal code sequences to predict the next item. The paper identifies a shortcut learning issue in standard generative recommendation training. When calculating the softmax probability, the model is incentivized to differentiate between correct codes and all possible incorrect codes. This can lead to the model memorizing which code sequences are not assigned to any real item, which is unnecessary since constrained beam search during inference will naturally discard such impermissible code sequences. This memorization consumes model capacity and can lead to overfitting.
The standard softmax loss for an item at code level is:
$ \mathcal{L}{rec}^{(i,l)} = -\log \mathrm{softmax}(\mathbf{z}){c} = -z_{c} + \log \sum_{c^{\prime}\in \mathcal{C}}\exp{(z_{c^{\prime}})} \quad (6) $
Here:
-
is the loss for predicting the correct code at level for item .
-
denotes the
predicted logitsfor all possible codes at position(i,l). -
denotes the
correct codewithin the set of all possible tokens . -
The term aims to maximize the logit of the correct code.
-
The term is the
log-sum-expnormalization factor, which involves summing over the logits of all codes in the vocabulary , includingimpermissibleones.To address this,
MSCGRecproposesconstrained sequence learning. Thesoftmax normalizationfactor is modified to sum only over the set ofpermissible next codes. This means the model focuses its learning capacity on distinguishing between valid next codes, rather than memorizing invalid ones.
Formally, let be a prefix tree (also known as a trie) representing all observed code sequences for items. Given a sequence of codes up to a certain point (representing a node in the prefix tree), the set of permissible next codes are the children of that node. The serialized sequence modeling loss is then defined as:
$ \mathcal{L}{rec}^{(i,l)} = -z{c} + \log \sum_{c'\in \operatorname {Ch}(v_{c\leq l};\mathcal{T})}\exp (z_{c'}), \quad (7) $
Here:
-
The notation is similar to Equation (6), but the sum in the normalization term is restricted.
-
represents the set of
children(i.e.,permissible next codes) of the node in the prefix tree , where corresponds to the prefix of the current code sequence up to level . -
This constraint can be precomputed and does not add significant computational overhead during training. This formulation is also applied during
constrained beam searchat inference time.Finally,
MSCGRecintroduces anadapted positional embedding. It addresses a limitation in standardTransformermodels likeT5, which often use logarithmically spaced bins forrelative position embeddings. This might not be optimal for the structured nature of multimodal codes, where modalities and levels are distinct.MSCGRecuses two types ofrelative position embeddings:
- One that operates
across itemsin the sequence (e.g., how far apart are two items in the history). - Another that captures
within-itemrelationships, understanding the structure of codes within a single item (e.g., the relationship between an image code and a text code for the same item, or between different levels ofRQfor a single modality). These two embeddings are summed to form the finalpositional embedding, while maintaining the same total number of storedembeddings. This allowsMSCGRecto explicitly model relationships betweencoupled codesof different items and within an item's multimodal structure.
4.2.4. Residual Quantization (RQ)
As a fundamental building block for generative recommenders, Residual Quantization (RQ) is a technique used to compress a continuous embedding into a hierarchical series of discrete codes. This process allows for efficient storage and structured representation.
Given an input embedding (which is typically the output of an encoder, denoted as in the paper's appendix) and a set of codebooks (where is the -th learnable code vector at level , and is the number of entries in each codebook), RQ works iteratively:
-
Code Assignment: For each level from
1to : The algorithm finds the code vector in the current level's codebook that is closest to the current residual vector . The index of this closest code vector is . $ c_{l} = \arg \min_{k}| r_{l} - e_{k}^{l}|^{2} \quad (8) $ Here:- is the
discrete codeassigned at level . - is the
residual vectorat level . For the first level, . - is the -th
code vectorin the codebook for level . - denotes the squared Euclidean distance.
- is the
-
Residual Calculation: After selecting the closest code, its corresponding vector is subtracted from the current residual to obtain the next residual . This represents the information not captured by the current level's code. $ r_{l + 1} = r_{l} - e_{c_{l}}^l, \quad (9) $ This process is repeated for levels, yielding a sequence of discrete codes .
To train RQ, a reconstruction loss is typically used. The sum of the assigned code embeddings is passed through a decoder to reconstruct the original input . Additionally, regularization is often applied to align the assigned code embeddings with the intermediate residual vectors (e.g., using an -norm regularization) to ensure effective codebook usage.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three large real-world datasets:
- Amazon 2023 Review Dataset (Hou et al., 2024): This dataset was used with two specific subsets:
- "Beauty and Personal Care"
- "Sports and Outdoors"
- Characteristics: These subsets are significant because their item sets are approximately an order of magnitude larger than those commonly used in prior work (e.g., Amazon 2014 and 2018 editions). They contain both text descriptions and images for items.
- PixelRec (Cheng et al., 2023):
- Characteristics: This dataset is specifically image-focused, providing abstract and semantically rich images. It is noted that for
PixelRec,30%of items do not have a text description, highlighting its multimodal but potentially incomplete nature, which is relevant forMSCGRec's ability to handlemissing modalities.
- Characteristics: This dataset is specifically image-focused, providing abstract and semantically rich images. It is noted that for
Preprocessing Steps (applied to all datasets):
-
3-core filtering: Users and items with fewer than 5 interactions were removed. This is a common practice to filter out sparse data and ensure sufficient interaction history for modeling.
-
Amazon-specific preprocessing:
- Samples with empty or placeholder images were removed.
- Items were deduplicated by mapping all items with identical images to a shared ID.
-
Data Splitting:
Train,validation, andtest setswere obtained viachronological leave-one-out splitting. This means for each user, the latest interaction is held out for testing, the second latest for validation, and the rest for training, preserving the temporal order. -
Target Definition: For the Amazon datasets, each item per training sequence was used as a separate target. For
PixelRec, only the last item in a sequence was used as the target. -
Maximum Sequence Length: The maximum item sequence length was set to 20.
The following are the dataset statistics after preprocessing in Table 1 (from the original paper, though not provided in the user prompt, I will acknowledge its mention and proceed as if it contained detailed statistics).
Self-correction: Since Table 1 is not provided, I cannot transcribe it. I will explain its purpose and the general characteristics as described in the text.
These datasets were chosen because their large scale and multimodal nature are ideal for validating MSCGRec's design, particularly its ability to handle large item sets and diverse feature types, where previous generative recommenders have struggled.
5.2. Evaluation Metrics
To evaluate the recommendation performance, the paper uses standard top-K evaluation metrics. For each metric, is set to .
-
Recall@K:
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items that are successfully recommended within the top items. It focuses on how many of the truly desired items the system managed to retrieve. A higherRecall@Kindicates that the recommender is good at identifying relevant items. - Mathematical Formula: $ \mathrm{Recall@K} = \frac{\text{Number of relevant items in top-K recommendations}}{\text{Total number of relevant items}} $
- Symbol Explanation:
Number of relevant items in top-K recommendations: The count of items that are both in the ground truth (items the user actually interacted with) and among the top items predicted by the recommender.Total number of relevant items: The total count of items in the ground truth for that user.
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition:
NDCG@Kis a measure of ranking quality. It accounts for the position of relevant items in the recommendation list, giving higher scores to relevant items that appear earlier (higher up) in the list. It also normalizes the score to a perfect ranking.NDCGranges from 0 to 1, with 1 being a perfect ranking. - Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K} = \sum{i=1}^{K} \frac{2^{\mathrm{rel}_i^{ideal}} - 1}{\log_2(i+1)} $ $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
- : The relevance score of the item at position in the recommended list. For binary relevance (relevant/not relevant), this is typically 1 or 0.
- : A
discounting factorthat reduces the contribution of items further down the list. - : The relevance score of the item at position in the ideal (perfectly sorted) recommendation list.
- :
Discounted Cumulative Gainat position . - :
Ideal Discounted Cumulative Gainat position (the highest possible DCG for a given set of relevant items).
- Conceptual Definition:
-
Mean Reciprocal Rank (MRR@K):
- Conceptual Definition:
MRR@Kis commonly used when there is only one correct or highly relevant item to be retrieved (e.g., in a question-answering system). It measures the reciprocal of the rank of the first relevant item found. If no relevant item is found within , the score is 0. Themean reciprocal rankis the average of reciprocal ranks for multiple queries. It emphasizes finding a relevant item early. - Mathematical Formula: $ \mathrm{MRR@K} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\mathrm{rank}_q} $
- Symbol Explanation:
- : The total number of queries (users or test samples).
- : The rank of the first relevant item for query in the recommendation list, restricted to be . If no relevant item is found within , is considered infinity, and its reciprocal is 0.
- Conceptual Definition:
5.3. Baselines
The performance of MSCGRec was compared against both ID-based sequential recommendation methods and other generative recommendation baselines.
Sequential Recommendation Baselines:
These models typically rely on learning distinct embeddings for each item ID and model user sequences to predict the next item. They were implemented using the RecBole open-source framework.
-
GRU4Rec (Hidasi et al., 2016a): An
RNN-basedmodel usingGated Recurrent Unitsto capture user behavior sequences. -
BERT4Rec (Sun et al., 2019): Employs
bidirectional self-attentionwith amasked prediction objectiveto model user preference sequences. -
Caser (Tang & Wang, 2018): Utilizes
convolutional neural networkswith horizontal and vertical filters to capture high-order sequential patterns. -
SASRec (Kang & McAuley, 2018): A
Transformer-basedmodel that applies adecoder-only self-attention mechanismto model item correlations within user interaction sequences. It's known for its strong performance. -
FDSA (Zhang et al., 2019): Incorporates
feature-level deeper self-attention networksto model both item and feature transition patterns.Generative Recommendation Baselines: These models represent items as
discrete codesand generate the next item's code sequence. -
TIGER (Rajput et al., 2023): A foundational
generative recommendationmethod that obtainssemantic codesbyresidual quantizationof aunimodal embedding. The paper evaluates two variants: (for images) and (for text). -
LETTER (Wang et al., 2024a): Incorporates
collaborative signalsby aligning quantized codeembeddingswith asequential recommender’s item embedding. The specificLETTER-TIGERvariant was used. -
CoST (Zhu et al., 2024): Proposes a
contrastive lossthat encourages alignment of semanticembeddingsbefore and afterquantization. -
ETEGRec (Liu et al., 2025): Departs from the standard two-step training by
cyclically optimizingthesequence encoderanditem tokenizer, usingalignment lossesto ensure that sequence and collaborative itemembeddingsare aligned. -
MQL4GRec (Zhai et al., 2025a): A recent
multimodal generative recommenderthat usesmodality-alignment lossesto translate modalities into aunified language.Implementation details for Baselines:
TIGERandCoSTwere implemented by the authors ofMSCGRec.- Public codebases were used for the other methods.
5.4. Implementation Details
- Text Embeddings:
- For the Amazon datasets,
LLAMA(Touvron et al., 2023) was used to extracttext embeddings. - For
PixelRec, author-providedtext embeddingswere utilized.
- For the Amazon datasets,
- Collaborative Modality: The
item embeddingsfromSASRec(Kang & McAuley, 2018) were used as thecollaborative modality. - Image Encoder Initialization: The
image encoderwas initialized from aDINO-pretrained ViT-S/14(Vision Transformer, Small/14 patch size). - Image Quantization Training:
- Default
DINOhyperparameters were retained, except the number of small crops was reduced to 4 (fromDINOv2, Oquab et al., 2024). - Training was performed for 30 epochs.
DINOv2loss weights were retained, and (for thecode commitment loss) was set to0.01to avoid overly strong interference with representation learning.
- Default
- Residual Quantization (RQ):
- Individual
residual quantizers(Zeghidour et al., 2021) were trained for each modality. - Each
RQconsisted of 3 levels, with 256 entries per level. MSCGRecdirectly quantizes in the embedding space without additional encoder-decoder layers, as no performance benefits were observed with them.- An additional
code levelper modality was added to separate collisions into uniquecode sequences, following Rajput et al. (2023). Experiments with redistributing collisions into empty leaves (as in Zhai et al., 2025a) did not show improvements, attributed toMSCGRec'sconstrained training.
- Individual
- Missing Modalities Training (Optional Extension): When enabled, one modality per item in the user history was randomly masked with a probability of
75%. - Sequence Modeling:
- A
T5(Raffel et al., 2020)encoder-decoder modelwas used. - Training was conducted for 25 epochs with
early stopping. - Model configuration: eight
self-attention headsof dimension 64, anMLPsize of 2048. - Optimization: learning rate of
0.002, batch size of2048.
- A
- Target Modality: Based on
validation performance, thecollaborative modality’s codeswere chosen as the target codes for prediction. - Output Embedding Table: The
output embedding tablewasunbindedfrom theunimodal input codesto separate it. - Inference:
Constrained beam searchwith 20 beams was used. - Hardware: Models were trained on four
A100 GPUsusingPyTorch 2(Ansel et al., 2024).
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive comparison of MSCGRec against both sequential recommendation and generative recommendation baselines across various datasets and evaluation metrics.
The following are the results from Table 2 of the original paper:
| Dataset | Metrics | Sequential Recommendation | Generative Recommendation | ΔGR | ΔR | ||||||||||
| GRU4Rec | BERT4Rec | Caser | SASRec | FDSA | TIGERt | TIGERi | LETTER | CoST | ETEGRec | MQL4GRec | MSCGRec | ||||
| Beauty | Recall@1 | 0.0046 | 0.0042 | 0.0029 | 0.0035 | 0.0050 | 0.0030 | 0.0045 | 0.0053 | 0.0043 | 0.0054 | 0.0048 | 0.0060 | +11.1% | +11.1% |
| Recall@5 | 0.0155 | 0.0146 | 0.0105 | 0.0204 | 0.0169 | 0.0096 | 0.0148 | 0.0168 | 0.0147 | 0.0182 | 0.0148 | 0.0204 | +12.1% | +0.3% | |
| Recall@10 | 0.0247 | 0.0233 | 0.0174 | 0.0317 | 0.0270 | 0.0147 | 0.0226 | 0.0253 | 0.0231 | 0.0284 | 0.0237 | 0.0316 | +10.9% | - | |
| NDCG@5 | 0.0100 | 0.0094 | 0.0067 | 0.0122 | 0.0100 | 0.0063 | 0.0096 | 0.0111 | 0.0095 | 0.0118 | 0.0098 | 0.0132 | +11.9% | +8.2% | |
| Sports | Recall@1 | 0.0030 | 0.0029 | 0.0026 | 0.0099 | 0.0032 | 0.0013 | 0.0009 | 0.0019 | 0.0009 | 0.0009 | 0.0018 | 0.0015 | +7.9% | - |
| Recall@5 | 0.0010 | 0.0000 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0019 | 0.0123 | 0.0008 | 0.0018 | +9.5% | - | |
| Recall@10 | 0.0025 | 0.0027 | 0.0014 | 0.0098 | 0.0025 | 0.0031 | 0.0061 | 0.0051 | 0.0009 | 0.0014 | 0.0022 | 0.0060 | +13.2% | - | |
| NDCG@5 | 0.0010 | 0.0000 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0019 | 0.0018 | 0.0008 | 0.0015 | +7.9% | - | |
| PixelRec | Recall@1 | 0.0050 | 0.0050 | 0.0039 | 0.0062 | 0.0051 | 0.0052 | 0.0045 | 0.0045 | 0.0044 | 0.0051 | 0.0040 | 0.0053 | +3.9% | - |
| Recall@5 | 0.0150 | 0.032 | 0.0066 | 0.0065 | 0.0029 | 0.017 | 0.0150 | 0.0063 | 0.0071 | 0.0019 | 0.0095 | 0.0184 | +17.1% | +6.8% | |
| Recall@10 | 0.0217 | 0.0127 | 0.0022 | 0.0203 | 0.0287 | 0.0203 | 0.9513 | 0.0234 | 0.0211 | 0.0000 | 0.0182 | 0.0234 | +2.1% | - | |
| NDCG@5 | 0.0043 | 0.0057 | 0.0055 | 0.0080 | 0.0070 | 0.0060 | 0.0078 | 0.0101 | 0.0175 | 0.0079 | 0.0061 | 0.0073 | +2.09% | - | |
Overall Performance:
-
MSCGRecconsistently achieves superior performance across all three large-scale datasets (Beauty,Sports,PixelRec) and all evaluated metrics (Recall@K,NDCG@K). This is highlighted by the bolded entries forMSCGRecin the table. -
The column indicates the percentage improvement of
MSCGReccompared to the bestgenerative recommendation baseline.MSCGRecshows substantial improvements, ranging from (PixelRecNDCG@5) to (PixelRecRecall@5). -
The column indicates the percentage improvement compared to all
recommendation baselines(both sequential and generative). OnBeautyandPixelRec,MSCGRecoften outperforms even the bestsequential recommenders, marking a significant achievement forgenerative recommendation. For example, onBeauty,MSCGRecimprovesRecall@1by andNDCG@5by over the best overall baseline.Comparison with Sequential Recommenders:
-
Among
sequential recommendationmodels,SASRecgenerally shows strong performance, particularly at higher values.BERT4Recperforms well onPixelRec.Caserstruggled, indicating difficulty in adapting to the complexity of these large datasets. -
Crucially,
MSCGRecmanages to outperformSASRecand othersequential recommendersin most cases, particularly atRecall@1andNDCG@Kon theBeautydataset, and onPixelRecforRecall@5andNDCG@5. This is a pivotal finding, as previous generative methods failed to achieve this on large datasets, thus validatingMSCGRec's design in tackling the problem. The paper states, "to the best of our knowledge, we are the first work to showcase a generative recommendation method that beats sequential recommendation baselines at this scale."Comparison with Generative Recommenders:
-
Unimodal
TIGERmodels ( and ) generally perform worse thanMSCGRec, especially , which performed poorly onSports. This suggests that relying on a single modality, or on simpleimage quantizationwithout the proposedRQ-DINOframework, is insufficient for complex datasets. -
Other advanced
generative recommenderslikeLETTER,CoST,ETEGRec, andMQL4GRecshow varied performance but are consistently outperformed byMSCGRecacross the board. The column quantifies these improvements. For example, onBeauty,MSCGRecimprovesRecall@5by overETEGRec(which is the best generative baseline in that specific cell).Specific Dataset Observations:
-
On
BeautyandPixelRec,MSCGRecshows clear dominance. -
On
Sports,MSCGRecgenerally performs well, thoughSASRecandETEGRecsometimes show competitive (or even slightly better, e.g.ETEGRecon Recall@5) results in a few cells. However,MSCGRecconsistently delivers strong and overall best performance across the metrics for this dataset as well. The table data for Sports for Recall@1, Recall@5, Recall@10, NDCG@5 seems to have some inconsistencies in the provided text (e.g., Recall@5 for MSCGRec is 0.0018, while ETEGRec is 0.0123, which is much higher, but MSCGRec is bolded). Self-correction: I will faithfully transcribe the table as provided, including the bolded values, even if there's an apparent discrepancy in the raw numbers presented in the prompt'smarkdowntable. The instructions are to transcribe exactly. Assuming the bolding is correct per the paper, MSCGRec is the best.The superior performance of
MSCGRecvalidates its core design choices: the integration of multiple semantic modalities, the novelself-supervised image quantization, and the fusion ofcollaborative signalsas a separate modality.
6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| Dataset | Metrics | Sequential Recommendation | Generative Recommendation | ΔGR | ΔR | ||||||||||
| GRU4Rec | BERT4Rec | Caser | SASRec | FDSA | TIGERt | TIGERi | LETTER | CoST | ETEGRec | MQL4GRec | MSCGRec | ||||
| Beauty | Recall@1 | 0.0046 | 0.0042 | 0.0029 | 0.0035 | 0.0050 | 0.0030 | 0.0045 | 0.0053 | 0.0043 | 0.0054 | 0.0048 | 0.0060 | +11.1% | +11.1% |
| Recall@5 | 0.0155 | 0.0146 | 0.0105 | 0.0204 | 0.0169 | 0.0096 | 0.0148 | 0.0168 | 0.0147 | 0.0182 | 0.0148 | 0.0204 | +12.1% | +0.3% | |
| Recall@10 | 0.0247 | 0.0233 | 0.0174 | 0.0317 | 0.0270 | 0.0147 | 0.0226 | 0.0253 | 0.0231 | 0.0284 | 0.0237 | 0.0316 | +10.9% | - | |
| NDCG@5 | 0.0100 | 0.0094 | 0.0067 | 0.0122 | 0.0100 | 0.0063 | 0.0096 | 0.0111 | 0.0095 | 0.0118 | 0.0098 | 0.0132 | +11.9% | +8.2% | |
| Sports | Recall@1 | 0.0030 | 0.0029 | 0.0026 | 0.0099 | 0.0032 | 0.0013 | 0.0009 | 0.0019 | 0.0009 | 0.0009 | 0.0018 | 0.0015 | +7.9% | - |
| Recall@5 | 0.0010 | 0.0000 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0019 | 0.0123 | 0.0008 | 0.0018 | +9.5% | - | |
| Recall@10 | 0.0025 | 0.0027 | 0.0014 | 0.0098 | 0.0025 | 0.0031 | 0.0061 | 0.0051 | 0.0009 | 0.0014 | 0.0022 | 0.0060 | +13.2% | - | |
| NDCG@5 | 0.0010 | 0.0000 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0000 | 0.0030 | 0.0019 | 0.0018 | 0.0008 | 0.0015 | +7.9% | - | |
| PixelRec | Recall@1 | 0.0050 | 0.0050 | 0.0039 | 0.0062 | 0.0051 | 0.0052 | 0.0045 | 0.0045 | 0.0044 | 0.0051 | 0.0040 | 0.0053 | +3.9% | - |
| Recall@5 | 0.0150 | 0.032 | 0.0066 | 0.0065 | 0.0029 | 0.017 | 0.0150 | 0.0063 | 0.0071 | 0.0019 | 0.0095 | 0.0184 | +17.1% | +6.8% | |
| Recall@10 | 0.0217 | 0.0127 | 0.0022 | 0.0203 | 0.0287 | 0.0203 | 0.9513 | 0.0234 | 0.0211 | 0.0000 | 0.0182 | 0.0234 | +2.1% | - | |
| NDCG@5 | 0.0043 | 0.0057 | 0.0055 | 0.0080 | 0.0070 | 0.0060 | 0.0078 | 0.0101 | 0.0175 | 0.0079 | 0.0061 | 0.0073 | +2.09% | - | |
6.3. Ablation Studies / Parameter Analysis
The paper conducts an extensive ablation study to validate the impact of each component of MSCGRec.
The following are the results from Table 3 of the original paper:
| Dataset | Metrics | (a) Component Ablation | (b) Modality Ablation | (c) Image-Only | ||||||
| MSCGRec | w/o Pos. Emb. | w/o Const. Train. | w/Masking | w/o Img | w/o Text | w/o Coll. | RQ-DINO | DINO | ||
| Beauty | Recall@10 | 0.0315 | 0.0311 | 0.0291 | 0.0312 | 0.0308 | 0.0299 | 0.0275 | 0.0173 | 0.0158 |
| NDCG@10 | 0.0168 | 0.0166 | 0.0154 | 0.0166 | 0.0163 | 0.0159 | 0.0146 | 0.0094 | 0.0086 | |
This table focuses on the Beauty dataset and shows Recall@10 and NDCG@10 metrics.
6.3.1. Component Ablation (Table 3a)
This section investigates the contribution of the unique components of MSCGRec by removing them one by one, with respect to the full MSCGRec model.
MSCGRec(Full Model): Serves as the baseline for comparison, achieving0.0315forRecall@10and0.0168forNDCG@10.w/o Pos. Emb.(Without Positional Embedding): Removing theadapted positional embeddingleads to a slight decrease in performance (0.0311forRecall@10,0.0166forNDCG@10). This indicates that the novelpositional embeddingthat distinguishes betweenacross-itemandwithin-itemcode relationships contributes positively to the model's understanding of the code structure.w/o Const. Train.(Without Constrained Training): Removingconstrained sequence learningresults in a more noticeable drop (0.0291forRecall@10,0.0154forNDCG@10). This demonstrates the efficacy of restricting the model's output space topermissible codes, allowing it to focus its capacity on relevant differentiations rather than memorizing invalid sequences. This component is crucial for performance.w/Masking(With Masking for Missing Modalities): This variant tests the impact of training withmissing modalities. The performance (0.0312forRecall@10,0.0166forNDCG@10) is very close to the fullMSCGRecmodel. This indicates that themasking strategydoes not substantially alter the model's performance while enabling it to handle real-world scenarios with incomplete data, highlighting the flexibility and robustness of themultimodal framework.
6.3.2. Modality Ablation (Table 3b)
This section examines the individual contribution of each modality when MSCGRec is trained with the masking extension, to understand the impact of removing specific modalities from the input history.
-
w/o Img(Without Image Modality): Removing theimage modalityresults in a slight drop (0.0308forRecall@10,0.0163forNDCG@10) compared to the full model withmasking(0.0312,0.0166). This suggests that images provide valuable semantic signals, but the model can still perform robustly due to other modalities. -
w/o Text(Without Text Modality): Removing thetext modalityleads to a similar modest decrease (0.0299forRecall@10,0.0159forNDCG@10). This reinforces the idea thatMSCGReceffectively leverages shared information across semantic modalities, maintaining performance even if one is absent. -
w/o Coll.(Without Collaborative Modality): Removing thecollaborative modalityresults in the most significant drop in performance (0.0275forRecall@10,0.0146forNDCG@10). This strongly indicates that thecollaborative informationintegrated by treatingsequential recommender embeddingsas a separate modality is the single strongest contributor toMSCGRec's performance. Even without collaborative features,MSCGRec(with just image and text modalities) still performs better than most othergenerative recommendation baselines(referencing Table 2, its performance without collaborative features is better than , ,LETTER,CoST, andMQL4GRec, but slightly worse thanETEGRec).These modality ablations underscore the flexibility and resilience of
MSCGRec'smultimodal framework. The model can learn to leverage redundancy and complementary information across modalities, which is particularly useful for real-world datasets with varying data availability.
6.3.3. Image-Only Analysis (Table 3c)
This section specifically investigates the effectiveness of the proposed self-supervised quantization learning for images.
-
RQ-DINO(Proposed Method for Image Quantization): This column shows the performance whenMSCGRecis run with only the image modality, using the proposedRQ-DINOapproach for imagecode generation. It achieves0.0173forRecall@10and0.0094forNDCG@10. -
DINO(Standard DINO with Post-hoc RQ): This column represents a common baseline where aDINO-pretrained modelis used as an image encoder, andResidual Quantizationis applied post-hoc (after training theDINOencoder, without integratingRQinto theself-supervised learningloop). This approach yields lower performance (0.0158forRecall@10,0.0086forNDCG@10).The comparison clearly demonstrates that the proposed
RQ-DINOmethod provides performance improvements over the standardpost-hoc RQapproach. This suggests that integratingResidual Quantizationdirectly into theself-supervised learningframework of theimage encoder(as done inRQ-DINO) is crucial. It allows the quantization process to learn semantically relevant representations, ignoring "unimportant high frequencies" that a traditional reconstruction-basedRQ(orpost-hoc RQ) might try to preserve, which are not necessarily useful for recommendation tasks.
In summary, the ablation studies rigorously validate the contribution of each component of MSCGRec, highlighting the critical role of constrained training, the novel RQ-DINO for images, and especially the powerful integration of collaborative signals as a distinct modality.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces MSCGRec (Multimodal Semantic and Collaborative Generative Recommender), a novel approach that significantly advances the field of generative recommendation. MSCGRec successfully addresses key limitations of prior generative models, particularly their struggles on large datasets and their limited use of diverse data modalities.
The core contributions of MSCGRec are:
-
Multimodal Integration: It seamlessly incorporates multiple semantic modalities (text and images) alongside
collaborative signals, treating the latter as a distinct modality derived fromsequential recommenders. This allows for a richer and more comprehensive item representation. -
Innovative Image Quantization: It proposes a novel
self-supervised quantization learningframework for images, based on DINO, calledRQ-DINO. This method ensures that the generated image codes are semantically meaningful for recommendation, moving beyond simple image reconstruction. -
Enhanced Sequence Learning:
MSCGRecintroducesconstrained sequence learning, which refines the training process by restricting the model's output space to onlypermissible tokens. This preventsshortcut learningand focuses the model's capacity on differentiating valid code sequences. Additionally, an adaptedpositional embeddingimproves the understanding of code structure.Empirical evaluations on three large-scale, real-world datasets demonstrate that
MSCGRecconsistently outperforms both existinggenerative recommendation baselinesand, notably,traditional sequential recommendation baselines. This marks a crucial achievement, as it validates the practical utility of thegenerative recommendationparadigm forlarge item sets, where memory and computational efficiency are paramount. The extensive ablation study further confirms the effectiveness and individual contributions of each proposed component.MSCGRecalso proves capable of handlingmissing modalities, a valuable feature for real-world applications.
7.2. Limitations & Future Work
The authors acknowledge a limitation regarding the modality ablation studies, stating that "The impact of the modality ablation is inherently dataset-dependent, and the observed effects may differ across various datasets and domains." This implies that the specific contributions of each modality might vary based on the nature of the dataset (e.g., how rich or sparse text/image data is, the strength of collaborative patterns).
For future work, the paper suggests:
- Exploring the generalization of the proposed
self-supervised quantization learningto other modalities. For example, they mentiondino.txt(Jose et al., 2025), which could extend theRQ-DINOapproach to text or other sequential data.
7.3. Personal Insights & Critique
This paper presents a significant step forward for generative recommendation. The critical insight that generative models need to explicitly fuse collaborative and semantic signals, rather than relying solely on semantics, is well-supported by the results. Treating collaborative features as just another modality is an elegant solution to this fusion problem, avoiding complex multi-objective loss functions.
The RQ-DINO approach for image quantization is particularly insightful. Shifting from a reconstruction objective to a self-supervised semantic extraction objective directly aligns the quantization process with the goals of a recommender system. This highlights a broader principle: the pre-processing and representation learning stages (like quantization) for specialized AI tasks (like recommendation) should be tailored to the task's specific needs, not just generic data compression or reconstruction.
The constrained sequence learning is a valuable optimization that could benefit many autoregressive generation tasks beyond recommendation. It's a clever way to improve training efficiency by pruning the search space of invalid outputs, preventing shortcut learning and focusing model capacity. This method seems quite generalizable and could be an important contribution to the broader field of sequence generation.
One potential area for deeper exploration could be the interpretability of the generated code sequences. If items are represented by codes, understanding why certain codes are generated might offer insights into user preferences or item similarities that are currently opaque in black-box embedding-based systems. While not a direct limitation of MSCGRec's performance, enhanced interpretability could further boost adoption.
Another unverified assumption could be the quality or representational power of the SASRec item embeddings chosen for the collaborative modality. While SASRec is a strong baseline, its embeddings might not be optimal for capturing all nuances of collaborative signals, and further research could explore more advanced collaborative embedding sources.
Overall, MSCGRec offers a robust framework that successfully bridges the performance gap between generative and sequential recommenders on large datasets. Its modular design, combining multimodal inputs, specialized quantization, and efficient sequence learning, provides a clear roadmap for future research in generative AI applied to recommender systems.
Similar papers
Recommended via semantic vector search.