CEMG: Collaborative-Enhanced Multimodal Generative Recommendation
TL;DR Summary
The CEMG framework enhances multimodal generative recommendations by dynamically integrating visual and textual features with collaborative signals, overcoming key challenges, and significantly outperforming existing methods.
Abstract
Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is CEMG: Collaborative-Enhanced Multimodal Generative Recommendation. It focuses on developing a novel framework for recommender systems that leverages collaborative signals and multimodal data (images and text) to generate personalized recommendations.
1.2. Authors
The authors of the paper are:
- Yuzhen Lin () - School of Information Systems and Management, Carnegie Mellon University, Pittsburgh, PA 15213, USA
- Hongyi Chen () - Samueli School of Engineering, University of California, Los Angeles, CA 90095, USA
- Xuanjing Chen () - Columbia Business School, Columbia University, New York, NY 10027, USA
- Shaowen Wang () - Henry Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
- Ivonne Xu (
_ 5) - Department of Physics, University of Chicago, Chicago, IL 60637, USA - Dongming Jiang ( b) - Department of Computer Science, Rice University, Houston, TX 77005, USA
1.3. Journal/Conference
This paper was published at (UTC): 2025-12-25T07:28:35.000Z. The original source link is an arXiv preprint, which indicates it is a pre-publication version of a scholarly paper that has not yet undergone peer review or been formally accepted by a journal or conference. However, arXiv is a widely respected platform for disseminating research in fields like computer science and physics.
1.4. Publication Year
The publication year, based on the provided UTC timestamp, is 2025.
1.5. Abstract
The paper addresses two main challenges in generative recommendation models: (1) superficial integration of collaborative signals and (2) decoupled fusion of multimodal features, both of which prevent the creation of comprehensive item representations. To tackle these, the authors propose CEMG, a Collaborative-Enhanced Multimodal Generative Recommendation framework. CEMG introduces a Multimodal Fusion Layer that dynamically integrates visual and textual features, guided by collaborative signals. This fused representation is then converted into discrete semantic codes using a Residual Quantization VAE (RQ-VAE) in a Unified Modality Tokenization stage. Finally, an End-to-End Generative Recommendation stage fine-tunes a large language model (LLM) to autoregressively generate these item codes. Experimental results demonstrate that CEMG significantly outperforms existing state-of-the-art baselines.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2512.21543v1. This is a link to the paper on arXiv, a preprint server. Its publication status is a preprint.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inherent limitation of existing generative recommendation models in creating truly holistic item representations. This limitation stems from two specific challenges:
-
Superficial Integration of Collaborative Signals: While
multimodal content(like images and text) offers rich semantic descriptions,collaborative signals—patterns derived from collective user behavior—are crucial for personalization. Many generative models only incorporate this information as a supplementary feature or through shallow alignment, failing to capture complex, high-order relationships that reveal latent user preferences and item-to-item correlations beyond mere content similarity. -
Decoupled Fusion of Multimodal Features: Current frameworks often treat
multimodal contentandcollaborative signalsas separate entities, fusing them in a late or disjointed manner. This separation hinders the model from understanding the intricate interplay between an item's intrinsic attributes (what it is) and its contextual role within the user community (how it is perceived). For example, items that are visually different might be functional substitutes, a nuance only detectable through deep, synergistic fusion.These challenges prevent
generative recommendationfrom fully realizing its potential, particularly in generalizing to new orlong-tail itemsand effectively leveragingmultimodal content. The paper's innovative idea is to create a deeply unifieditem representationby synergistically fusing content semantics with collaborative wisdom, specifically tailored for a powerfulgenerative recommendation engine.
2.2. Main Contributions / Findings
The paper's primary contributions are summarized as follows:
-
Novel Framework (CEMG) for Deep Multimodal-Collaborative Fusion: CEMG is proposed as a novel generative recommendation framework that, for the first time, employs a
collaborative-guided mechanismto deeply fuse multimodal content with high-order collaborative signals into a unified semantic space for item tokenization. -
Elegant and Effective Architecture: The paper designs a
Multimodal Fusion Layerthat dynamically enhances item representations by aligning content features with their collaborative context. -
End-to-End Generative Pipeline with LLMs: An
End-to-End Generative pipelineis developed that leverages the power ofLarge Language Models (LLMs)for recommendation. This pipeline is further enhanced with aconstrained decoding strategyto ensure recommendation validity and efficiency. -
Extensive Experimental Validation: The authors conduct extensive experiments on three benchmark datasets, demonstrating that CEMG significantly outperforms a wide array of state-of-the-art baselines.
The key findings demonstrate that CEMG consistently and significantly outperforms all baseline models across various metrics and datasets. This superiority is attributed to its ability to create a deeply unified
semantic representationthat synergistically integratesmultimodal contentwithcollaborative signals, combined with the power ofLLMsfor generation. Notably, CEMG shows substantial improvement in handlingcold-start items, validating its robust generalization capabilities even with sparse interaction data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the CEMG framework, a reader should be familiar with several foundational concepts in machine learning and recommender systems:
- Recommender Systems (RS): These are information filtering systems that predict what a user might like. Their goal is to alleviate information overload by personalizing user experiences, typically by suggesting items (products, movies, articles, etc.) that are relevant to a user's preferences.
- Collaborative Filtering (CF): A fundamental technique in recommender systems that makes predictions about a user's interest by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if users A and B have similar tastes, and user A liked item X, then user B is likely to like item X as well. Traditional CF methods often rely on
user-item interaction matrices. - Sequential Recommendation: This is a sub-field of recommender systems that considers the chronological order of user interactions. Instead of just predicting static preferences, sequential recommenders aim to predict the next item a user will interact with, given their past sequence of interactions. This captures dynamic user interests and transitions.
- Multimodal Recommendation: This approach enhances recommendation performance by leveraging auxiliary information from multiple modalities beyond just implicit user-item interactions or item IDs. These modalities often include text (e.g., item descriptions, reviews), images (e.g., product photos), audio, or video. The goal is to create richer
item representationsthat capture various aspects of an item. - Generative Recommendation: A transformative paradigm that redefines recommendation as a sequence generation task. Instead of predicting a specific item ID from a fixed set (as in traditional
discriminative models),generative modelsrepresent each item as a sequence ofsemantic tokensand then learn to generate these tokens for the next recommended item. This allows for greater flexibility, handlingcold-start itemsmore effectively, and generating diverse recommendations. - Graph Neural Networks (GNNs): A class of neural networks designed to operate on graph-structured data. They learn representations (embeddings) of nodes (e.g., users, items) by iteratively aggregating information from their local neighborhood in the graph.
LightGCN, mentioned in the paper, is a simplified yet effective GNN for recommendation that focuses on propagating embeddings over the user-item interaction graph. - Variational Autoencoder (VAE): A type of generative model that learns a compressed, continuous
latent representation(embedding) of input data. It consists of anencoderthat maps input to alatent spaceand adecoderthat reconstructs the input from a sample in thelatent space. VAEs are trained to ensure thelatent spaceis well-structured and allows for meaningful generation. - Vector Quantization (VQ) / Residual Quantization VAE (RQ-VAE):
Vector Quantization (VQ)is a technique that maps continuous vectors to discrete, learnablecodebook entries(tokens).RQ-VAEis an extension wherequantizationis performed iteratively, using multiplecodebook layers. In each layer, the model quantizes the residual (the part of the vector not yet explained by previous layers' quantizations), creating a sequence of discretesemantic tokensthat compactly represent the original continuous vector. This is crucial for bridging the gap between continuouslatent representationsand discrete tokens required byLarge Language Models. - Large Language Models (LLMs): Powerful neural networks, often based on the
Transformerarchitecture, trained on massive amounts of text data. They excel atsequence-to-sequence generation tasks, such as language translation, text summarization, and, in this context, generating sequences ofitem tokens.T5is a popularLLMused in this paper. - Attention Mechanism: A core component of
Transformermodels. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing each element. The generalself-attentionmechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the similarity between queries and keys.
- is a scaling factor, where is the dimension of the keys, used to prevent large dot products from pushing the
softmaxinto regions with tiny gradients. softmaxnormalizes the attention scores.- The result is a weighted sum of the values, where the weights are determined by the attention scores.
In CEMG, a guided attention mechanism is used where
collaborative embeddingsact as queries to weighvisualandtextualfeatures.
3.2. Previous Works
The paper contextualizes its contributions by discussing several key prior studies:
- Multimodal Recommendation:
- VBPR [3]: An early work that integrated pre-trained visual features into
matrix factorization, a foundational CF technique. - ACF [1] and UVCAN [14]: Employed
attention mechanismsto dynamically select informative content from multimodal data. - MMGCN [27]: Used
Graph Neural Networks (GNNs)to model complex relationships and propagate information across a multimodal graph. - MISSRec [25] and MMSRec [23]: Explored
self-supervised learningand modality-specific modeling forsequential recommendation. These works are primarilydiscriminative approaches, meaning they predict a rating or select from a predefined set of items, which can be computationally expensive and struggle with issues likefalse-negatives.
- VBPR [3]: An early work that integrated pre-trained visual features into
- Generative Recommendation:
- Text-based approaches [9]: Simple methods that map items to discrete token sequences, often directly from textual descriptions.
- VQ-based models (e.g., TIGER [20], LETTER [26]): These models use
Vector Quantization (VQ)techniques, oftenRQ-VAE[8], to learnsemantic codesfrom item features.LETTERnotably improved this by incorporatingcollaborative signalsto align learned codes. - MMGRec [11]: Incorporated multimodal features into the generation pipeline using graph-based architectures to
tokenizefused multimodal information.
- Underlying Technologies:
- LightGCN [4]: A simplified but effective
GNNused in CEMG forcollaborative feature encoding. - VGG [22]: A widely used
Convolutional Neural Network (CNN)for image classification, employed as theVisual Encoder. - BERT [24]: A powerful
Transformer-basedlanguage modelfor text understanding, used as theTextual Encoder. - T5 [19]: A
Transformer-basedLarge Language Modelused as thegenerative backbonein CEMG.
- LightGCN [4]: A simplified but effective
3.3. Technological Evolution
The field of recommender systems has evolved significantly:
-
Traditional Collaborative Filtering (CF): Began with
matrix factorizationand neighborhood-based methods (e.g.,Koren et al. [7]). These primarily relied onitem IDsand interaction patterns, struggling withcold-startandlong-tail itemsdue to their inability to understand item semantics. -
Sequential Recommendation: Introduced the concept of
user sessionorinteraction sequence(e.g.,GRU4Rec [5],SASRec [6]), moving beyond static preferences to model dynamic user interests usingRecurrent Neural Networks (RNNs)orTransformers. Still largelyID-based. -
Multimodal Recommendation: Began integrating rich content features (images, text) to enhance
item representations(e.g.,VBPR [3],ACF [1]). These improved semantic understanding but often remained in theembed-and-retrieveparadigm, whereitem embeddingsare learned and then similarity search is performed. Fusion strategies evolved from simple concatenation toattention mechanismsandGNNs. -
Generative Recommendation: The most recent paradigm shift, inspired by
Large Language Models. It re-frames recommendation as asequence-to-sequence generation task, moving from predictingitem IDsto generatingsemantic token sequencesthat represent items (e.g.,TIGER [20],LETTER [26],MMGRec [11]). This offers greater flexibility and addressescold-startissues by generating novel combinations of features.CEMG fits into this timeline by pushing the boundaries of
generative recommendation. It addresses the current shortcomings ofgenerative modelsby offering a more sophisticatedmultimodalandcollaborative signalintegration strategy, moving towards a truly holisticitem representationbeforetokenizationandLLM-based generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, CEMG introduces several core differences and innovations:
- Deep, Collaborative-Guided Multimodal Fusion: Unlike existing
generative modelsthat performmultimodal feature fusionin a decoupled or superficial manner, ortokenizeprimarily based on unimodal data (e.g.,TIGER), CEMG introduces a novelMultimodal Fusion Layer. This layer dynamically integratesvisualandtextual featuresunder the explicit guidance ofcollaborative signals. Thecollaborative embeddingacts as a query to weigh the importance of different modalities, ensuring that content features are interpreted within their social and behavioral context. This is a significant improvement overLETTER, which aligns quantized representations withcollaborative embeddingsbut doesn't guide the fusion process itself. - Unified Semantic Space for Tokenization: By deeply fusing all information sources before
tokenization, CEMG creates a truly holistic, unifieditem representation. This contrasts with methods that tokenize based on unimodal data or use shallow fusion, which can lead to fragmentedsemantic tokens. TheRQ-VAEthen converts this rich, unified representation into compact, discretesemantic tokens, ensuring that the generated codes encapsulate both content and collaborative wisdom. - End-to-End LLM-based Generation with Constrained Decoding: While
LLM-based methodslikeLlamaRec [28]andLLM-ESR [13]useLLMs, they often operate on item titles or raw text, directly prompting theLLMto generate item names or IDs. CEMG, however, leverages a powerfulLLM(T5) to autoregressively generate the semantic token sequences of items, which are learned from the deeply fused multimodal and collaborative features. This two-stage approach (tokenization then generation) allows for more structured and controllable generation. Furthermore, CEMG employs aprefix tree (Trie)-based constrained decoding strategyduring inference, which is crucial for ensuring that theLLMgenerates only valid item token sequences, a practical consideration often overlooked or less robustly addressed in simplerLLM-based recommendationapproaches. - Robustness to Cold-Start Items: By embedding
collaborative signalsinto themultimodal fusionandtokenizationprocess, CEMG is designed to generalize better, especially forcold-start items. This is because even with sparse interaction data, the rich multimodal content, guided by some collaborative context (even if limited), can still form meaningfulsemantic tokens, outperformingID-basedmethods and even other content-aware baselines.
4. Methodology
4.1. Principles
The core principle of CEMG is to overcome the limitations of superficial collaborative signal integration and decoupled multimodal feature fusion in generative recommendation by creating a truly holistic item representation. This is achieved through a collaborative-guided attention mechanism that dynamically combines visual and textual features with collaborative embeddings. This unified representation is then transformed into discrete semantic tokens using a Residual Quantization VAE (RQ-VAE). Finally, recommendation is reframed as a conditional language generation task, where a Large Language Model (LLM) autoregressively generates these item tokens, ensuring that the generated recommendations are semantically rich, contextually relevant, and align with user preferences derived from both content and collective behavior.
4.2. Core Methodology In-depth (Layer by Layer)
The CEMG framework comprises three main components: the Multimodal Encoding Layer, Unified Modality Tokenization, and End-to-End Generative Recommendation. The overall architecture is depicted in Figure 1.

该图像是CEMG框架的整体架构示意图。框架由三个主要组件组成:多模态编码层集成视觉特征 、协作特征 和文本特征 ,通过多模态融合层生成统一表示 。统一模态标记化阶段利用残差量化变分自编码器(RQ-VAE)将 转换为离散语义标记序列。最后,端到端生成推荐模块根据历史标记序列自回归生成下个推荐项目的标记。
Fig. 1: The overall architecture of the CEMG framework. The framework is composed of three main components. The Multimodal Encoding Layer integrates visual ), collaborative ), and textual ) features via the Multimodal Fusion Layer to produce a unified representation . The Unified Modality Tokenization stage, utilizing a Residual Quantization VAE (RQ-VAE), converts into a discrete sequence of semantic tokens. Finally, the End2End Generative Recommendation module takes historical token sequences as input and autoregressively generates the tokens for the next recommended item.
4.2.1. Problem Definition
The problem is formally defined as predicting the top- items a user is most likely to interact with next, given their historical interactions . Each item is associated with multimodal content ( for image, for text).
Instead of using atomic item IDs, each item is represented as a sequence of discrete semantic tokens, denoted as . Each token is an index from a codebook. The recommendation task is thus transformed into generating the token sequence for the next item based on the historical token sequences corresponding to .
Formally, the model aims to learn the probability:
Where:
- is the probability of generating the entire sequence of tokens for the -th item, given the user's historical interaction sequence .
- denotes the product over the tokens in the sequence.
P ( c _ { i _ { L + 1 } , m } | \{ \mathbf { c } _ { i _ { j } } \} _ { j = 1 } ^ { L } , c _ { i _ { L + 1 } , 1 } , \dots , c _ { i _ { L + 1 } , m - 1 } )is the probability of generating the -th token of the -th item, conditioned on all previous item token sequences in the history (up to items), and all previously generated tokens for the current -th item (from1tom-1). This reflects theautoregressivenature of the generation process.- represents the token sequence for item .
4.2.2. Multimodal Encoding Layer
This layer is responsible for learning a unified, dense representation for each item that combines its multimodal and collaborative characteristics.
4.2.2.1. Multimodal Feature Encoding
For each item , features are extracted from its associated image and text using pre-trained encoders.
-
Visual Encoder: A pre-trained
VGG network[22] is used to process the image and extract its visual features.VGGis aConvolutional Neural Network (CNN)known for its deep architecture, effective in image recognition tasks. -
Textual Encoder: A pre-trained
BERT model[24] is employed to encode the textual description .BERT(Bidirectional Encoder Representations from Transformers) is a powerfulTransformer-basedlanguage modelthat captures contextual information from text. The embedding of the[CLS]token, which is a special classification token, is taken as the text representation for the entire input sequence.The raw feature vectors from
VGGandBERTare then passed through aPrincipal Component Analysis (PCA)layer for dimensionality reduction. This yields the final visual and textual embeddings, and , respectively, where is the reduced dimension.
4.2.2.2. Collaborative Feature Encoding
To capture collaborative signals (community preferences and interaction patterns), user-item interactions are modeled as a bipartite graph . Here, is the set of users, is the set of items, and an edge exists if user has interacted with item .
LightGCN [4], a simplified yet powerful Graph Neural Network (GNN), is then used to learn user and item embeddings. LightGCN propagates embeddings by aggregating messages from a node's neighborhood over multiple layers, effectively distilling high-order connectivity patterns from the interaction graph. This process yields a collaborative embedding for item .
4.2.2.3. Multimodal Fusion Layer
This is a key innovation where the collaborative embedding acts as a guide to dynamically integrate visual and textual features. The hypothesis is that an item's collaborative context should determine the relative importance of its visual versus textual attributes. A guided attention mechanism is designed for this purpose:
-
The
collaborative embeddingacts as thequery. -
The
visual embeddingandtextual embeddingserve as bothkeysandvalues.The
attention weightsare computed as: Where: -
is the attention weight for modality (either visual or textual ).
-
is the collaborative embedding for item .
-
is the embedding for modality (either or ).
-
are learnable projection matrices that transform the query and key embeddings into a suitable space for dot-product similarity calculation.
-
calculates the scaled dot product similarity between the transformed collaborative query and the transformed modality key.
-
is the exponential function, used to ensure positive values.
-
The denominator is a normalization term (
softmaxdenominator) that ensures the attention weights sum to 1.The final
fused representationis a concatenation of the weightedmultimodal featuresand the guidingcollaborative feature: Where: -
and are the attention weights for visual and textual modalities, respectively.
-
and are the visual and textual embeddings.
-
denotes
element-wise additionof the weighted visual and textual embeddings. This creates a combined content representation. -
is the collaborative embedding.
-
[ ; ]denotesconcatenationof the combined content representation and the collaborative embedding. This creates a unified vector that holistically represents item by integrating content and collaborative signals.
4.2.3. Unified Modality Tokenization
The unified representation for each item is then tokenized into a discrete sequence of semantic tokens using a Residual Quantization Variational Autoencoder (RQ-VAE) [8].
The RQ-VAE architecture consists of three main parts:
-
Encoder: Maps the continuous, unified item representation to a continuous
latent vector. -
Residual Quantizer: Approximates iteratively across
codebook layers. In each stage :- It finds the closest
codevectorfrom the -thcodebookto the currentresidual(the part of that hasn't been quantized yet). - The index of this closest
codevectoris selected as the -th token, . - This selected
codevectoris then subtracted from theresidualto form thenext residualfor the subsequent layer. The sequence of selectedcodebook indicesbecomes the item'ssemantic token sequence.
- It finds the closest
-
Decoder: Reconstructs the original unified vector from the sum of the selected
codevectors(i.e., from the discretesemantic token sequence).The
RQ-VAEis trained by minimizing a compositeloss functionthat ensures bothsemantic fidelity(accurate reconstruction of ) andcodebook quality(effective utilization of discrete codes): Where:
- is the total loss for training the
RQ-VAE. - is the
reconstruction loss. This is amean squared error (MSE)that measures the difference between the original unified item representation and its reconstruction by the decoder. Minimizing this loss encourages theRQ-VAEto accurately capture the semantic information. - is the
VQ commitment loss[17]. This loss term encourages the output of the encoder (the continuouslatent vector) to "commit" or stay close to the chosencodebook entries. It helps stabilize the learning of thecodebooksand ensures that the encoder produceslatent vectorsthat are easilyquantizable. - is a
diversity loss[12]. This term promotes the utilization of diverse codes within eachcodebook, preventingcodebook collapsewhere only a fewcodevectorsare frequently chosen, leading to underutilization of thecodebookcapacity. - and are
balancing hyperparametersthat control the relative importance of thequantization lossanddiversity loss, respectively, compared to thereconstruction loss.
4.2.4. End-to-End Generative Recommendation
After the tokenization stage, each item is represented by its semantic token sequence . The recommendation task is then reframed as a conditional generation problem using a Large Language Model (LLM).
4.2.4.1. Interaction History Prompting
For a user with history , each item is converted into its token sequence . Each token within these sequences is represented by a special symbol (e.g., for the 12th token from the first codebook layer 'a'). The complete prompt for the LLM is constructed as a sequence of these item tokens, preserving their chronological order. The LLM is then tasked to autoregressively predict the token sequence of the next item, .
4.2.4.2. Training and Inference
-
Training: A powerful
decoder-only LLM(T5[19] is used in this paper, despiteT5being a sequence-to-sequence model with an encoder-decoder architecture, it can be fine-tuned for decoder-only tasks by only using its decoder or passing the input through the encoder and generating from the decoder) serves as the generative backbone. The model is trained using a standardnext-token prediction objective, minimizing thecross-entropy lossbetween the predicted token probabilities and the ground-truth target tokens: Where:- is the
Next Token Prediction loss. - The outer sum iterates through each item in the user's historical sequence, predicting the next item .
- The inner sum iterates through each token within the target item's token sequence .
\log P ( c _ { i _ { j + 1 } , m } | \{ \mathbf { c } _ { i _ { k } } \} _ { k = 1 } ^ { j } , c _ { i _ { j + 1 } , 1 } , \dots , c _ { i _ { j + 1 } , m - 1 } )is thelog-probabilityof predicting the -th token of item , conditioned on all preceding item token sequences in the history (up to item ) and all previously generated tokens for the current item (from1tom-1). This is the standardcross-entropy lossforautoregressive sequence generation.
- is the
-
Inference: Given a user's history prompt,
beam searchis used to generate multiple candidatetoken sequencesfor the next item.Beam searchexplores a set of the most probable sequences at each step, rather than just the single most probable one, to find better overall sequences. Thescoreof a candidate sequence is the sum of itslog-probabilities: Where:-
is the total score for a generated candidate token sequence .
-
The sum calculates the total
log-likelihoodof generating the sequence, given the initialprompt(user history) and the previously generated tokens within .To ensure that only valid
item sequencesare generated, aprefix tree (Trie)-basedconstrained decoding strategyis employed. TheTriecontains all validitem token sequencesfrom the entire item catalog. At each generation step, theLLM's output vocabulary is masked to only allow tokens that form a valid prefix according to theTrie, drastically pruning the search space and guaranteeing the validity of the final recommendations by preventing theLLMfrom "hallucinating" non-existent item token sequences.
-
5. Experimental Setup
5.1. Datasets
The authors evaluate CEMG on three widely used public datasets:
-
Amazon reviews (Beauty and Sports): These datasets consist of user reviews and interaction data for products from the "Beauty" and "Sports & Outdoors" categories on Amazon.
-
Yelp: This dataset contains user reviews and interactions for businesses on the Yelp platform.
For each interaction in these datasets, associated item images and textual descriptions are collected. To ensure data quality, users and items with fewer than 5 interactions are filtered out, which is a common practice in recommender systems research to focus on more active entities and reduce noise.
The statistics of the processed datasets are summarized in the following table (Table 1 from the original paper):
| Attribute | Beauty | Sports | Yelp |
|---|---|---|---|
| #Users | 22,363 | 35,598 | 30,431 |
| #Items | 12,101 | 18,357 | 20,033 |
| #Interactions | 198,502 | 296,337 316,942 | |
| Avg. Len. | 8.9 | 8.3 | 10.4 |
| Sparsity | 99.93% | 99.95% 99.95% | |
Where:
-
#Users: The total number of unique users in the dataset. -
#Items: The total number of unique items in the dataset. -
#Interactions: The total number of user-item interactions recorded. -
Avg. Len.: The average length of user interaction sequences. -
Sparsity: The percentage of empty cells in the user-item interaction matrix, indicating how sparse the interaction data is (a higher percentage means more sparse data, making recommendation more challenging).These datasets are chosen because they are widely recognized benchmarks in the recommendation research community, featuring real-world user behaviors and rich multimodal content (reviews for text, product/business images). Their diversity in domains (e-commerce products, local businesses) and scale make them effective for validating the generalizability and performance of multimodal generative recommendation methods.
5.2. Evaluation Metrics
The authors adopt a leave-one-out strategy for evaluation: for each user, their last interacted item is used as the ground truth for testing, the second-to-last for validation, and the remaining interactions for training. This setup simulates the real-world scenario of predicting the immediate next interaction.
The performance of all models is evaluated using two common ranking metrics at specific cutoffs ( and ): Hit Rate (HR) and Normalized Discounted Cumulative Gain (NDCG).
5.2.1. Hit Rate (HR@K)
- Conceptual Definition:
Hit Rate(also known asRecall) at measures the proportion of users for whom the ground-truth item (the next item they interacted with) appears within the top items recommended by the system. It is a simple binary measure: either the item is in the top or it's not. It reflects how often the model successfully recommends the relevant item within the limited top- list. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the ground-truth item is in top-K}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users for whom the ground-truth item is in top-K: Counts how many times the actual next item for a user is found among the first recommendations.Total number of users: The total number of users considered in the evaluation set.
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
- Conceptual Definition:
NDCGat is a ranking quality metric that accounts for the position of relevant items. It gives higher scores to relevant items that appear higher in the recommendation list and penalizes relevant items that appear lower. It also normalizes the score to be between 0 and 1, allowing comparison across different users and queries.NDCGis particularly suitable for scenarios where the order of recommended items matters. - Mathematical Formula:
$
\mathrm{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}_u}
$
where
$
\mathrm{DCG@K}u = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)}
$
And
IDCG@Kis theIdeal DCG(the maximum possibleDCGfor user ), which is calculated by placing all relevant items at the top of the list: $ \mathrm{IDCG@K}u = \sum{j=1}^{\min(K, |\mathrm{Rel}_u|)} \frac{2^{1} - 1}{\log_2(j+1)} $ - Symbol Explanation:
- : The total number of users.
- :
Discounted Cumulative Gainfor user at cutoff . - :
Ideal Discounted Cumulative Gainfor user at cutoff . This is the maximum possibleDCGvalue if the recommended list were perfectly ordered. - : The
relevance scoreof the item at rank . For recommendation tasks with implicit feedback,rel(j)is typically 1 if the item at rank is the ground-truth next item, and 0 otherwise. - : The minimum of and the number of relevant items for user . In a
leave-one-outsetting, is typically 1.
5.3. Baselines
CEMG is compared against four categories of baseline models to demonstrate its comprehensive superiority:
- Sequential Methods: These models capture the temporal dependencies in user interactions.
GRU4Rec [5]: A foundationalsession-based recommenderthat usesGated Recurrent Units (GRUs)to model sequential user behavior.SASRec [6]:Self-Attentive Sequential Recommendation, aTransformer-based model that appliesself-attentionto capture long-range dependencies in user sequences.
- Multimodal Methods: These models incorporate various content modalities.
MMSRec [23]:Self-supervised Multi-Modal Sequential Recommendation.MISSRec [25]:Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation.
- LLM-based Methods: These leverage
Large Language Modelsdirectly.LlamaRec [28]: Atwo-stage recommendationapproach usingLLMsfor ranking.LLM-ESR [13]:Large Language Models Enhancement for Long-tailed Sequential Recommendation. These methods typically operate on item titles or raw text as input to theLLMs.
- Generative Methods: These represent the state-of-the-art in
generative recommendationusingsemantic IDs.TIGER [20]: Agenerative retrievalmodel thattokenizesitems into discrete semantic codes and then generates these codes.LETTER [26]:Learnable Item Tokenization for Generative Recommendation, which improvestokenizationby aligningquantized representationswithcollaborative embeddings.MMGRec [11]:Multimodal Generative Recommendation with Transformer Model, which employs graph-based architectures to tokenize fused multimodal information.
5.4. Implementation Details
The authors provide specific implementation details for the CEMG framework:
- Feature Embedding Dimension: All feature embeddings (visual, textual, collaborative) are projected to a uniform dimension of .
- RQ-VAE Configuration:
- Number of
codebook layers(): Configured with . This means each item is represented by a sequence of 4 discrete tokens. Codebook Size(): Each of the codebooks contains uniquecodevectors(discrete symbols).
- Number of
- Loss Balancing Weights: Based on parameter analysis, the balancing weights for the
RQ-VAE losswere set to for thequantization lossand for thediversity loss. - Generative LLM Backbone:
T5[19] is employed as the generativeLLM-backbone. - Optimizer: The model is trained using the
AdamW optimizer. - Learning Rate: A learning rate of is used.
- Hardware: Training is performed on
NVIDIA A100 GPUs.
6. Results & Analysis
6.1. Core Results Analysis (RQ1)
The main experimental results, comparing CEMG against various baselines on three datasets, are presented in the following table (Table 2 from the original paper).
The following are the results from Table 2 of the original paper:
| Category | Model | Beauty | Sports | Yelp | |||
|---|---|---|---|---|---|---|---|
| HR@10 | NDCG@10 | HR@10 | NDCG@10 | HR@10 | NDCG@10 | ||
| Sequential | GRU4Rec | 0.0385 | 0.0116 | 0.0201 | 0.0045 | 0.0288 | 0.0095 |
| SASRec | 0.0434 | 0.0147 | 0.0232 | 0.0061 | 0.0329 | 0.0121 | |
| Multimodal | MISSRec | 0.0577 | 0.0287 | 0.0305 | 0.0118 | 0.0387 | 0.0163 |
| MMSRec | 0.0581 | 0.0292 | 0.0311 | 0.0124 | 0.0395 | 0.0171 | |
| LLM-based | LlamaRec | 0.0492 | 0.0198 | 0.0256 | 0.0083 | 0.0341 | 0.0134 |
| LLM-ESR | 0.0515 | 0.0214 | 0.0269 | 0.0091 | 0.0353 | 0.0140 | |
| Generative | TIGER | 0.0533 | 0.0251 | 0.0281 | 0.0103 | 0.0368 | 0.0151 |
| LETTER | 0.0552 | 0.0268 | 0.0295 | 0.0111 | 0.0377 | 0.0159 | |
| MMGRec | 0.0571 | 0.0281 | 0.0302 | 0.0119 | 0.0389 | 0.0166 | |
| CEMG | 0.0665 | 0.0348 | 0.0363 | 0.0157 | 0.0458 | 0.0212 | |
| Improvement (%) | +14.46% | +19.18% | +16.72% | +26.61% | +15.95% | +23.98% | |
The results clearly demonstrate that CEMG consistently and significantly outperforms all baseline models across all three datasets (Beauty, Sports, Yelp) and both evaluation metrics (HR@10 and NDCG@10). The improvements are statistically significant ().
Key Observations:
- Overall Superiority: CEMG achieves the highest scores in all categories, with impressive relative improvements over the best baselines. For example, on the Beauty dataset, CEMG shows a +14.46% improvement in
HR@10and +19.18% inNDCG@10over the best baseline (MMSRec). On Sports, theNDCG@10improvement is a remarkable +26.61%. - Generative Models Perform Well: Among the baselines,
generative methods(TIGER, LETTER, MMGRec) generally outperformtraditional sequential(GRU4Rec, SASRec) andLLM-based(LlamaRec, LLM-ESR) methods, and are competitive with or slightly better thanmultimodal methods(MISSRec, MMSRec). This highlights the inherent potential of the generative paradigm in recommendation. - CEMG's Advantage: CEMG's substantial lead over strong multimodal generative baselines like
MMGRecandLETTERvalidates the effectiveness of its core innovations: thecollaborative-guided fusionofmultimodal featuresand the advancedgenerative architecture. It suggests that a deeper, more unified representation learning process, guided bycollaborative signals, is crucial for unlocking the full potential ofgenerative recommendation. - Importance of Multimodality and Collaboration: While
LLM-based methodsshow some promise, they don't reach the performance of dedicated multimodal generative approaches, indicating that sophisticated fusion andtokenizationof multimodal and collaborative signals are more effective than relying solely on raw text forLLMinputs in this context.
6.2. Ablation Study (RQ2)
To understand the contribution of each key component in CEMG, an ablation study was conducted. Several variants of the model were tested, and the results are shown in Figure 2.

该图像是图表,展示了在三个数据集(Beauty、Sports、Yelp)上,CEMG模型与各种去除组件的变体的HR@10和NDCG@10的消融研究结果。各个变体性能的下降显示了每个组件的贡献。
Fig. 2: Ablation study results on three datasets for HR@10 and NDCG@10. Performance drops across all variants demonstrate the contribution of each component.
The variants are:
w/o Collab: Removes thecollaborative features() from the unified representation, meaning theMultimodal Fusion Layerwould no longer be guided by collaborative signals, and the final concatenated vector would not include .w/o Image: Removes thevisual features().w/o Text: Removes thetextual features().w/o LLM: Replaces the powerful pre-trainedT5(or Llama-3-8B mentioned in the text, indicating a potential update in thought process between abstract/intro and methodology/experiments) with a standard 6-layerTransformer decodertrained from scratch, similar toTIGER[20].
Analysis of Results:
-
Full CEMG Model is Best: As expected, the full CEMG model consistently achieves the best performance across all datasets and metrics.
-
w/o CollabShows Significant Drop: Removingcollaborative featuresleads to one of the most significant performance drops. This strongly underscores the vital role ofcollaborative filtering signals, not just as supplementary information, but as a guiding force in fusing multimodal content and creating a holisticitem representation, even in a content-richgenerative model. It confirms thatuser behavior patternsprovide crucial personalization cues that content alone cannot fully capture. -
w/o LLMAlso Significant: Replacing the powerful pre-trainedLLMwith a simplerTransformer decodertrained from scratch also results in a substantial performance degradation. This validates the choice of using a large, pre-trainedLLMlikeT5. Its advancedreasoning,sequence modeling capabilities, and extensiveworld knowledgelearned during pre-training are crucial for accuratelyautoregressively generatingthe complexsemantic token sequencesfor the next recommended item. -
Importance of Multimodal Features: Removing
imageortext features(i.e.,w/o Imageandw/o Text) also leads to noticeable performance drops, albeit generally less severe thanw/o Collaborw/o LLM. This confirms that CEMG effectively utilizes bothvisualandtextual information. It suggests that each modality contributes unique semantic aspects to theitem representation, and their synergistic fusion (especially when guided by collaborative signals) leads to a richer understanding of items. The absence of either modality results in a loss of descriptive power.In summary, the ablation study clearly demonstrates that all core components of CEMG—
collaborative signals,multimodal content(both image and text), and the powerfulLLMbackbone—are crucial and contribute synergistically to the model's superior performance.
6.3. Efficiency Analysis (RQ3)
The efficiency of CEMG, in terms of training and inference time, is compared against other state-of-the-art generative models. The results are presented in Figure 3.

该图像是一个图表,展示了在美妆和体育数据集上的效率比较。左侧展示了训练时间(小时)和推理延迟(毫秒/用户)两项指标,分别按阶段分解。右侧则是体育数据集的对应结果。不同的方法包括 TIGER、LETTER、MMGRec 和 CGMG。
Fig. 3: Efficiency comparison on the Beauty and Sports datasets. Left axis (bars) shows training time per epoch, broken down by stage. Right axis (line) shows inference speed in users per second (higher is better).
Analysis:
- Training Efficiency:
- CEMG's training time is composed of two stages:
tokenization(training theRQ-VAE) andend-to-end generation(fine-tuning theLLM). - While the overall training time for CEMG is noted to be higher than
TIGERdue to the processing of more modality features, it remains highly competitive. - The total training time is comparable to, and even slightly better than,
LETTER, which requires a complex alignment process. This suggests that CEMG's sophisticated fusion andtokenization pipeline, despite integrating multiple modalities and collaborative signals, does not introduce prohibitive computational overhead during training.
- CEMG's training time is composed of two stages:
- Inference Efficiency:
-
This is an area where CEMG particularly excels. CEMG achieves significantly lower inference latency (higher users per second, meaning faster processing) compared to other
multimodal generative modelslikeMMGRecandLETTER. -
This efficiency is attributed to CEMG's design of generating short, fixed-length
semantic token sequences(with tokens). This fixed-length, compact representation makes theautoregressive generationprocess much faster than models that might require more complex generation steps or extensive retrieval processes, which could involve longer sequences or more complex calculations per token. -
The high inference speed makes CEMG highly practical and suitable for real-world deployment scenarios where quick response times are critical for user experience.
In conclusion, CEMG strikes an effective balance between superior performance and practical computational cost, especially in terms of inference speed, which is a major advantage for production systems.
-
6.4. Parameter Analysis (RQ4)
The sensitivity of CEMG's performance to four key hyperparameters in the Unified Modality Tokenization stage is investigated. The results, specifically for HR@10, are shown in Figure 4.

该图像是图表,展示了CEMG模型在指标下,对不同参数的敏感性分析。图中分为四个子图:(a)展示了层数(M)的影响;(b)展示了码本大小(K)的影响;(c)展示了量化损失权重()的影响;(d)展示了多样性损失权重()的影响。不同类型(美妆、体育、Yelp)的数据点展示了各参数对推荐效果的影响。
Fig. 4: Parameter sensitivity analysis of CEMG on for (a) Number of Codebook Layers, (b) Codebook Size, (c) Quantization Loss Weight, and (d) Diversity Loss Weight.
Analysis:
-
a) Number of Codebook Layers (M):
- Performance (HR@10) generally improves as increases from 2 to 4. This is because a higher number of
codebook layersallows theRQ-VAEto capture finer-grainedsemantic detailsand create richeritem representationsthrough more iterativequantizationsteps. - However, performance plateaus at and then slightly declines at . This decline suggests that generating longer sequences (more tokens per item) can increase the difficulty of the
generative taskfor theLLM, potentially leading to more errors or less stable training. - Based on this, is chosen as the optimal setting, offering a good balance between representational power and generation complexity.
- Performance (HR@10) generally improves as increases from 2 to 4. This is because a higher number of
-
b) Codebook Size (K):
- A larger
codebook sizegenerally leads to better performance. This is intuitive, as a largercodebookprovides more distinctcodevectors, thus offering greater expressive power and capacity for thesemantic tokensto represent diverse item features. - The performance gain, however, tends to saturate after . This indicates that beyond this size, the marginal benefit of adding more
codevectorsdiminishes, suggesting that offers a good balance between expressiveness and computational complexity (larger codebooks require more memory and training time).
- A larger
-
c) Quantization Loss Weight ():
- This hyperparameter balances the
reconstruction qualitywith thecodebook alignment(how closely the encoder output adheres to thecodebook entries). - Figure 4(c) shows a clear unimodal trend, with performance peaking around .
- Values that are too low might not sufficiently encourage the encoder output to align with the
codebook, leading to poorquantization. Values that are too high might force the encoder output too rigidly tocodebook entries, potentially distorting thelatent representationand harmingreconstruction quality. - Thus, an optimal ensures a proper balance between these objectives.
- This hyperparameter balances the
-
d) Diversity Loss Weight ():
-
This weight is crucial for preventing
codebook collapse, a phenomenon where only a fewcodevectorsare frequently used, leading to inefficientcodebookutilization. -
Performance improves as increases up to 0.01, confirming the benefit of encouraging diverse
codebook usage. This ensures that allcodebook entriesare meaningfully utilized and contribute to richitem representations. -
However, excessively high values of can distort the
semantic spaceby over-emphasizing diversity, potentially forcing the model to select less optimalcodevectorsfor reconstruction and thus harming performance.These analyses provide valuable insights into configuring CEMG for optimal performance, demonstrating the careful tuning required for
VQ-VAEbasedtokenization.
-
6.5. Performance on Cold-Start Items (RQ5)
Cold-start items are a critical challenge for recommender systems because they have insufficient interaction data for collaborative filtering to be effective. The authors investigate CEMG's performance on items with five or fewer interactions in the training set. The results are presented in the following table (Table 3 from the original paper).
The following are the results from Table 3 of the original paper:
| Model | Beauty | Sports | Yelp | |||
|---|---|---|---|---|---|---|
| HR@10 | NDCG@10 | HR@10 | NDCG@10 | HR@10 | NDCG@10 | |
| SASRec | 0.0112 | 0.0048 | 0.0065 | 0.0027 | 0.0098 | 0.0041 |
| MISSRec | 0.0254 | 0.0115 | 0.0141 | 0.0068 | 0.0185 | 0.0092 |
| MMGRec | 0.0268 | 0.0123 | 0.0153 | 0.0075 | 0.0192 | 0.0099 |
| CEMG | 0.0305 | 0.0153 | 0.0183 | 0.0094 | 0.0231 | 0.0125 |
Analysis:
-
CEMG's Superiority: CEMG substantially outperforms all baselines on
cold-start itemsacross all three datasets and metrics. This is a critical finding, demonstrating CEMG's robust generalization capabilities. -
Content-Aware Models vs. ID-based: As expected,
content-aware modelslikeMISSRecandMMGRecperform significantly better than the purelyID-basedSASReconcold-start items. This highlights the inherent advantage of leveraging itemmultimodal contentwhen interaction data is scarce, as content provides intrinsic item semantics that are available even for new items.SASRecstruggles because it primarily relies on interaction history to learnitem embeddings, which is minimal forcold-start items. -
Advantage of Collaborative-Guided Tokenization: CEMG's advanced
semantic tokenizationprovides superior generalization even over othermultimodal generative baselines. By learning to generate richitem representationsfrom acollaborative-guided fusionofmultimodal content, CEMG remains effective even when explicitinteraction signalsfor a specific item are sparse. Thecollaborative guidanceduring fusion helps ensure that thesemantic tokenscapture not just what the item is (from content) but also how it relates to existing items in thecollaborative space, even if its own direct interactions are few. This makes theitem representationsmore robust and transferable, improving recommendations forcold-start scenarios.This analysis confirms that CEMG effectively addresses a long-standing challenge in
recommender systems, making it more practical for real-world applications with constantly evolving item catalogs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces CEMG, a novel generative recommendation framework that significantly advances the state-of-the-art by addressing two critical limitations in existing models: the superficial integration of collaborative signals and the decoupled fusion of multimodal features. CEMG's core innovation lies in its Multimodal Fusion Layer, which dynamically integrates visual and textual content under the explicit guidance of collaborative signals, creating a deeply unified and holistic item representation. This representation is then converted into discrete semantic codes using a Unified Modality Tokenization module powered by an RQ-VAE. Finally, an End-to-End Generative Recommendation component, leveraging a fine-tuned Large Language Model and constrained decoding, autoregressively generates these item codes to produce personalized recommendations. Extensive experiments on three benchmark datasets confirm that CEMG consistently and significantly outperforms a wide range of state-of-the-art baselines, demonstrating its superior performance, efficiency, and robustness, particularly in handling cold-start items.
7.2. Limitations & Future Work
The authors identify one key limitation and propose future research directions:
- Noisy Multimodal Content: A current limitation is that
noisy signalswithinmultimodal content, such as irrelevant image backgrounds or extraneous text, can be inadvertently encoded into theitem representations. Thisnoisecan potentially compromise the quality oftokenizationand subsequently affect recommendation accuracy. - Future Work - Advanced Decoding Strategies: To further mitigate
recommendation errorsand potentially address the issue ofnoisy signals, the authors plan to explore more advanceddecoding strategiesin their future work. This could involve techniques that are more robust to noise, or that can leverage additional contextual information during the generation process to produce even more precise and relevantitem token sequences.
7.3. Personal Insights & Critique
The CEMG framework represents a significant step forward in generative recommendation, particularly in its elegant solution to deeply integrate multimodal and collaborative signals.
-
Innovation of Collaborative-Guided Fusion: The
collaborative-guided multimodal fusion layeris a standout innovation. By usingcollaborative embeddingsas queries to dynamically weighvisualandtextual features, the model ensures that content is interpreted in the context of user preferences and item relationships, rather than in isolation. This aligns well with human intuition: how an item is perceived collectively often dictates which of its attributes are most salient. -
Bridge between Continuous and Discrete Spaces: The use of
RQ-VAEforunified modality tokenizationis an effective bridge between the rich, continuouslatent representationsand the discretesemantic tokensrequired byLLMs. This structuredtokenizationallowsLLMsto generate meaningful item codes rather than raw text, which can be less controllable and prone tohallucinationin a recommendation context. The fixed-length, shorttoken sequencescontribute to impressive inference efficiency, a critical factor for real-world deployment. -
Addressing Cold-Start: CEMG's superior performance on
cold-start itemsis a testament to its robustitem representation learning. By having amultimodalfoundation guided by even sparsecollaborative signals, it can generalize well to new items, which is a major pain point for traditionalID-basedrecommenders. -
Potential Issues/Unverified Assumptions:
- Dependency on Pre-trained Encoders: The model heavily relies on powerful pre-trained encoders (VGG, BERT). The quality of these upstream encoders directly impacts the initial
multimodal features. If these encoders are not perfectly aligned with the recommendation domain, or if new, more advanced encoders emerge, retraining or updating these components would be necessary. The assumption is that these general-purpose encoders are sufficient for extracting domain-relevant features. - Scalability of Trie for Constrained Decoding: While
prefix tree (Trie)-basedconstrained decodingensures validity, the size of theTriecan grow very large with an extremely vast item catalog and a high number ofcodebook layers/codebook size. The paper mentions layers and codes. For a truly massive catalog (millions of items), theTriemight become memory-intensive, potentially impacting inference speed for lookups, though likely less thanLLMgeneration itself. - Complexity of Two-Stage Training: Training involves two distinct stages (
RQ-VAEandLLM fine-tuning), which can be more complex to manage and optimize than a single end-to-end model. The potential for error propagation from thetokenizationstage to thegenerationstage exists. - Interpretability of Tokens: While
semantic tokensare generated, their direct interpretability for humans might still be limited compared to natural language descriptions. Understanding why a particular sequence of codes represents an item or why certain codes are generated might require further research intotoken semantics.
- Dependency on Pre-trained Encoders: The model heavily relies on powerful pre-trained encoders (VGG, BERT). The quality of these upstream encoders directly impacts the initial
-
Transferability: The core idea of
collaborative-guided multimodal fusionforitem representation learningcould be highly transferable to other domains beyond e-commerce or local businesses. For instance, in content recommendation (news, articles, videos), where richmultimodal contentis available alongside user interactions, this framework could provide significant benefits. ThetokenizationandLLM generationpipeline is also quite general and could be adapted for generating other structured outputs in various applications.Overall, CEMG presents a robust and innovative framework that effectively merges
multimodal learning,collaborative filtering, andgenerative AIto create a powerful and efficientrecommender system. The demonstrated improvements oncold-start itemsare particularly promising for real-world applicability.
Similar papers
Recommended via semantic vector search.