Paper status: completed

CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

Published:12/25/2025

LLM-based Recommendation Systems (30)Multimodal Recommendation Systems (9)Residual Quantized Variational Autoencoder (RQ-VAE) (3)Collaborative-Enhanced Generative Recommendation System (1)Dynamic Multimodal Fusion Layer (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The CEMG framework enhances multimodal generative recommendations by dynamically integrating visual and textual features with collaborative signals, overcoming key challenges, and significantly outperforming existing methods.

Abstract

Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.

Mind Map

In-depth Reading

English Analysis~26 min read · 34,179 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is CEMG: Collaborative-Enhanced Multimodal Generative Recommendation. It focuses on developing a novel framework for recommender systems that leverages collaborative signals and multimodal data (images and text) to generate personalized recommendations.

1.2. Authors

The authors of the paper are:

Yuzhen Lin ( $\cdot ^ { 1 }$ ) - School of Information Systems and Management, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Hongyi Chen ( $^ 2$ ) - Samueli School of Engineering, University of California, Los Angeles, CA 90095, USA
Xuanjing Chen ( $^ { 3 }$ ) - Columbia Business School, Columbia University, New York, NY 10027, USA
Shaowen Wang ( $^ 4$ ) - Henry Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
Ivonne Xu (_ 5) - Department of Physics, University of Chicago, Chicago, IL 60637, USA
Dongming Jiang ( $^ { 6 \mathrm { { , } } }$ b) - Department of Computer Science, Rice University, Houston, TX 77005, USA

1.3. Journal/Conference

This paper was published at (UTC): 2025-12-25T07:28:35.000Z. The original source link is an arXiv preprint, which indicates it is a pre-publication version of a scholarly paper that has not yet undergone peer review or been formally accepted by a journal or conference. However, arXiv is a widely respected platform for disseminating research in fields like computer science and physics.

1.4. Publication Year

The publication year, based on the provided UTC timestamp, is 2025.

1.5. Abstract

The paper addresses two main challenges in generative recommendation models: (1) superficial integration of collaborative signals and (2) decoupled fusion of multimodal features, both of which prevent the creation of comprehensive item representations. To tackle these, the authors propose CEMG, a Collaborative-Enhanced Multimodal Generative Recommendation framework. CEMG introduces a Multimodal Fusion Layer that dynamically integrates visual and textual features, guided by collaborative signals. This fused representation is then converted into discrete semantic codes using a Residual Quantization VAE (RQ-VAE) in a Unified Modality Tokenization stage. Finally, an End-to-End Generative Recommendation stage fine-tunes a large language model (LLM) to autoregressively generate these item codes. Experimental results demonstrate that CEMG significantly outperforms existing state-of-the-art baselines.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2512.21543v1. This is a link to the paper on arXiv, a preprint server. Its publication status is a preprint.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inherent limitation of existing generative recommendation models in creating truly holistic item representations. This limitation stems from two specific challenges:

Superficial Integration of Collaborative Signals: While multimodal content (like images and text) offers rich semantic descriptions, collaborative signals—patterns derived from collective user behavior—are crucial for personalization. Many generative models only incorporate this information as a supplementary feature or through shallow alignment, failing to capture complex, high-order relationships that reveal latent user preferences and item-to-item correlations beyond mere content similarity.
Decoupled Fusion of Multimodal Features: Current frameworks often treat multimodal content and collaborative signals as separate entities, fusing them in a late or disjointed manner. This separation hinders the model from understanding the intricate interplay between an item's intrinsic attributes (what it is) and its contextual role within the user community (how it is perceived). For example, items that are visually different might be functional substitutes, a nuance only detectable through deep, synergistic fusion.

These challenges prevent generative recommendation from fully realizing its potential, particularly in generalizing to new or long-tail items and effectively leveraging multimodal content. The paper's innovative idea is to create a deeply unified item representation by synergistically fusing content semantics with collaborative wisdom, specifically tailored for a powerful generative recommendation engine.

2.2. Main Contributions / Findings

The paper's primary contributions are summarized as follows:

Novel Framework (CEMG) for Deep Multimodal-Collaborative Fusion: CEMG is proposed as a novel generative recommendation framework that, for the first time, employs a collaborative-guided mechanism to deeply fuse multimodal content with high-order collaborative signals into a unified semantic space for item tokenization.
Elegant and Effective Architecture: The paper designs a Multimodal Fusion Layer that dynamically enhances item representations by aligning content features with their collaborative context.
End-to-End Generative Pipeline with LLMs: An End-to-End Generative pipeline is developed that leverages the power of Large Language Models (LLMs) for recommendation. This pipeline is further enhanced with a constrained decoding strategy to ensure recommendation validity and efficiency.
Extensive Experimental Validation: The authors conduct extensive experiments on three benchmark datasets, demonstrating that CEMG significantly outperforms a wide array of state-of-the-art baselines.

The key findings demonstrate that CEMG consistently and significantly outperforms all baseline models across various metrics and datasets. This superiority is attributed to its ability to create a deeply unified semantic representation that synergistically integrates multimodal content with collaborative signals, combined with the power of LLMs for generation. Notably, CEMG shows substantial improvement in handling cold-start items, validating its robust generalization capabilities even with sparse interaction data.

3.1. Foundational Concepts

To understand the CEMG framework, a reader should be familiar with several foundational concepts in machine learning and recommender systems:

Recommender Systems (RS): These are information filtering systems that predict what a user might like. Their goal is to alleviate information overload by personalizing user experiences, typically by suggesting items (products, movies, articles, etc.) that are relevant to a user's preferences.
Collaborative Filtering (CF): A fundamental technique in recommender systems that makes predictions about a user's interest by collecting preferences or taste information from many users (collaborating). The underlying assumption is that if users A and B have similar tastes, and user A liked item X, then user B is likely to like item X as well. Traditional CF methods often rely on user-item interaction matrices.
Sequential Recommendation: This is a sub-field of recommender systems that considers the chronological order of user interactions. Instead of just predicting static preferences, sequential recommenders aim to predict the next item a user will interact with, given their past sequence of interactions. This captures dynamic user interests and transitions.
Multimodal Recommendation: This approach enhances recommendation performance by leveraging auxiliary information from multiple modalities beyond just implicit user-item interactions or item IDs. These modalities often include text (e.g., item descriptions, reviews), images (e.g., product photos), audio, or video. The goal is to create richer item representations that capture various aspects of an item.
Generative Recommendation: A transformative paradigm that redefines recommendation as a sequence generation task. Instead of predicting a specific item ID from a fixed set (as in traditional discriminative models), generative models represent each item as a sequence of semantic tokens and then learn to generate these tokens for the next recommended item. This allows for greater flexibility, handling cold-start items more effectively, and generating diverse recommendations.
Graph Neural Networks (GNNs): A class of neural networks designed to operate on graph-structured data. They learn representations (embeddings) of nodes (e.g., users, items) by iteratively aggregating information from their local neighborhood in the graph. LightGCN, mentioned in the paper, is a simplified yet effective GNN for recommendation that focuses on propagating embeddings over the user-item interaction graph.
Variational Autoencoder (VAE): A type of generative model that learns a compressed, continuous latent representation (embedding) of input data. It consists of an encoder that maps input to a latent space and a decoder that reconstructs the input from a sample in the latent space. VAEs are trained to ensure the latent space is well-structured and allows for meaningful generation.
Vector Quantization (VQ) / Residual Quantization VAE (RQ-VAE): Vector Quantization (VQ) is a technique that maps continuous vectors to discrete, learnable codebook entries (tokens). RQ-VAE is an extension where quantization is performed iteratively, using multiple codebook layers. In each layer, the model quantizes the residual (the part of the vector not yet explained by previous layers' quantizations), creating a sequence of discrete semantic tokens that compactly represent the original continuous vector. This is crucial for bridging the gap between continuous latent representations and discrete tokens required by Large Language Models.
Large Language Models (LLMs): Powerful neural networks, often based on the Transformer architecture, trained on massive amounts of text data. They excel at sequence-to-sequence generation tasks, such as language translation, text summarization, and, in this context, generating sequences of item tokens. T5 is a popular LLM used in this paper.
Attention Mechanism: A core component of Transformer models. It allows the model to dynamically weigh the importance of different parts of the input sequence when processing each element. The general self-attention mechanism is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $Q K^T$ calculates the similarity between queries and keys.
- $\sqrt{d_k}$ is a scaling factor, where $d_k$ is the dimension of the keys, used to prevent large dot products from pushing the softmax into regions with tiny gradients.
- softmax normalizes the attention scores.
- The result is a weighted sum of the values, where the weights are determined by the attention scores. In CEMG, a guided attention mechanism is used where collaborative embeddings act as queries to weigh visual and textual features.

3.2. Previous Works

The paper contextualizes its contributions by discussing several key prior studies:

Multimodal Recommendation:
- VBPR [3]: An early work that integrated pre-trained visual features into matrix factorization, a foundational CF technique.
- ACF [1] and UVCAN [14]: Employed attention mechanisms to dynamically select informative content from multimodal data.
- MMGCN [27]: Used Graph Neural Networks (GNNs) to model complex relationships and propagate information across a multimodal graph.
- MISSRec [25] and MMSRec [23]: Explored self-supervised learning and modality-specific modeling for sequential recommendation. These works are primarily discriminative approaches, meaning they predict a rating or select from a predefined set of items, which can be computationally expensive and struggle with issues like false-negatives.
Generative Recommendation:
- Text-based approaches [9]: Simple methods that map items to discrete token sequences, often directly from textual descriptions.
- VQ-based models (e.g., TIGER [20], LETTER [26]): These models use Vector Quantization (VQ) techniques, often RQ-VAE [8], to learn semantic codes from item features. LETTER notably improved this by incorporating collaborative signals to align learned codes.
- MMGRec [11]: Incorporated multimodal features into the generation pipeline using graph-based architectures to tokenize fused multimodal information.
Underlying Technologies:
- LightGCN [4]: A simplified but effective GNN used in CEMG for collaborative feature encoding.
- VGG [22]: A widely used Convolutional Neural Network (CNN) for image classification, employed as the Visual Encoder.
- BERT [24]: A powerful Transformer-based language model for text understanding, used as the Textual Encoder.
- T5 [19]: A Transformer-based Large Language Model used as the generative backbone in CEMG.

3.3. Technological Evolution

The field of recommender systems has evolved significantly:

Traditional Collaborative Filtering (CF): Began with matrix factorization and neighborhood-based methods (e.g., Koren et al. [7]). These primarily relied on item IDs and interaction patterns, struggling with cold-start and long-tail items due to their inability to understand item semantics.
Sequential Recommendation: Introduced the concept of user session or interaction sequence (e.g., GRU4Rec [5], SASRec [6]), moving beyond static preferences to model dynamic user interests using Recurrent Neural Networks (RNNs) or Transformers. Still largely ID-based.
Multimodal Recommendation: Began integrating rich content features (images, text) to enhance item representations (e.g., VBPR [3], ACF [1]). These improved semantic understanding but often remained in the embed-and-retrieve paradigm, where item embeddings are learned and then similarity search is performed. Fusion strategies evolved from simple concatenation to attention mechanisms and GNNs.
Generative Recommendation: The most recent paradigm shift, inspired by Large Language Models. It re-frames recommendation as a sequence-to-sequence generation task, moving from predicting item IDs to generating semantic token sequences that represent items (e.g., TIGER [20], LETTER [26], MMGRec [11]). This offers greater flexibility and addresses cold-start issues by generating novel combinations of features.

CEMG fits into this timeline by pushing the boundaries of generative recommendation. It addresses the current shortcomings of generative models by offering a more sophisticated multimodal and collaborative signal integration strategy, moving towards a truly holistic item representation before tokenization and LLM-based generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, CEMG introduces several core differences and innovations:

Deep, Collaborative-Guided Multimodal Fusion: Unlike existing generative models that perform multimodal feature fusion in a decoupled or superficial manner, or tokenize primarily based on unimodal data (e.g., TIGER), CEMG introduces a novel Multimodal Fusion Layer. This layer dynamically integrates visual and textual features under the explicit guidance of collaborative signals. The collaborative embedding acts as a query to weigh the importance of different modalities, ensuring that content features are interpreted within their social and behavioral context. This is a significant improvement over LETTER, which aligns quantized representations with collaborative embeddings but doesn't guide the fusion process itself.
Unified Semantic Space for Tokenization: By deeply fusing all information sources before tokenization, CEMG creates a truly holistic, unified item representation. This contrasts with methods that tokenize based on unimodal data or use shallow fusion, which can lead to fragmented semantic tokens. The RQ-VAE then converts this rich, unified representation into compact, discrete semantic tokens, ensuring that the generated codes encapsulate both content and collaborative wisdom.
End-to-End LLM-based Generation with Constrained Decoding: While LLM-based methods like LlamaRec [28] and LLM-ESR [13] use LLMs, they often operate on item titles or raw text, directly prompting the LLM to generate item names or IDs. CEMG, however, leverages a powerful LLM (T5) to autoregressively generate the semantic token sequences of items, which are learned from the deeply fused multimodal and collaborative features. This two-stage approach (tokenization then generation) allows for more structured and controllable generation. Furthermore, CEMG employs a prefix tree (Trie)-based constrained decoding strategy during inference, which is crucial for ensuring that the LLM generates only valid item token sequences, a practical consideration often overlooked or less robustly addressed in simpler LLM-based recommendation approaches.
Robustness to Cold-Start Items: By embedding collaborative signals into the multimodal fusion and tokenization process, CEMG is designed to generalize better, especially for cold-start items. This is because even with sparse interaction data, the rich multimodal content, guided by some collaborative context (even if limited), can still form meaningful semantic tokens, outperforming ID-based methods and even other content-aware baselines.

4. Methodology

4.1. Principles

The core principle of CEMG is to overcome the limitations of superficial collaborative signal integration and decoupled multimodal feature fusion in generative recommendation by creating a truly holistic item representation. This is achieved through a collaborative-guided attention mechanism that dynamically combines visual and textual features with collaborative embeddings. This unified representation is then transformed into discrete semantic tokens using a Residual Quantization VAE (RQ-VAE). Finally, recommendation is reframed as a conditional language generation task, where a Large Language Model (LLM) autoregressively generates these item tokens, ensuring that the generated recommendations are semantically rich, contextually relevant, and align with user preferences derived from both content and collective behavior.

4.2. Core Methodology In-depth (Layer by Layer)

The CEMG framework comprises three main components: the Multimodal Encoding Layer, Unified Modality Tokenization, and End-to-End Generative Recommendation. The overall architecture is depicted in Figure 1.

$Fig. 1: The overall architecture of the CEMG framework. The framework is composed of three main components. The Multimodal Encoding Layer integrates visual $( e _ { i } ^ { V }$ ), collaborative $( e _ { i } ^ { C }$ ), and textual $\\cdot e _ { i } ^ { T }$ ) features via the Multimodal Fusion Layer to produce a unified representation $\\mathbf { x } _ { i }$ . The Unified Modality Tokenization stage, utilizing a Residual Quantization VAE (RQ-VAE), converts $\\mathbf { x } _ { i }$ into a discrete sequence of semantic tokens. Finally, the End2End Generative Recommendation module takes historical token sequences as input and autoregressively generates the tokens for the next recommended item.$
该图像是CEMG框架的整体架构示意图。框架由三个主要组件组成：多模态编码层集成视觉特征 $e_{i}^{V}$ 、协作特征 $e_{i}^{C}$ 和文本特征 $e_{i}^{T}$ ，通过多模态融合层生成统一表示 $\mathbf{x}_{i}$ 。统一模态标记化阶段利用残差量化变分自编码器（RQ-VAE）将 $\mathbf{x}_{i}$ 转换为离散语义标记序列。最后，端到端生成推荐模块根据历史标记序列自回归生成下个推荐项目的标记。

Fig. 1: The overall architecture of the CEMG framework. The framework is composed of three main components. The Multimodal Encoding Layer integrates visual $( e _ { i } ^ { V }$ ), collaborative $( e _ { i } ^ { C }$ ), and textual $\cdot e _ { i } ^ { T }$ ) features via the Multimodal Fusion Layer to produce a unified representation $\mathbf { x } _ { i }$ . The Unified Modality Tokenization stage, utilizing a Residual Quantization VAE (RQ-VAE), converts $\mathbf { x } _ { i }$ into a discrete sequence of semantic tokens. Finally, the End2End Generative Recommendation module takes historical token sequences as input and autoregressively generates the tokens for the next recommended item.

4.2.1. Problem Definition

The problem is formally defined as predicting the top- $K$ items a user $u$ is most likely to interact with next, given their historical interactions $S_u = [i_1, i_2, \dots, i_L]$ . Each item $i$ is associated with multimodal content ( $V_i$ for image, $T_i$ for text). Instead of using atomic item IDs, each item $i$ is represented as a sequence of $M$ discrete semantic tokens, denoted as $\mathbf{c}_i = [c_{i,1}, c_{i,2}, \dots, c_{i,M}]$ . Each token $c_{i,m}$ is an index from a codebook. The recommendation task is thus transformed into generating the token sequence $\mathbf{c}_{i_{L+1}}$ for the next item based on the historical token sequences corresponding to $S_u$ . Formally, the model aims to learn the probability: $P ( \mathbf { c } _ { i _ { L + 1 } } | S _ { u } ) = \prod _ { m = 1 } ^ { M } P ( c _ { i _ { L + 1 } , m } | \{ \mathbf { c } _ { i _ { j } } \} _ { j = 1 } ^ { L } , c _ { i _ { L + 1 } , 1 } , \dots , c _ { i _ { L + 1 } , m - 1 } )$ Where:

$P ( \mathbf { c } _ { i _ { L + 1 } } | S _ { u } )$ is the probability of generating the entire sequence of tokens for the $(L+1)$ -th item, given the user's historical interaction sequence $S_u$ .
$\prod _ { m = 1 } ^ { M }$ denotes the product over the $M$ tokens in the sequence.
P ( c _ { i _ { L + 1 } , m } | \{ \mathbf { c } _ { i _ { j } } \} _ { j = 1 } ^ { L } , c _ { i _ { L + 1 } , 1 } , \dots , c _ { i _ { L + 1 } , m - 1 } ) is the probability of generating the $m$ -th token of the $(L+1)$ -th item, conditioned on all previous item token sequences in the history (up to $L$ items), and all previously generated tokens for the current $(L+1)$ -th item (from 1 to m-1). This reflects the autoregressive nature of the generation process.
$\mathbf{c}_{i_j}$ represents the token sequence for item $i_j$ .

4.2.2. Multimodal Encoding Layer

This layer is responsible for learning a unified, dense representation for each item that combines its multimodal and collaborative characteristics.

4.2.2.1. Multimodal Feature Encoding

For each item $i$ , features are extracted from its associated image and text using pre-trained encoders.

Visual Encoder: A pre-trained VGG network [22] is used to process the image $V_i$ and extract its visual features. VGG is a Convolutional Neural Network (CNN) known for its deep architecture, effective in image recognition tasks.
Textual Encoder: A pre-trained BERT model [24] is employed to encode the textual description $T_i$ . BERT (Bidirectional Encoder Representations from Transformers) is a powerful Transformer-based language model that captures contextual information from text. The embedding of the [CLS] token, which is a special classification token, is taken as the text representation for the entire input sequence.

The raw feature vectors from VGG and BERT are then passed through a Principal Component Analysis (PCA) layer for dimensionality reduction. This yields the final visual and textual embeddings, $\mathbf{e}_i^v \in \mathbb{R}^d$ and $\mathbf{e}_i^t \in \mathbb{R}^d$ , respectively, where $d$ is the reduced dimension.

4.2.2.2. Collaborative Feature Encoding

To capture collaborative signals (community preferences and interaction patterns), user-item interactions are modeled as a bipartite graph $\mathcal{G} = (\mathcal{U} \cup \mathcal{I}, \mathcal{E})$ . Here, $\mathcal{U}$ is the set of users, $\mathcal{I}$ is the set of items, and an edge $(u, i) \in \mathcal{E}$ exists if user $u$ has interacted with item $i$ . LightGCN [4], a simplified yet powerful Graph Neural Network (GNN), is then used to learn user and item embeddings. LightGCN propagates embeddings by aggregating messages from a node's neighborhood over multiple layers, effectively distilling high-order connectivity patterns from the interaction graph. This process yields a collaborative embedding $\mathbf{e}_i^c \in \mathbb{R}^d$ for item $i$ .

4.2.2.3. Multimodal Fusion Layer

This is a key innovation where the collaborative embedding acts as a guide to dynamically integrate visual and textual features. The hypothesis is that an item's collaborative context should determine the relative importance of its visual versus textual attributes. A guided attention mechanism is designed for this purpose:

The collaborative embedding $\mathbf{e}_i^c$ acts as the query.
The visual embedding $\mathbf{e}_i^v$ and textual embedding $\mathbf{e}_i^t$ serve as both keys and values.

The attention weights are computed as: $\alpha _ { m } = \frac { \exp ( ( \mathbf { W } _ { q } \mathbf { e } _ { i } ^ { c } ) ^ { \top } ( \mathbf { W } _ { k } \mathbf { e } _ { i } ^ { m } ) ) } { \sum _ { m ^ { \prime } \in \{ v , t \} } \exp ( ( \mathbf { W } _ { q } \mathbf { e } _ { i } ^ { c } ) ^ { \top } ( \mathbf { W } _ { k } \mathbf { e } _ { i } ^ { m ^ { \prime } } ) ) } , \quad \mathrm { f o r } m \in \{ v , t \}$ Where:
$\alpha_m$ is the attention weight for modality $m$ (either visual $v$ or textual $t$ ).
$\mathbf{e}_i^c$ is the collaborative embedding for item $i$ .
$\mathbf{e}_i^m$ is the embedding for modality $m$ (either $\mathbf{e}_i^v$ or $\mathbf{e}_i^t$ ).
$\mathbf{W}_q, \mathbf{W}_k \in \mathbb{R}^{d \times d}$ are learnable projection matrices that transform the query and key embeddings into a suitable space for dot-product similarity calculation.
$( \mathbf { W } _ { q } \mathbf { e } _ { i } ^ { c } ) ^ { \top } ( \mathbf { W } _ { k } \mathbf { e } _ { i } ^ { m } )$ calculates the scaled dot product similarity between the transformed collaborative query and the transformed modality key.
$\exp(\cdot)$ is the exponential function, used to ensure positive values.
The denominator $\sum_{m' \in \{v, t\}} \exp ( ( \mathbf { W } _ { q } \mathbf { e } _ { i } ^ { c } ) ^ { \top } ( \mathbf { W } _ { k } \mathbf { e } _ { i } ^ { m ^ { \prime } } ) )$ is a normalization term (softmax denominator) that ensures the attention weights sum to 1.

The final fused representation $\mathbf{x}_i \in \mathbb{R}^{2d}$ is a concatenation of the weighted multimodal features and the guiding collaborative feature: $\mathbf { x } _ { i } = [ \alpha _ { v } \mathbf { e } _ { i } ^ { v } \oplus \alpha _ { t } \mathbf { e } _ { i } ^ { t } ; \mathbf { e } _ { i } ^ { c } ]$ Where:
$\alpha_v$ and $\alpha_t$ are the attention weights for visual and textual modalities, respectively.
$\mathbf{e}_i^v$ and $\mathbf{e}_i^t$ are the visual and textual embeddings.
$\oplus$ denotes element-wise addition of the weighted visual and textual embeddings. This creates a combined content representation.
$\mathbf{e}_i^c$ is the collaborative embedding.
[ ; ] denotes concatenation of the combined content representation and the collaborative embedding. This creates a unified vector $\mathbf{x}_i$ that holistically represents item $i$ by integrating content and collaborative signals.

4.2.3. Unified Modality Tokenization

The unified representation $\mathbf{x}_i$ for each item is then tokenized into a discrete sequence of semantic tokens using a Residual Quantization Variational Autoencoder (RQ-VAE) [8].

The RQ-VAE architecture consists of three main parts:

Encoder: Maps the continuous, unified item representation $\mathbf{x}_i$ to a continuous latent vector $\mathbf{z}_i$ .
Residual Quantizer: Approximates $\mathbf{z}_i$ iteratively across $M$ codebook layers. In each stage $m \in \{1, \dots, M\}$ :
- It finds the closest codevector $\mathbf{b}_{m,k}$ from the $m$ -th codebook $\mathcal{C}_m$ to the current residual (the part of $\mathbf{z}_i$ that hasn't been quantized yet).
- The index $k$ of this closest codevector is selected as the $m$ -th token, $c_{i,m}$ .
- This selected codevector is then subtracted from the residual to form the next residual for the subsequent layer. The sequence of selected codebook indices $[c_{i,1}, \dots, c_{i,M}]$ becomes the item's semantic token sequence $\mathbf{c}_i$ .
Decoder: Reconstructs the original unified vector $\hat{\mathbf{x}}_i$ from the sum of the selected codevectors (i.e., from the discrete semantic token sequence).

The RQ-VAE is trained by minimizing a composite loss function that ensures both semantic fidelity (accurate reconstruction of $\mathbf{x}_i$ ) and codebook quality (effective utilization of discrete codes): ${ \mathcal { L } } _ { \mathrm { R Q - V A E } } = { \mathcal { L } } _ { \mathrm { r e c o n } } + \lambda _ { q } { \mathcal { L } } _ { \mathrm { q u a n t } } + \lambda _ { d } { \mathcal { L } } _ { \mathrm { d i v } }$ Where:

$\mathcal{L}_{\mathrm{RQ-VAE}}$ is the total loss for training the RQ-VAE.
$\mathcal{L}_{\mathrm{recon}} = ||\mathbf{x}_i - \hat{\mathbf{x}}_i||_2^2$ is the reconstruction loss. This is a mean squared error (MSE) that measures the difference between the original unified item representation $\mathbf{x}_i$ and its reconstruction $\hat{\mathbf{x}}_i$ by the decoder. Minimizing this loss encourages the RQ-VAE to accurately capture the semantic information.
$\mathcal{L}_{\mathrm{quant}}$ is the VQ commitment loss [17]. This loss term encourages the output of the encoder (the continuous latent vector $\mathbf{z}_i$ ) to "commit" or stay close to the chosen codebook entries. It helps stabilize the learning of the codebooks and ensures that the encoder produces latent vectors that are easily quantizable.
$\mathcal{L}_{\mathrm{div}}$ is a diversity loss [12]. This term promotes the utilization of diverse codes within each codebook, preventing codebook collapse where only a few codevectors are frequently chosen, leading to underutilization of the codebook capacity.
$\lambda_q$ and $\lambda_d$ are balancing hyperparameters that control the relative importance of the quantization loss and diversity loss, respectively, compared to the reconstruction loss.

4.2.4. End-to-End Generative Recommendation

After the tokenization stage, each item $i$ is represented by its semantic token sequence $\mathbf{c}_i$ . The recommendation task is then reframed as a conditional generation problem using a Large Language Model (LLM).

4.2.4.1. Interaction History Prompting

For a user with history $S_u = [i_1, \dots, i_L]$ , each item $i_j$ is converted into its token sequence $\mathbf{c}_{i_j}$ . Each token within these sequences is represented by a special symbol (e.g., $<a_12>$ for the 12th token from the first codebook layer 'a'). The complete prompt for the LLM is constructed as a sequence of these item tokens, preserving their chronological order. The LLM is then tasked to autoregressively predict the token sequence of the next item, $\mathbf{c}_{i_{L+1}}$ .

4.2.4.2. Training and Inference

Training: A powerful decoder-only LLM (T5 [19] is used in this paper, despite T5 being a sequence-to-sequence model with an encoder-decoder architecture, it can be fine-tuned for decoder-only tasks by only using its decoder or passing the input through the encoder and generating from the decoder) serves as the generative backbone. The model is trained using a standard next-token prediction objective, minimizing the cross-entropy loss between the predicted token probabilities and the ground-truth target tokens: $\mathcal { L } _ { \mathrm { N T P } } = - \sum _ { j = 1 } ^ { L } \sum _ { m = 1 } ^ { M } \log P ( c _ { i _ { j + 1 } , m } | \{ \mathbf { c } _ { i _ { k } } \} _ { k = 1 } ^ { j } , c _ { i _ { j + 1 } , 1 } , \dots , c _ { i _ { j + 1 } , m - 1 } )$ Where:
- $\mathcal{L}_{\mathrm{NTP}}$ is the Next Token Prediction loss.
- The outer sum iterates through each item $j$ in the user's historical sequence, predicting the next item $i_{j+1}$ .
- The inner sum iterates through each token $m$ within the target item's token sequence $\mathbf{c}_{i_{j+1}}$ .
- \log P ( c _ { i _ { j + 1 } , m } | \{ \mathbf { c } _ { i _ { k } } \} _ { k = 1 } ^ { j } , c _ { i _ { j + 1 } , 1 } , \dots , c _ { i _ { j + 1 } , m - 1 } ) is the log-probability of predicting the $m$ -th token of item $i_{j+1}$ , conditioned on all preceding item token sequences in the history (up to item $i_j$ ) and all previously generated tokens for the current item $i_{j+1}$ (from 1 to m-1). This is the standard cross-entropy loss for autoregressive sequence generation.
Inference: Given a user's history prompt, beam search is used to generate multiple candidate token sequences for the next item. Beam search explores a set of the most probable sequences at each step, rather than just the single most probable one, to find better overall sequences. The score of a candidate sequence $\mathbf{c} = [c_1, \hdots, c_M]$ is the sum of its log-probabilities: $\mathrm { S c o r e } ( \mathbf { c } ) = \sum _ { m = 1 } ^ { M } \log P ( c _ { m } | \mathrm { p r o m p t } , c _ { 1 } , \ldots , c _ { m - 1 } )$ Where:
- $\mathrm{Score}(\mathbf{c})$ is the total score for a generated candidate token sequence $\mathbf{c}$ .
- The sum calculates the total log-likelihood of generating the sequence, given the initial prompt (user history) and the previously generated tokens within $\mathbf{c}$ .
  
  To ensure that only valid item sequences are generated, a prefix tree (Trie)-based constrained decoding strategy is employed. The Trie contains all valid item token sequences from the entire item catalog. At each generation step, the LLM's output vocabulary is masked to only allow tokens that form a valid prefix according to the Trie, drastically pruning the search space and guaranteeing the validity of the final recommendations by preventing the LLM from "hallucinating" non-existent item token sequences.

5. Experimental Setup

5.1. Datasets

The authors evaluate CEMG on three widely used public datasets:

Amazon reviews (Beauty and Sports): These datasets consist of user reviews and interaction data for products from the "Beauty" and "Sports & Outdoors" categories on Amazon.
Yelp: This dataset contains user reviews and interactions for businesses on the Yelp platform.

For each interaction in these datasets, associated item images and textual descriptions are collected. To ensure data quality, users and items with fewer than 5 interactions are filtered out, which is a common practice in recommender systems research to focus on more active entities and reduce noise.

The statistics of the processed datasets are summarized in the following table (Table 1 from the original paper):

Attribute	Beauty	Sports	Yelp
#Users	22,363	35,598	30,431
#Items	12,101	18,357	20,033
#Interactions	198,502	296,337 316,942
Avg. Len.	8.9	8.3	10.4
Sparsity	99.93%	99.95% 99.95%

Where:

#Users: The total number of unique users in the dataset.
#Items: The total number of unique items in the dataset.
#Interactions: The total number of user-item interactions recorded.
Avg. Len.: The average length of user interaction sequences.
Sparsity: The percentage of empty cells in the user-item interaction matrix, indicating how sparse the interaction data is (a higher percentage means more sparse data, making recommendation more challenging).

These datasets are chosen because they are widely recognized benchmarks in the recommendation research community, featuring real-world user behaviors and rich multimodal content (reviews for text, product/business images). Their diversity in domains (e-commerce products, local businesses) and scale make them effective for validating the generalizability and performance of multimodal generative recommendation methods.

5.2. Evaluation Metrics

The authors adopt a leave-one-out strategy for evaluation: for each user, their last interacted item is used as the ground truth for testing, the second-to-last for validation, and the remaining interactions for training. This setup simulates the real-world scenario of predicting the immediate next interaction.

The performance of all models is evaluated using two common ranking metrics at specific cutoffs ( $K=10$ and $K=20$ ): Hit Rate (HR) and Normalized Discounted Cumulative Gain (NDCG).

5.2.1. Hit Rate (HR@K)

Conceptual Definition: Hit Rate (also known as Recall) at $K$ measures the proportion of users for whom the ground-truth item (the next item they interacted with) appears within the top $K$ items recommended by the system. It is a simple binary measure: either the item is in the top $K$ or it's not. It reflects how often the model successfully recommends the relevant item within the limited top- $K$ list.
Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the ground-truth item is in top-K}}{\text{Total number of users}} $
Symbol Explanation:
- Number of users for whom the ground-truth item is in top-K: Counts how many times the actual next item for a user is found among the first $K$ recommendations.
- Total number of users: The total number of users considered in the evaluation set.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG at $K$ is a ranking quality metric that accounts for the position of relevant items. It gives higher scores to relevant items that appear higher in the recommendation list and penalizes relevant items that appear lower. It also normalizes the score to be between 0 and 1, allowing comparison across different users and queries. NDCG is particularly suitable for scenarios where the order of recommended items matters.
Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}_u} $ where $ \mathrm{DCG@K}u = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ And IDCG@K is the Ideal DCG (the maximum possible DCG for user $u$ ), which is calculated by placing all relevant items at the top of the list: $ \mathrm{IDCG@K}u = \sum{j=1}^{\min(K, |\mathrm{Rel}_u|)} \frac{2^{1} - 1}{\log_2(j+1)} $
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users.
- $\mathrm{DCG@K}_u$ : Discounted Cumulative Gain for user $u$ at cutoff $K$ .
- $\mathrm{IDCG@K}_u$ : Ideal Discounted Cumulative Gain for user $u$ at cutoff $K$ . This is the maximum possible DCG value if the recommended list were perfectly ordered.
- $\mathrm{rel}(j)$ : The relevance score of the item at rank $j$ . For recommendation tasks with implicit feedback, rel(j) is typically 1 if the item at rank $j$ is the ground-truth next item, and 0 otherwise.
- $\min(K, |\mathrm{Rel}_u|)$ : The minimum of $K$ and the number of relevant items for user $u$ . In a leave-one-out setting, $|\mathrm{Rel}_u|$ is typically 1.

5.3. Baselines

CEMG is compared against four categories of baseline models to demonstrate its comprehensive superiority:

Sequential Methods: These models capture the temporal dependencies in user interactions.
- GRU4Rec [5]: A foundational session-based recommender that uses Gated Recurrent Units (GRUs) to model sequential user behavior.
- SASRec [6]: Self-Attentive Sequential Recommendation, a Transformer-based model that applies self-attention to capture long-range dependencies in user sequences.
Multimodal Methods: These models incorporate various content modalities.
- MMSRec [23]: Self-supervised Multi-Modal Sequential Recommendation.
- MISSRec [25]: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation.
LLM-based Methods: These leverage Large Language Models directly.
- LlamaRec [28]: A two-stage recommendation approach using LLMs for ranking.
- LLM-ESR [13]: Large Language Models Enhancement for Long-tailed Sequential Recommendation. These methods typically operate on item titles or raw text as input to the LLMs.
Generative Methods: These represent the state-of-the-art in generative recommendation using semantic IDs.
- TIGER [20]: A generative retrieval model that tokenizes items into discrete semantic codes and then generates these codes.
- LETTER [26]: Learnable Item Tokenization for Generative Recommendation, which improves tokenization by aligning quantized representations with collaborative embeddings.
- MMGRec [11]: Multimodal Generative Recommendation with Transformer Model, which employs graph-based architectures to tokenize fused multimodal information.

5.4. Implementation Details

The authors provide specific implementation details for the CEMG framework:

Feature Embedding Dimension: All feature embeddings (visual, textual, collaborative) are projected to a uniform dimension of $d = 768$ .
RQ-VAE Configuration:
- Number of codebook layers ( $M$ ): Configured with $M = 4$ . This means each item is represented by a sequence of 4 discrete tokens.
- Codebook Size ( $K$ ): Each of the $M$ codebooks contains $K = 512$ unique codevectors (discrete symbols).
Loss Balancing Weights: Based on parameter analysis, the balancing weights for the RQ-VAE loss were set to $\lambda_q = 0.25$ for the quantization loss and $\lambda_d = 0.01$ for the diversity loss.
Generative LLM Backbone: T5 [19] is employed as the generative LLM-backbone.
Optimizer: The model is trained using the AdamW optimizer.
Learning Rate: A learning rate of $1 \times 10^{-4}$ is used.
Hardware: Training is performed on NVIDIA A100 GPUs.

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

The main experimental results, comparing CEMG against various baselines on three datasets, are presented in the following table (Table 2 from the original paper).

The following are the results from Table 2 of the original paper:

Category	Model	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
Category	Model	Beauty		Sports		Yelp
Sequential	GRU4Rec	0.0385	0.0116	0.0201	0.0045	0.0288	0.0095
Sequential	SASRec	0.0434	0.0147	0.0232	0.0061	0.0329	0.0121
Multimodal	MISSRec	0.0577	0.0287	0.0305	0.0118	0.0387	0.0163
Multimodal	MMSRec	0.0581	0.0292	0.0311	0.0124	0.0395	0.0171
LLM-based	LlamaRec	0.0492	0.0198	0.0256	0.0083	0.0341	0.0134
LLM-based	LLM-ESR	0.0515	0.0214	0.0269	0.0091	0.0353	0.0140
Generative	TIGER	0.0533	0.0251	0.0281	0.0103	0.0368	0.0151
	LETTER	0.0552	0.0268	0.0295	0.0111	0.0377	0.0159
	MMGRec	0.0571	0.0281	0.0302	0.0119	0.0389	0.0166
CEMG		0.0665	0.0348	0.0363	0.0157	0.0458	0.0212
Improvement (%)		+14.46%	+19.18%	+16.72%	+26.61%	+15.95%	+23.98%

The results clearly demonstrate that CEMG consistently and significantly outperforms all baseline models across all three datasets (Beauty, Sports, Yelp) and both evaluation metrics (HR@10 and NDCG@10). The improvements are statistically significant ( $p < 0.05$ ).

Key Observations:

Overall Superiority: CEMG achieves the highest scores in all categories, with impressive relative improvements over the best baselines. For example, on the Beauty dataset, CEMG shows a +14.46% improvement in HR@10 and +19.18% in NDCG@10 over the best baseline (MMSRec). On Sports, the NDCG@10 improvement is a remarkable +26.61%.
Generative Models Perform Well: Among the baselines, generative methods (TIGER, LETTER, MMGRec) generally outperform traditional sequential (GRU4Rec, SASRec) and LLM-based (LlamaRec, LLM-ESR) methods, and are competitive with or slightly better than multimodal methods (MISSRec, MMSRec). This highlights the inherent potential of the generative paradigm in recommendation.
CEMG's Advantage: CEMG's substantial lead over strong multimodal generative baselines like MMGRec and LETTER validates the effectiveness of its core innovations: the collaborative-guided fusion of multimodal features and the advanced generative architecture. It suggests that a deeper, more unified representation learning process, guided by collaborative signals, is crucial for unlocking the full potential of generative recommendation.
Importance of Multimodality and Collaboration: While LLM-based methods show some promise, they don't reach the performance of dedicated multimodal generative approaches, indicating that sophisticated fusion and tokenization of multimodal and collaborative signals are more effective than relying solely on raw text for LLM inputs in this context.

6.2. Ablation Study (RQ2)

To understand the contribution of each key component in CEMG, an ablation study was conducted. Several variants of the model were tested, and the results are shown in Figure 2.

Fig. 2: Ablation study results on three datasets for HR@10 and NDCG@10. Performance drops across all variants demonstrate the contribution of each component.
该图像是图表，展示了在三个数据集（Beauty、Sports、Yelp）上，CEMG模型与各种去除组件的变体的HR@10和NDCG@10的消融研究结果。各个变体性能的下降显示了每个组件的贡献。

Fig. 2: Ablation study results on three datasets for HR@10 and NDCG@10. Performance drops across all variants demonstrate the contribution of each component.

The variants are:

w/o Collab: Removes the collaborative features ( $\mathbf{e}_i^C$ ) from the unified representation, meaning the Multimodal Fusion Layer would no longer be guided by collaborative signals, and the final concatenated vector $\mathbf{x}_i$ would not include $\mathbf{e}_i^C$ .
w/o Image: Removes the visual features ( $\mathbf{e}_i^V$ ).
w/o Text: Removes the textual features ( $\mathbf{e}_i^T$ ).
w/o LLM: Replaces the powerful pre-trained T5 (or Llama-3-8B mentioned in the text, indicating a potential update in thought process between abstract/intro and methodology/experiments) with a standard 6-layer Transformer decoder trained from scratch, similar to TIGER [20].

Analysis of Results:

Full CEMG Model is Best: As expected, the full CEMG model consistently achieves the best performance across all datasets and metrics.
w/o Collab Shows Significant Drop: Removing collaborative features leads to one of the most significant performance drops. This strongly underscores the vital role of collaborative filtering signals, not just as supplementary information, but as a guiding force in fusing multimodal content and creating a holistic item representation, even in a content-rich generative model. It confirms that user behavior patterns provide crucial personalization cues that content alone cannot fully capture.
w/o LLM Also Significant: Replacing the powerful pre-trained LLM with a simpler Transformer decoder trained from scratch also results in a substantial performance degradation. This validates the choice of using a large, pre-trained LLM like T5. Its advanced reasoning, sequence modeling capabilities, and extensive world knowledge learned during pre-training are crucial for accurately autoregressively generating the complex semantic token sequences for the next recommended item.
Importance of Multimodal Features: Removing image or text features (i.e., w/o Image and w/o Text) also leads to noticeable performance drops, albeit generally less severe than w/o Collab or w/o LLM. This confirms that CEMG effectively utilizes both visual and textual information. It suggests that each modality contributes unique semantic aspects to the item representation, and their synergistic fusion (especially when guided by collaborative signals) leads to a richer understanding of items. The absence of either modality results in a loss of descriptive power.

In summary, the ablation study clearly demonstrates that all core components of CEMG—collaborative signals, multimodal content (both image and text), and the powerful LLM backbone—are crucial and contribute synergistically to the model's superior performance.

6.3. Efficiency Analysis (RQ3)

The efficiency of CEMG, in terms of training and inference time, is compared against other state-of-the-art generative models. The results are presented in Figure 3.

Fig. 3: Efficiency comparison on the Beauty and Sports datasets. Left axis (bars) shows training time per epoch, broken down by stage. Right axis (line) shows inference speed in users per second (higher is better).
该图像是一个图表，展示了在美妆和体育数据集上的效率比较。左侧展示了训练时间（小时）和推理延迟（毫秒/用户）两项指标，分别按阶段分解。右侧则是体育数据集的对应结果。不同的方法包括 TIGER、LETTER、MMGRec 和 CGMG。

Analysis:

Training Efficiency:
- CEMG's training time is composed of two stages: tokenization (training the RQ-VAE) and end-to-end generation (fine-tuning the LLM).
- While the overall training time for CEMG is noted to be higher than TIGER due to the processing of more modality features, it remains highly competitive.
- The total training time is comparable to, and even slightly better than, LETTER, which requires a complex alignment process. This suggests that CEMG's sophisticated fusion and tokenization pipeline, despite integrating multiple modalities and collaborative signals, does not introduce prohibitive computational overhead during training.
Inference Efficiency:
- This is an area where CEMG particularly excels. CEMG achieves significantly lower inference latency (higher users per second, meaning faster processing) compared to other multimodal generative models like MMGRec and LETTER.
- This efficiency is attributed to CEMG's design of generating short, fixed-length semantic token sequences (with $M=4$ tokens). This fixed-length, compact representation makes the autoregressive generation process much faster than models that might require more complex generation steps or extensive retrieval processes, which could involve longer sequences or more complex calculations per token.
- The high inference speed makes CEMG highly practical and suitable for real-world deployment scenarios where quick response times are critical for user experience.
  
  In conclusion, CEMG strikes an effective balance between superior performance and practical computational cost, especially in terms of inference speed, which is a major advantage for production systems.

6.4. Parameter Analysis (RQ4)

The sensitivity of CEMG's performance to four key hyperparameters in the Unified Modality Tokenization stage is investigated. The results, specifically for HR@10, are shown in Figure 4.

$Fig. 4: Parameter sensitivity analysis of CEMG on $\\mathrm { H R @ 1 0 }$ for (a) Number of Codebook Layers, (b) Codebook Size, (c) Quantization Loss Weight, and (d) Diversity Loss Weight.$
该图像是图表，展示了CEMG模型在 $\mathrm{HR@10}$ 指标下，对不同参数的敏感性分析。图中分为四个子图：（a）展示了层数（M）的影响；（b）展示了码本大小（K）的影响；（c）展示了量化损失权重（ $\lambda_q$ ）的影响；（d）展示了多样性损失权重（ $\lambda_d$ ）的影响。不同类型（美妆、体育、Yelp）的数据点展示了各参数对推荐效果的影响。

Fig. 4: Parameter sensitivity analysis of CEMG on $\mathrm { H R @ 1 0 }$ for (a) Number of Codebook Layers, (b) Codebook Size, (c) Quantization Loss Weight, and (d) Diversity Loss Weight.

Analysis:

a) Number of Codebook Layers (M):
- Performance (HR@10) generally improves as $M$ increases from 2 to 4. This is because a higher number of codebook layers allows the RQ-VAE to capture finer-grained semantic details and create richer item representations through more iterative quantization steps.
- However, performance plateaus at $M=4$ and then slightly declines at $M=8$ . This decline suggests that generating longer sequences (more tokens per item) can increase the difficulty of the generative task for the LLM, potentially leading to more errors or less stable training.
- Based on this, $M=4$ is chosen as the optimal setting, offering a good balance between representational power and generation complexity.
b) Codebook Size (K):
- A larger codebook size $K$ generally leads to better performance. This is intuitive, as a larger codebook provides more distinct codevectors, thus offering greater expressive power and capacity for the semantic tokens to represent diverse item features.
- The performance gain, however, tends to saturate after $K=512$ . This indicates that beyond this size, the marginal benefit of adding more codevectors diminishes, suggesting that $K=512$ offers a good balance between expressiveness and computational complexity (larger codebooks require more memory and training time).
c) Quantization Loss Weight ( $\lambda_q$ ):
- This hyperparameter balances the reconstruction quality with the codebook alignment (how closely the encoder output adheres to the codebook entries).
- Figure 4(c) shows a clear unimodal trend, with performance peaking around $\lambda_q = 0.25$ .
- Values that are too low might not sufficiently encourage the encoder output to align with the codebook, leading to poor quantization. Values that are too high might force the encoder output too rigidly to codebook entries, potentially distorting the latent representation and harming reconstruction quality.
- Thus, an optimal $\lambda_q$ ensures a proper balance between these objectives.
d) Diversity Loss Weight ( $\lambda_d$ ):
- This weight is crucial for preventing codebook collapse, a phenomenon where only a few codevectors are frequently used, leading to inefficient codebook utilization.
- Performance improves as $\lambda_d$ increases up to 0.01, confirming the benefit of encouraging diverse codebook usage. This ensures that all codebook entries are meaningfully utilized and contribute to rich item representations.
- However, excessively high values of $\lambda_d$ can distort the semantic space by over-emphasizing diversity, potentially forcing the model to select less optimal codevectors for reconstruction and thus harming performance.
  
  These analyses provide valuable insights into configuring CEMG for optimal performance, demonstrating the careful tuning required for VQ-VAE based tokenization.

6.5. Performance on Cold-Start Items (RQ5)

Cold-start items are a critical challenge for recommender systems because they have insufficient interaction data for collaborative filtering to be effective. The authors investigate CEMG's performance on items with five or fewer interactions in the training set. The results are presented in the following table (Table 3 from the original paper).

The following are the results from Table 3 of the original paper:

Model	Beauty		Sports		Yelp
	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
SASRec	0.0112	0.0048	0.0065	0.0027	0.0098	0.0041
MISSRec	0.0254	0.0115	0.0141	0.0068	0.0185	0.0092
MMGRec	0.0268	0.0123	0.0153	0.0075	0.0192	0.0099
CEMG	0.0305	0.0153	0.0183	0.0094	0.0231	0.0125

Analysis:

CEMG's Superiority: CEMG substantially outperforms all baselines on cold-start items across all three datasets and metrics. This is a critical finding, demonstrating CEMG's robust generalization capabilities.
Content-Aware Models vs. ID-based: As expected, content-aware models like MISSRec and MMGRec perform significantly better than the purely ID-based SASRec on cold-start items. This highlights the inherent advantage of leveraging item multimodal content when interaction data is scarce, as content provides intrinsic item semantics that are available even for new items. SASRec struggles because it primarily relies on interaction history to learn item embeddings, which is minimal for cold-start items.
Advantage of Collaborative-Guided Tokenization: CEMG's advanced semantic tokenization provides superior generalization even over other multimodal generative baselines. By learning to generate rich item representations from a collaborative-guided fusion of multimodal content, CEMG remains effective even when explicit interaction signals for a specific item are sparse. The collaborative guidance during fusion helps ensure that the semantic tokens capture not just what the item is (from content) but also how it relates to existing items in the collaborative space, even if its own direct interactions are few. This makes the item representations more robust and transferable, improving recommendations for cold-start scenarios.

This analysis confirms that CEMG effectively addresses a long-standing challenge in recommender systems, making it more practical for real-world applications with constantly evolving item catalogs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces CEMG, a novel generative recommendation framework that significantly advances the state-of-the-art by addressing two critical limitations in existing models: the superficial integration of collaborative signals and the decoupled fusion of multimodal features. CEMG's core innovation lies in its Multimodal Fusion Layer, which dynamically integrates visual and textual content under the explicit guidance of collaborative signals, creating a deeply unified and holistic item representation. This representation is then converted into discrete semantic codes using a Unified Modality Tokenization module powered by an RQ-VAE. Finally, an End-to-End Generative Recommendation component, leveraging a fine-tuned Large Language Model and constrained decoding, autoregressively generates these item codes to produce personalized recommendations. Extensive experiments on three benchmark datasets confirm that CEMG consistently and significantly outperforms a wide range of state-of-the-art baselines, demonstrating its superior performance, efficiency, and robustness, particularly in handling cold-start items.

7.2. Limitations & Future Work

The authors identify one key limitation and propose future research directions:

Noisy Multimodal Content: A current limitation is that noisy signals within multimodal content, such as irrelevant image backgrounds or extraneous text, can be inadvertently encoded into the item representations. This noise can potentially compromise the quality of tokenization and subsequently affect recommendation accuracy.
Future Work - Advanced Decoding Strategies: To further mitigate recommendation errors and potentially address the issue of noisy signals, the authors plan to explore more advanced decoding strategies in their future work. This could involve techniques that are more robust to noise, or that can leverage additional contextual information during the generation process to produce even more precise and relevant item token sequences.

7.3. Personal Insights & Critique

The CEMG framework represents a significant step forward in generative recommendation, particularly in its elegant solution to deeply integrate multimodal and collaborative signals.

Innovation of Collaborative-Guided Fusion: The collaborative-guided multimodal fusion layer is a standout innovation. By using collaborative embeddings as queries to dynamically weigh visual and textual features, the model ensures that content is interpreted in the context of user preferences and item relationships, rather than in isolation. This aligns well with human intuition: how an item is perceived collectively often dictates which of its attributes are most salient.
Bridge between Continuous and Discrete Spaces: The use of RQ-VAE for unified modality tokenization is an effective bridge between the rich, continuous latent representations and the discrete semantic tokens required by LLMs. This structured tokenization allows LLMs to generate meaningful item codes rather than raw text, which can be less controllable and prone to hallucination in a recommendation context. The fixed-length, short token sequences contribute to impressive inference efficiency, a critical factor for real-world deployment.
Addressing Cold-Start: CEMG's superior performance on cold-start items is a testament to its robust item representation learning. By having a multimodal foundation guided by even sparse collaborative signals, it can generalize well to new items, which is a major pain point for traditional ID-based recommenders.
Potential Issues/Unverified Assumptions:
1. Dependency on Pre-trained Encoders: The model heavily relies on powerful pre-trained encoders (VGG, BERT). The quality of these upstream encoders directly impacts the initial multimodal features. If these encoders are not perfectly aligned with the recommendation domain, or if new, more advanced encoders emerge, retraining or updating these components would be necessary. The assumption is that these general-purpose encoders are sufficient for extracting domain-relevant features.
2. Scalability of Trie for Constrained Decoding: While prefix tree (Trie)-based constrained decoding ensures validity, the size of the Trie can grow very large with an extremely vast item catalog and a high number of codebook layers/codebook size. The paper mentions $M=4$ layers and $K=512$ codes. For a truly massive catalog (millions of items), the Trie might become memory-intensive, potentially impacting inference speed for lookups, though likely less than LLM generation itself.
3. Complexity of Two-Stage Training: Training involves two distinct stages (RQ-VAE and LLM fine-tuning), which can be more complex to manage and optimize than a single end-to-end model. The potential for error propagation from the tokenization stage to the generation stage exists.
4. Interpretability of Tokens: While semantic tokens are generated, their direct interpretability for humans might still be limited compared to natural language descriptions. Understanding why a particular sequence of codes represents an item or why certain codes are generated might require further research into token semantics.
Transferability: The core idea of collaborative-guided multimodal fusion for item representation learning could be highly transferable to other domains beyond e-commerce or local businesses. For instance, in content recommendation (news, articles, videos), where rich multimodal content is available alongside user interactions, this framework could provide significant benefits. The tokenization and LLM generation pipeline is also quite general and could be adapted for generating other structured outputs in various applications.

Overall, CEMG presents a robust and innovative framework that effectively merges multimodal learning, collaborative filtering, and generative AI to create a powerful and efficient recommender system. The demonstrated improvements on cold-start items are particularly promising for real-world applicability.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~26 min read · 34,179 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Definition

4.2.2. Multimodal Encoding Layer

4.2.2.1. Multimodal Feature Encoding

4.2.2.2. Collaborative Feature Encoding

4.2.2.3. Multimodal Fusion Layer

4.2.3. Unified Modality Tokenization

4.2.4. End-to-End Generative Recommendation

4.2.4.1. Interaction History Prompting

4.2.4.2. Training and Inference

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Hit Rate (HR@K)

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis (RQ1)

6.2. Ablation Study (RQ2)

6.3. Efficiency Analysis (RQ3)

6.4. Parameter Analysis (RQ4)

6.5. Performance on Cold-Start Items (RQ5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers