Multi-Aspect Cross-modal Quantization for Generative Recommendation
TL;DR Summary
This paper introduces the MACRec model for generative recommendation, integrating multimodal information to improve semantic ID quality. It employs cross-modal quantization to reduce conflict rates and combines implicit and explicit alignments, enhancing the generative model's pe
Abstract
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is Multi-Aspect Cross-modal Quantization for Generative Recommendation.
1.2. Authors
The authors and their affiliations are:
- Fuwei Zhang (Institute of Artificial Intelligence, Beihang University)
- Xiaoyu Liu (Institute of Artificial Intelligence, Beihang University)
- Dongbo Xi (Meituan)
- Jishen Yin (Meituan)
- Huan Chen (Meituan)
- Peng Yan (Meituan)
- Fuzhen Zhuang (Institute of Artificial Intelligence, Beihang University; SKLCCSE, School of Computer Science and Engineering, Beihang University)
- Zhao Zhang (SKLCCSE, School of Computer Science and Engineering, Beihang University)
1.3. Journal/Conference
This paper is a preprint, published on arXiv, as indicated by the "Published at (UTC): 2025-11-19T04:55:14.000Z" and the arxiv.org links. arXiv is a widely recognized open-access preprint server for research articles in fields like computer science, mathematics, and physics. It plays a crucial role in rapidly disseminating research findings and facilitating early peer review before formal publication in journals or conferences.
1.4. Publication Year
2025
1.5. Abstract
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, the paper proposes Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, MACRec first introduces cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of the GR model, it incorporates multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, extensive experiments on three well-known recommendation datasets demonstrate the effectiveness of the proposed method.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2511.15122v1 - PDF Link:
https://arxiv.org/pdf/2511.15122v1.pdf - Publication Status: Preprint (version 1) on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the limitation of current Generative Recommendation (GR) approaches in effectively utilizing multimodal information to create high-quality semantic identifiers (IDs) and train robust GR models.
This problem is important because GR is a rapidly evolving paradigm in recommender systems, reformulating recommendation as a next-token prediction task. The success of GR heavily relies on the quality of these discrete semantic IDs, which should be hierarchically organized, minimize conflicts, and facilitate generative model training. However, existing methods often rely on single-modality (textual) embeddings for ID generation, leading to limited semantic discriminability and potential semantic loss in deeper hierarchical structures. For instance, text-based embeddings might group items by brand, overlooking functional differences that image-based embeddings could capture (e.g., different types of instruments from the same brand). Current multimodal GR models also tend to encode modalities separately without deep cross-modal interaction during quantization, leading to hierarchical semantic loss and suboptimal use of complementary information.
The paper's entry point is to explicitly introduce multimodal information and cross-modal interactions at two critical stages: semantic ID learning and generative model training, to overcome the limitations of single-modality or weakly integrated multimodality in existing GR systems.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Novel Cross-modal Quantization Method:
MACRecproposes a newcross-modal quantizationmethod that integratescontrastive learningintoresidual quantization(RQ) andreconstruction. This approach leveragesmultimodal informationto learnhierarchically meaningful semantic IDsfor items, addressing the issue of semantic loss and improving codebook usability. -
Multi-Aspect Alignment Strategies for GR Training: To enable the generative model to learn common features from different modalities and enhance its understanding of
semantic IDs,MACRecemploysmulti-aspect alignment strategies. These include:Implicit alignmentin thelatent spacethroughcontrastive methods.Explicit alignmentwithin thegenerative task(e.g., predicting visual IDs from text IDs and vice-versa, or sequence-level predictions).
-
Extensive Experimental Validation: The paper conducts extensive experiments on three widely used recommendation datasets (
Musical Instruments,Arts, Crafts and Sewing,Video Games). The findings demonstrate thatMACRecsignificantly outperforms state-of-the-artGenerative Recommendation(GR) models, showcasing its effectiveness. -
Improved Codebook Utilization and Reduced Collision Rates: Analysis shows that
MACReceffectively reduces theitem collision rateduring thequantization processand achieves a more balancedcode assignment distribution, indicating better utilization of thecodebook capacityand superiorsemantic representation.These contributions collectively aim to solve the problem of learning high-quality, semantically rich, and non-conflicting item
semantic IDswhile ensuring effectivegenerative model trainingby deeply integratingmultimodal informationfrom various aspects.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following key concepts:
-
Recommender Systems: Systems that suggest items (e.g., products, movies, articles) to users based on their preferences and past behavior. They address information overload by personalizing content.
-
Generative Recommendation (GR): A new paradigm in recommender systems that reframes the recommendation task as a sequence generation problem. Instead of predicting a score or ranking items,
GRmodels learn to "generate" thesemantic identifier (ID)of the next item a user might interact with, given their historical sequence of interactions. This often leverages techniques fromLarge Language Models (LLMs). -
Quantized Representations / Discretization: The process of converting continuous data (like numerical embeddings) into a finite set of discrete values or "tokens." In
GR, item features (e.g., text descriptions, images) are converted into discretesemantic IDsto enable sequence modeling, similar to how words are treated as tokens in natural language processing. -
Semantic Identifiers (IDs) / Tokens: These are discrete, symbolic representations assigned to items after
quantization. Eachsemantic IDis intended to capture the underlying meaning or characteristics of an item. InGR, user interaction histories become sequences of theseitem semantic IDs. -
Next-Token Prediction: A core task in
generative models, especiallyLLMs. Given a sequence of tokens, the model predicts the most probable next token in the sequence. InGR, this translates to predicting thesemantic IDof the next item in a user's interaction history. -
Residual Quantized Variational AutoEncoder (RQ-VAE): A key component for learning discrete
semantic IDs.- Vector Quantization (VQ): A method to discretize continuous vectors. It works by maintaining a "codebook" (a set of learnable representative vectors, called "codewords"). When a continuous input vector comes in,
VQfinds the closest codeword in the codebook and replaces the input vector with that codeword. This effectively "quantizes" the input. - Residual Quantization (RQ): An extension of
VQthat improves its efficiency and expressive power. Instead of quantizing the entire vector in one go,RQapplies multipleVQlayers sequentially. Each layer quantizes the residual (the error or remaining information) from the previous layer's quantization. This allows for a hierarchical representation, capturing progressively finer details. For example, the first layer might capture broad features, and subsequent layers refine them. - Variational AutoEncoder (VAE): A type of generative model that learns a probabilistic mapping from input data to a latent space and then reconstructs the input from this latent representation.
RQ-VAEcombines thequantizationprocess with theVAEframework, learning to encode and decode item features through discretecodewordswhile minimizing reconstruction error.
- Vector Quantization (VQ): A method to discretize continuous vectors. It works by maintaining a "codebook" (a set of learnable representative vectors, called "codewords"). When a continuous input vector comes in,
-
Multimodal Information / Modalities: Data that comes from different sources or forms, such as text (item descriptions, reviews) and images (product photos).
Multimodalapproaches aim to leverage the complementary nature of these different data types to gain a richer understanding than any single modality alone. -
Contrastive Learning: A self-supervised learning paradigm where the model learns representations by contrasting similar and dissimilar examples. It aims to pull "positive pairs" (e.g., different views of the same item, or items belonging to the same semantic cluster) closer in the embedding space while pushing "negative pairs" (unrelated items) further apart.
- InfoNCE Loss (Noise-Contrastive Estimation): A common loss function used in
contrastive learning. It encourages the embedding of an anchor sample to be similar to its positive samples and dissimilar to its negative samples. $ \mathcal{L}{\mathrm{InfoNCE}}(q, {k+}, {k_-}) = - \log \frac{\exp(\mathrm{sim}(q, k_+) / \tau)}{\sum_{k \in {k_+} \cup {k_-}} \exp(\mathrm{sim}(q, k) / \tau)} $ where: - is the query (anchor) embedding.
- is a positive sample embedding.
- represents all sample embeddings (positive and negative) in the batch.
- is a similarity function, often cosine similarity or dot product.
- is a temperature hyperparameter that scales the logits. A smaller makes the distribution sharper, emphasizing larger similarities. This loss maximizes the agreement between and relative to other samples in the batch.
- InfoNCE Loss (Noise-Contrastive Estimation): A common loss function used in
-
Sequence-to-Sequence (Seq2Seq) Models: An
encoder-decoderarchitecture commonly used for tasks involving sequences, such as machine translation or text summarization. Anencoderprocesses the input sequence into alatent representation, and adecodergenerates an output sequence from thislatent representation.Transformerarchitectures (like T5) are popularSeq2Seqmodels. -
K-means Clustering: An unsupervised learning algorithm used to partition observations into clusters, where each observation belongs to the cluster with the nearest mean (centroid). It's used here to generate
pseudo-labelsbased on feature similarity. -
LLaMA (Large Language Model Meta AI): A family of open-source large language models developed by Meta AI. Used in the paper for extracting
textual embeddings. -
ViT (Vision Transformer): A
Transformermodel applied to image recognition. It treats images as sequences of image patches, which are then processed by a standardTransformer encoder. Used in the paper for extractingvisual embeddings.
3.2. Previous Works
The paper contextualizes its work by discussing various recommender system paradigms:
-
Sequential Recommendation: Focuses on modeling user behavior sequences to capture dynamic preferences.
- Early approaches:
GRU4Rec(Hidasi et al. 2015) usedGated Recurrent Units (GRUs)to model sessions.STAMP(Liu et al. 2018) introduced attention mechanisms for short-term preferences.NARM(Li et al. 2017) also used attention for session-based recommendations. These models primarily relied oninteraction data(which items were interacted with). - Attention-based models:
SASRec(Kang and McAuley 2018) brought theTransformerarchitecture to sequential recommendation, effectively modeling long-range dependencies. - Pretrained Language Models (PLMs):
BERT4Rec(Sun et al. 2019) adaptedBERTfor recommendation by using masked item prediction, significantly advancing performance through self-supervised pretraining. - Prompt-based methods:
P5(Geng et al. 2022) andM6-Rec(Cui et al. 2022) reformulated recommendation tasks aslanguage modeling problems, enhancing generalization and flexibility by usingprompts.
- Early approaches:
-
Multi-modal Sequential Recommendation: Enriches sequential representations by incorporating item
modalitiesbeyond just interaction IDs (e.g., text, images).- Approaches include
deep and graph neural networkslikeMMGCN(Wei et al. 2019) andGRCN(Wei et al. 2020) to integrate heterogeneous features. Contrastive learningandmultimodal pretrainingmethods such asMMGCL(Yi et al. 2022) andMISSRec(Wang et al. 2023a) further strengthen user interest modeling.VIP5(Geng et al. 2023) extendedprompt-based techniquestomultimodal settings.
- Approaches include
-
Generative Recommendation (GR): This is where
MACRecdirectly positions itself.TIGER(Rajput et al. 2023) was an early work that discretizeditem sequencesintotokens, enabling thegenerative recommendation paradigm.LC-Rec(Zheng et al. 2024) utilizedLLMs'natural language understandingfor diverse task-specific fine-tuning.LETTER(Wang et al. 2024) extendedTIGERby introducingcollaborative filtering embeddingsand an additionalloss functionto improvecodebook utilization.Multimodal GRworks:MMGRec(Liu et al. 2024a) used aGraph RQ-VAEto generateitem representationsby integratingmultimodal featureswithcollaborative signals.MQL4GRec(Zhai et al. 2025) is presented as the most direct state-of-the-art baseline, encodingmultimodalandcross-domain item informationinto a unifiedquantized language.
3.3. Technological Evolution
The field of recommender systems has evolved from basic collaborative filtering and matrix factorization to sophisticated deep learning models. This evolution can be broadly categorized:
-
Traditional Collaborative Filtering/Content-Based: Early systems relied on user-item interaction matrices or item metadata.
-
Sequential Models: Recognizing the temporal nature of user preferences, models like
GRU4RecandSASRecbegan to model user histories as sequences. -
Deep Learning Integration: The rise of
neural networksled to more powerful representation learning for items and users. -
Pre-trained Models (PLMs/LLMs): Adapting large pre-trained models from
NLP(BERT4Rec,P5) andVision(ViT) has significantly boosted performance by leveraging rich pre-learned knowledge. -
Multimodal Integration: Moving beyond single-modality data, researchers started incorporating various
modalities(text, images, audio, etc.) to capture richer item semantics and user preferences. -
Generative Paradigm: The latest shift, inspired by
LLMs, where recommendation becomes ageneration task(next-token prediction) rather than a classification or ranking task. This approach offers flexibility and richer contextual understanding.MACRecfits into the latest stage, building uponmultimodal generative recommendation. It aims to refine howmultimodal informationis processed within thequantizationstep and during thegenerative model trainingto address the shortcomings of previousmultimodal GRefforts.
3.4. Differentiation Analysis
Compared to the main methods in related work, MACRec's core differences and innovations are:
- Holistic Cross-modal Integration during Quantization: Existing
multimodal GRmodels (e.g.,MQL4GRec) typically encode eachmodalityseparately to obtainsemantic IDsfor differentmodalities. They do not considercross-modal interactionsduring thequantization processitself.MACRecintroducescross-modal contrastive learningdirectly into each layer ofresidual quantization, actively forcing interaction and alignment between textual and visual residuals. This is a significant improvement over independentquantization, reducingsemantic lossandcodebook collapse. - Multi-Aspect Cross-modal Alignment in GR Training: While some
multimodal GRmodels usemultimodal features,MACRecfurther enhances thegenerative model'sability by incorporating bothimplicitandexplicit alignmentmechanisms during theGRtraining phase.Implicit alignment(contrastive learning) in thelatent spaceof thegenerative modelensures thatsemantic IDsfrom differentmodalitiesfor the same item are close.Explicit alignmentintroduces auxiliarygenerative tasks(item-level: text ID to visual ID, sequence-level: textual sequence to next visual ID) that directly encourage thegenerative modelto learncross-modal relationships.
- Improved Semantic ID Quality and Codebook Utilization: By integrating
cross-modal contrastive learningearly in theID learning processand then aligning representations duringreconstruction,MACRecachievessemantic IDsthat are more discriminative, hierarchically meaningful, and suffer from lowercollision rates(fewer items mapping to the same ID) and bettercodebook utilizationcompared to baselines likeMQL4GRec. - Comprehensive Approach:
MACRecoffers a more comprehensive framework by addressingmultimodal integrationat two crucial stages (ID learningandGR training) and from multipleaspects(quantization,reconstruction,implicit alignment,explicit alignment), leading to superiorrecommendation performance.
4. Methodology
4.1. Principles
The core idea behind MACRec is to enhance Generative Recommendation (GR) by deeply integrating multimodal information throughout the entire process, from semantic ID learning to generative model training. The theoretical basis is that different modalities (e.g., text and images) capture complementary aspects of an item, and by fostering cross-modal interactions at various stages, we can:
-
Construct High-Quality Semantic IDs: Overcome the limitations of
single-modalityquantizationby usingcross-modal contrastive learningto makesemantic IDsmore discriminative, less prone to conflicts, and better utilized within thecodebook. -
Improve Generative Model Understanding: Train the
GRmodel to inherently understand and leverage the shared and complementary information acrossmodalitiesthrough explicit and implicit alignment tasks, thereby improving its ability to predict the next item.The intuition is that if a model understands an item from both its textual description and its visual appearance, it will have a much richer and more robust representation, leading to more accurate and diverse recommendations.
4.2. Core Methodology In-depth (Layer by Layer)
The MACRec framework is organized into two main modules: cross-modal item quantization for generating discrete semantic IDs, and the training phase of the GR model with multi-aspect alignment. The overall architecture is illustrated in Figure 2.
The following figure (Figure 2 from the original paper) illustrates the overall architecture of MACRec:
该图像是示意图,展示了多方面跨模态量化在生成推荐系统中的应用。图中包含了跨模态项目量化和多方面对齐的过程,分别展示了文本和视觉特征的编码、伪标签生成、量化过程及对齐机制。通过对比隐式和显式对齐,图像阐明了如何利用多模态信息来提升生成推荐模型的性能。
As shown in Figure 2, the process begins with multimodal item information (text and image). This information goes through cross-modal item quantization to produce discrete semantic IDs. These IDs are then used to construct Seq2Seq training data for the Generative Recommender (GR) model, which is trained with multi-aspect alignment (including implicit and explicit alignments) to perform next-token prediction.
4.2.1. Cross-modal Item Quantization
The goal of this module is to generate high-quality discrete semantic IDs for items by effectively integrating multimodal information during the quantization process.
4.2.1.1. Dual-modality Pseudo-label Generation
To enable contrastive learning across modalities, the first step involves generating pseudo-labels. For each item , its text and visual information are encoded into continuous embeddings using pre-trained models.
-
The
text informationis encoded into an embedding using anopen-source large language model(e.g.,LLaMA). -
The
visual contentof the item's image is encoded into an embedding using aVision Transformer (ViT).Subsequently,
K-means clusteringis performed independently on these textual and visual embeddings to partition them into clusters. These cluster assignments serve aspseudo-labelsfor each modality.
The clustering process is formulated as: $ \mathcal{C}{\mathrm{text}} = \mathrm{KMeans}({ \mathbf{t}i }{i=1}^N) $ $ \mathcal{C}{\mathrm{vision}} = \mathrm{KMeans}({ \mathbf{v}i }{i=1}^N) $ where:
- denotes the resulting cluster assignments (pseudo-labels) for the text modality.
- denotes the resulting cluster assignments (pseudo-labels) for the vision modality.
- represents the set of all text embeddings for all items.
- represents the set of all visual embeddings for all items.
- is the K-means clustering function.
- is the total number of items.
4.2.1.2. Cross-modal Quantization with Contrastive Learning
The core quantization mechanism is based on Residual-Quantized Variational AutoEncoder (RQ-VAE). RQ-VAE uses multi-layer vector quantization (VQ) where each layer quantizes the residuals from the previous layer.
In MACRec, for a given item, both the text and visual embeddings are first processed by an encoder (composed of a multi-layer perceptron (MLP)) to obtain latent representations:
- Text latent representation:
- Visual latent representation:
These
latent representationsand serve as the initialresidualsfor the firstVQ layer: and .
At the -th layer of RQ, each modality has its own learnable codebook . The residual for a given modality at layer is quantized by finding the closest codeword in its respective codebook.
The selection of the closest codeword is given by:
$
c_l^t = \arg\min_k | \mathbf{r}l^t - \mathbf{e}{l,k}^t |_2
$
$
c_l^v = \arg\min_k | \mathbf{r}l^v - \mathbf{e}{l,k}^v |_2
$
where:
-
is the index of the selected codeword for the text modality at layer .
-
is the index of the selected codeword for the visual modality at layer .
-
is the residual vector for the text modality at layer .
-
is the residual vector for the visual modality at layer .
-
is the -th codeword in the text codebook at layer .
-
is the -th codeword in the visual codebook at layer .
-
denotes the Euclidean (L2) norm.
-
is the size of the codebook.
After selecting the codeword, the
residualfor the next layer is calculated by subtracting the selected codeword from the currentresidual: $ \mathbf{r}{l+1}^t = \mathbf{r}l^t - \mathbf{e}{l,c_k^t}^t $ $ \mathbf{r}{l+1}^v = \mathbf{r}l^v - \mathbf{e}{l,c_k^v}^v $ where: -
and are the residual vectors passed to the -th layer for text and vision, respectively.
To address the limitations of independent
quantization(potentialcodebook collapseand underutilization ofcross-modal complementarity),MACRecintroducescross-modal contrastive learningat eachRQlayer. This is done by leveraging themultimodal pseudo-labelsgenerated earlier. Specifically, visualpseudo-labelsenhance textualresidual representations, and textualpseudo-labelsoptimize visualresidual representations.
The InfoNCE loss for the -th layer is defined as:
$
\mathcal{L}{\mathrm{con}}^{l, v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( { \langle \mathbf{r}i^t, \mathbf{r}{i,pos}^t \rangle } / { \tau } \right)}{\sum_{j=1}^B \exp \left( { \langle \mathbf{r}i^t, \mathbf{r}j^t \rangle } / { \tau } \right)} \right)
$
$
\mathcal{L}{\mathrm{con}}^{l, t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( { \langle \mathbf{r}i^v, \mathbf{r}{i,pos}^v \rangle } / { \tau } \right)}{\sum_{j=1}^B \exp \left( { \langle \mathbf{r}i^v, \mathbf{r}j^v \rangle } / { \tau } \right)} \right)
$
$
\mathcal{L}{\mathrm{con}}^{l} = \mathcal{L}{con}^{l, t \to v} + \mathcal{L}_{con}^{l, v \to t}
$
where:
- is the
contrastive lossfor text residuals, guided by visualpseudo-labels. - is the
contrastive lossfor visual residuals, guided by textualpseudo-labels. - is the
batch size. - indexes an item in the batch.
- is the text residual for item at layer .
- is the visual residual for item at layer .
- represents a positive sample for the text residual of item . It is a text residual from another item in the batch that shares the same visual
pseudo-labelas item . - represents a positive sample for the visual residual of item . It is a visual residual from another item in the batch that shares the same textual
pseudo-labelas item . - denotes the
inner product(dot product), used here as a similarity measure. - is the
temperature parameterfor thecontrastive loss. - sums over all items in the batch, acting as negative samples for the current anchor if they don't share the same
pseudo-label.
4.2.1.3. Cross-modal Reconstruction Alignment
After quantization, the discrete semantic IDs are represented by summing the corresponding codeword vectors from each layer. For layers of codebooks, the quantized representation for an item is:
$
\hat{\mathbf{z}}^t = \sum_{l=0}^{L-1} \mathbf{e}{l,c_k^t}^t
$
$
\hat{\mathbf{z}}^v = \sum{l=0}^{L-1} \mathbf{e}_{l,c_k^v}^v
$
where:
-
is the final reconstructed quantized embedding for the text modality.
-
is the final reconstructed quantized embedding for the visual modality.
-
is the selected codeword for text at layer .
-
is the selected codeword for vision at layer .
-
is the total number of RQ layers.
To further refine
codebook representationsand balancecodebook utilization,MACRecintroduces anotheralignment lossbased oncontrastive learning. This loss encouragesbidirectional alignmentbetween thequantized representationsof differentmodalitiesfor the same item.
The alignment loss is formulated as:
$
\mathcal{L}{\mathrm{align}}^{t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \hat{\mathbf{z}}_i^t, \hat{\mathbf{z}}i^v \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \hat{\mathbf{z}}i^t, \hat{\mathbf{z}}j^v \rangle / \tau \right)} \right)
$
$
\mathcal{L}{\mathrm{align}}^{v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \hat{\mathbf{z}}i^v, \hat{\mathbf{z}}i^t \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \hat{\mathbf{z}}i^v, \hat{\mathbf{z}}j^t \rangle / \tau \right)} \right)
$
$
\mathcal{L}{\mathrm{align}} = \mathcal{L}{align}^{t \to v} + \mathcal{L}{align}^{v \to t}
$
where:
-
is the
alignment lossfrom text to vision. -
is the
alignment lossfrom vision to text. -
and are the
quantized embeddingsfor the text and vision modalities of item . -
The other symbols (, , ) are as defined for
InfoNCEloss. This loss ensures that thequantized representationsof the same item across modalities are similar.Similar to the
RQ-VAEarchitecture, thequantized representationsare decoded and reconstructed separately for eachmodality: -
Decoded textual embedding:
-
Decoded visual embedding:
The
reconstruction lossesare calculated as the squared L2 norm between the original and decoded embeddings: $ \mathcal{L}_{\mathrm{recon}}^t = | \mathbf{t} - \hat{\mathbf{t}} |2^2 $ $ \mathcal{L}{\mathrm{recon}}^v = | \mathbf{v} - \hat{\mathbf{v}} |_2^2 $ where: -
is the reconstruction loss for the text modality.
-
is the reconstruction loss for the visual modality.
-
and are the original text and visual embeddings, respectively.
-
and are the reconstructed text and visual embeddings.
The
RQ-VAEtraining also includes aquantization loss(often called codebook loss or commitment loss) to ensure thecodebookvectors are updated appropriately and the encoder output commits to the codebook. This is typically applied to each modality : $ \mathcal{L}{\mathrm{rq}}^m = \sum{l=0}^{L-1} \left( | \mathbf{sg} [ \mathbf{r}l^m ] - \mathbf{e}{l,c_k^m}^m |_2^2 + \alpha | \mathbf{r}l^m - \mathbf{sg} [ \mathbf{e}{l,c_k^m}^m ] |_2^2 \right) $ where: -
is the residual quantization loss for modality .
-
represents the
stop-gradientoperation, which prevents gradients from flowing through the argument. -
The first term updates the
codebook vectors() to move towards the encoder's output (residuals). -
The second term is a
commitment lossthat pulls the encoder's output () closer to thecodebook vector(by passing gradients to the encoder). -
is a
loss coefficient(hyperparameter). -
Superscript denotes the
modality(text or vision ).The total
RQ-VAEloss combines the reconstruction andquantization lossesfor bothmodalities: $ \mathcal{L}{\mathrm{RQ-VAE}} = \mathcal{L}{\mathrm{recon}}^t + \mathcal{L}{\mathrm{recon}}^v + \mathcal{L}{\mathrm{rq}}^t + \mathcal{L}_{\mathrm{rq}}^v $
Finally, the overall training objective for learning the semantic identifiers (ID), denoted as , integrates all these components:
$
\mathcal{L}{\mathrm{ID}} = \mathcal{L}{\mathrm{RQ-VAE}} + \lambda_{\mathrm{con}}^l \sum_{l=0}^{L-1} \mathcal{L}{\mathrm{con}}^l + \lambda{\mathrm{align}} \mathcal{L}_{\mathrm{align}}
$
where:
-
is the total loss for learning
semantic IDs. -
is the combined reconstruction and
RQloss. -
and are
trade-off hyperparametersthat balance the contribution of thecross-modal contrastive loss(summed over all layers) and thecross-modal reconstruction alignment loss, respectively.For cases where conflicts occur among certain item IDs (multiple items mapping to the same ID),
MACRecadopts the same conflict resolution strategy as proposed by Zhai et al., which involves reassigningcodewordsbased on the distance between items and thecodebook.
4.2.2. Generative Recommendation with Multi-aspect Alignment
Once the RQ-VAE model is trained, it provides discrete semantic IDs for both text and images. For instance, a text ID might be , and a visual ID might be . These sequences of semantic IDs are then used to construct Seq2Seq training data for the Generative Recommender (GR) model, which is typically a Transformer-based encoder-decoder architecture (like T5). The GR model is trained for next-token prediction. To further optimize information sharing and interaction across different modalities during this GR training phase, MACRec designs implicit alignment and explicit alignment mechanisms.
4.2.2.1. Implicit Alignment for Cross-modal Semantic IDs
This mechanism aims to ensure that the GR model recognizes the commonality between semantic IDs of different modalities that belong to the same item. It aligns them at the latent space level after encoding.
Specifically, the textual semantic ID (t-sid) and visual semantic ID (v-sid) of an item are encoded into latent representations using the encoder of the GR model. Mean Pooling is applied to obtain a single vector representation for the sequence.
The encoding process is: $ \mathbf{e}^t = \mathrm{MeanPool}(\mathrm{T5-Encoder}(t\text{-sid})) $ $ \mathbf{e}^v = \mathrm{MeanPool}(\mathrm{T5-Encoder}(v\text{-sid})) $ where:
-
is the pooled
latent representationfor the textualsemantic ID. -
is the pooled
latent representationfor the visualsemantic ID. -
is the encoder component of the
T5model (used as theGRbackbone). -
represents the textual
semantic IDsequence for an item. -
represents the visual
semantic IDsequence for an item. -
computes the average of the token embeddings from the encoder's output to get a single vector.
These
latent representationsare then aligned usingcontrastive learning, similar toInfoNCEloss: $ \mathcal{L}{\mathrm{implicit}}^{t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \mathbf{e}_i^t, \mathbf{e}i^v \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \mathbf{e}i^t, \mathbf{e}j^v \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{implicit}}^{v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \mathbf{e}i^v, \mathbf{e}i^t \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \mathbf{e}i^v, \mathbf{e}j^t \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{implicit}} = \mathcal{L}{\mathrm{implicit}}^{t \to v} + \mathcal{L}{\mathrm{implicit}}^{v \to t} $ where: -
and are the
bidirectional implicit alignment lossesfor thelatent spacerepresentations. -
and are the
latent representationsfor the textual and visualsemantic IDsof item . -
, , and are the
batch size,inner product, andtemperature parameter, respectively. This loss pulls thelatent representationsof an item's text and imagesemantic IDscloser together.
4.2.2.2. Explicit Alignment with Different Generation Tasks
Inspired by prior work (Zhai et al.), MACRec also proposes explicit alignment strategies by designing additional generation tasks during the GR training. These tasks directly encourage cross-modal understanding:
-
Item-level alignment: The
GR modelis trained to generate an item's visualsemantic IDwhen given its textualsemantic IDas input, and vice versa (generate textualsemantic IDfrom visualsemantic ID). This forces a direct mapping between themodalitiesfor individual items. -
Sequence-level alignment: The
GR modelis trained to predict the visualsemantic IDof the next recommended item given a historical sequence of textualsemantic IDs. Similarly, it predicts the textualsemantic IDof the next item given a historical sequence of visualsemantic IDs. These tasks integratecross-modal understandinginto the sequential prediction context.These additional
explicit alignment tasksare incorporated into thesequential recommendation trainingalongside the primarynext-token predictiontasks.
4.2.2.3. Training Objectives and Inference
For multimodal GR, there are two main recommendation tasks:
-
Predicting the textual
semantic IDof the next item based on the historical sequence of item textualsemantic IDs. -
Predicting the visual
semantic IDof the next item based on the historical sequence of item visualsemantic IDs.By integrating the aforementioned alignment strategies, the final training objective for the
GR modelis formulated as: $ \mathcal{L}{\mathrm{rec}} = - \sum{t=1}^{|y|} \log P_{\theta} (y_t | y < t, x) + \lambda_{\mathrm{implicit}} \mathcal{L}_{\mathrm{implicit}} $ where:
- is the total
recommendation loss. - is the standard
next-token prediction loss(negative log-likelihood) for generating the target sequence given the input context . is the probability of predicting the -th token given previous tokens and input , parameterized by . - is a
hyperparameterthat controls the weight of theimplicit alignment loss. - is the
implicit alignment losscalculated in thelatent space. (Note: Theexplicit alignmenttasks are incorporated by modifying the input/output pairs for the standardnext-token prediction loss, effectively expanding the scope of rather than adding a separate term to the loss function).
During the inference stage, MACRec generates multiple candidate semantic IDs for different modalities using constrained beam search (as in Rajput et al. 2023). Finally, the results from both modalities are ensembled by averaging their scores to obtain the final recommendation.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on three real-world recommendation datasets derived from the Amazon Product Reviews dataset, covering user reviews and item metadata from May 1996 to October 2018. The datasets represent three distinct product categories:
-
Musical Instruments -
Arts, Crafts and Sewing(Arts) -
Video Games(Games)These datasets are commonly used in recommendation research and are effective for validating the performance of sequential and
multimodalrecommendation methods due to their inherent sequence structure (user interaction history) and rich item metadata (textual descriptions and images).
The following are the results from Table 1 of the original paper:
| Datasets | #Users | #Items | #Interactions | Sparsity | Avg. len |
|---|---|---|---|---|---|
| Instruments | 17112 | 6250 | 136226 | 99.87% | 7.96 |
| Arts | 22171 | 9416 | 174079 | 99.92% | 7.85 |
| Games | 42259 | 13839 | 373514 | 99.94% | 8.84 |
where:
#Users: The number of unique users in the dataset.#Items: The number of unique items in the dataset.#Interactions: The total number of user-item interactions recorded.Sparsity: A measure of how few interactions exist compared to all possible interactions, calculated as 1 - \frac{\text{#Interactions}}{\text{#Users} \times \text{#Items}}. High sparsity (close to 100%) is typical for recommendation datasets.Avg. len: The average length of user interaction sequences.
5.2. Evaluation Metrics
To assess recommendation effectiveness, the paper adopts two standard top- evaluation metrics: Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K). is set to 1, 5, and 10. A leave-one-out evaluation protocol is used, where the last item in a user's sequence is held out for testing, and full ranking assessments are performed across the entire item collection (rather than sampling negative items).
-
Hit Rate at K (HR@K)
- Conceptual Definition:
HR@Kmeasures how often the target item (the item the user actually interacted with next) is present within the top- recommended items. It's a recall-oriented metric that indicates whether the recommender system "hit" the relevant item in its top suggestions. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K recommendations}}{\text{Total number of users}} $
- Symbol Explanation:
Number of users for whom the target item is in top-K recommendations: The count of unique users for whom the ground truth next item is found among the top items ranked by the recommender system.Total number of users: The total number of users considered in the evaluation.- : The size of the recommendation list (e.g., 1, 5, or 10).
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain at K (NDCG@K)
- Conceptual Definition:
NDCG@Kis a measure of ranking quality. It considers not only whether relevant items are in the top- list but also their position in the list. Higher relevance items placed at higher ranks (closer to the top) contribute more to theNDCGscore. It's "normalized" to values between 0 and 1 by dividing by theIdeal DCG (IDCG), which is the DCG of a perfect ranking. - Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where: $ \mathrm{DCG@K} = \sum_{i=1}^K \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K} = \sum{i=1}^K \frac{2^{\mathrm{rel}{i{\text{ideal}}}} - 1}{\log_2(i+1)} $
- Symbol Explanation:
- :
Discounted Cumulative Gainat rank . - :
Ideal Discounted Cumulative Gainat rank . This is the maximum possibleDCGachievable for a given query, obtained by ranking all relevant items perfectly. - : The maximum rank for which to consider items.
- : The relevance score of the item at position in the recommended list. For recommendation tasks where relevance is binary (relevant/not relevant, or 1/0 for the target item), is 1 if the item at rank is the target item, and 0 otherwise.
- : The relevance score of the item at position in the ideal (perfectly sorted by relevance) recommendation list. For
leave-one-out evaluation, there is only one relevant item (the ground truth next item), so its is 1, and all others are 0. Thus, will be .
- :
- Conceptual Definition:
5.3. Baselines
To evaluate MACRec, the authors compare it against a diverse set of representative recent methods:
-
Traditional Sequential Recommendation Models:
BERT4Rec(Sun et al. 2019): Aself-supervised sequential recommenderbased onBERTthat uses masked item prediction.SASRec(Kang and McAuley 2018): Aself-attentive sequential recommenderthat applies theTransformer encoderto model sequential interactions.FDSA(Zhang et al. 2019):Feature-level Deeper Self-Attention Networkfor sequential recommendation.S3-Rec(Zhou et al. 2020):Self-supervised learningfor sequential recommendation withmutual information maximization.
-
Multimodal Sequential Recommendation Models:
MISSRec(Wang et al. 2023a):Pre-training and transferring multi-modal interest-aware sequence representationfor recommendation.P5(Geng et al. 2022):Recommendation as Language Processing, a unifiedpretrain, personalized prompt & predict paradigm. While not strictlymultimodalin its original form, it's a strongLLM-based baseline.VIP5(Geng et al. 2023): ExtendsP5tomultimodal settings, positioning it as amultimodal foundation modelfor recommendation.
-
Generative Recommendation (GR) Models:
TIGER(Rajput et al. 2023): An earlygenerative retrievalmodel that discretizesitem sequencesintotokensforgenerative recommendation.MQL4GRec(Zhai et al. 2025): Amultimodal generative recommendationmodel that encodesmultimodalandcross-domain item informationinto a unifiedquantized language. This is the most direct and advancedmultimodal GRbaseline against whichMACRecaims to show significant improvements.
5.4. Implementation Details
- Fair Comparison: For
MQL4GRec, the authors did not utilize pre-training on millions of additional-category datasets to ensure a fair comparison under similar data conditions. - Feature Extraction:
Textfeatures are obtained usingLLaMA(Touvron et al. 2023).Imagefeatures are obtained usingViT-L/14(Vision Transformer Large, patch size 14) (Dosovitskiy et al. 2020).
- RQ-VAE Configuration:
Codebook size: 256.- Number of
RQ layers: 4. Optimizer:AdamW.Batch size: 1024.Learning rate: 0.001.- Number of
K-means clusters: 512 (forpseudo-label generation).
- GR Model Backbone:
T5(Text-to-Text Transfer Transformer) is used as the backbone.Encoderanddecodereach have 4Transformer layers.- 6
attention headsper layer. Attention head dimension: 64.
- Hyperparameters for MACRec Losses:
Layer-wise contrastive weight: , . This indicates thatcontrastive lossis applied only to the 2nd and 3rdRQ layers.Alignment loss weight: 0.001.Implicit alignment loss weight: 0.01.Temperature parameter: 0.1 (for allcontrastive losses).
- Runs: Results are averaged over five random seeds to ensure robustness.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that MACRec consistently achieves superior performance across all three datasets when compared to a wide range of state-of-the-art baselines, including traditional sequential models and other generative recommendation approaches.
The following are the results from Table 2 of the original paper:
| Dataset | Metrics | Sequential Rec. | Multimodal Seq. Rec. | Generative Rec. | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT4Rec | SASRec | FDSA | S3-Rec | MISSRec | P5-CID | VIP5 | TIGER | MQL4GRec | MACRec | ||
| Instruments | HR@1 | 0.0450 | 0.0318 | 0.0530 | 0.0339 | 0.0723 | 0.0512 | 0.0737 | 0.0754 | 0.0763 | 0.0819* |
| HR@5 | 0.0856 | 0.0946 | 0.0987 | 0.0937 | 0.1089 | 0.0839 | 0.0892 | 0.1007 | 0.1058 | 0.1110* | |
| HR@10 | 0.1081 | 0.1233 | 0.1249 | 0.1123 | 0.1361 | 0.1119 | 0.1071 | 0.1221 | 0.1291 | 0.1363* | |
| NDCG@5 | 0.0667 | 0.0654 | 0.0775 | 0.0693 | 0.0797 | 0.0678 | 0.0815 | 0.0882 | 0.0902 | 0.0965* | |
| NDCG@10 | 0.0739 | 0.0746 | 0.0859 | 0.0743 | 0.0880 | 0.0704 | 0.0872 | 0.0950 | 0.0997 | 0.1046* | |
| Arts | HR@1 | 0.0289 | 0.0212 | 0.0380 | 0.0172 | 0.0479 | 0.0421 | 0.0474 | 0.0532 | 0.0626 | 0.0685* |
| HR@5 | 0.0697 | 0.0951 | 0.0832 | 0.0739 | 0.1021 | 0.0713 | 0.0704 | 0.0894 | 0.1167 | 0.1254* | |
| HR@10 | 0.0922 | 0.1250 | 0.1190 | 0.1030 | 0.1321 | 0.0994 | 0.0959 | 0.1167 | 0.1254 | 0.1329* | |
| NDCG@5 | 0.0502 | 0.0610 | 0.0583 | 0.0511 | 0.0699 | 0.0607 | 0.0586 | 0.0718 | 0.0816 | 0.0868* | |
| NDCG@10 | 0.0575 | 0.0706 | 0.0695 | 0.0630 | 0.0815 | 0.0662 | 0.0635 | 0.0806 | 0.0898 | 0.0953* | |
| Games | HR@1 | 0.0115 | 0.0069 | 0.0163 | 0.0136 | 0.0201 | 0.0169 | 0.0173 | 0.0166 | 0.0200 | 0.0208* |
| HR@5 | 0.0426 | 0.0587 | 0.0614 | 0.0527 | 0.0674 | 0.0532 | 0.0480 | 0.0523 | 0.0645 | 0.0671* | |
| HR@10 | 0.0725 | 0.0985 | 0.0988 | 0.0903 | 0.1048 | 0.0824 | 0.0758 | 0.0857 | 0.1007 | 0.1078* | |
| NDCG@5 | 0.0270 | 0.0333 | 0.0389 | 0.0351 | 0.0385 | 0.0331 | 0.0328 | 0.0345 | 0.0421 | 0.0435* | |
| NDCG@10 | 0.0366 | 0.0461 | 0.0509 | 0.0468 | 0.0499 | 0.0454 | 0.0418 | 0.0453 | 0.0538 | 0.0565* | |
Key observations from Table 2:
- Overall Superiority:
MACRecconsistently achieves the best performance across all three datasets (Instruments,Arts,Games) and all evaluation metrics (HR@1,HR@5,HR@10,NDCG@5,NDCG@10). The asterisk*indicates statistical significance (-value ) against the best baseline, confirming its robust advantage. - Advantage over Multimodal Generative Baselines:
MACRecsignificantly outperformsMQL4GRec, which is the closest state-of-the-artmultimodal generative recommendationmodel. This indicates thatMACRec's novelcross-modal quantizationforsemantic IDlearning and itsmulti-aspect alignmenttraining strategy are highly effective in enhancingrecommendation performance. For example, on theInstrumentsdataset,MACRecimprovesHR@10fromMQL4GRec's 0.1291 to 0.1363, andNDCG@10from 0.0997 to 0.1046. - Improved NDCG: Compared to traditional
multimodal sequential recommendationmodels (likeMISSRec,P5-CID,VIP5),MACRecshows remarkable improvement, especially inNDCG.NDCGis a more sensitive metric to ranking quality. This suggests thatMACRec'smultimodal generative recommendation frameworkcan more accurately recommend items that are not only relevant but also highly preferred by users and placed at higher ranks. - Generative Paradigm Effectiveness: Generally,
Generative Recommendationmodels (TIGER,MQL4GRec,MACRec) tend to outperform traditionalsequential recommendationmodels (BERT4Rec,SASRec,FDSA,S3-Rec) and even somemultimodal sequentialmodels, highlighting the potential of thegenerative paradigm.MACRecpushes this paradigm further by effectively integratingmultimodal information.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Study (RQ2)
An ablation study was conducted to understand the impact of different components of MACRec on its performance. The study evaluated the HR@10 metric on the three datasets by removing specific loss alignment strategies.
The following are the results from Table 3 of the original paper:
| Model | Instruments | Arts | Games |
|---|---|---|---|
| MACRec | 0.1363 | 0.1329 | 0.1078 |
| w/o Lcon | 0.1289 | 0.1283 | 0.1018 |
| w/o Lalign | 0.1310 | 0.1301 | 0.1026 |
| w/o Limplicit | 0.1312 | 0.1296 | 0.1042 |
| w/o Explicit Alignment | 0.1296 | 0.1299 | 0.1037 |
Observations from Table 3:
- Effectiveness of All Modules: Removing any of the proposed
alignment modules(Lcon,Lalign,Limplicit,Explicit Alignment) leads to a discernible performance degradation across all three datasets. This confirms that each component contributes positively toMACRec's overall effectiveness. - Dominant Impact of Cross-modal Contrastive Quantization (
Lcon): The largest performance drop is observed whenLcon(thecross-modal contrastive lossduringquantization) is removed. For example,HR@10onInstrumentsdrops from 0.1363 to 0.1289. This highlights the critical role ofMACRec'scontrastive learning-based cross-modal quantizationapproach in creating high-qualitysemantic IDsby integratingmultimodal informationearly in theID learning process. - Importance of Alignment Losses: Both
implicit alignment(Limplicit) andexplicit alignmentstrategies contribute significantly. RemovingLimplicitorExplicit Alignmentalso leads to noticeable drops, underscoring their importance in enhancing thegenerative model'sunderstanding ofsemantic IDsand enabling the learning of shared features acrossmodalitiesduringGRtraining.
6.2.2. Item Collision Analysis (RQ3)
The paper investigates the item collision rate during the quantization process, which refers to how often multiple distinct items are assigned the exact same semantic ID. A lower collision rate indicates better semantic discriminability and codebook utilization.
The following are the results from Table 4 of the original paper:
| Dataset | Text | Image | ||
|---|---|---|---|---|
| MQL4GRec | MACRec | MQL4GRec | MACRec | |
| Instruments | 3.23 | 2.76 | 3.71 | 2.38 |
| Arts | 5.15 | 4.24 | 5.71 | 3.29 |
| Games | 25.24 | 2.91 | 26.10 | 3.51 |
Observations from Table 4:
- Reduced Collision Rates:
MACRecconsistently achieves significantly loweritem ID collision ratesfor bothtextandimage modalitiesacross all datasets compared toMQL4GRec. For example, on theGamesdataset,MACRecreduces thetext collision ratefrom 25.24% (MQL4GRec) to 2.91% and theimage collision ratefrom 26.10% to 3.51%. - Enhanced Codebook Usability: This reduction in
collision ratestrongly suggests thatMACReceffectively leverages the complementarity between differentmodalitiesduringquantization. By guiding theID learning processwithcross-modal contrastive learning,MACRecenables a more balanced and distinct allocation ofsemantic IDs, thereby minimizing ambiguity and improvingcodebook usability.
6.2.3. Code Assignment Distribution (RQ4)
The distribution of codewords (individual semantic IDs) to items is crucial for effective codebook utilization. An ideal distribution would be relatively even, indicating that most codewords are used and no single codeword is over-represented (which would suggest a lack of discriminative power).
The following figure (Figure 4 from the original paper) shows the code assignment distribution on the 2nd RQ layer:
该图像是图表,展示了 MQL4Rec 和 MACRec 两种方法在不同桶索引下的代码分配情况。横轴为桶索引(每桶16个),纵轴为项目数量,分别显示了文本和图像的分布情况。
As shown in Figure 4, which visualizes the code assignment distribution on the 2nd RQ layer, MACRec demonstrates a more uniform distribution of items across codewords compared to MQL4GRec. The red bars represent text semantic IDs, and blue bars represent visual semantic IDs. MQL4GRec shows a more skewed distribution, with some codewords being assigned a very high number of items while many others are underutilized. In contrast, MACRec's distribution appears flatter and more spread out. This indicates that MACRec's cross-modal quantization approach leads to better codebook utilization by encouraging a diverse and balanced assignment of semantic IDs, reflecting its superior semantic representation capabilities.
6.2.4. Parameter Analysis (RQ5)
The paper also analyzes the impact of key hyperparameters on MACRec's performance, specifically on the Instruments dataset, using HR@10 and NDCG@10 as metrics.
The following figure (Figure 3 from the original paper) shows the performance of MACRec over different hyper-parameters on Instruments:
该图像是图表,展示了MACRec在不同超参数下的性能,包括代码簿大小、语义ID长度、起始层、、和的影响。每个子图中,HR@10和NDCG@10的值通过不同颜色的柱状图呈现,显示出超参数变化对推荐性能的影响。图中给出了相应的数值,以反映各参数设置下的具体性能。
Observations from Figure 3:
- Codebook Size: Both very small and very large
codebook sizesdegrade performance. A smallcodebooklimits thequantization spaceand the model's ability to capture diversesemantic associations. An overly largecodebookcan dilutetoken exposure, making it harder for the model to learn robust representations due to sparse usage of manycodewords. - Semantic ID Length: Similarly, extremely short
semantic IDsfail to capture comprehensive semantics, leading to a loss of information. Conversely, excessively longsemantic IDscomplicate learning by expanding thegeneration space, making thenext-token predictiontask more challenging and reducing performance. There is an optimal length that balances expressiveness and learnability. - Starting Layer for
Lcon: Thecross-modal contrastive loss() is most effective when applied starting from the thirdRQ layer. Applying it from earlier layers (0th or 1st) or later layers yields suboptimal results. This suggests thatcontrastive learningis most beneficial in laterRQ layerswhere fine-grainedsemantic residualsare processed, allowing these layers to leveragecross-modal signalsto compensate forsemantic losswithout interfering with the coarser-grained quantization of earlier layers. - Weights of Contrastive Losses: Each of the three
contrastive losses(, , ) has an optimal weight (e.g., , , ). Higher weights for these losses strengthenmodality fusionandalignment, but if too high, they might over-constrain the model or introduce noise. Lower weights, on the other hand, lead to insufficientcross-modal interaction, hindering the benefits ofmultimodality. This indicates a need for careful tuning to find the right balance for effectivemultimodal integration.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), a novel framework designed to address the limitations of existing Generative Recommendation (GR) methods in effectively integrating multimodal information. The core idea of MACRec is to enhance cross-modal alignment and interaction at two critical stages: semantic ID learning and generative model training.
Specifically, MACRec proposes a cross-modal quantization method that incorporates contrastive learning into residual quantization (RQ) and reconstruction. This approach facilitates the construction of semantic IDs that are not only hierarchically meaningful but also more discriminative and less prone to conflicts, thereby improving codebook usability. Furthermore, MACRec integrates multi-aspect cross-modal alignments (both implicit alignment in the latent space and explicit alignment through generative tasks) during the GR model's training process. These alignments enhance the model's understanding of sequential multimodal information.
Extensive experiments conducted on three well-known recommendation datasets demonstrate the superior performance of MACRec compared to state-of-the-art GR models. Additional analyses confirm MACRec's advantages in reducing item collision rates and achieving more balanced code assignment distributions, validating its effectiveness in utilizing codebook capacity and representing item semantics.
7.2. Limitations & Future Work
The paper primarily focuses on presenting the solution and its effectiveness, and does not explicitly list a "Limitations" or "Future Work" section. However, based on the context and common challenges in the field, potential areas could include:
-
Computational Cost:
RQ-VAEwith multiple layers and separate processing for multiplemodalities(text and image in this case) can be computationally intensive, especially duringID learningand if the item catalog is very large. -
Scalability: While the method shows strong performance on the datasets used, scaling to industrial-level recommendation systems with millions or billions of items and complex
multimodal data(e.g., audio, video, user-generated content) might introduce new challenges. -
Dependency on Pre-trained Models: The quality of initial
textandimage embeddingsheavily relies on powerful pre-trained models likeLLaMAandViT. The performance ofMACReccould be sensitive to the choice and capabilities of these foundational models. -
Interpretability of Semantic IDs: While
MACRecaims for "hierarchically meaningful semantic IDs," the direct interpretability of these discretetoken sequencesfor human understanding or debugging might still be limited, unlike natural language. -
Generalization to Other Multimodal Data: The current work focuses on text and images. Extending
MACRecto incorporate othermodalities(e.g., audio, video for music/movie recommendation) would require careful adaptation of thecross-modal quantizationandalignmentstrategies.Future research directions could explore:
-
Developing more efficient
quantizationmechanisms formultimodal datato reduce computational overhead. -
Investigating methods for dynamically adjusting
semantic IDlength orcodebook sizebased on item complexity or dataset characteristics. -
Exploring adaptive weighting strategies for the various
alignment lossesinstead of fixed hyperparameters. -
Applying
MACRecto more diversemultimodal recommendation scenariosor othergenerative tasksbeyondnext-item prediction. -
Enhancing the interpretability of the generated
semantic IDsto provide more transparent recommendation explanations.
7.3. Personal Insights & Critique
This paper presents a strong and well-reasoned approach to integrating multimodal information into Generative Recommendation. The rigorous integration at both the ID learning and GR training stages, using a combination of cross-modal contrastive learning and multi-aspect alignment, is a significant step forward. The empirical results clearly validate the effectiveness of MACRec, especially the substantial reduction in item collision rates and the improved code assignment distribution, which are crucial indicators of high-quality semantic representations.
Inspirations & Transferability:
-
The idea of applying
cross-modal contrastive learningdirectly within theresidual quantizationprocess is highly innovative. This concept could be transferred to other domains requiring robust discrete representations ofmultimodal data, such asmultimodal retrievalormultimodal content generation. -
The
multi-aspect alignmentstrategy (implicit and explicit) forgenerative modelsis a generalizable framework. It could be adapted for anySeq2Seqmodel that needs to learncross-modal relationshipsfrom discretetokens, even outside of recommendation, for example, inmultimodal dialogue systemsormultimodal captioningwhere differentmodalitiesneed to be aligned at bothlatentandtask-specificlevels. -
The benefits of carefully constructed
semantic IDsforgenerative retrievalare evident. This emphasizes the importance of thequantizationstage in anygenerative AIapplication that relies on discretetokens.Potential Issues & Areas for Improvement:
-
Hyperparameter Sensitivity: The paper details several critical hyperparameters (e.g.,
codebook size,ID length,loss weights,starting layer for Lcon,temperature). Optimal performance relies heavily on their careful tuning, which can be a time-consuming process. Further work could explore adaptive or meta-learning approaches for hyperparameter optimization. -
Complexity of Pseudo-label Generation: The initial
K-means clusteringforpseudo-label generationis a heuristic step. Its quality directly influences the subsequentcontrastive learning. Investigating more sophisticated or adaptivepseudo-labelingstrategies, perhaps dynamically adjusted during training, could be beneficial. -
Scaling of Cross-modal Contrastive Loss: While effective, the
cross-modal contrastive lossinRQlayers requires careful management. The choice to apply it only to later layers () suggests a balance to prevent over-constraining initial, coarser representations. The optimal balance might vary significantly across different datasets ormodalities. -
Computational Cost of Inference: While the model is generative, the
constrained beam searchand subsequent ensemble acrossmodalitiescan add latency during inference, which is a critical factor in real-time recommendation systems. -
Lack of User-Modality Preference: The model implicitly aligns modalities. It doesn't explicitly learn if a user prefers visual or textual aspects more strongly for certain item types. Incorporating user-specific
modality preferencescould be a valuable extension.Overall,
MACRecprovides a solid foundation for future research inmultimodal generative recommendationby offering a systematic and effective way to harness the rich information embedded in diversemodalities.
Similar papers
Recommended via semantic vector search.