Multi-Aspect Cross-modal Quantization for Generative Recommendation

Zhao Zhang

Paper status: completed

Multi-Aspect Cross-modal Quantization for Generative Recommendation

Published:11/19/2025

Generative Recommendation Systems (24)Semantic IDs Learning (2)Cross-modal Quantization (1)Multimodal Information Integration (1)Recommendation Datasets (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces the MACRec model for generative recommendation, integrating multimodal information to improve semantic ID quality. It employs cross-modal quantization to reduce conflict rates and combines implicit and explicit alignments, enhancing the generative model's pe

Abstract

Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.

Mind Map

In-depth Reading

English Analysis~30 min read · 40,022 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is Multi-Aspect Cross-modal Quantization for Generative Recommendation.

1.2. Authors

The authors and their affiliations are:

Fuwei Zhang (Institute of Artificial Intelligence, Beihang University)
Xiaoyu Liu (Institute of Artificial Intelligence, Beihang University)
Dongbo Xi (Meituan)
Jishen Yin (Meituan)
Huan Chen (Meituan)
Peng Yan (Meituan)
Fuzhen Zhuang (Institute of Artificial Intelligence, Beihang University; SKLCCSE, School of Computer Science and Engineering, Beihang University)
Zhao Zhang (SKLCCSE, School of Computer Science and Engineering, Beihang University)

1.3. Journal/Conference

This paper is a preprint, published on arXiv, as indicated by the "Published at (UTC): 2025-11-19T04:55:14.000Z" and the arxiv.org links. arXiv is a widely recognized open-access preprint server for research articles in fields like computer science, mathematics, and physics. It plays a crucial role in rapidly disseminating research findings and facilitating early peer review before formal publication in journals or conferences.

1.4. Publication Year

2025

1.5. Abstract

Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, the paper proposes Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, MACRec first introduces cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of the GR model, it incorporates multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, extensive experiments on three well-known recommendation datasets demonstrate the effectiveness of the proposed method.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2511.15122v1
PDF Link: https://arxiv.org/pdf/2511.15122v1.pdf
Publication Status: Preprint (version 1) on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the limitation of current Generative Recommendation (GR) approaches in effectively utilizing multimodal information to create high-quality semantic identifiers (IDs) and train robust GR models.

This problem is important because GR is a rapidly evolving paradigm in recommender systems, reformulating recommendation as a next-token prediction task. The success of GR heavily relies on the quality of these discrete semantic IDs, which should be hierarchically organized, minimize conflicts, and facilitate generative model training. However, existing methods often rely on single-modality (textual) embeddings for ID generation, leading to limited semantic discriminability and potential semantic loss in deeper hierarchical structures. For instance, text-based embeddings might group items by brand, overlooking functional differences that image-based embeddings could capture (e.g., different types of instruments from the same brand). Current multimodal GR models also tend to encode modalities separately without deep cross-modal interaction during quantization, leading to hierarchical semantic loss and suboptimal use of complementary information.

The paper's entry point is to explicitly introduce multimodal information and cross-modal interactions at two critical stages: semantic ID learning and generative model training, to overcome the limitations of single-modality or weakly integrated multimodality in existing GR systems.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Novel Cross-modal Quantization Method: MACRec proposes a new cross-modal quantization method that integrates contrastive learning into residual quantization (RQ) and reconstruction. This approach leverages multimodal information to learn hierarchically meaningful semantic IDs for items, addressing the issue of semantic loss and improving codebook usability.
Multi-Aspect Alignment Strategies for GR Training: To enable the generative model to learn common features from different modalities and enhance its understanding of semantic IDs, MACRec employs multi-aspect alignment strategies. These include:
- Implicit alignment in the latent space through contrastive methods.
- Explicit alignment within the generative task (e.g., predicting visual IDs from text IDs and vice-versa, or sequence-level predictions).
Extensive Experimental Validation: The paper conducts extensive experiments on three widely used recommendation datasets (Musical Instruments, Arts, Crafts and Sewing, Video Games). The findings demonstrate that MACRec significantly outperforms state-of-the-art Generative Recommendation (GR) models, showcasing its effectiveness.
Improved Codebook Utilization and Reduced Collision Rates: Analysis shows that MACRec effectively reduces the item collision rate during the quantization process and achieves a more balanced code assignment distribution, indicating better utilization of the codebook capacity and superior semantic representation.

These contributions collectively aim to solve the problem of learning high-quality, semantically rich, and non-conflicting item semantic IDs while ensuring effective generative model training by deeply integrating multimodal information from various aspects.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following key concepts:

Recommender Systems: Systems that suggest items (e.g., products, movies, articles) to users based on their preferences and past behavior. They address information overload by personalizing content.
Generative Recommendation (GR): A new paradigm in recommender systems that reframes the recommendation task as a sequence generation problem. Instead of predicting a score or ranking items, GR models learn to "generate" the semantic identifier (ID) of the next item a user might interact with, given their historical sequence of interactions. This often leverages techniques from Large Language Models (LLMs).
Quantized Representations / Discretization: The process of converting continuous data (like numerical embeddings) into a finite set of discrete values or "tokens." In GR, item features (e.g., text descriptions, images) are converted into discrete semantic IDs to enable sequence modeling, similar to how words are treated as tokens in natural language processing.
Semantic Identifiers (IDs) / Tokens: These are discrete, symbolic representations assigned to items after quantization. Each semantic ID is intended to capture the underlying meaning or characteristics of an item. In GR, user interaction histories become sequences of these item semantic IDs.
Next-Token Prediction: A core task in generative models, especially LLMs. Given a sequence of tokens, the model predicts the most probable next token in the sequence. In GR, this translates to predicting the semantic ID of the next item in a user's interaction history.
Residual Quantized Variational AutoEncoder (RQ-VAE): A key component for learning discrete semantic IDs.
- Vector Quantization (VQ): A method to discretize continuous vectors. It works by maintaining a "codebook" (a set of learnable representative vectors, called "codewords"). When a continuous input vector comes in, VQ finds the closest codeword in the codebook and replaces the input vector with that codeword. This effectively "quantizes" the input.
- Residual Quantization (RQ): An extension of VQ that improves its efficiency and expressive power. Instead of quantizing the entire vector in one go, RQ applies multiple VQ layers sequentially. Each layer quantizes the residual (the error or remaining information) from the previous layer's quantization. This allows for a hierarchical representation, capturing progressively finer details. For example, the first layer might capture broad features, and subsequent layers refine them.
- Variational AutoEncoder (VAE): A type of generative model that learns a probabilistic mapping from input data to a latent space and then reconstructs the input from this latent representation. RQ-VAE combines the quantization process with the VAE framework, learning to encode and decode item features through discrete codewords while minimizing reconstruction error.
Multimodal Information / Modalities: Data that comes from different sources or forms, such as text (item descriptions, reviews) and images (product photos). Multimodal approaches aim to leverage the complementary nature of these different data types to gain a richer understanding than any single modality alone.
Contrastive Learning: A self-supervised learning paradigm where the model learns representations by contrasting similar and dissimilar examples. It aims to pull "positive pairs" (e.g., different views of the same item, or items belonging to the same semantic cluster) closer in the embedding space while pushing "negative pairs" (unrelated items) further apart.
- InfoNCE Loss (Noise-Contrastive Estimation): A common loss function used in contrastive learning. It encourages the embedding of an anchor sample to be similar to its positive samples and dissimilar to its negative samples. $ \mathcal{L}{\mathrm{InfoNCE}}(q, {k+}, {k_-}) = - \log \frac{\exp(\mathrm{sim}(q, k_+) / \tau)}{\sum_{k \in {k_+} \cup {k_-}} \exp(\mathrm{sim}(q, k) / \tau)} $ where:
- $q$ is the query (anchor) embedding.
- $k_+$ is a positive sample embedding.
- $k$ represents all sample embeddings (positive and negative) in the batch.
- $\mathrm{sim}(\cdot, \cdot)$ is a similarity function, often cosine similarity or dot product.
- $\tau$ is a temperature hyperparameter that scales the logits. A smaller $\tau$ makes the distribution sharper, emphasizing larger similarities. This loss maximizes the agreement between $q$ and $k_+$ relative to other samples in the batch.
Sequence-to-Sequence (Seq2Seq) Models: An encoder-decoder architecture commonly used for tasks involving sequences, such as machine translation or text summarization. An encoder processes the input sequence into a latent representation, and a decoder generates an output sequence from this latent representation. Transformer architectures (like T5) are popular Seq2Seq models.
K-means Clustering: An unsupervised learning algorithm used to partition $N$ observations into $K$ clusters, where each observation belongs to the cluster with the nearest mean (centroid). It's used here to generate pseudo-labels based on feature similarity.
LLaMA (Large Language Model Meta AI): A family of open-source large language models developed by Meta AI. Used in the paper for extracting textual embeddings.
ViT (Vision Transformer): A Transformer model applied to image recognition. It treats images as sequences of image patches, which are then processed by a standard Transformer encoder. Used in the paper for extracting visual embeddings.

3.2. Previous Works

The paper contextualizes its work by discussing various recommender system paradigms:

Sequential Recommendation: Focuses on modeling user behavior sequences to capture dynamic preferences.
- Early approaches: GRU4Rec (Hidasi et al. 2015) used Gated Recurrent Units (GRUs) to model sessions. STAMP (Liu et al. 2018) introduced attention mechanisms for short-term preferences. NARM (Li et al. 2017) also used attention for session-based recommendations. These models primarily relied on interaction data (which items were interacted with).
- Attention-based models: SASRec (Kang and McAuley 2018) brought the Transformer architecture to sequential recommendation, effectively modeling long-range dependencies.
- Pretrained Language Models (PLMs): BERT4Rec (Sun et al. 2019) adapted BERT for recommendation by using masked item prediction, significantly advancing performance through self-supervised pretraining.
- Prompt-based methods: P5 (Geng et al. 2022) and M6-Rec (Cui et al. 2022) reformulated recommendation tasks as language modeling problems, enhancing generalization and flexibility by using prompts.
Multi-modal Sequential Recommendation: Enriches sequential representations by incorporating item modalities beyond just interaction IDs (e.g., text, images).
- Approaches include deep and graph neural networks like MMGCN (Wei et al. 2019) and GRCN (Wei et al. 2020) to integrate heterogeneous features.
- Contrastive learning and multimodal pretraining methods such as MMGCL (Yi et al. 2022) and MISSRec (Wang et al. 2023a) further strengthen user interest modeling.
- VIP5 (Geng et al. 2023) extended prompt-based techniques to multimodal settings.
Generative Recommendation (GR): This is where MACRec directly positions itself.
- TIGER (Rajput et al. 2023) was an early work that discretized item sequences into tokens, enabling the generative recommendation paradigm.
- LC-Rec (Zheng et al. 2024) utilized LLMs' natural language understanding for diverse task-specific fine-tuning.
- LETTER (Wang et al. 2024) extended TIGER by introducing collaborative filtering embeddings and an additional loss function to improve codebook utilization.
- Multimodal GR works: MMGRec (Liu et al. 2024a) used a Graph RQ-VAE to generate item representations by integrating multimodal features with collaborative signals. MQL4GRec (Zhai et al. 2025) is presented as the most direct state-of-the-art baseline, encoding multimodal and cross-domain item information into a unified quantized language.

3.3. Technological Evolution

The field of recommender systems has evolved from basic collaborative filtering and matrix factorization to sophisticated deep learning models. This evolution can be broadly categorized:

Traditional Collaborative Filtering/Content-Based: Early systems relied on user-item interaction matrices or item metadata.
Sequential Models: Recognizing the temporal nature of user preferences, models like GRU4Rec and SASRec began to model user histories as sequences.
Deep Learning Integration: The rise of neural networks led to more powerful representation learning for items and users.
Pre-trained Models (PLMs/LLMs): Adapting large pre-trained models from NLP (BERT4Rec, P5) and Vision (ViT) has significantly boosted performance by leveraging rich pre-learned knowledge.
Multimodal Integration: Moving beyond single-modality data, researchers started incorporating various modalities (text, images, audio, etc.) to capture richer item semantics and user preferences.
Generative Paradigm: The latest shift, inspired by LLMs, where recommendation becomes a generation task (next-token prediction) rather than a classification or ranking task. This approach offers flexibility and richer contextual understanding.

MACRec fits into the latest stage, building upon multimodal generative recommendation. It aims to refine how multimodal information is processed within the quantization step and during the generative model training to address the shortcomings of previous multimodal GR efforts.

3.4. Differentiation Analysis

Compared to the main methods in related work, MACRec's core differences and innovations are:

Holistic Cross-modal Integration during Quantization: Existing multimodal GR models (e.g., MQL4GRec) typically encode each modality separately to obtain semantic IDs for different modalities. They do not consider cross-modal interactions during the quantization process itself. MACRec introduces cross-modal contrastive learning directly into each layer of residual quantization, actively forcing interaction and alignment between textual and visual residuals. This is a significant improvement over independent quantization, reducing semantic loss and codebook collapse.
Multi-Aspect Cross-modal Alignment in GR Training: While some multimodal GR models use multimodal features, MACRec further enhances the generative model's ability by incorporating both implicit and explicit alignment mechanisms during the GR training phase.
- Implicit alignment (contrastive learning) in the latent space of the generative model ensures that semantic IDs from different modalities for the same item are close.
- Explicit alignment introduces auxiliary generative tasks (item-level: text ID to visual ID, sequence-level: textual sequence to next visual ID) that directly encourage the generative model to learn cross-modal relationships.
Improved Semantic ID Quality and Codebook Utilization: By integrating cross-modal contrastive learning early in the ID learning process and then aligning representations during reconstruction, MACRec achieves semantic IDs that are more discriminative, hierarchically meaningful, and suffer from lower collision rates (fewer items mapping to the same ID) and better codebook utilization compared to baselines like MQL4GRec.
Comprehensive Approach: MACRec offers a more comprehensive framework by addressing multimodal integration at two crucial stages (ID learning and GR training) and from multiple aspects (quantization, reconstruction, implicit alignment, explicit alignment), leading to superior recommendation performance.

4. Methodology

4.1. Principles

The core idea behind MACRec is to enhance Generative Recommendation (GR) by deeply integrating multimodal information throughout the entire process, from semantic ID learning to generative model training. The theoretical basis is that different modalities (e.g., text and images) capture complementary aspects of an item, and by fostering cross-modal interactions at various stages, we can:

Construct High-Quality Semantic IDs: Overcome the limitations of single-modality quantization by using cross-modal contrastive learning to make semantic IDs more discriminative, less prone to conflicts, and better utilized within the codebook.
Improve Generative Model Understanding: Train the GR model to inherently understand and leverage the shared and complementary information across modalities through explicit and implicit alignment tasks, thereby improving its ability to predict the next item.

The intuition is that if a model understands an item from both its textual description and its visual appearance, it will have a much richer and more robust representation, leading to more accurate and diverse recommendations.

4.2. Core Methodology In-depth (Layer by Layer)

The MACRec framework is organized into two main modules: cross-modal item quantization for generating discrete semantic IDs, and the training phase of the GR model with multi-aspect alignment. The overall architecture is illustrated in Figure 2.

The following figure (Figure 2 from the original paper) illustrates the overall architecture of MACRec:

该图像是示意图，展示了多方面跨模态量化在生成推荐系统中的应用。图中包含了跨模态项目量化和多方面对齐的过程，分别展示了文本和视觉特征的编码、伪标签生成、量化过程及对齐机制。通过对比隐式和显式对齐，图像阐明了如何利用多模态信息来提升生成推荐模型的性能。

As shown in Figure 2, the process begins with multimodal item information (text and image). This information goes through cross-modal item quantization to produce discrete semantic IDs. These IDs are then used to construct Seq2Seq training data for the Generative Recommender (GR) model, which is trained with multi-aspect alignment (including implicit and explicit alignments) to perform next-token prediction.

The goal of this module is to generate high-quality discrete semantic IDs for items by effectively integrating multimodal information during the quantization process.

4.2.1.1. Dual-modality Pseudo-label Generation

To enable contrastive learning across modalities, the first step involves generating pseudo-labels. For each item $i$ , its text and visual information are encoded into continuous embeddings using pre-trained models.

The text information is encoded into an embedding $\mathbf{t}_i$ using an open-source large language model (e.g., LLaMA).
The visual content of the item's image is encoded into an embedding $\mathbf{v}_i$ using a Vision Transformer (ViT).

Subsequently, K-means clustering is performed independently on these textual and visual embeddings to partition them into $K$ clusters. These cluster assignments serve as pseudo-labels for each modality.

The clustering process is formulated as: $ \mathcal{C}{\mathrm{text}} = \mathrm{KMeans}({ \mathbf{t}i }{i=1}^N) $ $ \mathcal{C}{\mathrm{vision}} = \mathrm{KMeans}({ \mathbf{v}i }{i=1}^N) $ where:

$\mathcal{C}_{\mathrm{text}}$ denotes the resulting cluster assignments (pseudo-labels) for the text modality.
$\mathcal{C}_{\mathrm{vision}}$ denotes the resulting cluster assignments (pseudo-labels) for the vision modality.
$\{ \mathbf{t}_i \}_{i=1}^N$ represents the set of all $N$ text embeddings for all items.
$\{ \mathbf{v}_i \}_{i=1}^N$ represents the set of all $N$ visual embeddings for all items.
$\mathrm{KMeans}(\cdot)$ is the K-means clustering function.
$N$ is the total number of items.

The core quantization mechanism is based on Residual-Quantized Variational AutoEncoder (RQ-VAE). RQ-VAE uses multi-layer vector quantization (VQ) where each layer quantizes the residuals from the previous layer.

In MACRec, for a given item, both the text and visual embeddings are first processed by an encoder (composed of a multi-layer perceptron (MLP)) to obtain latent representations:

Text latent representation: $\mathbf{z}^t = \mathrm{T-Encoder}(\mathbf{t})$
Visual latent representation: $\mathbf{z}^v = \mathrm{V-Encoder}(\mathbf{v})$ These latent representations $\mathbf{z}^t$ and $\mathbf{z}^v$ serve as the initial residuals for the first VQ layer: $\mathbf{r}_0^t = \mathbf{z}^t$ and $\mathbf{r}_0^v = \mathbf{z}^v$ .

At the $l$ -th layer of RQ, each modality has its own learnable codebook $C_l^{v/t} = \{ \mathbf{e}_{l,k}^{v/t} \}_{k=0}^M$ . The residual for a given modality at layer $l$ is quantized by finding the closest codeword in its respective codebook.

The selection of the closest codeword is given by: $ c_l^t = \arg\min_k | \mathbf{r}l^t - \mathbf{e}{l,k}^t |_2 $ $ c_l^v = \arg\min_k | \mathbf{r}l^v - \mathbf{e}{l,k}^v |_2 $ where:

$c_l^t$ is the index of the selected codeword for the text modality at layer $l$ .
$c_l^v$ is the index of the selected codeword for the visual modality at layer $l$ .
$\mathbf{r}_l^t$ is the residual vector for the text modality at layer $l$ .
$\mathbf{r}_l^v$ is the residual vector for the visual modality at layer $l$ .
$\mathbf{e}_{l,k}^t$ is the $k$ -th codeword in the text codebook at layer $l$ .
$\mathbf{e}_{l,k}^v$ is the $k$ -th codeword in the visual codebook at layer $l$ .
$\| \cdot \|_2$ denotes the Euclidean (L2) norm.
$M$ is the size of the codebook.

After selecting the codeword, the residual for the next layer is calculated by subtracting the selected codeword from the current residual: $ \mathbf{r}{l+1}^t = \mathbf{r}l^t - \mathbf{e}{l,c_k^t}^t $ $ \mathbf{r}{l+1}^v = \mathbf{r}l^v - \mathbf{e}{l,c_k^v}^v $ where:
$\mathbf{r}_{l+1}^t$ and $\mathbf{r}_{l+1}^v$ are the residual vectors passed to the $(l+1)$ -th layer for text and vision, respectively.

To address the limitations of independent quantization (potential codebook collapse and underutilization of cross-modal complementarity), MACRec introduces cross-modal contrastive learning at each RQ layer. This is done by leveraging the multimodal pseudo-labels generated earlier. Specifically, visual pseudo-labels enhance textual residual representations, and textual pseudo-labels optimize visual residual representations.

The InfoNCE loss for the $l$ -th layer is defined as: $ \mathcal{L}{\mathrm{con}}^{l, v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( { \langle \mathbf{r}i^t, \mathbf{r}{i,pos}^t \rangle } / { \tau } \right)}{\sum_{j=1}^B \exp \left( { \langle \mathbf{r}i^t, \mathbf{r}j^t \rangle } / { \tau } \right)} \right) $ $ \mathcal{L}{\mathrm{con}}^{l, t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( { \langle \mathbf{r}i^v, \mathbf{r}{i,pos}^v \rangle } / { \tau } \right)}{\sum_{j=1}^B \exp \left( { \langle \mathbf{r}i^v, \mathbf{r}j^v \rangle } / { \tau } \right)} \right) $ $ \mathcal{L}{\mathrm{con}}^{l} = \mathcal{L}{con}^{l, t \to v} + \mathcal{L}_{con}^{l, v \to t} $ where:

$\mathcal{L}_{\mathrm{con}}^{l, v \to t}$ is the contrastive loss for text residuals, guided by visual pseudo-labels.
$\mathcal{L}_{\mathrm{con}}^{l, t \to v}$ is the contrastive loss for visual residuals, guided by textual pseudo-labels.
$B$ is the batch size.
$i$ indexes an item in the batch.
$\mathbf{r}_i^t$ is the text residual for item $i$ at layer $l$ .
$\mathbf{r}_i^v$ is the visual residual for item $i$ at layer $l$ .
$\mathbf{r}_{i,pos}^t$ represents a positive sample for the text residual of item $i$ . It is a text residual from another item in the batch that shares the same visual pseudo-label $\mathcal{C}_{\mathrm{vision}}$ as item $i$ .
$\mathbf{r}_{i,pos}^v$ represents a positive sample for the visual residual of item $i$ . It is a visual residual from another item in the batch that shares the same textual pseudo-label $\mathcal{C}_{\mathrm{text}}$ as item $i$ .
$\langle \cdot, \cdot \rangle$ denotes the inner product (dot product), used here as a similarity measure.
$\tau$ is the temperature parameter for the contrastive loss.
$\sum_{j=1}^B \exp(\cdot)$ sums over all items in the batch, acting as negative samples for the current anchor if they don't share the same pseudo-label.

After quantization, the discrete semantic IDs are represented by summing the corresponding codeword vectors from each layer. For $L$ layers of codebooks, the quantized representation for an item is: $ \hat{\mathbf{z}}^t = \sum_{l=0}^{L-1} \mathbf{e}{l,c_k^t}^t $ $ \hat{\mathbf{z}}^v = \sum{l=0}^{L-1} \mathbf{e}_{l,c_k^v}^v $ where:

$\hat{\mathbf{z}}^t$ is the final reconstructed quantized embedding for the text modality.
$\hat{\mathbf{z}}^v$ is the final reconstructed quantized embedding for the visual modality.
$\mathbf{e}_{l,c_k^t}^t$ is the selected codeword for text at layer $l$ .
$\mathbf{e}_{l,c_k^v}^v$ is the selected codeword for vision at layer $l$ .
$L$ is the total number of RQ layers.

To further refine codebook representations and balance codebook utilization, MACRec introduces another alignment loss based on contrastive learning. This loss encourages bidirectional alignment between the quantized representations of different modalities for the same item.

The alignment loss is formulated as: $ \mathcal{L}{\mathrm{align}}^{t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \hat{\mathbf{z}}_i^t, \hat{\mathbf{z}}i^v \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \hat{\mathbf{z}}i^t, \hat{\mathbf{z}}j^v \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{align}}^{v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \hat{\mathbf{z}}i^v, \hat{\mathbf{z}}i^t \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \hat{\mathbf{z}}i^v, \hat{\mathbf{z}}j^t \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{align}} = \mathcal{L}{align}^{t \to v} + \mathcal{L}{align}^{v \to t} $ where:

$\mathcal{L}_{\mathrm{align}}^{t \to v}$ is the alignment loss from text to vision.
$\mathcal{L}_{\mathrm{align}}^{v \to t}$ is the alignment loss from vision to text.
$\hat{\mathbf{z}}_i^t$ and $\hat{\mathbf{z}}_i^v$ are the quantized embeddings for the text and vision modalities of item $i$ .
The other symbols ( $B$ , $\langle \cdot, \cdot \rangle$ , $\tau$ ) are as defined for InfoNCE loss. This loss ensures that the quantized representations of the same item across modalities are similar.

Similar to the RQ-VAE architecture, the quantized representations are decoded and reconstructed separately for each modality:
Decoded textual embedding: $\hat{\mathbf{t}} = \mathrm{T-Decoder}(\hat{\mathbf{z}}^t)$
Decoded visual embedding: $\hat{\mathbf{v}} = \mathrm{V-Decoder}(\hat{\mathbf{z}}^v)$

The reconstruction losses are calculated as the squared L2 norm between the original and decoded embeddings: $ \mathcal{L}_{\mathrm{recon}}^t = | \mathbf{t} - \hat{\mathbf{t}} |2^2 $ $ \mathcal{L}{\mathrm{recon}}^v = | \mathbf{v} - \hat{\mathbf{v}} |_2^2 $ where:
$\mathcal{L}_{\mathrm{recon}}^t$ is the reconstruction loss for the text modality.
$\mathcal{L}_{\mathrm{recon}}^v$ is the reconstruction loss for the visual modality.
$\mathbf{t}$ and $\mathbf{v}$ are the original text and visual embeddings, respectively.
$\hat{\mathbf{t}}$ and $\hat{\mathbf{v}}$ are the reconstructed text and visual embeddings.

The RQ-VAE training also includes a quantization loss (often called codebook loss or commitment loss) to ensure the codebook vectors are updated appropriately and the encoder output commits to the codebook. This is typically applied to each modality $m$ : $ \mathcal{L}{\mathrm{rq}}^m = \sum{l=0}^{L-1} \left( | \mathbf{sg} [ \mathbf{r}l^m ] - \mathbf{e}{l,c_k^m}^m |_2^2 + \alpha | \mathbf{r}l^m - \mathbf{sg} [ \mathbf{e}{l,c_k^m}^m ] |_2^2 \right) $ where:
$\mathcal{L}_{\mathrm{rq}}^m$ is the residual quantization loss for modality $m$ .
$\mathbf{sg}[\cdot]$ represents the stop-gradient operation, which prevents gradients from flowing through the argument.
The first term $\| \mathbf{sg} [ \mathbf{r}_l^m ] - \mathbf{e}_{l,c_k^m}^m \|_2^2$ updates the codebook vectors ( $\mathbf{e}_{l,c_k^m}^m$ ) to move towards the encoder's output (residuals $\mathbf{r}_l^m$ ).
The second term $\alpha \| \mathbf{r}_l^m - \mathbf{sg} [ \mathbf{e}_{l,c_k^m}^m ] \|_2^2$ is a commitment loss that pulls the encoder's output ( $\mathbf{r}_l^m$ ) closer to the codebook vector (by passing gradients to the encoder).
$\alpha$ is a loss coefficient (hyperparameter).
Superscript $m$ denotes the modality (text $t$ or vision $v$ ).

The total RQ-VAE loss combines the reconstruction and quantization losses for both modalities: $ \mathcal{L}{\mathrm{RQ-VAE}} = \mathcal{L}{\mathrm{recon}}^t + \mathcal{L}{\mathrm{recon}}^v + \mathcal{L}{\mathrm{rq}}^t + \mathcal{L}_{\mathrm{rq}}^v $

Finally, the overall training objective for learning the semantic identifiers (ID), denoted as $\mathcal{L}_{\mathrm{ID}}$ , integrates all these components: $ \mathcal{L}{\mathrm{ID}} = \mathcal{L}{\mathrm{RQ-VAE}} + \lambda_{\mathrm{con}}^l \sum_{l=0}^{L-1} \mathcal{L}{\mathrm{con}}^l + \lambda{\mathrm{align}} \mathcal{L}_{\mathrm{align}} $ where:

$\mathcal{L}_{\mathrm{ID}}$ is the total loss for learning semantic IDs.
$\mathcal{L}_{\mathrm{RQ-VAE}}$ is the combined reconstruction and RQ loss.
$\lambda_{\mathrm{con}}^l$ and $\lambda_{\mathrm{align}}$ are trade-off hyperparameters that balance the contribution of the cross-modal contrastive loss $\mathcal{L}_{\mathrm{con}}^l$ (summed over all layers) and the cross-modal reconstruction alignment loss $\mathcal{L}_{\mathrm{align}}$ , respectively.

For cases where conflicts occur among certain item IDs (multiple items mapping to the same ID), MACRec adopts the same conflict resolution strategy as proposed by Zhai et al., which involves reassigning codewords based on the distance between items and the codebook.

4.2.2. Generative Recommendation with Multi-aspect Alignment

Once the RQ-VAE model is trained, it provides discrete semantic IDs for both text and images. For instance, a text ID might be $<a_1><b_2><c_3>$ , and a visual ID might be $<A_1><B_2><C_3>$ . These sequences of semantic IDs are then used to construct Seq2Seq training data for the Generative Recommender (GR) model, which is typically a Transformer-based encoder-decoder architecture (like T5). The GR model is trained for next-token prediction. To further optimize information sharing and interaction across different modalities during this GR training phase, MACRec designs implicit alignment and explicit alignment mechanisms.

This mechanism aims to ensure that the GR model recognizes the commonality between semantic IDs of different modalities that belong to the same item. It aligns them at the latent space level after encoding. Specifically, the textual semantic ID (t-sid) and visual semantic ID (v-sid) of an item are encoded into latent representations using the encoder of the GR model. Mean Pooling is applied to obtain a single vector representation for the sequence.

The encoding process is: $ \mathbf{e}^t = \mathrm{MeanPool}(\mathrm{T5-Encoder}(t\text{-sid})) $ $ \mathbf{e}^v = \mathrm{MeanPool}(\mathrm{T5-Encoder}(v\text{-sid})) $ where:

$\mathbf{e}^t$ is the pooled latent representation for the textual semantic ID.
$\mathbf{e}^v$ is the pooled latent representation for the visual semantic ID.
$\mathrm{T5-Encoder}(\cdot)$ is the encoder component of the T5 model (used as the GR backbone).
$t\text{-sid}$ represents the textual semantic ID sequence for an item.
$v\text{-sid}$ represents the visual semantic ID sequence for an item.
$\mathrm{MeanPool}(\cdot)$ computes the average of the token embeddings from the encoder's output to get a single vector.

These latent representations are then aligned using contrastive learning, similar to InfoNCE loss: $ \mathcal{L}{\mathrm{implicit}}^{t \to v} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \mathbf{e}_i^t, \mathbf{e}i^v \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \mathbf{e}i^t, \mathbf{e}j^v \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{implicit}}^{v \to t} = - \frac{1}{B} \sum{i=1}^B \log \left( \frac{\exp \left( \langle \mathbf{e}i^v, \mathbf{e}i^t \rangle / \tau \right)}{\sum{j=1}^B \exp \left( \langle \mathbf{e}i^v, \mathbf{e}j^t \rangle / \tau \right)} \right) $ $ \mathcal{L}{\mathrm{implicit}} = \mathcal{L}{\mathrm{implicit}}^{t \to v} + \mathcal{L}{\mathrm{implicit}}^{v \to t} $ where:
$\mathcal{L}_{\mathrm{implicit}}^{t \to v}$ and $\mathcal{L}_{\mathrm{implicit}}^{v \to t}$ are the bidirectional implicit alignment losses for the latent space representations.
$\mathbf{e}_i^t$ and $\mathbf{e}_i^v$ are the latent representations for the textual and visual semantic IDs of item $i$ .
$B$ , $\langle \cdot, \cdot \rangle$ , and $\tau$ are the batch size, inner product, and temperature parameter, respectively. This loss pulls the latent representations of an item's text and image semantic IDs closer together.

4.2.2.2. Explicit Alignment with Different Generation Tasks

Inspired by prior work (Zhai et al.), MACRec also proposes explicit alignment strategies by designing additional generation tasks during the GR training. These tasks directly encourage cross-modal understanding:

Item-level alignment: The GR model is trained to generate an item's visual semantic ID when given its textual semantic ID as input, and vice versa (generate textual semantic ID from visual semantic ID). This forces a direct mapping between the modalities for individual items.
Sequence-level alignment: The GR model is trained to predict the visual semantic ID of the next recommended item given a historical sequence of textual semantic IDs. Similarly, it predicts the textual semantic ID of the next item given a historical sequence of visual semantic IDs. These tasks integrate cross-modal understanding into the sequential prediction context.

These additional explicit alignment tasks are incorporated into the sequential recommendation training alongside the primary next-token prediction tasks.

4.2.2.3. Training Objectives and Inference

For multimodal GR, there are two main recommendation tasks:

Predicting the textual semantic ID of the next item based on the historical sequence of item textual semantic IDs.
Predicting the visual semantic ID of the next item based on the historical sequence of item visual semantic IDs.

By integrating the aforementioned alignment strategies, the final training objective for the GR model is formulated as: $ \mathcal{L}{\mathrm{rec}} = - \sum{t=1}^{|y|} \log P_{\theta} (y_t | y < t, x) + \lambda_{\mathrm{implicit}} \mathcal{L}_{\mathrm{implicit}} $ where:

$\mathcal{L}_{\mathrm{rec}}$ is the total recommendation loss.
$- \sum_{t=1}^{|y|} \log P_{\theta} (y_t | y < t, x)$ is the standard next-token prediction loss (negative log-likelihood) for generating the target sequence $y$ given the input context $x$ . $P_{\theta}(y_t | y < t, x)$ is the probability of predicting the $t$ -th token $y_t$ given previous tokens $y < t$ and input $x$ , parameterized by $\theta$ .
$\lambda_{\mathrm{implicit}}$ is a hyperparameter that controls the weight of the implicit alignment loss.
$\mathcal{L}_{\mathrm{implicit}}$ is the implicit alignment loss calculated in the latent space. (Note: The explicit alignment tasks are incorporated by modifying the input/output pairs for the standard next-token prediction loss, effectively expanding the scope of $(y_t | y < t, x)$ rather than adding a separate term to the loss function).

During the inference stage, MACRec generates multiple candidate semantic IDs for different modalities using constrained beam search (as in Rajput et al. 2023). Finally, the results from both modalities are ensembled by averaging their scores to obtain the final recommendation.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three real-world recommendation datasets derived from the Amazon Product Reviews dataset, covering user reviews and item metadata from May 1996 to October 2018. The datasets represent three distinct product categories:

Musical Instruments
Arts, Crafts and Sewing (Arts)
Video Games (Games)

These datasets are commonly used in recommendation research and are effective for validating the performance of sequential and multimodal recommendation methods due to their inherent sequence structure (user interaction history) and rich item metadata (textual descriptions and images).

The following are the results from Table 1 of the original paper:

Datasets	#Users	#Items	#Interactions	Sparsity	Avg. len
Instruments	17112	6250	136226	99.87%	7.96
Arts	22171	9416	174079	99.92%	7.85
Games	42259	13839	373514	99.94%	8.84

where:

#Users: The number of unique users in the dataset.
#Items: The number of unique items in the dataset.
#Interactions: The total number of user-item interactions recorded.
Sparsity: A measure of how few interactions exist compared to all possible interactions, calculated as $1 - \frac{\text{#Interactions}}{\text{#Users} \times \text{#Items}}$ . High sparsity (close to 100%) is typical for recommendation datasets.
Avg. len: The average length of user interaction sequences.

5.2. Evaluation Metrics

To assess recommendation effectiveness, the paper adopts two standard top- $K$ evaluation metrics: Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K). $K$ is set to 1, 5, and 10. A leave-one-out evaluation protocol is used, where the last item in a user's sequence is held out for testing, and full ranking assessments are performed across the entire item collection (rather than sampling negative items).

Hit Rate at K (HR@K)
- Conceptual Definition: HR@K measures how often the target item (the item the user actually interacted with next) is present within the top- $K$ recommended items. It's a recall-oriented metric that indicates whether the recommender system "hit" the relevant item in its top suggestions.
- Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K recommendations}}{\text{Total number of users}} $
- Symbol Explanation:
  - Number of users for whom the target item is in top-K recommendations: The count of unique users for whom the ground truth next item is found among the top $K$ items ranked by the recommender system.
  - Total number of users: The total number of users considered in the evaluation.
  - $K$ : The size of the recommendation list (e.g., 1, 5, or 10).
Normalized Discounted Cumulative Gain at K (NDCG@K)
- Conceptual Definition: NDCG@K is a measure of ranking quality. It considers not only whether relevant items are in the top- $K$ list but also their position in the list. Higher relevance items placed at higher ranks (closer to the top) contribute more to the NDCG score. It's "normalized" to values between 0 and 1 by dividing by the Ideal DCG (IDCG), which is the DCG of a perfect ranking.
- Mathematical Formula: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ where: $ \mathrm{DCG@K} = \sum_{i=1}^K \frac{2^{\mathrm{rel}i} - 1}{\log_2(i+1)} $ $ \mathrm{IDCG@K} = \sum{i=1}^K \frac{2^{\mathrm{rel}{i{\text{ideal}}}} - 1}{\log_2(i+1)} $
- Symbol Explanation:
  - $\mathrm{DCG@K}$ : Discounted Cumulative Gain at rank $K$ .
  - $\mathrm{IDCG@K}$ : Ideal Discounted Cumulative Gain at rank $K$ . This is the maximum possible DCG achievable for a given query, obtained by ranking all relevant items perfectly.
  - $K$ : The maximum rank for which to consider items.
  - $\mathrm{rel}_i$ : The relevance score of the item at position $i$ in the recommended list. For recommendation tasks where relevance is binary (relevant/not relevant, or 1/0 for the target item), $\mathrm{rel}_i$ is 1 if the item at rank $i$ is the target item, and 0 otherwise.
  - $\mathrm{rel}_{i_{\text{ideal}}}$ : The relevance score of the item at position $i$ in the ideal (perfectly sorted by relevance) recommendation list. For leave-one-out evaluation, there is only one relevant item (the ground truth next item), so its $\mathrm{rel}$ is 1, and all others are 0. Thus, $\mathrm{IDCG@K}$ will be $\frac{2^1 - 1}{\log_2(1+1)} = 1$ .

5.3. Baselines

To evaluate MACRec, the authors compare it against a diverse set of representative recent methods:

Traditional Sequential Recommendation Models:
- BERT4Rec (Sun et al. 2019): A self-supervised sequential recommender based on BERT that uses masked item prediction.
- SASRec (Kang and McAuley 2018): A self-attentive sequential recommender that applies the Transformer encoder to model sequential interactions.
- FDSA (Zhang et al. 2019): Feature-level Deeper Self-Attention Network for sequential recommendation.
- S3-Rec (Zhou et al. 2020): Self-supervised learning for sequential recommendation with mutual information maximization.
Multimodal Sequential Recommendation Models:
- MISSRec (Wang et al. 2023a): Pre-training and transferring multi-modal interest-aware sequence representation for recommendation.
- P5 (Geng et al. 2022): Recommendation as Language Processing, a unified pretrain, personalized prompt & predict paradigm. While not strictly multimodal in its original form, it's a strong LLM-based baseline.
- VIP5 (Geng et al. 2023): Extends P5 to multimodal settings, positioning it as a multimodal foundation model for recommendation.
Generative Recommendation (GR) Models:
- TIGER (Rajput et al. 2023): An early generative retrieval model that discretizes item sequences into tokens for generative recommendation.
- MQL4GRec (Zhai et al. 2025): A multimodal generative recommendation model that encodes multimodal and cross-domain item information into a unified quantized language. This is the most direct and advanced multimodal GR baseline against which MACRec aims to show significant improvements.

5.4. Implementation Details

Fair Comparison: For MQL4GRec, the authors did not utilize pre-training on millions of additional-category datasets to ensure a fair comparison under similar data conditions.
Feature Extraction:
- Text features are obtained using LLaMA (Touvron et al. 2023).
- Image features are obtained using ViT-L/14 (Vision Transformer Large, patch size 14) (Dosovitskiy et al. 2020).
RQ-VAE Configuration:
- Codebook size $M$ : 256.
- Number of RQ layers $L$ : 4.
- Optimizer: AdamW.
- Batch size: 1024.
- Learning rate: 0.001.
- Number of K-means clusters $K$ : 512 (for pseudo-label generation).
GR Model Backbone: T5 (Text-to-Text Transfer Transformer) is used as the backbone.
- Encoder and decoder each have 4 Transformer layers.
- 6 attention heads per layer.
- Attention head dimension: 64.
Hyperparameters for MACRec Losses:
- Layer-wise contrastive weight $\lambda_{\mathrm{con}}^l$ : $\lambda_{\mathrm{con}}^{0,1} = 0$ , $\lambda_{\mathrm{con}}^{2,3} = 0.1$ . This indicates that contrastive loss is applied only to the 2nd and 3rd RQ layers.
- Alignment loss weight $\lambda_{\mathrm{align}}$ : 0.001.
- Implicit alignment loss weight $\lambda_{\mathrm{implicit}}$ : 0.01.
- Temperature parameter $\tau$ : 0.1 (for all contrastive losses).
Runs: Results are averaged over five random seeds to ensure robustness.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that MACRec consistently achieves superior performance across all three datasets when compared to a wide range of state-of-the-art baselines, including traditional sequential models and other generative recommendation approaches.

The following are the results from Table 2 of the original paper:

Dataset	Metrics	Sequential Rec.		Multimodal Seq. Rec.				Generative Rec.
Dataset	Metrics	BERT4Rec	SASRec	FDSA	S3-Rec	MISSRec	P5-CID	VIP5	TIGER	MQL4GRec	MACRec
Instruments	HR@1	0.0450	0.0318	0.0530	0.0339	0.0723	0.0512	0.0737	0.0754	0.0763	0.0819*
	HR@5	0.0856	0.0946	0.0987	0.0937	0.1089	0.0839	0.0892	0.1007	0.1058	0.1110*
	HR@10	0.1081	0.1233	0.1249	0.1123	0.1361	0.1119	0.1071	0.1221	0.1291	0.1363*
	NDCG@5	0.0667	0.0654	0.0775	0.0693	0.0797	0.0678	0.0815	0.0882	0.0902	0.0965*
	NDCG@10	0.0739	0.0746	0.0859	0.0743	0.0880	0.0704	0.0872	0.0950	0.0997	0.1046*
Arts	HR@1	0.0289	0.0212	0.0380	0.0172	0.0479	0.0421	0.0474	0.0532	0.0626	0.0685*
	HR@5	0.0697	0.0951	0.0832	0.0739	0.1021	0.0713	0.0704	0.0894	0.1167	0.1254*
	HR@10	0.0922	0.1250	0.1190	0.1030	0.1321	0.0994	0.0959	0.1167	0.1254	0.1329*
	NDCG@5	0.0502	0.0610	0.0583	0.0511	0.0699	0.0607	0.0586	0.0718	0.0816	0.0868*
	NDCG@10	0.0575	0.0706	0.0695	0.0630	0.0815	0.0662	0.0635	0.0806	0.0898	0.0953*
Games	HR@1	0.0115	0.0069	0.0163	0.0136	0.0201	0.0169	0.0173	0.0166	0.0200	0.0208*
	HR@5	0.0426	0.0587	0.0614	0.0527	0.0674	0.0532	0.0480	0.0523	0.0645	0.0671*
	HR@10	0.0725	0.0985	0.0988	0.0903	0.1048	0.0824	0.0758	0.0857	0.1007	0.1078*
	NDCG@5	0.0270	0.0333	0.0389	0.0351	0.0385	0.0331	0.0328	0.0345	0.0421	0.0435*
	NDCG@10	0.0366	0.0461	0.0509	0.0468	0.0499	0.0454	0.0418	0.0453	0.0538	0.0565*

Key observations from Table 2:

Overall Superiority: MACRec consistently achieves the best performance across all three datasets (Instruments, Arts, Games) and all evaluation metrics (HR@1, HR@5, HR@10, NDCG@5, NDCG@10). The asterisk * indicates statistical significance ( $p$ -value $< 0.05$ ) against the best baseline, confirming its robust advantage.
Advantage over Multimodal Generative Baselines: MACRec significantly outperforms MQL4GRec, which is the closest state-of-the-art multimodal generative recommendation model. This indicates that MACRec's novel cross-modal quantization for semantic ID learning and its multi-aspect alignment training strategy are highly effective in enhancing recommendation performance. For example, on the Instruments dataset, MACRec improves HR@10 from MQL4GRec's 0.1291 to 0.1363, and NDCG@10 from 0.0997 to 0.1046.
Improved NDCG: Compared to traditional multimodal sequential recommendation models (like MISSRec, P5-CID, VIP5), MACRec shows remarkable improvement, especially in NDCG. NDCG is a more sensitive metric to ranking quality. This suggests that MACRec's multimodal generative recommendation framework can more accurately recommend items that are not only relevant but also highly preferred by users and placed at higher ranks.
Generative Paradigm Effectiveness: Generally, Generative Recommendation models (TIGER, MQL4GRec, MACRec) tend to outperform traditional sequential recommendation models (BERT4Rec, SASRec, FDSA, S3-Rec) and even some multimodal sequential models, highlighting the potential of the generative paradigm. MACRec pushes this paradigm further by effectively integrating multimodal information.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study (RQ2)

An ablation study was conducted to understand the impact of different components of MACRec on its performance. The study evaluated the HR@10 metric on the three datasets by removing specific loss alignment strategies.

The following are the results from Table 3 of the original paper:

Model	Instruments	Arts	Games
MACRec	0.1363	0.1329	0.1078
w/o Lcon	0.1289	0.1283	0.1018
w/o Lalign	0.1310	0.1301	0.1026
w/o Limplicit	0.1312	0.1296	0.1042
w/o Explicit Alignment	0.1296	0.1299	0.1037

Observations from Table 3:

Effectiveness of All Modules: Removing any of the proposed alignment modules (Lcon, Lalign, Limplicit, Explicit Alignment) leads to a discernible performance degradation across all three datasets. This confirms that each component contributes positively to MACRec's overall effectiveness.
Dominant Impact of Cross-modal Contrastive Quantization (Lcon): The largest performance drop is observed when Lcon (the cross-modal contrastive loss during quantization) is removed. For example, HR@10 on Instruments drops from 0.1363 to 0.1289. This highlights the critical role of MACRec's contrastive learning-based cross-modal quantization approach in creating high-quality semantic IDs by integrating multimodal information early in the ID learning process.
Importance of Alignment Losses: Both implicit alignment (Limplicit) and explicit alignment strategies contribute significantly. Removing Limplicit or Explicit Alignment also leads to noticeable drops, underscoring their importance in enhancing the generative model's understanding of semantic IDs and enabling the learning of shared features across modalities during GR training.

6.2.2. Item Collision Analysis (RQ3)

The paper investigates the item collision rate during the quantization process, which refers to how often multiple distinct items are assigned the exact same semantic ID. A lower collision rate indicates better semantic discriminability and codebook utilization.

The following are the results from Table 4 of the original paper:

Dataset	Text		Image
Dataset	MQL4GRec	MACRec	MQL4GRec	MACRec
Instruments	3.23	2.76	3.71	2.38
Arts	5.15	4.24	5.71	3.29
Games	25.24	2.91	26.10	3.51

Observations from Table 4:

Reduced Collision Rates: MACRec consistently achieves significantly lower item ID collision rates for both text and image modalities across all datasets compared to MQL4GRec. For example, on the Games dataset, MACRec reduces the text collision rate from 25.24% (MQL4GRec) to 2.91% and the image collision rate from 26.10% to 3.51%.
Enhanced Codebook Usability: This reduction in collision rate strongly suggests that MACRec effectively leverages the complementarity between different modalities during quantization. By guiding the ID learning process with cross-modal contrastive learning, MACRec enables a more balanced and distinct allocation of semantic IDs, thereby minimizing ambiguity and improving codebook usability.

6.2.3. Code Assignment Distribution (RQ4)

The distribution of codewords (individual semantic IDs) to items is crucial for effective codebook utilization. An ideal distribution would be relatively even, indicating that most codewords are used and no single codeword is over-represented (which would suggest a lack of discriminative power).

The following figure (Figure 4 from the original paper) shows the code assignment distribution on the 2nd RQ layer:

Figure 4: Code assignment distribution on the 2-th RQ layer. 该图像是图表，展示了 MQL4Rec 和 MACRec 两种方法在不同桶索引下的代码分配情况。横轴为桶索引（每桶16个），纵轴为项目数量，分别显示了文本和图像的分布情况。

As shown in Figure 4, which visualizes the code assignment distribution on the 2nd RQ layer, MACRec demonstrates a more uniform distribution of items across codewords compared to MQL4GRec. The red bars represent text semantic IDs, and blue bars represent visual semantic IDs. MQL4GRec shows a more skewed distribution, with some codewords being assigned a very high number of items while many others are underutilized. In contrast, MACRec's distribution appears flatter and more spread out. This indicates that MACRec's cross-modal quantization approach leads to better codebook utilization by encouraging a diverse and balanced assignment of semantic IDs, reflecting its superior semantic representation capabilities.

6.2.4. Parameter Analysis (RQ5)

The paper also analyzes the impact of key hyperparameters on MACRec's performance, specifically on the Instruments dataset, using HR@10 and NDCG@10 as metrics.

The following figure (Figure 3 from the original paper) shows the performance of MACRec over different hyper-parameters on Instruments:

Figure 3: Performance of MACRec over different hyper-parameters on Instruments. 该图像是图表，展示了MACRec在不同超参数下的性能，包括代码簿大小、语义ID长度、起始层、 $L_{con}$ 、 $L_{align}$ 和 $L_{implicit}$ 的影响。每个子图中，HR@10和NDCG@10的值通过不同颜色的柱状图呈现，显示出超参数变化对推荐性能的影响。图中给出了相应的数值，以反映各参数设置下的具体性能。

Observations from Figure 3:

Codebook Size: Both very small and very large codebook sizes degrade performance. A small codebook limits the quantization space and the model's ability to capture diverse semantic associations. An overly large codebook can dilute token exposure, making it harder for the model to learn robust representations due to sparse usage of many codewords.
Semantic ID Length: Similarly, extremely short semantic IDs fail to capture comprehensive semantics, leading to a loss of information. Conversely, excessively long semantic IDs complicate learning by expanding the generation space, making the next-token prediction task more challenging and reducing performance. There is an optimal length that balances expressiveness and learnability.
Starting Layer for Lcon: The cross-modal contrastive loss ( $\mathcal{L}_{\mathrm{con}}^l$ ) is most effective when applied starting from the third RQ layer. Applying it from earlier layers (0th or 1st) or later layers yields suboptimal results. This suggests that contrastive learning is most beneficial in later RQ layers where fine-grained semantic residuals are processed, allowing these layers to leverage cross-modal signals to compensate for semantic loss without interfering with the coarser-grained quantization of earlier layers.
Weights of Contrastive Losses: Each of the three contrastive losses ( $\mathcal{L}_{\mathrm{con}}^l$ , $\mathcal{L}_{\mathrm{align}}$ , $\mathcal{L}_{\mathrm{implicit}}$ ) has an optimal weight (e.g., $\lambda_{\mathrm{con}}^l$ , $\lambda_{\mathrm{align}}$ , $\lambda_{\mathrm{implicit}}$ ). Higher weights for these losses strengthen modality fusion and alignment, but if too high, they might over-constrain the model or introduce noise. Lower weights, on the other hand, lead to insufficient cross-modal interaction, hindering the benefits of multimodality. This indicates a need for careful tuning to find the right balance for effective multimodal integration.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), a novel framework designed to address the limitations of existing Generative Recommendation (GR) methods in effectively integrating multimodal information. The core idea of MACRec is to enhance cross-modal alignment and interaction at two critical stages: semantic ID learning and generative model training.

Specifically, MACRec proposes a cross-modal quantization method that incorporates contrastive learning into residual quantization (RQ) and reconstruction. This approach facilitates the construction of semantic IDs that are not only hierarchically meaningful but also more discriminative and less prone to conflicts, thereby improving codebook usability. Furthermore, MACRec integrates multi-aspect cross-modal alignments (both implicit alignment in the latent space and explicit alignment through generative tasks) during the GR model's training process. These alignments enhance the model's understanding of sequential multimodal information.

Extensive experiments conducted on three well-known recommendation datasets demonstrate the superior performance of MACRec compared to state-of-the-art GR models. Additional analyses confirm MACRec's advantages in reducing item collision rates and achieving more balanced code assignment distributions, validating its effectiveness in utilizing codebook capacity and representing item semantics.

7.2. Limitations & Future Work

The paper primarily focuses on presenting the solution and its effectiveness, and does not explicitly list a "Limitations" or "Future Work" section. However, based on the context and common challenges in the field, potential areas could include:

Computational Cost: RQ-VAE with multiple layers and separate processing for multiple modalities (text and image in this case) can be computationally intensive, especially during ID learning and if the item catalog is very large.
Scalability: While the method shows strong performance on the datasets used, scaling to industrial-level recommendation systems with millions or billions of items and complex multimodal data (e.g., audio, video, user-generated content) might introduce new challenges.
Dependency on Pre-trained Models: The quality of initial text and image embeddings heavily relies on powerful pre-trained models like LLaMA and ViT. The performance of MACRec could be sensitive to the choice and capabilities of these foundational models.
Interpretability of Semantic IDs: While MACRec aims for "hierarchically meaningful semantic IDs," the direct interpretability of these discrete token sequences for human understanding or debugging might still be limited, unlike natural language.
Generalization to Other Multimodal Data: The current work focuses on text and images. Extending MACRec to incorporate other modalities (e.g., audio, video for music/movie recommendation) would require careful adaptation of the cross-modal quantization and alignment strategies.

Future research directions could explore:
Developing more efficient quantization mechanisms for multimodal data to reduce computational overhead.
Investigating methods for dynamically adjusting semantic ID length or codebook size based on item complexity or dataset characteristics.
Exploring adaptive weighting strategies for the various alignment losses instead of fixed hyperparameters.
Applying MACRec to more diverse multimodal recommendation scenarios or other generative tasks beyond next-item prediction.
Enhancing the interpretability of the generated semantic IDs to provide more transparent recommendation explanations.

7.3. Personal Insights & Critique

This paper presents a strong and well-reasoned approach to integrating multimodal information into Generative Recommendation. The rigorous integration at both the ID learning and GR training stages, using a combination of cross-modal contrastive learning and multi-aspect alignment, is a significant step forward. The empirical results clearly validate the effectiveness of MACRec, especially the substantial reduction in item collision rates and the improved code assignment distribution, which are crucial indicators of high-quality semantic representations.

Inspirations & Transferability:

The idea of applying cross-modal contrastive learning directly within the residual quantization process is highly innovative. This concept could be transferred to other domains requiring robust discrete representations of multimodal data, such as multimodal retrieval or multimodal content generation.
The multi-aspect alignment strategy (implicit and explicit) for generative models is a generalizable framework. It could be adapted for any Seq2Seq model that needs to learn cross-modal relationships from discrete tokens, even outside of recommendation, for example, in multimodal dialogue systems or multimodal captioning where different modalities need to be aligned at both latent and task-specific levels.
The benefits of carefully constructed semantic IDs for generative retrieval are evident. This emphasizes the importance of the quantization stage in any generative AI application that relies on discrete tokens.

Potential Issues & Areas for Improvement:
Hyperparameter Sensitivity: The paper details several critical hyperparameters (e.g., codebook size, ID length, loss weights, starting layer for Lcon, temperature). Optimal performance relies heavily on their careful tuning, which can be a time-consuming process. Further work could explore adaptive or meta-learning approaches for hyperparameter optimization.
Complexity of Pseudo-label Generation: The initial K-means clustering for pseudo-label generation is a heuristic step. Its quality directly influences the subsequent contrastive learning. Investigating more sophisticated or adaptive pseudo-labeling strategies, perhaps dynamically adjusted during training, could be beneficial.
Scaling of Cross-modal Contrastive Loss: While effective, the cross-modal contrastive loss in RQ layers requires careful management. The choice to apply it only to later layers ( $\lambda_{\mathrm{con}}^{0,1} = 0, \lambda_{\mathrm{con}}^{2,3} = 0.1$ ) suggests a balance to prevent over-constraining initial, coarser representations. The optimal balance might vary significantly across different datasets or modalities.
Computational Cost of Inference: While the model is generative, the constrained beam search and subsequent ensemble across modalities can add latency during inference, which is a critical factor in real-time recommendation systems.
Lack of User-Modality Preference: The model implicitly aligns modalities. It doesn't explicitly learn if a user prefers visual or textual aspects more strongly for certain item types. Incorporating user-specific modality preferences could be a valuable extension.

Overall, MACRec provides a solid foundation for future research in multimodal generative recommendation by offering a systematic and effective way to harness the rich information embedded in diverse modalities.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Multi-Aspect Cross-modal Quantization for Generative Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 40,022 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Cross-modal Item Quantization

4.2.1.1. Dual-modality Pseudo-label Generation

4.2.1.2. Cross-modal Quantization with Contrastive Learning

4.2.1.3. Cross-modal Reconstruction Alignment

4.2.2. Generative Recommendation with Multi-aspect Alignment

4.2.2.1. Implicit Alignment for Cross-modal Semantic IDs

4.2.2.2. Explicit Alignment with Different Generation Tasks

4.2.2.3. Training Objectives and Inference

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study (RQ2)

6.2.2. Item Collision Analysis (RQ3)

6.2.3. Code Assignment Distribution (RQ4)

6.2.4. Parameter Analysis (RQ5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers