Multimodal Quantitative Language for Generative Recommendation
TL;DR Summary
This paper introduces MQL4GRec, a method that converts items from various domains and modalities into a unified 'quantitative language' to address cold start issues and knowledge transfer in recommendation systems, achieving NDCG improvements of up to 14.82% over baseline models.
Abstract
Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18%, 14.82%, and 7.95% on the NDCG metric across three different datasets, respectively.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Multimodal Quantitative Language for Generative Recommendation
1.2. Authors
Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, and Yonghong Tian. The authors represent a collaboration between several prestigious Chinese institutions: Sun Yat-sen University, Pengcheng Laboratory, Xiamen University, and Peking University.
1.3. Journal/Conference
This paper was published as a preprint on arXiv on February 20, 2025. While not yet appearing in a peer-reviewed journal at the time of this analysis, the author affiliations and the technical rigor suggest it is intended for a top-tier artificial intelligence or data mining conference (such as SIGIR, KDD, or RecSys).
1.4. Publication Year
2025
1.5. Abstract
Generative recommendation is a new paradigm that directly predicts item identifiers. Existing methods often rely on Pre-trained Language Models (PLMs), but these models struggle with the gap between general language knowledge and specific recommendation needs. Furthermore, they rarely utilize multimodal information (like images). This paper proposes MQL4GRec, which translates items from different domains and modalities into a unified quantitative language. By using quantitative translators and a series of generation tasks (pre-training and fine-tuning), the model transfers recommendation knowledge effectively. Experiments show improvements of up to 14.82% on the NDCG metric compared to baselines.
1.6. Original Source Link
-
Original Source (arXiv): https://arxiv.org/abs/2504.05314
-
PDF Link: https://arxiv.org/pdf/2504.05314v1.pdf
2. Executive Summary
2.1. Background & Motivation
Traditional recommendation systems often rely on ID-based Recommendation (IDRec), where every user and item is assigned a unique, meaningless ID. However, this approach suffers from the cold start problem (it cannot recommend new items with no history) and lacks transferability (knowledge from one store cannot easily be used in another).
Recent Generative Recommendation attempts to solve this by treating recommendation as a language generation task, using models like GPT or T5. However, two major gaps remain:
-
Semantic Gap: The knowledge inside a language model (e.g., how to write a poem) is different from the logic of a recommender (e.g., people who buy milk often buy bread).
-
Modality Gap: Most models focus on text, ignoring images and other visual cues that heavily influence user choices.
The motivation of this paper is to create a "bridge"—a shared language that represents both text and images across different domains—allowing the model to learn universal recommendation patterns.
2.2. Main Contributions / Findings
-
Proposed MQL4GRec: A novel framework that converts multimodal item content (text/images) into a unified
quantitative language. -
Quantitative Translators: Introduced a method using
RQ-VAEto compress complex features into discrete tokens that act as item identifiers. -
Knowledge Transfer Tasks: Designed specific generation tasks (symmetric, asymmetric, and alignment) to teach the model how different modalities and domains relate to user preferences.
-
Empirical Success: Achieved significant performance gains across multiple Amazon datasets, proving that pre-training on source domains improves performance on target (unseen) domains.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice needs to be familiar with the following:
- Generative Recommendation: Instead of ranking a list of candidates, the model "writes" the ID of the item it wants to recommend, similar to how a chatbot generates the next word.
- Pre-trained Language Models (PLMs): Models like
LLaMAorT5that have been trained on massive amounts of text. They are used here as the "brain" of the recommender. - Vector Quantization (VQ): A process of mapping continuous numbers (vectors) to a fixed set of discrete categories (a "codebook"). Think of it like rounding a specific color (like "light navy blue") to the nearest standard crayon in a box of 64.
- Multimodal Learning: AI that can process more than one type of data, such as reading a product description (text) while "looking" at its photo (image).
3.2. Previous Works
- P5 (Geng et al., 2022): A landmark work that treated all recommendation tasks (ranking, explanation, etc.) as a sequence-to-sequence problem using natural language.
- TIGER (Rajput et al., 2023): The first to use
RQ-VAEto create "semantic IDs." It turned item embeddings into a series of numbers, but primarily focused on one modality. - CLIP (Radford et al., 2021): A model that learns to associate images with text. The authors use
CLIP's image encoder in this paper to extract visual features.
3.3. Technological Evolution
The field has moved from Collaborative Filtering (finding similar users) Deep Learning (using neural networks for IDs) Generative Recommendation (using language models to predict items). This paper represents the latest stage: Multimodal Generative Recommendation, where the model learns from every available piece of information (text, image, and cross-domain history).
4. Methodology
4.1. Principles
The core idea is to treat items not as IDs, but as "sentences" in a new Quantitative Language. If we can describe a "Guitar" and a "Piano" using the same vocabulary of tokens, the model can learn the relationship between them regardless of which category they belong to.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Quantitative Translator (Item to Tokens)
The first step is converting an item's raw content into the quantitative language.
-
Encoding: We take an item's text (title/description) and pass it through a frozen
LLaMAencoder to get a text vector . Similarly, we pass the item's image through aViT(Vision Transformer) to get a visual vector . -
Residual-Quantized Variational AutoEncoder (RQ-VAE): We train an
RQ-VAEto act as a "translator." It maps the continuous vector into a tuple of codewords. The quantization process for levels is defined as: Here, is the codeword selected at level , is the residual (what's left over) from the previous level, and are learnable vectors in the codebook . -
Vocabulary Construction: For text, we add lowercase prefixes (e.g., ). For images, we use uppercase (e.g., ). This creates a shared dictionary .
The training loss for this translator is:
\mathcal{L}(h) = \| h - \hat{h} \|_2^2 + \sum_{i=1}^{H} (\text{sg}[r_i] - v_{c_i}^i)_2^2 + \beta(r_i - \text{sg}[v_{c_i}^i])_2^2Where is the original representation, is the reconstructed one, and is the stop-gradient operator (which prevents gradients from flowing back into specific parts of the network during optimization).
4.2.2. Handling Collisions
Sometimes, two different items might end up with the same code. To fix this, the authors calculate the distance from the residual vector to the codebook vectors and reallocate tokens based on a sorted list of proximity. This ensures every item has a unique "name" in the quantitative language.
4.2.3. Quantitative Language Generation (QLG) Tasks
Once items are tokens, the model is trained on three types of tasks (illustrated in Figure 2):
-
Next Item Generation (NIG): Given a sequence of tokens from the user's history, predict the next item's tokens (e.g., Text Text, or Image Image).
-
Asymmetric Item Generation (AIG): Predict the next item's text tokens based on the user's image history, or vice versa. This forces the model to understand the link between how an item looks and how it is described.
-
Quantitative Language Alignment (QLA): Directly translate an item's text tokens into its image tokens. This ensures the "Quantitative Language" is semantically grounded across modalities.
The following figure (Figure 2 from the original paper) shows the overall framework:
该图像是MQL4GRec的整体框架示意图,展示了将来自不同领域和模态的项目内容转换为统一的定量语言的过程。图中包含了多个翻译器和定量语言生成任务,以及通过预训练和微调进行知识转移的步骤。
4.2.4. Training & Re-ranking
The model uses a standard sequence-to-sequence loss: Where is the input sequence and is the target item.
Finally, for prediction, the model generates two lists (one from the text task, one from the image task) and combines them using a re-ranking formula: This gives higher priority to items that appear in both the visual and textual prediction lists.
5. Experimental Setup
5.1. Datasets
The authors used the Amazon Product Reviews dataset, divided into source and target domains:
-
Pre-training (Source): Pet Supplies, Cell Phones, Automotive, Tools, Toys, Sports.
-
Fine-tuning (Target): Musical Instruments, Arts/Crafts, Video Games.
The following are the statistics from Table 6 of the original paper:
Datasets #Users #Items #Interactions Sparsity Avg. len Pet 183,697 31,986 1,571,284 99.97% 8.55 Cell 123,885 38,298 873,966 99.98% 7.05 Instruments 17,112 6,250 136,226 99.87% 7.96 Arts 22,171 9,416 174,079 99.92% 7.85 Games 42,259 13,839 373,514 99.94% 8.84
5.2. Evaluation Metrics
The paper uses three standard metrics:
- HR@K (Hit Ratio):
- Definition: Measures the percentage of times the actual target item is within the top recommended items.
- Formula:
- Symbols: is total users; is 1 if the item is in the top , else 0.
- Recall@K: (Functionally similar to HR in this context).
- NDCG@K (Normalized Discounted Cumulative Gain):
- Definition: Not only checks if the item is in the list but also rewards the model more if the correct item is at the very top (rank 1) compared to the bottom (rank 10).
- Formula: where
\mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)}. - Symbols: is the relevance of the item at rank ; is the ideal score if everything was perfect.
5.3. Baselines
The model is compared against:
-
Traditional:
SASRec(self-attention based),BERT4Rec(bidirectional). -
Multimodal:
MISSRec,VIP5. -
Generative Retrieval:
TIGER,P5-CID.
6. Results & Analysis
6.1. Core Results Analysis
MQL4GRec consistently outperformed all baselines. For example, on the Instruments dataset, it improved NDCG@10 by 11.58% over the best baseline. A key takeaway is that VIP5 (a multimodal baseline) actually struggled, likely because it tried to feed raw images into a text model. In contrast, MQL4GRec's "Quantitative Language" made the images digestible for the model.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Dataset | Metrics | SASRec | VQ-Rec | MISSRec | TIGER | MQL4GRec | Improv. |
|---|---|---|---|---|---|---|---|
| Instruments | HR@10 | 0.1233 | 0.1357 | 0.1361 | 0.1221 | 0.1375 | +1.03% |
| NDCG@10 | 0.0746 | 0.0891 | 0.0880 | 0.0950 | 0.1060 | +11.58% | |
| Arts | HR@10 | 0.1250 | 0.1386 | 0.1321 | 0.1167 | 0.1327 | - |
| NDCG@10 | 0.0706 | 0.0844 | 0.0815 | 0.0806 | 0.0950 | +12.56% | |
| Games | NDCG@5 | 0.0333 | 0.0242 | 0.0385 | 0.0345 | 0.0421 | +8.23% |
| NDCG@10 | 0.0461 | 0.0329 | 0.0499 | 0.0453 | 0.0548 | +7.66% |
6.3. Ablation Studies
-
Collision Handling: The authors' method of reallocating tokens based on residual distance proved superior to TIGER's method (adding a random index), which can break the semantic meaning of IDs.
-
Pre-training Impact: Figure 3 shows that for "Instruments" and "Arts," more pre-training data leads to better performance. However, for "Games," the trend was flatter, suggesting a domain gap or potential overfitting.
7. Conclusion & Reflections
7.1. Conclusion Summary
MQL4GRec represents a significant advance in making recommendation models "universal." By converting different modalities and categories into a shared "Quantitative Language," it allows for effective knowledge transfer. The model doesn't just learn what one user likes; it learns how item appearances and descriptions relate to purchase behavior across the entire Amazon ecosystem.
7.2. Limitations & Future Work
- Inference Speed: Like most generative models, it uses beam search, which is slower than traditional matrix multiplication during recommendation.
- Missing Content: The model relies on having a title or image. If an item has neither, the model cannot generate a Quantitative Language ID for it.
- Zero-Shot performance: While it showed some ability to recommend items without any fine-tuning (Zero-Shot), the performance was still very low, suggesting a need for larger models or more diverse pre-training.
7.3. Personal Insights & Critique
The "Quantitative Language" approach is brilliant because it treats the modality gap as a translation problem. In typical AI, we struggle to make sense of how a vector for "Blue" (image) relates to a vector for "Blue" (text). By forcing them into the same discrete vocabulary, this paper creates a mathematical Rosetta Stone for products.
However, a potential issue is Vocabulary Saturation. As the number of items grows to millions or billions, will 4 levels of codebooks be enough? The paper addresses collisions for thousands of items, but a real-world platform like Amazon might require a much deeper or more complex "Language" structure to avoid naming conflicts. Additionally, the drop in performance on the "Games" dataset during heavy pre-training (Figure 4) suggests that some domains are "too unique" and might actually be harmed by learning patterns from unrelated domains like "Pet Supplies."
Similar papers
Recommended via semantic vector search.