Paper status: completed

MULTIMODAL QUANTITATIVE LANGUAGE FOR GENERATIVE RECOMMENDATION

Published:02/20/2025

Generative Recommendation Systems (40)Multimodal Knowledge Transfer (1)Pre-trained Models in Recommendation (1)Quantitative Language Generation Tasks (1)Recommendation Performance Improvement (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The MQL4GRec framework addresses limitations in generative recommendation systems by transforming item text and images into a unified 'quantitative language', enabling effective cross-modal knowledge transfer. Experiments show significant NDCG improvements, with gains up to 14.82

Abstract

Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18%, 14.82%, and 7.95% on the NDCG metric across three different datasets, respectively.

Mind Map

In-depth Reading

English Analysis~11 min read · 14,729 chars

1. Bibliographic Information

1.1. Title

Multimodal Quantitative Language for Generative Recommendation

1.2. Authors

Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, Yonghong Tian

1.3. Journal/Conference

Published at (UTC): 2025-02-20. Note: While the specific venue name is not explicitly listed in the metadata provided, the paper structure and citation style are consistent with top-tier computer science conferences (e.g., AAAI, SIGIR, or ACM Multimedia).

1.4. Publication Year

2025

1.5. Abstract

This paper addresses the limitations of existing Generative Recommendation systems, which often struggle to bridge the gap between pre-trained language models (PLMs) and recommendation tasks, and fail to fully utilize multimodal information (text and images). The authors propose MQL4GRec (Multimodal Quantitative Language for Generative Recommendation). The core idea is to transform items' text and image content into a unified "quantitative language" (a sequence of discrete tokens) using a technique called Quantitative Translators. This allows different modalities to share a common vocabulary. The system then employs specific generation tasks (predicting the next item, cross-modal prediction) to transfer knowledge. Experiments on Amazon datasets show significant improvements (up to 14.82% in NDCG) over state-of-the-art baselines.

1.6. Original Source Link

/files/papers/695918b65411c3e2652eaec9/paper.pdf

2. Executive Summary

2.1. Background & Motivation

The Problem: Traditional recommendation systems rely on unique IDs for items (IDRec), which suffer from the "cold start" problem (new items have no history) and lack transferability. Recently, Generative Recommendation has emerged, where models directly generate the ID of the target item.
Current Gaps:
1. Modality Gap: Existing methods try to use Pre-trained Language Models (PLMs) like T5 or LLaMA. However, these models understand natural language, not arbitrary item IDs or raw pixel data.
2. Multimodal Neglect: Items often have images and text. Most models either ignore images or fail to integrate them effectively with text, missing out on capturing a user's multi-faceted preferences.
The Innovation: Instead of using meaningless IDs or raw complex data, this paper proposes converting everything (images and text) into a unified, concise "language" of discrete tokens. This acts as a bridge, allowing a single model to "speak" both image and text dialects to recommend items.

2.2. Main Contributions / Findings

MQL4GRec Framework: A novel approach that translates multimodal item content into a Quantitative Language using a shared vocabulary. This breaks down barriers between different domains and data types.
Quantitative Generation Tasks: The authors design specific training tasks—such as generating text tokens from image inputs (Asymmetric Generation)—to force the model to learn deep semantic connections between modalities.
Collision Handling Strategy: A refined method to handle cases where different items map to the same token sequence, ensuring more accurate identification.
Performance: The method outperforms strong baselines (including TIGER and VIP5) on three Amazon datasets, showing gains of 11.18%, 14.82%, and 7.95% in NDCG metrics.

The following figure (Figure 1 from the original paper) illustrates the core concept: transforming diverse inputs (Movies, Arts) into a unified code sequence to bridge the knowledge gap.

该图像是示意图，展示了如何将来自不同领域和模态的物品转换为统一的定量语言，以便于推荐知识的转移。图中包括艺术和电影类别的示例项目，通过定量语言进行表达。

3.1. Foundational Concepts

To understand MQL4GRec, one must grasp three core concepts:

Generative Recommendation:
- Traditional: The model calculates a score for every possible item and ranks them.
- Generative: The model treats the user's history as a sentence and "writes" the name (or ID) of the next item, just like ChatGPT writes the next word. The output is a sequence of tokens.
Vector Quantization (VQ) & RQ-VAE:
- Concept: Computers usually represent images/text as continuous vectors (lists of decimal numbers like $[0.12, -0.98, ...]$ ). VQ converts these into a list of discrete integers (codes like [45, 12, 99]) from a fixed "codebook" (dictionary).
- Residual Quantization (RQ): A hierarchical version. Imagine trying to locate a point on a map.
  - Level 1 Code: "North-West" (Coarse approximation).
  - Level 2 Code: "Top-Left corner of North-West" (Refines the error from Level 1).
  - Level 3 Code: "Specific building" (Further refinement).
- RQ-VAE: A neural network that learns to compress data into these hierarchical codes (Semantic IDs) and reconstruct the original data from them.
Transformer Architecture:
- The backbone of modern NLP (like BERT, GPT). It uses an "Attention" mechanism to weigh the importance of different parts of the input sequence when generating output. MQL4GRec uses this to process the sequences of quantitative tokens.

3.2. Previous Works

ID-based Sequential Recommendation: Models like SASRec (Kang & McAuley, 2018) and BERT4Rec (Sun et al., 2019) use unique integer IDs. They are fast but cannot handle new items well (no pre-trained knowledge).
Generative Recommendation:
- P5 (Geng et al., 2022): Converts recommendation into text tasks using T5. Uses raw item names or integer IDs.
- TIGER (Rajput et al., 2023): A direct predecessor. It uses RQ-VAE to create "Semantic IDs" from item embeddings.
- Difference: TIGER creates IDs from a single representation. MQL4GRec creates IDs from both text and images separately and aligns them.
Multimodal Recommendation:
- VIP5 (Geng et al., 2023): Adds images to the P5 framework.
- Difference: VIP5 struggles because visual features and PLM linguistic knowledge are hard to align. MQL4GRec solves this by converting images into the same token language as text.

3.3. Differentiation Analysis

The key innovation of MQL4GRec compared to TIGER and VIP5 is the Unified Quantitative Language.

Vs. TIGER: TIGER quantizes a generic item embedding. MQL4GRec quantizes Text and Images independently into a shared vocabulary, allowing the model to explicitly learn cross-modal relationships (e.g., predicting the "text code" of an item based on the "image codes" of history items).
Vs. VIP5: MQL4GRec discretizes continuous visual features, making them compatible with the token-based generation process of Large Language Models.

4. Methodology

4.1. Principles

The core philosophy is "Translation." The raw content of an item (pixels in an image, words in a description) is too complex and noisy for direct recommendation. MQL4GRec acts as a translator that converts these complex signals into a concise, unified language (a sequence of integers). Once everything is in this "Quantitative Language," standard sequence-to-sequence models can easily learn patterns and transfer knowledge between domains.

The overall framework is shown below (Figure 2 from the original paper). The left side shows the "Translation" (Quantization) process, and the right side shows the "Generation" (Recommendation) tasks.

该图像是MQL4GRec的框架示意图。它展示了如何将来自不同领域和模态的项目内容转换为统一的量化语言，以促进推荐知识的转移。图中包括了量化语言生成任务的设计，以及预训练和微调的过程，通过Transformer编码器和解码器生成目标序列，并用 $E$ 和 $Z$ 表示嵌入和量化语言对齐。

4.2. Core Methodology In-depth

4.2.1. The Quantitative Translator (RQ-VAE)

The first step is to convert items into tokens. The authors use Residual-Quantized Variational AutoEncoder (RQ-VAE).

1. Encoding: For an item, we extract features using a pre-trained encoder.

Text: LLaMA is used to encode title/description into a vector $h$ .
Image: ViT (Vision Transformer) is used to encode the item image into a vector $h$ .

2. Residual Quantization: The goal is to represent the continuous vector $z$ (derived from $h$ ) as a sequence of discrete codes. The system uses a codebook $\mathcal{C}$ containing vectors $\mathbf{v}_k$ .

The quantization happens in $L$ levels (iterations). In each level, the model finds the codebook vector closest to the current "residual" (the remaining information not yet explained).

Step 1 (Level 1): Find the closest code $c_1$ for the input $z$ . $c_{1} = \underset{k}{\arg \min } \left\| z - \mathbf{v}_{k}^{1} \right\|_{2}^{2}$ Here, $z$ is the input vector, $\mathbf{v}_{k}^{1}$ is the $k$ -th vector in the level 1 codebook, and $\|\cdot\|_2^2$ is the squared Euclidean distance.
Step 2 (Calculate Residual): Calculate what information is left over ( $r_2$ ). $\mathbf{r}_{2} = z - \mathbf{v}_{c_{1}}^{1}$ $r_2$ represents the "error" after the first approximation.
Step 3 (General Loop): For levels $i=1$ to $L$ : The code for level $i$ is the one closest to the current residual $\mathbf{r}_i$ : $c_{i} = \underset{k}{\arg \min } \left\| \mathbf{r}_{i} - \mathbf{v}_{k}^{i} \right\|_{2}^{2}$ Update the residual for the next level: $\mathbf{r}_{i+1} = \mathbf{r}_{i} - \mathbf{v}_{c_{i}}^{i}$ (Note: For the first step, $\mathbf{r}_1 = z$ ).

3. Reconstruction: The quantized representation $\hat{z}$ is the sum of all selected code vectors: $\hat{z} = \sum_{i=1}^{L} \mathbf{v}_{c_{i}}^{i}$

4. Loss Function: The RQ-VAE is trained to minimize the difference between the original and reconstructed vector, plus a commitment loss to keep the codebook stable. $\mathcal{L} (h) = \mathcal{L}_{\mathrm{recon}} + \mathcal{L}_{\mathrm{rqvae}}$ $\mathcal{L}_{\mathrm{recon}} = \| h - \hat{h} \|_{2}^{2}$ $\mathcal{L}_{\mathrm{rqvae}} = \sum_{i=1}^{H} \left( \| \mathrm{sg}[\mathbf{r}_{i}] - \mathbf{v}_{c_{i}}^{i} \|_{2}^{2} + \beta \| \mathbf{r}_{i} - \mathrm{sg}[\mathbf{v}_{c_{i}}^{i}] \|_{2}^{2} \right)$

Symbol Explanation:
- $h$ : Original input representation.
- $\hat{h}$ : Reconstructed output from the decoder.
- $\mathrm{sg}[\cdot]$ : Stop-gradient operator (prevents backpropagation through that term, stabilizing training).
- $\beta$ : A hyperparameter weighting the commitment loss.

5. Vocabulary Construction: To distinguish between Text codes and Image codes in the unified language:

Text Codes: Prefixed with lowercase letters (e.g., $<a_2>$ , $<b_3>$ ).
Image Codes: Prefixed with uppercase letters (e.g., $<A_1>$ , $<B_4>$ ). Total vocabulary size is $2 \times L \times K$ (2 modalities, $L$ levels, $K$ codes per level).

4.2.2. Handling Collisions

Sometimes, two different items might map to the exact same sequence of codes (a "collision"). This is bad because the model can't distinguish them.

Method:

Identify $N$ colliding items.
Calculate the distance $D$ between each item's residual vector and the code vectors at the final level $L$ .
Sort the items based on this distance.
Reallocation: Assign the nearest code to the first item. If the best code is taken by another item, assign the next nearest code. If the last level runs out of codes, backtrack to the second-to-last level and reallocate.

4.2.3. Quantitative Language Generation Tasks

The paper treats recommendation as a sequence generation problem. The input is a sequence of user history items (in Quantitative Language), and the output is the target item.

They design three types of tasks to train the model:

Next Item Generation (NIG):
- Subtask 1 (Text): Input History $\rightarrow$ Output Target Item's Text tokens.
- Subtask 2 (Image): Input History $\rightarrow$ Output Target Item's Image tokens.
Asymmetric Item Generation (AIG):
- This forces cross-modal learning.
- Text-to-Image: Input History (Text tokens) $\rightarrow$ Output Target (Image tokens).
- Image-to-Text: Input History (Image tokens) $\rightarrow$ Output Target (Text tokens).
Quantitative Language Alignment (QLA):
- Explicit translation between modalities for the same item.
- Input: Item X (Text tokens) $\rightarrow$ Output: Item X (Image tokens).

4.2.4. Training and Re-ranking

Training Objective: Standard Negative Log-Likelihood (NLL) for sequence generation: $\mathcal{L}_{\theta} = - \sum_{j=1}^{|\mathbf{Y}|} \log P_{\theta} \left( \mathbf{Y}_{j} \mid \mathbf{Y}_{<j} , \mathbf{X} \right)$

$\theta$ : Model parameters.
$\mathbf{X}$ : Input sequence (user history).
$\mathbf{Y}$ : Target sequence (target item tokens).
$\mathbf{Y}_{j}$ : The $j$ -th token to predict.

Re-ranking (Inference): The model generates two recommendation lists: one based on text prediction ( $R_t$ ) and one based on image prediction ( $R_v$ ). To combine them, the authors use a simple heuristic score boost: $s(x) = \begin{cases} (s_{t}(x) + s_{v}(x))/2 + 1 & x \in R_{t}, x \in R_{v} \\ s_{t}(x) & x \in R_{t} \\ s_{v}(x) & x \in R_{v} \end{cases}$

Logic: If an item appears in both the text-generated list and the image-generated list, it is highly likely to be the correct recommendation. Therefore, its score is averaged and boosted by adding 1 (a significant boost since scores are usually probabilities or log-probs).

5. Experimental Setup

5.1. Datasets

The authors use the Amazon Product Reviews dataset.

Source Domains (for Pre-training): 6 large categories including "Pet Supplies", "Cell Phones", "Automotive", "Tools", "Toys", "Sports".
Target Domains (for Evaluation): 3 categories:
1. Musical Instruments (17k users, 6k items)
2. Arts Crafts and Sewing (22k users, 9k items)
3. Video Games (42k users, 13k items)
  
  Data Processing: Unpopular items (<5 interactions) are filtered. Maximum sequence length is 20.

5.2. Evaluation Metrics

Hit Ratio (HR@K):
- Definition: Measures whether the ground-truth target item is present in the top-K recommended items. It is a binary metric (1 if present, 0 if not) averaged over all test users.
- Formula: $\text{HR@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{target}_u \in R_{u, K})$
- Symbols: $U$ is the set of users. $\mathbb{I}(\cdot)$ is the indicator function (1 if true, 0 if false). $R_{u, K}$ is the top-K recommendation list for user $u$ .
Normalized Discounted Cumulative Gain (NDCG@K):
- Definition: Measures the quality of the ranking. It rewards the model more if the correct item is ranked higher (e.g., rank 1 is much better than rank 10).
- Formula: $\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$ $\text{DCG@K} = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$
- Symbols: $rel_i$ is the relevance of the item at position $i$ (1 for the target item, 0 otherwise). IDCG is the Ideal DCG (where the target is at rank 1).

5.3. Baselines

The method is compared against 10 baselines:

Sequential (ID-based): GRU4Rec, BERT4Rec, SASRec, FDSA, S3-Rec.
Multimodal/Transferable: VQ-Rec, MISSRec.
Generative: P5-CID (Text-to-Text), VIP5 (Multimodal P5), TIGER (Generative with RQ-VAE).
Why these? TIGER is the direct competitor using VQ. VIP5 is the direct competitor using Multimodal Generative techniques.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that MQL4GRec consistently outperforms all baselines across all three datasets.

The following are the results from Table 1 of the original paper:

Dataset	Metrics	Baselines (Selected)					MQL4GRec	Improv.
Dataset	Metrics	SASRec	FDSA	VQ-Rec	VIP5	TIGER	MQL4GRec	Improv.
Instruments	HR@1	0.0318	0.0530	0.0502	0.0737	0.0754	0.0833	+10.48%
	HR@5	0.0946	0.0987	0.1062	0.0892	0.1007	0.1150	+2.39%
	HR@10	0.1233	0.1249	0.1357	0.1071	0.1221	0.1375	+1.03%
	NDCG@5	0.0654	0.0750	0.0796	0.0815	0.0882	0.0977	+10.77%
	NDCG@10	0.0746	0.0859	0.0891	0.0872	0.0950	0.1060	+11.58%
Arts	HR@1	0.0212	0.0380	0.0408	0.0474	0.0532	0.0672	+26.32%
	HR@5	0.0951	0.0832	0.1038	0.0704	0.0894	0.1037	-
	HR@10	0.1250	0.1190	0.1386	0.0859	0.1167	0.1327	-
	NDCG@5	0.0610	0.0583	0.0732	0.0586	0.0718	0.0857	+17.08%
	NDCG@10	0.0706	0.0695	0.0844	0.0635	0.0806	0.0950	+12.56%
Games	HR@1	0.0069	0.0163	0.0075	0.0173	0.0166	0.0203	+1.00%
	HR@5	0.0587	0.0614	0.0408	0.0480	0.0523	0.0637	-
	HR@10	0.0985	0.0988	0.0679	0.0758	0.0857	0.1033	-
	NDCG@5	0.0333	0.0389	0.0242	0.0328	0.0345	0.0421	+8.23%
	NDCG@10	0.0461	0.0509	0.0329	0.0418	0.0453	0.0548	+7.66%

Key Observations:

Beating TIGER: MQL4GRec consistently beats TIGER (the strongest baseline). This proves that adding auxiliary content (images/text) and using separate codebooks for them is better than just quantizing a single ID embedding.
Failure of VIP5: Interestingly, VIP5 often performs worse than ID-based methods. The authors suggest this is due to the "modal gap"—the visual encoder's features don't align well with the PLM's text-based knowledge. MQL4GRec solves this by converting the image to tokens, effectively turning the image into a "foreign language" that the PLM can learn to speak.
Significant Gains: The improvements in NDCG (ranking quality) are very high (>10% in Instruments and Arts). This implies MQL4GRec is very good at putting the correct item at the top of the list.

6.2. Ablation Studies

6.2.1. Handling Collisions

The authors compared their distance-based reallocation method against TIGER's method (adding an index suffix).

The following are the results from Table 2 of the original paper:

Methods	Instruments		Arts		Games
Methods	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
TIGER	0.1221	0.0950	0.1167	0.0806	0.0857	0.0453
TIGER w/o user	0.1216	0.0958	0.1159	0.0810	0.0863	0.0464
Handling Collisions (Ours)	0.1277	0.0987	0.1163	0.0844	0.0885	0.0473

Analysis: The proposed method yields better results. TIGER's method of adding an index creates a "semantically unrelated distribution" (i.e., the index number 1, 2, 3 has no relation to the item's content), whereas MQL4GRec's method tries to keep the semantic meaning by picking the next closest semantic code.

6.2.2. Impact of Generation Tasks

They tested adding the Asymmetric (AIG) and Alignment (QLA) tasks. The results (Table 3 in paper) show that adding AIG significantly boosts performance. This confirms that forcing the model to predict Image tokens from Text input (and vice versa) helps it learn better representations.

6.3. Pre-training Analysis

The authors analyzed the effect of dataset size and epochs.

Dataset Size (Figure 3): Increasing the amount of text pre-training data helps consistently. However, for the "Games" dataset, too much multimodal pre-training actually hurt performance slightly. This suggests a domain gap or conflict between the general pre-training data and the specific "Games" domain preferences.
Zero-Shot: The model has weak but non-zero capabilities to recommend items in new domains without fine-tuning, showing some transferability.

The following figure (Figure 3 from the original paper) shows the dataset size impact:

该图像是图表，展示了不同预训练数据集量对推荐性能的影响。图中分别展示了在乐器、艺术和游戏三个领域中，不同数量的预训练数据下，使用NIG和QLG方法的HR@10值。可以看到，随着数据集数量的增加，QLG方法的表现优于NIG方法。

7. Conclusion & Reflections

7.1. Conclusion Summary

MQL4GRec successfully demonstrates that Unified Quantitative Language is a powerful paradigm for multimodal recommendation. By treating images and text as discrete tokens in a shared vocabulary, the model can leverage standard generative PLM architectures to perform complex, cross-modal recommendation tasks. The approach effectively transfers knowledge from data-rich source domains to target domains, achieving state-of-the-art results.

7.2. Limitations & Future Work

Inference Speed: The authors admit that, like most generative models using beam search, MQL4GRec is slower at inference time than simple dot-product models (like SASRec).
Missing Modalities: The paper assumes all items have both text and images. The scenario where some items are missing one modality was not studied.
Domain Conflict: The "Games" dataset showed signs of negative transfer (performance drop with more pre-training), indicating that simply adding more data isn't always better if the domains conflict.

7.3. Personal Insights & Critique

Innovation: The idea of using two separate RQ-VAEs and aligning them via generative tasks is clever. It effectively solves the "modality gap" by forcing the visual data to conform to the discrete, sequential nature of language models.
Critique on Re-ranking: The re-ranking formula ( $+1$ if in both lists) is a heuristic. While effective, it feels a bit arbitrary. A learnable fusion layer or a joint probability calculation might be more theoretically sound, though likely more expensive.
Scalability: The vocabulary size grows with $L \times K$ . For very large datasets, the codebook might need to be huge to avoid too many collisions, which could make the model heavy.
Application: This "Quantization as Translation" approach could be applied beyond recommendation—for example, in multimodal search or describing products in e-commerce automatically.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.