Paper status: completed

DiffCL: A Diffusion-Based Contrastive Learning Framework with Semantic Alignment for Multimodal Recommendations

Published:01/02/2025

Diffusion Models (10)Multimodal Recommendation Systems (8)Contrastive Learning Framework (3)Semantic Alignment (1)Graph-Based Feature Enhancement (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiffCL is a diffusion-based contrastive learning framework for multimodal recommendation that reduces noise via diffusion-generated views, aligns cross-modal semantics with stable ID embeddings, and alleviates data sparsity using an item-item graph, enhancing recommendation accur

Abstract

Multimodal recommendation systems integrate diverse multimodal information into the feature representations of both items and users, thereby enabling a more comprehensive modeling of user preferences. However, existing methods are hindered by data sparsity and the inherent noise within multimodal data, which impedes the accurate capture of users' interest preferences. Additionally, discrepancies in the semantic representations of items across different modalities can adversely impact the prediction accuracy of recommendation models. To address these challenges, we introduce a novel diffusion-based contrastive learning framework (DiffCL) for multimodal recommendation. DiffCL employs a diffusion model to generate contrastive views that effectively mitigate the impact of noise during the contrastive learning phase. Furthermore, it improves semantic consistency across modalities by aligning distinct visual and textual semantic information through stable ID embeddings. Finally, the introduction of the Item-Item Graph enhances multimodal feature representations, thereby alleviating the adverse effects of data sparsity on the overall system performance. We conduct extensive experiments on three public datasets, and the results demonstrate the superiority and effectiveness of the DiffCL.

Mind Map

In-depth Reading

English Analysis~29 min read · 40,376 chars

1. Bibliographic Information

1.1. Title

DiffCL: A Diffusion-Based Contrastive Learning Framework with Semantic Alignment for Multimodal Recommendations

1.2. Authors

Qiya Song, Jiajun Hu, Lin Xiao, Bin Sun Member, IEEE, Xieping Gao, Shutao Li Fellow, IEEE

1.3. Journal/Conference

This paper was published as a preprint on arXiv. The listed publication date is 2025-01-02T05:30:19.000Z. arXiv is a widely respected open-access preprint server for research articles, particularly in fields like computer science, physics, mathematics, and others. It allows researchers to share their work rapidly before, or in parallel with, peer review processes in traditional journals or conferences.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces DiffCL, a novel diffusion-based contrastive learning framework for multimodal recommendation systems. The core challenges addressed are data sparsity and inherent noise within multimodal data, which impede accurate user preference modeling, and semantic discrepancies across modalities that can reduce prediction accuracy. DiffCL employs a diffusion model to generate robust contrastive views, thereby mitigating noise during contrastive learning. It enhances semantic consistency by aligning distinct visual and textual semantic information through stable ID embeddings. Additionally, an Item-Item Graph is incorporated to improve multimodal feature representations and alleviate data sparsity. Extensive experiments on three public datasets demonstrate the superiority and effectiveness of the DiffCL framework.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2501.01066v1 PDF Link: https://arxiv.org/pdf/2501.01066v1.pdf Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The landscape of online platforms increasingly relies on recommender systems (RSs) to filter information and suggest items aligned with user preferences. Initially, RSs primarily used user interaction data, but as user needs grew more complex, multimodal recommender systems (MRSs) emerged, integrating diverse multimodal information (e.g., visual, textual) to capture user preferences more comprehensively. This integration, while powerful, faces two significant challenges:

Data Sparsity and Inherent Noise: In real-world scenarios, user-item interaction data is often sparse, making it difficult for models to accurately infer preferences. Furthermore, multimodal data inherently contains noise, which can degrade the quality of learned representations. Existing methods, particularly self-supervised learning (SSL) techniques that generate contrastive views (different augmented versions of the same data), often use simple augmentation strategies like edge dropout or adding random noise. These methods can inadvertently introduce or amplify irrelevant noise, hindering accurate user interest capture.
Semantic Discrepancies Across Modalities: Different modalities (e.g., images and text for an item) often represent the same item with distinct semantic characteristics. Discrepancies in these representations can lead to inconsistencies when fused, negatively impacting the recommendation model's prediction accuracy.

The core problem the paper aims to solve is improving the accuracy and robustness of multimodal recommendations by effectively handling multimodal noise, enhancing semantic consistency across modalities, and mitigating data sparsity.

2.2. Main Contributions / Findings

The paper introduces a novel framework, DiffCL, which makes several key contributions:

Novel Diffusion-Based Contrastive Learning Framework (DiffCL): The paper proposes DiffCL for multimodal recommendations, enhancing the semantic representation of items by introducing Item-Item graphs to mitigate the effects of data sparsity. This provides a comprehensive solution addressing multiple challenges simultaneously.
Diffusion Model for Contrastive View Generation: DiffCL leverages a diffusion model to generate contrastive views during the graph contrastive learning phase. Unlike previous methods that rely on simple random noise or dropout, this approach effectively reduces the impact of noisy information, leading to more robust and higher-quality augmented data for self-supervised learning.
ID-Guided Semantic Alignment: The framework utilizes stable ID embeddings (unique identifiers for users/items) to guide the semantic alignment process across different modalities (visual and textual). This ensures consistency in semantic representations, allowing for more effective complementary learning between modalities and preventing discrepancies from degrading performance.
Item-Item Graph for Feature Enhancement: DiffCL introduces an Item-Item Graph to enhance multimodal feature representations. This helps in capturing latent relationships among items and further alleviates the adverse effects of data sparsity on overall system performance.
Empirical Validation: Extensive experiments conducted on three public datasets (Baby, Video, Sports) demonstrate the superiority and effectiveness of DiffCL compared to state-of-the-art general and multimodal recommendation models. This validates the practical utility and performance gains achieved by the proposed methods.

3.1. Foundational Concepts

Multimodal Recommender Systems (MRSs)

Multimodal Recommender Systems are advanced recommendation systems that integrate information from multiple distinct data sources, or modalities, to provide more accurate and personalized recommendations. Traditional recommender systems often rely solely on user-item interaction data (e.g., purchases, clicks) or user/item metadata (e.g., item categories, user demographics). MRSs go a step further by incorporating rich, diverse data types like images, text descriptions, video content, and audio. The goal is to capture a more holistic understanding of user preferences and item characteristics, which is particularly useful when interaction data is sparse or when implicit feedback needs to be enriched with explicit content understanding.

Graph Neural Networks (GNNs)

Graph Neural Networks (GNNs) are a class of neural networks designed to process data that can be represented as graphs. In a graph, data points are nodes (or vertices), and the relationships between them are edges. GNNs are particularly well-suited for recommender systems because user-item interactions naturally form a bipartite graph (users connected to items they interacted with). The fundamental idea behind GNNs is message passing, where each node iteratively aggregates information from its neighbors to update its own representation. This process allows nodes to incorporate structural information from the graph and learn representations that reflect their neighborhood context. The abstract structure of a GNN involves three main functions:

Message Passing: Neighbors send information (messages) to a central node. The message from neighbor $j$ $j$ to node $k$ $k$ at layer $l$ $l$ is often a function of their feature vectors. $ m_k^{(l)} = \sum_{j \in \mathcal{N}(k)} M^{(l)}(h_j^{(l-1)}, h_k^{(l-1)}, e_{jk}) $ where:
- $m_k^{(l)}$ : The aggregated message for node $k$ at layer $l$ .
- $\mathcal{N}(k)$ : The set of neighbors of node $k$ .
- $M^{(l)}$ : A message function that transforms the features of node $j$ ( $h_j^{(l-1)}$ ), node $k$ ( $h_k^{(l-1)}$ ), and the edge features $e_{jk}$ at layer $l$ .
- $h_j^{(l-1)}$ and $h_k^{(l-1)}$ : Feature vectors of nodes $j$ and $k$ from the previous layer (l-1).
- $e_{jk}$ : Features of the edge connecting nodes $j$ and $k$ .
Aggregation: The central node collects all incoming messages from its neighbors. This usually involves a permutation-invariant function (like sum, mean, or max) to combine the messages. The paper refers to an accumulation function as part of this step.
Update: The node's own representation (feature vector) is updated using its previous state and the aggregated messages. $ h_k^{(l)} = U^{(l)}(h_k^{(l-1)}, m_k^{(l)}) $ where:
- $h_k^{(l)}$ : The updated feature vector for node $k$ at layer $l$ .
- $U^{(l)}$ : An update function (e.g., a neural network, summation, mean). This iterative process allows GNNs to capture increasingly higher-order relationships and dependencies in the graph.

Contrastive Learning (CL)

Contrastive Learning is a self-supervised learning (SSL) paradigm that aims to learn robust feature representations without explicit labels. The core idea is to maximize the similarity between different augmented "views" of the same data instance (called positive pairs) while simultaneously minimizing the similarity between views of different instances (called negative pairs). This encourages the model to learn features that are invariant to certain augmentations but discriminative across different instances. A common loss function used in contrastive learning is InfoNCE loss, which is a variant of the Noise Contrastive Estimation (NCE) loss. For a given anchor (e.g., a user embedding), a positive sample (another view of the same user), and a set of negative samples (views of other users), InfoNCE pushes the positive pair closer and pulls negative pairs farther apart in the embedding space.

Diffusion Models (DMs)

Diffusion Models (DMs) are a class of generative models that have achieved remarkable success in generating high-quality synthetic data, especially images and audio. They operate through a two-stage process:

Forward Diffusion Process: This process gradually adds Gaussian noise to data points over a series of time steps. Starting from an original data sample ( $x_0$ ), at each step $t$ , a small amount of noise is added, transforming $x_{t-1}$ into $x_t$ . After many steps, the data $x_T$ becomes pure noise, indistinguishable from a standard Gaussian distribution. This process is typically fixed and not learned. $ q(\mathbf{x}t \mid \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t \mathbf{I}) $ where $\mathcal{N}$ denotes a Gaussian distribution, and $\beta_t$ is a hyperparameter controlling the noise scale at step $t$ .
Reverse Diffusion Process: This is the learned process. A neural network is trained to reverse the forward process, gradually denoising the noisy data $x_t$ back to the original data $x_0$ . The model learns to predict the noise added at each step or directly predict the original data. By sampling from the pure noise $x_T$ and applying the learned reverse steps, the model can generate new data samples. $ p_{\theta}(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t) = \mathcal{N}(\boldsymbol{x}{t-1}; \mu{\theta}(\boldsymbol{x}t, t), \boldsymbol{\Sigma}{\theta}(\boldsymbol{x}_t, t)) $ Here, $\mu_{\theta}$ and $\Sigma_{\theta}$ are the mean and variance of the Gaussian distribution, which are learned by a neural network parameterized by $\theta$ .

Bayesian Personalized Ranking (BPR)

Bayesian Personalized Ranking (BPR) is a pairwise ranking loss function commonly used in recommender systems, particularly for implicit feedback scenarios (where only positive interactions are observed, and non-interactions are ambiguous). Instead of predicting a score for each item, BPR aims to optimize the relative ranking of items. The core idea is that a user should prefer an item they have interacted with (positive item) over an item they have not interacted with (negative item). For a given user $u$ , a positive item $p$ , and a negative item $n$ , the BPR loss is defined as: $ \mathcal{L}{BPR} = \sum{(u, p, n) \in D} - \log(\sigma(\hat{y}{u,p} - \hat{y}{u,n})) $ where:

$D$ : The training set of triplets (u, p, n).
$\hat{y}_{u,p}$ : The predicted score for user $u$ and positive item $p$ .
$\hat{y}_{u,n}$ : The predicted score for user $u$ and negative item $n$ .
$\sigma$ : The sigmoid activation function, which maps any real value to a range between 0 and 1. The objective is to maximize $\hat{y}_{u,p} - \hat{y}_{u,n}$ , meaning the score of the positive item should be higher than the score of the negative item. The negative log-sigmoid ensures that the loss is minimized when this condition is met.

3.2. Previous Works

The paper contextualizes its contributions by reviewing existing recommender systems (RSs) and multimodal recommender systems (MRSs).

Traditional RSs: Early systems like Matrix Factorization (MF) (S. Rendle, "Factorization machines," 2010) and basic user interaction-based models relied heavily on explicit feedback or historical interaction patterns. These methods, while foundational, struggle with data sparsity and cannot capture complex user preferences or item nuances due to their reliance on unimodal data. For instance, BPR (S. Rendle et al., "Bpr: Bayesian personalized ranking from implicit feedback," 2012) focuses on implicit feedback but still operates within a unimodal context.
Deep Learning in RSs: With advancements in deep learning (DL), RSs began leveraging DL to learn underlying features and complex nonlinear correlations. Works like Neural Collaborative Filtering (NCF) (X. He et al., "Neural collaborative filtering," 2017) improved modeling capabilities by replacing simple matrix factorization with neural architectures. ACNE (J. Chen et al., "Self-training enhanced: Network embedding and overlapping community detection with adversarial learning," 2022) and CRL (J. Chen et al., "Crl: Collaborative representation learning by coordinating topic modeling and network embeddings," 2022) demonstrate the application of deep learning for network embeddings and collaborative learning.
Early MRSs: To overcome the limitations of unimodal systems, MRSs started integrating diverse information. VBPR (X. He et al., "VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback," 2016, not cited but a seminal work in this area) was one of the first to incorporate visual modal information into matrix factorization, directly addressing data sparsity. CKE (F. Zhang et al., "Collaborative knowledge base embedding for recommender systems," 2016) combined image and text features with knowledge graphs. These methods primarily focused on integrating multimodal data rather than comprehensively modeling interactions across modalities.
GNN-based RSs: Graph Neural Networks (GNNs) gained prominence for their ability to capture higher-order features and collaborative signals by aggregating information from neighboring nodes in user-item interaction graphs. NGCF (X. Wang et al., "Neural graph collaborative filtering," 2019) fused GCN with matrix factorization. LightGCN (X. He et al., "LightGCN: Simplifying and Powering Graph Convolution Networks for Recommendation," 2020) simplified GCNs for recommendation by removing non-linearities and weight matrices. While effective, GNN-based methods (MMGCN, DualGNN, MGCN) still require substantial high-quality interaction data and are susceptible to multimodal noise. JMPGCF (K. Liu et al., "Joint multi-grained preference modeling for sequential recommendation," 2022) highlights the challenge of prevalence in GCNs.
Self-Supervised Learning (SSL) for RSs: SSL emerged to address data sparsity by generating supervision signals from unlabeled data. NCL (Z. Lin et al., "Contrastive learning for recommender systems," 2022) and HCCF (L. Xia et al., "HyRec: Hyper-Relational Recommendation with Graph Convolutional Networks," 2022) combine SSL with collaborative filtering. SGL (J. Wu et al., "Self-supervised graph learning for recommendation," 2021) uses dropout techniques to create contrastive views. However, these methods often focus only on interactive data augmentation or use simple random augmentations (e.g., MMGCL, SLMRec, MMSSL), which can introduce noise irrelevant to recommendations.
Diffusion Models (DMs) in RSs: Recently, DMs, successful in generative tasks, have been explored for RSs. PDRec (H. Ma et al., "Plugin diffusion model for sequential recommendation," 2023) and DreamRec (Z. Yang et al., "Generate what you prefer: Reshaping sequential recommendation via guided diffusion," 2024) use DMs for sequential recommendations and item space exploration. DiffRec (W. Wang et al., "Diffusion recommender model," 2023) uses DMs for collaborative information generation. LD4MRec (P. Yu et al., "Ld4mrec: Simplifying and powering diffusion model for multimedia recommendation," 2023) and DiffMM (Y. Jiang et al., "Diffmm: Multi-modal diffusion model for recommendation," 2024) combine DMs with multimodal or cross-modal information for user representation and collaborative signal modeling.

3.3. Technological Evolution

The evolution of recommender systems can be broadly traced as follows:

Early RSs (Pre-2010s): Dominated by traditional methods like Collaborative Filtering (CF) (user-based, item-based) and Matrix Factorization (MF). These methods relied heavily on explicit ratings or implicit interactions, struggling with cold-start and data sparsity issues.
Deep Learning Era (2010s onwards): The advent of deep learning brought models like NCF, which replaced linear MF with neural networks, enabling the capture of complex non-linear relationships.
Multimodal Integration (Mid-2010s onwards): Recognizing the limitations of unimodal data, researchers began incorporating richer multimodal information (visual, textual) using techniques like VBPR and CKE. This marked a shift towards more comprehensive item and user representations.
Graph Neural Networks (Late 2010s onwards): The inherent graph structure of user-item interactions led to the widespread adoption of GNNs (NGCF, LightGCN). These models excel at capturing higher-order collaborative signals through message passing on interaction graphs.
Self-Supervised Learning (Early 2020s onwards): To address data sparsity and reduce reliance on massive labeled data, SSL techniques were integrated into RSs. Contrastive Learning became a popular choice, generating supervision from augmented views of data (SGL).
Advanced Generative Models (Mid-2020s onwards): The success of generative models like Diffusion Models in computer vision spurred their application in RSs, not just for data generation but also for robust representation learning and collaborative signal enhancement (DiffRec, DiffMM).

This paper's work fits into the most recent wave, specifically at the intersection of Multimodal RSs, GNNs, Self-Supervised Learning, and Diffusion Models.

3.4. Differentiation Analysis

Compared to the main methods in related work, DiffCL introduces several core differences and innovations:

Robust Contrastive View Generation: Previous SSL methods for RSs (e.g., SGL, MMGCL, SLMRec) often rely on simpler augmentation strategies like edge dropout, node dropping, or adding random Gaussian noise. As shown in Figure 1 of the paper, these methods can introduce irrelevant noise or fail to generate semantically rich, distinct views. DiffCL differentiates itself by employing a diffusion model to generate contrastive views. Diffusion models, with their powerful generative capabilities, can produce high-quality, semantically meaningful augmented views by systematically adding and then denoising noise. This approach is claimed to "effectively mitigate the impact of noise introduced during self-supervised learning," offering a more sophisticated and robust augmentation strategy than simple random perturbations.
ID-Guided Semantic Alignment: Many multimodal recommendation methods fuse or align modal features (MMGCN, MGCN). However, DiffCL proposes a novel cross-modal alignment method that uses stable ID features as guidance. The uniqueness and stability of ID embeddings (for users and items) provide a consistent reference point to align visual and textual semantic information. This is distinct from alignment methods that might disrupt historical interaction information or suffer from inconsistencies due to varying modal distributions. By parameterizing modal features with Gaussian distributions and aligning their means and variances to that of the ID modality, DiffCL aims for a more principled and stable semantic consistency across modalities.
Integrated Data Sparsity Mitigation: While SSL generally helps with data sparsity, DiffCL explicitly addresses it further by introducing an Item-Item Graph for feature enhancement. This graph, constructed based on multimodal feature similarities, helps uncover latent relationships among items that might be missing in sparse user-item interaction data. This complementary mechanism, combined with the robust contrastive learning, provides a stronger foundation for item representations.
Holistic Framework: DiffCL integrates these distinct components—GNNs for higher-order feature capture, diffusion models for robust CL augmentation, ID-guided alignment for semantic consistency, and the Item-Item graph for sparsity mitigation—into a single, coherent framework. This holistic approach is designed to tackle the multifaceted challenges of multimodal recommendation more effectively than methods focusing on individual aspects.

该图像是图1，展示了图对比学习的两种构建方法：(a) 边随机丢弃，即根据预设概率删除部分边；(b) 在图编码器处理后的特征嵌入中加入随机噪声进行对比学习。

The following figure (Figure 1 from the original paper) shows two methods for constructing graph contrastive learning. (a) illustrates edge dropout where random edges are removed from the graph based on a predefined rate. (b) depicts adding random uniform or Gaussian noise to feature embeddings after processing by a Graph Encoder. This highlights simpler augmentation techniques that DiffCL aims to improve upon.

4. Methodology

4.1. Principles

The DiffCL framework is built upon several core principles to address the challenges in multimodal recommendation systems:

Comprehensive Preference Modeling: Integrate diverse multimodal information (visual, textual, and ID) alongside user-item interaction data to build a richer understanding of user preferences and item characteristics.
Robust Feature Learning with Self-Supervised Learning: Utilize self-supervised learning (SSL) with contrastive learning to learn robust feature representations, particularly to combat data sparsity.
Noise Mitigation through Diffusion Models: Employ diffusion models as a sophisticated augmentation strategy for contrastive learning. This moves beyond simple random perturbations to generate high-quality, semantically meaningful contrastive views that effectively reduce the impact of noise inherent in multimodal data and during augmentation.
Semantic Consistency through ID-Guided Alignment: Address semantic discrepancies across modalities by using stable and unique ID embeddings as a reliable reference point to align visual and textual semantic information. This ensures that fused features maintain consistency and complementarity.
Enhanced Item Representation via Item-Item Graph: Further alleviate data sparsity and enrich item features by explicitly modeling latent relationships between items through a modality-aware Item-Item Graph.
Optimized Ranking Objective: Employ Bayesian Personalized Ranking (BPR) loss to optimize the model for accurate item ranking, combined with contrastive learning and semantic alignment losses for a holistic training objective.

4.2. Core Methodology In-depth (Layer by Layer)

The DiffCL framework consists of several interconnected components: a Graph Encoder for initial feature processing, Diffusion Graph Contrastive Learning for robust self-supervision, Multimodal Feature Enhancement via an Item-Item Graph, and ID-guided Multimodal Semantic Alignment. Finally, all components are optimized through a combined loss function. The detailed workflow is illustrated in Figure 2.

$该图像是综述DiffCL模型结构的示意图，展示了用户-物品交互的多模态特征输入、图编码器的多层GCN网络、扩散对比学习过程及不同模态语义对齐与特征增强模块，图中包含扩散概率过程的转移公式$p(x_t|x_{t-1})$和$p(x_{t-1}|\\tilde{x_t})$。$ 该图像是综述DiffCL模型结构的示意图，展示了用户-物品交互的多模态特征输入、图编码器的多层GCN网络、扩散对比学习过程及不同模态语义对齐与特征增强模块，图中包含扩散概率过程的转移公式 $p(x_t|x_{t-1})$ 和 $p(x_{t-1}|\tilde{x_t})$ 。

The following figure (Figure 2 from the original paper) provides an overview of the DiffCL model architecture. It depicts how raw multimodal features are processed by a multi-layer GCN graph encoder to capture preference cues. The diffusion contrastive learning introduces a diffusion model to construct contrast views. Stable ID embeddings guide semantic alignment, and the Item-Item graph enhances multimodal feature representations.

4.2.1. Problem Formulation

Given a set of users $U = \{u_1, u_2, \dots, u_{|U|}\}$ and a set of items $I = \{i_1, i_2, \dots, i_{|I|}\}$ , the goal of the multimodal recommender system is to predict a score $\hat{y}_{u,i}$ indicating user $u$ 's preference for item $i$ . This score is obtained by computing the inner product of the user embedding $e_u$ and item embedding $e_i$ : $ \hat{y}_{u,i} = e_u \cdot e_i^T $ The system processes original user-item ID interactions to obtain ID embeddings ( $E_{id}$ ), and leverages visual ( $E_v$ ) and textual ( $E_t$ ) modal features obtained through respective encoders. After several enhancement and alignment steps, the final user $e_u$ and item $e_i$ embeddings are derived for prediction.

4.2.2. Graph Encoder

The Graph Encoder component is responsible for capturing higher-order features from different modalities by processing user-item heterogeneous graphs using Graph Convolutional Networks (GCNs).

First, raw visual information is extracted using a pre-trained ResNet50 model and raw textual information using a pre-trained BERT model. These models produce initial embeddings for the items.

Next, based on the raw interaction data and the multimodal information of items, three distinct user-item graphs are constructed: $G = \{G_m \mid G_v, G_t, G_{id}\}$ .

An interaction matrix $J$ is formed where $J_{ui}=1$ if an interaction exists between user $u$ and item $i$ , and $J_{ui}=0$ otherwise.
Each $G_m = \{n, e\}$ represents a user-item graph for a specific modality $m \in \{v, t, id\}$ . Here, $n$ is the set of nodes (users and items) and $e$ is the set of edges (interactions).

The GCN then processes these graphs. The feature embeddings $E_m^{(l)}$ after the $l$ -th layer of a GCN are calculated as follows: $ E_m^{(l)} = \sum_{i \in N_u} { \frac{1}{\sqrt{|N_u|} \sqrt{|N_i|}} } E_m^{(l-1)} $ where:
$N_i$ : The set of single-hop neighbors of item $i$ in graph $G_m$ .
$N_u$ : The set of single-hop neighbors of user $u$ in graph $G_m$ .
$E_m^{(l-1)}$ : Feature embeddings from the previous layer (l-1) for modality $m$ .
The term $\frac{1}{\sqrt{|N_u|} \sqrt{|N_i|}}$ is a normalization factor based on node degrees, similar to LightGCN's propagation rule.

The final embedded features for a specific modality $m$ ( $E_m$ ) are obtained by summing the representations from all $L$ layers, including the initial feature extraction ( $E_m^{(0)}$ ): $ E_m = \sum_{l=0}^L E_m^{(l)} $ where $E_m^{(0)}$ is the initial feature after extraction (e.g., from ResNet50 or BERT). The resulting $E_m$ combines user and item embeddings for that modality: $ E_m = \left[ e_m^u \quad e_m^i \right] $ where $e_m^u$ and $e_m^i$ denote user and item embeddings respectively for modality $m$ .

4.2.3. Diffusion Graph Contrastive Learning

This component introduces a diffusion model (DM) into the graph contrastive learning phase to generate two similar yet distinct contrastive views. This approach aims to enhance item and user representations by mitigating noise more effectively than simpler augmentation techniques.

4.2.3.1. Graph Diffusion Forward Process

The forward process gradually adds Gaussian noise to the input embeddings. The higher-order feature embeddings ( $E_m$ ) obtained from the Graph Encoder are the starting point. Let's consider the visual modality embeddings $E_v = \left[ e_v^u \quad e_v^i \right]$ as an example. The diffusion process is initialized with $x_0 = \left[ e_v^u \quad e_v^i \right]^T$ . At each time step $t$ , Gaussian noise is gradually added to transform $x_{t-1}$ to $x_t$ . The transition probability from $x_{t-1}$ to $x_t$ is given by: $ q(\mathbf{x}t \mid \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t \mathbf{I}) $ where:

$\mathcal{N}$ : Represents a Gaussian distribution.
$\beta_t$ : A noise scale parameter that controls the amount of Gaussian noise added at time step $t$ . As $t$ increases, $x_t$ converges to a standard Gaussian distribution. Since independent Gaussian noise distributions are additive, $x_t$ can be directly sampled from $x_0$ : $ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\gamma}t} \mathbf{x}0, (1 - \bar{\gamma}t) \mathbf{I}) $ Here, $\gamma_t$ and $\overline{\gamma_t}$ are parameters controlling the total noise added from $x_0$ to $x_t$ : $ \gamma_t = 1 - \beta_t $ $ \overline{\gamma_t} = \prod_1^t \gamma_t $ Using these parameters, $x_t$ can be re-parameterized as: $ x_t = \sqrt{\overline{\gamma_t}} x_0 + \sqrt{1 - \overline{\gamma_t}} \varepsilon $ where $\varepsilon \sim \mathcal{N}(0, \mathbf{I})$ (standard Gaussian noise). A linear noise scheduler is used to control the amount of noise in $x_{0:T}$ : $ 1 - \overline{\gamma_t} = s \cdot \left[ \gamma{\operatorname*{min}} + \frac{t - 1}{T - 1} (\gamma{\operatorname*{max}} - \gamma{\operatorname*{min}}) \right] $ where:
$t \in \{1, 2, \dots, T\}$ : The time step in the diffusion process.
$s$ : A noise scale hyperparameter, $s \in [0, 1]$ .
$\gamma_{\operatorname*{min}}$ and $\gamma_{\operatorname*{max}}$ : Respectively, the minimum and maximum limits of additive noise.

4.2.3.2. Graph Diffusion Reverse Process

The reverse process aims to remove the noise added during the forward process and recover the original $x_0$ . This learned process generates a pseudo-feature similar to the original input. Starting from $x_t$ , it gradually denoises to recover $x_{t-1}$ . The mathematical expression for the reverse process is: $ p_{\theta}(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t) = \mathcal{N}(\boldsymbol{x}{t-1}; \mu{\theta}(\boldsymbol{x}t, t), \boldsymbol{\Sigma}{\theta}(\boldsymbol{x}_t, t)) $ where:

$\mu_{\theta}(\boldsymbol{x}_t, t)$ : The predicted mean of the Gaussian distribution for the next state, parameterized by a neural network with learnable parameters $\theta$ .
$\boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_t, t)$ : The predicted variance of the Gaussian distribution for the next state, also parameterized by a neural network with learnable parameters $\theta$ .

4.2.3.3. Graph Contrastive Learning

After processing with the Graph Encoder, the visual (or textual) representation is $E_v$ (or $E_t$ ). By setting $x_0 = E_v$ , the diffusion model (forward and reverse processes) is used to generate two distinct contrastive views, $E_v^1$ and $E_v^2$ , which are similar to $E_v$ but with controlled inconsistencies. The same procedure applies to $E_t$ to generate $E_t^1$ and $E_t^2$ . Graph contrastive learning is then performed using the InfoNCE loss function. For the visual modality, the user-level contrastive loss $\mathcal{L}_u^v$ and item-level contrastive loss $\mathcal{L}_i^v$ are defined as: $ \mathcal{L}u^v = \sum{u_1 \in U} - \log \frac{\exp \big( s (e_{u_1,v}^1, e_{u_1,v}^2) \big) / \tau}{\sum_{u_2 \in U} \exp \big( s (e_{u_2,v}^1, e_{u_2,v}^2) \big) / \tau} $ $ \mathcal{L}i^v = \sum{i_1 \in I} - \log \frac{\exp \big( s (e_{i_1,v}^1, e_{i_1,v}^2) \big) / \tau}{\sum_{i_2 \in I} \exp \big( s (e_{i_2,v}^1, e_{i_2,v}^2) \big) / \tau} $ where:

$s(\cdot)$ : The cosine similarity function.
$\tau$ : A hyperparameter representing the temperature, controlling the sharpness of the distribution and convergence rate.
$e_{u_1,v}^1, e_{u_1,v}^2$ : The two contrastive views of user $u_1$ 's embedding in the visual modality.
$e_{i_1,v}^1, e_{i_1,v}^2$ : The two contrastive views of item $i_1$ 's embedding in the visual modality. The total contrastive learning loss for the visual modality is: $ \mathcal{L}{cl}^v = \mathcal{L}u^v + \mathcal{L}i^v $ Similarly, for the textual modality: $ \mathcal{L}{cl}^t = \mathcal{L}u^t + \mathcal{L}i^t $ The final graph contrastive learning loss is a weighted sum of the visual and textual losses: $ \mathcal{L}{cl} = \lambda{\mathrm{cl}} (\mathcal{L}{cl}^v + \mathcal{L}{cl}^t) $ where $\lambda_{\mathrm{cl}}$ is a hyperparameter to control the contribution of this loss.

4.2.4. Multimodal Feature Enhancement and Alignment

4.2.4.1. Multimodal Feature Enhancement (Item-Item Graph)

To capture semantic connections among items and alleviate data sparsity, an Item-Item Graph (I-I graph) is constructed for each modality (visual and textual). The similarity score $S_{i,j}^m$ between items $i$ and $j$ for a specific modality $m \in \{v, t\}$ is calculated using cosine similarity of their original features $f_i^m$ and $f_j^m$ : $ S_{i,j}^m = \frac{ (f_i^m)^\top f_j^m }{ |f_i^m| |f_j^m| } $ To reduce the impact of redundant data, only the top $K$ neighbors (with highest similarity scores) for each item are retained, setting their similarity to 1 and others to 0. The paper fixes $K=10$ . $ S_{i,j}^m = \left{ \begin{array}{ll} 1 & \mathrm{~if~} S_{i,j}^m \in \mathrm{~top} \cdot K (S_{i,j}^m) \ 0 & \mathrm{~otherwise} \end{array} \right. $ The resulting similarity matrix $S^m$ is then normalized to $\widehat S^m$ : $ \widehat S^m = (D^m)^{-\frac{1}{2}} S^m (D^m)^{-\frac{1}{2}} $ where $D^m$ is a diagonal matrix of $S^m$ , with its diagonal elements $D_{ii}^m$ calculated as the sum of similarities for item $i$ : $ D_{ii}^m = \sum_j S_{i,j}^m $ This normalization ensures a symmetric and stable adjacency matrix for subsequent aggregation. Finally, multi-layer neighbor information is aggregated based on this modality-aware adjacency matrix to enhance item embeddings: $ A_m^{(l)} = \sum_{j \in N_i} \widehat S_{i,j}^m A_{j_m}^{(l-1)} $ where $j$ is a first-order neighbor of $i$ , and $A_{j_m}$ denotes the embedding of item $j$ in modality $m$ . The final item embeddings $e_m^i$ are then enhanced by adding $A_m^{(l)}$ : $ \boldsymbol{E_m} = \left[ e_m^{\boldsymbol{u}} \quad e_m^i + A_m^{(l)} \right] $ This implies that the item embeddings $e_m^i$ from the Graph Encoder are augmented with the aggregated information from the Item-Item Graph.

4.2.4.2. Multimodal Feature Fusion

Visual and textual features, which are complementary, are fused at the feature level to comprehensively capture user preferences. The fused feature representation $E_{vt}$ is calculated as a weighted sum: $ E_{vt} = \mu \times E_v + (1 - \mu) \times E_t $ where:

$E_v$ : The enhanced visual features.
$E_t$ : The enhanced textual features.
$\mu$ : A trainable parameter (initialized to 0.5) that controls the weighting between visual and textual modalities. The ID modality features are not fused at this stage because of their inherent uniqueness and stability, which makes them suitable for semantic alignment and final score calculation.

4.2.4.3. Multimodal Semantic Alignment (ID-guided)

To address the inconsistent feature distributions across modalities and prevent noise information from adversely affecting predictions, DiffCL proposes a cross-modal alignment method guided by stable ID features. The final ID modality feature $E_{id}$ , visual modality feature $E_v$ , and textual modality feature $E_t$ are parameterized as Gaussian distributions: $ E_{id} \sim N(\mu_{id}, \sigma_{id}^2) $ $ \left{ \begin{array}{l} E_v \sim N(\mu_v, \sigma_v^2), \ E_t \sim N(\mu_t, \sigma_t^2), \end{array} \right. $ where:

$N(\mu, \sigma^2)$ : A Gaussian distribution with mean $\mu$ and variance $\sigma^2$ .
$\mu_{id}, \sigma_{id}^2$ : Mean and variance of the ID modality's Gaussian distribution.
$\mu_v, \sigma_v^2$ : Mean and variance of the visual modality's Gaussian distribution.
$\mu_t, \sigma_t^2$ : Mean and variance of the textual modality's Gaussian distribution. The alignment loss is then calculated by measuring the distance between the ID modality distribution and the visual/textual modality distributions. This is done by comparing their respective means and standard deviations: $ \left{ \begin{array}{ll} \mathcal{L}{align_1} = |\mu{id} - \mu_v| + |\sigma_{id} - \sigma_v|, \ \mathcal{L}{align_2} = |\mu{id} - \mu_t| + |\sigma_{id} - \sigma_t|. \end{array} \right. $ The total alignment loss is a weighted sum: $ \mathcal{L}{align} = \lambda{align} (\mathcal{L}{align_1} + \mathcal{L}{align_2}) $ where $\lambda_{align}$ is a hyperparameter to balance this loss.

4.2.5. Model Optimization

The overall model is optimized by combining the Bayesian Personalized Ranking (BPR) loss, the diffusion graph contrastive learning loss, the cross-modal alignment loss, and a regularization loss.

The BPR loss is calculated from triplets (u, p, n), where user $u$ prefers positive item $p$ over negative item $n$ : $ \mathcal{L}{BPR} = \sum{(u, p, n) \in D} - \log ( \sigma ( y_{u,p} - y_{u,n} ) ) $ where:

$D$ : The set of triplets (u, p, n).
$\sigma$ : The sigmoid function.
$y_{u,p}$ and $y_{u,n}$ : The predicted scores, which combine the fused multimodal features ( $E_{vt}$ ) and the ID features ( $E_{id}$ ): $ y_{u,p} = (e_{vt}^u)^T \cdot e_{vt}^p + (e_{id}^u)^T \cdot e_{id}^p $ $ y_{u,n} = (e_{vt}^u)^T \cdot e_{vt}^n + (e_{id}^u)^T \cdot e_{id}^n $ Here, $e_{vt}^u$ and $e_{vt}^p$ are the fused multimodal embeddings for user $u$ and item $p$ , and $e_{id}^u$ and $e_{id}^p$ are their respective ID embeddings.

The total loss function $\mathcal{L}$ for DiffCL is: $ \mathcal{L} = \lambda_{\mathrm{cl}} \mathcal{L}{\mathrm{cl}} + \mathcal{L}{align} + \mathcal{L}{\mathrm{BPR}} + \mathcal{L}{\mathrm{E}} $ where:

$\lambda_{\mathrm{cl}}$ : Weight for the diffusion graph contrastive learning loss ( $\mathcal{L}_{\mathrm{cl}}$ ).
$\mathcal{L}_{align}$ : The cross-modal alignment loss.
$\mathcal{L}_{\mathrm{BPR}}$ : The Bayesian Personalized Ranking loss.
$\mathcal{L}_{\mathrm{E}}$ : The regularization loss, which uses L2 regularization on the visual and textual embeddings to prevent overfitting: $ \mathcal{L}_E = \lambda_E ( |E_v|_2^2 + |E_t|_2^2 ) $ where $\lambda_E$ is a hyperparameter to regulate the impact of the L2 regularization.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on three public datasets derived from the Amazon review dataset, which is widely used in Multimodal Recommender Systems (MRSs). These datasets contain user interaction information, item descriptions (text), item images, and other relevant metadata. To ensure data quality, a 5-core filtering process was applied to the raw data, meaning only users and items with at least 5 interactions were retained. Before model training, item visual features and textual features were pre-extracted using state-of-the-art models:

Visual Features: Extracted using ResNet50 (K. He et al., "Deep residual learning for image recognition," 2016), yielding 4096-dimensional embeddings.
Textual Features: Extracted using BERT (J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," 2018), yielding 384-dimensional embeddings.

The three datasets are:

Baby: A dataset related to baby products.
Video: A dataset related to video products.
Sports: A dataset related to sports and outdoors products.

The training, validation, and test sets were split in an 8:1:1 ratio.

The following are the results from [Table I] of the original paper, showing the specific data distribution of the experimental datasets:

Dataset	#User	#Item	#Interaction	Sparsity
Baby	19,445	7,050	160,792	99.88%
Sports	35,598	18,357	296,337	99.96%
Video	24,303	10,672	231,780	99.91%

Example of data sample: While the paper does not explicitly provide an example of a data sample (e.g., a specific image or text description), one can infer that for an item in the "Baby" dataset, there would be an image of a baby product and its textual description (e.g., brand, model, features, user reviews if processed). The BERT model would process the text, and ResNet50 would process the image.

These datasets were chosen because they are standard benchmarks in the MRS domain, representing diverse product categories with varying levels of data sparsity, thus allowing for a robust evaluation of the proposed method's effectiveness.

5.2. Evaluation Metrics

The performance of the DiffCL framework and baseline models was evaluated using two widely accepted ranking metrics in recommender systems: Recall@K and Normalized Discounted Cumulative Gain at K (NDCG@K). The evaluation was performed for $K=10$ and $K=20$ .

Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top $K$ recommendations. It assesses the model's ability to "recall" or find a significant portion of all items a user would like, among the limited set of items presented to them. A higher Recall@K indicates that the recommender system is effective at identifying and including a larger fraction of a user's true preferences in its top recommendations.
Mathematical Formula: $ \text{Recall@K} = \frac{1}{|U|} \sum_{u \in U} \frac{|\text{Relevant}u \cap \text{Recommended}{u,K}|}{|\text{Relevant}_u|} $
Symbol Explanation:
- $|U|$ : The total number of users in the test set.
- $u$ : A specific user.
- $\text{Relevant}_u$ : The set of items that user $u$ has actually interacted with (ground truth relevant items) in the test set.
- $\text{Recommended}_{u,K}$ : The set of top $K$ items recommended by the system for user $u$ .
- $|\cdot|$ : Denotes the cardinality (number of elements) of a set.
- $\cap$ : Represents the intersection of two sets. The formula calculates the average recall across all users. For each user, recall is the number of relevant items found in the top $K$ recommendations, divided by the total number of relevant items for that user.

Normalized Discounted Cumulative Gain at K (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality that accounts for the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear earlier in the list and penalizes relevant items that appear later. It also considers the varying degrees of relevance (though often simplified to binary relevance in recommender systems). NDCG@K is "normalized" to a value between 0 and 1, where 1 indicates a perfect ranking (all relevant items are at the top, ordered by relevance). It provides a more nuanced evaluation than Recall@K by considering the order of recommendations.
Mathematical Formula: $ \text{NDCG@K} = \frac{1}{|U|} \sum_{u \in U} \frac{\text{DCG@K}_u}{\text{IDCG@K}_u} $ where DCG@K for a user $u$ is calculated as: $ \text{DCG@K}u = \sum{j=1}^{K} \frac{\text{rel}j}{\log_2(j+1)} $ And IDCG@K (Ideal DCG) for a user $u$ is calculated as: $ \text{IDCG@K}u = \sum{j=1}^{K} \frac{\text{rel}{j, \text{ideal}}}{\log_2(j+1)} $
Symbol Explanation:
- $|U|$ : The total number of users in the test set.
- $u$ : A specific user.
- $K$ : The number of top recommendations considered.
- $\text{rel}_j$ : The relevance score of the item at position $j$ in the recommended list. For binary relevance (item is either relevant or not), $\text{rel}_j = 1$ if the item is relevant, and 0 otherwise.
- $\log_2(j+1)$ : A logarithmic discount factor, giving less weight to relevant items that appear at lower ranks.
- $\text{rel}_{j, \text{ideal}}$ : The relevance score of the item at position $j$ in the ideal (perfectly sorted) recommendation list for user $u$ . IDCG@K represents the maximum possible DCG for user $u$ given the set of relevant items. The NDCG@K for each user is DCG@K divided by IDCG@K, and then averaged across all users.

5.3. Baselines

The DiffCL framework was compared against a selection of representative state-of-the-art recommendation models, including both general and multimodal approaches. These baselines were implemented using the MMRec framework (H. Zhou et al., "A comprehensive survey on multimodal recommender systems: Taxonomy, evaluation, and future directions," 2023) to ensure fair comparison.

(a) General Recommendation Methods:

BPR (S. Rendle et al., "Bpr: Bayesian personalized ranking from implicit feedback," 2012): A foundational recommendation algorithm for implicit feedback. It models user preferences by optimizing pairwise rankings, aiming for positive items to be ranked higher than negative items. It randomly selects negative samples during training to improve generalization.
LightGCN (X. He et al., "LightGCN: Simplifying and Powering Graph Convolution Networks for Recommendation," 2020): A lightweight Graph Convolutional Network (GCN) based recommendation framework. It simplifies the traditional GCN architecture by removing unnecessary components like feature transformations and non-linear activation functions, focusing solely on neighborhood aggregation to capture collaborative signals. This simplification improves training efficiency and often achieves better recommendation performance.

(b) Multimodal Recommendation Methods:

VBPR (X. He et al., "VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback," 2016): An extension of BPR that was one of the first to incorporate visual features of items into the recommendation process. It enhances item representations with visual information, thereby improving RS performance in multimodal scenarios, particularly in addressing data sparsity.
MMGCN (Y. Wei et al., "Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video," 2019): This method uses a graph structure to model complex relationships between users and items. It specifically designs mechanisms to integrate information from various modalities, ensuring that visual, textual, or other modal data effectively complement user-item interactions.
DualGNN (Q. Wang et al., "Dualgnn: Dual graph neural network for multimedia recommendation," 2021): This approach employs a Dual Graph Neural Network to simultaneously model relationships within user-item graphs and potentially other auxiliary graphs. It aims to capture multi-level relational information to enhance recommendation accuracy and personalization.
SLMRec (Z. Tao et al., "Self-supervised learning for multimedia recommendation," 2022): A method that leverages self-supervised learning (SSL) for multimedia recommendation. It designs SSL tasks to generate implicit supervision signals (labels) and uses a contrastive learning strategy to optimize the model by constructing positive and negative sample pairs from multimodal data.
BM3 (X. Zhou et al., "Bootstrap latent representations for multi-modal recommendn," 2022): This method focuses on simplifying self-supervision tasks in multimodal recommender systems. It likely employs bootstrapping techniques to learn robust latent representations from diverse modal information.
MGCN (P. Yu et al., "Multi-view graph convolutional network for multimedia recommendation," 2023): Based on GCNs, this model purifies modal features using item information and incorporates a behavior-aware fuser that adaptively learns to combine different modal features.
DiffMM (Y. Jiang et al., "Diffmm: Multi-modal diffusion model for recommendation," 2024): A method based on diffusion models that enhances user representations by combining cross-modal contrastive learning with modality-aware graph diffusion models. It aims to better model collaborative signals and align multimodal feature information for more accurate recommendations.
Freedom (X. Zhou and Z. Shen, "A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation," 2023): This method operates on user-item (U-I) and item-item (I-I) graphs, proposing a degree-sensitive edge pruning method to remove potentially noisy edges, thereby improving graph quality for multimodal recommendation.

These baselines are representative because they cover different foundational approaches (MF-based, GNN-based, SSL-based, Diffusion-based) and address multimodal integration in various ways, allowing for a comprehensive comparison of DiffCL's novel contributions.

5.4. Details

To ensure fair evaluation, all comparative baselines were implemented using the MMRec framework, and a grid search was performed to identify their optimal hyperparameter settings.

For DiffCL, the following hyperparameters were set:

Optimizer: Adam optimizer.
Learning Rate: 0.001.
Dropout Rate: 0.5.
Temperature ( $\tau$ ) in Graph Contrastive Learning: 0.4.

The weights for the different loss components ( $\lambda_{\mathrm{cl}}$ , $\lambda_{align}$ , $\lambda_E$ ) were tuned specifically for each dataset:
Baby Dataset: $\lambda_{\mathrm{cl}} = 0.1$ , $\lambda_{align} = 0.4$ , $\lambda_E = 0.7$ .
Video Dataset: $\lambda_{\mathrm{cl}} = 0.01$ , $\lambda_{align} = 1.0$ , $\lambda_E = 1.0$ .
Sports Dataset: $\lambda_{\mathrm{cl}} = 0.7$ , $\lambda_{align} = 0.4$ , $\lambda_E = 0.9$ .

These specific values highlight that the optimal balance between contrastive learning, semantic alignment, BPR, and regularization losses can vary significantly depending on the characteristics of the dataset.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results, summarized in Table II, demonstrate the superior performance of DiffCL across all three datasets (Baby, Video, Sports) compared to both general and state-of-the-art multimodal recommendation models.

The following are the results from [Table II] of the original paper:

Datasets	Baby				Video				Sports
Model	R 10	R@20	N@10	N@20	R@10	R@20	N@10	N@20	R 10	R@20	N@10	N@20
BPR	0.0268	0.0441	0.0144	0.0188	0.0722	0.1106	0.0386	0.0486	0.0306	0.0465	0.0169	0.0210
LightGCN	0.0402	0.0644	0.0274	0.0375	0.0873	0.1351	0.0475	0.0599	0.0423	0.0642	0.0229	0.0285
DiffCL	0.0641	0.0987	0.0343	0.0433	0.1421	0.2069	0.0804	0.0974	0.0754	0.1095	0.0421	0.0509
Improv.	32.50%	42.70%	14.23%	5.60%	59.45%	50.55%	62.11%	56.43%	64.78%	60.59%	65.06%	64.91%
VBPR	0.0397	0.0665	0.0210	0.0279	0.1198	0.1796	0.0647	0.0802	0.0509	0.0765	0.0274	0.0340
MMGCN	0.0397	0.0641	0.0206	0.0269	0.0843	0.1323	0.0440	0.0565	0.0380	0.0610	0.0206	0.0266
DualGNN	0.0518	0.0820	0.0273	0.0350	0.1200	0.1807	0.0656	0.0814	0.0583	0.0865	0.0320	0.0393
SLMRec	0.0529	0.0775	0.0290	0.0353	0.1187	0.1767	0.0642	0.0792	0.0663	0.0990	0.0365	0.0450
BM3	0.0539	0.0848	0.0283	0.0362	0.1166	0.1772	0.0636	0.0793	0.0632	0.0940	0.0346	0.0426
MGCN	0.0608	0.0927	0.0333	0.0415	0.1345	0.1997	0.0740	0.0910	0.0713	0.1060	0.0392	0.0489
Freedom	0.0622	0.0948	0.0330	0.0414	0.1226	0.1858	0.0662	0.0827	0.0722	0.1062	0.0394	0.0484
DiffMM	0.0619	0.0947	0.0326	0.0394					0.0683	0.1019	0.0374	0.0455
DiffCL	0.0641	0.0987	0.0343	0.0433	0.1421	0.2069	0.0804	0.0974	0.0754	0.1095	0.0421	0.0509
Improv.	3.05%	4.11%	3.93%	4.58%	5.65%	3.60%	8.64%	7.03%	4.43%	3.11%	6.85%	5.16%

Comparison with General Recommendation Models (BPR, LightGCN):

DiffCL significantly outperforms BPR and LightGCN across all datasets and metrics. This highlights the crucial role of multimodal information and advanced feature learning techniques in improving recommendation accuracy.
The improvement is particularly pronounced on the Sports dataset, where DiffCL achieves an improvement of $64.78\%$ in R@10 and $60.59\%$ in R@20 over LightGCN (the best general model). Similar substantial gains are observed in NDCG@10 ( $65.06\%$ ) and NDCG@20 ( $64.91\%$ ).
On the Video dataset, DiffCL shows even larger improvements over LightGCN, with R@10 improving by $59.45\%$ and NDCG@10 by $62.11\%$ .
The Baby dataset sees a smaller, but still significant, improvement (e.g., $32.50\%$ in R@10), suggesting that while multimodal information is beneficial, its impact might vary depending on the dataset domain, possibly indicating that users for baby products might weigh other factors (e.g., brand trust, safety) more heavily than purely visual/textual aesthetics.

Comparison with Multimodal Recommendation Models:

DiffCL consistently demonstrates superior performance even when compared to the best performing multimodal baselines (e.g., MGCN, Freedom, DiffMM).
On the Video dataset, DiffCL achieves R@10 of 0.1421, R@20 of 0.2069, N@10 of 0.0804, and N@20 of 0.0974. This represents an improvement of $5.65\%$ for R@10 and $8.64\%$ for N@10 over MGCN, which was the best multimodal baseline on this dataset.
On Sports, DiffCL improves R@10 by $4.43\%$ and N@10 by $6.85\%$ over Freedom (the previous best).
On Baby, the improvements are $3.05\%$ for R@10 and $3.93\%$ for N@10 over Freedom. The consistent superiority of DiffCL across diverse datasets and metrics validates the effectiveness of its core components: the diffusion model for robust contrastive view generation, the Item-Item Graph for data augmentation, and ID modality-guided inter-modal alignment. The overall findings indicate that by addressing multimodal noise, semantic inconsistencies, and data sparsity more effectively, DiffCL significantly enhances recommendation performance.

6.2. Ablation Studies / Parameter Analysis (RQ2)

To verify the effectiveness of each component, several variants of DiffCL were tested by removing or combining different modules.

The following are the results from [Table III] of the original paper, showing the performance comparison of different variants:

Variants	Metrics	Datasets
Variants	Metrics	Baby	Video	Sports
DiffCLbaseline	R@20	0.0854	0.1907	0.0956
DiffCLbaseline	N@20	0.0364	0.0856	0.0428
DiffCLdiff	R@20	0.0925	0.1978	0.1095
DiffCLdiff	N@20	0.0396	0.0895	0.0509
DiffCLalign	R@20	0.0907	0.1965	0.0960
DiffCLalign	N@20	0.0392	0.0893	0.0428
DiffCLh	R@20	0.0986	0.1921	0.1099
DiffCLh	N@20	0.0430	0.0872	0.0494
DiffCLdiff+align	R@20	0.0911	0.1904	0.1093
DiffCLdiff+align	N@20	0.0403	0.0866	0.0506
DiffCLdiff+h	R@20	0.0986	0.1940	0.1102
DiffCLdiff+h	N@20	0.0430	0.0885	0.0495
DiffCLalign+h	R@20	0.0993	0.1968	0.1114
DiffCLalign+h	N@20	0.0432	0.0896	0.0496
DiffCL	R@20	0.0987	0.2069	0.1095
DiffCL	N@20	0.0433	0.0974	0.0509

The variants evaluated were:

DiffCL_baseline: The model without any of the proposed components (diffusion graph contrastive learning, ID modal guidance intramodal semantic alignment, or feature enhancement via Item-Item graph). This serves as the basic LightGCN-like backbone model.
DiffCL_diff: Retains only the diffusion graph contrastive learning task.
DiffCL_align: Retains only the ID modal guidance intramodal semantic alignment task.
$DiffCL_h$ : Retains only the feature enhancement task (i.e., using the Item-Item graph).
DiffCL_diff+align: Retains both diffusion graph contrastive learning and ID modal guidance intramodal semantic alignment.
DiffCL_diff+h: Retains both diffusion graph contrastive learning and feature enhancement.
DiffCL_align+h: Retains both ID modal guidance intramodal semantic alignment and feature enhancement.
DiffCL: The full proposed framework with all components.

Analysis of Ablation Study:

Effectiveness of Individual Components: Comparing DiffCL_baseline with DiffCL_diff, DiffCL_align, and $DiffCL_h$ shows that each component individually contributes to improving performance. For example, on the Baby dataset, DiffCL_baseline has R@20 of 0.0854, while DiffCL_diff (0.0925), DiffCL_align (0.0907), and $DiffCL_h$ (0.0986) all show improvements. This confirms the validity of each proposed module.
Impact of Diffusion Graph Contrastive Learning (DiffCL_diff): The diffusion graph contrastive learning module (DiffCL_diff) consistently improves results over the baseline, especially on Sports (R@20 from 0.0956 to 0.1095, N@20 from 0.0428 to 0.0509). This highlights the effectiveness of using diffusion models for robust view generation, reducing noise, and learning better representations.
Impact of ID-Guided Semantic Alignment (DiffCL_align): The ID-guided semantic alignment module (DiffCL_align) also provides improvements, demonstrating its role in fostering semantic consistency across modalities. Its impact varies by dataset, suggesting that the degree of semantic discrepancy might differ.
Impact of Item-Item Graph Feature Enhancement ( $DiffCL_h$ ): The feature enhancement module ( $DiffCL_h$ ) shows significant gains, particularly on Baby (R@20 from 0.0854 to 0.0986, N@20 from 0.0364 to $0.0430). This confirms that explicitly modeling item-item relationshipsand augmenting item embeddings effectively addressesdata sparsity` and enriches representations.
Combinations of Components: The models combining two components (DiffCL_diff+align, DiffCL_diff+h, DiffCL_align+h) generally outperform single-component variants. For instance, DiffCL_align+h on Baby achieves R@20 of 0.0993, which is higher than DiffCL_align (0.0907) and $DiffCL_h$ (0.0986). This indicates that the components are complementary.
Full Model (DiffCL): The full DiffCL model, incorporating all three contributions, achieves the best performance across all datasets and metrics. On Video, DiffCL achieves R@20 of 0.2069 and N@20 of 0.0974, which is the highest among all variants. This strong performance validates that the synergistic combination of diffusion-based contrastive learning, ID-guided semantic alignment, and Item-Item graph enhancement is crucial for achieving state-of-the-art results in multimodal recommendation.

6.3. Hyperparameter Effects (RQ3)

The paper also investigated the impact of key hyperparameters that control the weights of different loss components: $\lambda_{diff}$ (for diffusion graph contrastive learning), $\lambda_{align}$ (for multimodal semantic alignment), and $\lambda_E$ (for L2 regularization). These parameters were varied within the range $\{0.01, 0.1, 0.2, \dots, 1.0\}$ .

The findings are presented in Figures 3, 4, and 5.

该图像是两幅折线图，展示了不同数据集（baby、sports、video）在不同参数λ_diff取值下Recall@20和NDCG@20的变化趋势。图中显示sports数据集效果明显优于其他两类，且在中等λ_diff值时性能最高。

The following figure (Figure 3 from the original paper) shows the performance of DiffCL under various $\lambda_{diff}$ settings. The x-axis represents values of $\lambda_{diff}$ , while the y-axis shows Recall@20 and NDCG@20 for the Baby, Sports, and Video datasets. Analysis of $\lambda_{diff}$ :

Optimal Range: For Baby and Sports, the performance (both R@20 and N@20) tends to peak around $\lambda_{diff} = 0.1$ or 0.2 and then slightly decreases or stabilizes.
Video Dataset: The Video dataset shows a more pronounced peak around $\lambda_{diff} = 0.1$ , indicating that a moderate weighting of the diffusion contrastive learning loss is most effective.
Significance: This suggests that while diffusion-based contrastive learning is beneficial, excessively high weight might lead to over-regularization or over-emphasis on self-supervision, potentially distracting from the primary BPR objective.

$该图像是两幅折线图，展示了不同参数条件下婴儿、运动和视频三个类别在Recall@20和NDCG@20指标上的表现。横轴分别为调节参数$\\lambda_{align}$和$\\lambda_{diff}$，纵轴显示对应的评估指标值，反映参数对模型推荐效果的影响。$ 该图像是两幅折线图，展示了不同参数条件下婴儿、运动和视频三个类别在Recall@20和NDCG@20指标上的表现。横轴分别为调节参数 $\lambda_{align}$ 和 $\lambda_{diff}$ ，纵轴显示对应的评估指标值，反映参数对模型推荐效果的影响。

The following figure (Figure 4 from the original paper) shows the performance of DiffCL under various $\lambda_{align}$ settings. The x-axis represents values of $\lambda_{align}$ , while the y-axis shows Recall@20 and NDCG@20 for the Baby, Sports, and Video datasets. Analysis of $\lambda_{align}$ :

Optimal Range: For Baby and Sports, the performance generally increases as $\lambda_{align}$ increases, peaking around 0.4 to 0.6 before stabilizing or slightly declining.
Video Dataset: The Video dataset shows a strong positive correlation, with performance continuing to rise and peaking at $\lambda_{align} = 1.0$ .
Significance: This indicates that ID-guided semantic alignment is a crucial component, and a stronger emphasis on aligning modal distributions (higher $\lambda_{align}$ ) is generally beneficial, especially for datasets like Video where semantic discrepancies might be more pronounced.

$Fig. 5. The performance of the DiffCL under various $\\lambda _ { E }$ settings$ 该图像是图表，展示了DiffCL模型在不同参数 $\lambda_E$ 设置下的性能表现，分别用Recall@20和NDCG@20指标衡量，从三个数据集baby、sports和video中对比了性能变化。

The following figure (Figure 5 from the original paper) shows the performance of DiffCL under various $\lambda_{E}$ settings. The x-axis represents values of $\lambda_{E}$ , while the y-axis shows Recall@20 and NDCG@20 for the Baby, Sports, and Video datasets. Analysis of $\lambda_E$ :

General Trend: For all datasets, very small values of $\lambda_E$ (e.g., 0.01) lead to lower performance, suggesting that some L2 regularization is necessary to prevent overfitting.
Optimal Range: Performance tends to peak at moderate values of $\lambda_E$ (e.g., around 0.7 for Baby, 0.9 for Sports, and 1.0 for Video).
Significance: This confirms the importance of regularization for model stability and generalization. However, excessively high L2 regularization can also penalize useful features, leading to underfitting.

Overall Hyperparameter Insights: The experiments reveal that the optimal hyperparameter settings (specifically, the loss weights) are dataset-dependent. This emphasizes the need for careful tuning for each specific application. The findings highlight that balancing the contributions of contrastive learning, semantic alignment, BPR (implicit primary task), and regularization is critical for maximizing DiffCL's performance. The ability to find an optimal balance showcases the robustness and adaptability of the framework.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduced DiffCL, a novel diffusion-based contrastive learning framework for multimodal recommendation. The framework addresses key challenges in MRSs, including data sparsity, multimodal noise, and semantic discrepancies across modalities. DiffCL innovates by leveraging a diffusion model to generate high-quality contrastive views, effectively mitigating noise during self-supervised learning. It further enhances semantic consistency by aligning diverse visual and textual semantic information using stable ID embeddings. Additionally, the integration of an Item-Item Graph significantly improves multimodal feature representations and alleviates the adverse effects of data sparsity. Comprehensive experiments on three real-world datasets consistently demonstrated the superiority and effectiveness of DiffCL against various state-of-the-art baselines.

7.2. Limitations & Future Work

The authors acknowledge that while DiffCL shows promising results, there's room for further research and optimization. The main future research direction proposed is:

Optimizing Diffusion Model Integration and Multi-Perspective Data Augmentation: The current work applies diffusion models to generate contrastive views within a specific stage. Future research aims to optimize the integration of diffusion models throughout the recommender system, extending their application beyond specific stages. By fully leveraging the powerful generative capabilities of diffusion models, the authors intend to perform data augmentation from multiple perspectives (not just for contrastive views) to achieve even superior recommendation outcomes. This suggests exploring DMs for synthetic interaction generation, user preference evolution, or even generating entire item descriptions/images to enrich sparse modalities.

7.3. Personal Insights & Critique

Personal Insights:

Novelty of Diffusion for Augmentation: The most compelling aspect of DiffCL is its innovative use of diffusion models for contrastive view generation. Instead of simple random noise or dropout, DMs offer a principled way to create diverse yet semantically consistent augmentations, which is a significant step forward for self-supervised learning in MRSs. This can inspire other applications where robust data augmentation is critical but traditional methods fall short.
Importance of ID-Guided Alignment: The ID-guided semantic alignment is a clever solution to a common problem in multimodal fusion. By anchoring alignment to stable ID embeddings, the framework provides a robust mechanism to ensure cross-modal consistency without sacrificing the unique contributions of each modality. This approach could be highly transferable to other multimodal learning tasks where a stable, modality-agnostic representation is available.
Holistic Problem-Solving: DiffCL doesn't just tackle one problem but strategically integrates solutions for noise mitigation, semantic alignment, and data sparsity within a unified framework. This holistic approach makes the model more robust and effective in real-world scenarios.

Critique / Areas for Improvement:

Computational Cost of Diffusion Models: While diffusion models offer high-quality generation, they are notoriously computationally intensive, especially during the reverse (denoising) process, which involves multiple iterative steps. The paper does not discuss the computational overhead of training DiffCL compared to baselines, particularly concerning the diffusion component. For large-scale recommender systems, this could be a significant practical limitation.
Sensitivity to Hyperparameters: The experiments show that the optimal weights ( $\lambda_{diff}, \lambda_{align}, \lambda_E$ ) vary significantly across datasets. While this is common, it implies a potentially high hyperparameter tuning burden for new datasets or domains, which could limit its out-of-the-box applicability. Further research into adaptive weighting mechanisms or more robust tuning strategies might be beneficial.
Theoretical Justification for ID-Guided Alignment: While ID embeddings are "stable," a deeper theoretical analysis of why aligning Gaussian distributions of modal features to ID features is optimal, beyond empirical results, could strengthen the argument. For instance, what if the ID embedding itself is not the true "semantic centroid" of an item across all its multimodal representations?
Explainability of Diffusion-Based Views: While DMs generate "high-quality" views, it would be interesting to explore the interpretability of these generated contrastive views. How do they differ from simple random noise, and what specific semantic perturbations do they introduce that are beneficial for contrastive learning? Visualizing these augmented embeddings could provide more insight.
Scalability and Online Deployment: The paper focuses on offline evaluation. Discussions on the scalability of DiffCL for large-scale online recommendation systems (e.g., handling billions of items and users, real-time inference) would be valuable. The iterative nature of diffusion models might pose challenges for low-latency predictions.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DiffCL: A Diffusion-Based Contrastive Learning Framework with Semantic Alignment for Multimodal Recommendations

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 40,376 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Multimodal Recommender Systems (MRSs)

Graph Neural Networks (GNNs)

Contrastive Learning (CL)

Diffusion Models (DMs)

Bayesian Personalized Ranking (BPR)

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Graph Encoder

4.2.3. Diffusion Graph Contrastive Learning

4.2.3.1. Graph Diffusion Forward Process

4.2.3.2. Graph Diffusion Reverse Process

4.2.3.3. Graph Contrastive Learning

4.2.4. Multimodal Feature Enhancement and Alignment

4.2.4.1. Multimodal Feature Enhancement (Item-Item Graph)

4.2.4.2. Multimodal Feature Fusion

4.2.4.3. Multimodal Semantic Alignment (ID-guided)

4.2.5. Model Optimization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Recall@K

Normalized Discounted Cumulative Gain at K (NDCG@K)

5.3. Baselines

5.4. Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis (RQ2)

6.3. Hyperparameter Effects (RQ3)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers