Paper status: completed

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

Published:12/19/2024

Multimodal Recommendation Systems (8)Graph Neural Networks (2)Modality-Independent Receptive Fields (1)Global Transformer (1)User-Item Graph Modeling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents modality-independent GNNs to enhance multimodal recommendation performance by utilizing separate GNNs for different modalities. A sampling-based global transformer effectively integrates global information, addressing limitations of existing methods, with supe

Abstract

Multimodal recommendation systems can learn users' preferences from existing user-item interactions as well as the semantics of multimodal data associated with items. Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task. Graph Neural Networks (GNNs) have shown promising performance in this domain. Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, $K$ ) to enrich user and item semantics. We observe that the optimal receptive fields for GNNs can vary across different modalities. In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance. Our results indicate that the optimal $K$ for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information. To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs. We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods. Our code is publicly available at https://github.com/CrawlScript/MIG-GT.

Mind Map

In-depth Reading

English Analysis~34 min read · 46,053 chars

1. Bibliographic Information

1.1. Title

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

1.2. Authors

Jun Hu, Bryan Hooi, Bingsheng He (all from School of Computing, National University of Singapore), and Yinwei Wei (School of Software, Shandong University). Bryan Hooi is specifically marked with an asterisk, indicating he might be the corresponding author or a key contributor. Their affiliations suggest a strong background in computer science, particularly in areas like graph neural networks, machine learning, and recommendation systems.

1.3. Journal/Conference

Published as a preprint on arXiv. The paper cites various top-tier conferences such as AAAI, ACM SIGIR, ACM MM, ICLR, NeurIPS, and WWW in its references, indicating the authors are targeting or have published in highly reputable venues in artificial intelligence, machine learning, and multimedia information retrieval.

1.4. Publication Year

2024

1.5. Abstract

Multimodal recommendation systems leverage user-item interactions and item-associated multimodal data (e.g., text, images) to understand user preferences. Many existing methods model this as a graph learning task using a multimodal user-item graph, with Graph Neural Networks (GNNs) showing promising results. A common approach is to use GNNs to capture neighborhood information within a certain receptive field (number of hops, $K$ ) to enrich user and item semantics. The authors observe that the optimal receptive field $K$ for GNNs can vary across different modalities. To address this, they propose Modality-Independent Receptive Fields (MIRF), which employs separate GNNs with independent $K$ values for each modality. Furthermore, recognizing that small optimal $K$ values (e.g., 1 or 2) can restrict GNNs' capacity to capture global information, they introduce a Sampling-based Global Transformer (SGT). This SGT uses uniform global sampling to efficiently integrate global context into the GNNs. Comprehensive experiments demonstrate the superiority of their approach over existing methods.

1.6. Original Source Link

https://arxiv.org/abs/2412.13994v1 This is a preprint publication on arXiv. PDF Link: https://arxiv.org/pdf/2412.13994v1.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enhancing the performance of multimodal recommendation systems. Traditional recommendation systems primarily rely on historical user-item interactions. However, with the explosion of rich multimodal data (like text, images, videos) associated with items, there's a significant opportunity to improve recommendations by leveraging this semantic information.

The problem is important because more comprehensive and accurate recommendations directly translate to better user experience in applications like e-commerce and micro-video platforms, leading to increased user engagement and satisfaction.

Specific Challenges and Gaps in Prior Research:

Fixed Receptive Fields in GNNs: Many state-of-the-art multimodal recommendation systems model user-item interactions as graphs and employ Graph Neural Networks (GNNs) for representation learning. GNNs aggregate information from a node's local neighborhood within a specified number of hops, known as the receptive field $K$ . Prior research typically applies a single, uniform $K$ across all modalities (e.g., text, visual, learnable embeddings). The authors observe that different modalities might benefit from different neighborhood sizes, implying that a one-size-fits-all $K$ might be suboptimal.
Limited Global Information Capture in GNNs: When the optimal receptive field $K$ for certain modalities is small (e.g., 1 or 2 hops), GNNs are inherently limited in capturing global information from the entire graph. This can lead to local optima or a lack of understanding of broader item relationships or user preferences that exist beyond immediate neighbors.
Computational Cost of Global Models (Transformers): While Transformers excel at capturing global dependencies, their quadratic complexity with respect to the number of nodes makes them computationally prohibitive for large-scale graphs commonly found in recommendation systems.

Paper's Entry Point or Innovative Idea: The paper's innovation stems from two key observations:

Modality-Dependent Locality: Different modalities inherently capture information at different scales or levels of abstraction. For example, visual features might be highly localized, while textual features might require broader context. This suggests that the "reach" of information propagation (receptive field $K$ ) should be customized per modality.
Efficient Global Context Integration: There's a need to bridge the gap between GNNs' local aggregation and the desire for global context without incurring the prohibitive computational cost of full Transformers. The idea is to achieve global awareness through a computationally efficient sampling mechanism.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Modality-Independent Receptive Fields (MIRF): The authors propose applying separate GNNs for each modality (learnable embedding, text, visual), each with its own independent receptive field $K$ . This allows each modality to optimally leverage neighborhood information according to its unique characteristics. Their empirical results (e.g., Figure 1) demonstrate that optimal $K$ values indeed vary across modalities (e.g., $K=3$ for learnable embeddings and text, $K=2$ for visual features on Amazon Baby dataset), validating the necessity of this approach.
Sampling-based Global Transformer (SGT): To address the limitation of GNNs in capturing global information, especially when optimal $K$ is small, they introduce an efficient Sampling-based Global Transformer. Instead of computing attention scores for all node pairs, SGT uniformly samples a small number of global nodes and computes attention only between the target node and these sampled nodes. This module effectively integrates global context while maintaining computational efficiency.
Transformer Unsmooth Regularization (TUR): To mitigate the potential "smoothing" effect caused by the Transformer's self-attention on sampled nodes, which could make node representations indistinguishable, they propose a Transformer Unsmooth Regularization ( $\mathcal{L}_{TUR}$ ). This regularization encourages distinction between representations of a node and its neighbors within the Transformer's output, helping preserve local structure.
Comprehensive Experimental Validation: The proposed framework, Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT), is extensively evaluated on three public Amazon datasets (Baby, Sports, Clothing). It consistently outperforms state-of-the-art (SOTA) baselines, including those employing complex denoising mechanisms or explicit item-item relation modeling. The paper also demonstrates the training efficiency of MIG-GT and its compatibility with contrastive learning.

Key Conclusions or Findings:

The optimal receptive field for GNNs in multimodal recommendation is indeed modality-dependent.
Integrating global information is crucial, especially when local receptive fields are small.
A sampling-based approach effectively makes Transformers applicable for global context integration in large-scale graph recommendation, bypassing quadratic complexity.
The proposed Transformer Unsmooth Regularization is beneficial for maintaining distinct node representations in the SGT.
MIG-GT achieves state-of-the-art performance with high efficiency, without relying on complex denoising or explicit item-item graph construction commonly found in other SOTA models.

3.1. Foundational Concepts

3.1.1. Recommendation Systems

Recommendation systems are information filtering systems that predict user preferences for items. They aim to suggest items (e.g., products, movies, articles) that users are most likely to enjoy or interact with.

Collaborative Filtering (CF): A widely used technique that makes predictions based on the preferences or behaviors of other similar users (user-based CF) or items (item-based CF). Matrix Factorization (MF) is a popular CF technique that decomposes the user-item interaction matrix into lower-dimensional user and item embedding matrices.
Implicit Feedback: In many recommendation scenarios, users don't explicitly rate items (e.g., giving a 5-star rating). Instead, interactions like clicks, views, purchases, or watch time are considered implicit feedback, indicating a positive preference. This type of feedback is common in large-scale systems.
Multimodal Data: Data that includes information from multiple modalities, such as text (e.g., item descriptions), images (e.g., product photos), audio, and video. Leveraging multimodal data provides a richer, more comprehensive understanding of items and users, which can lead to better recommendations.

3.1.2. Graph Neural Networks (GNNs)

Graph Neural Networks are a class of deep learning methods designed to operate on graph-structured data. They extend the concepts of neural networks to handle non-Euclidean data by aggregating information from a node's neighbors.

Nodes (Vertices) and Edges: A graph consists of nodes (e.g., users, items) and edges (e.g., user-item interactions, item-item relationships) that connect them.
Message Passing: The core mechanism of GNNs. Each node updates its representation by aggregating messages from its neighbors and combining them with its own previous representation. This process is often iterated for several hops.
Receptive Field (Number of Hops, $K$ ): In GNNs, the receptive field refers to the extent of the neighborhood from which a node gathers information. If a GNN layer processes information from 1-hop neighbors, and $K$ layers are stacked, a node's final representation incorporates information up to $K$ hops away. A larger $K$ allows a node to capture more global information but can also lead to issues like over-smoothing, where distinct node representations become too similar.
Graph Convolutional Networks (GCNs): A specific type of GNN that adapts convolutional operations to graphs. LightGCN is a simplified GCN for recommendation that removes non-linear activation functions and feature transformations to streamline message passing.

3.1.3. Transformers and Self-Attention

Transformers are neural network architectures introduced in "Attention Is All You Need" (Vaswani et al., 2017) that have revolutionized natural language processing and other fields.

Self-Attention Mechanism: The core component of Transformers. It allows each element in a sequence (or node in a graph, if adapted) to weigh the importance of all other elements when computing its own representation. This mechanism enables Transformers to capture long-range dependencies effectively.
Query, Key, Value (Q, K, V): In self-attention, input representations are transformed into three different vectors:
- Query ( $Q$ ): Represents what each element is "looking for."
- Key ( $K$ ): Represents what each element "offers."
- Value ( $V$ ): Represents the content of each element. The attention score between a query and a key determines how much the corresponding value contributes to the output.
Attention Formula: The fundamental scaled dot-product attention mechanism is given by: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where Q, K, V are matrices of query, key, and value vectors respectively, $d_k$ is the dimension of the key vectors (used for scaling to prevent vanishing gradients), and $\mathrm{softmax}$ normalizes the attention weights.
Global Information Capture: Transformers naturally excel at capturing global dependencies because each element can attend to all other elements, unlike GNNs which are typically restricted to local neighborhoods.
Computational Complexity: A major drawback of standard Transformers is their quadratic time and space complexity with respect to the input sequence length (or number of nodes in a graph), due to the $QK^T$ matrix multiplication. This makes them impractical for very large inputs.

3.1.4. Bayesian Personalized Ranking (BPR) Loss

BPR is a widely used pairwise ranking loss function for implicit feedback recommendation. Its objective is to maximize the difference between the predicted preference of a user for an interacted item (positive sample) and a non-interacted item (negative sample).

Principle: For a given user, BPR assumes that the user prefers any interacted item over any non-interacted item.
Optimization: It minimizes the negative log-likelihood of this pairwise preference.
Formula: The BPR loss is defined as: $ \mathcal{L}{BPR} = - \sum{u \in U} \sum_{i \in I_u^+} \sum_{j \in I \setminus I_u^+} \log \sigma \left( \hat{r}{ui} - \hat{r}{uj} \right) $ where:
- $U$ is the set of all users.
- $I_u^+$ is the set of items interacted with by user $u$ .
- $I \setminus I_u^+$ is the set of items not interacted with by user $u$ .
- $\hat{r}_{ui}$ is the predicted preference score of user $u$ for item $i$ .
- $\hat{r}_{uj}$ is the predicted preference score of user $u$ for item $j$ .
- $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function. The paper uses a slightly different form, sampling $v_k$ directly from the graph: $ \mathcal{L}{BPR} = - \sum{B_{ij}=1} \mathbb{E}_{v_k \sim p(v)} \log \sigma \big( \tilde{u}_i^\prime \tilde{v}_j - \tilde{u}_i^\prime \tilde{v}_k \big) $ Here, $B_{ij}=1$ indicates an observed interaction between user $i$ and item $j$ , $\tilde{u}_i^\prime \tilde{v}_j$ is the preference score (dot product of embeddings), and $v_k \sim p(v)$ means a negative item $v_k$ is sampled from the item distribution.

3.2. Previous Works

The paper categorizes related work into three main areas:

3.2.1. Graph Neural Networks for Recommendation

This area focuses on using GNNs to model user-item interactions as bipartite graphs to learn user and item embeddings.

GCMC (van den Berg, Kipf, and Welling 2017): One of the early works applying Graph Convolutional Networks (GCNs) to matrix completion for recommendation, essentially building an autoencoder on the user-item graph.
PinSage (Ying et al. 2018): Utilized GNNs with sampling strategies to handle large-scale datasets, specifically for Pinterest's recommendation system.
NGCF (Wang et al. 2019): Designed GNNs to explicitly capture high-order connectivity in the user-item interaction graph, propagating embeddings across multiple hops to enrich representations.
LightGCN (He et al. 2020): A highly influential and simplified GCN for recommendation. It argues that the most critical components of GCNs for recommendation are neighbor aggregation and feature propagation, while non-linear transformations and activation functions might introduce noise or complexity without significant benefits.
- LightGCN Propagation: The $k$ -th layer embedding for a node $u$ (or $i$ ) is: $ e_u^{(k+1)} = \mathrm{AGG}\left(e_u^{(k)}, {e_v^{(k)} \mid v \in \mathcal{N}u}\right) $ Specifically, for LightGCN: $ e_u^{(k+1)} = \sum{v \in \mathcal{N}_u} \frac{1}{\sqrt{|\mathcal{N}_u||\mathcal{N}_v|}} e_v^{(k)} $ The final embedding is a weighted sum of embeddings from all layers: $E = \sum_{k=0}^K \alpha_k E^{(k)}$ .
UltraGCN (Mao et al. 2021): Proposed an ultra-simplified GCN for recommendation that bypasses explicit GNN operations for message passing, instead using a constraint-based loss function to implicitly enforce neighborhood aggregation.
ApeGNN (Zhang et al. 2023): Focuses on adaptively aggregating information based on local graph structures, allowing for more diverse pattern capture.
MGDN (Hu et al. 2024): Mentioned as a generalization of LightGCN, offering flexible control over balancing self-information and neighbor information. This is the base GNN model used in the proposed MIG-GT framework. The propagation rule in MGDN (Equation 3 in the paper) explicitly shows how it blends self-information and neighbor information across $K$ hops.

3.2.2. Graph Transformers

These methods attempt to apply the power of Transformers to graph data, often facing the challenge of quadratic complexity.

SGFormer (Wu et al. 2023a) and Polynormer (Deng, Yue, and Zhang 2024): These works address the complexity issue by removing the softmax normalization in attention, reducing complexity to linear. They also combine Graph Transformer outputs with GNN models, differing in their fusion strategies. The current paper replaces its Sampling-based Global Transformer (SGT) with these variants in an ablation study.

3.2.3. Multimodal Recommendation

This field focuses on integrating multimodal data (e.g., text, images) into recommendation systems.

Early Approaches (He and McAuley 2016b, Liu, Wu, and Wang 2017): Extended Bayesian Personalized Ranking (BPR) by incorporating visual features, e.g., VBPR.
VECF (Chen et al. 2019): Used pre-trained CNNs (VGG) for image feature extraction and employed region-specific attention for item visual features.
GNN-based Multimodal Models:
- MMGCN (Wei et al. 2019): A foundational work that builds modality-aware graphs and applies separate GNNs for each modality. The learned modality-specific features are then aggregated. This model is a direct predecessor to the Modality-Independent Receptive Fields (MIRF) concept, but MMGCN typically uses a fixed $K$ for all GNNs.
- GRCN (Wei et al. 2020): Refined user-item graph structures by sieving out misleading connections, focusing on denoising the interaction graph.
- DualGNN (Wang et al. 2023): Introduced a user co-occurrence graph and a feature preference module to capture dynamic multimodal item features.
- SLMRec (Tao et al. 2023): A recent self-supervised learning approach for multimedia recommendation.
- LATTICE (Zhang et al. 2021): Performed modality-aware structure learning, constructing item-item graphs for each modality and combining them.
- FREEDOM (Zhou and Shen 2023): Simplified previous approaches by freezing the item-item graph structure and denoising the user-item interaction graph. This model is presented as a strong SOTA baseline.

3.3. Technological Evolution

The field of recommendation systems has evolved from:

Traditional Collaborative Filtering (CF) and Matrix Factorization (MF): Relying solely on user-item interaction data.
Multimodal CF: Incorporating auxiliary information like visual features (e.g., VBPR), text, etc., alongside CF, typically by concatenating features or adding them to the scoring function.
GNN-based Recommendation: Modeling user-item interactions as graphs and using GNNs to capture higher-order relationships and propagate information (LightGCN, NGCF).
GNN-based Multimodal Recommendation: Combining GNNs with multimodal data, often by applying GNNs to modality-specific graphs or features (MMGCN, GRCN, DualGNN, LATTICE, FREEDOM). These often focus on denoising or explicitly modeling item-item relationships.
GNNs with Global Context (e.g., Transformers): The latest trend is to integrate the strengths of GNNs (local aggregation) with the global context capabilities of Transformers, while addressing the computational challenges of Transformers on large graphs. This paper's work (MIG-GT) fits directly into this cutting-edge trend.

3.4. Differentiation Analysis

Compared to the main methods in related work, MIG-GT offers several core differences and innovations:

Modality-Specific Receptive Fields: Unlike most GNN-based multimodal recommendation systems (e.g., MMGCN, GRCN, LATTICE, FREEDOM) that typically use a uniform number of GNN layers (and thus a uniform receptive field $K$ ) for all modalities, MIG-GT explicitly recognizes and exploits the modality-dependent nature of optimal receptive fields. This allows each modality to learn representations at its most effective structural scale.
Efficient Global Context Integration: While some Graph Transformers (e.g., SGFormer, Polynormer) aim to reduce quadratic complexity, MIG-GT introduces a novel Sampling-based Global Transformer (SGT). This module directly tackles the scalability issue by uniformly sampling a small subset of global nodes for attention computation, offering a practical and efficient way to infuse global information into GNN-learned representations. This is a crucial differentiator from full graph Transformers.
Simplicity and Effectiveness without Complex Graph Construction/Denoising: Many recent SOTA multimodal recommendation models (e.g., LATTICE, FREEDOM) achieve high performance by explicitly modeling complex item-item relationships or employing sophisticated denoising mechanisms for interaction graphs. MIG-GT, in contrast, applies its innovations directly to the original user-item interaction graph without such complexities. Its superior or matching performance demonstrates that modality-independent receptive fields and efficient global information integration are powerful and perhaps more fundamental improvements.
Addressing the GNN Smoothing Problem in a Global Context: The Transformer Unsmooth Regularization (TUR) is a novel contribution designed specifically to counteract the potential over-smoothing or indistinguishability of representations that could arise from the SGT's global attention mechanism, ensuring that local distinctiveness is preserved.

4. Methodology

4.1. Principles

The core idea behind the Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT) framework is twofold:

Modality-Specific Locality: Recognize that different data modalities (e.g., visual, textual, learnable embeddings) might benefit from aggregating information from different neighborhood sizes (receptive fields or number of hops, $K$ ) in a graph. Instead of using a fixed $K$ for all modalities, the system should allow each modality to have its own independent optimal $K$ .
Efficient Global Context: Address the limitation of GNNs, which are primarily designed for local information aggregation, in capturing global context, especially when optimal local receptive fields are small. This global context should be integrated efficiently, overcoming the computational bottlenecks of standard Transformers on large graphs.

4.2. Core Methodology In-depth (Layer by Layer)

The MIG-GT framework processes multimodal user-item graphs to generate enhanced user and item representations. It consists of two main components: Modality-Independent Receptive Fields (MIRF) and Sampling-based Global Transformer (SGT).

The overall framework is illustrated in Figure 3.

Figure 3: Overall Framework of Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT 该图像是关于融合图神经网络与全局变换器的总体框架示意图。图中展示了多模态用户-项目图、各模态的图（嵌入、文本、视觉）及其特征提取过程，采用了 $MGDN(hop=K)$ 方法进行消息传递，并引入了基于全局采样的变换器以整合全局信息。

Figure 3: Overall Framework of Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT

4.2.1. Problem Definition

The paper defines the recommendation task as predicting unobserved user preferences over items.

Let $U$ be the set of users and $V$ be the set of items.
An interaction matrix $B \in \mathbb{R}^{|U| \times |V|}$ indicates observed interactions ( $B_{ij}=1$ ) or non-interactions ( $B_{ij}=0$ ).
Each item $v_j \in V$ is associated with multimodal data (text and image features).
The goal is to learn $d$ -dimensional user and item representations (embeddings) such that their dot product, $\tilde{u}_i^\prime \tilde{v}_j$ , reflects user $u_i$ 's preference for item $v_j$ .

4.2.2. Multimodal User-Item Graph

The system models users and items as vertices in a single homogeneous graph $\mathcal{G}$ .

The total number of vertices is $|N| = |U| + |V|$ .
The first $|U|$ vertices represent users, and the subsequent $|V|$ vertices represent items.
Observed user-item interactions form the edges of the graph.
The adjacency matrix $A \in \{0, 1\}^{|N| \times |N|}$ is defined as: $A = \begin{pmatrix} 0 & B \\ B^\prime & 0 \end{pmatrix}$ where $B^\prime$ is the transpose of $B$ . This matrix represents the connectivity between users and items (and vice-versa).
Multimodal Features:
- Each item vertex has a text feature vector and a visual feature vector, extracted using pre-trained models.
- Each user vertex is assigned a $d$ -dimensional learnable embedding, which is initialized randomly and optimized during training.

4.2.3. Modality-Independent Receptive Fields (MIRF)

This component applies separate GNNs for different modalities, each with its own determined receptive field $K^{(M)}$ .

4.2.3.1. Feature Encoding

For each modality $M \in \{\text{Learnable Embedding (E), Text (T), Visual (V)}\}$ , raw vertex features are denoted as $X^{(M)} \in \mathbb{R}^{|N| \times d^{(M)}}$ , where $d^{(M)}$ is the specific dimension for modality $M$ .
To prepare for message passing, these raw features are encoded into a common $d$ $d$ -dimensional space using a Multilayer Perceptron (MLP): $\tilde{X}^{(M)} = \mathrm{MLP}(X^{(M)}) \in \mathbb{R}^{|N| \times d}$
- Exception for Learnable Embeddings: For the learnable embedding modality, $X^{(E)} \in \mathbb{R}^{|N| \times d}$ is already in the target dimension and is directly optimized, so no MLP is needed: $\tilde{X}^{(E)} = X^{(E)}$ .
- Handling Missing Features: For modalities where certain vertex types lack features (e.g., user vertices typically don't have text or visual features), their corresponding encoded feature vectors in $\tilde{X}^{(M)}$ are simply assigned zero vectors of dimension $d$ .

4.2.3.2. Message Propagation with MGDN

The paper uses MGDN (Hu et al. 2024) for message propagation. MGDN generalizes LightGCN and propagates features without additional transformations in its layers.

Normalized Adjacency Matrix: A normalized adjacency matrix $\hat{A}$ $\hat{A}$ is computed, which is shared across all modalities: $\hat{A} = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$ where:
- $\tilde{A} = A + I$ is the adjacency matrix $A$ with self-loops added (identity matrix $I$ ).
- $\tilde{D}$ is the degree matrix of $\tilde{A}$ , where $\tilde{D}_{ii} = \sum_{j=0}^{|N|} \tilde{A}_{ij}$ (sum of row $i$ in $\tilde{A}$ ).
MGDN Propagation Formula: MGDN learns vertex representations by incorporating neighbor information within $K^{(M)}$ $K^{(M)}$ hops. The final output $\mathcal{Z}^{(M)}$ $Z^{(M)}$ for modality $M$ $M$ is computed as: $\mathcal{Z}^{(M)} = f_{MGDN}(\tilde{X}^{(M)}, A) = \left(\beta^{K^{(M)}} \hat{A}^{K^{(M)}} + \sum_{k=0}^{K^{(M)}-1} \alpha \beta^k \hat{A}^k \right) \tilde{X}^{(M)} / \Gamma$ where:
- $\tilde{X}^{(M)}$ is the initial encoded feature matrix for modality $M$ .
- $K^{(M)}$ is the modality-independent receptive field (number of hops) for modality $M$ .
- $\alpha$ and $\beta$ are hyperparameters that control the influence of the initial features (self-information) and propagated features (neighbor information), respectively.
- $\Gamma$ is a normalization term to ensure that the sum of coefficients for $\hat{A}^k \tilde{X}^{(M)}$ is 1.0: $\Gamma = \beta^{K^{(M)}} + \sum_{k=0}^{K^{(M)}-1} \alpha \beta^k$
Step-wise Computation for Efficiency: For practical computation, $\mathcal{Z}^{(M)}$ $Z^{(M)}$ is calculated iteratively:
1. Initialize the 0-hop representation: $H^{(M, 0)} = \tilde{X}^{(M)}$
2. For $k=1, \dots, K^{(M)}$ , update the representation by mixing propagated and initial features: $H^{(M, k)} = \beta \hat{A} H^{(M, k-1)} + \alpha H^{(M, 0)}$
3. The final modality-specific representation is the $K^{(M)}$ -hop representation, normalized by $\Gamma$ : $\mathcal{Z}^{(M)} = H^{(M, K^{(M)})}/\Gamma$
Receptive Field Selection: The optimal modality-independent receptive fields $K^{(E)}$ , $K^{(T)}$ , and $K^{(V)}$ are selected via grid search on the validation set, as empirically shown to be feasible and consistent with test set performance.

4.2.3.3. Multimodal Representation Pooling

After obtaining modality-independent vertex representations $\mathcal{Z}^{(E)}$ , $\mathcal{Z}^{(T)}$ , and $\mathcal{Z}^{(V)}$ , they are combined using sum-pooling to form the overall multimodal vertex representations $Z \in \mathbb{R}^{|N| \times d}$ : $Z = \mathcal{Z}^{(E)} + \mathcal{Z}^{(T)} + \mathcal{Z}^{(V)}$ Each row $z_i$ in $Z$ represents the multimodal embedding for the $i$ -th vertex (user or item).

4.2.4. Sampling-Based Global Transformer (SGT)

This module addresses the GNNs' limitation in capturing global information by efficiently integrating global context using a simplified Transformer.

4.2.4.1. Global Sampling

To avoid the quadratic complexity of a full Transformer, for each vertex $z_i$ , a small number of vertices are uniformly sampled from the entire graph.

For a target vertex $z_i$ , a matrix $S_i \in \mathbb{R}^{(C+1) \times d}$ is constructed.
The first row of $S_i$ is $s_{i1} = z_i$ (the target vertex's representation).
The subsequent $C$ rows, $s_{ij}$ (for $1 < j \le C+1$ ), are representations $z_k$ sampled uniformly from $Z$ , i.e., $k \sim \mathrm{Uniform}(1, |N|)$ .
This sampling is done independently for each target vertex $z_i$ and at each training step, ensuring exposure to diverse global contexts.

4.2.4.2. Simplified Transformer

A simplified Transformer performs self-attention on $S_i$ to enrich the semantics of these representations, resulting in $T_i \in \mathbb{R}^{(C+1) \times d}$ : $T_i = (1 - \gamma) \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) \mathcal{V} + \gamma S_i$ where:

$\mathrm{softmax}$ denotes row-wise softmax normalization.
$\gamma$ is a hyperparameter ( $0 \le \gamma \le 1$ ) controlling the residual connection, allowing flexible integration of the Transformer's output with the original input $S_i$ .
$Q = S_i W^{(Q)}$ is the Query matrix.
$K = S_i W^{(K)}$ is the Key matrix.
$\mathcal{V} = S_i$ is the Value matrix (the paper states $\mathcal{V} = S_i$ , implying no separate linear transformation for Value, unlike standard Transformers, making it "simplified").
$W^{(Q)}, W^{(K)} \in \mathbb{R}^{d \times d_{att}}$ are learnable weight matrices, where $d_{att}$ is the attention dimension (hyperparameter). The $\sqrt{d}$ in the denominator is used for scaling the dot product, where $d$ is the embedding dimension.

4.2.4.3. Final Vertex Representation

The first row of $T_i$ , $T_{i1}$ , which corresponds to the target vertex $z_i$ , is extracted as its final, globally-enriched representation $\tilde{z}_i$ .

The final vertex representation matrix is $\tilde{Z} \in \mathbb{R}^{|N| \times d}$ , where $\tilde{z}_i = T_{i1}$ .
User and item representations are then denoted as $\tilde{u}_i = \tilde{z}_i$ and $\tilde{v}_j = \tilde{z}_{j+|U|}$ , respectively.
The other rows $\{T_{ij} \mid j > 1\}$ (corresponding to sampled global vertices) are not used as final representations but contribute to the Transformer Unsmooth Regularization.

4.2.4.4. Transformer Unsmooth Regularization (TUR)

To prevent the self-attention mechanism from making representations of sampled nodes too similar (smoothing), Transformer Unsmooth Regularization ( $\mathcal{L}_{TUR}$ ) is introduced. It encourages the model to distinguish between a target vertex's representation and its neighbors' representations within the Transformer's output.

For each vertex $i$ , a neighbor $k$ is sampled (i.e., $A_{ik}=1$ ).
The regularization loss is applied: $\mathcal{L}_{TUR} = - \sum_{A_{ik}=1} \log \left( \frac{\exp(\tilde{z}_k^\prime T_{i1})}{\sum_{j=1}^{C+1} \exp(\tilde{z}_k^\prime T_{ij})} \right)$ where:
- $A_{ik}=1$ indicates that vertex $k$ is a neighbor of vertex $i$ .
- $\tilde{z}_k$ is the final representation of the neighbor $k$ .
- $T_{i1}$ is the enriched representation of the target vertex $i$ from $T_i$ .
- $T_{ij}$ are the enriched representations of all $C+1$ vertices in $T_i$ (target vertex $i$ and $C$ sampled global vertices). This loss function effectively tries to make the dot product between the neighbor $\tilde{z}_k$ and the target's enriched representation $T_{i1}$ high, while making it lower for $T_{ij}$ where $j>1$ . This implies that the model should preserve the distinctness between the target and its true neighbors, even when processed through the global sampling Transformer.

4.2.5. Model Optimization

The model is optimized using the Adam optimizer with a combined loss function: $\mathcal{L}_{rec} = \mathcal{L}_{rank}(\tilde{Z}) + \mathcal{L}_{TUR}(\tilde{Z}, T) + \Psi_{L2} \mathcal{L}_{L2}(\tilde{Z})$ where:

$\mathcal{L}_{rank}(\tilde{Z})$ $L_{r ank} (\tilde{Z})$ is the ranking loss. The paper adopts the popular Bayesian Personalized Ranking (BPR) loss: $\mathcal{L}_{BPR} = - \sum_{B_{ij}=1} \mathbb{E}_{v_k \sim p(v)} \log \sigma \big( \tilde{u}_i^\prime \tilde{v}_j - \tilde{u}_i^\prime \tilde{v}_k \big)$ Here:
- $B_{ij}=1$ means user $i$ has interacted with item $j$ (positive sample).
- $\mathbb{E}_{v_k \sim p(v)}$ denotes taking the expectation over a negative item $v_k$ sampled from the graph's item distribution.
- $\sigma$ is the sigmoid activation function.
- $\tilde{u}_i^\prime \tilde{v}_j$ represents the preference score for a positive pair.
- $\tilde{u}_i^\prime \tilde{v}_k$ represents the preference score for a negative pair. The BPR loss aims to maximize the difference between the scores of positive (interacted) items and negative (non-interacted) items for each user.
$\mathcal{L}_{TUR}(\tilde{Z}, T)$ is the Transformer Unsmooth Regularization loss defined above.
$\mathcal{L}_{L2}(\tilde{Z})$ is the standard L2 regularization applied to the final vertex representations $\tilde{Z}$ , with $\Psi_{L2}$ as its coefficient. L2 regularization helps prevent overfitting by penalizing large weights.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three public datasets derived from the Amazon review datasets (He and McAuley 2016a), which are commonly used in multimodal recommendation research.

Baby: This dataset contains user reviews and product information for baby products.
Sports: This dataset contains user reviews and product information for sports and outdoors items.
Clothing: This dataset contains user reviews and product information for clothing, shoes, and jewelry.

Characteristics and Preprocessing:

Source: Amazon review datasets.
Filtering: All datasets were filtered using a 5-core threshold for both products and users. This means that only users who had at least 5 interactions and items that had at least 5 interactions were retained, ensuring a minimum level of activity and relevance.
Multimodal Features:
- Visual Features: 4,096-dimensional embeddings for images, extracted using pre-trained Convolutional Neural Networks (CNNs).
- Text Features: 384-dimensional embeddings, derived from item titles, descriptions, categories, and brands using sentence-transformers (a type of pre-trained language model).
Data Split:
- 80% of known user interactions for training.
- 10% for validation.
- 10% for testing.
Random Seeds: The reported performance is the mean result obtained using five different random seeds, which helps ensure the robustness and statistical significance of the results.

The following are the results from Table 1 of the original paper:

Dataset Users Items Interactions Sparsity

Baby 19,445 7,050 160,792 99.88%

Sports 35,598 18,357 296,337 99.95%

Clothing 39,387 23,033 278,677 99.97%

Dataset	Users	Items	Interactions	Sparsity
Baby	19,445	7,050	160,792	99.88%
Sports	35,598	18,357	296,337	99.95%
Clothing	39,387	23,033	278,677	99.97%

5.2. Evaluation Metrics

The performance of the recommendation models is evaluated using two widely-used ranking metrics: Recall (R@K) and Normalized Discounted Cumulative Gain (NDCG@K). $K$ typically refers to the top $K$ recommendations. The paper reports results for $K=10$ and $K=20$ .

5.2.1. Recall (R@K)

Conceptual Definition: Recall@K measures the proportion of relevant items (items the user interacted with in the test set) that are successfully retrieved within the top $K$ recommendations. It focuses on how many of the truly relevant items were found by the recommender system. A higher Recall@K indicates that the system is better at retrieving a larger fraction of the relevant items.
Mathematical Formula: $\mathrm{Recall@K} = \frac{1}{|U_{test}|} \sum_{u \in U_{test}} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u@K|}{|\mathrm{Rel}_u|}$
Symbol Explanation:
- $|U_{test}|$ is the number of users in the test set.
- $\mathrm{Rel}_u$ is the set of relevant items for user $u$ (i.e., items user $u$ interacted with in the test set).
- $\mathrm{Rec}_u@K$ is the set of top $K$ items recommended to user $u$ .
- $|\cdot|$ denotes the cardinality of a set.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality that considers the position of relevant items in the recommendation list. It assigns higher scores to relevant items that appear at higher ranks (closer to the top) in the recommendation list. It is normalized by the ideal DCG (IDCG) to account for varying numbers of relevant items per user. A higher NDCG@K indicates better ranking, with relevant items placed more prominently.
Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $\mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)}$ Then, Ideal DCG (IDCG@K) is calculated by sorting relevant items by their relevance score in descending order: $\mathrm{IDCG@K} = \sum_{i=1}^{|\mathrm{Rel}_u|, i \le K} \frac{2^{\mathrm{rel}_{i_{ideal}}} - 1}{\log_2(i+1)}$ Finally, NDCG@K is: $\mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}$ For implicit feedback, $\mathrm{rel}_i$ is usually binary (1 if relevant, 0 if not). So, the formula simplifies to: $\mathrm{DCG@K} = \sum_{i=1}^{K} \frac{\mathrm{rel}_i}{\log_2(i+1)}$
Symbol Explanation:
- $K$ is the number of top recommendations considered.
- $\mathrm{rel}_i$ is the relevance score of the item at position $i$ in the recommended list (typically 1 if relevant, 0 if not).
- $\mathrm{rel}_{i_{ideal}}$ is the relevance score of the item at position $i$ in the ideal recommendation list (where all relevant items are ranked at the top).
- $\log_2(i+1)$ is the logarithmic discount factor, giving less weight to relevant items at lower ranks.

5.3. Baselines

The proposed MIG-GT method is compared against a comprehensive set of baselines, categorized into two groups:

5.3.1. Non-Multimodal Baselines (Interaction-only)

These models rely solely on user-item interaction data.

MF (Koren, Bell, and Volinsky 2009): Matrix Factorization, a classic collaborative filtering technique.
LightGCN (He et al. 2020): A simplified and highly effective Graph Convolutional Network for recommendation, optimized for implicit feedback.
ApeGNN (Zhang et al. 2023): A GNN that adaptively aggregates information based on local structures.
MGDN (Hu et al. 2024): The base GNN model used in MIG-GT, which generalizes LightGCN and offers flexible controls for self- and neighbor information.

5.3.2. Multimodal Baselines (Interaction + Multimodal Data)

These models leverage both user-item interactions and item-associated multimodal data.

VBPR (He and McAuley 2016b): Visual Bayesian Personalized Ranking, an extension of BPR that incorporates visual features.
MMGCN (Wei et al. 2019): Multi-modal Graph Convolution Network, an early GNN-based multimodal model that applies separate GNNs to modality-aware graphs.
GRCN (Wei et al. 2020): Graph-Refined Convolutional Network, which focuses on refining user-item graph structures by sieving out noisy connections.
DualGNN (Wang et al. 2023): Dual Graph Neural Network, introducing a user co-occurrence graph and a feature preference module.
SLMRec (Tao et al. 2023): Self-Supervised Learning for Multimedia Recommendation, a recent self-supervised learning approach.
LATTICE (Zhang et al. 2021): Mining Latent Structures for Multimedia Recommendation, which performs modality-aware structure learning to obtain item-item graphs.
FREEDOM (Zhou and Shen 2023): A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation, a state-of-the-art model that explicitly models item-item relationships and applies denoising.

All baselines utilize BPR as their ranking loss, ensuring a fair comparison on the learning objective.

5.4. Parameter Settings

Modality-Independent Receptive Fields ( $K^{(M)}$ ): Searched within a range, specifically $K^{(M)} \le 4$ .
Transformer Residual Hyperparameter ( $\gamma$ ): Searched within the range [0.8, 0.9].
Learning Rate: Searched from $\{1 \times 10^{-2}, 1 \times 10^{-3}\}$ .
L2 Regularization Coefficient ( $\Psi_{L2}$ ): Searched from $\{1 \times 10^{-4}, 1 \times 10^{-5}\}$ .
Optimization: Adam optimizer (Kingma and Ba 2015).
Hardware: Linux system with two Intel Xeon E5-2690 v4 CPUs, 128GB RAM, and a GeForce GTX 1080 Ti GPU (11GB).
Implementation: PyTorch and DGL (Deep Graph Library).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate the effectiveness and superiority of the proposed MIG-GT framework across three Amazon datasets (Baby, Sports, and Clothing).

The following are the results from Table 2 of the original paper:

Method	Multimodal	GNN	Baby				Sports				Clothing
			R @ 10	R@20	N@10	N@20	R@ 10	R@20	N@10	N@20	R@ 10	R @20	N@10	N@20
MF	X		0.0357	0.0575	0.0192	0.0249	0.0432	0.0653	0.0241	0.0298	0.0206	0.0303	0.0114	0.0138
LightGCN	X	X	0.0479	0.0754	0.0257	0.0328	0.0569	0.0864	0.0311	0.0387	0.0361	0.0544	0.0197	0.0243
ApeGNN	X		0.0501	0.0775	0.0267	0.0338	0.0608	0.0892	0.0333	0.0407	0.0378	0.0538	0.0204	0.0244
MGDN	X		0.0495	0.0783	0.0272	0.0346	0.0614	0.0932	0.0340	0.0422	0.0362	0.0551	0.0199	0.0247
VBPR	✓		0.0423	0.0663	0.0223	0.0284	0.0558	0.0856	0.0307	0.0384	0.0281	0.0415	0.0158	0.0192
MMGCN	✓	✗	0.0421	0.0660	0.0220	0.0282	0.0401	0.0636	0.0209	0.0270	0.0227	0.0361	0.0154	0.0154
GRCN	✓		0.0532	0.0824	0.0282	0.0358	0.0599	0.0919	0.0330	0.0413	0.0421	0.057	0.0224	0.0284
DualGNN	✓	✓	0.0513	0.0803	0.0278	0.0352	0.0588	0.0899	0.0324	0.0404	0.0452	0.0675	0.0242	0.0298
SLMRec	✓		0.0521	0.0772	0.0289	0.0354	0.0663	0.0990	0.0365	0.0450	0.0442	0.0659	0.0241	0.0296
LATTICE	✓	✓	0.0547	0.0850	0.0292	0.0370	0.0620	0.0953	0.0335	0.0421	0.0492	0.0733	0.0268	0.0330
FREEDOM	✓	✓	0.0627	0.0992	0.0330	0.0424	0.0717	0.1089	0.0385	0.0481	0.0626	0.0932	0.0338	0.0416
MIG-GT	✓	✓	0.0665	0.1021	0.0361	0.0452	0.0753	0.1130	0.0414	0.0511	0.0636	0.0934	0.0347	0.0422
Improv.			6.06%	2.92%	9.39%	6.6%	5.02%	3.76%	7.53%	6.24%	1.6%	0.21%	2.66%	1.44%

Key Observations from Table 2:

Impact of Multimodal Data: Methods utilizing multimodal data (marked with ✓ in the 'Multimodal' column) generally outperform those relying solely on user-item interactions (marked with $X$ ). For instance, VBPR (multimodal) significantly improves over MF (non-multimodal), highlighting the value of rich item semantics. This confirms a well-established finding in multimodal recommendation.
Efficacy of GNNs: GNN-based methods (marked with ✓ in the 'GNN' column) consistently show better performance than non-GNN methods, both in multimodal and non-multimodal settings. LightGCN and MGDN outperform MF. Similarly, multimodal GNNs like GRCN, DualGNN, LATTICE, and FREEDOM generally surpass VBPR. This reinforces the strength of GNNs in capturing complex interaction patterns.
Improvements in Multimodal GNNs: More advanced multimodal GNNs (GRCN, DualGNN, SLMRec) that consider nuances like noisy interactions or feature preferences achieve better performance than earlier methods like MMGCN. This suggests that there's considerable room for improvement by tailoring GNNs to specific challenges in multimodal recommendation.
Modeling Item-Item Relationships: Models like LATTICE and FREEDOM, which explicitly learn or utilize item-item relationships (often by building dedicated graphs or denoising mechanisms), achieve strong performance, often surpassing other multimodal GNNs. FREEDOM stands out as the SOTA baseline in most cases, demonstrating the power of explicit item-item modeling and denoising.
MIG-GT's Superiority: MIG-GT consistently outperforms all baselines, including the SOTA FREEDOM.
- On Baby and Sports datasets, MIG-GT shows substantial improvements over FREEDOM (e.g., 6.06% and 5.02% in R@10, 9.39% and 7.53% in N@10 respectively).
- On Clothing, MIG-GT still outperforms FREEDOM, albeit with smaller margins (1.6% in R@10, 2.66% in N@10).
- A notable aspect is that MIG-GT achieves this without relying on complex denoising or explicit item-item relation modeling, which are key components of FREEDOM and LATTICE. This suggests that the core innovations of modality-independent receptive fields and sampling-based global transformers are highly effective and potentially more fundamental to multimodal recommendation performance.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact and Selection of Modality-Independent Receptive Fields (MIRF)

The paper initially illustrates the concept of modality-dependent optimal $K$ values using Figure 1 in the introduction.

$Figure 1: Performance of GNNs on Amazon Baby with features of different modalities at varying receptive fields (number of hops, `K _ { c }` ). "Emb" stands for learnable embeddings. The optimal $K$ i…$ 该图像是图表，展示了在 Amazon Baby 数据集上，不同模态（Emb、Text 和 Visual）在不同感受野（ $K$ 值）下的 GNN 性能。图中表明，最佳的 $K$ 值依赖于模态，其中 Emb 和 Text 在 $K=3$ 时表现最佳，而 Visual 在 $K=2$ 时最佳。数据由 NDCG@20 指标衡量。

Figure 1: Performance of GNNs on Amazon Baby with features of different modalities at varying receptive fields (number of hops, K _ { c } ). "Emb" stands for learnable embeddings. The optimal $K$ is modality-dependent: Emb and Text perform best at $K = 3$ , while Visual performs best at $K = 2$ . As shown in Figure 1, on the Amazon Baby dataset, the optimal $K$ for learnable embeddings and text modalities is $K=3$ , while for the visual modality, it is $K=2$ . This initial finding empirically validates the core hypothesis that optimal receptive fields vary across modalities.

Further detailed analysis is presented using heatmaps (Figure 4 and Figure 5) to visualize the interaction between different $K^{(M)}$ values and their impact on performance (NDCG@20).

The following are the results from Figure 4 of the original paper:

$Figure 4: Heatmaps showing the ${ \\mathrm { N D C G } } @ 2 0$ scores for different combinations of $K ^ { ( T ) }$ and $K ^ { ( V ) }$ .$ 该图像是热图，展示了不同组合的 $K^{(T)}$ 和 $K^{(V)}$ 对 ${\mathrm{NDCG}}@20$ 分数的影响。左侧热图显示了在验证集上的得分，右侧热图则表示测试集上的得分。热图中的数值反映了不同模态下的性能表现，颜色深浅则指示得分的高低。

Figure 4: Heatmaps showing the ${ \mathrm { N D C G } } @ 2 0$ scores for different combinations of $K ^ { ( T ) }$ and $K ^ { ( V ) }$ . Figure 4 presents heatmaps for NDCG@20 scores when $K^{(E)}$ is fixed at 4, and $K^{(T)}$ and $K^{(V)}$ vary from 1 to 4.

Validation vs. Test Consistency: The left heatmap (Figure 4a) shows validation performance, and the right (Figure 4b) shows test performance. The patterns of performance variation are largely consistent between the validation and test sets. This consistency indicates that selecting optimal $K^{(M)}$ values using grid search on a validation set is a reliable strategy.
Optimal Combinations: Different combinations of $K^{(T)}$ and $K^{(V)}$ yield varying performance, with an optimal region emerging that is not simply a uniform $K$ across all modalities.

The following are the results from Figure 5 of the original paper:

$Figure 5: Heatmaps showing the ${ \\mathrm { N D C G } } @ 2 0$ scores for different combinations of $K ^ { ( E ) }$ and $K ^ { ( V ) }$ .$ 该图像是热力图，展示了在验证集和测试集上不同 $K^{(E)}$ 和 $K^{(V)}$ 组合下的 { ext{NDCG}}@20 分数。左侧为验证集的表现，右侧为测试集的表现，色深表示分数的高低。

Figure 5: Heatmaps showing the ${ \mathrm { N D C G } } @ 2 0$ scores for different combinations of $K ^ { ( E ) }$ and $K ^ { ( V ) }$ . Figure 5 presents heatmaps for NDCG@20 scores when $K^{(T)}$ is fixed at 2, and $K^{(E)}$ and $K^{(V)}$ vary from 1 to 4.

Similar to Figure 4, the heatmaps in Figure 5 also confirm the consistency between validation (Figure 5a) and test (Figure 5b) performance patterns, further validating the feasibility of validation-based hyperparameter tuning for MIRF.
The results clearly show that the optimal configuration for $K^{(M)}$ values often involves different values for different modalities, diverging from a uniform $K$ setting. This directly supports the paper's core hypothesis regarding the benefit of modality-independent receptive fields.

6.2.2. Impact of Sampling-Based Global Transformers (SGT)

To assess the effectiveness of the Sampling-based Global Transformer (SGT), an ablation study is performed by comparing MIG-GT with MIG, a variant that removes SGT but retains the MIRF components. The SOTA method FREEDOM is included for context.

The following are the results from Figure 6 of the original paper:

Figure 6: Impact of Sampling-based Global Transformers. 该图像是一个柱状图，展示了“MIG-GT”、“MIG”和“FREEDOM”三种方法在不同数据集（Baby、Sports、Clothing）上的表现。图中分别展示了在 recall@10 和 recall@20 以及 ndcg@10 和 ndcg@20 四个指标下的比较。图中可以看出，MIG-GT 在多个数据集上表现优越。

Figure 6: Impact of Sampling-based Global Transformers.

MIG vs. FREEDOM: Even without SGT, MIG (which only uses MIRF) already outperforms FREEDOM on the Baby and Sports datasets, highlighting the standalone effectiveness of modality-independent receptive fields. On Clothing, MIG is slightly out-performed by FREEDOM.
MIG-GT vs. MIG: MIG-GT consistently enhances the performance of MIG across all datasets. This indicates that the SGT module significantly contributes to the overall performance improvement by effectively integrating global information. On Clothing, MIG-GT closes the gap and surpasses FREEDOM, demonstrating SGT's crucial role when MIRF alone might not be sufficient.

The paper also compares MIG-GT with variants that replace SGT with other Graph Transformer methods like SGFormer and Polynormer.

The following are the results from Table 3 of the original paper:

	Baby		Sports		Clothing
Method	R @20	N@20	R @20	N@20	R @20	N@20
MIG-SGFormer	0.0863	0.0376	0.0887	0.0392	0.0827	0.0363
MIG-Polynormer	0.0997	0.0436	0.1048	0.0461	0.0864	0.0386
MIG-GT	0.1021	0.0452	0.1130	0.0511	0.0934	0.0422

As shown in Table 3, MIG-GT outperforms MIG-SGFormer and MIG-Polynormer across all datasets and metrics. This demonstrates the superior effectiveness of the proposed sampling-based approach for global context integration specifically within the recommendation domain.

6.2.3. Impact of Number of Global Samples ( $C$ ) for SGT

The study investigates how the number of global samples $C$ in SGT affects performance.

The following are the results from Figure 7 of the original paper:

Figure 7: Impact of Number of Global Samples `( C )` for SGT. 该图像是一个图表，展示了不同全局样本数量 $C$ 对于召回率（recall@20）和归一化折损累计增益（ndcg@20）的影响。左侧图表（(a)）表示不同数据集（Baby、Sports、Clothing）在召回率上的变化，右侧图表（(b)）展示了各数据集在ndcg@20上的表现。不同数据集在这两个指标上的变化趋势相对平稳，呈现出一定的规律性。

Figure 7: Impact of Number of Global Samples ( C ) for SGT. Figure 7 illustrates the impact of varying $C$ from 5 to 25 on R@20 and N@20.

Performance Improvement: An increase in $C$ from 5 to 10 generally leads to performance improvements across all datasets for both R@20 and N@20.
Diminishing Returns: Beyond $C=10$ or $C=15$ , the performance increments become less significant or plateau, and in some cases, might even slightly decrease (e.g., N@20 on Sports with $C=25$ ).
Efficiency: The results show that a relatively small number of global samples (e.g., $C=10$ or $C=20$ ) is sufficient for SGT to achieve significant performance improvements, confirming the efficiency and scalability of the sampling-based approach.

6.2.4. Training Efficiency of MIG-GT

Training efficiency is a crucial aspect for recommendation systems.

The following are the results from Figure 8 of the original paper:

Figure 8: Test performance (ndcg $\textcircled{2} 2 0 )$ during training. Figure 8 compares the training efficiency of MIG-GT against the SOTA FREEDOM by plotting NDCG@20 against training time (in seconds).

Faster Convergence: MIG-GT demonstrates superior training efficiency. On Baby and Sports datasets, MIG-GT achieves FREEDOM's final performance much earlier in the training process and then continues to surpass it.
Comparable Performance, Faster Optimal: On the Clothing dataset, while MIG-GT's final performance is comparable to FREEDOM's, it reaches its optimal results significantly faster.
Reason for Efficiency: The paper attributes this efficiency to avoiding complex denoising mechanisms over item-item relations that models like FREEDOM employ. MIG-GT's approach of modality-independent GNNs and sampled global transformers seems to be more direct and computationally lighter.

6.2.5. Comparison with Contrastive Learning (CL)-Based Methods

The paper also explores the compatibility of MIG-GT with Contrastive Learning (CL) by integrating a typical CL loss (InfoNCE) to create MIG-GT-CL.

The following are the results from Table 4 of the original paper:

	Baby		Sports		Clothing
Method	R@20	N@20	R@20	N@20	R@20	N@20
MMSSL	0.0971	0.0420	0.1013	0.0474	0.0797	0.0359
MGCN	0.0964	0.0427	0.1106	0.0496	0.0945	0.0428
LGMRec	0.1002	0.0440	0.1068	0.0480	0.0828	0.0371
MIG-GT	0.1021	0.0452	0.1130	0.0511	0.0934	0.0422
MIG-GT-CL	0.1022	0.0451	0.1120	0.0505	0.0946	0.0428

MIG-GT already outperforms most dedicated CL-based methods (MMSSL, MGCN, LGMRec) in its original form.
When a simple InfoNCE CL loss is added (MIG-GT-CL), its performance generally remains strong or slightly improves, especially on Clothing, where it becomes the top performer. On Sports and Baby, it's very competitive, indicating that while MIG-GT's inherent architecture is powerful, it can still benefit from or is at least compatible with additional self-supervised signals like CL. This suggests that the core advancements of MIG-GT are orthogonal and complementary to contrastive learning techniques.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully addresses key limitations in existing GNN-based multimodal recommendation systems by introducing the Modality-Independent Graph Neural Networks with Global Transformers (MIG-GT) framework. The paper empirically demonstrated that the optimal receptive field (number of hops, $K$ ) for GNNs varies across different modalities, leading to the proposal of Modality-Independent Receptive Fields (MIRF). Furthermore, to overcome the challenge of GNNs' limited global information capture, especially when optimal $K$ values are small, the Sampling-based Global Transformer (SGT) was introduced. SGT efficiently integrates global context by performing self-attention on a small, uniformly sampled subset of global nodes. The effectiveness of this global sampling approach and the necessity of Transformer Unsmooth Regularization (TUR) were validated through comprehensive experiments. MIG-GT consistently achieved state-of-the-art performance on three Amazon datasets, demonstrating improved accuracy and training efficiency compared to existing methods, even those employing more complex graph construction or denoising strategies.

7.2. Limitations & Future Work

The paper doesn't explicitly list limitations or future work sections. However, some aspects can be inferred:

Hyperparameter Search for $K^{(M)}$ : The current method relies on grid search for determining the optimal $K^{(M)}$ for each modality, which can be computationally intensive as the number of modalities or potential $K$ values increases. An adaptive or learnable mechanism for determining $K^{(M)}$ would be beneficial.
Sampling Strategy for SGT: The Sampling-based Global Transformer uses uniform global sampling. While effective and efficient, more sophisticated sampling strategies (e.g., importance sampling, biased sampling towards influential nodes) could potentially capture even richer global context or reduce variance.
Generalization of TUR: The Transformer Unsmooth Regularization is designed to prevent smoothing in the context of the SGT. Its generalizability to other global attention mechanisms or GNN architectures might be explored.
Computational Cost of Sampling: While efficient, the need to sample $C$ nodes for every node at every training step could still be a bottleneck for extremely large graphs and larger $C$ values. Further optimizations in sampling or attention computation could be explored.
Explicit Item-Item Relations: The paper explicitly notes that MIG-GT outperforms SOTA models like FREEDOM without relying on explicit item-item relations or complex denoising. While this highlights MIG-GT's strength, exploring how MIRF and SGT could complement explicit item-item modeling might lead to even further gains.

7.3. Personal Insights & Critique

This paper presents a very insightful and practical approach to enhancing multimodal recommendation systems. The observation about modality-dependent receptive fields for GNNs is intuitive yet often overlooked, and its empirical validation is compelling. It pushes the boundaries of GNN design beyond a one-size-fits-all $K$ .

The Sampling-based Global Transformer is a clever solution to the long-standing challenge of integrating global context into graph models without sacrificing scalability. The idea of using a small, uniform sample to approximate global attention is elegant and highly effective, making Transformers feasible for large-scale recommendation graphs. The Transformer Unsmooth Regularization is also a well-thought-out addition, acknowledging and mitigating a potential side effect of the global attention mechanism.

Inspirations and Applications:

Adaptive GNN Architectures: The concept of modality-independent $K$ can be extended to other heterogeneous graph tasks or scenarios where different feature types might require varying propagation depths.
Efficient Global Context for Large Graphs: The sampling strategy for Transformers could be applied to various large-scale graph learning problems beyond recommendation, wherever global context is desired but full attention is prohibitive (e.g., fraud detection, social network analysis).
Hybrid Models: MIG-GT successfully combines GNNs (local) and Transformers (global) in a synergistic manner. This hybrid approach represents a promising direction for other graph-based machine learning problems.

Potential Issues or Areas for Improvement:

Interpretability: While effective, the interaction between modality-independent GNNs and the sampled global Transformer might be complex to interpret. Further work could focus on understanding why certain $K$ values are optimal for specific modalities or which global nodes are most influential via the SGT.
Sensitivity to Sampling: While the paper shows robustness for $C$ values between 10-20, the quality and representativeness of uniform sampling can vary. For highly skewed graphs or specific tasks, a more adaptive or learned sampling strategy could offer further improvements.
Theoretical Guarantees: The effectiveness of TUR and the choice of $\gamma$ are empirically validated. Providing stronger theoretical motivations or bounds for these components could enhance the model's robustness and understanding.
Generalization to New Modalities: The framework is designed for text and visual modalities. Evaluating its extensibility and performance with other modalities (e.g., audio, video segments) could provide further insights.

Overall, MIG-GT is a significant step forward in multimodal recommendation, offering a robust, efficient, and highly effective framework by thoughtfully addressing both local and global information aggregation challenges.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 46,053 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Recommendation Systems

3.1.2. Graph Neural Networks (GNNs)

3.1.3. Transformers and Self-Attention

3.1.4. Bayesian Personalized Ranking (BPR) Loss

3.2. Previous Works

3.2.1. Graph Neural Networks for Recommendation

3.2.2. Graph Transformers

3.2.3. Multimodal Recommendation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Definition

4.2.2. Multimodal User-Item Graph

4.2.3. Modality-Independent Receptive Fields (MIRF)

4.2.3.1. Feature Encoding

4.2.3.2. Message Propagation with MGDN

4.2.3.3. Multimodal Representation Pooling

4.2.4. Sampling-Based Global Transformer (SGT)

4.2.4.1. Global Sampling

4.2.4.2. Simplified Transformer

4.2.4.3. Final Vertex Representation

4.2.4.4. Transformer Unsmooth Regularization (TUR)

4.2.5. Model Optimization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall (R@K)

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

5.3. Baselines

5.3.1. Non-Multimodal Baselines (Interaction-only)

5.3.2. Multimodal Baselines (Interaction + Multimodal Data)

5.4. Parameter Settings

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Impact and Selection of Modality-Independent Receptive Fields (MIRF)

6.2.2. Impact of Sampling-Based Global Transformers (SGT)

6.2.3. Impact of Number of Global Samples (CCC) for SGT

6.2.4. Training Efficiency of MIG-GT

6.2.5. Comparison with Contrastive Learning (CL)-Based Methods

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.3. Impact of Number of Global Samples ( $C$ ) for SGT