Paper status: completed

Multimodal fusion framework based on knowledge graph for personalized recommendation

Published:01/01/2025

Knowledge Graph-based Recommendation (5)Multimodal Recommendation Systems (8)Multimodal Fusion Framework (1)Personalized Recommendation (1)

Original Link

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work proposes Multi-KG4Rec, a multimodal fusion framework leveraging fine-grained modal interactions in knowledge graphs to enhance personalized recommendations, demonstrating superior efficiency on real-world datasets.

Abstract

Expert Systems With Applications 268 (2025) 126308 Available online 1 January 2025 0957-4174/© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Multimodal fusion framework based on knowledge graph for personalized recommendation Jingjing Wang a , Haoran Xie b , ∗ , Siyu Zhang a , S. Joe Qin b , Xiaohui Tao c , Fu Lee Wang d , Xiaoliang Xu a a Hangzhou Dianzi University, 1158 2nd Ave, Qiantang district, Hangzhou, 310005, Zhejiang, China b Lingnan University, 8 Castle Peak Road, Tuen Mun, New Territories, 999077, Hong Kong Special Administrative Region c University of Southern Queensland, Springfield, 4300, Queensland, Australia d Hong Kong Metropolitan University, 30 Good Shepherd Street, Ho Man Tin, Kowloon, 999077, Hong Kong Special Administrative Region A R T I C L E I N F O Keywords: Knowledge graphs Multimodal fusion framework Recommender system A B S T R A C T Knowledge Graphs (KGs), which contain a wealth of knowledge

Mind Map

In-depth Reading

English Analysis~35 min read · 48,314 chars

1. Bibliographic Information

1.1. Title

Multimodal fusion framework based on knowledge graph for personalized recommendation

1.2. Authors

Jingjing Wang, Haoran Xie, Siyu Zhang, S. Joe Qin, Xiaohui Tao, Fu Lee Wang, Xiaoliang Xu

1.3. Journal/Conference

The paper was submitted for publication and has an ARTICLE INFO section, but the specific journal or conference is not explicitly mentioned in the provided text. However, the presence of author affiliations (e.g., Hangzhou Dianzi University, Lingnan University, University of Southern Queensland) and a CRediT authorship contribution statement suggests it is intended for a peer-reviewed academic publication.

1.4. Publication Year

The publication year is not explicitly stated in the provided text. However, references within the paper span up to 2023, suggesting it was published in late 2023 or 2024.

1.5. Abstract

The paper addresses limitations in existing Multimodal Knowledge Graph (MKG)-based recommendation systems, which primarily use multimodal information as auxiliary data for reasoning relationships between entities, often overlooking the direct interactions between modalities. To overcome this, the authors propose Multi-KG4Rec, a multimodal fusion framework based on Knowledge Graphs (KGs) for personalized recommendation. The framework systematically analyzes shortcomings in current multimodal graph construction. It introduces a modal fusion module to extract user modal preferences at a fine-grained level. Extensive experiments conducted on two real-world datasets (MovieLens and Amazon-Books) from different domains demonstrate the efficiency and effectiveness of Multi-KG4Rec.

1.6. Original Source Link

/files/papers/690dd0087a8fb0eb524e6845/paper.pdf This is a direct link to the PDF document of the paper, indicating its publication status as an academic paper available for download.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve revolves around enhancing the representational quality and personalization capabilities of recommender systems by more effectively integrating rich multimodal information.

Core Problem: Traditional Knowledge Graph (KG)-based recommender systems effectively use KGs as a knowledge-driven tool for high-quality representations, but they often represent attribution information as pure symbols, limiting their ability to understand real-world scenarios rich in images and text. While Multimodal Knowledge Graphs (MKGs) have been proposed to address this by incorporating text and visual content, existing MKG-based methods suffer from two significant limitations:
1. Lack of a Unified MKG Architecture: Existing MKG methods are typically categorized into feature-based and entity-based approaches.
  - Feature-based methods treat multimodal information as auxiliary data for entities, enriching representations but often overlooking interactions between different modalities. They also impose strict constraints on MKG completeness.
  - Entity-based methods consider multimodal information as supplementary nodes, but these are often limited to attribute entities and struggle with sparsity for item entities (e.g., few items share identical posters or text).
2. Ineffective Multimodal Fusion: Current fusion techniques, like concatenation or weighted sums, struggle to effectively leverage multimodal information, especially for capturing subtle correlations within or across modalities (e.g., visual style across different movies with varying textual descriptions). This makes it challenging to extract fine-grained personalized multimodal preferences.
Why this problem is important: Personalization in recommender systems is crucial for user satisfaction and engagement. Real-world items are inherently multimodal (e.g., movies have posters, descriptions, genres). Ignoring or inadequately fusing this rich information leads to less accurate recommendations and a poorer understanding of user preferences. Addressing these architectural and fusion limitations can significantly improve the model's ability to understand the real world and provide more precise, personalized recommendations.
Paper's Entry Point/Innovative Idea: The paper's innovation lies in proposing a unified Multimodal fusion framework based on Knowledge Graph for personalized Recommendation (Multi-KG4Rec) that tackles these architectural and fusion challenges. It introduces a novel MKG construction by dividing the multimodal graph into several single-modal graphs, representing each entity by its modality feature, which helps avoid node sparsity and allows for coarse-grained preference extraction. Furthermore, it employs a sophisticated fine-grained modal fusion module using pre-trained models (like CLIP and Large Language Models - LLMs) for initial feature generation, followed by a graph neural network to align features with graph structures, and finally a cross multi-head attention module between text and visual transformers for deep multimodal interaction.

2.2. Main Contributions / Findings

The paper makes several key contributions:

Unified Multimodal Architecture: Multi-KG4Rec proposes a unified multimodal graph architecture that overcomes the strict limitations of feature-based methods by dividing the multimodal KG into several single-modal graphs. This approach also addresses node sparsity issues inherent in entity-based methods by representing each entity with its modality feature, enabling coarse-grained user preference extraction without being constrained by explicit entity connections.
Leveraging LLMs and Fine-Grained Multimodal Fusion: The framework employs a pre-trained Large Language Model (LLM) (specifically, CLIP) to generate initial multimodal features, which are then integrated with graph-structured information using a Graph Neural Network (GNN). Crucially, it introduces a multimodal fusion module that uses a cross multi-head attention module between text and visual transformers. This module extracts users' personalized multimodal preferences at a fine-grained level, capturing subtle interactions between modalities that previous methods overlooked.
Extensive Experimental Validation: The effectiveness of Multi-KG4Rec is demonstrated through extensive experiments on two real-world datasets from different domains: MovieLens and Amazon-Books. The results show that Multi-KG4Rec consistently outperforms various strong baselines, including collaborative filtering, KG-based, and MKG-based methods, across standard evaluation metrics like Recall@k, MRR@k, and NDCG@k.
Empirical Insights on Modality Effectiveness: The analysis reveals that incorporating multimodal features significantly boosts performance, with visual modalities often being more impactful than text for user decisions. The case study further highlights that different users exhibit distinct fine-grained preferences for visual versus text modalities, validating the necessity of the proposed fine-grained fusion approach.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts:

Recommender Systems (RSs): Systems designed to predict user preferences and suggest items (e.g., movies, products, articles) that users are most likely to enjoy. They address information overload by filtering relevant items.
Knowledge Graphs (KGs): A KG is a structured representation of information that describes entities (real-world objects, events, concepts) and their semantic relationships. It typically consists of nodes (entities) and edges (relationships), forming triples in the format (head entity, relation, tail entity) or (h, r, t). KGs provide rich, explicit knowledge that can enhance recommender systems by linking items to their attributes and related concepts, thus offering more context than traditional collaborative filtering methods.
Multimodal Knowledge Graphs (MKGs): An extension of KGs that integrates information from multiple modalities, such as text, images, and potentially audio or video. In MKGs, entities or their attributes can be enriched with features derived from these different data types. For example, a movie entity might have a textual description, an image (poster), and a genre. MKGs aim to provide a more comprehensive understanding of entities by combining symbolic knowledge with perceptual data.
Graph Neural Networks (GNNs): A class of neural networks designed to operate directly on graph-structured data. GNNs learn representations (embeddings) of nodes by aggregating information from their neighbors, iteratively refining these representations. They are powerful for capturing relational dependencies and structural patterns in graphs.
- Graph Attention Networks (GATs): A type of GNN that incorporates an attention mechanism. Instead of assigning equal weight to all neighbors, GATs learn different weights for different neighbors based on their features, allowing them to selectively focus on more important neighbors during aggregation.
Transformers: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), primarily known for its success in natural language processing. The core idea of a Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. This enables Transformers to capture long-range dependencies effectively.
- Self-Attention: A mechanism that allows a model to weigh the importance of different words in an input sentence when encoding a specific word. For each word, it calculates a score of how much it should "attend" to other words in the sentence.
- Multi-head Attention: An extension of self-attention where the attention mechanism is applied multiple times in parallel, using different learned linear projections (query, key, value) for each "head." The outputs from these heads are then concatenated and linearly transformed, allowing the model to capture diverse types of relationships or focus on different aspects of the information.
- Cross-Attention: A variant of the attention mechanism used when interacting between two different sequences (e.g., text and image features). One sequence provides the query (e.g., text features) and the other provides the key and value (e.g., visual features), allowing the first sequence to attend to relevant parts of the second.
Contrastive Learning: A machine learning paradigm where the model learns by contrasting positive pairs (similar samples) with negative pairs (dissimilar samples). The goal is to bring representations of positive pairs closer together while pushing negative pairs further apart in the embedding space.
CLIP (Contrastive Language-Image Pre-training): A pre-trained multimodal model developed by OpenAI. CLIP learns to align images and text descriptions by being trained on a large dataset of image-text pairs using contrastive learning. It consists of an image encoder and a text encoder, which project images and texts into a shared embedding space. This allows CLIP to understand visual concepts described in natural language and generate high-quality, aligned multimodal features.
TransR: A knowledge graph embedding model that learns representations of entities and relations. TransR projects entities into different relation-specific spaces before performing translation. This allows it to capture different aspects of entity relationships more effectively than simpler models like TransE by making entity embeddings more flexible depending on the relation. The core idea is that $h + r ≈ t$ should hold true in the relation space.
Bayesian Personalized Ranking (BPR) Loss: A widely used pairwise ranking loss function for implicit feedback recommendation. BPR aims to maximize the difference between the predicted scores of observed (positive) items and unobserved (negative) items for a given user. It assumes that a user prefers an interacted item over a non-interacted item.
- Formula: The BPR loss for a single triplet (u, i, j) (user $u$ prefers item $i$ over item $j$ ) is typically defined as: $ \mathcal{L}{BPR} = -\ln \sigma(\hat{y}{ui} - \hat{y}_{uj}) $ where $\hat{y}_{ui}$ is the predicted score for user $u$ and item $i$ , $\hat{y}_{uj}$ is the predicted score for user $u$ and item $j$ , and $\sigma(\cdot)$ is the sigmoid function. The total loss sums over all training triplets.

3.2. Previous Works

The paper categorizes previous works into KG-based methods, MKG-based methods, and Multimodal fusion methods.

KG-based methods: These methods construct a heterogeneous graph involving users, items, and item attributes, then propagate relationships to generate representations.
- KGCN (Wang, Zhao et al., 2019): Incorporates GCN (Graph Convolutional Network) and KG methods to learn relationships between entities. It uses a fixed number of neighbors as the receptive field.
- KGAT (Wang, He et al., 2019): An innovative collaborative KG method that propagates features on the collaborative KG through a GCN layer to encode high-order relationships between users and items. It integrates TransR and GNN to generate entity representations.
- These methods are noted for reasoning relationships but often ignore rich knowledge from text and visual information.
- Meta-path-based methods (Hu et al., 2018; Zhao et al., 2017): Rely on manual path definition for feature engineering.
MKG-based methods: These integrate multimodal entity nodes from text and visual modalities.
- Feature-based methods: Treat multimodal information as auxiliary data for entities.
  - CKE (Zhang et al., 2016): Divides MKG into a bipartite graph, textual content, and visual content. It uses TransR for structural representations, and denoising autoencoders for multimodal content. It is highlighted for not using GNN to aggregate high-order neighbor information and lacking sufficient attention on inter-modal interaction.
  - DKN (Wang et al., 2018): Proposed a CNN framework to integrate high-order relational reasoning with text semantics. It explored high-order relationships under textual modalities but not other modalities.
  - CMCKG (Cao et al., 2022): Utilizes original KG for structural representations and converts textual descriptions into new KG nodes. Employs contrastive learning to enhance consistency between representations.
- Entity-based methods: Consider multimodal information as newly added supplementary nodes.
  - MKGAT (Sun et al., 2020): An entity-based method where only attribute entities contain multimodal information, which is transferred as newly added nodes. This paper identifies it as a representative feature-based method in its experimental comparisons, implying its multimodal integration approach is more aligned with augmenting entity features rather than distinct modal entities.
  - MMKGV (Liu, Li et al., 2022): Integrated multimodality information as relationship triplets within a knowledge graph.
Multimodal fusion methods: Categorized into coarse-grained, fine-grained, and combined attention.
- Coarse-grained attention: Focuses on modality correlation at a high level. Examples include DUALGRAPH (Li, Feng et al., 2023), UVCAN (Liu, Chen et al., 2019), MCPTR (Liu, Ma et al., 2022), CMBF (Chen et al., 2021). These often use co-attention or cross-attention at the modality level.
- Fine-grained attention: Focuses on detailed correlations. Examples include POG (Chen et al., 2019), NOR (Lin et al., 2019), EFRM (Hou et al., 2019), MMRec (Wu et al., 2021). These often use self-attention or candidate-aware attention for specific attributes or elements.
- Combined attention: Balances fine-grained and coarse-grained. Examples include NOVA (Liu et al., 2021), NRPA (Liu, Wu et al., 2019), VLSNR (Han et al., 2022), MARank (Yu et al., 2019). This paper positions its method in this category, using pre-trained models for fine-grained alignment and then cross multi-head attention.

3.3. Technological Evolution

The field has evolved from basic collaborative filtering to KG-based recommenders that leverage structured knowledge. The limitation of symbolic KGs led to the integration of multimodal information, giving rise to MKGs. Early MKG methods either treated multimodal data as auxiliary features (feature-based) or added them as new nodes (entity-based). However, these often struggled with complete MKG architectures, data sparsity, and effective fusion of interactions between modalities. The rise of powerful pre-trained models (like CLIP for multimodal alignment) and advanced attention mechanisms (like Transformers) opened new avenues for more sophisticated multimodal integration. This paper fits into this evolution by addressing the architectural shortcomings of previous MKGs and leveraging Transformers for fine-grained cross-modal fusion, aiming for a more unified and effective MKG framework.

3.4. Differentiation Analysis

Multi-KG4Rec differentiates itself from previous methods primarily in its approach to MKG architecture and multimodal fusion:

Unified MKG Architecture:
- Vs. Feature-based: Existing feature-based methods (e.g., CKE, CMCKG, sometimes MKGAT in comparison) treat multimodal information as auxiliary to the main entity, which can overlook direct interactions between modalities. Multi-KG4Rec mitigates this by conceptually dividing the multimodal KG into several single-modal graphs. Each entity within a modality (e.g., visual entity, text entity) is represented by its specific modality feature. This allows for dedicated processing of each modality before fusion, potentially capturing intra-modal correlations more effectively and then inter-modal interactions.
- Vs. Entity-based: Existing entity-based methods (e.g., MKGAT) often add multimodal information as new supplementary nodes, which face severe sparsity issues, especially for item entities (as few items share identical visual/text content). Multi-KG4Rec's single-modal graph approach, combined with pre-trained models, avoids this explicit node sparsity by focusing on rich feature representations for existing items rather than creating new sparse attribute nodes for every unique piece of multimodal content.
Fine-Grained Multimodal Fusion:
- Vs. Simple Fusion (Concatenation/Weighted Sums): Many prior methods use simple concatenation or weighted sums for multimodal fusion, which are less effective at capturing complex, subtle interactions between modalities. Multi-KG4Rec employs a cross multi-head attention module between a text transformer and a visual transformer. This allows for a much more dynamic and fine-grained interaction where text features can attend to visual features, and vice versa, enabling the model to learn deep correlations and extract personalized preferences that are highly modality-specific.
- Leveraging LLMs and GNNs: Unlike methods that rely on simpler encoders, Multi-KG4Rec utilizes pre-trained multimodal models like CLIP to generate initial, high-quality, aligned multimodal features. It then uniquely integrates these features with graph-structured information using a GNN, which is particularly suited for propagating information over complex graph structures, bridging the gap between rich LLM features and graph topology. This is highlighted as a novel approach to combining the strengths of generative LLM features with graph-structured data.
  
  In essence, Multi-KG4Rec addresses the architectural fragmentation and superficial multimodal fusion of previous MKG methods by offering a more integrated, feature-rich, and interactively fused approach, driven by advanced pre-trained multimodal models and attention mechanisms.

4. Methodology

4.1. Principles

The core idea behind Multi-KG4Rec is to construct a flexible Multimodal Knowledge Graph (MKG) architecture that effectively captures fine-grained interactions between different modalities (text and visual) for personalized recommendations. This is achieved by:

Modular MKG Construction: Instead of rigid feature-based or entity-based MKG designs, Multi-KG4Rec divides the MKG into several single-modal graphs. This allows for independent representation learning within each modality, capturing modality-specific features while avoiding issues like node sparsity for item entities.
Leveraging Pre-trained Multimodal Encoders: Utilizing powerful pre-trained multimodal models (like CLIP) to generate initial, aligned, and rich entity features from both text and visual content, ensuring that the multimodal information is well-represented from the outset.
Fine-Grained Cross-Modal Fusion: Employing Transformer-based cross multi-head attention to deeply fuse information between modalities. This mechanism allows for dynamic interaction and attention between text and visual features, enabling the extraction of subtle, personalized multimodal preferences.
Knowledge-Aware Propagation: Integrating a Graph Neural Network (GNN) layer that is knowledge-aware to propagate these fused multimodal features across the KG structure. This ensures that high-order relational information is incorporated into the final user and item representations.
Personalized Prediction: Combining these rich, fused, and propagated representations of users and items to predict user interest in potential items, optimized via a Bayesian Personalized Ranking (BPR) loss.

4.2. Core Methodology In-depth

The Multi-KG4Rec framework consists of an Embedding module, a Multimodal fusion module, an Information propagation module, and a Prediction component, all optimized through a unified Optimizer. The overall architecture is illustrated in Figure 2.

Fig. 2. The architecture of the proposed Multi-KG4Rec.
该图像是论文中图2所示的Multi-KG4Rec架构示意图，展示了从单模态输入到跨模态交互，以及信息传播模块最后进行预测的全过程，体现了图像-文本交互模块和多模态融合的设计。

Fig. 2. The architecture of the proposed Multi-KG4Rec.

4.2.1. Embedding Module

The embedding module is responsible for initializing and optimizing the representations of entities in the Knowledge Graph.

4.2.1.1. Entity Embedding

Given a triplet (h, r, t) in the KG $\mathcal{G}$ , entities initially have an ID that is embedded as a structural feature via a lookup table. For multimodal entities (items and their attribute nodes that possess visual and text content), the paper uses CLIP, a pre-trained multimodal visual-text model, to align image-text pairs and generate initial entity features. Specifically, visual and textual descriptions corresponding to entities are fed into CLIP. The outputs from CLIP's two encoders (one for visual, one for text) are projected into a shared embedding space. The output of the last layer serves as the feature, with a dimensionality of 512. This process ensures that text and visual features are intrinsically aligned from the start.

4.2.1.2. Embedding Optimization

To further optimize these entity features and capture their structural relationships within the KG, TransR is adopted. TransR models relations as translations in relation-specific spaces. Nodes and edges in $\mathcal{G}$ are converted into triplets (h, r, t). The optimization objective for TransR is to ensure that in the relation space, the projected head entity plus the relation embedding is approximately equal to the projected tail entity. The embeddings of the head entity $h$ , tail entity $t$ , and relation $r$ are denoted as $\mathbf{e}_h, \mathbf{e}_t \in \mathbb{R}^d$ and $\mathbf{e}_r \in \mathbb{R}^k$ , respectively. $\mathbf{e}_h^r$ and $\mathbf{e}_t^r$ denote the projected representations of $\mathbf{e}_h$ and $\mathbf{e}_t$ in the space of relation $r$ . For a given triplet (h, r, t), the formula for the objective score g(h, r, t) is: $g ( h , r , t ) = \left\| \mathbf { W } _ { r } \mathbf { e } _ { h } + \mathbf { e } _ { r } - \mathbf { W } _ { r } \mathbf { e } _ { t } \right\| _ { 2 } ^ { 2 }$ Here:

g(h, r, t): The score indicating the likelihood that the triplet (h, r, t) is true. A lower score implies a higher likelihood.
$\mathbf{e}_h, \mathbf{e}_t$ : The original embeddings of the head and tail entities, respectively, in the entity space.
$\mathbf{e}_r$ : The embedding of the relation $r$ .
$\mathbf{W}_r \in \mathbb{R}^{k \times d}$ : A transformation matrix that projects entities from the entity space (dimension $d$ ) to the relation $r$ space (dimension $k$ ).
$\|\cdot\|_2^2$ : The squared $L_2$ norm, measuring the Euclidean distance.

The training of TransR distinguishes between positive triplets (existing in $\mathcal{G}$ ) and negative triplets (not existing in $\mathcal{G}$ ) using a pairwise ranking loss. For a positive triplet (h, r, t) and a sampled negative triplet (h, r, t'), the loss $\mathcal{L}_{\mathrm{KG}}$ is defined as: $\mathcal { L } _ { \mathrm { K G } } = \sum _ { ( h , r , t , t ^ { \prime } ) \in T } - \ln \sigma \left( g \left( h , r , t ^ { \prime } \right) - g ( h , r , t ) \right)$ Here:
$\mathcal{L}_{\mathrm{KG}}$ : The knowledge graph embedding loss.
$T$ : The set of training samples, each containing a positive triplet (h, r, t) and a corresponding negative triplet (h, r, t').
g(h, r, t'): The score for the negative triplet.
g(h, r, t): The score for the positive triplet.
$\sigma(\cdot)$ : The sigmoid function, which squashes its input to a range between 0 and 1. This loss encourages the score of positive triplets to be lower than that of negative triplets, meaning $g(h, r, t') > g(h, r, t)$ is preferred, making $\sigma(g(h, r, t') - g(h, r, t))$ close to 1, and $-\ln(\cdot)$ close to 0.

4.2.2. Multimodal Fusion Module

This module is designed to fuse modality information at a fine-grained level, capturing complex interactions between text and visual features. It comprises a text transformer, a visual transformer, and a multi-head attention layer for cross-modal interaction. The transformers aim to extract dependencies within a single modality, while the multi-head attention layer performs the actual multimodal fusion.

First, for an item $v_i$ , its high-order neighbors are collected to form an input sequence $S_i$ . The paper uses a breadth-first search (BFS) approach to gather neighbors up to a certain count $n$ , sorted by distance. This sequence includes the item itself, user neighbors, and attribute entity neighbors (e.g., $S_i = \{v_i, u_m^1, e_p^1, e_q^2, u_n^2, e_k^2\}$ , where $v_i$ is the item, $u_m^1$ is a 1st-order user neighbor, $e_p^1$ is a 1st-order attribute neighbor, $e_q^2$ is a 2nd-order attribute neighbor, etc.).

Given the neighbor set $S_i$ , the visual features and text features corresponding to these neighbors are denoted as $\mathbf{x}_v \in \mathbb{R}^{n \times d}$ and $\mathbf{x}_t \in \mathbb{R}^{n \times d}$ , respectively, where $n$ is the sequence length (number of neighbors) and $d$ is the embedding dimension.

The process of multi-head attention (MHA) for cross-modal fusion, as shown in Figure 3(a), involves transforming these features into queries, keys, and values. We take the visual head as an example: $Q _ { v } ^ { ( i ) } , K _ { v } ^ { ( i ) } , V _ { v } ^ { ( i ) } = \mathbf { x } _ { v } \mathbf { W } _ { Q } ^ { v ( i ) } , \mathbf { x } _ { v } \mathbf { W } _ { K } ^ { v ( i ) } , \mathbf { x } _ { v } \mathbf { W } _ { V } ^ { v ( i ) }$ Here:

$Q_v^{(i)}, K_v^{(i)}, V_v^{(i)}$ : The query, key, and value matrices for the $i$ -th head of the visual modality.
$\mathbf{x}_v$ : The input visual features for the neighbors in $S_i$ .
$\mathbf{W}_Q^{v(i)}, \mathbf{W}_K^{v(i)}, \mathbf{W}_V^{v(i)} \in \mathbb{R}^{d \times d_h}$ : Learnable weight matrices for projecting the input features into queries, keys, and values for the $i$ -th head.
$d_h = d/H$ : The dimension of each head, where $H$ is the total number of heads.

The text modality features $\mathbf{x}_t$ undergo similar transformations to generate $Q_t^{(i)}, K_t^{(i)}, V_t^{(i)}$ .

For cross-modal attention, the keys and values from both visual and text modalities are concatenated. For example, to calculate the output of a visual head, denoted $\mathrm{head}_{\mathrm{i}}^{\mathrm{M}_{v}}$ : $\mathbf { h e a d } _ { \mathrm { i } } ^ { \mathrm { M } _ { v } } = \mathbf { A t t n } \left( Q _ { v } ^ { ( i ) } , \mathrm { c o n c a t } \left( K _ { v } ^ { ( i ) } , K _ { t } ^ { ( i ) } \right) , \mathrm { c o n c a t } \left( V _ { v } ^ { ( i ) } , V _ { t } ^ { ( i ) } \right) \right)$ Here:

$\mathrm{head}_{\mathrm{i}}^{\mathrm{M}_{v}}$ : The output of the $i$ -th attention head for the visual modality, having attended to both visual and text key-value pairs.
$\mathbf{Attn}(\cdot, \cdot, \cdot)$ : Represents the scaled dot-product attention function, which is typically defined as $\mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$ .
$Q_v^{(i)}$ : The query from the visual modality.
$\mathrm{concat}(K_v^{(i)}, K_t^{(i)})$ : The concatenation of keys from both visual and text modalities.
$\mathrm{concat}(V_v^{(i)}, V_t^{(i)})$ : The concatenation of values from both visual and text modalities. This operation allows the visual query to attend to relevant information present in both visual and text modalities. A similar process would be performed for text queries using $\mathrm{concat}(K_v^{(i)}, K_t^{(i)})$ and $\mathrm{concat}(V_v^{(i)}, V_t^{(i)})$ but with $Q_t^{(i)}$ as the query.

After computing all $H$ heads for a modality, their outputs are concatenated to form the final multi-head attention output for that modality: $\mathbf { h e a d } ^ { v } = \mathrm { c o n c a t } \left( \mathrm { h e a d } _ { 1 } ^ { \mathrm { M } _ { v } } , \dots , \mathrm { h e a d } _ { H } ^ { \mathrm { M } _ { v } } \right) W _ { o } ^ { v }$ Here:

$\mathbf{head}^v$ : The final output from the multi-head attention layer for the visual modality.
$\mathrm{concat}(\cdot)$ : Concatenation operation.
$W_o^v$ : A learnable linear projection matrix that combines the concatenated head outputs back to the original embedding dimension $d$ .

Following the multi-head attention layer, a Feedforward Neural Network (FFN) is applied. This FFN typically consists of two non-linear layers with ReLU activation, and it operates on the output of the attention mechanism (after layer normalization and residual connections). For the visual modality, the FFN calculation is: $\mathrm { F F N } ( \mathbf { x } ) ^ { v } = \mathrm { R e L U } \left( \mathbf { x } _ { v } W _ { 1 } ^ { v } + \mathbf { b } _ { 1 } ^ { v } \right) W _ { 2 } ^ { v } + \mathbf { b } _ { 2 } ^ { v }$ Here:
$\mathrm{FFN}(\mathbf{x})^v$ : The output of the feedforward network for the visual modality.
$\mathbf{x}_v$ : The input to the FFN (which is the output from the multi-head attention layer for the visual modality, potentially with residual connections and layer normalization).
$\mathrm{ReLU}(\cdot)$ : The Rectified Linear Unit activation function, $\mathrm{ReLU}(z) = \max(0, z)$ .
$W_1^v \in \mathbb{R}^{d \times d_m}$ , $W_2^v \in \mathbb{R}^{d_m \times d}$ : Learnable weight matrices for the two linear layers.
$\mathbf{b}_1^v, \mathbf{b}_2^v$ : Learnable bias vectors.
$d_m$ : The hidden dimension of the FFN.

4.2.3. Information Propagation Module

After obtaining the multimodal information, a knowledge-aware graph attention layer is applied to propagate this information to higher-order neighbors, as shown in Figure 3(b).

Fig. 3.llustration about modal interaction and high-order information propagation.
该图像是论文中图3的示意图，展示了跨模态注意力模块和双交互聚合器的结构及工作流程，用以实现模态间的交互与高阶信息传播。

Fig. 3. Illustration about modal interaction and high-order information propagation.

For a given entity $h$ , $\mathcal{N}_h$ denotes the set of triples where $h$ is the head entity: $\mathcal{N}_h = \{(h, r, t) \mid (h, r, t) \in \mathcal{G}\}$ . The neighbor information is aggregated as follows: $\mathbf { e } _ { \mathcal { N } _ { h } } = \sum _ { ( h , r , t ) \in \mathcal { N } _ { h } } \pi ( h , r , t ) \mathbf { e } _ { t }$ Here:

$\mathbf{e}_{\mathcal{N}_h}$ : The aggregated embedding representing the neighborhood of entity $h$ .
$\mathbf{e}_t$ : The embedding of the tail entity $t$ from a triplet (h, r, t).
$\pi(h, r, t)$ : An attention coefficient that controls the flow of information from entity $t$ to entity $h$ based on their relationship $r$ .

The attention coefficient $\pi(h, r, t)$ is defined as: $\pi ( h , r , t ) = \left( \mathbf { W } _ { r } \mathbf { e } _ { t } \right) ^ { \top } \operatorname { t a n h } \left( \left( \mathbf { W } _ { r } \mathbf { e } _ { h } + \mathbf { e } _ { r } \right) \right)$ Here:
$\pi(h, r, t)$ : The raw attention score between head $h$ and tail $t$ through relation $r$ .
$\mathbf{W}_r$ : A trainable weight matrix that transforms entity embeddings.
$\mathbf{e}_t$ : Embedding of the tail entity.
$\tanh(\cdot)$ : The hyperbolic tangent activation function.
$\mathbf{W}_r \mathbf{e}_h + \mathbf{e}_r$ : Represents the transformed head entity embedding combined with the relation embedding.
$(\cdot)^\top$ : Transpose operation. The coefficients for all triplets connected to $h$ are then normalized using the softmax function (not explicitly shown in the formula but implied by standard graph attention mechanisms).

Finally, the original embedding of $h$ ( $\mathbf{e}_h$ ) and the aggregated neighborhood embedding ( $\mathbf{e}_{\mathcal{N}_h}$ ) are combined using a Bi-Interaction mechanism: $\begin{array} { r l } & { f _ { \mathrm { B i } - I n t e r a c t i o n } = \mathrm { Leaky } \mathrm { ReLU } \left( \mathbf { W } _ { 1 } \left( \mathbf { e } _ { h } + \mathbf { e } _ { \mathcal { N } _ { h } } \right) \right) } \\ & { ~ + \mathrm { Leaky } \mathrm { ReLU } \left( \mathbf { W } _ { 2 } \left( \mathbf { e } _ { h } \odot \mathbf { e } _ { \mathcal { N } _ { h } } \right) \right) } \end{array}$ Here:

$f_{\mathrm{Bi-Interaction}}$ : The final combined representation of entity $h$ after considering its own embedding and its aggregated neighborhood.
$\mathrm{LeakyReLU}(\cdot)$ : The Leaky Rectified Linear Unit activation function, $\mathrm{LeakyReLU}(z) = \max(0, z) + \alpha \min(0, z)$ , where $\alpha$ is a small positive slope.
$\mathbf{W}_1, \mathbf{W}_2$ : Learnable weight matrices.
$\mathbf{e}_h + \mathbf{e}_{\mathcal{N}_h}$ : The element-wise sum, capturing linear interactions.
$\mathbf{e}_h \odot \mathbf{e}_{\mathcal{N}_h}$ : The element-wise product, capturing non-linear (element-wise multiplicative) interactions. The Bi-Interaction component helps capture both additive and multiplicative relationships between the entity's own features and its neighborhood context. This process is typically repeated for $L$ layers to capture higher-order connectivity.

4.2.4. Prediction

After the information propagation module, we obtain representations for users $u$ and items $i$ from each layer $l$ of the GNN (e.g., $e_u^{(1)}, \ldots, e_u^{(l)}$ and $e_i^{(1)}, \ldots, e_i^{(l)}$ ). A layer-aggregation mechanism (Xu et al., 2018) concatenates these representations into unified vectors: $\mathbf { e } _ { u } ^ { * } = \mathbf { e } _ { u } ^ { ( 0 ) } \rVert \cdots \rVert \mathbf { e } _ { u } ^ { ( l ) } , \quad \mathbf { e } _ { i } ^ { * } = \mathbf { e } _ { i } ^ { ( 0 ) } \rVert \cdots \rVert \mathbf { e } _ { i } ^ { ( l ) }$ Here:

$\mathbf{e}_u^*, \mathbf{e}_i^*$ : The aggregated representations for user $u$ and item $i$ , respectively.
$\mathbf{e}_u^{(0)}, \ldots, \mathbf{e}_u^{(l)}$ : User representations from the initial layer (0) up to the $l$ -th layer.
$\mathbf{e}_i^{(0)}, \ldots, \mathbf{e}_i^{(l)}$ : Item representations from the initial layer (0) up to the $l$ -th layer.
$\rVert$ : Denotes the concatenation operation.

Then, the user and item representations from the visual modality ( $\mathbf{e}_u^{v(*)}, \mathbf{e}_i^{v(*)}$ ) are concatenated with those from the text modality ( $\mathbf{e}_u^{t(*)}, \mathbf{e}_i^{t(*)}$ ) to obtain the final user and item representations: $\mathbf { e } _ { u } = \mathbf { e } _ { u } ^ { v ( * ) } \lVert \mathbf { e } _ { u } ^ { t ( * ) } , \quad \mathbf { e } _ { i } = \mathbf { e } _ { i } ^ { v ( * ) } \lVert \mathbf { e } _ { i } ^ { t ( * ) }$ Here:
$\mathbf{e}_u, \mathbf{e}_i$ : The final, comprehensive embeddings for user $u$ and item $i$ , incorporating both visual and textual multimodal information. The predicted score $\hat{y}(u, i)$ for user $u$ and item $i$ can then be calculated, typically as the inner product of their final embeddings: $\hat{y}(u, i) = \mathbf{e}_u^\top \mathbf{e}_i$ .

4.2.5. Optimizer

To train the recommendation model, the Bayesian Personalized Ranking (BPR) loss is used to optimize the parameters based on the prediction loss. The BPR loss aims to maximize the difference between the predicted score of an observed (positive) item and an unobserved (negative) item for a given user. $\mathcal { L } _ { \mathrm { C F } } = \sum _ { ( u , i ) \in \mathcal { R } ^ { + } , ( u , j ) \in \mathcal { R } ^ { - } } - \ln \sigma ( \hat { y } ( u , i ) - \hat { y } ( u , j ) ) + \lambda \| \theta \| _ { 2 } ^ { 2 }$ Here:

$\mathcal{L}_{\mathrm{CF}}$ : The collaborative filtering loss, based on BPR.
$\mathcal{R}^+$ : The set of observed (positive) user-item interactions.
$\mathcal{R}^-$ : The set of sampled unobserved (negative) user-item interactions.
$\hat{y}(u, i)$ : The predicted score for user $u$ and positive item $i$ .
$\hat{y}(u, j)$ : The predicted score for user $u$ and negative item $j$ .
$\sigma(\cdot)$ : The sigmoid function.
$\lambda$ : The regularization coefficient.
$\|\theta\|_2^2$ : The $L_2$ regularization term for all trainable model parameters $\theta$ , used to prevent overfitting.

The final overall loss function combines the KG embedding loss ( $\mathcal{L}_{\mathrm{KG}}$ ) and the collaborative filtering loss ( $\mathcal{L}_{\mathrm{CF}}$ ): $\mathcal { L } = \mathcal { L } _ { \mathrm { K G } } + \mathcal { L } _ { \mathrm { C F } }$ Here:
$\mathcal{L}$ : The total loss to be minimized during training.
$\mathcal{L}_{\mathrm{KG}}$ : The TransR knowledge graph embedding loss, as defined in Section 4.2.1.2.
$\mathcal{L}_{\mathrm{CF}}$ : The BPR collaborative filtering loss, as defined above.

5. Experimental Setup

5.1. Datasets

The experiments were conducted on two real-world datasets from different domains: MovieLens and Amazon-Books.

MovieLens:
- Source: MovieLens-1M dataset, a widely used benchmark for recommender systems.
- Characteristics: Contains user IDs, item (movie) IDs, and ratings on a scale from 1 to 5.
- Processing: All ratings were converted to binary: 1 for a rating of 1, and 0 for all other ratings. This implies that only ratings of 1 were considered positive interactions.
- Multimodal Enrichment: A knowledge graph was constructed by linking items in the dataset to entities in Freebase. Corresponding movie posters and text descriptions were retrieved from IMDb to serve as visual and textual multimodal information for the entities.
Amazon-Books:
- Source: A subset of user reviews from Amazon's e-commerce website.
- Processing: Users with fewer than 10 interactions were filtered out, following the method in Wang, He et al. (2019), to ensure sufficient historical data per user.
- Multimodal Enrichment: Multimodal information (likely book covers and descriptions) was collected using the same methodology as for the MovieLens dataset.

Data Statistics: The following are the results from Table 1 of the original paper:

Dataset	#Interactions	#Items	#Users	#Sparsity	#Entities	#Relations	#Triplets
MovieLens	834,268	3589	6040	96.15%	60,406	51	273,547
Amazon-Books	332,834	18,932	24,047	99.92%	44,935	23	192,388

Rationale for Dataset Choice: These datasets are widely recognized and used in the recommender systems community. MovieLens provides a classic movie recommendation scenario, while Amazon-Books offers an e-commerce context. Both allow for the integration of structured knowledge (KGs) and rich multimodal content (posters/covers, descriptions), making them suitable for evaluating multimodal KG-based recommender systems. The high sparsity levels (96.15% for MovieLens, 99.92% for Amazon-Books) highlight the challenge of recommending in sparse interaction environments, where KGs and multimodal data can provide crucial auxiliary information.

5.2. Evaluation Metrics

To measure the quality of the recommended sequences, three commonly used metrics are employed: Recall@k, MRR@k, and NDCG@k. The default value for $k$ is 20. When using these metrics, items the user has already interacted with are treated as positive, and others as candidates. The top $k$ ranked items are selected as recommendations.

Recall@k:
- Conceptual Definition: Recall@k measures the proportion of relevant items (i.e., items a user actually interacted with in the test set) that are successfully retrieved within the top $k$ recommendations. It focuses on how many of the truly relevant items the recommender system managed to "remember" or find.
- Mathematical Formula: $ \mathrm{Recall@k} = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u^k|}{|\mathrm{Rel}_u|} $
- Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $|\cdot|$ : The cardinality (number of elements) of a set.
  - $\mathrm{Rel}_u$ : The set of relevant items for user $u$ in the test set (items $u$ actually interacted with).
  - $\mathrm{Rec}_u^k$ : The set of top $k$ items recommended to user $u$ .
  - $\cap$ : Set intersection.
  - $\sum_{u \in U}$ : Summation over all users.
Mean Reciprocal Rank (MRR@k):
- Conceptual Definition: MRR@k evaluates the ranking quality, especially when there is only one or very few correct answers. For each query (user), it finds the rank of the first relevant item. The reciprocal of this rank is taken (1/rank), and then these reciprocal ranks are averaged across all queries. A higher MRR indicates that the first relevant item appears earlier in the recommendation list. It is sensitive to ranking positions.
- Mathematical Formula: $ \mathrm{MRR@k} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\mathrm{rank}_u} $
- Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $|\cdot|$ : The cardinality of a set.
  - $\mathrm{rank}_u$ : The rank of the first relevant item in the recommendation list for user $u$ , up to rank $k$ . If no relevant item is found within the top $k$ , the reciprocal rank is 0.
Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition: NDCG@k is a measure of ranking quality that considers the graded relevance of items (though often binary in recommendation). It assigns higher scores to relevant items that appear earlier in the list and discounts the value of relevant items as their position decreases. It normalizes the score by the ideal DCG (perfect ranking) to make scores comparable across different users or queries. It is sensitive to ranking positions.
- Mathematical Formula: $ \mathrm{NDCG@k} = \frac{1}{|U|} \sum_{u \in U} \frac{\mathrm{DCG@k}_u}{\mathrm{IDCG@k}_u} $ where $ \mathrm{DCG@k}u = \sum{j=1}^{k} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ and $\mathrm{IDCG@k}_u$ is the DCG of the ideal ranking for user $u$ .
- Symbol Explanation:
  - $U$ : The set of all users in the test set.
  - $|\cdot|$ : The cardinality of a set.
  - $\mathrm{DCG@k}_u$ : Discounted Cumulative Gain for user $u$ up to rank $k$ .
  - $\mathrm{IDCG@k}_u$ : Ideal Discounted Cumulative Gain for user $u$ up to rank $k$ , which is the DCG value if all relevant items were ranked perfectly at the top.
  - $j$ : The position of an item in the recommendation list.
  - $\mathrm{rel}(j)$ : The relevance score of the item at position $j$ . In binary relevance scenarios (relevant=1, not relevant=0), it's 1 if the item is relevant, 0 otherwise.

5.3. Baselines

The Multi-KG4Rec model was compared against several baselines, categorized into collaborative filtering, knowledge graph-based, and multimodal methods that incorporate knowledge graphs.

Collaborative Filtering Methods: These methods typically rely on user-item interaction patterns.
- SpectralCF (Zheng et al., 2018): A spectral collaborative filtering model that applies a convolutional model in the spectral domain space based on the bipartite graph of user-item interactions. It aims to reveal deep connections and alleviate the cold-start problem.
- ConvNCF (He et al., 2018): Neural Collaborative Filtering model that uses element-wise products to capture pairwise correlations among dimensions within the embedding space.
Knowledge Graph-based Approaches: These integrate KGs to enrich recommendations.
- KGAT (Wang, He et al., 2019): Integrates TransR and GNN to generate entity representations and propagates features on the collaborative KG to encode high-order relationships.
- KGCN (Wang, Zhao et al., 2019): Utilizes GNNs to learn entity relationships by aggregating information from a fixed number of neighbors as the receptive field.
- CKE (Zhang et al., 2016): Integrates structural information (via TransR), textual data (via stacked denoising autoencoders), and image data (via stacked convolutional auto-encoders) to enhance recommendation quality.
Multimodal Methods Incorporating Knowledge Graphs: These are specifically designed to handle multimodal information alongside KGs.
- MKGAT (Sun et al., 2020): A multimodal graph attention mechanism designed to solve entity information aggregation and entity relationship reasoning, identified as a representative feature-based method.
  
  These baselines are representative as they cover various paradigms: pure collaborative filtering, KG-enhanced methods, and multimodal KG methods, allowing for a comprehensive evaluation of Multi-KG4Rec's advancements.

5.4. Parameter Settings

Data Split: The interaction data was randomly split into 8:1:1 for training, validation, and testing, respectively.
Initialization: Model parameters were initialized using the Xavier initializer.
Optimizer: Adam optimizer was used for model optimization.
Hyperparameters:
- Mini-batch sizes: Searched within $\{1024, 5120, 10240\}$ .
- Learning rates: Searched within $\{0.0001, 0.0005, 0.001, 0.005, 0.01\}$ .
- Regularization coefficient $\lambda$ (for $L_2$ regularization): Set in $\{10^{-5}, 10^{-4}, \dots, 10^{-1}\}$ .
Multimodal Features:
- CLIP model: Utilized for visual and text entities, extracting 512-dimensional features from its last layer.
- Dimension Reduction: These 512-dimensional features were then reduced to 64 dimensions via a non-linear transformation with a LeakyReLU activation function.
Multimodal Fusion Module:
- Blocks: Stacked 3 blocks (layers).
- Attention Heads: Each block used 8 attention heads.
Information Propagation Module:
- Layers: 3 layers of the knowledge-aware graph neural network were used to encode high-order connectivity.
- Output Dimensions: The output dimension of each GNN layer was $\{64, 32, 16\}$ . This implies a progressive reduction in dimension through the layers.
Implementation: The Multi-KG4Rec model was implemented in PyTorch.
Hardware: All experiments were conducted on a Windows PC equipped with an RTX 3090 GPU.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Multi-KG4Rec consistently outperforms all baseline models on both the MovieLens and Amazon-Books datasets across all three evaluation metrics (Recall@k, MRR@k, NDCG@k).

The following are the results from Table 2 of the original paper:

Models	MovieLens			Amazon-Books
Models	Recall	MRR	NDCG	Recall	MRR	NDCG
SpectralCF	0.2199	0.3714	0.2082	0.1327	0.0541	0.0602
ConvNCF	0.1815	0.3405	0.1794	0.0404	0.0148	0.0175
KGAT	0.2489	0.3941	0.2303	0.1431	0.0553	0.0702
KGCN	0.2268	0.3783	0.2165	0.1418	0.0528	0.0677
CKE	0.2217	0.3754	0.2128	0.1324	0.0491	0.0612
MKGAT	0.2513	0.3963	0.2311	0.1477	0.0560	0.0707
Multi-KG4Rec	0.2552	0.4077	0.2383	0.1498	0.0572	0.0727
Improv.	1.55%	2.88%	3.12%	1.42%	2.83%	2.14%

Key findings from this comparison:

Multi-KG4Rec's Superiority: Multi-KG4Rec achieves the best performance across all metrics on both datasets. Compared to MKGAT (a strong MKG-based baseline), Multi-KG4Rec shows improvements of up to 3.12% in Recall@20 for MovieLens and 2.14% for Amazon-Books. This strongly validates the effectiveness of its proposed unified architecture and fine-grained multimodal fusion module. The authors attribute this to Multi-KG4Rec's comprehensive perspective on user-item interactions, which is crucial for personalized recommendations.
KG-based vs. CF-based Methods: KG-based methods (CKE, KGAT, KGCN, MKGAT) generally outperform collaborative filtering-based methods (SpectralCF, ConvNCF). This underscores the value of Knowledge Graphs in providing auxiliary information and enabling relational reasoning, especially in sparse data environments. KGs help GNNs encode relationships between attribute entities, alleviating data sparsity and cold-start issues, and enhancing understanding of user-item relationships.
KGAT vs. KGCN: KGAT outperforms KGCN. The paper suggests that while KGCN aims for a broader receptive field, it might introduce more noise, whereas KGAT's collaborative KG approach, propagating features through GCN layers, is more effective.
CKE's Limitations: CKE performs the worst among KG-based methods. This is attributed to its lack of GNN-based high-order neighbor information aggregation and insufficient attention to interactions between different modalities, despite also dividing text and images into separate modes. This highlights the importance of GNN propagation and advanced fusion mechanisms, which Multi-KG4Rec addresses.

6.2. Modality Effectiveness Analyses

To understand the impact of different modalities, an analysis was conducted by comparing MKGAT and Multi-KG4Rec under various modality configurations on the MovieLens dataset.

The following are the results from Table 3 of the original paper:

Models	MKGAT			Multi-KG4Rec
	Recall	MRR	NDCG	Recall	MRR	NDCG
	w/o t&v	0.2453	0.3907	0.2251	0.2489	0.3941	0.2303
w/o v	0.2477	0.3949	0.2272	0.2518	0.4014	0.2327
Improv.	1.00%	1.07%	0.93%	1.16%	1.85%	1.04%
w/o t	0.2479	0.3951	0.2285	0.2531	0.4016	0.2340
Improv.	1.06%	1.13%	1.51%	1.69%	1.90%	1.61%
Multi-KG4Rec	0.2488	0.3963	0.2311	0.2542	0.4033	0.2371
Improv.	1.42%	1.43%	2.67%	2.13%	2.33%	2.95%

Note: In the table, the row "w/o t&v" for Multi-KG4Rec is identical to the KGAT row, which likely indicates that Multi-KG4Rec without multimodal features defaults to a KG-only approach similar to KGAT or a comparable baseline, confirming the base performance without fusion. The Multi-KG4Rec row (last row) seems to refer to its full performance. The Improv. rows likely represent the improvement over the "w/o t&v" baseline within their respective model columns. Let's assume the last row "Multi-KG4Rec" refers to the full model, and the "Improv." rows below "w/o v" and "w/o t" reflect improvement over "w/o t&v" for each specific model.

Key observations:

Multimodal Benefits: Models incorporating multimodal features (visual and text) consistently achieve superior performance compared to those relying on a single modality or no multimodal information (w/o t&v). This confirms that rich, diverse item characteristics from multiple perspectives enhance the model's understanding of user intent.
Visual Modality's Dominance: Under single-modal conditions (w/o v vs. w/o t), the visual modality (meaning the model uses only visual information in addition to KG structure, denoted as w/o t in the table meaning "without text") is generally more effective than the text modality (w/o v in the table meaning "without visual"). This aligns with findings in other multimodal models, suggesting that images often convey more information or have a higher weight in user decision-making than text content in certain domains.
Multi-KG4Rec's Expressive Power: Multi-KG4Rec achieves stronger performance than MKGAT across all comparable settings (e.g., full multimodal, w/o v, w/o t). This suggests that Multi-KG4Rec has superior expressive power to perceive implicit relationships between images and texts, primarily due to its sophisticated multimodal fusion module that extracts cross-modal information at a fine-grained level.

6.3. Ablation Study

An ablation study was conducted to further analyze the effectiveness of the Bi-Transformer (bi-directional attention mechanism) within the multimodal fusion module. This involved comparing the full Multi-KG4Rec model with variants where only unidirectional attention was activated.

The following are the results from Table 4 of the original paper:

Dataset	MovieLens			Amazon-Books
Dataset	Recall	MRR	NDCG	Recall	MRR	NDCG
w/o t&v	0.2453	0.3907	0.2251	0.1473	0.0566	0.0716
Bi-Trans12v	0.2437	0.3917	0.2244	0.1428	0.0514	0.0674
Bi-Transu2t	0.2444	0.3944	0.2227	0.1436	0.0521	0.0662
Multi-KG4Rec	0.2552	0.4077	0.2383	0.1498	0.0572	0.0727

Note: w/o t&v denotes the multimodal fusion module is disabled. Bi-Trans12v likely refers to a variant where text attends to visual (text-to-visual), and Bi-Transu2t likely refers to visual attends to text (visual-to-text). The naming appears slightly ambiguous with 12v and u2t, but based on common Transformer parlance, they represent unidirectional cross-attention from one modality to another. The authors' text clarifies "text-to-image attention activated" and "image-to-text attention activated". Let's assume Bi-Trans12v means only text-to-visual attention is active (text queries visual keys/values) and Bi-Transu2t means only visual-to-text attention is active (visual queries text keys/values).

Key findings from the ablation study:

Impact of Multimodal Fusion: The results show that disabling the multimodal fusion module (w/o t&v) leads to a significant performance drop compared to the full Multi-KG4Rec model across all metrics and datasets. This reinforces the previous conclusion about the critical role of multimodal information integration.
Importance of Bi-directional Attention: Both unidirectional variants (Bi-Trans12v and Bi-Transu2t) perform worse than the full Multi-KG4Rec model. This indicates that a bi-directional cross-attention mechanism is crucial. Unidirectional transformers consider correlations only from one side, potentially losing vital information from the other modality. For instance, Bi-Trans12v (text-to-visual) might capture how text features relate to visual features but miss how visual features might inform text understanding, and vice versa for Bi-Transu2t.
Robustness and Enhanced Interaction: The bi-directional transformer (the full Multi-KG4Rec model) offers several advantages: It independently extracts significant features from each modality, and then the cross-modality attention module dynamically adjusts the weight of these features. This mechanism enhances the interaction between modalities, leading to better overall performance and improved robustness against noisy or irrelevant information within a single modality.

6.4. Case Study

A case study was conducted to visually validate the significance of modalities in influencing user preferences. Two users (user $u_{3238}$ from MovieLens and user $u_{927}$ from Amazon-Books) were selected, and 10 items they interacted with were gathered. The attention mechanism was used to compute correlation scores between user-item pairs, where higher scores indicate a greater impact of that item's modality on user preferences.

该图像是多模态偏好示意图，展示了两用户u_3238和u_927在视觉模态和文本模态下的实体及偏好权重关系，部分视觉节点对应影视海报，文本节点附带关键词标签，反映了多模态信息在用户偏好捕捉中的细粒度交互。

Fig. 4. Attention distribution for $u_{3238}$ and $u_{927}$ .

Key insights from the case study (Figure 4):

Personalized Modal Preferences: The visualization clearly shows that different users exhibit varying preferences for visual and text modalities. User $u_{3238}$ from the MovieLens dataset shows a significantly higher attention score for the visual modality compared to the text modality. In contrast, user $u_{927}$ from the Amazon-Books dataset exhibits the opposite trend, with higher attention to text.
Rationale for Fine-Grained Fusion: This observation validates the rationale and necessity of discussing and implementing modal fusion at a fine-grained level. User $u_{3238}$ might prioritize movie posters when choosing a movie, while user $u_{927}$ might focus more on book descriptions or reviews.
Qualitative Analysis of Preferences: Further qualitative analysis revealed specific preferences: $u_{3238}$ tends to prefer posters with "scary elements," while $u_{927}$ tends towards "romantic-themed books." This demonstrates that the Multi-KG4Rec model can not only identify which modality a user prefers but also what specific characteristics within that modality appeal to them, showcasing the effectiveness of the fine-grained fusion.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully proposed Multi-KG4Rec, a novel personalized recommendation framework that leverages a Knowledge Graph and a sophisticated multimodal fusion approach. The framework's core innovation lies in its ability to effectively learn potential relationships and interactions between textual and visual modalities at a fine-grained level using a Bi-Transformer module. This is complemented by a GNN layer that propagates high-order information throughout the KG. Extensive experiments on two real-world datasets, MovieLens and Amazon-Books, conclusively demonstrated the efficiency and effectiveness of Multi-KG4Rec, showing superior performance over various strong baselines. The research highlighted the importance of multimodal information and the necessity of fine-grained cross-modal fusion for capturing diverse user preferences.

7.2. Limitations & Future Work

The authors suggest a specific direction for future work:

Additional Modalities: The paper suggests that web pages can serve as an additional, valuable modality to offer more contextual information for items. They note a relative scarcity of research in this area.
Future Research Goal: The authors aim to collect such web page datasets and design models to validate their hypothesis regarding the utility of web page content for recommender systems. This implies a current limitation is the exclusion of other potentially rich modalities.

Implicit limitations, though not explicitly stated as "limitations" by the authors, could be inferred:
Computational Complexity: Fine-grained attention mechanisms and Transformers, especially with large sequence lengths (neighbors), can be computationally intensive, potentially affecting real-time performance for very large-scale systems. The paper notes this as a general concern for fine-grained attention but doesn't specifically address it for Multi-KG4Rec itself.
Scalability of KG construction: Building and maintaining MKGs for vast item catalogs, particularly for retrieving and aligning multimodal content (like IMDb or Freebase links), can be complex and resource-intensive.
Generalizability of CLIP: While CLIP is powerful, its effectiveness relies on its pre-training data. Its performance might vary for highly specialized or niche domains not well-represented in its training.

7.3. Personal Insights & Critique

This paper presents a strong contribution to the field of multimodal recommender systems by addressing critical architectural and fusion shortcomings.

Strengths and Innovations:
- Unified Architecture: The idea of dividing the MKG into single-modal graphs before fusion is an elegant way to handle multimodal data without falling into the pitfalls of entity sparsity or over-simplification.
- Sophisticated Fusion: The use of pre-trained LLMs (CLIP) for initial features combined with a Bi-Transformer cross multi-head attention module represents a state-of-the-art approach to multimodal fusion, moving beyond simple concatenation. This truly enables the "fine-grained" preference extraction that many previous works claimed but struggled to achieve.
- Clear Validation: The experimental setup across two diverse datasets, comprehensive baseline comparisons, and detailed ablation studies provide robust evidence for the model's effectiveness. The case study is particularly insightful, offering qualitative validation of personalized modal preferences.
Potential Areas for Improvement/Critique:
- Defining "Single-Modal Graphs": While the concept of "dividing the multimodal graph into several single modal graphs" is mentioned, the exact implementation details of how these single-modal graphs are structured and how they interact before the multimodal fusion module could be further elaborated. Does this imply separate GNN layers for each modality before fusion, or is it purely a conceptual distinction in how features are handled?
- Cold-Start Scenarios: While KGs generally help with cold-start, the paper mentions that for "cold-start nodes, more high-level information will be introduced to enhance their representations" when constructing $S_i$ . A dedicated analysis or experiment on cold-start performance would further highlight this benefit.
- Computational Cost of Bi-Transformer: Although noted generally for fine-grained attention, a more specific discussion or analysis of the computational overhead introduced by the Bi-Transformer within Multi-KG4Rec and potential strategies for optimization (e.g., knowledge distillation, pruning) for large-scale deployment would be valuable.
- Interpretability of Bi-Transformer: While the case study provides a visual interpretation of attention, a deeper dive into which specific visual features or text phrases influence decisions could offer even greater interpretability, especially for the nuanced interactions within the Bi-Transformer.
Transferability and Applicability: The methods proposed in Multi-KG4Rec are highly transferable.
- The multimodal fusion module could be adapted for any domain where items have rich visual and textual descriptions (e.g., fashion, real estate, travel, news recommendations).
- The knowledge-aware propagation could be applied to other graph-structured data fusion tasks beyond recommendation.
- The framework for leveraging pre-trained multimodal models with GNNs could inspire similar architectures in other domains, such as multimodal question answering or knowledge base completion with rich media. The authors' suggestion of integrating web pages as another modality further reinforces the adaptability of this framework to incorporate diverse information sources.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Multimodal fusion framework based on knowledge graph for personalized recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 48,314 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Embedding Module

4.2.1.1. Entity Embedding

4.2.1.2. Embedding Optimization

4.2.2. Multimodal Fusion Module

4.2.3. Information Propagation Module

4.2.4. Prediction

4.2.5. Optimizer

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Parameter Settings

6. Results & Analysis

6.1. Core Results Analysis

6.2. Modality Effectiveness Analyses

6.3. Ablation Study

6.4. Case Study

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers