Multimodal fusion framework based on knowledge graph for personalized recommendation
TL;DR Summary
This work proposes Multi-KG4Rec, a multimodal fusion framework leveraging fine-grained modal interactions in knowledge graphs to enhance personalized recommendations, demonstrating superior efficiency on real-world datasets.
Abstract
Expert Systems With Applications 268 (2025) 126308 Available online 1 January 2025 0957-4174/© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies. Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa Multimodal fusion framework based on knowledge graph for personalized recommendation Jingjing Wang a , Haoran Xie b , ∗ , Siyu Zhang a , S. Joe Qin b , Xiaohui Tao c , Fu Lee Wang d , Xiaoliang Xu a a Hangzhou Dianzi University, 1158 2nd Ave, Qiantang district, Hangzhou, 310005, Zhejiang, China b Lingnan University, 8 Castle Peak Road, Tuen Mun, New Territories, 999077, Hong Kong Special Administrative Region c University of Southern Queensland, Springfield, 4300, Queensland, Australia d Hong Kong Metropolitan University, 30 Good Shepherd Street, Ho Man Tin, Kowloon, 999077, Hong Kong Special Administrative Region A R T I C L E I N F O Keywords: Knowledge graphs Multimodal fusion framework Recommender system A B S T R A C T Knowledge Graphs (KGs), which contain a wealth of knowledge
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Multimodal fusion framework based on knowledge graph for personalized recommendation
1.2. Authors
Jingjing Wang, Haoran Xie, Siyu Zhang, S. Joe Qin, Xiaohui Tao, Fu Lee Wang, Xiaoliang Xu
1.3. Journal/Conference
The paper was submitted for publication and has an ARTICLE INFO section, but the specific journal or conference is not explicitly mentioned in the provided text. However, the presence of author affiliations (e.g., Hangzhou Dianzi University, Lingnan University, University of Southern Queensland) and a CRediT authorship contribution statement suggests it is intended for a peer-reviewed academic publication.
1.4. Publication Year
The publication year is not explicitly stated in the provided text. However, references within the paper span up to 2023, suggesting it was published in late 2023 or 2024.
1.5. Abstract
The paper addresses limitations in existing Multimodal Knowledge Graph (MKG)-based recommendation systems, which primarily use multimodal information as auxiliary data for reasoning relationships between entities, often overlooking the direct interactions between modalities. To overcome this, the authors propose Multi-KG4Rec, a multimodal fusion framework based on Knowledge Graphs (KGs) for personalized recommendation. The framework systematically analyzes shortcomings in current multimodal graph construction. It introduces a modal fusion module to extract user modal preferences at a fine-grained level. Extensive experiments conducted on two real-world datasets (MovieLens and Amazon-Books) from different domains demonstrate the efficiency and effectiveness of Multi-KG4Rec.
1.6. Original Source Link
/files/papers/690dd0087a8fb0eb524e6845/paper.pdf This is a direct link to the PDF document of the paper, indicating its publication status as an academic paper available for download.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve revolves around enhancing the representational quality and personalization capabilities of recommender systems by more effectively integrating rich multimodal information.
-
Core Problem: Traditional
Knowledge Graph (KG)-based recommender systems effectively useKGsas a knowledge-driven tool for high-quality representations, but they often represent attribution information as pure symbols, limiting their ability to understand real-world scenarios rich in images and text. WhileMultimodal Knowledge Graphs (MKGs)have been proposed to address this by incorporating text and visual content, existingMKG-based methods suffer from two significant limitations:- Lack of a Unified
MKGArchitecture: ExistingMKGmethods are typically categorized into feature-based and entity-based approaches.Feature-based methodstreat multimodal information as auxiliary data for entities, enriching representations but often overlooking interactions between different modalities. They also impose strict constraints onMKGcompleteness.Entity-based methodsconsider multimodal information as supplementary nodes, but these are often limited to attribute entities and struggle with sparsity for item entities (e.g., few items share identical posters or text).
- Ineffective Multimodal Fusion: Current fusion techniques, like concatenation or weighted sums, struggle to effectively leverage multimodal information, especially for capturing subtle correlations within or across modalities (e.g., visual style across different movies with varying textual descriptions). This makes it challenging to extract fine-grained personalized multimodal preferences.
- Lack of a Unified
-
Why this problem is important: Personalization in recommender systems is crucial for user satisfaction and engagement. Real-world items are inherently multimodal (e.g., movies have posters, descriptions, genres). Ignoring or inadequately fusing this rich information leads to less accurate recommendations and a poorer understanding of user preferences. Addressing these architectural and fusion limitations can significantly improve the model's ability to understand the real world and provide more precise, personalized recommendations.
-
Paper's Entry Point/Innovative Idea: The paper's innovation lies in proposing a unified
Multimodal fusion framework based on Knowledge Graph for personalized Recommendation (Multi-KG4Rec)that tackles these architectural and fusion challenges. It introduces a novelMKGconstruction by dividing the multimodal graph into several single-modal graphs, representing each entity by its modality feature, which helps avoid node sparsity and allows for coarse-grained preference extraction. Furthermore, it employs a sophisticated fine-grained modal fusion module usingpre-trained models(likeCLIPandLarge Language Models - LLMs) for initial feature generation, followed by agraph neural networkto align features with graph structures, and finally across multi-head attention modulebetween text and visual transformers for deep multimodal interaction.
2.2. Main Contributions / Findings
The paper makes several key contributions:
- Unified Multimodal Architecture:
Multi-KG4Recproposes a unifiedmultimodal grapharchitecture that overcomes the strict limitations of feature-based methods by dividing themultimodal KGinto several single-modal graphs. This approach also addresses node sparsity issues inherent in entity-based methods by representing each entity with its modality feature, enabling coarse-grained user preference extraction without being constrained by explicit entity connections. - Leveraging
LLMsand Fine-Grained Multimodal Fusion: The framework employs apre-trained Large Language Model (LLM)(specifically,CLIP) to generate initial multimodal features, which are then integrated with graph-structured information using aGraph Neural Network (GNN). Crucially, it introduces amultimodal fusion modulethat uses across multi-head attention modulebetween text and visual transformers. This module extracts users' personalized multimodal preferences at a fine-grained level, capturing subtle interactions between modalities that previous methods overlooked. - Extensive Experimental Validation: The effectiveness of
Multi-KG4Recis demonstrated through extensive experiments on two real-world datasets from different domains: MovieLens and Amazon-Books. The results show thatMulti-KG4Recconsistently outperforms various strong baselines, includingcollaborative filtering,KG-based, andMKG-basedmethods, across standard evaluation metrics likeRecall@k,MRR@k, andNDCG@k. - Empirical Insights on Modality Effectiveness: The analysis reveals that incorporating multimodal features significantly boosts performance, with visual modalities often being more impactful than text for user decisions. The case study further highlights that different users exhibit distinct fine-grained preferences for visual versus text modalities, validating the necessity of the proposed fine-grained fusion approach.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts:
- Recommender Systems (RSs): Systems designed to predict user preferences and suggest items (e.g., movies, products, articles) that users are most likely to enjoy. They address information overload by filtering relevant items.
- Knowledge Graphs (KGs): A
KGis a structured representation of information that describes entities (real-world objects, events, concepts) and their semantic relationships. It typically consists of nodes (entities) and edges (relationships), forming triples in the format (head entity, relation, tail entity) or(h, r, t).KGsprovide rich, explicit knowledge that can enhance recommender systems by linking items to their attributes and related concepts, thus offering more context than traditional collaborative filtering methods. - Multimodal Knowledge Graphs (MKGs): An extension of
KGsthat integrates information from multiple modalities, such as text, images, and potentially audio or video. InMKGs, entities or their attributes can be enriched with features derived from these different data types. For example, a movie entity might have a textual description, an image (poster), and a genre.MKGsaim to provide a more comprehensive understanding of entities by combining symbolic knowledge with perceptual data. - Graph Neural Networks (GNNs): A class of neural networks designed to operate directly on graph-structured data.
GNNslearn representations (embeddings) of nodes by aggregating information from their neighbors, iteratively refining these representations. They are powerful for capturing relational dependencies and structural patterns in graphs.- Graph Attention Networks (GATs): A type of
GNNthat incorporates an attention mechanism. Instead of assigning equal weight to all neighbors,GATslearn different weights for different neighbors based on their features, allowing them to selectively focus on more important neighbors during aggregation.
- Graph Attention Networks (GATs): A type of
- Transformers: A neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017), primarily known for its success in natural language processing. The core idea of a
Transformeris theself-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. This enablesTransformersto capture long-range dependencies effectively.- Self-Attention: A mechanism that allows a model to weigh the importance of different words in an input sentence when encoding a specific word. For each word, it calculates a score of how much it should "attend" to other words in the sentence.
- Multi-head Attention: An extension of
self-attentionwhere the attention mechanism is applied multiple times in parallel, using different learned linear projections (query,key,value) for each "head." The outputs from these heads are then concatenated and linearly transformed, allowing the model to capture diverse types of relationships or focus on different aspects of the information. - Cross-Attention: A variant of the
attention mechanismused when interacting between two different sequences (e.g., text and image features). One sequence provides thequery(e.g., text features) and the other provides thekeyandvalue(e.g., visual features), allowing the first sequence to attend to relevant parts of the second.
- Contrastive Learning: A machine learning paradigm where the model learns by contrasting positive pairs (similar samples) with negative pairs (dissimilar samples). The goal is to bring representations of positive pairs closer together while pushing negative pairs further apart in the embedding space.
- CLIP (Contrastive Language-Image Pre-training): A
pre-trained multimodal modeldeveloped by OpenAI.CLIPlearns to align images and text descriptions by being trained on a large dataset of image-text pairs usingcontrastive learning. It consists of an image encoder and a text encoder, which project images and texts into a shared embedding space. This allowsCLIPto understand visual concepts described in natural language and generate high-quality, aligned multimodal features. - TransR: A
knowledge graph embeddingmodel that learns representations of entities and relations.TransRprojects entities into different relation-specific spaces before performing translation. This allows it to capture different aspects of entity relationships more effectively than simpler models likeTransEby making entity embeddings more flexible depending on the relation. The core idea is that should hold true in the relation space. - Bayesian Personalized Ranking (BPR) Loss: A widely used
pairwise ranking lossfunction for implicit feedback recommendation.BPRaims to maximize the difference between the predicted scores of observed (positive) items and unobserved (negative) items for a given user. It assumes that a user prefers an interacted item over a non-interacted item.- Formula: The
BPRloss for a single triplet(u, i, j)(user prefers item over item ) is typically defined as: $ \mathcal{L}{BPR} = -\ln \sigma(\hat{y}{ui} - \hat{y}_{uj}) $ where is the predicted score for user and item , is the predicted score for user and item , and is the sigmoid function. The total loss sums over all training triplets.
- Formula: The
3.2. Previous Works
The paper categorizes previous works into KG-based methods, MKG-based methods, and Multimodal fusion methods.
KG-based methods: These methods construct a heterogeneous graph involving users, items, and item attributes, then propagate relationships to generate representations.- KGCN (Wang, Zhao et al., 2019): Incorporates
GCN(Graph Convolutional Network) andKGmethods to learn relationships between entities. It uses a fixed number of neighbors as the receptive field. - KGAT (Wang, He et al., 2019): An innovative
collaborative KGmethod that propagates features on thecollaborative KGthrough aGCNlayer to encode high-order relationships between users and items. It integratesTransRandGNNto generate entity representations. - These methods are noted for reasoning relationships but often ignore rich knowledge from text and visual information.
- Meta-path-based methods (Hu et al., 2018; Zhao et al., 2017): Rely on manual path definition for feature engineering.
- KGCN (Wang, Zhao et al., 2019): Incorporates
MKG-based methods: These integrate multimodal entity nodes from text and visual modalities.- Feature-based methods: Treat multimodal information as auxiliary data for entities.
- CKE (Zhang et al., 2016): Divides
MKGinto a bipartite graph, textual content, and visual content. It usesTransRfor structural representations, and denoising autoencoders for multimodal content. It is highlighted for not usingGNNto aggregate high-order neighbor information and lacking sufficient attention on inter-modal interaction. - DKN (Wang et al., 2018): Proposed a
CNNframework to integrate high-order relational reasoning with text semantics. It explored high-order relationships under textual modalities but not other modalities. - CMCKG (Cao et al., 2022): Utilizes original
KGfor structural representations and converts textual descriptions into newKGnodes. Employscontrastive learningto enhance consistency between representations.
- CKE (Zhang et al., 2016): Divides
- Entity-based methods: Consider multimodal information as newly added supplementary nodes.
- MKGAT (Sun et al., 2020): An entity-based method where only attribute entities contain multimodal information, which is transferred as newly added nodes. This paper identifies it as a representative feature-based method in its experimental comparisons, implying its multimodal integration approach is more aligned with augmenting entity features rather than distinct modal entities.
- MMKGV (Liu, Li et al., 2022): Integrated multimodality information as relationship triplets within a
knowledge graph.
- Feature-based methods: Treat multimodal information as auxiliary data for entities.
Multimodal fusion methods: Categorized into coarse-grained, fine-grained, and combined attention.- Coarse-grained attention: Focuses on modality correlation at a high level. Examples include
DUALGRAPH(Li, Feng et al., 2023),UVCAN(Liu, Chen et al., 2019),MCPTR(Liu, Ma et al., 2022),CMBF(Chen et al., 2021). These often use co-attention or cross-attention at the modality level. - Fine-grained attention: Focuses on detailed correlations. Examples include
POG(Chen et al., 2019),NOR(Lin et al., 2019),EFRM(Hou et al., 2019),MMRec(Wu et al., 2021). These often useself-attentionorcandidate-aware attentionfor specific attributes or elements. - Combined attention: Balances fine-grained and coarse-grained. Examples include
NOVA(Liu et al., 2021),NRPA(Liu, Wu et al., 2019),VLSNR(Han et al., 2022),MARank(Yu et al., 2019). This paper positions its method in this category, using pre-trained models for fine-grained alignment and thencross multi-head attention.
- Coarse-grained attention: Focuses on modality correlation at a high level. Examples include
3.3. Technological Evolution
The field has evolved from basic collaborative filtering to KG-based recommenders that leverage structured knowledge. The limitation of symbolic KGs led to the integration of multimodal information, giving rise to MKGs. Early MKG methods either treated multimodal data as auxiliary features (feature-based) or added them as new nodes (entity-based). However, these often struggled with complete MKG architectures, data sparsity, and effective fusion of interactions between modalities. The rise of powerful pre-trained models (like CLIP for multimodal alignment) and advanced attention mechanisms (like Transformers) opened new avenues for more sophisticated multimodal integration. This paper fits into this evolution by addressing the architectural shortcomings of previous MKGs and leveraging Transformers for fine-grained cross-modal fusion, aiming for a more unified and effective MKG framework.
3.4. Differentiation Analysis
Multi-KG4Rec differentiates itself from previous methods primarily in its approach to MKG architecture and multimodal fusion:
- Unified
MKGArchitecture:- Vs. Feature-based: Existing feature-based methods (e.g.,
CKE,CMCKG, sometimesMKGATin comparison) treat multimodal information as auxiliary to the main entity, which can overlook direct interactions between modalities.Multi-KG4Recmitigates this by conceptually dividing themultimodal KGinto several single-modal graphs. Each entity within a modality (e.g., visual entity, text entity) is represented by its specific modality feature. This allows for dedicated processing of each modality before fusion, potentially capturing intra-modal correlations more effectively and then inter-modal interactions. - Vs. Entity-based: Existing entity-based methods (e.g.,
MKGAT) often add multimodal information as new supplementary nodes, which face severe sparsity issues, especially for item entities (as few items share identical visual/text content).Multi-KG4Rec's single-modal graph approach, combined withpre-trained models, avoids this explicit node sparsity by focusing on rich feature representations for existing items rather than creating new sparse attribute nodes for every unique piece of multimodal content.
- Vs. Feature-based: Existing feature-based methods (e.g.,
- Fine-Grained Multimodal Fusion:
-
Vs. Simple Fusion (Concatenation/Weighted Sums): Many prior methods use simple concatenation or weighted sums for multimodal fusion, which are less effective at capturing complex, subtle interactions between modalities.
Multi-KG4Recemploys across multi-head attention modulebetween a text transformer and a visual transformer. This allows for a much more dynamic and fine-grained interaction where text features can attend to visual features, and vice versa, enabling the model to learn deep correlations and extract personalized preferences that are highly modality-specific. -
Leveraging
LLMsandGNNs: Unlike methods that rely on simpler encoders,Multi-KG4Recutilizespre-trained multimodal modelslikeCLIPto generate initial, high-quality, aligned multimodal features. It then uniquely integrates these features with graph-structured information using aGNN, which is particularly suited for propagating information over complex graph structures, bridging the gap between richLLMfeatures and graph topology. This is highlighted as a novel approach to combining the strengths of generativeLLMfeatures with graph-structured data.In essence,
Multi-KG4Recaddresses the architectural fragmentation and superficial multimodal fusion of previousMKGmethods by offering a more integrated, feature-rich, and interactively fused approach, driven by advancedpre-trained multimodal modelsandattention mechanisms.
-
4. Methodology
4.1. Principles
The core idea behind Multi-KG4Rec is to construct a flexible Multimodal Knowledge Graph (MKG) architecture that effectively captures fine-grained interactions between different modalities (text and visual) for personalized recommendations. This is achieved by:
- Modular
MKGConstruction: Instead of rigid feature-based or entity-basedMKGdesigns,Multi-KG4Recdivides theMKGinto several single-modal graphs. This allows for independent representation learning within each modality, capturing modality-specific features while avoiding issues like node sparsity for item entities. - Leveraging Pre-trained Multimodal Encoders: Utilizing powerful
pre-trained multimodal models(likeCLIP) to generate initial, aligned, and rich entity features from both text and visual content, ensuring that the multimodal information is well-represented from the outset. - Fine-Grained Cross-Modal Fusion: Employing
Transformer-basedcross multi-head attentionto deeply fuse information between modalities. This mechanism allows for dynamic interaction and attention between text and visual features, enabling the extraction of subtle, personalized multimodal preferences. - Knowledge-Aware Propagation: Integrating a
Graph Neural Network (GNN)layer that is knowledge-aware to propagate these fused multimodal features across theKGstructure. This ensures that high-order relational information is incorporated into the final user and item representations. - Personalized Prediction: Combining these rich, fused, and propagated representations of users and items to predict user interest in potential items, optimized via a
Bayesian Personalized Ranking (BPR)loss.
4.2. Core Methodology In-depth
The Multi-KG4Rec framework consists of an Embedding module, a Multimodal fusion module, an Information propagation module, and a Prediction component, all optimized through a unified Optimizer. The overall architecture is illustrated in Figure 2.

该图像是论文中图2所示的Multi-KG4Rec架构示意图,展示了从单模态输入到跨模态交互,以及信息传播模块最后进行预测的全过程,体现了图像-文本交互模块和多模态融合的设计。
Fig. 2. The architecture of the proposed Multi-KG4Rec.
4.2.1. Embedding Module
The embedding module is responsible for initializing and optimizing the representations of entities in the Knowledge Graph.
4.2.1.1. Entity Embedding
Given a triplet (h, r, t) in the KG , entities initially have an ID that is embedded as a structural feature via a lookup table. For multimodal entities (items and their attribute nodes that possess visual and text content), the paper uses CLIP, a pre-trained multimodal visual-text model, to align image-text pairs and generate initial entity features.
Specifically, visual and textual descriptions corresponding to entities are fed into CLIP. The outputs from CLIP's two encoders (one for visual, one for text) are projected into a shared embedding space. The output of the last layer serves as the feature, with a dimensionality of 512. This process ensures that text and visual features are intrinsically aligned from the start.
4.2.1.2. Embedding Optimization
To further optimize these entity features and capture their structural relationships within the KG, TransR is adopted. TransR models relations as translations in relation-specific spaces.
Nodes and edges in are converted into triplets (h, r, t). The optimization objective for TransR is to ensure that in the relation space, the projected head entity plus the relation embedding is approximately equal to the projected tail entity.
The embeddings of the head entity , tail entity , and relation are denoted as and , respectively. and denote the projected representations of and in the space of relation .
For a given triplet (h, r, t), the formula for the objective score g(h, r, t) is:
Here:
-
g(h, r, t): The score indicating the likelihood that the triplet(h, r, t)is true. A lower score implies a higher likelihood. -
: The original embeddings of the head and tail entities, respectively, in the entity space.
-
: The embedding of the relation .
-
: A transformation matrix that projects entities from the entity space (dimension ) to the relation space (dimension ).
-
: The squared norm, measuring the Euclidean distance.
The training of
TransRdistinguishes between positive triplets (existing in ) and negative triplets (not existing in ) using a pairwise ranking loss. For a positive triplet(h, r, t)and a sampled negative triplet(h, r, t'), the loss is defined as: Here: -
: The knowledge graph embedding loss.
-
: The set of training samples, each containing a positive triplet
(h, r, t)and a corresponding negative triplet(h, r, t'). -
g(h, r, t'): The score for the negative triplet. -
g(h, r, t): The score for the positive triplet. -
: The sigmoid function, which squashes its input to a range between 0 and 1. This loss encourages the score of positive triplets to be lower than that of negative triplets, meaning is preferred, making close to 1, and close to 0.
4.2.2. Multimodal Fusion Module
This module is designed to fuse modality information at a fine-grained level, capturing complex interactions between text and visual features. It comprises a text transformer, a visual transformer, and a multi-head attention layer for cross-modal interaction. The transformers aim to extract dependencies within a single modality, while the multi-head attention layer performs the actual multimodal fusion.
First, for an item , its high-order neighbors are collected to form an input sequence . The paper uses a breadth-first search (BFS) approach to gather neighbors up to a certain count , sorted by distance. This sequence includes the item itself, user neighbors, and attribute entity neighbors (e.g., , where is the item, is a 1st-order user neighbor, is a 1st-order attribute neighbor, is a 2nd-order attribute neighbor, etc.).
Given the neighbor set , the visual features and text features corresponding to these neighbors are denoted as and , respectively, where is the sequence length (number of neighbors) and is the embedding dimension.
The process of multi-head attention (MHA) for cross-modal fusion, as shown in Figure 3(a), involves transforming these features into queries, keys, and values. We take the visual head as an example:
Here:
-
: The query, key, and value matrices for the -th head of the visual modality.
-
: The input visual features for the neighbors in .
-
: Learnable weight matrices for projecting the input features into queries, keys, and values for the -th head.
-
: The dimension of each head, where is the total number of heads.
The text modality features undergo similar transformations to generate .
For cross-modal attention, the keys and values from both visual and text modalities are concatenated. For example, to calculate the output of a visual head, denoted :
Here:
- : The output of the -th attention head for the visual modality, having attended to both visual and text
key-valuepairs. - : Represents the scaled dot-product attention function, which is typically defined as .
- : The query from the visual modality.
- : The concatenation of keys from both visual and text modalities.
- : The concatenation of values from both visual and text modalities. This operation allows the visual query to attend to relevant information present in both visual and text modalities. A similar process would be performed for text queries using and but with as the query.
After computing all heads for a modality, their outputs are concatenated to form the final multi-head attention output for that modality:
Here:
-
: The final output from the
multi-head attentionlayer for the visual modality. -
: Concatenation operation.
-
: A learnable linear projection matrix that combines the concatenated head outputs back to the original embedding dimension .
Following the
multi-head attentionlayer, aFeedforward Neural Network (FFN)is applied. ThisFFNtypically consists of two non-linear layers withReLUactivation, and it operates on the output of the attention mechanism (after layer normalization and residual connections). For the visual modality, theFFNcalculation is: Here: -
: The output of the feedforward network for the visual modality.
-
: The input to the
FFN(which is the output from themulti-head attentionlayer for the visual modality, potentially with residual connections and layer normalization). -
: The Rectified Linear Unit activation function, .
-
, : Learnable weight matrices for the two linear layers.
-
: Learnable bias vectors.
-
: The hidden dimension of the
FFN.
4.2.3. Information Propagation Module
After obtaining the multimodal information, a knowledge-aware graph attention layer is applied to propagate this information to higher-order neighbors, as shown in Figure 3(b).

该图像是论文中图3的示意图,展示了跨模态注意力模块和双交互聚合器的结构及工作流程,用以实现模态间的交互与高阶信息传播。
Fig. 3. Illustration about modal interaction and high-order information propagation.
For a given entity , denotes the set of triples where is the head entity: . The neighbor information is aggregated as follows: Here:
-
: The aggregated embedding representing the neighborhood of entity .
-
: The embedding of the tail entity from a triplet
(h, r, t). -
: An attention coefficient that controls the flow of information from entity to entity based on their relationship .
The attention coefficient is defined as: Here:
-
: The raw attention score between head and tail through relation .
-
: A trainable weight matrix that transforms entity embeddings.
-
: Embedding of the tail entity.
-
: The hyperbolic tangent activation function.
-
: Represents the transformed head entity embedding combined with the relation embedding.
-
: Transpose operation. The coefficients for all triplets connected to are then normalized using the softmax function (not explicitly shown in the formula but implied by standard graph attention mechanisms).
Finally, the original embedding of () and the aggregated neighborhood embedding () are combined using a Bi-Interaction mechanism:
Here:
- : The final combined representation of entity after considering its own embedding and its aggregated neighborhood.
- : The Leaky Rectified Linear Unit activation function, , where is a small positive slope.
- : Learnable weight matrices.
- : The element-wise sum, capturing linear interactions.
- : The element-wise product, capturing non-linear (element-wise multiplicative) interactions.
The
Bi-Interactioncomponent helps capture both additive and multiplicative relationships between the entity's own features and its neighborhood context. This process is typically repeated for layers to capture higher-order connectivity.
4.2.4. Prediction
After the information propagation module, we obtain representations for users and items from each layer of the GNN (e.g., and ). A layer-aggregation mechanism (Xu et al., 2018) concatenates these representations into unified vectors:
Here:
-
: The aggregated representations for user and item , respectively.
-
: User representations from the initial layer (0) up to the -th layer.
-
: Item representations from the initial layer (0) up to the -th layer.
-
: Denotes the concatenation operation.
Then, the user and item representations from the visual modality () are concatenated with those from the text modality () to obtain the final user and item representations: Here:
-
: The final, comprehensive embeddings for user and item , incorporating both visual and textual multimodal information. The predicted score for user and item can then be calculated, typically as the inner product of their final embeddings: .
4.2.5. Optimizer
To train the recommendation model, the Bayesian Personalized Ranking (BPR) loss is used to optimize the parameters based on the prediction loss. The BPR loss aims to maximize the difference between the predicted score of an observed (positive) item and an unobserved (negative) item for a given user.
Here:
-
: The collaborative filtering loss, based on
BPR. -
: The set of observed (positive) user-item interactions.
-
: The set of sampled unobserved (negative) user-item interactions.
-
: The predicted score for user and positive item .
-
: The predicted score for user and negative item .
-
: The sigmoid function.
-
: The regularization coefficient.
-
: The regularization term for all trainable model parameters , used to prevent overfitting.
The final overall loss function combines the
KGembedding loss () and thecollaborative filteringloss (): Here: -
: The total loss to be minimized during training.
-
: The
TransRknowledge graph embedding loss, as defined in Section 4.2.1.2. -
: The
BPRcollaborative filtering loss, as defined above.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on two real-world datasets from different domains: MovieLens and Amazon-Books.
-
MovieLens:
- Source: MovieLens-1M dataset, a widely used benchmark for recommender systems.
- Characteristics: Contains user IDs, item (movie) IDs, and ratings on a scale from 1 to 5.
- Processing: All ratings were converted to binary: 1 for a rating of 1, and 0 for all other ratings. This implies that only ratings of 1 were considered positive interactions.
- Multimodal Enrichment: A knowledge graph was constructed by linking items in the dataset to entities in Freebase. Corresponding movie posters and text descriptions were retrieved from IMDb to serve as visual and textual multimodal information for the entities.
-
Amazon-Books:
- Source: A subset of user reviews from Amazon's e-commerce website.
- Processing: Users with fewer than 10 interactions were filtered out, following the method in Wang, He et al. (2019), to ensure sufficient historical data per user.
- Multimodal Enrichment: Multimodal information (likely book covers and descriptions) was collected using the same methodology as for the MovieLens dataset.
-
Data Statistics: The following are the results from Table 1 of the original paper:
Dataset #Interactions #Items #Users #Sparsity #Entities #Relations #Triplets MovieLens 834,268 3589 6040 96.15% 60,406 51 273,547 Amazon-Books 332,834 18,932 24,047 99.92% 44,935 23 192,388 -
Rationale for Dataset Choice: These datasets are widely recognized and used in the recommender systems community. MovieLens provides a classic movie recommendation scenario, while Amazon-Books offers an e-commerce context. Both allow for the integration of structured knowledge (KGs) and rich multimodal content (posters/covers, descriptions), making them suitable for evaluating multimodal KG-based recommender systems. The high sparsity levels (96.15% for MovieLens, 99.92% for Amazon-Books) highlight the challenge of recommending in sparse interaction environments, where
KGsand multimodal data can provide crucial auxiliary information.
5.2. Evaluation Metrics
To measure the quality of the recommended sequences, three commonly used metrics are employed: Recall@k, MRR@k, and NDCG@k. The default value for is 20. When using these metrics, items the user has already interacted with are treated as positive, and others as candidates. The top ranked items are selected as recommendations.
-
Recall@k:
- Conceptual Definition:
Recall@kmeasures the proportion of relevant items (i.e., items a user actually interacted with in the test set) that are successfully retrieved within the top recommendations. It focuses on how many of the truly relevant items the recommender system managed to "remember" or find. - Mathematical Formula: $ \mathrm{Recall@k} = \frac{1}{|U|} \sum_{u \in U} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u^k|}{|\mathrm{Rel}_u|} $
- Symbol Explanation:
- : The set of all users in the test set.
- : The cardinality (number of elements) of a set.
- : The set of relevant items for user in the test set (items actually interacted with).
- : The set of top items recommended to user .
- : Set intersection.
- : Summation over all users.
- Conceptual Definition:
-
Mean Reciprocal Rank (MRR@k):
- Conceptual Definition:
MRR@kevaluates the ranking quality, especially when there is only one or very few correct answers. For each query (user), it finds the rank of the first relevant item. The reciprocal of this rank is taken (1/rank), and then these reciprocal ranks are averaged across all queries. A higherMRRindicates that the first relevant item appears earlier in the recommendation list. It is sensitive to ranking positions. - Mathematical Formula: $ \mathrm{MRR@k} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\mathrm{rank}_u} $
- Symbol Explanation:
- : The set of all users in the test set.
- : The cardinality of a set.
- : The rank of the first relevant item in the recommendation list for user , up to rank . If no relevant item is found within the top , the reciprocal rank is 0.
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition:
NDCG@kis a measure of ranking quality that considers the graded relevance of items (though often binary in recommendation). It assigns higher scores to relevant items that appear earlier in the list and discounts the value of relevant items as their position decreases. It normalizes the score by the idealDCG(perfect ranking) to make scores comparable across different users or queries. It is sensitive to ranking positions. - Mathematical Formula:
$
\mathrm{NDCG@k} = \frac{1}{|U|} \sum_{u \in U} \frac{\mathrm{DCG@k}_u}{\mathrm{IDCG@k}_u}
$
where
$
\mathrm{DCG@k}u = \sum{j=1}^{k} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)}
$
and is the
DCGof the ideal ranking for user . - Symbol Explanation:
- : The set of all users in the test set.
- : The cardinality of a set.
- : Discounted Cumulative Gain for user up to rank .
- : Ideal Discounted Cumulative Gain for user up to rank , which is the
DCGvalue if all relevant items were ranked perfectly at the top. - : The position of an item in the recommendation list.
- : The relevance score of the item at position . In binary relevance scenarios (relevant=1, not relevant=0), it's 1 if the item is relevant, 0 otherwise.
- Conceptual Definition:
5.3. Baselines
The Multi-KG4Rec model was compared against several baselines, categorized into collaborative filtering, knowledge graph-based, and multimodal methods that incorporate knowledge graphs.
Collaborative FilteringMethods: These methods typically rely on user-item interaction patterns.- SpectralCF (Zheng et al., 2018): A
spectral collaborative filteringmodel that applies aconvolutional modelin the spectral domain space based on the bipartite graph of user-item interactions. It aims to reveal deep connections and alleviate the cold-start problem. - ConvNCF (He et al., 2018):
Neural Collaborative Filteringmodel that uses element-wise products to capture pairwise correlations among dimensions within the embedding space.
- SpectralCF (Zheng et al., 2018): A
Knowledge Graph-basedApproaches: These integrateKGsto enrich recommendations.- KGAT (Wang, He et al., 2019): Integrates
TransRandGNNto generate entity representations and propagates features on thecollaborative KGto encode high-order relationships. - KGCN (Wang, Zhao et al., 2019): Utilizes
GNNsto learn entity relationships by aggregating information from a fixed number of neighbors as the receptive field. - CKE (Zhang et al., 2016): Integrates structural information (via
TransR), textual data (via stacked denoising autoencoders), and image data (via stacked convolutional auto-encoders) to enhance recommendation quality.
- KGAT (Wang, He et al., 2019): Integrates
Multimodal MethodsIncorporatingKnowledge Graphs: These are specifically designed to handle multimodal information alongsideKGs.-
MKGAT (Sun et al., 2020): A
multimodal graph attention mechanismdesigned to solve entity information aggregation and entity relationship reasoning, identified as a representative feature-based method.These baselines are representative as they cover various paradigms: pure collaborative filtering, KG-enhanced methods, and multimodal KG methods, allowing for a comprehensive evaluation of
Multi-KG4Rec's advancements.
-
5.4. Parameter Settings
- Data Split: The interaction data was randomly split into
8:1:1for training, validation, and testing, respectively. - Initialization: Model parameters were initialized using the
Xavier initializer. - Optimizer:
Adam optimizerwas used for model optimization. - Hyperparameters:
- Mini-batch sizes: Searched within .
- Learning rates: Searched within .
- Regularization coefficient (for regularization): Set in .
- Multimodal Features:
CLIPmodel: Utilized for visual and text entities, extracting 512-dimensional features from its last layer.- Dimension Reduction: These 512-dimensional features were then reduced to 64 dimensions via a non-linear transformation with a
LeakyReLUactivation function.
- Multimodal Fusion Module:
- Blocks: Stacked 3 blocks (layers).
- Attention Heads: Each block used 8 attention heads.
- Information Propagation Module:
- Layers: 3 layers of the
knowledge-aware graph neural networkwere used to encode high-order connectivity. - Output Dimensions: The output dimension of each
GNNlayer was . This implies a progressive reduction in dimension through the layers.
- Layers: 3 layers of the
- Implementation: The
Multi-KG4Recmodel was implemented inPyTorch. - Hardware: All experiments were conducted on a Windows PC equipped with an
RTX 3090 GPU.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Multi-KG4Rec consistently outperforms all baseline models on both the MovieLens and Amazon-Books datasets across all three evaluation metrics (Recall@k, MRR@k, NDCG@k).
The following are the results from Table 2 of the original paper:
| Models | MovieLens | Amazon-Books | ||||
| Recall | MRR | NDCG | Recall | MRR | NDCG | |
| SpectralCF | 0.2199 | 0.3714 | 0.2082 | 0.1327 | 0.0541 | 0.0602 |
| ConvNCF | 0.1815 | 0.3405 | 0.1794 | 0.0404 | 0.0148 | 0.0175 |
| KGAT | 0.2489 | 0.3941 | 0.2303 | 0.1431 | 0.0553 | 0.0702 |
| KGCN | 0.2268 | 0.3783 | 0.2165 | 0.1418 | 0.0528 | 0.0677 |
| CKE | 0.2217 | 0.3754 | 0.2128 | 0.1324 | 0.0491 | 0.0612 |
| MKGAT | 0.2513 | 0.3963 | 0.2311 | 0.1477 | 0.0560 | 0.0707 |
| Multi-KG4Rec | 0.2552 | 0.4077 | 0.2383 | 0.1498 | 0.0572 | 0.0727 |
| Improv. | 1.55% | 2.88% | 3.12% | 1.42% | 2.83% | 2.14% |
Key findings from this comparison:
Multi-KG4Rec's Superiority:Multi-KG4Recachieves the best performance across all metrics on both datasets. Compared toMKGAT(a strongMKG-based baseline),Multi-KG4Recshows improvements of up to 3.12% inRecall@20for MovieLens and 2.14% for Amazon-Books. This strongly validates the effectiveness of its proposed unified architecture and fine-grained multimodal fusion module. The authors attribute this toMulti-KG4Rec's comprehensive perspective on user-item interactions, which is crucial for personalized recommendations.KG-based vs.CF-based Methods:KG-based methods (CKE, KGAT, KGCN, MKGAT) generally outperformcollaborative filtering-based methods (SpectralCF, ConvNCF). This underscores the value ofKnowledge Graphsin providing auxiliary information and enabling relational reasoning, especially in sparse data environments.KGshelpGNNsencode relationships between attribute entities, alleviating data sparsity and cold-start issues, and enhancing understanding of user-item relationships.KGATvs.KGCN:KGAToutperformsKGCN. The paper suggests that whileKGCNaims for a broader receptive field, it might introduce more noise, whereasKGAT's collaborativeKGapproach, propagating features throughGCNlayers, is more effective.CKE's Limitations:CKEperforms the worst amongKG-based methods. This is attributed to its lack ofGNN-based high-order neighbor information aggregation and insufficient attention to interactions between different modalities, despite also dividing text and images into separate modes. This highlights the importance ofGNNpropagation and advanced fusion mechanisms, whichMulti-KG4Recaddresses.
6.2. Modality Effectiveness Analyses
To understand the impact of different modalities, an analysis was conducted by comparing MKGAT and Multi-KG4Rec under various modality configurations on the MovieLens dataset.
The following are the results from Table 3 of the original paper:
| Models | MKGAT | Multi-KG4Rec | ||||
| Recall | MRR | NDCG | Recall | MRR | NDCG | |
| w/o t&v | 0.2453 | 0.3907 | 0.2251 | 0.2489 | 0.3941 | 0.2303 |
| w/o v | 0.2477 | 0.3949 | 0.2272 | 0.2518 | 0.4014 | 0.2327 |
| Improv. | 1.00% | 1.07% | 0.93% | 1.16% | 1.85% | 1.04% |
| w/o t | 0.2479 | 0.3951 | 0.2285 | 0.2531 | 0.4016 | 0.2340 |
| Improv. | 1.06% | 1.13% | 1.51% | 1.69% | 1.90% | 1.61% |
| Multi-KG4Rec | 0.2488 | 0.3963 | 0.2311 | 0.2542 | 0.4033 | 0.2371 |
| Improv. | 1.42% | 1.43% | 2.67% | 2.13% | 2.33% | 2.95% |
Note: In the table, the row "w/o t&v" for Multi-KG4Rec is identical to the KGAT row, which likely indicates that Multi-KG4Rec without multimodal features defaults to a KG-only approach similar to KGAT or a comparable baseline, confirming the base performance without fusion. The Multi-KG4Rec row (last row) seems to refer to its full performance. The Improv. rows likely represent the improvement over the "w/o t&v" baseline within their respective model columns. Let's assume the last row "Multi-KG4Rec" refers to the full model, and the "Improv." rows below "w/o v" and "w/o t" reflect improvement over "w/o t&v" for each specific model.
Key observations:
- Multimodal Benefits: Models incorporating multimodal features (visual and text) consistently achieve superior performance compared to those relying on a single modality or no multimodal information (
w/o t&v). This confirms that rich, diverse item characteristics from multiple perspectives enhance the model's understanding of user intent. - Visual Modality's Dominance: Under single-modal conditions (
w/o vvs.w/o t), the visual modality (meaning the model uses only visual information in addition to KG structure, denoted asw/o tin the table meaning "without text") is generally more effective than the text modality (w/o vin the table meaning "without visual"). This aligns with findings in other multimodal models, suggesting that images often convey more information or have a higher weight in user decision-making than text content in certain domains. Multi-KG4Rec's Expressive Power:Multi-KG4Recachieves stronger performance thanMKGATacross all comparable settings (e.g., full multimodal,w/o v,w/o t). This suggests thatMulti-KG4Rechas superior expressive power to perceive implicit relationships between images and texts, primarily due to its sophisticatedmultimodal fusion modulethat extracts cross-modal information at a fine-grained level.
6.3. Ablation Study
An ablation study was conducted to further analyze the effectiveness of the Bi-Transformer (bi-directional attention mechanism) within the multimodal fusion module. This involved comparing the full Multi-KG4Rec model with variants where only unidirectional attention was activated.
The following are the results from Table 4 of the original paper:
| Dataset | MovieLens | Amazon-Books | ||||
| Recall | MRR | NDCG | Recall | MRR | NDCG | |
| w/o t&v | 0.2453 | 0.3907 | 0.2251 | 0.1473 | 0.0566 | 0.0716 |
| Bi-Trans12v | 0.2437 | 0.3917 | 0.2244 | 0.1428 | 0.0514 | 0.0674 |
| Bi-Transu2t | 0.2444 | 0.3944 | 0.2227 | 0.1436 | 0.0521 | 0.0662 |
| Multi-KG4Rec | 0.2552 | 0.4077 | 0.2383 | 0.1498 | 0.0572 | 0.0727 |
Note: w/o t&v denotes the multimodal fusion module is disabled. Bi-Trans12v likely refers to a variant where text attends to visual (text-to-visual), and Bi-Transu2t likely refers to visual attends to text (visual-to-text). The naming appears slightly ambiguous with 12v and u2t, but based on common Transformer parlance, they represent unidirectional cross-attention from one modality to another. The authors' text clarifies "text-to-image attention activated" and "image-to-text attention activated". Let's assume Bi-Trans12v means only text-to-visual attention is active (text queries visual keys/values) and Bi-Transu2t means only visual-to-text attention is active (visual queries text keys/values).
Key findings from the ablation study:
- Impact of Multimodal Fusion: The results show that disabling the multimodal fusion module (
w/o t&v) leads to a significant performance drop compared to the fullMulti-KG4Recmodel across all metrics and datasets. This reinforces the previous conclusion about the critical role of multimodal information integration. - Importance of Bi-directional Attention: Both unidirectional variants (
Bi-Trans12vandBi-Transu2t) perform worse than the fullMulti-KG4Recmodel. This indicates that a bi-directionalcross-attentionmechanism is crucial. Unidirectional transformers consider correlations only from one side, potentially losing vital information from the other modality. For instance,Bi-Trans12v(text-to-visual) might capture how text features relate to visual features but miss how visual features might inform text understanding, and vice versa forBi-Transu2t. - Robustness and Enhanced Interaction: The
bi-directional transformer(the fullMulti-KG4Recmodel) offers several advantages: It independently extracts significant features from each modality, and then the cross-modality attention module dynamically adjusts the weight of these features. This mechanism enhances the interaction between modalities, leading to better overall performance and improved robustness against noisy or irrelevant information within a single modality.
6.4. Case Study
A case study was conducted to visually validate the significance of modalities in influencing user preferences. Two users (user from MovieLens and user from Amazon-Books) were selected, and 10 items they interacted with were gathered. The attention mechanism was used to compute correlation scores between user-item pairs, where higher scores indicate a greater impact of that item's modality on user preferences.

该图像是多模态偏好示意图,展示了两用户u_3238和u_927在视觉模态和文本模态下的实体及偏好权重关系,部分视觉节点对应影视海报,文本节点附带关键词标签,反映了多模态信息在用户偏好捕捉中的细粒度交互。
Fig. 4. Attention distribution for and .
Key insights from the case study (Figure 4):
- Personalized Modal Preferences: The visualization clearly shows that different users exhibit varying preferences for visual and text modalities. User from the MovieLens dataset shows a significantly higher attention score for the visual modality compared to the text modality. In contrast, user from the Amazon-Books dataset exhibits the opposite trend, with higher attention to text.
- Rationale for Fine-Grained Fusion: This observation validates the rationale and necessity of discussing and implementing modal fusion at a fine-grained level. User might prioritize movie posters when choosing a movie, while user might focus more on book descriptions or reviews.
- Qualitative Analysis of Preferences: Further qualitative analysis revealed specific preferences: tends to prefer posters with "scary elements," while tends towards "romantic-themed books." This demonstrates that the
Multi-KG4Recmodel can not only identify which modality a user prefers but also what specific characteristics within that modality appeal to them, showcasing the effectiveness of the fine-grained fusion.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully proposed Multi-KG4Rec, a novel personalized recommendation framework that leverages a Knowledge Graph and a sophisticated multimodal fusion approach. The framework's core innovation lies in its ability to effectively learn potential relationships and interactions between textual and visual modalities at a fine-grained level using a Bi-Transformer module. This is complemented by a GNN layer that propagates high-order information throughout the KG. Extensive experiments on two real-world datasets, MovieLens and Amazon-Books, conclusively demonstrated the efficiency and effectiveness of Multi-KG4Rec, showing superior performance over various strong baselines. The research highlighted the importance of multimodal information and the necessity of fine-grained cross-modal fusion for capturing diverse user preferences.
7.2. Limitations & Future Work
The authors suggest a specific direction for future work:
-
Additional Modalities: The paper suggests that web pages can serve as an additional, valuable modality to offer more contextual information for items. They note a relative scarcity of research in this area.
-
Future Research Goal: The authors aim to collect such web page datasets and design models to validate their hypothesis regarding the utility of web page content for recommender systems. This implies a current limitation is the exclusion of other potentially rich modalities.
Implicit limitations, though not explicitly stated as "limitations" by the authors, could be inferred:
-
Computational Complexity: Fine-grained attention mechanisms and
Transformers, especially with large sequence lengths (neighbors), can be computationally intensive, potentially affecting real-time performance for very large-scale systems. The paper notes this as a general concern for fine-grained attention but doesn't specifically address it forMulti-KG4Recitself. -
Scalability of
KGconstruction: Building and maintainingMKGsfor vast item catalogs, particularly for retrieving and aligning multimodal content (likeIMDborFreebaselinks), can be complex and resource-intensive. -
Generalizability of
CLIP: WhileCLIPis powerful, its effectiveness relies on its pre-training data. Its performance might vary for highly specialized or niche domains not well-represented in its training.
7.3. Personal Insights & Critique
This paper presents a strong contribution to the field of multimodal recommender systems by addressing critical architectural and fusion shortcomings.
-
Strengths and Innovations:
- Unified Architecture: The idea of dividing the
MKGinto single-modal graphs before fusion is an elegant way to handle multimodal data without falling into the pitfalls of entity sparsity or over-simplification. - Sophisticated Fusion: The use of
pre-trained LLMs(CLIP) for initial features combined with aBi-Transformercross multi-head attentionmodule represents a state-of-the-art approach to multimodal fusion, moving beyond simple concatenation. This truly enables the "fine-grained" preference extraction that many previous works claimed but struggled to achieve. - Clear Validation: The experimental setup across two diverse datasets, comprehensive baseline comparisons, and detailed ablation studies provide robust evidence for the model's effectiveness. The case study is particularly insightful, offering qualitative validation of personalized modal preferences.
- Unified Architecture: The idea of dividing the
-
Potential Areas for Improvement/Critique:
- Defining "Single-Modal Graphs": While the concept of "dividing the multimodal graph into several single modal graphs" is mentioned, the exact implementation details of how these single-modal graphs are structured and how they interact before the
multimodal fusion modulecould be further elaborated. Does this imply separateGNNlayers for each modality before fusion, or is it purely a conceptual distinction in how features are handled? - Cold-Start Scenarios: While
KGsgenerally help withcold-start, the paper mentions that for "cold-start nodes, more high-level information will be introduced to enhance their representations" when constructing . A dedicated analysis or experiment on cold-start performance would further highlight this benefit. - Computational Cost of
Bi-Transformer: Although noted generally for fine-grained attention, a more specific discussion or analysis of the computational overhead introduced by theBi-TransformerwithinMulti-KG4Recand potential strategies for optimization (e.g., knowledge distillation, pruning) for large-scale deployment would be valuable. - Interpretability of
Bi-Transformer: While the case study provides a visual interpretation of attention, a deeper dive into which specific visual features or text phrases influence decisions could offer even greater interpretability, especially for the nuanced interactions within theBi-Transformer.
- Defining "Single-Modal Graphs": While the concept of "dividing the multimodal graph into several single modal graphs" is mentioned, the exact implementation details of how these single-modal graphs are structured and how they interact before the
-
Transferability and Applicability: The methods proposed in
Multi-KG4Recare highly transferable.- The
multimodal fusion modulecould be adapted for any domain where items have rich visual and textual descriptions (e.g., fashion, real estate, travel, news recommendations). - The
knowledge-aware propagationcould be applied to other graph-structured data fusion tasks beyond recommendation. - The framework for leveraging
pre-trained multimodal modelswithGNNscould inspire similar architectures in other domains, such as multimodal question answering or knowledge base completion with rich media. The authors' suggestion of integrating web pages as another modality further reinforces the adaptability of this framework to incorporate diverse information sources.
- The
Similar papers
Recommended via semantic vector search.