Abstract

30 MEGCF: Multimodal Entity Graph Collaborative Filtering for Personalized Recommendation KANG LIU, FENG XUE, DAN GUO, LE WU, SHUJIE LI, and RICHANG HONG, Hefei University of Technology, China In most E-commerce platforms, whether the displayed items trigger the user’s interest largely depends on their most eye-catching multimodal content. Consequently, increasing efforts focus on modeling multimodal user preference, and the pressing paradigm is to incorporate complete multimodal deep features of the items into the recommendation module. However, the existing studies ignore the mismatch problem between multimodal feature extraction (MFE) and user interest modeling (UIM) . That is, MFE and UIM have different emphases. Specifically, MFE is migrated from and adapted to upstream tasks such…

1. Bibliographic Information

1.1. Title

MEGCF: Multimodal Entity Graph Collaborative Filtering for Personalized Recommendation

1.2. Authors

Kang Liu, Feng Xue, Dan Guo, Le Wu, Shujie Li, and Richang Hong.

1.3. Journal/Conference

ACM Transactions on Information Systems (TOIS), Volume 41, Issue 2, Article 30. Comment: TOIS is a premier and highly prestigious journal in the field of information retrieval and recommender systems, known for publishing rigorous and impactful research.

1.4. Publication Year

2023 (Published online: June 13, 2022; Issue date: March 2023).

1.5. Abstract

This paper addresses a critical issue in multimodal recommendation systems: the mismatch problem between Multimodal Feature Extraction (MFE) and User Interest Modeling (UIM). Existing methods typically extract complete deep features from items (images/text) using models pre-trained on upstream tasks (like classification). However, these features often contain significant noise irrelevant to user preferences (e.g., background clutter in an image). The authors propose MEGCF (Multimodal Entity Graph Collaborative Filtering) to solve this. MEGCF transforms MFE into a user-oriented process by extracting specific "semantic entities" (e.g., "jacket" instead of the whole image vector) and constructing a Collaborative Multimodal Interaction Graph. It employs a novel Sentiment-weighted Symmetric Linear Graph Convolution Network (GCN) to capture high-order semantic correlations and collaborative signals. Extensive experiments on three datasets demonstrate MEGCF's superiority over state-of-the-art baselines.

1.6. Original Source Link

ACM Digital Library PDF (Official Link from the provided text)

2. Executive Summary

2.1. Background & Motivation

In modern E-commerce, users are heavily influenced by multimodal content (images, titles, reviews). Recommendation systems have evolved from simple ID-based Collaborative Filtering (CF) to Multimodal Recommendation, which incorporates visual and textual features to alleviate data sparsity (cold-start problems).

However, the authors identify a critical gap:

The Mismatch Problem: Most existing methods use a two-step approach:
1. MFE (Multimodal Feature Extraction): Uses deep neural networks (like CNNs) pre-trained on general tasks (e.g., ImageNet classification) to generate dense vectors for items. This is content-oriented.
2. UIM (User Interest Modeling): Uses these vectors to predict user preference. This is user-oriented.
The Conflict: The pre-trained features focus on what the object is (including background, angle, lighting), while the user cares about specific attributes (e.g., the style of the collar, the specific object). Direct incorporation introduces "preference-independent multimodal noise," contaminating the recommendation model.

The following figure (Figure 1 from the original paper) illustrates this mismatch: Visual Feature Extraction (VFE) captures the entire image content (including background), whereas user preference is driven by specific entities like the "jacket" or "white hat."

该图像是一个示意图，展示了用户对商品中语义丰富实体（夹克、白帽子和牛仔裤）的偏好匹配情况，以及传统视觉特征提取（VFE）与用户兴趣建模之间的错配问题。

2.2. Main Contributions / Findings

Concept Innovation: The paper proposes to shift from using "complete deep features" to "semantic entities." By extracting specific objects (visual entities) and keywords (textual entities), the model filters out noise and focuses on elements that trigger user interest.
Methodological Novelty (MEGCF):
- Constructs a Collaborative Multimodal Interaction Graph that explicitly links Users, Items, and Semantic Entities.
- Develops a Symmetric Linear GCN architecture. One branch captures standard collaborative signals, and the other captures multimodal semantic correlations.
- Introduces a Review-based Sentiment Weighting Strategy. Instead of learning attention weights solely from sparse interactions (like GAT), it uses sentiment analysis of reviews to determine an item's "quality," using this to weight message propagation in the graph.
Performance: MEGCF achieves state-of-the-art results on three real-world datasets (Amazon Beauty, Amazon Art, and Taobao), outperforming strong baselines like LightGCN, MMGCN, and GRCN.

3.1. Foundational Concepts

To understand MEGCF, a beginner needs to grasp these concepts:

Collaborative Filtering (CF): A technique that predicts what a user will like based on the behavior of similar users. Ideally, if User A and User B both bought items X and Y, and User A buys Z, User B might also like Z.
Graph Convolutional Network (GCN): A deep learning method for graph data. In a recommendation graph (User-Item), a GCN updates a node's representation (embedding) by aggregating information from its neighbors.
- High-order Connectivity: A standard CF looks at direct interactions (1-hop). GCNs can look deeper (2-hop, 3-hop). For example, User $\rightarrow$ Item A $\rightarrow$ User 2 $\rightarrow$ Item B. This path suggests User might like Item B.
Linear GCN (LightGCN): Traditional GCNs use non-linear activation functions (like ReLU) between layers. Recent research (LightGCN) showed that for recommendation, these non-linearities are unnecessary and even harmful. Linear GCNs simplify the process to just weighted aggregation of neighbors, which is faster and often more accurate.
Multimodal Semantic Entities: Instead of representing an image as a single vector of numbers (embedding), the system identifies distinct objects within it (e.g., "sunglasses," "beach"). These discrete labels are "entities."

3.2. Previous Works

Separated Frameworks (SF): Methods like VBPR (Visual Bayesian Personalized Ranking) extract visual features using a pre-trained CNN and then feed them into a Matrix Factorization model. The feature extraction and recommendation training are separate.
End-to-End Frameworks (EF): Methods like DVBPR try to train the image processor and the recommender jointly. While theoretically better, they are computationally expensive and hard to train on sparse data.
GCN-based CF:
- NGCF: Uses standard non-linear GCNs.
- LightGCN: The state-of-the-art CF model that simplifies NGCF by removing non-linearities. MEGCF builds upon the Linear GCN architecture.
Multimodal GCNs:
- MMGCN: Uses separate GCNs for each modality (audio, video, text).
- GRCN: Refines the interaction graph using multimodal similarity.

3.3. Differentiation Analysis

MEGCF differs from the above in two key ways:

Input Data: Unlike MMGCN or GRCN which use dense vectors (deep features) that contain noise, MEGCF extracts discrete Semantic Entities to build a graph.
Weighting Mechanism: Unlike GAT (Graph Attention Networks) which learns weights from interactions (which are sparse), MEGCF uses Sentiment Analysis from reviews to pre-calculate weights, injecting explicit "quality" signals into the graph structure.

The following figure (Figure 2 from the original paper) illustrates how "Multimodal Semantic Correlation" works. Even if User $u_1$ and User $u_2$ have no direct overlapping items, they are connected through the path of semantic entities ( $u_1 \rightarrow i_2 \rightarrow \text{entity } e_2 \rightarrow i_3 \rightarrow u_2$ ), revealing their shared preference.

$Fig. 2. Illustration of multimodal semantic correlation, where $u$ ,i, and $e$ denote the user, item, and semantic entity, respectively, and `C _ { a b }` denotes the semantic correlation between $a$ and $b$ .$ Fig. 2. Illustration of multimodal semantic correlation, where $u$ ,i, and $e$ denote the user, item, and semantic entity, respectively, and C _ { a b } denotes the semantic correlation between $a$ and $b$ .

4. Methodology

4.1. Principles

The core philosophy of MEGCF is "De-noising through Semantics."

Extraction: Convert raw, noisy multimodal data into clean, meaningful entities.
Connection: Use these entities as bridges in a graph to connect items that share semantic features.
Propagation: Use a Graph Neural Network to spread user preferences across these bridges, weighted by how positive the users feel about the items (sentiment).

The overall architecture is shown below (Figure 3 from the original paper):

$Fig. 3. Illustration of the proposed MEGCF. The target user and item are `u _ { 1 }` and `i _ { 1 }` , MSE denotes multimodal semantic entity, and $L$ is the max number of graph convolution layers.$ Fig. 3. Illustration of the proposed MEGCF. The target user and item are u _ { 1 } and i _ { 1 } , MSE denotes multimodal semantic entity, and $L$ is the max number of graph convolution layers.

4.2. Core Methodology In-depth

4.2.1. Multimodal Semantic Entity Extraction

The first step is to extract meaningful "tags" from images and text.

Visual Entities ( $\mathcal{E}_V$ ): The authors use a PNASNet model pre-trained on ImageNet. For each item image, they take the top-ranked categories from the classification output (e.g., "Jersey," "Backpack") as visual entities.
Textual Entities ( $\mathcal{E}_T$ ):
- Titles: Words are directly used as entities after removing stop words.
- Reviews: Since reviews are long and noisy, the SGRank algorithm (a keyword extraction method) is used to identify key terms.

4.2.2. Collaborative Multimodal Interaction Graph Construction

MEGCF builds a unified graph $\mathcal{G}$ by merging two bipartite graphs:

User-Item Graph ( $\mathcal{G}_1$ ): Standard interactions where $r_{ui}=1$ if user $u$ interacted with item $i$ .
Item-Entity Graph ( $\mathcal{G}_2$ ): Connections based on extraction. $r_{ie}=1$ if entity $e$ appears in item $i$ .

The final tripartite graph is $\mathcal{G} = \{(u, r_{ui}, i), (i, r_{ie}, e)\}$ .

4.2.3. Review-based Sentiment Extraction & Weighting

This is a unique feature of MEGCF. It assumes that items with positive reviews are "higher quality" and should have more influence in the graph propagation.

Sentiment Score Calculation: For an item $i$ , let $T_i$ be its set of reviews. The authors use a pre-trained sentiment analysis model (SENTA) denoted as $f(\cdot)$ to score each review. The item's sentiment score $s_i$ is the average: $s_i = \frac{\sum_{t \in T_i} f(t)}{|T_i|}$ where $|T_i|$ is the number of reviews. This score $s_i$ is used to weight the item's influence.

4.2.4. Sentiment-weighted Symmetric Linear GCN

MEGCF employs two parallel Linear GCN modules (Symmetric) to generate embeddings.

Module 1: LS-GCN-1 (Capturing CF Signals)

This module operates on the User-Item graph ( $\mathcal{G}_1$ ). It propagates information between users and items. For a layer $l$ , the embedding of item $i_1$ , denoted $v_{i1}^{(l)}$ , is aggregated from its neighbor users $N_{i1}$ . Similarly for user $u_1$ .

The Formula (Item Update): $v_{i1}^{(l)} = \sum_{u \in N_{i1} \cup i1} \frac{(s_{i1})^\gamma |\mathcal{I}|}{\sum_{i \in \mathcal{I}} (s_i)^\gamma} \cdot \frac{1}{|N_{i1}|^{0.5}|N_u|^{0.5-\alpha}} \cdot v_u^{(l-1)}$

The Formula (User Update): $v_{u1}^{(l)} = \sum_{i \in N_{u1} \cup u1} \frac{(s_i)^\gamma |\mathcal{I}|}{\sum_{i \in \mathcal{I}} (s_i)^\gamma} \cdot \frac{1}{|N_{u1}|^{0.5}|N_i|^{0.5-\alpha}} \cdot v_i^{(l-1)}$

Symbol Explanation:

$v^{(l-1)}$ : Embedding from the previous layer.
$\frac{(s_i)^\gamma |\mathcal{I}|}{\sum (s_i)^\gamma}$ $\frac{( s _{i} ) ^{γ} ∣ I ∣}{\sum ( s _{i} ) ^{γ}}$ : Sentiment Weighting Term.
- $s_i$ : Sentiment score of item $i$ .
- $\gamma$ : A smoothing parameter (set to 0.1).
- $|\mathcal{I}|$ : Total number of items.
- This term normalizes the sentiment score across all items, giving higher weight to items with better reviews.
$\frac{1}{|N_{u}|^{0.5}|N_i|^{0.5-\alpha}}$ $\frac{1}{∣ N _{u} ∣ ^{0.5} ∣ N _{i} ∣ ^{0.5 - α}}$ : Popularity-aware Graph Laplacian Norm.
- $|N|$ : The degree (number of neighbors) of a node.
- $\alpha$ : A hyperparameter to adjust sensitivity to popularity (node degree). Standard GCN sets $\alpha=0$ .

Module 2: LS-GCN-2 (Capturing Multimodal Semantic Correlation)

This module operates on the full graph $\mathcal{G}$ (User-Item-Entity). It incorporates the semantic entities into the embeddings. The embeddings here are denoted with a star ( $v^*$ ).

User Update (via Items): $v_{u1}^{*(l)} = \sum_{i \in N_{u1} \cup u1} \frac{(s_i)^\gamma |\mathcal{I}|}{\sum_{i \in \mathcal{I}} (s_i)^\gamma} \cdot \frac{1}{|N_{u1}|^{0.5}|N_i|^{0.5-\alpha}} \cdot v_i^{*(l-1)}$

Entity Update (via Items): $v_{e1}^{*(l)} = \sum_{i \in N_{e1} \cup e1} \frac{(s_i)^\gamma |\mathcal{I}|}{\sum_{i \in \mathcal{I}} (s_i)^\gamma} \cdot \frac{1}{|N_{e1}|^{0.5}|N_i|^{0.5-\alpha}} \cdot v_i^{*(l-1)}$

Item Update (via Users AND Entities): Since items connect to both users and entities, they aggregate from both types of neighbors: $v_{i1}^{*(l)} = \sum_{u \in N_{i1}^{(u)} \cup i1} W_{s} \cdot \text{Norm}_{ui} \cdot v_{u}^{*(l-1)} + \sum_{e \in N_{i1}^{(e)} \cup i1} W_{s} \cdot \text{Norm}_{ei} \cdot v_{e}^{*(l-1)}$ (Note: For clarity, I represented the repeated sentiment and Laplacian terms as $W_s$ and $\text{Norm}$ . The original paper writes them out fully as in Eq (7). The logic is simply summing the weighted contributions from user neighbors and entity neighbors.)

4.2.5. Prediction & Optimization

After $L$ layers, the final representation for a user $u$ is the embedding from the last layer of both modules: $v_u^{(L)}$ and $v_u^{*(L)}$ .

Prediction Score: The preference score $\hat{y}_{ui}$ is the sum of inner products from both modules: $\hat{y}_{ui} = (v_u^{(L)})^T \cdot v_i^{(L)} + (v_u^{*(L)})^T \cdot v_i^{*(L)}$

Loss Function (BPR Loss): The model uses Bayesian Personalized Ranking (BPR) loss, which encourages the score of an observed interaction ( $i$ ) to be higher than an unobserved one ( $j$ ). The total loss $\mathcal{L}$ is the sum of losses from both modules: $\mathcal{L} = \mathcal{L}_1 (\text{from LS-GCN-1}) + \mathcal{L}_2 (\text{from LS-GCN-2})$ $\mathcal{L}_1 = \sum_{(u,i,j) \in O} -\ln\sigma(\hat{y}_{ui}^{(1)} - \hat{y}_{uj}^{(1)}) + \lambda_1 ||\Theta_1||_2^2$ (Where $\sigma$ is the sigmoid function, and $\lambda$ is regularization).

5. Experimental Setup

5.1. Datasets

The authors use three real-world datasets.

Amazon Beauty: E-commerce data containing images, titles, and reviews.
Amazon Art: ("Arts_crafts_and_Sewing"): Similar to Beauty but sparser.
Taobao: A fashion collocation dataset from a Tianchi competition. Crucially, this dataset has no reviews, so the sentiment weighting component of MEGCF is disabled for Taobao experiments.

Data Statistics:

Beauty: ~15k Users, ~8.6k Items, ~1k Visual Entities, ~11k Textual Entities.
Art: ~25k Users, ~9k Items, ~962 Visual Entities.
Taobao: ~12k Users, ~8.7k Items.

5.2. Evaluation Metrics

Hit Ratio (HR@k):
- Concept: Measures whether the test item is present in the top- $k$ recommended list. It answers "Did the user see the relevant item?"
- Formula: $HR@k = \frac{1}{|U|} \sum_{u \in U} \delta(\text{rank}_{u, \text{target}} \le k)$ , where $\delta$ is 1 if true, 0 otherwise.
Normalized Discounted Cumulative Gain (NDCG@k):
- Concept: Measures the quality of the ranking. It rewards the algorithm more if the relevant item appears higher up in the list (e.g., position 1 is much better than position 10).
- Formula: $NDCG@k = \frac{DCG@k}{IDCG@k}$ , where $DCG@k = \sum_{i=1}^k \frac{2^{rel_i}-1}{\log_2(i+1)}$ .

5.3. Baselines

MEGCF is compared against a comprehensive set of baselines:

Traditional: BPRMF, SVD++.
GCN-based: NGCF (Non-linear), LightGCN (Linear, SOTA for pure CF).
Multimodal: VBPR (Visual MF), CKE (Knowledge+Visual), MMGCN (Multimodal GCN), GRCN (Graph-refined Multimodal).

6. Results & Analysis

6.1. Core Results Analysis

The following are the results from Table 2 of the original paper. This table compares the performance of MEGCF against all baselines across three datasets.

Metric	Models	Beauty			Art		Taobao
Metric	Models	k=5	k=10	k=20	k=5	k=10	k=20	k=5	k=10	k=20
HR@k	BPRMF	0.4274	0.5173	0.6231	0.6333	0.7052	0.7829	0.3215	0.4049	0.5155
	SVD++	0.4584	0.5520	0.6659	0.6530	0.7425	0.8285	0.3374	0.4293	0.5466
	VBPR	0.4722	0.5670	0.6665	0.6699	0.7464	0.8262	0.3464	0.4364	0.5512
	CKE	0.4810	0.5894	0.6950	0.6719	0.7632	0.8461	0.3560	0.4550	0.5789
	NGCF	0.4853	0.5820	0.6810	0.6742	0.7541	0.8287	0.3575	0.4593	0.5841
	MMGCN	0.4934	0.6067	0.7166	0.6769	0.7702	0.8546	0.3649	0.4695	0.5902
	LightGCN	0.5002	0.6063	0.7178	0.6814	0.7639	0.8329	0.3848	0.4893	0.6237
	GRCN	0.5087	0.6204	0.7241	0.6905	0.7743	0.8532	0.3865	0.4996	0.6375
	MEGCF	0.5439	0.6464	0.7448	0.7116	0.7902	0.8651	0.4045	0.5212	0.6516
NDCG@k	BPRMF	0.3343	0.3634	0.3900	0.5597	0.5829	0.6025	0.2465	0.2733	0.3011
	SVD++	0.3592	0.3895	0.4157	0.5627	0.5916	0.6134	0.2523	0.2819	0.3114
	VBPR	0.3665	0.3973	0.4224	0.5830	0.6078	0.6280	0.2639	0.2928	0.3216
	CKE	0.3650	0.4002	0.4269	0.5739	0.6030	0.6245	0.2622	0.2941	0.3253
	NGCF	0.3776	0.4089	0.4339	0.5882	0.6141	0.6330	0.2658	0.2986	0.3301
	MMGCN	0.3714	0.4081	0.4359	0.5643	0.5945	0.6159	0.2709	0.3047	0.3351
	LightGCN	0.3807	0.4152	0.4435	0.5886	0.6153	0.6340	0.2840	0.3176	0.3515
	GRCN	0.3910	0.4272	0.4533	0.5937	0.6208	0.6407	0.2861	0.3225	0.3573
	MEGCF	0.4257	0.4590	0.4838	0.6144	0.6398	0.6588	0.3020	0.3397	0.3726

Analysis of Results:

Superiority: MEGCF consistently achieves the best performance across all datasets and metrics. It provides an average improvement of 4.40% over the strongest baseline, GRCN.
Ranking Quality: The improvement in NDCG is notably higher than in HR (e.g., 8.87% improvement in NDCG@5 on Beauty vs GRCN). This suggests that modeling semantic correlations helps the model place the "right" items higher in the list, likely because it captures fine-grained user intents better than broad visual features.
Impact of Linearity: LightGCN outperforms NGCF, confirming that removing non-linearities is beneficial for recommendation. MEGCF adopts this linear design, contributing to its success.

6.2. Ablation Studies

6.2.1. Importance of Modalities

The authors tested variants without visual (w/o V) or textual (w/o T) entities.

Result: Removing either modality drops performance.
Conclusion: Both visual and textual semantics contribute independently to understanding user interest.

The following figure (Figure 5 from the original paper) shows the performance drop when removing modalities. "w/o V&T" (bottom line) performs the worst, proving the value of multimodal data.

Fig. 5. Effect of modality-specific semantic correlation on MEGCF.

6.2.2. Importance of Symmetric GCN Structure

Tested variants:

w/o g2: Uses only the user-item GCN (g1).
w/o g1: Uses only the multimodal entity GCN (g2).
Result: MEGCF (combining both) is superior.
Insight: The standard interaction graph (g1) provides the base collaborative signal, while the entity graph (g2) provides semantic augmentation. They are complementary.

6.2.3. Sentiment Weighting vs. Attention

The authors compared their review-based sentiment weighting against a standard Graph Attention Network (GAT).

Finding: MEGCF's sentiment weighting outperforms GAT.
Reasoning: GAT tries to learn weights from interaction data, which is sparse. MEGCF introduces external knowledge (sentiment from reviews) into the weights, providing a robust signal about item quality that cannot be learned from clicks alone.

The following figure (Figure 8 from the original paper) shows training trends. MEGCF (red line) consistently stays above the GAT variant (MEGCF_gat, yellow line).

Fig. 8. Performance trends on different training epochs on the Beauty dataset.

6.3. Case Study

To visualize the benefit, the authors examined a specific user in the Taobao dataset.

Observation: The user interacted with items containing "backpack" entities.
Result: MEGCF successfully identified "backpack" ( $e_{78}$ ) as a high-preference entity for this user, even though the items looked different visually. This confirms the model is learning preference for the object (semantic entity) rather than just visual patterns.

The following figure (Figure 10 from the original paper) illustrates this case.

该图像是一个示意图，展示了MEGCF模型中目标用户与交互商品及视觉语义实体之间的关系，使用不同虚线表示用户与实体偏好计算和商品之间的相似度计算。

7. Conclusion & Reflections

7.1. Conclusion Summary

MEGCF successfully addresses the mismatch problem in multimodal recommendation. By extracting semantic entities (transforming content-oriented features to user-oriented interests) and leveraging explicit sentiment signals from reviews, it builds a robust graph-based recommender. The Symmetric Linear GCN architecture effectively fuses collaborative signals with high-order multimodal semantic correlations, achieving state-of-the-art results.

7.2. Limitations & Future Work

Entity Extraction Accuracy: The current model relies on off-the-shelf pre-trained models (like PNASNet trained on ImageNet). The categories are limited (1000 classes), and errors in extraction (misidentifying objects) can propagate noise into the graph.
Future Direction: The authors suggest using larger pre-training datasets or Contrastive Learning to improve feature representation. They also mention exploring Causal Inference to better distinguish which multimodal features truly cause a user interaction.

7.3. Personal Insights & Critique

Integration of "Static" & "Dynamic" Signals: One of the paper's smartest moves is using review sentiment as a static edge weight. While Attention mechanisms (dynamic weights) are popular, they are data-hungry. Hard-coding "quality" via sentiment is a computationally efficient and effective way to inject domain knowledge into the graph structure.
Interpretability: The entity-based approach offers excellent interpretability. Unlike dense vectors where we don't know why a user likes an item, MEGCF can explicitly show "User U likes Item I because it contains Entity E (Backpack)." This is valuable for explainable AI in e-commerce.
Potential Issue: The reliance on ImageNet categories is a bottleneck. In fashion (Taobao), specific styles (e.g., "vintage," "bohemian") might matter more than broad object labels ("shirt"). A specialized fashion-detector would likely boost MEGCF's performance significantly.