Comprehending Knowledge Graphs with Large Language Models for Recommender Systems
TL;DR Summary
CoLaKG employs large language models to enrich knowledge graphs by addressing missing facts, capturing high-order connections, and preserving semantic information, significantly enhancing recommendation results on multiple real-world datasets.
Abstract
In recent years, the introduction of knowledge graphs (KGs) has significantly advanced recommender systems by facilitating the discovery of potential associations between items. However, existing methods still face several limitations. First, most KGs suffer from missing facts or limited scopes. Second, existing methods convert textual information in KGs into IDs, resulting in the loss of natural semantic connections between different items. Third, existing methods struggle to capture high-order connections in the global KG. To address these limitations, we propose a novel method called CoLaKG, which leverages large language models (LLMs) to improve KG-based recommendations. The extensive knowledge and remarkable reasoning capabilities of LLMs enable our method to supplement missing facts in KGs, and their powerful text understanding abilities allow for better utilization of semantic information. Specifically, CoLaKG extracts useful information from KGs at both local and global levels. By employing the item-centered subgraph extraction and prompt engineering, it can accurately understand the local information. In addition, through the semantic-based retrieval module, each item is enriched by related items from the entire knowledge graph, effectively harnessing global information. Furthermore, the local and global information are effectively integrated into the recommendation model through a representation fusion module and a retrieval-augmented representation learning module, respectively. Extensive experiments on four real-world datasets demonstrate the superiority of our method.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Comprehending Knowledge Graphs with Large Language Models for Recommender Systems
- Authors: Ziqiang Cui, Yunpeng Weng, Xing Tang, Fuyuan Lyu, Dugang Liu, Xiuqiang He, and Chen Ma.
- Affiliations: The authors are affiliated with City University of Hong Kong, Tencent, McGill University & MILA, Shenzhen University, and Shenzhen Technology University. This indicates a collaboration between academic institutions and industry (Tencent), suggesting the research has both theoretical grounding and practical relevance.
- Journal/Conference: The paper is submitted to the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25). SIGIR is a premier, top-tier international conference in the field of information retrieval, known for its rigorous peer-review process and high impact.
- Publication Year: 2025 (as per the submission details).
- Abstract: The paper proposes a novel method, CoLaKG, to enhance knowledge graph (KG)-based recommender systems using Large Language Models (LLMs). It addresses three key limitations of existing methods: (1) KGs often have missing facts; (2) converting text to IDs loses semantic meaning; and (3) capturing high-order (distant) connections in a KG is difficult. CoLaKG leverages the knowledge and reasoning of LLMs to supplement KGs and understand their semantics. It comprehends KGs at both a local level (by analyzing item-centered subgraphs via prompts) and a global level (by retrieving semantically related items from the entire KG). This information is then integrated into a recommendation model through representation fusion and a retrieval-augmented learning module. Experiments on four datasets show that CoLaKG significantly outperforms existing methods.
- Original Source Link:
- ArXiv: https://arxiv.org/abs/2410.12229v3
- PDF: https://arxiv.org/pdf/2410.12229v3.pdf
- The paper is a preprint on ArXiv, submitted for publication at SIGIR '25.
2. Executive Summary
-
Background & Motivation (Why): Recommender systems are essential for navigating information overload, but they often struggle with data sparsity (not enough user interaction data). To overcome this, many systems incorporate external knowledge using Knowledge Graphs (KGs), which connect items based on their real-world attributes (e.g., a movie's director, genre, actors). However, existing KG-based methods have critical flaws:
- Incomplete Knowledge: KGs are often manually built and incomplete. If a movie is missing its "genre" attribute, it loses a potential connection to other movies of the same genre.
- Semantic Loss: Traditional methods convert textual information (like "horror" and "thriller") into distinct numerical IDs. This process discards the rich semantic relationship between the words, treating them as completely unrelated.
- Limited Scope: These methods, often based on Graph Neural Networks (GNNs), struggle to capture connections between items that are many "hops" away from each other in the KG, a problem known as over-smoothing. The paper proposes to use Large Language Models (LLMs) to address these gaps. LLMs possess vast world knowledge and a deep understanding of text, making them ideal for "filling in the blanks" in a KG and recognizing semantic similarities that ID-based methods miss.
-
Main Contributions / Findings (What): The paper introduces CoLaKG (Comprehending Knowledge Graphs with Large Language Models), a novel framework with several key contributions:
- LLM-Powered KG Comprehension: It is the first to use LLMs to systematically comprehend KGs for recommendations by understanding both their structure and semantics. This helps mitigate the issues of missing facts and semantic loss.
- Dual Local and Global Perspective: CoLaKG analyzes KGs from two viewpoints:
- Local: It prompts an LLM with an item's immediate neighborhood (1-hop and 2-hop connections) to generate a rich, semantically aware description.
- Global: It uses the LLM-generated descriptions to find semantically similar items across the entire KG, even if they are structurally distant. This is achieved via a retrieval mechanism.
- Efficient Decoupled Architecture: The LLM-based comprehension is performed offline as a pre-processing step. The resulting semantic embeddings are then fed into a standard recommendation model. This means the LLM is not needed during the live recommendation (inference) phase, making the system efficient and practical for real-world deployment.
- Superior Performance: Extensive experiments show that CoLaKG significantly outperforms traditional, KG-based, and other LLM-based recommendation models on four diverse datasets.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Recommender Systems: Systems that predict a user's interest in an item and suggest items they are likely to enjoy. They are widely used on platforms like Netflix, Amazon, and Spotify.
- Collaborative Filtering (CF): A classic recommendation technique that works on the principle "users who liked similar items in the past will like similar items in the future." It relies solely on the user-item interaction matrix but suffers from data sparsity when there are few interactions.
- Knowledge Graph (KG): A structured representation of facts in the form of a graph. It consists of nodes (entities, e.g., "Titanic," "Leonardo DiCaprio") and edges (relations, e.g., "starred_in"). In recommendations, KGs provide rich side information about items, helping to connect them in meaningful ways (e.g., "Movie A" and "Movie B" are connected because they share the same director).
- Graph Neural Networks (GNNs): A class of neural networks designed to work with graph-structured data. They operate by passing "messages" between connected nodes and aggregating information from their neighbors. In KG-based recommendation, GNNs propagate information across the KG to learn better item and user representations. However, stacking too many GNN layers can lead to the over-smoothing problem, where all node representations become indistinguishable.
- Large Language Models (LLMs): Massive neural networks (like GPT-4 or DeepSeek) trained on vast amounts of text data. They excel at understanding natural language, reasoning, and possess extensive "world knowledge," which can be used to infer missing information or understand semantic nuances.
- Embeddings: Numerical vector representations of entities like users, items, or words. These vectors capture the semantic meaning of the entity, such that similar entities have similar vectors.
-
Previous Works: The paper positions itself within the evolution of knowledge-aware recommendation:
- Embedding-based Methods (e.g.,
CKE): These methods learn embeddings for entities and relations in the KG and integrate them into the recommendation model. They primarily use the KG's structure but often ignore the textual semantics. - Path-based Methods (e.g.,
PER,MCRec): These methods explicitly define and use "meta-paths" (e.g., User → Item → Actor → Item) to capture high-order relationships. Their main drawback is the need for manual, domain-specific design of these paths. - GNN-based Methods (e.g.,
KGAT,KGIN): These have become the standard. They automatically learn how to propagate information through the KG. However, as noted, they suffer from ID-based semantic loss and have difficulty capturing global connections. Some recent methods (KGCL,KGRec) use contrastive learning to reduce noise but still operate on the ID-based graph. - LLMs for Recommendation (e.g.,
RLMRec,KAR): Recent works have started using LLMs for recommendations. Some use LLMs to generate item descriptions, while others use them as the recommender itself (zero-shot recommendation). However, these methods often don't leverage the structured, task-specific knowledge available in a KG.
- Embedding-based Methods (e.g.,
-
Differentiation:
CoLaKGis distinct from prior art in a crucial way. Instead of just using the KG's structure or using an LLM in isolation, it creates a synergy: it uses the LLM to "read" and "understand" the KG.- Compared to GNN-based methods,
CoLaKGdirectly processes text, avoiding semantic loss, and uses its inherent knowledge to complete the KG. Its retrieval mechanism for global context bypasses the over-smoothing limitations of deep GNNs. - Compared to other LLM-based methods,
CoLaKGgrounds the LLM's reasoning in a task-relevant KG, mitigating the risk of LLM "hallucinations" (generating plausible but incorrect information) and focusing its power on the specific recommendation task.
- Compared to GNN-based methods,
4. Methodology (Core Technology & Implementation)
The CoLaKG framework is a two-stage process, as illustrated in Figure 2.

Stage 1: KG Comprehension with LLMs (Offline)
This stage generates semantic embeddings for items and users.
-
4.1.1 Local KG Comprehension (for each item): The goal is to generate a rich semantic representation for each item using its local neighborhood in the KG.
-
Subgraph Extraction: For an item , the model extracts its 1-hop and 2-hop neighbors from the KG.
- 1-hop: All triples
(v, r, e)where is the head entity (e.g., (Titanic, has_genre, Romance)). - 2-hop: To avoid an explosion of nodes, it randomly samples triples connected to the 1-hop neighbors (e.g., if "James Cameron" is a 1-hop neighbor of "Titanic", a 2-hop triple might be (James Cameron, directed, Avatar)).
- 1-hop: All triples
-
Prompt Engineering: The extracted triples are converted into natural language and formatted into a prompt for the LLM. As shown in Figure 3, the prompt instructs the LLM to act as an expert, summarize the item's attributes based on the provided 1-hop and 2-hop information, and infer any related details.
该图像是图表,展示了论文中用于局部知识图理解的提示模板,示例内容涉及电影推荐的一级和二级关系信息,如导演和演员的关联描述。 -
LLM-based Generation: The LLM processes the prompt to generate a comprehensive textual summary of the item. This process is formulated as:
- : The textual comprehension generated by the LLM for item .
- : The system prompt/instruction.
- : Text representation of the 1-hop triples.
- : Text representation of the sampled 2-hop triples.
-
Semantic Embedding: The textual summary is converted into a dense vector embedding using a pre-trained sentence embedding model (like SimCSE).
-
-
4.1.2 Retrieval-based Global KG Utilization: This component captures long-range semantic relationships across the entire KG.
- Similarity Calculation: Using the semantic embeddings for all items, the model computes the cosine similarity between every pair of items . This creates a fully-connected semantic item-item graph where edge weights represent semantic similarity.
- Neighbor Retrieval: For each item , the model retrieves the top- most semantically similar items, forming its neighbor set . These neighbors can be structurally distant in the original KG but are semantically close.
-
4.2 User Preference Comprehension: A similar process is applied to understand user preferences.
- For a user , the model gathers all items they have interacted with and their corresponding 1-hop KG information.
- This information is concatenated into a single text block and fed to the LLM with a prompt asking it to summarize the user's tastes.
- The LLM's output is converted into a user semantic embedding .
Stage 2: Retrieval-Augmented Representation Learning (Online Training)
This stage integrates the pre-computed semantic embeddings into the recommendation model.
-
4.3.1 Cross-Modal Representation Alignment: The model needs to combine the traditional ID-based embeddings () with the new semantic embeddings ().
- Alignment: Since these embeddings are from different "modalities" and may have different dimensions, a learnable adapter network (a linear projection layer with an ELU activation function ) maps the semantic embeddings into the same space as the ID embeddings.
- : Learnable weight matrices for alignment.
- : The projected semantic embeddings.
- Fusion: The ID and projected semantic embeddings are fused using simple mean pooling to create initial user and item representations that contain both collaborative and semantic signals.
- Alignment: Since these embeddings are from different "modalities" and may have different dimensions, a learnable adapter network (a linear projection layer with an ELU activation function ) maps the semantic embeddings into the same space as the ID embeddings.
-
4.3.2 Item Representation Augmentation with Retrieved Neighbors: The item representations are further enhanced using the globally retrieved neighbors .
- Attention Mechanism: An attention mechanism calculates the importance of each neighbor to the central item . The attention scores are computed based on their semantic embeddings , ensuring the aggregation is guided by semantic relevance.
- Aggregation: The final augmented item representation is a combination of the item's own fused embedding and a weighted average of its neighbors' embeddings.
-
4.4-4.5 User-Item Modeling and Training:
- Recommendation Backbone: The final user representations and augmented item representations are used as the initial embeddings for a standard recommendation model. The paper uses
LightGCNfor its simplicity and effectiveness.LightGCNthen performs several layers of message passing on the user-item interaction graph to capture collaborative filtering signals. - Prediction: The final prediction score is the inner product of the final user and item embeddings learned by
LightGCN. - Training: The model is trained end-to-end (except for the fixed LLM-generated embeddings) using the Bayesian Personalized Ranking (BPR) loss, a standard pairwise loss function that aims to rank observed (positive) items higher than unobserved (negative) items for a given user.
- Recommendation Backbone: The final user representations and augmented item representations are used as the initial embeddings for a standard recommendation model. The paper uses
5. Experimental Setup
-
Datasets: Experiments were conducted on four real-world datasets from different domains to ensure generalizability. The statistics are transcribed from Table 1 below.
- MovieLens-1M: A popular benchmark for movie recommendations.
- Last-FM: A music recommendation dataset based on user listening habits.
- MIND: A large-scale news recommendation dataset.
- Fund: An industrial dataset for financial fund recommendation.
(Manual Transcription of Table 1) Table 1: Dataset statistics.
Statistics MovieLens Last-FM MIND Funds # Users 6,040 1,859 44,603 209,999 # Items 3,260 2,813 15,174 5,701 # Interactions 998,539 86,608 1,285,064 1,225,318 Knowledge Graph # Entities 12,068 9,614 32,810 8,111 # Relations 12 2 14 12 # Triples 62,958 118,500 307,140 65,697 -
Evaluation Metrics: The performance of top-N recommendation is evaluated using two standard metrics,
Recall@kandNDCG@k, for k=10 and 20.- Recall@k:
- Conceptual Definition: This metric measures the proportion of relevant items (that the user actually interacted with in the test set) that are found in the top- recommended items. It answers the question: "Out of all the items the user liked, how many did we manage to recommend in the top- list?"
- Mathematical Formula: For a single user, The final value is averaged over all users.
- Symbol Explanation:
Recommended Itemsis the set of top- items suggested by the model.Relevant Itemsis the set of items from the test set that the user has interacted with.
- Normalized Discounted Cumulative Gain (NDCG@k):
- Conceptual Definition:
NDCG@kis a more sophisticated metric that evaluates the quality of the ranking. It rewards models for placing relevant items higher up in the top- list. A relevant item at position 1 is more valuable than one at position 10. The score is normalized so that a perfect ranking yields a score of 1. - Mathematical Formula:
- Symbol Explanation: is the rank position. is the relevance of the item at position (1 if relevant, 0 otherwise).
DCG@kis the Discounted Cumulative Gain.IDCG@kis the Ideal DCG, which is the DCG of a perfect ranking where all relevant items are placed at the top.
- Conceptual Definition:
- Recall@k:
-
Baselines:
CoLaKGis compared against a comprehensive set of 12 baselines from three categories:- Classical Methods:
BPR-MF,NFM,LightGCN. These rely mostly on collaborative filtering signals. - KG-enhanced Methods:
CKE,RippleNet,KGAT,KGIN,KGCL,KGRec. These represent the state-of-the-art in using KGs for recommendation. - LLM-based Methods:
RLMRec,KAR,CLLM4Rec. These represent alternative ways of leveraging LLMs in recommendation.
- Classical Methods:
6. Results & Analysis
-
Core Results: The main results, transcribed from Table 2, show the performance of
CoLaKGagainst all baselines.(Manual Transcription of Table 2)
Model MovieLens Last-FM MIND Funds R@10 N@10 R@20 N@20 R@10 N@10 R@20 N@20 R@10 N@10 R@20 N@20 R@10 N@10 R@20 N@20 BPR-MF 0.1257 0.3100 0.2048 0.3062 0.1307 0.1352 0.1971 0.1685 0.0315 0.0238 0.0537 0.0310 0.4514 0.3402 0.5806 0.3809 NFM 0.1346 0.3558 0.2129 0.3379 0.2246 0.2327 0.3273 0.2830 0.0495 0.0356 0.0802 0.0458 0.4388 0.3187 0.5756 0.3651 LightGCN 0.1598 0.3901 0.2512 0.3769 0.2589 0.2799 0.3642 0.3321 0.0624 0.0492 0.0998 0.0609 0.4992 0.3778 0.6353 0.4204 CKE 0.1524 0.3783 0.2373 0.3609 0.2342 0.2545 0.3266 0.3001 0.0526 0.0417 0.0822 0.0510 0.4926 0.3702 0.6294 0.4130 RippleNet 0.1415 0.3669 0.2201 0.3423 0.2267 0.2341 0.3248 0.2861 0.0472 0.0364 0.0785 0.0451 0.4764 0.3591 0.6124 0.4003 KGAT 0.1536 0.3782 0.2451 0.3661 0.2470 0.2595 0.3433 0.3075 0.0594 0.0456 0.0955 0.0571 0.5037 0.3751 0.6418 0.4182 KGIN 0.1631 0.3959 0.2562 0.3831 0.2562 0.2742 0.3611 0.3215 0.0640 0.0518 0.1022 0.0639 0.5079 0.3857 0.6428 0.4259 KGCL 0.1554 0.3797 0.2465 0.3677 0.2599 0.2763 0.3652 0.3284 0.0671 0.0543 0.1059 0.0670 0.5071 0.3877 0.6355 0.4273 KGRec 0.1640 0.3968 0.2571 0.3842 0.2571 0.2748 0.3617 0.3251 0.0627 0.0506 0.1003 0.0625 0.5104 0.3913 0.6467 0.4304 RLMRec 0.1613 0.3920 0.2524 0.3787 0.2597 0.2812 0.3651 0.3335 0.0619 0.0486 0.0990 0.0602 0.4988 0.3784 0.6351 0.4210 KAR 0.1582 0.3869 0.2511 0.3722 0.2532 0.2770 0.3612 0.3324 0.0615 0.0480 0.1002 0.0613 0.5033 0.3812 0.6312 0.4175 CLLM4Rec 0.1563 0.3841 0.2433 0.3637 0.2571 0.2793 0.3642 0.3268 0.0631 0.0494 0.1012 0.0628 0.4996 0.3791 0.6273 0.4103 CoLaKG 0.1699 0.4130 0.2642 0.3974 0.2738 0.2948 0.3803 0.3471 0.0698 0.0562 0.1087 0.0684 0.5273 0.4012 0.6524 0.4392 Analysis:
- Consistent Superiority:
CoLaKGachieves the best performance on all metrics across all four datasets, demonstrating its effectiveness and robustness. - Value of KGs: KG-enhanced methods like
KGINandKGRecgenerally outperform classical CF methods likeLightGCN, confirming that KGs provide valuable side information. - Superiority over LLM Baselines:
CoLaKGsignificantly outperforms other LLM-based methods (RLMRec,KAR,CLLM4Rec). This suggests thatCoLaKG's approach of using LLMs to deeply comprehend a task-specific KG is more effective than using LLMs in a more general way (e.g., just for generating item profiles).
- Consistent Superiority:
-
Validation of Generalizability: Table 3 shows that the semantic information generated by
CoLaKGcan be plugged into various recommendation backbones (BPR-MF,NFM,LightGCN) and consistently improves their performance. This highlights that the core contribution ofCoLaKG—the LLM-powered KG comprehension—is a versatile module that can enhance a wide range of models.(Manual Transcription of Table 3)
Model MovieLens Last-FM MIND R@20 N@20 R@20 N@20 R@20 N@20 BPR-MF 0.2048 0.3062 0.1971 0.1685 0.0537 0.0310 BPR-MF+Ours 0.2213 0.3255 0.2104 0.1812 0.0609 0.3986 NFM 0.2129 0.3379 0.3273 0.2830 0.0802 0.0458 NFM+Ours 0.2285 0.3527 0.3478 0.2996 0.0859 0.0487 LightGCN 0.2512 0.3769 0.3642 0.3321 0.0998 0.0609 LightGCN+Ours 0.2642 0.3974 0.3803 0.3471 0.1087 0.0684 -
Ablation Study: This study dissects the
CoLaKGmodel to verify the contribution of each component.(Manual Transcription of Table 4)
Metric w/o sv w/o su w/o Nk(v) w/o D'v CoLaKG ML R@20 0.2553 0.2613 0.2603 0.2628 0.2642 N@20 0.3811 0.3948 0.3902 0.3960 0.3974 Last-FM R@20 0.3628 0.3785 0.3725 0.3789 0.3803 N@20 0.3278 0.3465 0.3403 0.3459 0.3471 MIND R@20 0.1043 0.1048 0.1064 0.1076 0.1087 N@20 0.0640 0.0658 0.0662 0.0671 0.0684 Funds R@20 0.6382 0.6481 0.6455 0.6499 0.6524 N@20 0.4247 0.4351 0.4305 0.4378 0.4392 Analysis:
- w/o sv (no item semantic embedding): Removing the LLM-generated item embeddings causes the largest performance drop, confirming that local KG comprehension is the most critical component.
- w/o su (no user semantic embedding): Performance also drops, showing that using the LLM to model user preferences is beneficial.
- w/o Nk(v) (no neighbor augmentation): Removing the retrieval-based global neighbor aggregation leads to a significant drop, proving the importance of capturing global semantic information beyond the local KG subgraph.
- w/o D'v (no 2-hop KG info): Removing the 2-hop relations from the prompt causes a slight performance decrease, suggesting that providing richer context (even if sampled) helps the LLM generate better summaries.
-
Hyperparameter Study: Figure 4 investigates the impact of two key hyperparameters: (number of retrieved neighbors) and (number of sampled 2-hop neighbors).

Analysis:
- The performance generally improves as increases from 0, peaking around or depending on the dataset. This shows that retrieving global neighbors is beneficial, but retrieving too many () can introduce noise and slightly degrade performance. The case is equivalent to the
w/o N_k(v)ablation and performs worst. - The impact of is also positive, showing that including more 2-hop information helps, but the gains diminish after a certain point.
- The performance generally improves as increases from 0, peaking around or depending on the dataset. This shows that retrieving global neighbors is beneficial, but retrieving too many () can introduce noise and slightly degrade performance. The case is equivalent to the
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
CoLaKG, a novel and effective framework that leverages LLMs to overcome fundamental limitations in KG-based recommender systems. By using LLMs to comprehend KGs at both local (subgraph analysis) and global (semantic retrieval) levels,CoLaKGcan reason about missing facts, understand textual semantics, and capture long-range item relationships. Its decoupled architecture ensures efficiency, making it a powerful and practical solution. The strong empirical results across four diverse datasets validate the superiority of this approach. -
Limitations & Future Work (Author-Stated and Inferred):
- Computational Cost: While inference is efficient, the initial offline stage of querying the LLM for every item and user can be computationally expensive and time-consuming, especially for datasets with millions of items.
- Dependence on LLM Quality: The performance of
CoLaKGis inherently tied to the capabilities of the chosen LLM and text embedding model. A less powerful LLM might generate lower-quality summaries. - Prompt Sensitivity: The design of the prompts is crucial. The current prompts may not be optimal for all types of KGs or domains, and future work could explore automated prompt engineering.
- Static Nature: The current framework processes the KG in a static, offline manner. In real-world systems where new items and users appear constantly (the "cold-start" problem), a mechanism for dynamically updating the semantic embeddings would be needed.
-
Personal Insights & Critique:
- Novelty and Impact: The core idea of using an LLM as a "reasoning engine" to interpret a symbolic knowledge base (the KG) is highly innovative and powerful. It bridges the gap between structured knowledge and the unstructured understanding of LLMs. The decoupled design is a major strength, making it immediately more practical than methods requiring LLMs at inference time.
- Transferability: This approach is highly transferable. It could be applied to any recommendation domain where items have rich textual attributes and relational data, such as e-commerce (product specifications), academic paper recommendation (authors, topics, venues), or even social network recommendations.
- Potential Improvements:
- More Sophisticated Sampling: Instead of random sampling for 2-hop neighbors, a more intelligent, attention-based sampling method could be used to select more relevant 2-hop information to include in the prompt.
- Advanced Fusion: The fusion of ID and semantic embeddings currently uses simple mean pooling. A more complex, attention-based fusion gate could learn to dynamically weigh the importance of collaborative vs. semantic information for different items or users.
- End-to-End Fine-tuning: While costly, exploring methods to fine-tune a smaller, domain-specific language model as part of the end-to-end training loop could potentially yield even better performance.
Similar papers
Recommended via semantic vector search.