Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems
TL;DR Summary
KGNN-LS models user-specific KG relations with label smoothness regularization, enabling end-to-end training for better personalized recommendations, notably improving cold-start performance and scalability, outperforming state-of-the-art baselines across datasets.
Abstract
Knowledge graphs capture structured information and relations between a set of entities or items. As such knowledge graphs represent an attractive source of information that could help improve recommender systems. However, existing approaches in this domain rely on manual feature engineering and do not allow for an end-to-end training. Here we propose Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) to provide better recommendations. Conceptually, our approach computes user-specific item embeddings by first applying a trainable function that identifies important knowledge graph relationships for a given user. This way we transform the knowledge graph into a user-specific weighted graph and then apply a graph neural network to compute personalized item embeddings. To provide better inductive bias, we rely on label smoothness assumption, which posits that adjacent items in the knowledge graph are likely to have similar user relevance labels/scores. Label smoothness provides regularization over the edge weights and we prove that it is equivalent to a label propagation scheme on a graph. We also develop an efficient implementation that shows strong scalability with respect to the knowledge graph size. Experiments on four datasets show that our method outperforms state of the art baselines. KGNN-LS also achieves strong performance in cold-start scenarios where user-item interactions are sparse.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems
1.2. Authors
- Hongwei Wang (Stanford University)
- Fuzheng Zhang (Meituan-Dianping Group)
- Mengdi Zhang (Meituan-Dianping Group)
- Jure Leskovec (Stanford University)
- Miao Zhao (Hong Kong Polytechnic University)
- Wenjie Li (Hong Kong Polytechnic University)
- Zhongyuan Wang (Meituan-Dianping Group)
1.3. Journal/Conference
Published at The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '19), August 4-8, 2019, Anchorage, AK, USA. KDD is a highly reputable and influential conference in the fields of data mining, data science, and knowledge discovery, known for publishing cutting-edge research.
1.4. Publication Year
2019
1.5. Abstract
The paper addresses the challenge of leveraging knowledge graphs (KGs) to improve recommender systems, specifically tackling limitations of existing approaches such as reliance on manual feature engineering and lack of end-to-end training. It proposes a novel model called Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS). Conceptually, KGNN-LS computes user-specific item embeddings by first applying a trainable function to identify important KG relationships for a given user, transforming the KG into a user-specific weighted graph. A graph neural network (GNN) is then applied to this personalized graph to compute personalized item embeddings. To provide better inductive bias and regularize the edge weights, the model relies on a label smoothness assumption, which posits that adjacent items in the KG are likely to have similar user relevance labels/scores. The paper proves that this label smoothness regularization is equivalent to a label propagation scheme on a graph. An efficient implementation is also developed, demonstrating strong scalability. Experiments on four datasets show that KGNN-LS outperforms state-of-the-art baselines, especially in cold-start scenarios where user-item interactions are sparse.
1.6. Original Source Link
https://arxiv.org/abs/1905.04413 The paper was published at KDD '19, and the provided link is to the arXiv preprint, which is an officially recognized preprint server.
1.7. PDF Link
https://arxiv.org/pdf/1905.04413v3.pdf
2. Executive Summary
2.1. Background & Motivation
Recommender systems are crucial for combating information overload in various internet applications. Traditional recommender systems, primarily based on collaborative filtering (CF), often face significant challenges:
-
Cold-start problem: Difficulty in recommending brand new items or to new users due to insufficient interaction data.
-
Sparsity issue: Limited user-item interaction data, making it hard to learn robust preferences.
To alleviate these issues, researchers have explored incorporating auxiliary information sources like user/item profiles or social networks.
Knowledge graphs (KGs)stand out as a particularly attractive source. KGs are heterogeneous graphs that capture structured information and relations between entities (e.g., items, their properties, and characteristics). They provide rich connectivity information, revealing semantic relatedness between items that might not be apparent from interaction data alone.
The core challenge lies in effectively utilizing KGs to capture user-specific item-item relatedness. Existing KG-aware recommender systems, categorized into path-based, embedding-based, and hybrid methods, often suffer from:
-
Manual feature engineering: Requiring significant human effort to design features.
-
Lack of end-to-end training: Components are often trained separately, limiting joint optimization.
-
Poor scalability: Struggling with large-scale KGs.
While
Graph Neural Networks (GNNs)have shown promise inrepresentation learningon graphs by aggregating information from local neighborhoods, their application in recommender systems has largely been confined to homogeneousbipartite user-item interaction graphsoruser-/item-similarity graphs. Extending GNNs to heterogeneous KGs, which involve various entity types and relation types, remained an open question. This paper aims to bridge this gap by developing a GNN architecture tailored for heterogeneous KGs in recommender systems.
2.2. Main Contributions / Findings
The paper introduces Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) to address the aforementioned challenges and improve recommendation quality. Its main contributions are:
- Novel
KGNN-LSArchitecture: The paper proposes a novel framework that extends GNNs to heterogeneous knowledge graphs for recommender systems. It simultaneously capturessemantic relationshipsbetween items andpersonalized user preferences. - User-Specific Weighted Graph Transformation: It introduces a
trainable and personalized relation scoring function() that dynamically transforms the static knowledge graph into auser-specific weighted graph. This weighted graph reflects both the intrinsic semantic information of the KG and the individual user's interests, enabling the GNN to compute personalized item embeddings. - Integration of
Graph Neural Networksfor Personalized Embeddings: By applying a GNN on the user-specific weighted graph,KGNN-LScomputespersonalized item embeddings. These embeddings effectively capture each item's local KG structure in a user-personalized way, aggregating features from neighbors up to hops away. Label SmoothnessRegularization: To address the overfitting risk introduced by learnable edge weights (which are only sparsely supervised by user-item interactions), the paper develops a novellabel smoothness regularizationtechnique. This regularization operates on the assumption that adjacent entities in the KG are likely to have similar user relevance labels.- Theoretical Proof of
Label PropagationEquivalence: The paper formally proves that the proposedlabel smoothness regularizationis equivalent to alabel propagationscheme on a graph. This provides a strong theoretical foundation for its use as a regularization mechanism. Aleave-one-out loss functionis designed for label propagation to provide additional supervised signals for learning the edge scoring function. - Unified End-to-End Training Framework:
KGNN-LSintegrates the knowledge-aware GNN and label smoothness regularization into a single, end-to-end trainable loss function, allowing gradients to flow from the prediction function back through the GNN and the relation scoring function. - Empirical Validation and Scalability: Extensive experiments on four real-world datasets (MovieLens-20M, Book-Crossing, Last.FM, and Dianping-Food) demonstrate that
KGNN-LSsignificantly outperforms state-of-the-art baselines in bothtop-K recommendation(measured byRecall@K) andclick-through rate (CTR) prediction(measured byAUC). The method also shows strong performance incold-start scenariosand exhibitsstrong scalabilitywith respect to the KG size.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand KGNN-LS, several fundamental concepts are essential for a beginner:
-
Recommender Systems (RS): Systems designed to predict user preferences for items. They help users discover relevant content or products from a vast array of choices.
- Collaborative Filtering (CF): A widely used technique in RS that makes recommendations based on the preferences of similar users or the characteristics of similar items. For example, if user A likes movies X and Y, and user B likes movie X, CF might recommend movie Y to user B.
- Cold-start Problem: A common challenge in RS where the system struggles to make recommendations for new users or new items because there is insufficient interaction data (ratings, clicks, purchases) to infer their preferences or characteristics.
- Sparsity: Refers to the situation where the user-item interaction matrix contains very few observed interactions compared to the total possible interactions. This makes it difficult for CF algorithms to find reliable patterns.
-
Knowledge Graphs (KGs): Structured representations of knowledge that describe real-world entities and their relationships.
- Entities: Real-world objects, concepts, or abstract ideas (e.g., "The Silence of the Lambs" (a movie), "Anthony Hopkins" (an actor), "director" (a property)). In recommender systems, items (like movies, books, music) are often entities.
- Relations: The types of connections between entities (e.g., "film.film.star", "book.book.author").
- Triples: The fundamental unit of a KG, typically represented as
(h, r, t), where is thehead entity, is therelation, and is thetail entity. For example, (The Silence of the Lambs, film.film.star, Anthony Hopkins). - Heterogeneous Graph: A graph where nodes and/or edges can be of different types. KGs are heterogeneous because they have different types of entities and relations.
-
Graph Neural Networks (GNNs): A class of neural networks designed to operate on graph-structured data. They generalize convolutional operations (like those in
Convolutional Neural Networks (CNNs)for images) to arbitrary graph structures.- Representation Learning: The process of automatically discovering meaningful features or embeddings from raw data. In GNNs, this means learning vector representations (embeddings) for nodes that capture their structural and feature information.
- Message Passing / Aggregation: The core mechanism of GNNs. Each node updates its representation by aggregating information from its immediate neighbors and its own previous representation. This process is often repeated for multiple layers, allowing information to propagate across longer distances in the graph.
- Graph Convolutional Networks (GCNs): A popular type of GNN. A common GCN layer updates node features for node at layer using information from its neighbors : $ h_v^{(l+1)} = \sigma \left( \sum_{u \in \mathcal{N}(v) \cup {v}} \frac{1}{\sqrt{\deg(u)\deg(v)}} W^{(l)} h_u^{(l)} \right) $ where is the degree of node , is a learnable weight matrix, and is an activation function. This can be written in matrix form as: $ \mathbf{H}^{(l+1)} = \sigma \left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) $ Here, is the matrix of node embeddings at layer , is the adjacency matrix, (with being the identity matrix for self-loops), and is the diagonal degree matrix of .
-
Label Smoothness / Label Propagation:
- Semi-supervised Learning: A type of machine learning that uses a small amount of labeled data and a large amount of unlabeled data during training.
- Label Smoothness Assumption: A common assumption in
graph-based semi-supervised learningstating that if two nodes are close in a graph (e.g., connected by an edge), they are likely to have similar labels. - Label Propagation Algorithm: An iterative algorithm that propagates labels from labeled nodes to unlabeled nodes through the graph, based on the smoothness assumption. It seeks to assign labels to unlabeled nodes such that the labels vary smoothly across the graph, respecting the initial labels of the labeled nodes.
3.2. Previous Works
The paper categorizes and discusses previous works relevant to KG-aware recommender systems and GNNs.
-
KG-aware Recommender Systems:
- Embedding-based methods: These methods first pre-process a KG using
knowledge graph embedding (KGE)algorithms to learn low-dimensional vector representations (embeddings) for entities and relations. These embeddings are then incorporated into recommendation models.TransE[2]: A prominent KGE model that assumes relations are translations in the embedding space, i.e., for a triple(h,r,t). Its objective is to minimize the distance between and .CKE[34] (Collaborative Knowledge Base Embedding): Combines collaborative filtering with structural, textual, and visual knowledge from KGs. It learns unified embeddings by jointly optimizing CF loss and KGE loss (e.g.,TransRorTransE).- Critique: While flexible, KGE algorithms primarily focus on rigorous semantic relatedness (suitable for link prediction) rather than direct recommendation, and often lack end-to-end training.
- Path-based methods: These methods explore various patterns of connections (
meta-pathsormeta-graphs) among items in a KG to extract additional features for recommendations.PER[33] (Personalized Entity Recommendation): Treats the KG as a heterogeneous information network and extracts meta-path based features to represent connectivity between users and items. For instance, a meta-path like "user-movie-director-movie" suggests a user's interest in movies by the same director.- Critique: Rely heavily on manually designed meta-paths, which are difficult to tune and generalize.
- Hybrid methods: Combine aspects of embedding-based and path-based methods, or integrate KGs into other recommendation paradigms.
RippleNet[24]: A memory-network-like approach that propagates users' preferences on the KG. It uses a user's historical preferences to activate relevant entities in the KG, iteratively expanding "ripples" to find candidate items. It learns user embeddings by aggregating the embeddings of related entities in the KG, weighted by their relevance to the user's preferences.Wang et al. [28](Knowledge Graph Convolutional Networks for Recommender Systems): This work is a direct precursor and inspiration forKGNN-LS. It also uses GCNs on KGs for recommendation and introduces user-specific relation scoring. However,KGNN-LSidentifies that simply applying GCNs without proper regularization (likelabel smoothness) is prone to overfitting and can lead to performance degradation.
- Embedding-based methods: These methods first pre-process a KG using
-
Graph Neural Networks in Recommender Systems:
PinSage[32]: Applies GNNs to a large-scale bipartite graph (pin-board) in Pinterest for item recommendation. It uses a specialized GCN variant for inductive representation learning.Monti et al.[14] andBerg et al.[19] (Graph Convolutional Matrix Completion): Model recommender systems asmatrix completiontasks and design GNNs for representation learning on user-item bipartite graphs.Wu et al.[31]: Uses GNNs on user/item structure graphs to learn representations.- Critique: These approaches are primarily designed for homogeneous bipartite graphs or similarity graphs. The challenge of extending GNNs to heterogeneous KGs, with their diverse entity and relation types, remained largely unaddressed prior to works like
Wang et al. [28]andKGNN-LS.
-
Semi-supervised Learning on Graphs:
- Prior work on
label smoothnessandlabel propagationin semi-supervised learning. - Methods where edge weights are
fixed[1, 37, 38] vs.parameterized and learnable[10, 21, 35].KGNN-LSfalls into the latter category, making the regularization aspect even more critical.
- Prior work on
3.3. Technological Evolution
The evolution of recommender systems has moved from basic collaborative filtering (CF) and matrix factorization techniques to incorporating richer side information.
- Early CF models: Relied solely on user-item interaction data, struggling with sparsity and cold-start.
- Hybrid models: Integrated additional features like user/item profiles, content features, or contextual information to alleviate CF limitations.
- Knowledge Graph Integration: Recognizing the rich semantic information in KGs, methods emerged to use them:
Embedding-based: Pre-learning KGEs and feeding them into RS.Path-based: Manually designing meta-paths to capture specific types of item-item relationships.Hybrid KG-RS: More integrated approaches likeRippleNet.
- Graph Neural Networks (GNNs): A parallel development in graph representation learning. Initially applied to homogeneous graphs (e.g., social networks, citation networks), then adapted for bipartite graphs in RS.
- GNNs on Heterogeneous KGs for RS: The current paper's work (
KGNN-LS) represents an important step in directly applying GNNs toheterogeneous KGsfor recommendation, aiming for end-to-end training and personalized aggregation, while explicitly addressing the challenges of learnable edge weights through principled regularization.
3.4. Differentiation Analysis
Compared to the main methods in related work, KGNN-LS presents several core innovations:
-
Compared to traditional KG-aware RS (embedding-based, path-based):
KGNN-LSoffersend-to-end training, allowing all components (user/item embeddings, relation scoring, GNN weights) to be jointly optimized based on the recommendation task, unlike many embedding-based methods that pre-train KGEs.- It
automates the discovery of important KG relationshipsvia a trainable function (), moving beyond the manual meta-path engineering required by path-based methods likePER. This user-specific and relation-specific weighting is a key differentiation.
-
Compared to
Wang et al. [28](KGCNs for RS):- While
Wang et al. [28]also applied GCNs to KGs with user-specific relation scoring,KGNN-LSidentifies and addresses a critical limitation: theoverfitting riskassociated with learnable edge weights that are only weakly supervised. - The introduction of
label smoothness regularizationis the primary differentiating factor, providing a stronginductive biasand an additionalsupervised signal(via leave-one-out loss) for learning these crucial edge weights, leading to better generalization.
- While
-
Compared to GNNs for homogeneous/bipartite graphs in RS (e.g.,
PinSage,GC-MC):KGNN-LSspecifically designs its architecture to handle theheterogeneityof KGs (diverse entity and relation types), rather than just homogeneous or bipartite graphs. The user-specific relation scoring function is central to this adaptation.
-
Compared to
RippleNet[24]:-
Both are hybrid methods utilizing KGs.
RippleNetpropagates user preferences by expanding "ripples" on the KG, akin to a memory network.KGNN-LS, on the other hand, explicitly constructs a user-specific weighted adjacency matrix and applies a GNN for feature aggregation, directly learning personalized item embeddings that reflect the local KG structure. Thelabel smoothnessregularization is also unique toKGNN-LS.In essence,
KGNN-LSinnovates by creating a truly personalized GNN architecture for heterogeneous KGs, where edge weights are not only user-specific but also robustly learned and regularized by a principledlabel smoothnessassumption, addressing the critical issue of overfitting that arises from the flexibility of learnable graph structures.
-
4. Methodology
4.1. Principles
The core principle of KGNN-LS is to integrate rich, structured knowledge from knowledge graphs (KGs) into recommender systems using Graph Neural Networks (GNNs), while ensuring the learned representations are both personalized and robust. The method operates on two main ideas:
-
Personalized Graph Transformation and Feature Propagation: The
heterogeneous KGis first transformed into auser-specific weighted graph. This transformation is achieved by a trainable function that learns the importance of different relations for a given user. A GNN then operates on this personalized graph to aggregate features from an item's neighborhood, generating user-specific item embeddings that capture local KG structure and user preferences. This can be viewed asfeature propagationon the KG. -
Label Smoothness Regularization for Edge Weights: To prevent
overfittingdue to the high flexibility of learnable edge weights (which are only sparsely constrained by user-item interactions),KGNN-LSintroduceslabel smoothness regularization. This regularization enforces aninductive biasthat adjacent entities in the KG should have similar relevance scores for a given user. The paper proves that this regularization is equivalent to alabel propagationscheme, providing a principled way to constrain and supervise the learning of edge weights. This can be seen aslabel propagationon the KG.These two ideas are unified into a single, end-to-end trainable loss function, allowing the model to learn personalized item representations that are informed by both feature aggregation and label consistency across the KG.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
Let be the set of users and be the set of items. The user-item interaction matrix captures implicit feedback, where indicates user has interacted with item , and otherwise (or unobserved). A knowledge graph is also available, composed of head entities , relations , and tail entities . is the set of all entities and is the set of all relations. Items are a subset of entities (i.e., ). The task is to predict the potential interest of user in item , denoted as , where represents the model parameters and represents the raw features.
4.2.2. Knowledge-aware Graph Neural Networks
The first stage involves transforming the heterogeneous KG into a user-personalized weighted graph and then applying a GNN.
1. User-Specific Relation Scoring Function:
To make the KG user-specific, a function is used to quantify the importance of a relation for a given user .
$
s_u(r) = g(\mathbf{u}, \mathbf{r})
$
Here, is the feature vector (embedding) of user , and is the feature vector (embedding) of relation type . is a differentiable function, typically an inner product or a multilayer perceptron (MLP). This function learns which types of relations are more relevant to a user's preferences. For example, some users might care more about the "director" relation for movies, while others might prioritize "lead actor."
2. User-Specific Adjacency Matrix Construction:
Given , the knowledge graph is transformed into a user-specific adjacency matrix . The entry represents the weight of the edge between entity and entity for user .
$
A_u^{ij} = s_u(r_{e_i, e_j})
$
where is the relation type between and in . If no relation exists between and , then . This matrix effectively personalizes the graph structure based on user preferences.
3. Graph Neural Network for Entity Embeddings: Entities in the KG are initialized with raw features, forming an initial entity representation matrix , where is the dimension of raw entity features. This initial matrix is denoted as . The GNN then iteratively updates these entity representations layer by layer. The layer-wise forward propagation is defined as: $ \mathbf{H}{l+1} = \sigma \left( \mathbf{D}{u}^{-1/2} \mathbf{A}{u} \mathbf{D}{u}^{-1/2} \mathbf{H}{l} \mathbf{W}{l} \right), \quad l = 0, 1, \cdots, L-1. $ Let's break down the symbols in this equation:
-
: The matrix of hidden representations of entities at layer . Each row corresponds to an entity's embedding.
-
: The initial entity feature matrix.
-
: The user-specific adjacency matrix constructed as described above. To ensure that an entity's own previous representation is included in its update, self-connections are added: , where is the identity matrix.
-
: A diagonal degree matrix where entries . This matrix is used for normalization, specifically normalizes the adjacency matrix, akin to symmetric normalization in GCNs, to prevent exploding or vanishing gradients and stabilize representations.
-
: A layer-specific trainable weight matrix. This matrix transforms the aggregated features from the previous layer.
-
: A non-linear activation function (e.g.,
ReLUfor non-final layers,tanhfor the final layer), which introduces non-linearity into the model. -
: The total number of GNN layers. A multi-layer GNN allows information to propagate across longer distances in the KG, capturing
higher-orderneighborhood information.The final output of the GNN is , which contains the final, user-specific representations for all entities. The representation of a specific item for user is given by , which is the -th row of . Note that is user-specific because is user-specific.
4. Prediction Function:
Finally, the predicted engagement probability of user with item is calculated by:
$
\hat{y}_{uv} = f(\mathbf{u}, \mathbf{v}_u)
$
where is the user embedding and is the personalized item embedding. is a differentiable prediction function, such as inner product or a multilayer perceptron.
The entire system, from relation scoring to final prediction, is end-to-end trainable, meaning gradients can flow through , the GNN layers (updating ), and the relation scoring function (updating user and relation embeddings).
4.2.3. Label Smoothness Regularization
The learnable edge weights in provide flexibility but also increase the risk of overfitting, especially since the supervised signal comes only from sparse user-item interactions. To address this, label smoothness (LS) regularization is introduced.
1. Label Smoothness Assumption and Energy Function:
The core idea is that adjacent entities in the KG should have similar relevance labels for a given user. Let be a real-valued label function for user . For observed items, (i.e., 1 if interacted, 0 otherwise). For other entities (non-items or unobserved items), their labels are initially unknown.
The energy function quantifies the "roughness" of the labels across the graph:
$
E(l_u, \mathbf{A}u) = \frac{1}{2} \sum{e_i \in \mathcal{E}, e_j \in \mathcal{E}} A_{u}^{ij} \left(l_u(e_i) - l_u(e_j)\right)^2
$
This function penalizes large differences in labels between connected entities, weighted by the strength of their connection . Minimizing encourages label smoothness.
2. Harmonic Property of Minimum-Energy Label Function (Theorem 1):
The paper proves that the label function that minimizes while respecting the fixed labels of observed items ( for ) is harmonic. This means that the label of any non-item entity is the average of its neighbors' labels.
Theorem 1. The minimum-energy label function $ l_u^* = \operatorname*{argmin}{\substack{l_u: l_u(v) = y{uv}, \forall v \in \mathcal{V}}} E(l_u, \mathbf{A}u) $ w.r.t. Eq. (2) is harmonic, i.e., satisfies $ l_u^*(e_i) = \frac{1}{D_u^{ii}} \sum{e_j \in \mathcal{E}} A_u^{ij} l_u^*(e_j), \quad \forall e_i \in \mathcal{E} \setminus \mathcal{V}. $
Proof: The energy function is . To find the minimum, we take the partial derivative of with respect to for any non-item entity (since labels of items are fixed). $ \frac{\partial E(l_u, \mathbf{A}u)}{\partial l_u(e_i)} = \frac{1}{2} \sum{j} A_u^{ij} \cdot 2 \cdot (l_u(e_i) - l_u(e_j)) + \frac{1}{2} \sum_{j} A_u^{ji} \cdot 2 \cdot (l_u(e_j) - l_u(e_i))(-1) $ Since is typically symmetric (or we can use an effective symmetric matrix), . $ \frac{\partial E(l_u, \mathbf{A}u)}{\partial l_u(e_i)} = \sum{j} A_u^{ij} (l_u(e_i) - l_u(e_j)) $ Setting the derivative to zero for : $ \sum_{j} A_u^{ij} (l_u^(e_i) - l_u^(e_j)) = 0 $ $ l_u^(e_i) \sum_{j} A_u^{ij} - \sum_{j} A_u^{ij} l_u^(e_j) = 0 $ Since (the degree of in ), we get: $ l_u^(e_i) D_u^{ii} = \sum_{j} A_u^{ij} l_u^(e_j) $ $ l_u^(e_i) = \frac{1}{D_u^{ii}} \sum_{j} A_u^{ij} l_u^(e_j), \quad \forall e_i \in \mathcal{E} \setminus \mathcal{V}. $ This confirms the harmonic property: the label of a non-item entity is the weighted average of its neighbors' labels.
3. Equivalence to Label Propagation (Theorem 2): The harmonic property implies that repeating a specific label propagation scheme will converge to this minimum-energy label function.
Theorem 2. Repeating the following two steps: (1) Propagate labels for all entities: , where is the vector of labels for all entities; (2) Reset labels of all items to initial labels: , where is the vector of labels for all items and are initial labels; will lead to .
Proof:
Let be partitioned into (labels for items) and (labels for non-items/unlabeled entities). Since is fixed, we are only interested in .
Define the transition matrix . is row-normalized.
Partition according to the partition of :
$
\mathbf{P} = \left[ \begin{array}{ll} \mathbf{P}{VV} & \mathbf{P}{VE} \ \mathbf{P}{EV} & \mathbf{P}{EE} \end{array} \right].
$
Where maps items to items, maps items to non-items, maps non-items to items, and maps non-items to non-items.
The label propagation scheme can be written as:
$
\begin{array}{rcl} l_u(\mathcal{V})^{(t+1)} & = & \mathbf{P}{VV} l_u(\mathcal{V})^{(t)} + \mathbf{P}{VE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)} \ l_u(\mathcal{E} \setminus \mathcal{V})^{(t+1)} & = & \mathbf{P}{EV} l_u(\mathcal{V})^{(t)} + \mathbf{P}{EE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)} \end{array}
$
Due to step (2), is always reset to its initial value . So, we focus on :
$
l_u(\mathcal{E} \setminus \mathcal{V})^{(t+1)} = \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top + \mathbf{P}{EE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)}.
$
This is an iterative equation. Repeating it times:
$
l_u(\mathcal{E} \setminus \mathcal{V})^{(n)} = (\mathbf{P}{EE})^n l_u(\mathcal{E} \setminus \mathcal{V})^{(0)} + \left(\sum{i=0}^{n-1} (\mathbf{P}{EE})^i\right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top.
$
The paper's provided proof starts from for the sum, and for the sum index which seems to be a slight difference in indexing convention, but the core idea of convergence is the same. Let's follow the paper's provided equation directly:
$
l_u(\mathcal{E} \setminus \mathcal{V}) = \operatorname*{lim}{n \to \infty} (\mathbf{P}{EE})^n l_u^{(0)}(\mathcal{E} \setminus \mathcal{V}) + \left( \sum_{i=1}^{n} (\mathbf{P}{EE})^{i-1} \right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top.
$
To show convergence, we need to prove that .
Since is row-normalized, for all rows .
For , which is a sub-matrix corresponding to non-item entities, each row sum must be less than 1 if there is any connection from a non-item entity to an item entity. If there are no connections from non-items to items, then would still be row-normalized with sum 1, which might not lead to convergence to 0. However, the problem formulation implies there are items, and thus connections to the fixed labels exist.
The paper states:
$
\exists \epsilon < 1, \sum_j \mathbf{P}{EE}[i,j] \leq \epsilon,
$
for all possible row index . This condition holds if there's at least one path from a non-item entity to an item entity (a fixed label), implying that some "probability mass" flows out of the subgraph in each step.
Then, it follows:
$
\sum_j (\mathbf{P}{EE})^n[i,j] = \sum_j \left( (\mathbf{P}{EE})^{(n-1)} \mathbf{P}{EE} \right)[i,j]
$
$
= \sum_j \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \mathbf{P}{EE}[k,j]
$
$
= \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \sum_j \mathbf{P}{EE}[k,j]
$
$
\leq \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \epsilon
$
$
\leq \epsilon \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k]
$
By induction, this implies .
As , (since ). Thus, (the zero matrix).
This means the term converges to , and the initial value does not affect the convergence point.
Therefore, the equation for becomes:
$
l_u(\mathcal{E} \setminus \mathcal{V}) = \operatorname*{lim}{n \to \infty} \left( \sum{i=1}^{n} (\mathbf{P}{EE})^{i-1} \right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top.
$
Let . This is a geometric series of matrices.
For such a series to converge, the spectral radius of must be less than 1, which is implied by the row sum .
The sum converges to .
To show this:
$
\mathbf{T} - \mathbf{T} \mathbf{P}{EE} = \sum{i=0}^{\infty} (\mathbf{P}{EE})^i - \sum{i=0}^{\infty} (\mathbf{P}{EE})^i \mathbf{P}{EE} = \sum_{i=0}^{\infty} (\mathbf{P}{EE})^i - \sum{i=1}^{\infty} (\mathbf{P}{EE})^i = (\mathbf{P}{EE})^0 = \mathbf{I}.
$
Thus, .
Substituting this back, we get the unique fixed point:
$
l_u(\mathcal{E} \setminus \mathcal{V}) = (\mathbf{I} - \mathbf{P}{EE})^{-1} \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top.
$
Combining with the fixed labels for items, the converged is:
$
l_u^*(\mathcal{E}) = \left[ \begin{array}{c} \mathbf{Y}[u, \mathcal{V}]^\top \ (\mathbf{I} - \mathbf{P}{EE})^{-1} \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top \end{array} \right].
$
This concludes the proof that the iterative label propagation scheme converges to the minimum-energy label function.
4. Leave-One-Out Loss for Regularization:
The minimum-energy label function itself doesn't directly provide a signal to update because is fixed to true relevancy labels. Also, the true labels for are unknown.
To overcome this, a leave-one-out loss is proposed. For each item that user has interacted with (), we temporarily "hide" its true label (treat it as unlabeled) and use the label propagation scheme (Theorem 2) to predict its label, , based on the remaining labeled items and all other entities. The difference between the true label and the predicted label then serves as a supervised signal to regularize .
The regularization term is defined as:
$
R(\mathbf{A}) = \sum_u R(\mathbf{A}u) = \sum_u \sum_v J(y{uv}, \hat{l}_u(v))
$
Here, is the cross-entropy loss function. This regularization encourages the learned edge weights to be such that label propagation effectively predicts the true labels of held-out items. In essence, it "tightens" the rubber bands (edges) in the KG to propagate labels more effectively.
4.2.4. The Unified Loss Function
KGNN-LS combines the knowledge-aware GNN (for feature propagation) and label smoothness regularization (for label propagation) into a single objective function:
$
\operatorname*{min}{\mathbf{W}, \mathbf{A}} \mathcal{L} = \operatorname*{min}{\mathbf{W}, \mathbf{A}} \sum_{u,v} J(y_{uv}, \hat{y}_{uv}) + \lambda R(\mathbf{A}) + \gamma |\mathcal{F}|_2^2
$
Let's explain each term:
-
: This is the primary
recommendation loss. It measures the discrepancy between the true user-item interaction labels and the predicted probabilities from the GNN. is typicallybinary cross-entropyfor implicit feedback. This term drives the learning of GNN weights and indirectly influences through the user-specific embeddings . -
: This is the
label smoothness regularization term. It encourages the edge weights (and thus the underlying user/relation embeddings that form ) to support effective label propagation, enforcing the smoothness assumption. is a hyperparameter balancing its importance. -
: This is an
L2-regularizer(weight decay) on the model parameters (which includes and the embeddings of users/relations). It helps prevent overfitting by penalizing large parameter values. is a hyperparameter.This unified loss function allows for
end-to-end trainingwhere both personalized item representations and the personalized graph structure are learned jointly and robustly.
4.2.5. Discussion (Intuition from Figure 2)
The paper provides an intuitive analogy with a physical equilibrium model (Figure 2, not provided, but described in text).
-
Without KG: Items are loosely connected, mainly through collaborative filtering effects.
-
With KG: Edges in the KG act like
rubber bandsimposing constraints.- GNN effect: A single GNN layer mixes an entity's features with its immediate neighbors. If an item is positively rated, its neighbors are also "pulled up" (assigned higher relevance). Deeper GNNs () propagate this "pull" further, exploring
long-distance interests. - Personalization: The "strength" of these rubber bands (edge weights ) is
user-specificandrelation-specific. This means the influence of KG paths is tailored to each user.
- GNN effect: A single GNN layer mixes an entity's features with its immediate neighbors. If an item is positively rated, its neighbors are also "pulled up" (assigned higher relevance). Deeper GNNs () propagate this "pull" further, exploring
-
Label Smoothness Regularization effect: If an edge weight (rubber band strength) is too weak, the positive signal from an observed item might not propagate effectively to its neighbors. The
label smoothness regularization(vialeave-one-out loss) identifies such cases. By "holding out" an observed positive item and trying to predict its label from its neighbors, the regularization effectivelytightens the rubber bands(increases edge weights) along paths that help recover the true label. This encourages the model to pull up other potentially relevant, unobserved items connected via these strengthened paths.The following figure from the paper provides a visual illustration of the label smoothness regularization:
该图像是论文中的示意图,展示了不同情境下知识图(KG)与标签平滑(LS)正则化对用户兴趣判别边界的影响。图中通过不同颜色圆点和箭头表示正负样本及其力的方向,说明了KG和标签平滑如何帮助调整判别边界以优化推荐效果。
The illustration shows how the label smoothness regularization helps. If a positive sample (blue circle) is held out, the model tries to predict its label. Because the true label is 1 and another positive sample (upper right pink circle) has the largest label value, the regularization term will increase the weights of edges (arrows) between them, allowing the "label flow" to predict the held-out sample correctly. This effectively strengthens connections (rubber bands) and encourages the model to raise the predicted relevance of other connected, unobserved positive items (the two upper pink items).
5. Experimental Setup
5.1. Datasets
The experiments were conducted on four real-world datasets spanning different recommendation domains: movies, books, music, and restaurants.
- MovieLens-20M:
- Domain: Movie recommendations.
- Source: Public benchmark dataset.
- Characteristics: Approximately 20 million explicit ratings (1 to 5).
- KG Source:
Satori(commercial KG by Microsoft).
- Book-Crossing:
- Domain: Book recommendations.
- Source: Public dataset.
- Characteristics: 1 million ratings (0 to 10).
- KG Source:
Satori.
- Last.FM:
- Domain: Music recommendations.
- Source: Public dataset.
- Characteristics: Musician listening information from 2 thousand users.
- KG Source:
Satori.
- Dianping-Food:
-
Domain: Restaurant recommendations.
-
Source: From Meituan-Dianping Group (a Chinese group buying website).
-
Characteristics: Over 10 million interactions (clicking, buying, adding to favorites) between ~2 million users and ~1 thousand restaurants.
-
KG Source:
Meituan Brain(internal KG by Meituan-Dianping Group).Data Preprocessing:
-
- Implicit Feedback Conversion: For MovieLens-20M, Book-Crossing, and Last.FM (which contain explicit ratings or listening counts), interactions were transformed into
implicit feedback. A rating was considered positive for MovieLens-20M (marked as 1). No threshold was set for Book-Crossing and Last.FM due to sparsity, implying any interaction is positive. - Negative Sampling: For each user, an equal number of randomly sampled
unwatched/uninteracted itemswere marked as 0 (negative samples) to balance the dataset. - KG Construction (for public datasets):
-
A subset of triples with confidence level was extracted from
Satori. -
Item IDs (movies, books, musicians) were matched by name with
tail entitiesin specific relations (e.g.,film.film.name). -
Items with multiple matches or no matches were excluded.
-
These matched item IDs were then used to collect all relevant triples from the Satori sub-KG where they acted as
head entities.The following are the results from Table 2 of the original paper:
Movie Book Music Restaurant # users 138,159 19,676 1,872 2,298,698 # items 16,954 20,003 3,846 1,362 # interactions 13,501,622 172,576 42,346 23,416,418 # entities 102,569 25,787 9,366 28,115 # relations 32 18 60 7 # KG triples 499,474 60,787 15,518 160,519
-
The datasets were chosen to represent diverse domains and scales, allowing for comprehensive validation of the method's performance across different complexities of user-item interactions and knowledge graph structures. The Dianping-Food dataset is particularly notable for its large number of users but relatively small number of items, which can emphasize the utility of KG information for differentiation.
The connection between the KG and user-item interactions is further validated by Figure 3.
该图像是图表,展示了图中两个随机采样物品在不同数据集中最短路径距离的概率分布,比较了有无共同用户的两种情况。
This figure shows the probability distribution of the shortest path distance between two randomly sampled items in the KG. It distinguishes between item pairs that have no common user in the dataset and those that have at least one common user. The results empirically demonstrate that items with common users are significantly more likely to be closer in the KG (e.g., within 2 hops) compared to items without common users. This finding supports the core motivation to use KG proximity to infer item relatedness and justifies the label smoothness regularization.
5.2. Evaluation Metrics
The paper evaluates the proposed method using two common types of metrics in recommender systems: Recall@K for top-K recommendation and AUC for click-through rate (CTR) prediction.
5.2.1. Recall@K
- Conceptual Definition:
Recall@Kmeasures the proportion of relevant items (i.e., items a user has interacted with or is expected to interact with) that are successfully recommended within the top items. It is a common metric for evaluating the effectiveness of a recommender system in retrieving all relevant items for a user. A higherRecall@Kindicates that the system is better at finding and presenting the items that the user likes. - Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|\mathcal{U}{test}|} \sum{u \in \mathcal{U}{test}} \frac{|\text{Recommended}{u,K} \cap \text{Relevant}_u|}{|\text{Relevant}_u|} $
- Symbol Explanation:
- : The total number of users in the test set.
- : The set of top items recommended by the system for user .
- : The set of items that user actually interacted with (or found relevant) in the test set.
- : The intersection operator, meaning items present in both sets.
- : The cardinality (number of elements) of a set.
5.2.2. AUC (Area Under the ROC Curve)
- Conceptual Definition:
AUC(Area Under the Receiver Operating Characteristic Curve) is a performance metric for binary classification problems (like CTR prediction, where the task is to predict if a user will click/interact with an item). It represents the probability that a randomly chosen positive instance (e.g., an item a user will click) will be ranked higher than a randomly chosen negative instance (e.g., an item a user will not click) by the model. AnAUCof 1.0 indicates perfect classification, while 0.5 indicates performance no better than random guessing. - Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{Positive}} \sum_{j \in \text{Negative}} \mathbf{1}(\mathrm{score}(i) > \mathrm{score}(j)) + 0.5 \cdot \mathbf{1}(\mathrm{score}(i) = \mathrm{score}(j))}{|\text{Positive}| \cdot |\text{Negative}|} $
- Symbol Explanation:
- : The set of positive samples (e.g., items clicked by a user).
- : The set of negative samples (e.g., items not clicked by a user).
- : The predicted score (e.g., click probability) of item .
- : The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
- : The number of positive samples.
- : The number of negative samples.
5.3. Baselines
The proposed KGNN-LS model was compared against several state-of-the-art baselines, including both KG-free and KG-aware methods.
-
KG-Free Baselines:
SVD[12] (Singular Value Decomposition): A classiccollaborative filtering(CF) model that usesmatrix factorizationto decompose the user-item interaction matrix into lower-rank user and item latent factor matrices. The paper uses an unbiased version where the predicted engaging probability is modeled as an inner product of user and item embeddings: .LibFM[16] (Factorization Machines with LibFM): A widely used feature-basedfactorization modelforCTR prediction. It models interactions between variables using a factorization approach. For this paper, user ID and item ID are concatenated as input.
-
KG-Aware Baselines:
-
LibFM + TransE: An extension ofLibFMwhereentity representationslearned byTransE[2] are attached to each user-item pair. This incorporates structural knowledge from the KG directly intoLibFM's feature set. -
PER[33] (Personalized Entity Recommendation): Apath-based method. It treats the KG as aheterogeneous information networkand extractsmeta-path based featuresto represent connectivity between users and items. The paper specifies manually designed meta-paths for each dataset (e.g., "user-movie-director-movie" for MovieLens-20M). -
CKE[34] (Collaborative Knowledge Base Embedding): Anembedding-based methodthat combinescollaborative filteringwith structural knowledge from KGs. It jointly learns user/item embeddings from CF and entity/relation embeddings from KGs. -
RippleNet[24]: Ahybrid methodthat uses amemory-network-likeapproach to propagate user preferences on the KG for recommendation. It learns user embeddings by iteratively expanding "ripples" of user preferences over the KG.Hyper-parameter Settings: General settings: functions and (in
KGNN-LS) are set asinner product. (activation function) isReLUfor non-last layers andtanhfor the last layer. To handle varying neighbor sizes in KGs efficiently, a fixed-size set of neighbors isuniformly sampledfor each entity (denoted by ). Hyperparameters forKGNN-LSare determined by optimizingR@10on a validation set. The search spaces for , (dimension of hidden layers), (number of layers), (label smoothness regularizer weight), (L2 regularizer weight), and (learning rate) are provided in Appendix B. For baselines, specific dimensions and learning rates are mentioned in Section 5.2.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that KGNN-LS significantly outperforms state-of-the-art baselines in both top-K recommendation and CTR prediction across all four datasets.
The following are the results from Table 3 of the original paper:
| Model | MovieLens-20M | Book-Crossing | Last.FM | Dianping-Food | |||||||||
| R@2 | R@10 R@50 R@100 | R@2 | R@10 R@50 R@100 | R@2 | R@10 R@50 R@100 | R@2 | R@10 R@50 R@100 | ||||||
| SVD | 0.036 | 0.124 0.277 | 0.401 | 0.027 | 0.046 0.077 | 0.109 | 0.029 | 0.098 | 0.240 | 0.332 | 0.039 | 0.152 0.329 | 0.451 |
| LibFM | 0.039 | 0.121 0.271 | 0.388 | 0.033 | 0.062 0.092 | 0.124 | 0.030 | 0.103 | 0.263 | 0.330 | 0.043 0.156 | 0.332 | 0.448 |
| LibFM + TransE | 0.041 | 0.125 0.280 | 0.396 | 0.037 | 0.064 0.097 | 0.130 | 0.032 | 0.102 | 0.259 | 0.326 | 0.044 0.161 | 0.343 | 0.455 |
| PER | 0.022 | 0.077 0.160 | 0.243 | 0.022 0.041 | 0.064 | 0.070 | 0.014 | 0.052 | 0.116 | 0.176 | 0.023 0.102 | 0.256 | 0.354 |
| CKE | 0.034 | 0.107 0.244 | 0.322 | 0.028 | 0.051 0.079 | 0.112 | 0.023 | 0.070 | 0.180 | 0.296 | 0.034 0.138 | 0.305 | 0.437 |
| RippleNet | 0.045 | 0.130 0.278 | 0.447 | 0.036 | 0.074 0.107 | 0.127 | 0.032 | 0.101 | 0.242 | 0.336 | 0.040 0.155 | 0.328 | 0.440 |
| KGNN-LS | 0.043 | 0.155 0.321 | 0.458 | 0.045 0.082 0.117 | 0.149 | 0.044 0.122 0.277 | 0.370 | 0.047 0.170 0.340 | 0.487 | ||||
Table 3: Recall@K in Top-K Recommendation
-
KGNN-LSconsistently achieves the highestRecall@Kvalues across different (2, 10, 50, 100) and all four datasets. -
For example, on MovieLens-20M,
KGNN-LSachievesR@100of 0.458, surpassingRippleNet(0.447) and other baselines. -
On Last.FM,
KGNN-LSreachesR@100of 0.370, significantly higher thanRippleNet(0.336) and (0.326). -
The
PERbaseline, a path-based method, performs notably worse than most others, highlighting the difficulty of manual meta-path engineering. -
CKE, an embedding-based method, also performs less competitively, suggesting that pre-trained KG embeddings might not fully align with the recommendation task without deeper integration.The following are the results from Table 4 of the original paper:
Model Movie Book Music Restaurant SVD 0.963 0.672 0.769 0.838 LibFM 0.959 0.691 0.778 0.837 LibFM + TransE 0.966 0.698 0.777 0.839 PER 0.832 0.617 0.633 0.746 CKE 0.924 0.677 0.744 0.802 RippleNet 0.960 0.727 0.770 0.833 KGNN-LS 0.979 0.744 0.803 0.850
Table 4: AUC in CTR Prediction
-
KGNN-LSconsistently achieves the highestAUCscores across all datasets, demonstrating its superior performance in predicting user-item interactions. -
The average
AUCgains over baselines are substantial: 5.1% on MovieLens-20M, 6.9% on Book-Crossing, 8.3% on Last.FM, and 4.3% on Dianping-Food. These percentage gains are calculated as (KGNN-LS AUC - Baseline Max AUC) / Baseline Max AUC. For example, on MovieLens, (0.979 - 0.966)/0.966 = 1.3%, but the paper might be reporting (KGNN-LS AUC - Baseline Min AUC)/Baseline Min AUC or another form of average, this warrants a careful check or assumption if the exact calculation is not specified. Assuming the paper's claimed average gains, this further underscores the model's effectiveness. -
The strong performance of
LibFM + TransEcompared toLibFMalone indicates the value of incorporating KG embeddings, butKGNN-LSgoes further by integrating GNNs andlabel smoothness.Daily Performance Stability: The following figure from the paper shows the daily AUC of all methods on Dianping-Food in September 2018.
该图像是图表,展示了2018年9月Dianping-Food数据集上多种方法的每日AUC变化趋势。图中比较了LibFM+TransE、PER、CKE、RippleNet和KGNN-LS五种算法的性能,KGNN-LS整体表现较优。
Figure 4 illustrates the daily AUC performance of KGNN-LS and baselines on the Dianping-Food dataset over a month. KGNN-LS consistently maintains the highest AUC curve, indicating superior performance over time. Moreover, its curve shows low variance, suggesting that KGNN-LS is robust and stable in practical, real-world deployment scenarios, an important factor for recommender systems.
6.2. Effectiveness of LS Regularization
The paper conducts an ablation study to verify the effectiveness of the label smoothness (LS) regularization.
The following figure from the paper shows the effectiveness of LS regularization on Last.FM.
该图像是图表,展示了LS正则化参数λ对Last.FM数据集上不同隐藏维度d的R@10指标的影响,表现了随着λ变化R@10的波动趋势,明显体现了不同d值下的性能差异。
Figure 5 plots R@10 on the Last.FM dataset as the LS regularization weight varies from 0 to 5, for different hidden layer dimensions ().
- When , the
LS regularizationterm is effectively removed. This scenario represents aKGNNwithout thelabel smoothnesscomponent, similar to the approach byWang et al. [28]. - The results clearly show that
KGNN-LSwith a non-zero consistently outperforms the case where . This empirically validates the claim thatLS regularizationis crucial for assisting the learning of edge weights in the KG and achieving better generalization. - The optimal typically falls within a range, such as
0.1 to 1.0in most cases, as atoo large\lambda$$ can overwhelm the overall loss, distorting gradients and leading to sub-optimal performance.
6.3. Results in Cold-Start Scenarios
One of the primary motivations for using KGs in recommender systems is to alleviate the sparsity and cold-start issues. The paper evaluates KGNN-LS's performance in cold-start scenarios by varying the size of the training set on MovieLens-20M.
The following are the results from Table 5 of the original paper:
| r | 20% | 40% | 60% | 80% | 100% |
| SVD | 0.882 | 0.913 | 0.938 | 0.955 | 0.963 |
| LibFM | 0.902 | 0.923 | 0.938 | 0.950 | 0.959 |
| LibFM+TransE | 0.914 | 0.935 | 0.949 | 0.960 | 0.966 |
| PER | 0.802 | 0.814 | 0.821 | 0.828 | 0.832 |
| CKE | 0.898 | 0.910 | 0.916 | 0.921 | 0.924 |
| RippleNet | 0.921 | 0.937 | 0.947 | 0.955 | 0.960 |
| KGNN-LS | 0.961 | 0.970 | 0.974 | 0.977 | 0.979 |
Table 5: AUC with varying training set ratio () on MovieLens-20M
- The table shows the
AUCperformance as the training data ratio decreases from100%(full training data) to20%(highly sparse scenario). - As expected, the performance of all models degrades with less training data.
- However,
KGNN-LSdemonstrates remarkable resilience. When , theAUCdecrease for baselines ranges from2.8%(CKE) to8.4%(SVD) compared to their performance on100%training data. In contrast,KGNN-LSexperiences only a1.8%decrease inAUC(from 0.979 to 0.961). - This strong performance in
cold-start scenarioshighlightsKGNN-LS's ability to effectively leverage the rich structural and semantic information from the KG to compensate for sparse user-item interactions, which is a significant advantage for practical recommender systems.
6.4. Hyper-parameters Sensitivity
The paper also investigates the impact of key hyperparameters on KGNN-LS's performance.
6.4.1. Number of GNN Layers ()
The following are the results from Table 6 of the original paper:
| L | 1 | 2 | 3 | 4 |
| MovieLens-20M | 0.155 | 0.146 | 0.122 | 0.011 |
| Book-Crossing | 0.077 | 0.082 | 0.043 | 0.008 |
| Last.FM | 0.122 | 0.106 | 0.105 | 0.057 |
| Dianping-Food | 0.165 | 0.170 | 0.061 | 0.036 |
Table 6: R@10 with varying number of layers L
- The table shows
R@10for different numbers of GNN layers ( to4). KGNN-LSgenerally performs best with a small number of layers, typically or . For instance, on MovieLens-20M, yields the bestR@10, while on Book-Crossing and Dianping-Food, is optimal.- A significant drop in performance is observed for larger (e.g., ). This is a common phenomenon in GNNs known as
over-smoothing, where increasing the number of layers causes node representations to become too similar to their distant neighbors, making them indistinguishable. In the context of KGs, a larger mixes too many entity embeddings, leading toover-smoothedrepresentation learning and hindering the model's ability to capture fine-grained distinctions and personalized interests.
6.4.2. Dimension of Hidden Layers ()
The following are the results from Table 7 of the original paper:
| d | 4 | 8 | 16 | 32 | 64 | 128 |
| MovieLens-20M | 0.134 | 0.141 | 0.143 | 0.155 | 0.155 | 0.151 |
| Book-Crossing | 0.065 | 0.073 | 0.077 | 0.081 | 0.082 | 0.080 |
| Last.FM | 0.111 | 0.116 | 0.122 | 0.109 | 0.102 | 0.107 |
| Dianping-Food | 0.155 | 0.170 | 0.167 | 0.166 | 0.163 | 0.161 |
Table 7: R@10 with varying dimension of hidden layers d
- The table shows
R@10for different dimensions of hidden layers (). - Performance generally improves as increases initially (e.g., from to or ), as higher dimensions allow the model to capture more complex features and increase its capacity.
- However, performance starts to drop if becomes too large (e.g., for MovieLens-20M and Book-Crossing, or onwards for Last.FM and Dianping-Food). This suggests that
too large a dimensioncan lead tooverfittingon the training data, harming generalization to unseen data. - The optimal dimension is typically found in the range of , balancing model capacity and preventing overfitting.
6.5. Running Time Analysis
The scalability of a recommender system is critical for real-world applications dealing with large KGs. The paper evaluates the running time of KGNN-LS and baselines as the KG size increases.
The following figure from the paper shows the running time of all methods w.r.t. KG size on MovieLens-20M.
该图像是图表,展示了不同方法在MovieLens-20M数据集上,随着知识图谱(KG)三元组数量增加的运行时间变化。图中显示KGNN-LS在较大KG规模时运行效率优于RippleNet等方法。
Figure 6 presents the running time of various methods on MovieLens-20M as the number of KG triples is scaled up (by extracting more triples from Satori).
- The figure shows that
KGNN-LSexhibitsstrong scalability. Its running time increases gracefully with the KG size, indicating efficient handling of larger knowledge graphs. - Notably,
KGNN-LSappears more efficient thanRippleNetas KG size grows, especially givenRippleNet's memory-network-like iterative propagation. - While absolute running times are influenced by hardware and specific configurations, the trend clearly demonstrates
KGNN-LS's ability to handle increasing KG sizes without prohibitive computational costs. This is a crucial practical advantage for deploying KG-aware recommender systems.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS), a novel and effective model for recommender systems. KGNN-LS addresses key limitations of previous KG-aware recommendation approaches by offering an end-to-end trainable framework that avoids manual feature engineering and enhances scalability. Its core innovation lies in:
-
User-specific personalized graph construction: Dynamically transforming a heterogeneous knowledge graph into a user-specific weighted graph using a trainable relation scoring function.
-
Graph Neural Network for personalized embeddings: Applying a GNN on this personalized graph to generate rich, user-specific item embeddings by aggregating local neighborhood information.
-
Label Smoothness regularization: Introducing a principled regularization technique that leverages the
label smoothnessassumption, effectively constraining the learnable edge weights and proving its equivalence tolabel propagation. This regularization mitigatesoverfittingand provides stronger inductive bias.Extensive experiments on four diverse real-world datasets consistently show that
KGNN-LSsignificantly outperforms state-of-the-art baselines in bothtop-K recommendationandCTR prediction. Furthermore,KGNN-LSdemonstrates robust performance incold-start scenarioswith sparse user-item interactions and exhibits strong scalability with respect to increasing knowledge graph sizes.
7.2. Limitations & Future Work
The authors identify several promising directions for future research:
- Broader application of
LS regularization: The paper notes thatLS regularizationwas specifically proposed for recommendation with KGs. Future work could examine its effectiveness on other graph tasks, such aslink prediction(predicting missing edges in a graph) andnode classification(assigning labels to nodes based on their features and connections). - Theoretical relationship between
feature propagationandlabel propagation: The paper highlights that the GNN component performsfeature propagationwhileLS regularizationimplicitly implementslabel propagation. Investigating the theoretical connections and interplay between these two propagation mechanisms within a unified framework is suggested as a promising research direction.
7.3. Personal Insights & Critique
The KGNN-LS paper presents a highly innovative and practical solution for integrating knowledge graphs into recommender systems. My personal insights and critique include:
Innovations and Strengths:
-
Principled Regularization: The
label smoothness regularizationis a standout feature. It's not just an ad-hoc penalty but is theoretically grounded inlabel propagationand directly addresses the overfitting challenge posed by learnable, user-specific graph structures. This provides a strong inductive bias that adjacent items in a semantically meaningful graph should have similar relevance, which is intuitively appealing. -
User-Specific GNNs: The dynamic construction of a user-specific weighted graph, combined with GNNs, is a powerful way to personalize recommendations. This moves beyond static KG embeddings or generic graph convolutions by explicitly modeling how each user perceives the importance of different relations.
-
End-to-End Trainability: The unified loss function enables all components to be jointly optimized, which is crucial for maximizing performance and adaptability, as opposed to multi-stage approaches.
-
Cold-Start Robustness: The demonstrated effectiveness in cold-start scenarios is a major advantage. By leveraging the rich information in KGs,
KGNN-LScan make meaningful recommendations even when explicit user-item interactions are scarce. -
Scalability: The efficient implementation and good scalability are critical for real-world deployment on large-scale platforms.
Potential Issues and Areas for Improvement/Future Exploration:
-
Complexity of function: The
user-specific relation scoring functionis defined generally as a differentiable function (e.g., inner product or MLP). The choice of could significantly impact the model's ability to capture complex user preferences for relations. Exploring more sophisticated, interpretable, or dynamic forms for could be beneficial. -
Computational Cost of Label Propagation: While the overall system shows good scalability, the
label propagationstep for regularization involves iterative matrix operations or explicit computation of the inverse which can be computationally intensive for very large KGs and many users if not carefully approximated or optimized. The paper mentions "efficient implementation," implying this is handled, but specific details on how this is achieved for the regularization term's calculation might be worth exploring further. -
Interpretability of : While the model learns relation importance, a deeper analysis into why certain relations are important for specific users (e.g., through attention mechanisms or more transparent functions) could enhance interpretability, which is increasingly valued in recommender systems.
-
Feature Engineering for : The initial entity features are crucial. The paper mentions "raw entity features." The quality and type of these features (e.g., textual descriptions, categorical attributes) can greatly influence the GNN's ability to learn meaningful representations. Investigating optimal ways to obtain or pre-process these initial features could be valuable.
-
Dynamic KGs: Most KGs are dynamic, with new entities and relations constantly being added. The current model assumes a relatively static KG structure during training. Adapting
KGNN-LSto handle dynamically evolving KGs efficiently would be a practical extension.Transferability and Applicability: The core ideas of
KGNN-LSare highly transferable. -
Beyond Recommendations: The concept of a user-specific graph transformation combined with GNNs and principled graph-based regularization could be applied to other personalized graph-based tasks, such as personalized
knowledge graph completion,personalized link prediction(e.g., predicting personalized social connections based on common interests derived from a KG), orpersonalized entity linking. -
Different Domains: The method is general enough to be applied to any domain where structured knowledge is available and personalization is desired, provided a suitable KG can be constructed or accessed. Examples include personalized search, educational content recommendation, or drug discovery (personalized medicine, where KGs can represent drug-protein interactions, and user preferences might be patient-specific biological profiles).
Overall,
KGNN-LSoffers a robust and theoretically sound framework that significantly advances the state-of-the-art in KG-aware recommender systems, providing a solid foundation for future research in personalized graph machine learning.
Similar papers
Recommended via semantic vector search.