Paper status: completed

Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems

Published:05/11/2019

Knowledge Graph Embedding (4)Cold-Start Recommendation (3)GNN-based Recommender Systems (3)Label Smoothness Regularization (1)Knowledge-Aware Recommendation Methods (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

KGNN-LS models user-specific KG relations with label smoothness regularization, enabling end-to-end training for better personalized recommendations, notably improving cold-start performance and scalability, outperforming state-of-the-art baselines across datasets.

Abstract

Knowledge graphs capture structured information and relations between a set of entities or items. As such knowledge graphs represent an attractive source of information that could help improve recommender systems. However, existing approaches in this domain rely on manual feature engineering and do not allow for an end-to-end training. Here we propose Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) to provide better recommendations. Conceptually, our approach computes user-specific item embeddings by first applying a trainable function that identifies important knowledge graph relationships for a given user. This way we transform the knowledge graph into a user-specific weighted graph and then apply a graph neural network to compute personalized item embeddings. To provide better inductive bias, we rely on label smoothness assumption, which posits that adjacent items in the knowledge graph are likely to have similar user relevance labels/scores. Label smoothness provides regularization over the edge weights and we prove that it is equivalent to a label propagation scheme on a graph. We also develop an efficient implementation that shows strong scalability with respect to the knowledge graph size. Experiments on four datasets show that our method outperforms state of the art baselines. KGNN-LS also achieves strong performance in cold-start scenarios where user-item interactions are sparse.

Mind Map

In-depth Reading

English Analysis~34 min read · 46,786 chars

1. Bibliographic Information

1.1. Title

Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems

1.2. Authors

Hongwei Wang (Stanford University)
Fuzheng Zhang (Meituan-Dianping Group)
Mengdi Zhang (Meituan-Dianping Group)
Jure Leskovec (Stanford University)
Miao Zhao (Hong Kong Polytechnic University)
Wenjie Li (Hong Kong Polytechnic University)
Zhongyuan Wang (Meituan-Dianping Group)

1.3. Journal/Conference

Published at The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '19), August 4-8, 2019, Anchorage, AK, USA. KDD is a highly reputable and influential conference in the fields of data mining, data science, and knowledge discovery, known for publishing cutting-edge research.

1.4. Publication Year

2019

1.5. Abstract

The paper addresses the challenge of leveraging knowledge graphs (KGs) to improve recommender systems, specifically tackling limitations of existing approaches such as reliance on manual feature engineering and lack of end-to-end training. It proposes a novel model called Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS). Conceptually, KGNN-LS computes user-specific item embeddings by first applying a trainable function to identify important KG relationships for a given user, transforming the KG into a user-specific weighted graph. A graph neural network (GNN) is then applied to this personalized graph to compute personalized item embeddings. To provide better inductive bias and regularize the edge weights, the model relies on a label smoothness assumption, which posits that adjacent items in the KG are likely to have similar user relevance labels/scores. The paper proves that this label smoothness regularization is equivalent to a label propagation scheme on a graph. An efficient implementation is also developed, demonstrating strong scalability. Experiments on four datasets show that KGNN-LS outperforms state-of-the-art baselines, especially in cold-start scenarios where user-item interactions are sparse.

1.6. Original Source Link

https://arxiv.org/abs/1905.04413 The paper was published at KDD '19, and the provided link is to the arXiv preprint, which is an officially recognized preprint server.

1.7. PDF Link

https://arxiv.org/pdf/1905.04413v3.pdf

2. Executive Summary

2.1. Background & Motivation

Recommender systems are crucial for combating information overload in various internet applications. Traditional recommender systems, primarily based on collaborative filtering (CF), often face significant challenges:

Cold-start problem: Difficulty in recommending brand new items or to new users due to insufficient interaction data.
Sparsity issue: Limited user-item interaction data, making it hard to learn robust preferences.

To alleviate these issues, researchers have explored incorporating auxiliary information sources like user/item profiles or social networks. Knowledge graphs (KGs) stand out as a particularly attractive source. KGs are heterogeneous graphs that capture structured information and relations between entities (e.g., items, their properties, and characteristics). They provide rich connectivity information, revealing semantic relatedness between items that might not be apparent from interaction data alone.

The core challenge lies in effectively utilizing KGs to capture user-specific item-item relatedness. Existing KG-aware recommender systems, categorized into path-based, embedding-based, and hybrid methods, often suffer from:

Manual feature engineering: Requiring significant human effort to design features.
Lack of end-to-end training: Components are often trained separately, limiting joint optimization.
Poor scalability: Struggling with large-scale KGs.

While Graph Neural Networks (GNNs) have shown promise in representation learning on graphs by aggregating information from local neighborhoods, their application in recommender systems has largely been confined to homogeneous bipartite user-item interaction graphs or user-/item-similarity graphs. Extending GNNs to heterogeneous KGs, which involve various entity types and relation types, remained an open question. This paper aims to bridge this gap by developing a GNN architecture tailored for heterogeneous KGs in recommender systems.

2.2. Main Contributions / Findings

The paper introduces Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) to address the aforementioned challenges and improve recommendation quality. Its main contributions are:

Novel KGNN-LS Architecture: The paper proposes a novel framework that extends GNNs to heterogeneous knowledge graphs for recommender systems. It simultaneously captures semantic relationships between items and personalized user preferences.
User-Specific Weighted Graph Transformation: It introduces a trainable and personalized relation scoring function ( $s_u(r)$ ) that dynamically transforms the static knowledge graph into a user-specific weighted graph. This weighted graph reflects both the intrinsic semantic information of the KG and the individual user's interests, enabling the GNN to compute personalized item embeddings.
Integration of Graph Neural Networks for Personalized Embeddings: By applying a GNN on the user-specific weighted graph, KGNN-LS computes personalized item embeddings. These embeddings effectively capture each item's local KG structure in a user-personalized way, aggregating features from neighbors up to $L$ hops away.
Label Smoothness Regularization: To address the overfitting risk introduced by learnable edge weights (which are only sparsely supervised by user-item interactions), the paper develops a novel label smoothness regularization technique. This regularization operates on the assumption that adjacent entities in the KG are likely to have similar user relevance labels.
Theoretical Proof of Label Propagation Equivalence: The paper formally proves that the proposed label smoothness regularization is equivalent to a label propagation scheme on a graph. This provides a strong theoretical foundation for its use as a regularization mechanism. A leave-one-out loss function is designed for label propagation to provide additional supervised signals for learning the edge scoring function.
Unified End-to-End Training Framework: KGNN-LS integrates the knowledge-aware GNN and label smoothness regularization into a single, end-to-end trainable loss function, allowing gradients to flow from the prediction function back through the GNN and the relation scoring function.
Empirical Validation and Scalability: Extensive experiments on four real-world datasets (MovieLens-20M, Book-Crossing, Last.FM, and Dianping-Food) demonstrate that KGNN-LS significantly outperforms state-of-the-art baselines in both top-K recommendation (measured by Recall@K) and click-through rate (CTR) prediction (measured by AUC). The method also shows strong performance in cold-start scenarios and exhibits strong scalability with respect to the KG size.

3.1. Foundational Concepts

To understand KGNN-LS, several fundamental concepts are essential for a beginner:

Recommender Systems (RS): Systems designed to predict user preferences for items. They help users discover relevant content or products from a vast array of choices.
- Collaborative Filtering (CF): A widely used technique in RS that makes recommendations based on the preferences of similar users or the characteristics of similar items. For example, if user A likes movies X and Y, and user B likes movie X, CF might recommend movie Y to user B.
- Cold-start Problem: A common challenge in RS where the system struggles to make recommendations for new users or new items because there is insufficient interaction data (ratings, clicks, purchases) to infer their preferences or characteristics.
- Sparsity: Refers to the situation where the user-item interaction matrix contains very few observed interactions compared to the total possible interactions. This makes it difficult for CF algorithms to find reliable patterns.
Knowledge Graphs (KGs): Structured representations of knowledge that describe real-world entities and their relationships.
- Entities: Real-world objects, concepts, or abstract ideas (e.g., "The Silence of the Lambs" (a movie), "Anthony Hopkins" (an actor), "director" (a property)). In recommender systems, items (like movies, books, music) are often entities.
- Relations: The types of connections between entities (e.g., "film.film.star", "book.book.author").
- Triples: The fundamental unit of a KG, typically represented as (h, r, t), where $h$ is the head entity, $r$ is the relation, and $t$ is the tail entity. For example, (The Silence of the Lambs, film.film.star, Anthony Hopkins).
- Heterogeneous Graph: A graph where nodes and/or edges can be of different types. KGs are heterogeneous because they have different types of entities and relations.
Graph Neural Networks (GNNs): A class of neural networks designed to operate on graph-structured data. They generalize convolutional operations (like those in Convolutional Neural Networks (CNNs) for images) to arbitrary graph structures.
- Representation Learning: The process of automatically discovering meaningful features or embeddings from raw data. In GNNs, this means learning vector representations (embeddings) for nodes that capture their structural and feature information.
- Message Passing / Aggregation: The core mechanism of GNNs. Each node updates its representation by aggregating information from its immediate neighbors and its own previous representation. This process is often repeated for multiple layers, allowing information to propagate across longer distances in the graph.
- Graph Convolutional Networks (GCNs): A popular type of GNN. A common GCN layer updates node features $h_v^{(l)}$ for node $v$ at layer $l$ using information from its neighbors $\mathcal{N}(v)$ : $ h_v^{(l+1)} = \sigma \left( \sum_{u \in \mathcal{N}(v) \cup {v}} \frac{1}{\sqrt{\deg(u)\deg(v)}} W^{(l)} h_u^{(l)} \right) $ where $\deg(u)$ is the degree of node $u$ , $W^{(l)}$ is a learnable weight matrix, and $\sigma$ is an activation function. This can be written in matrix form as: $ \mathbf{H}^{(l+1)} = \sigma \left( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) $ Here, $\mathbf{H}^{(l)}$ is the matrix of node embeddings at layer $l$ , $\mathbf{A}$ is the adjacency matrix, $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ (with $\mathbf{I}$ being the identity matrix for self-loops), and $\tilde{\mathbf{D}}$ is the diagonal degree matrix of $\tilde{\mathbf{A}}$ .
Label Smoothness / Label Propagation:
- Semi-supervised Learning: A type of machine learning that uses a small amount of labeled data and a large amount of unlabeled data during training.
- Label Smoothness Assumption: A common assumption in graph-based semi-supervised learning stating that if two nodes are close in a graph (e.g., connected by an edge), they are likely to have similar labels.
- Label Propagation Algorithm: An iterative algorithm that propagates labels from labeled nodes to unlabeled nodes through the graph, based on the smoothness assumption. It seeks to assign labels to unlabeled nodes such that the labels vary smoothly across the graph, respecting the initial labels of the labeled nodes.

3.2. Previous Works

The paper categorizes and discusses previous works relevant to KG-aware recommender systems and GNNs.

KG-aware Recommender Systems:
- Embedding-based methods: These methods first pre-process a KG using knowledge graph embedding (KGE) algorithms to learn low-dimensional vector representations (embeddings) for entities and relations. These embeddings are then incorporated into recommendation models.
  - TransE [2]: A prominent KGE model that assumes relations are translations in the embedding space, i.e., $h + r \approx t$ for a triple (h,r,t). Its objective is to minimize the distance between $h+r$ and $t$ .
  - CKE [34] (Collaborative Knowledge Base Embedding): Combines collaborative filtering with structural, textual, and visual knowledge from KGs. It learns unified embeddings by jointly optimizing CF loss and KGE loss (e.g., TransR or TransE).
  - Critique: While flexible, KGE algorithms primarily focus on rigorous semantic relatedness (suitable for link prediction) rather than direct recommendation, and often lack end-to-end training.
- Path-based methods: These methods explore various patterns of connections (meta-paths or meta-graphs) among items in a KG to extract additional features for recommendations.
  - PER [33] (Personalized Entity Recommendation): Treats the KG as a heterogeneous information network and extracts meta-path based features to represent connectivity between users and items. For instance, a meta-path like "user-movie-director-movie" suggests a user's interest in movies by the same director.
  - Critique: Rely heavily on manually designed meta-paths, which are difficult to tune and generalize.
- Hybrid methods: Combine aspects of embedding-based and path-based methods, or integrate KGs into other recommendation paradigms.
  - RippleNet [24]: A memory-network-like approach that propagates users' preferences on the KG. It uses a user's historical preferences to activate relevant entities in the KG, iteratively expanding "ripples" to find candidate items. It learns user embeddings by aggregating the embeddings of related entities in the KG, weighted by their relevance to the user's preferences.
  - Wang et al. [28] (Knowledge Graph Convolutional Networks for Recommender Systems): This work is a direct precursor and inspiration for KGNN-LS. It also uses GCNs on KGs for recommendation and introduces user-specific relation scoring. However, KGNN-LS identifies that simply applying GCNs without proper regularization (like label smoothness) is prone to overfitting and can lead to performance degradation.
Graph Neural Networks in Recommender Systems:
- PinSage [32]: Applies GNNs to a large-scale bipartite graph (pin-board) in Pinterest for item recommendation. It uses a specialized GCN variant for inductive representation learning.
- Monti et al. [14] and Berg et al. [19] (Graph Convolutional Matrix Completion): Model recommender systems as matrix completion tasks and design GNNs for representation learning on user-item bipartite graphs.
- Wu et al. [31]: Uses GNNs on user/item structure graphs to learn representations.
- Critique: These approaches are primarily designed for homogeneous bipartite graphs or similarity graphs. The challenge of extending GNNs to heterogeneous KGs, with their diverse entity and relation types, remained largely unaddressed prior to works like Wang et al. [28] and KGNN-LS.
Semi-supervised Learning on Graphs:
- Prior work on label smoothness and label propagation in semi-supervised learning.
- Methods where edge weights are fixed [1, 37, 38] vs. parameterized and learnable [10, 21, 35]. KGNN-LS falls into the latter category, making the regularization aspect even more critical.

3.3. Technological Evolution

The evolution of recommender systems has moved from basic collaborative filtering (CF) and matrix factorization techniques to incorporating richer side information.

Early CF models: Relied solely on user-item interaction data, struggling with sparsity and cold-start.
Hybrid models: Integrated additional features like user/item profiles, content features, or contextual information to alleviate CF limitations.
Knowledge Graph Integration: Recognizing the rich semantic information in KGs, methods emerged to use them:
- Embedding-based: Pre-learning KGEs and feeding them into RS.
- Path-based: Manually designing meta-paths to capture specific types of item-item relationships.
- Hybrid KG-RS: More integrated approaches like RippleNet.
Graph Neural Networks (GNNs): A parallel development in graph representation learning. Initially applied to homogeneous graphs (e.g., social networks, citation networks), then adapted for bipartite graphs in RS.
GNNs on Heterogeneous KGs for RS: The current paper's work (KGNN-LS) represents an important step in directly applying GNNs to heterogeneous KGs for recommendation, aiming for end-to-end training and personalized aggregation, while explicitly addressing the challenges of learnable edge weights through principled regularization.

3.4. Differentiation Analysis

Compared to the main methods in related work, KGNN-LS presents several core innovations:

Compared to traditional KG-aware RS (embedding-based, path-based):
- KGNN-LS offers end-to-end training, allowing all components (user/item embeddings, relation scoring, GNN weights) to be jointly optimized based on the recommendation task, unlike many embedding-based methods that pre-train KGEs.
- It automates the discovery of important KG relationships via a trainable function ( $s_u(r)$ ), moving beyond the manual meta-path engineering required by path-based methods like PER. This user-specific and relation-specific weighting is a key differentiation.
Compared to Wang et al. [28] (KGCNs for RS):
- While Wang et al. [28] also applied GCNs to KGs with user-specific relation scoring, KGNN-LS identifies and addresses a critical limitation: the overfitting risk associated with learnable edge weights that are only weakly supervised.
- The introduction of label smoothness regularization is the primary differentiating factor, providing a strong inductive bias and an additional supervised signal (via leave-one-out loss) for learning these crucial edge weights, leading to better generalization.
Compared to GNNs for homogeneous/bipartite graphs in RS (e.g., PinSage, GC-MC):
- KGNN-LS specifically designs its architecture to handle the heterogeneity of KGs (diverse entity and relation types), rather than just homogeneous or bipartite graphs. The user-specific relation scoring function is central to this adaptation.
Compared to RippleNet [24]:
- Both are hybrid methods utilizing KGs. RippleNet propagates user preferences by expanding "ripples" on the KG, akin to a memory network. KGNN-LS, on the other hand, explicitly constructs a user-specific weighted adjacency matrix and applies a GNN for feature aggregation, directly learning personalized item embeddings that reflect the local KG structure. The label smoothness regularization is also unique to KGNN-LS.
  
  In essence, KGNN-LS innovates by creating a truly personalized GNN architecture for heterogeneous KGs, where edge weights are not only user-specific but also robustly learned and regularized by a principled label smoothness assumption, addressing the critical issue of overfitting that arises from the flexibility of learnable graph structures.

4. Methodology

4.1. Principles

The core principle of KGNN-LS is to integrate rich, structured knowledge from knowledge graphs (KGs) into recommender systems using Graph Neural Networks (GNNs), while ensuring the learned representations are both personalized and robust. The method operates on two main ideas:

Personalized Graph Transformation and Feature Propagation: The heterogeneous KG is first transformed into a user-specific weighted graph. This transformation is achieved by a trainable function that learns the importance of different relations for a given user. A GNN then operates on this personalized graph to aggregate features from an item's neighborhood, generating user-specific item embeddings that capture local KG structure and user preferences. This can be viewed as feature propagation on the KG.
Label Smoothness Regularization for Edge Weights: To prevent overfitting due to the high flexibility of learnable edge weights (which are only sparsely constrained by user-item interactions), KGNN-LS introduces label smoothness regularization. This regularization enforces an inductive bias that adjacent entities in the KG should have similar relevance scores for a given user. The paper proves that this regularization is equivalent to a label propagation scheme, providing a principled way to constrain and supervise the learning of edge weights. This can be seen as label propagation on the KG.

These two ideas are unified into a single, end-to-end trainable loss function, allowing the model to learn personalized item representations that are informed by both feature aggregation and label consistency across the KG.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

Let $\mathcal{U}$ be the set of users and $\mathcal{V}$ be the set of items. The user-item interaction matrix $\mathbf{Y}$ captures implicit feedback, where $y_{uv}=1$ indicates user $u$ has interacted with item $v$ , and $y_{uv}=0$ otherwise (or unobserved). A knowledge graph $\mathcal{G} = \{(h, r, t)\}$ is also available, composed of head entities $h \in \mathcal{E}$ , relations $r \in \mathcal{R}$ , and tail entities $t \in \mathcal{E}$ . $\mathcal{E}$ is the set of all entities and $\mathcal{R}$ is the set of all relations. Items $\mathcal{V}$ are a subset of entities $\mathcal{E}$ (i.e., $\mathcal{V} \subseteq \mathcal{E}$ ). The task is to predict the potential interest of user $u$ in item $v$ , denoted as $\hat{y}_{uv} = \mathcal{F}(u, v | \Theta, \Upsilon, \mathcal{G})$ , where $\Theta$ represents the model parameters and $\Upsilon$ represents the raw features.

4.2.2. Knowledge-aware Graph Neural Networks

The first stage involves transforming the heterogeneous KG into a user-personalized weighted graph and then applying a GNN.

1. User-Specific Relation Scoring Function: To make the KG user-specific, a function $s_u(r)$ is used to quantify the importance of a relation $r$ for a given user $u$ . $ s_u(r) = g(\mathbf{u}, \mathbf{r}) $ Here, $\mathbf{u}$ is the feature vector (embedding) of user $u$ , and $\mathbf{r}$ is the feature vector (embedding) of relation type $r$ . $g$ is a differentiable function, typically an inner product or a multilayer perceptron (MLP). This function learns which types of relations are more relevant to a user's preferences. For example, some users might care more about the "director" relation for movies, while others might prioritize "lead actor."

2. User-Specific Adjacency Matrix Construction: Given $s_u(r)$ , the knowledge graph $\mathcal{G}$ is transformed into a user-specific adjacency matrix $\mathbf{A}_u \in \mathbb{R}^{|\mathcal{E}| \times |\mathcal{E}|}$ . The entry $A_u^{ij}$ represents the weight of the edge between entity $e_i$ and entity $e_j$ for user $u$ . $ A_u^{ij} = s_u(r_{e_i, e_j}) $ where $r_{e_i, e_j}$ is the relation type between $e_i$ and $e_j$ in $\mathcal{G}$ . If no relation exists between $e_i$ and $e_j$ , then $A_u^{ij} = 0$ . This matrix effectively personalizes the graph structure based on user preferences.

3. Graph Neural Network for Entity Embeddings: Entities in the KG are initialized with raw features, forming an initial entity representation matrix $\mathbf{E} \in \mathbb{R}^{|\mathcal{E}| \times d_0}$ , where $d_0$ is the dimension of raw entity features. This initial matrix is denoted as $\mathbf{H}_0 = \mathbf{E}$ . The GNN then iteratively updates these entity representations layer by layer. The layer-wise forward propagation is defined as: $ \mathbf{H}{l+1} = \sigma \left( \mathbf{D}{u}^{-1/2} \mathbf{A}{u} \mathbf{D}{u}^{-1/2} \mathbf{H}{l} \mathbf{W}{l} \right), \quad l = 0, 1, \cdots, L-1. $ Let's break down the symbols in this equation:

$\mathbf{H}_l \in \mathbb{R}^{|\mathcal{E}| \times d_l}$ : The matrix of hidden representations of entities at layer $l$ . Each row corresponds to an entity's embedding.
$\mathbf{H}_0 = \mathbf{E}$ : The initial entity feature matrix.
$\mathbf{A}_u$ : The user-specific adjacency matrix constructed as described above. To ensure that an entity's own previous representation is included in its update, self-connections are added: $\mathbf{A}_u \gets \mathbf{A}_u + \mathbf{I}$ , where $\mathbf{I}$ is the identity matrix.
$\mathbf{D}_u$ : A diagonal degree matrix where entries $D_u^{ii} = \sum_j A_u^{ij}$ . This matrix is used for normalization, specifically $\mathbf{D}_u^{-1/2} \mathbf{A}_u \mathbf{D}_u^{-1/2}$ normalizes the adjacency matrix, akin to symmetric normalization in GCNs, to prevent exploding or vanishing gradients and stabilize representations.
$\mathbf{W}_l \in \mathbb{R}^{d_l \times d_{l+1}}$ : A layer-specific trainable weight matrix. This matrix transforms the aggregated features from the previous layer.
$\sigma$ : A non-linear activation function (e.g., ReLU for non-final layers, tanh for the final layer), which introduces non-linearity into the model.
$L$ : The total number of GNN layers. A multi-layer GNN allows information to propagate across longer distances in the KG, capturing higher-order neighborhood information.

The final output of the GNN is $\mathbf{H}_L \in \mathbb{R}^{|\mathcal{E}| \times d_L}$ , which contains the final, user-specific representations for all entities. The representation of a specific item $v$ for user $u$ is given by $\mathbf{v}_u$ , which is the $v$ -th row of $\mathbf{H}_L$ . Note that $\mathbf{v}_u$ is user-specific because $\mathbf{A}_u$ is user-specific.

4. Prediction Function: Finally, the predicted engagement probability of user $u$ with item $v$ is calculated by: $ \hat{y}_{uv} = f(\mathbf{u}, \mathbf{v}_u) $ where $\mathbf{u}$ is the user embedding and $\mathbf{v}_u$ is the personalized item embedding. $f$ is a differentiable prediction function, such as inner product or a multilayer perceptron. The entire system, from relation scoring to final prediction, is end-to-end trainable, meaning gradients can flow through $f$ , the GNN layers (updating $\mathbf{W}_l$ ), and the relation scoring function $g$ (updating user and relation embeddings).

4.2.3. Label Smoothness Regularization

The learnable edge weights in $\mathbf{A}_u$ provide flexibility but also increase the risk of overfitting, especially since the supervised signal comes only from sparse user-item interactions. To address this, label smoothness (LS) regularization is introduced.

1. Label Smoothness Assumption and Energy Function: The core idea is that adjacent entities in the KG should have similar relevance labels for a given user. Let $l_u: \mathcal{E} \to \mathbb{R}$ be a real-valued label function for user $u$ . For observed items, $l_u(v) = y_{uv}$ (i.e., 1 if interacted, 0 otherwise). For other entities (non-items or unobserved items), their labels are initially unknown. The energy function $E$ quantifies the "roughness" of the labels across the graph: $ E(l_u, \mathbf{A}u) = \frac{1}{2} \sum{e_i \in \mathcal{E}, e_j \in \mathcal{E}} A_{u}^{ij} \left(l_u(e_i) - l_u(e_j)\right)^2 $ This function penalizes large differences in labels between connected entities, weighted by the strength of their connection $A_u^{ij}$ . Minimizing $E$ encourages label smoothness.

2. Harmonic Property of Minimum-Energy Label Function (Theorem 1): The paper proves that the label function $l_u^*$ that minimizes $E(l_u, \mathbf{A}_u)$ while respecting the fixed labels of observed items ( $l_u(v) = y_{uv}$ for $v \in \mathcal{V}$ ) is harmonic. This means that the label of any non-item entity is the average of its neighbors' labels.

Theorem 1. The minimum-energy label function $ l_u^* = \operatorname*{argmin}{\substack{l_u: l_u(v) = y{uv}, \forall v \in \mathcal{V}}} E(l_u, \mathbf{A}u) $ w.r.t. Eq. (2) is harmonic, i.e., $l_u^*$ satisfies $ l_u^*(e_i) = \frac{1}{D_u^{ii}} \sum{e_j \in \mathcal{E}} A_u^{ij} l_u^*(e_j), \quad \forall e_i \in \mathcal{E} \setminus \mathcal{V}. $

Proof: The energy function is $E(l_u, \mathbf{A}_u) = \frac{1}{2} \sum_{i,j} A_u^{ij} \left(l_u(e_i) - l_u(e_j)\right)^2$ . To find the minimum, we take the partial derivative of $E$ with respect to $l_u(e_i)$ for any non-item entity $e_i \in \mathcal{E} \setminus \mathcal{V}$ (since labels of items are fixed). $ \frac{\partial E(l_u, \mathbf{A}u)}{\partial l_u(e_i)} = \frac{1}{2} \sum{j} A_u^{ij} \cdot 2 \cdot (l_u(e_i) - l_u(e_j)) + \frac{1}{2} \sum_{j} A_u^{ji} \cdot 2 \cdot (l_u(e_j) - l_u(e_i))(-1) $ Since $A_u^{ij}$ is typically symmetric (or we can use an effective symmetric matrix), $A_u^{ij} = A_u^{ji}$ . $ \frac{\partial E(l_u, \mathbf{A}u)}{\partial l_u(e_i)} = \sum{j} A_u^{ij} (l_u(e_i) - l_u(e_j)) $ Setting the derivative to zero for $l_u = l_u^*$ : $ \sum_{j} A_u^{ij} (l_u^(e_i) - l_u^(e_j)) = 0 $ $ l_u^(e_i) \sum_{j} A_u^{ij} - \sum_{j} A_u^{ij} l_u^(e_j) = 0 $ Since $D_u^{ii} = \sum_j A_u^{ij}$ (the degree of $e_i$ in $\mathbf{A}_u$ ), we get: $ l_u^(e_i) D_u^{ii} = \sum_{j} A_u^{ij} l_u^(e_j) $ $ l_u^(e_i) = \frac{1}{D_u^{ii}} \sum_{j} A_u^{ij} l_u^(e_j), \quad \forall e_i \in \mathcal{E} \setminus \mathcal{V}. $ This confirms the harmonic property: the label of a non-item entity is the weighted average of its neighbors' labels.

3. Equivalence to Label Propagation (Theorem 2): The harmonic property implies that repeating a specific label propagation scheme will converge to this minimum-energy label function.

Theorem 2. Repeating the following two steps: (1) Propagate labels for all entities: $l_u(\mathcal{E}) \gets \mathbf{D}_u^{-1} \mathbf{A}_u l_u(\mathcal{E})$ , where $l_u(\mathcal{E})$ is the vector of labels for all entities; (2) Reset labels of all items to initial labels: $l_u(\mathcal{V}) \gets \mathbf{Y}[u, \mathcal{V}]^\top$ , where $l_u(\mathcal{V})$ is the vector of labels for all items and $\mathbf{Y}[u, \mathcal{V}] = [y_{uv_1}, y_{uv_2}, \cdots]$ are initial labels; will lead to $l_u \to l_u^*$ .

Proof: Let $l_u(\mathcal{E})$ be partitioned into $l_u(\mathcal{V})$ (labels for items) and $l_u(\mathcal{E} \setminus \mathcal{V})$ (labels for non-items/unlabeled entities). Since $l_u(\mathcal{V})$ is fixed, we are only interested in $l_u(\mathcal{E} \setminus \mathcal{V})$ . Define the transition matrix $\mathbf{P} = \mathbf{D}_u^{-1} \mathbf{A}_u$ . $\mathbf{P}$ is row-normalized. Partition $\mathbf{P}$ according to the partition of $l_u$ : $ \mathbf{P} = \left[ \begin{array}{ll} \mathbf{P}{VV} & \mathbf{P}{VE} \ \mathbf{P}{EV} & \mathbf{P}{EE} \end{array} \right]. $ Where $\mathbf{P}_{VV}$ maps items to items, $\mathbf{P}_{VE}$ maps items to non-items, $\mathbf{P}_{EV}$ maps non-items to items, and $\mathbf{P}_{EE}$ maps non-items to non-items. The label propagation scheme can be written as: $ \begin{array}{rcl} l_u(\mathcal{V})^{(t+1)} & = & \mathbf{P}{VV} l_u(\mathcal{V})^{(t)} + \mathbf{P}{VE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)} \ l_u(\mathcal{E} \setminus \mathcal{V})^{(t+1)} & = & \mathbf{P}{EV} l_u(\mathcal{V})^{(t)} + \mathbf{P}{EE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)} \end{array} $ Due to step (2), $l_u(\mathcal{V})$ is always reset to its initial value $\mathbf{Y}[u, \mathcal{V}]^\top$ . So, we focus on $l_u(\mathcal{E} \setminus \mathcal{V})$ : $ l_u(\mathcal{E} \setminus \mathcal{V})^{(t+1)} = \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top + \mathbf{P}{EE} l_u(\mathcal{E} \setminus \mathcal{V})^{(t)}. $ This is an iterative equation. Repeating it $n$ times: $ l_u(\mathcal{E} \setminus \mathcal{V})^{(n)} = (\mathbf{P}{EE})^n l_u(\mathcal{E} \setminus \mathcal{V})^{(0)} + \left(\sum{i=0}^{n-1} (\mathbf{P}{EE})^i\right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top. $ The paper's provided proof starts from $n \to \infty$ for the sum, and $i=1$ for the sum index which seems to be a slight difference in indexing convention, but the core idea of convergence is the same. Let's follow the paper's provided equation directly: $ l_u(\mathcal{E} \setminus \mathcal{V}) = \operatorname*{lim}{n \to \infty} (\mathbf{P}{EE})^n l_u^{(0)}(\mathcal{E} \setminus \mathcal{V}) + \left( \sum_{i=1}^{n} (\mathbf{P}{EE})^{i-1} \right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top. $ To show convergence, we need to prove that $\operatorname*{lim}_{n \to \infty} (\mathbf{P}_{EE})^n = 0$ . Since $\mathbf{P}$ is row-normalized, $\sum_j P_{ij} = 1$ for all rows $i$ . For $\mathbf{P}_{EE}$ , which is a sub-matrix corresponding to non-item entities, each row sum $\sum_j \mathbf{P}_{EE}[i,j]$ must be less than 1 if there is any connection from a non-item entity to an item entity. If there are no connections from non-items to items, then $\mathbf{P}_{EE}$ would still be row-normalized with sum 1, which might not lead to convergence to 0. However, the problem formulation implies there are items, and thus connections to the fixed labels exist. The paper states: $ \exists \epsilon < 1, \sum_j \mathbf{P}{EE}[i,j] \leq \epsilon, $ for all possible row index $i$ . This condition holds if there's at least one path from a non-item entity to an item entity (a fixed label), implying that some "probability mass" flows out of the $\mathcal{E} \setminus \mathcal{V}$ subgraph in each step. Then, it follows: $ \sum_j (\mathbf{P}{EE})^n[i,j] = \sum_j \left( (\mathbf{P}{EE})^{(n-1)} \mathbf{P}{EE} \right)[i,j] $ $ = \sum_j \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \mathbf{P}{EE}[k,j] $ $ = \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \sum_j \mathbf{P}{EE}[k,j] $ $ \leq \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] \epsilon $ $ \leq \epsilon \sum_k (\mathbf{P}{EE})^{(n-1)}[i,k] $ By induction, this implies $\sum_j (\mathbf{P}_{EE})^n[i,j] \leq \epsilon^n$ . As $n \to \infty$ , $\epsilon^n \to 0$ (since $\epsilon < 1$ ). Thus, $\operatorname*{lim}_{n \to \infty} (\mathbf{P}_{EE})^n = \mathbf{0}$ (the zero matrix). This means the term $(\mathbf{P}_{EE})^n l_u^{(0)}(\mathcal{E} \setminus \mathcal{V})$ converges to $\mathbf{0}$ , and the initial value $l_u^{(0)}(\mathcal{E} \setminus \mathcal{V})$ does not affect the convergence point. Therefore, the equation for $l_u(\mathcal{E} \setminus \mathcal{V})$ becomes: $ l_u(\mathcal{E} \setminus \mathcal{V}) = \operatorname*{lim}{n \to \infty} \left( \sum{i=1}^{n} (\mathbf{P}{EE})^{i-1} \right) \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top. $ Let $\mathbf{T} = \operatorname*{lim}_{n \to \infty} \sum_{i=1}^{n} (\mathbf{P}_{EE})^{i-1} = \sum_{i=0}^{\infty} (\mathbf{P}_{EE})^{i}$ . This is a geometric series of matrices. For such a series to converge, the spectral radius of $\mathbf{P}_{EE}$ must be less than 1, which is implied by the row sum $\leq \epsilon < 1$ . The sum converges to $\mathbf{T} = (\mathbf{I} - \mathbf{P}_{EE})^{-1}$ . To show this: $ \mathbf{T} - \mathbf{T} \mathbf{P}{EE} = \sum{i=0}^{\infty} (\mathbf{P}{EE})^i - \sum{i=0}^{\infty} (\mathbf{P}{EE})^i \mathbf{P}{EE} = \sum_{i=0}^{\infty} (\mathbf{P}{EE})^i - \sum{i=1}^{\infty} (\mathbf{P}{EE})^i = (\mathbf{P}{EE})^0 = \mathbf{I}. $ Thus, $\mathbf{T} = (\mathbf{I} - \mathbf{P}_{EE})^{-1}$ . Substituting this back, we get the unique fixed point: $ l_u(\mathcal{E} \setminus \mathcal{V}) = (\mathbf{I} - \mathbf{P}{EE})^{-1} \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top. $ Combining with the fixed labels for items, the converged $l_u^*$ is: $ l_u^*(\mathcal{E}) = \left[ \begin{array}{c} \mathbf{Y}[u, \mathcal{V}]^\top \ (\mathbf{I} - \mathbf{P}{EE})^{-1} \mathbf{P}{EV} \mathbf{Y}[u, \mathcal{V}]^\top \end{array} \right]. $ This concludes the proof that the iterative label propagation scheme converges to the minimum-energy label function.

4. Leave-One-Out Loss for Regularization: The minimum-energy label function $l_u^*$ itself doesn't directly provide a signal to update $\mathbf{A}_u$ because $l_u^*(\mathcal{V})$ is fixed to true relevancy labels. Also, the true labels for $l_u^*(\mathcal{E} \setminus \mathcal{V})$ are unknown. To overcome this, a leave-one-out loss is proposed. For each item $v$ that user $u$ has interacted with ( $y_{uv}=1$ ), we temporarily "hide" its true label (treat it as unlabeled) and use the label propagation scheme (Theorem 2) to predict its label, $\hat{l}_u(v)$ , based on the remaining labeled items and all other entities. The difference between the true label $y_{uv}$ and the predicted label $\hat{l}_u(v)$ then serves as a supervised signal to regularize $\mathbf{A}_u$ . The regularization term $R(\mathbf{A})$ is defined as: $ R(\mathbf{A}) = \sum_u R(\mathbf{A}u) = \sum_u \sum_v J(y{uv}, \hat{l}_u(v)) $ Here, $J$ is the cross-entropy loss function. This regularization encourages the learned edge weights $\mathbf{A}_u$ to be such that label propagation effectively predicts the true labels of held-out items. In essence, it "tightens" the rubber bands (edges) in the KG to propagate labels more effectively.

4.2.4. The Unified Loss Function

KGNN-LS combines the knowledge-aware GNN (for feature propagation) and label smoothness regularization (for label propagation) into a single objective function: $ \operatorname*{min}{\mathbf{W}, \mathbf{A}} \mathcal{L} = \operatorname*{min}{\mathbf{W}, \mathbf{A}} \sum_{u,v} J(y_{uv}, \hat{y}_{uv}) + \lambda R(\mathbf{A}) + \gamma |\mathcal{F}|_2^2 $ Let's explain each term:

$\sum_{u,v} J(y_{uv}, \hat{y}_{uv})$ : This is the primary recommendation loss. It measures the discrepancy between the true user-item interaction labels $y_{uv}$ and the predicted probabilities $\hat{y}_{uv}$ from the GNN. $J$ is typically binary cross-entropy for implicit feedback. This term drives the learning of GNN weights $\mathbf{W}$ and indirectly influences $\mathbf{A}$ through the user-specific embeddings $\mathbf{v}_u$ .
$\lambda R(\mathbf{A})$ : This is the label smoothness regularization term. It encourages the edge weights $\mathbf{A}$ (and thus the underlying user/relation embeddings that form $\mathbf{A}$ ) to support effective label propagation, enforcing the smoothness assumption. $\lambda$ is a hyperparameter balancing its importance.
$\gamma \|\mathcal{F}\|_2^2$ : This is an L2-regularizer (weight decay) on the model parameters $\mathcal{F}$ (which includes $\mathbf{W}$ and the embeddings of users/relations). It helps prevent overfitting by penalizing large parameter values. $\gamma$ is a hyperparameter.

This unified loss function allows for end-to-end training where both personalized item representations and the personalized graph structure are learned jointly and robustly.

4.2.5. Discussion (Intuition from Figure 2)

The paper provides an intuitive analogy with a physical equilibrium model (Figure 2, not provided, but described in text).

Without KG: Items are loosely connected, mainly through collaborative filtering effects.
With KG: Edges in the KG act like rubber bands imposing constraints.
- GNN effect: A single GNN layer mixes an entity's features with its immediate neighbors. If an item is positively rated, its neighbors are also "pulled up" (assigned higher relevance). Deeper GNNs ( $L > 1$ ) propagate this "pull" further, exploring long-distance interests.
- Personalization: The "strength" of these rubber bands (edge weights $s_u(r)$ ) is user-specific and relation-specific. This means the influence of KG paths is tailored to each user.
Label Smoothness Regularization effect: If an edge weight (rubber band strength) is too weak, the positive signal from an observed item might not propagate effectively to its neighbors. The label smoothness regularization (via leave-one-out loss) identifies such cases. By "holding out" an observed positive item and trying to predict its label from its neighbors, the regularization effectively tightens the rubber bands (increases edge weights) along paths that help recover the true label. This encourages the model to pull up other potentially relevant, unobserved items connected via these strengthened paths.

The following figure from the paper provides a visual illustration of the label smoothness regularization:

该图像是论文中的示意图，展示了不同情境下知识图（KG）与标签平滑（LS）正则化对用户兴趣判别边界的影响。图中通过不同颜色圆点和箭头表示正负样本及其力的方向，说明了KG和标签平滑如何帮助调整判别边界以优化推荐效果。

The illustration shows how the label smoothness regularization helps. If a positive sample (blue circle) is held out, the model tries to predict its label. Because the true label is 1 and another positive sample (upper right pink circle) has the largest label value, the regularization term $R(\mathbf{A})$ will increase the weights of edges (arrows) between them, allowing the "label flow" to predict the held-out sample correctly. This effectively strengthens connections (rubber bands) and encourages the model to raise the predicted relevance of other connected, unobserved positive items (the two upper pink items).

5. Experimental Setup

5.1. Datasets

The experiments were conducted on four real-world datasets spanning different recommendation domains: movies, books, music, and restaurants.

MovieLens-20M:
- Domain: Movie recommendations.
- Source: Public benchmark dataset.
- Characteristics: Approximately 20 million explicit ratings (1 to 5).
- KG Source: Satori (commercial KG by Microsoft).
Book-Crossing:
- Domain: Book recommendations.
- Source: Public dataset.
- Characteristics: 1 million ratings (0 to 10).
- KG Source: Satori.
Last.FM:
- Domain: Music recommendations.
- Source: Public dataset.
- Characteristics: Musician listening information from 2 thousand users.
- KG Source: Satori.
Dianping-Food:
- Domain: Restaurant recommendations.
- Source: From Meituan-Dianping Group (a Chinese group buying website).
- Characteristics: Over 10 million interactions (clicking, buying, adding to favorites) between ~2 million users and ~1 thousand restaurants.
- KG Source: Meituan Brain (internal KG by Meituan-Dianping Group).
  
  Data Preprocessing:
Implicit Feedback Conversion: For MovieLens-20M, Book-Crossing, and Last.FM (which contain explicit ratings or listening counts), interactions were transformed into implicit feedback. A rating $\geq 4$ was considered positive for MovieLens-20M (marked as 1). No threshold was set for Book-Crossing and Last.FM due to sparsity, implying any interaction is positive.
Negative Sampling: For each user, an equal number of randomly sampled unwatched/uninteracted items were marked as 0 (negative samples) to balance the dataset.

KG Construction (for public datasets):

A subset of triples with confidence level $>0.9$ was extracted from Satori.
Item IDs (movies, books, musicians) were matched by name with tail entities in specific relations (e.g., film.film.name).
Items with multiple matches or no matches were excluded.

These matched item IDs were then used to collect all relevant triples from the Satori sub-KG where they acted as head entities.

The following are the results from Table 2 of the original paper:

	Movie	Book	Music	Restaurant
# users	138,159	19,676	1,872	2,298,698
# items	16,954	20,003	3,846	1,362
# interactions	13,501,622	172,576	42,346	23,416,418
# entities	102,569	25,787	9,366	28,115
# relations	32	18	60	7
# KG triples	499,474	60,787	15,518	160,519

The datasets were chosen to represent diverse domains and scales, allowing for comprehensive validation of the method's performance across different complexities of user-item interactions and knowledge graph structures. The Dianping-Food dataset is particularly notable for its large number of users but relatively small number of items, which can emphasize the utility of KG information for differentiation.

The connection between the KG and user-item interactions is further validated by Figure 3.

Figure 3: Probability distribution of the shortest path distance between two randomly sampled items in the KG under the circumstance that (1) they have no common user in the dataset; (2) they have co… 该图像是图表，展示了图中两个随机采样物品在不同数据集中最短路径距离的概率分布，比较了有无共同用户的两种情况。

This figure shows the probability distribution of the shortest path distance between two randomly sampled items in the KG. It distinguishes between item pairs that have no common user in the dataset and those that have at least one common user. The results empirically demonstrate that items with common users are significantly more likely to be closer in the KG (e.g., within 2 hops) compared to items without common users. This finding supports the core motivation to use KG proximity to infer item relatedness and justifies the label smoothness regularization.

5.2. Evaluation Metrics

The paper evaluates the proposed method using two common types of metrics in recommender systems: Recall@K for top-K recommendation and AUC for click-through rate (CTR) prediction.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items (i.e., items a user has interacted with or is expected to interact with) that are successfully recommended within the top $K$ items. It is a common metric for evaluating the effectiveness of a recommender system in retrieving all relevant items for a user. A higher Recall@K indicates that the system is better at finding and presenting the items that the user likes.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|\mathcal{U}{test}|} \sum{u \in \mathcal{U}{test}} \frac{|\text{Recommended}{u,K} \cap \text{Relevant}_u|}{|\text{Relevant}_u|} $
Symbol Explanation:
- $|\mathcal{U}_{test}|$ : The total number of users in the test set.
- $\text{Recommended}_{u,K}$ : The set of top $K$ items recommended by the system for user $u$ .
- $\text{Relevant}_u$ : The set of items that user $u$ actually interacted with (or found relevant) in the test set.
- $\cap$ : The intersection operator, meaning items present in both sets.
- $|\cdot|$ : The cardinality (number of elements) of a set.

5.2.2. AUC (Area Under the ROC Curve)

Conceptual Definition: AUC (Area Under the Receiver Operating Characteristic Curve) is a performance metric for binary classification problems (like CTR prediction, where the task is to predict if a user will click/interact with an item). It represents the probability that a randomly chosen positive instance (e.g., an item a user will click) will be ranked higher than a randomly chosen negative instance (e.g., an item a user will not click) by the model. An AUC of 1.0 indicates perfect classification, while 0.5 indicates performance no better than random guessing.
Mathematical Formula: $ \mathrm{AUC} = \frac{\sum_{i \in \text{Positive}} \sum_{j \in \text{Negative}} \mathbf{1}(\mathrm{score}(i) > \mathrm{score}(j)) + 0.5 \cdot \mathbf{1}(\mathrm{score}(i) = \mathrm{score}(j))}{|\text{Positive}| \cdot |\text{Negative}|} $
Symbol Explanation:
- $\text{Positive}$ : The set of positive samples (e.g., items clicked by a user).
- $\text{Negative}$ : The set of negative samples (e.g., items not clicked by a user).
- $\mathrm{score}(i)$ : The predicted score (e.g., click probability) of item $i$ .
- $\mathbf{1}(\cdot)$ : The indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
- $|\text{Positive}|$ : The number of positive samples.
- $|\text{Negative}|$ : The number of negative samples.

5.3. Baselines

The proposed KGNN-LS model was compared against several state-of-the-art baselines, including both KG-free and KG-aware methods.

KG-Free Baselines:
- SVD [12] (Singular Value Decomposition): A classic collaborative filtering (CF) model that uses matrix factorization to decompose the user-item interaction matrix into lower-rank user and item latent factor matrices. The paper uses an unbiased version where the predicted engaging probability is modeled as an inner product of user and item embeddings: $\hat{y}_{uv} = \mathbf{u}^\top \mathbf{v}$ .
- LibFM [16] (Factorization Machines with LibFM): A widely used feature-based factorization model for CTR prediction. It models interactions between variables using a factorization approach. For this paper, user ID and item ID are concatenated as input.
KG-Aware Baselines:
- LibFM + TransE: An extension of LibFM where entity representations learned by TransE [2] are attached to each user-item pair. This incorporates structural knowledge from the KG directly into LibFM's feature set.
- PER [33] (Personalized Entity Recommendation): A path-based method. It treats the KG as a heterogeneous information network and extracts meta-path based features to represent connectivity between users and items. The paper specifies manually designed meta-paths for each dataset (e.g., "user-movie-director-movie" for MovieLens-20M).
- CKE [34] (Collaborative Knowledge Base Embedding): An embedding-based method that combines collaborative filtering with structural knowledge from KGs. It jointly learns user/item embeddings from CF and entity/relation embeddings from KGs.
- RippleNet [24]: A hybrid method that uses a memory-network-like approach to propagate user preferences on the KG for recommendation. It learns user embeddings by iteratively expanding "ripples" of user preferences over the KG.
  
  Hyper-parameter Settings: General settings: functions $g$ and $f$ (in KGNN-LS) are set as inner product. $\sigma$ (activation function) is ReLU for non-last layers and tanh for the last layer. To handle varying neighbor sizes in KGs efficiently, a fixed-size set of neighbors is uniformly sampled for each entity (denoted by $S$ ). Hyperparameters for KGNN-LS are determined by optimizing R@10 on a validation set. The search spaces for $S$ , $d$ (dimension of hidden layers), $L$ (number of layers), $\lambda$ (label smoothness regularizer weight), $\gamma$ (L2 regularizer weight), and $\eta$ (learning rate) are provided in Appendix B. For baselines, specific dimensions and learning rates are mentioned in Section 5.2.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that KGNN-LS significantly outperforms state-of-the-art baselines in both top-K recommendation and CTR prediction across all four datasets.

The following are the results from Table 3 of the original paper:

Model	MovieLens-20M			Book-Crossing			Last.FM				Dianping-Food
Model	R@2	R@10 R@50 R@100		R@2	R@10 R@50 R@100		R@2	R@10 R@50 R@100			R@2		R@10 R@50 R@100
SVD	0.036	0.124 0.277	0.401	0.027	0.046 0.077	0.109	0.029	0.098	0.240	0.332	0.039	0.152 0.329	0.451
LibFM	0.039	0.121 0.271	0.388	0.033	0.062 0.092	0.124	0.030	0.103	0.263	0.330	0.043 0.156	0.332	0.448
LibFM + TransE	0.041	0.125 0.280	0.396	0.037	0.064 0.097	0.130	0.032	0.102	0.259	0.326	0.044 0.161	0.343	0.455
PER	0.022	0.077 0.160	0.243	0.022 0.041	0.064	0.070	0.014	0.052	0.116	0.176	0.023 0.102	0.256	0.354
CKE	0.034	0.107 0.244	0.322	0.028	0.051 0.079	0.112	0.023	0.070	0.180	0.296	0.034 0.138	0.305	0.437
RippleNet	0.045	0.130 0.278	0.447	0.036	0.074 0.107	0.127	0.032	0.101	0.242	0.336	0.040 0.155	0.328	0.440
KGNN-LS	0.043	0.155 0.321	0.458		0.045 0.082 0.117	0.149	0.044 0.122 0.277			0.370	0.047 0.170 0.340		0.487

Table 3: Recall@K in Top-K Recommendation

KGNN-LS consistently achieves the highest Recall@K values across different $K$ (2, 10, 50, 100) and all four datasets.
For example, on MovieLens-20M, KGNN-LS achieves R@100 of 0.458, surpassing RippleNet (0.447) and other baselines.
On Last.FM, KGNN-LS reaches R@100 of 0.370, significantly higher than RippleNet (0.336) and $LibFM+TransE$ (0.326).
The PER baseline, a path-based method, performs notably worse than most others, highlighting the difficulty of manual meta-path engineering.

CKE, an embedding-based method, also performs less competitively, suggesting that pre-trained KG embeddings might not fully align with the recommendation task without deeper integration.

The following are the results from Table 4 of the original paper:

Model	Movie	Book	Music	Restaurant
SVD	0.963	0.672	0.769	0.838
LibFM	0.959	0.691	0.778	0.837
LibFM + TransE	0.966	0.698	0.777	0.839
PER	0.832	0.617	0.633	0.746
CKE	0.924	0.677	0.744	0.802
RippleNet	0.960	0.727	0.770	0.833
KGNN-LS	0.979	0.744	0.803	0.850

Table 4: AUC in CTR Prediction

KGNN-LS consistently achieves the highest AUC scores across all datasets, demonstrating its superior performance in predicting user-item interactions.
The average AUC gains over baselines are substantial: 5.1% on MovieLens-20M, 6.9% on Book-Crossing, 8.3% on Last.FM, and 4.3% on Dianping-Food. These percentage gains are calculated as (KGNN-LS AUC - Baseline Max AUC) / Baseline Max AUC. For example, on MovieLens, (0.979 - 0.966)/0.966 = 1.3%, but the paper might be reporting (KGNN-LS AUC - Baseline Min AUC)/Baseline Min AUC or another form of average, this warrants a careful check or assumption if the exact calculation is not specified. Assuming the paper's claimed average gains, this further underscores the model's effectiveness.
The strong performance of LibFM + TransE compared to LibFM alone indicates the value of incorporating KG embeddings, but KGNN-LS goes further by integrating GNNs and label smoothness.

Daily Performance Stability: The following figure from the paper shows the daily AUC of all methods on Dianping-Food in September 2018.

Figure 4: Daily AUC of all methods on Dianping-Food in September 2018. 该图像是图表，展示了2018年9月Dianping-Food数据集上多种方法的每日AUC变化趋势。图中比较了LibFM+TransE、PER、CKE、RippleNet和KGNN-LS五种算法的性能，KGNN-LS整体表现较优。

Figure 4 illustrates the daily AUC performance of KGNN-LS and baselines on the Dianping-Food dataset over a month. KGNN-LS consistently maintains the highest AUC curve, indicating superior performance over time. Moreover, its curve shows low variance, suggesting that KGNN-LS is robust and stable in practical, real-world deployment scenarios, an important factor for recommender systems.

6.2. Effectiveness of LS Regularization

The paper conducts an ablation study to verify the effectiveness of the label smoothness (LS) regularization. The following figure from the paper shows the effectiveness of LS regularization on Last.FM.

Figure 5: Effectiveness of LS regularization on Last.FM. 该图像是图表，展示了LS正则化参数λ对Last.FM数据集上不同隐藏维度d的R@10指标的影响，表现了随着λ变化R@10的波动趋势，明显体现了不同d值下的性能差异。

Figure 5 plots R@10 on the Last.FM dataset as the LS regularization weight $\lambda$ varies from 0 to 5, for different hidden layer dimensions ( $d=4, 8, 16$ ).

When $\lambda = 0$ , the LS regularization term is effectively removed. This scenario represents a KGNN without the label smoothness component, similar to the approach by Wang et al. [28].
The results clearly show that KGNN-LS with a non-zero $\lambda$ consistently outperforms the case where $\lambda = 0$ . This empirically validates the claim that LS regularization is crucial for assisting the learning of edge weights in the KG and achieving better generalization.
The optimal $\lambda$ typically falls within a range, such as 0.1 to 1.0 in most cases, as a too large\lambda$$ can overwhelm the overall loss, distorting gradients and leading to sub-optimal performance.

6.3. Results in Cold-Start Scenarios

One of the primary motivations for using KGs in recommender systems is to alleviate the sparsity and cold-start issues. The paper evaluates KGNN-LS's performance in cold-start scenarios by varying the size of the training set on MovieLens-20M.

The following are the results from Table 5 of the original paper:

r	20%	40%	60%	80%	100%
SVD	0.882	0.913	0.938	0.955	0.963
LibFM	0.902	0.923	0.938	0.950	0.959
LibFM+TransE	0.914	0.935	0.949	0.960	0.966
PER	0.802	0.814	0.821	0.828	0.832
CKE	0.898	0.910	0.916	0.921	0.924
RippleNet	0.921	0.937	0.947	0.955	0.960
KGNN-LS	0.961	0.970	0.974	0.977	0.979

Table 5: AUC with varying training set ratio ( $r$ ) on MovieLens-20M

The table shows the AUC performance as the training data ratio $r$ decreases from 100% (full training data) to 20% (highly sparse scenario).
As expected, the performance of all models degrades with less training data.
However, KGNN-LS demonstrates remarkable resilience. When $r = 20\%$ , the AUC decrease for baselines ranges from 2.8% (CKE) to 8.4% (SVD) compared to their performance on 100% training data. In contrast, KGNN-LS experiences only a 1.8% decrease in AUC (from 0.979 to 0.961).
This strong performance in cold-start scenarios highlights KGNN-LS's ability to effectively leverage the rich structural and semantic information from the KG to compensate for sparse user-item interactions, which is a significant advantage for practical recommender systems.

6.4. Hyper-parameters Sensitivity

The paper also investigates the impact of key hyperparameters on KGNN-LS's performance.

6.4.1. Number of GNN Layers ( $L$ )

The following are the results from Table 6 of the original paper:

L	1	2	3	4
MovieLens-20M	0.155	0.146	0.122	0.011
Book-Crossing	0.077	0.082	0.043	0.008
Last.FM	0.122	0.106	0.105	0.057
Dianping-Food	0.165	0.170	0.061	0.036

Table 6: R@10 with varying number of layers L

The table shows R@10 for different numbers of GNN layers ( $L=1$ to 4).
KGNN-LS generally performs best with a small number of layers, typically $L=1$ or $L=2$ . For instance, on MovieLens-20M, $L=1$ yields the best R@10, while on Book-Crossing and Dianping-Food, $L=2$ is optimal.
A significant drop in performance is observed for larger $L$ (e.g., $L=4$ ). This is a common phenomenon in GNNs known as over-smoothing, where increasing the number of layers causes node representations to become too similar to their distant neighbors, making them indistinguishable. In the context of KGs, a larger $L$ mixes too many entity embeddings, leading to over-smoothed representation learning and hindering the model's ability to capture fine-grained distinctions and personalized interests.

6.4.2. Dimension of Hidden Layers ( $d$ )

The following are the results from Table 7 of the original paper:

d	4	8	16	32	64	128
MovieLens-20M	0.134	0.141	0.143	0.155	0.155	0.151
Book-Crossing	0.065	0.073	0.077	0.081	0.082	0.080
Last.FM	0.111	0.116	0.122	0.109	0.102	0.107
Dianping-Food	0.155	0.170	0.167	0.166	0.163	0.161

Table 7: R@10 with varying dimension of hidden layers d

The table shows R@10 for different dimensions of hidden layers ( $d$ ).
Performance generally improves as $d$ increases initially (e.g., from $d=4$ to $d=32$ or $d=64$ ), as higher dimensions allow the model to capture more complex features and increase its capacity.
However, performance starts to drop if $d$ becomes too large (e.g., $d=128$ for MovieLens-20M and Book-Crossing, or $d=32$ onwards for Last.FM and Dianping-Food). This suggests that too large a dimension can lead to overfitting on the training data, harming generalization to unseen data.
The optimal dimension is typically found in the range of $d = 8 \sim 64$ , balancing model capacity and preventing overfitting.

6.5. Running Time Analysis

The scalability of a recommender system is critical for real-world applications dealing with large KGs. The paper evaluates the running time of KGNN-LS and baselines as the KG size increases. The following figure from the paper shows the running time of all methods w.r.t. KG size on MovieLens-20M.

Figure 6: Running time of all methods w.r.t. KG size on MovieLens-20M. 该图像是图表，展示了不同方法在MovieLens-20M数据集上，随着知识图谱（KG）三元组数量增加的运行时间变化。图中显示KGNN-LS在较大KG规模时运行效率优于RippleNet等方法。

Figure 6 presents the running time of various methods on MovieLens-20M as the number of KG triples is scaled up (by extracting more triples from Satori).

The figure shows that KGNN-LS exhibits strong scalability. Its running time increases gracefully with the KG size, indicating efficient handling of larger knowledge graphs.
Notably, KGNN-LS appears more efficient than RippleNet as KG size grows, especially given RippleNet's memory-network-like iterative propagation.
While absolute running times are influenced by hardware and specific configurations, the trend clearly demonstrates KGNN-LS's ability to handle increasing KG sizes without prohibitive computational costs. This is a crucial practical advantage for deploying KG-aware recommender systems.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS), a novel and effective model for recommender systems. KGNN-LS addresses key limitations of previous KG-aware recommendation approaches by offering an end-to-end trainable framework that avoids manual feature engineering and enhances scalability. Its core innovation lies in:

User-specific personalized graph construction: Dynamically transforming a heterogeneous knowledge graph into a user-specific weighted graph using a trainable relation scoring function.
Graph Neural Network for personalized embeddings: Applying a GNN on this personalized graph to generate rich, user-specific item embeddings by aggregating local neighborhood information.
Label Smoothness regularization: Introducing a principled regularization technique that leverages the label smoothness assumption, effectively constraining the learnable edge weights and proving its equivalence to label propagation. This regularization mitigates overfitting and provides stronger inductive bias.

Extensive experiments on four diverse real-world datasets consistently show that KGNN-LS significantly outperforms state-of-the-art baselines in both top-K recommendation and CTR prediction. Furthermore, KGNN-LS demonstrates robust performance in cold-start scenarios with sparse user-item interactions and exhibits strong scalability with respect to increasing knowledge graph sizes.

7.2. Limitations & Future Work

The authors identify several promising directions for future research:

Broader application of LS regularization: The paper notes that LS regularization was specifically proposed for recommendation with KGs. Future work could examine its effectiveness on other graph tasks, such as link prediction (predicting missing edges in a graph) and node classification (assigning labels to nodes based on their features and connections).
Theoretical relationship between feature propagation and label propagation: The paper highlights that the GNN component performs feature propagation while LS regularization implicitly implements label propagation. Investigating the theoretical connections and interplay between these two propagation mechanisms within a unified framework is suggested as a promising research direction.

7.3. Personal Insights & Critique

The KGNN-LS paper presents a highly innovative and practical solution for integrating knowledge graphs into recommender systems. My personal insights and critique include:

Innovations and Strengths:

Principled Regularization: The label smoothness regularization is a standout feature. It's not just an ad-hoc penalty but is theoretically grounded in label propagation and directly addresses the overfitting challenge posed by learnable, user-specific graph structures. This provides a strong inductive bias that adjacent items in a semantically meaningful graph should have similar relevance, which is intuitively appealing.
User-Specific GNNs: The dynamic construction of a user-specific weighted graph, combined with GNNs, is a powerful way to personalize recommendations. This moves beyond static KG embeddings or generic graph convolutions by explicitly modeling how each user perceives the importance of different relations.
End-to-End Trainability: The unified loss function enables all components to be jointly optimized, which is crucial for maximizing performance and adaptability, as opposed to multi-stage approaches.
Cold-Start Robustness: The demonstrated effectiveness in cold-start scenarios is a major advantage. By leveraging the rich information in KGs, KGNN-LS can make meaningful recommendations even when explicit user-item interactions are scarce.
Scalability: The efficient implementation and good scalability are critical for real-world deployment on large-scale platforms.

Potential Issues and Areas for Improvement/Future Exploration:
Complexity of $g$ function: The user-specific relation scoring function $g(\mathbf{u}, \mathbf{r})$ is defined generally as a differentiable function (e.g., inner product or MLP). The choice of $g$ could significantly impact the model's ability to capture complex user preferences for relations. Exploring more sophisticated, interpretable, or dynamic forms for $g$ could be beneficial.
Computational Cost of Label Propagation: While the overall system shows good scalability, the label propagation step for regularization involves iterative matrix operations or explicit computation of the inverse $(\mathbf{I} - \mathbf{P}_{EE})^{-1}$ which can be computationally intensive for very large KGs and many users if not carefully approximated or optimized. The paper mentions "efficient implementation," implying this is handled, but specific details on how this is achieved for the regularization term's calculation might be worth exploring further.
Interpretability of $s_u(r)$ : While the model learns relation importance, a deeper analysis into why certain relations are important for specific users (e.g., through attention mechanisms or more transparent $g$ functions) could enhance interpretability, which is increasingly valued in recommender systems.
Feature Engineering for $\mathbf{E}$ : The initial entity features $\mathbf{E}$ are crucial. The paper mentions "raw entity features." The quality and type of these features (e.g., textual descriptions, categorical attributes) can greatly influence the GNN's ability to learn meaningful representations. Investigating optimal ways to obtain or pre-process these initial features could be valuable.
Dynamic KGs: Most KGs are dynamic, with new entities and relations constantly being added. The current model assumes a relatively static KG structure during training. Adapting KGNN-LS to handle dynamically evolving KGs efficiently would be a practical extension.

Transferability and Applicability: The core ideas of KGNN-LS are highly transferable.
Beyond Recommendations: The concept of a user-specific graph transformation combined with GNNs and principled graph-based regularization could be applied to other personalized graph-based tasks, such as personalized knowledge graph completion, personalized link prediction (e.g., predicting personalized social connections based on common interests derived from a KG), or personalized entity linking.
Different Domains: The method is general enough to be applied to any domain where structured knowledge is available and personalization is desired, provided a suitable KG can be constructed or accessed. Examples include personalized search, educational content recommendation, or drug discovery (personalized medicine, where KGs can represent drug-protein interactions, and user preferences might be patient-specific biological profiles).

Overall, KGNN-LS offers a robust and theoretically sound framework that significantly advances the state-of-the-art in KG-aware recommender systems, providing a solid foundation for future research in personalized graph machine learning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 46,786 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Knowledge-aware Graph Neural Networks

4.2.3. Label Smoothness Regularization

4.2.4. The Unified Loss Function

4.2.5. Discussion (Intuition from Figure 2)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. AUC (Area Under the ROC Curve)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Effectiveness of LS Regularization

6.3. Results in Cold-Start Scenarios

6.4. Hyper-parameters Sensitivity

6.4.1. Number of GNN Layers (LLL)

6.4.2. Dimension of Hidden Layers (ddd)

6.5. Running Time Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.4.1. Number of GNN Layers ( $L$ )

6.4.2. Dimension of Hidden Layers ( $d$ )