Paper status: completed

Understanding Negative Sampling in Knowledge Graph Embedding

Published:01/31/2021

Knowledge Graph Embedding (4)Negative Sampling Methods (1)Negative Sample Generation (1)Knowledge Representation in Recommendation Systems (1)Link Prediction and Node Classification (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper discusses negative sampling in knowledge graph embedding, highlighting its importance in training. It categorizes negative sampling methods into static, dynamic, and custom clustering approaches, offering new insights to enhance knowledge representation in recommendati

Abstract

Knowledge graph embedding (KGE) is to project entities and relations of a knowledge graph (KG) into a low-dimensional vector space, which has made steady progress in recent years. Conventional KGE methods, especially translational distance-based models, are trained through discriminating positive samples from negative ones. Most KGs store only positive samples for space efficiency. Negative sampling thus plays a crucial role in encoding triples of a KG. The quality of generated negative samples has a direct impact on the performance of learnt knowledge representation in a myriad of downstream tasks, such as recommendation, link prediction and node classification. We summarize current negative sampling approaches in KGE into three categories, static distribution-based, dynamic distribution-based and custom cluster-based respectively. Based on this categorization we discuss the most prevalent existing approaches and their characteristics. It is a hope that this review can provide some guidelines for new thoughts about negative sampling in KGE.

Mind Map

In-depth Reading

English Analysis~31 min read · 43,729 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Understanding Negative Sampling in Knowledge Graph Embedding".

1.2. Authors

The authors of the paper are:

Jing Qian (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China, and Department of Computer Science, University of Liverpool, United Kingdom)
Gangmin Li (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China)
Katie Atkinson (Affiliated with Department of Computer Science, University of Liverpool, United Kingdom)
Yong Yue (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China)

1.3. Journal/Conference

The paper was published in the International Journal of Artificial Intelligence and Applications (IJAIA), Vol.12, No.1, January 2021. This is a peer-reviewed journal in the field of artificial intelligence, indicating its relevance and contribution to academic discourse in AI.

1.4. Publication Year

The paper was published on January 31, 2021.

1.5. Abstract

Knowledge graph embedding (KGE) involves projecting entities and relations of a knowledge graph (KG) into a low-dimensional vector space. Many KGE methods, particularly translational distance-based models, are trained by distinguishing positive (true) samples from negative (false) ones. Since most KGs only store positive samples for efficiency, negative sampling is crucial for encoding triples. The quality of these generated negative samples directly influences the performance of the learned knowledge representation in downstream tasks such as recommendation, link prediction, and node classification. The paper categorizes current negative sampling approaches in KGE into three types: static distribution-based, dynamic distribution-based, and custom cluster-based. It discusses prevalent approaches and their characteristics within this categorization, aiming to guide future research in negative sampling for KGE.

1.6. Original Source Link

The original source link is /files/papers/69354df00a9b802059199f26/paper.pdf. Given the publication details, it is an officially published paper.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Knowledge graphs (KGs) store facts as (head entity, relation, tail entity) triples. To leverage machine learning techniques, these symbolic representations need to be converted into numerical vector spaces, a process called knowledge graph embedding (KGE). Many KGE models, especially translational distance-based models, rely on distinguishing true facts (positive samples) from false ones (negative samples) during training. However, KGs typically only store positive facts for storage efficiency. The challenge lies in effectively generating negative samples that are informative enough to train robust KGE models without introducing misleading or trivial information.
Problem Importance: The quality of negative samples directly impacts the learned knowledge representations. Poorly generated negative samples can lead to zero loss problems (where the model easily distinguishes true from false, learning little), false negatives (where a generated negative sample is actually a true but unobserved fact), or vanishing gradients, all of which hinder model training and lead to suboptimal performance in critical downstream tasks like link prediction (predicting missing links in a KG), recommendation systems, and node classification.
Challenges/Gaps: While KGE research has largely focused on developing novel embedding models and scoring functions, the crucial aspect of negative sampling has been relatively underexplored and often relies on simple random sampling. Existing surveys on KGE mention negative sampling only briefly. There's a need for a systematic overview and a deeper understanding of various negative sampling strategies and their implications.
Paper's Entry Point: This paper addresses this gap by providing the first systematic and exhaustive overview of existing negative sampling approaches in KGE. It categorizes them based on their sample source and discusses their characteristics, aiming to shed light on this under-appreciated but critical component of KGE training.

2.2. Main Contributions / Findings

Primary Contributions:
- Comprehensive Categorization: The paper systematically categorizes existing negative sampling approaches in KGE into three main types: static distribution-based, dynamic distribution-based, and custom cluster-based. This provides a structured framework for understanding the landscape of negative sampling.
- Detailed Review of Approaches: It discusses the most prevalent methods within each category, explaining their mechanisms, advantages, and disadvantages.
- Highlighting Importance: The paper strongly argues for the critical role of negative sampling in KGE training, emphasizing that it is often overlooked compared to the development of new embedding models.
- Guidance for Future Research: By summarizing and categorizing existing methods, the paper aims to provide guidelines and stimulate new thoughts for future research in negative sampling, suggesting areas for improvement and further exploration.
Key Conclusions/Findings:
- Negative sampling is a vital, yet under-explored, component of KGE model training, significantly impacting the quality of learned representations.
- Simple uniform sampling often produces low-quality negatives, leading to issues like zero loss and false negatives.
- More sophisticated methods, such as Bernoulli sampling, GAN-based sampling, Markov chain Monte Carlo negative sampling (MCNS), and custom cluster-based sampling, attempt to generate higher-quality, more informative negative samples by considering relation properties, dynamic distributions, or entity similarities.
- There is a clear trend towards more dynamic and context-aware negative sampling strategies to overcome the limitations of static random sampling.
- The paper implicitly concludes that focusing solely on embedding model architectures without considering the quality of negative samples can lead to suboptimal KGE performance.

3.1. Foundational Concepts

Knowledge Graph (KG): A knowledge graph (KG) is a structured representation of information that models real-world entities and their relationships in a graph format. Nodes in the graph represent entities (e.g., "London," "United Kingdom," "Barack Obama"), and edges represent relations (e.g., "locatedIn," "Gender," "PresidentOf") between these entities. Facts in a KG are typically expressed as triples in the form (head entity, relation, tail entity) or (subject, predicate, object). For example, (London, locatedIn, UnitedKingdom). KGs like Freebase, NELL, and YAGO are widely used.
Knowledge Graph Embedding (KGE): Knowledge graph embedding (KGE) is the process of mapping entities and relations from a symbolic knowledge graph into a low-dimensional continuous vector space (i.e., numerical vectors or "embeddings"). The goal is to capture the semantic meanings and structural properties of the KG within these vector representations, making them amenable to machine learning algorithms. For instance, "London" might be represented as a vector $\mathbf{h}_{\text{London}}$ , and "locatedIn" as a vector $\mathbf{r}_{\text{locatedIn}}$ .
Translational Distance-based Models: These are a popular family of KGE models where relations are conceptualized as "translations" in the embedding space. The core idea is that if a triple (h, r, t) is true, then the embedding of the head entity $\mathbf{h}$ plus the embedding of the relation $\mathbf{r}$ should be close to the embedding of the tail entity $\mathbf{t}$ . Mathematically, this is often expressed as $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ . Examples include TransE, TransH, TransR.
Scoring Function: In KGE models, a scoring function $f_r(h, t)$ (or f(h, r, t)) is a mathematical function that takes a triple (h, r, t) as input and outputs a score indicating its plausibility or truthfulness. A higher score typically means the triple is more likely to be true. Models are trained to assign high scores to positive (true) triples and low scores to negative (false) triples.
Positive Samples: These are the ground-truth facts directly observed and stored in the knowledge graph. They represent true relationships between entities.
Negative Samples: These are false facts generated by corrupting positive samples. For a positive triple (h, r, t), a negative sample might be (h', r, t) (replacing head), (h, r, t') (replacing tail), or (h, r', t) (replacing relation), where $h'$ or $t'$ are other entities, or $r'$ is another relation, such that the corrupted triple is unlikely to be true. These are crucial for teaching the model what is not true.
Noise Contrastive Estimation (NCE): Noise Contrastive Estimation (NCE) is a statistical method used for estimating parameters of unnormalized probability models (models where the normalization constant, or partition function, is hard to compute). Instead of directly estimating the probability distribution, NCE rephrases the problem as a binary classification task: distinguishing between "true" data samples (positive samples) and "noise" samples (negative samples) drawn from a known noise distribution. This significantly reduces computational complexity, especially for large vocabularies or continuous spaces.
Negative Sampling (in word2vec context): Negative sampling is a simplification of NCE popularized by word2vec. In word2vec, it's used to efficiently train word embeddings. Instead of updating all weights for every word in the vocabulary, it only updates weights for the positive word in context and a small number of randomly sampled negative words. This speeds up training considerably. KGE adopts this concept for triples.
Open World Assumption (OWA) vs. Closed World Assumption (CWA):
- Closed World Assumption (CWA): States that any fact not explicitly present in the KG is considered false. This assumes KGs are complete, which is rarely true for real-world KGs.
- Open World Assumption (OWA): States that facts not explicitly present in the KG are simply unknown and could be either true or false. This is a more realistic assumption for incomplete KGs and is generally preferred in KGE. The paper notes that KGE models are typically trained under OWA due to the inherent incompleteness of KGs.
Zero Loss Problem: This occurs when the generated negative samples are too obviously false or easily distinguishable from positive samples. In such cases, the KGE model quickly learns to assign very high scores to positives and very low scores to negatives, leading to a loss value that approaches zero too rapidly. When the loss is near zero, the model's parameters receive very small updates (due to small gradients), effectively stopping meaningful learning and preventing the model from capturing subtle semantic differences.

3.2. Previous Works

The paper primarily focuses on negative sampling techniques and their interaction with KGE models rather than introducing a new KGE model. However, it references numerous KGE models as context or as discriminators within GAN-based negative sampling frameworks.

TransE [5]: A foundational translational distance-based KGE model. It models relations as translations in the embedding space: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ for a true triple (h, r, t). It uses uniform sampling for negatives.
TransH [12]: An improvement over TransE that addresses its limitations in handling 1-to-N, N-to-1, and N-to-N relations. It projects entities onto a hyperplane specific to each relation before applying the translation operation. It introduced Bernoulli sampling.
TransR [13], TransD [14], TransG [15]: Further variants of TransE that embed entities and relations into different spaces or use dynamic mapping matrices to handle more complex relation types.
RESCAL [6]: An early semantic matching-based KGE model that uses matrix factorization. It models relations as matrices $\mathbf{M}_r$ , where the plausibility of (h, r, t) is measured by $\mathbf{h}^\top \mathbf{M}_r \mathbf{t}$ .
DistMult [31]: Simplifies RESCAL by constraining $\mathbf{M}_r$ to be a diagonal matrix, which can effectively model symmetric relations.
ComplEx [23]: Extends DistMult to complex number space to better handle antisymmetric relations.
KBGAN [16]: The first work to apply Generative Adversarial Networks (GANs) to negative sampling in KGE. It uses one KGE model as a generator to create negatives and another KGE model as a discriminator.
IGAN [17]: Another GAN-based approach, which uses a two-layer fully-connected neural network as its generator, unlike KBGAN which uses KGE models as generators.
MCNS (Markov chain Monte Carlo Negative Sampling) [22]: Proposes a theoretically derived negative sampling distribution that is positively but sub-linearly correlated with the positive sampling distribution. It uses depth first search (DFS) and Metropolis-Hastings algorithm for efficient sampling.
RotatE [49]: While not solely a negative sampling method, it proposes a self-adversarial sampling approach, which is an alternative to GAN-based methods, avoiding reinforcement learning and improving efficiency.
CKRL [20] and NKRL [21]: These works focus on confidence-aware knowledge representation learning, particularly dealing with noisy KGs. NKRL extends CKRL by introducing a confidence-aware negative sampling method to generate plausible negatives and reduce false detection of noises.

3.3. Technological Evolution

The field of KGE has evolved from discrete representations to continuous vector spaces, mirroring trends in natural language processing (NLP) with word embeddings.

Early KGE (pre-2013): Focused on symbolic logic and rule-based systems, or matrix factorization methods like RESCAL. These often struggled with scalability and capturing rich semantics.
Translational Models (2013 onwards): The advent of TransE [5] (inspired by word2vec [4]) marked a significant shift. It introduced the idea of relations as translations in a vector space, offering simplicity and efficiency. This led to a wave of variants (TransH, TransR, TransD, TransG) that addressed TransE's limitations, such as handling different relation types and modeling complexity.
Semantic Matching Models (mid-2010s): Alongside translational models, semantic matching models (e.g., DistMult, ComplEx) gained prominence, focusing on modeling interactions between entity and relation embeddings using dot products or other similarity measures.
Neural Network-based Models (late 2010s): The deep learning revolution brought neural networks into KGE, leading to models like ConvE (using 2D convolutions), R-GCN (Graph Convolutional Networks), and KG-BERT (Transformer-based models), integrating KGE with advanced neural architectures.
Focus on Negative Sampling (2018 onwards): While negative sampling was always present (e.g., uniform sampling in TransE), its critical importance became more recognized. Papers like KBGAN and IGAN introduced Generative Adversarial Networks (GANs) to dynamically generate higher-quality negatives. MCNS provided a theoretical foundation for optimal negative sampling distributions. Custom cluster-based methods emerged to refine the search space for negatives.

This paper fits into the latter part of this timeline, specifically addressing the growing realization that while embedding models have advanced, the quality of training data (especially negative samples) is equally crucial for robust KGE.

3.4. Differentiation Analysis

Compared to the primary focus of most KGE research (developing novel embedding models), this paper distinguishes itself by not proposing a new KGE model. Instead, its core innovation lies in:

Meta-analysis of Negative Sampling: It provides a comprehensive, structured review of negative sampling methods, a crucial component that directly influences KGE model training but is often treated as a secondary concern. Most KGE papers either briefly mention their chosen negative sampling strategy (e.g., uniform sampling) or introduce a new one as part of their model without a broader discussion of the field.
Categorization Scheme: The paper introduces a novel categorization (static, dynamic, custom cluster-based) that helps organize and understand the diverse landscape of negative sampling techniques, providing a clear framework for comparison and future development.
Emphasis on Impact: It explicitly highlights how different negative sampling strategies address issues like zero loss, false negatives, and training instability, which are often implicit or secondary considerations in papers primarily focused on embedding model architecture.
Call to Action: By exhaustively reviewing existing methods and pointing out their strengths and weaknesses, the paper serves as a foundational resource and a call for more dedicated research into negative sampling, positioning it as an independent and vital area of KGE research rather than just a utility for KGE models.

4. Methodology

The paper's "methodology" is not a novel KGE model or a new negative sampling algorithm, but rather a systematic approach to categorize and analyze existing negative sampling techniques. It proposes a three-category classification based on the source or generation mechanism of the negative samples. Within each category, it discusses prominent examples, highlighting their principles, advantages, and drawbacks.

4.1. Principles

The core principle behind negative sampling in KGE is Noise Contrastive Estimation (NCE): transforming the task of learning a complex, unnormalized probability distribution over triples into a simpler binary classification task. The model is trained to discriminate between positive samples (true facts from the KG) and negative samples (generated false facts). The quality of these generated negative samples is paramount. If negatives are too easy to distinguish, the model learns little (zero loss problem). If they are too hard or are false negatives (actually true facts), the model can be misled.

The paper's categorization framework is based on how the "noise samples" (negatives) are generated:

Static Distribution-Based Sampling: Negatives are drawn from a fixed, predefined probability distribution, usually based on global statistics of the KG. These distributions do not change during training.
Dynamic Distribution-Based Sampling: Negatives are generated from a distribution that adapts or evolves during the training process, often leveraging feedback from the KGE model itself.
Custom Cluster-Based Sampling: Negatives are sampled from a restricted set of candidate entities or relations that are pre-selected or clustered based on certain criteria (e.g., semantic similarity, entity types), rather than from the entire entity set.

4.2. Core Methodology In-depth (Layer by Layer)

The paper begins by defining the general context for negative sampling. For a standard KG, let $\mathbb{E}$ be the set of entities and $\mathbb{R}$ be the set of relations. $\mathbb{D}^+$ represents the set of positive (true) triples $\tau^+ = (h, r, t)$ . Negative triples, denoted $\tau^-$ , belong to the set $\mathbb{D}^-$ . The paper formally defines $\mathbb{D}^-$ as follows:

$\begin{array} { r l } & { \tau ^ { - } \in \mathbb { D } ^ { - } } \\ & { \mathbb { D } ^ { - } = \{ ( h ^ { \prime } , r , t ) | h ^ { \prime } \in \mathbb { E } \wedge h ^ { \prime } \neq h \wedge ( h , r , t ) \in \mathbb { D } ^ { + } \} } \\ & { \cup \ \{ ( h , r , t ^ { \prime } ) | t ^ { \prime } \in \mathbb { E } \wedge t ^ { \prime } \neq t \wedge ( h , r , t ) \in \mathbb { D } ^ { + } \} } \\ & { \cup \ \{ ( h , r ^ { \prime } , t ) | r ^ { \prime } \in \mathbb { D } \wedge r ^ { \prime } \neq r \wedge ( h , r , t ) \in \mathbb { D } ^ { + } \} } \end{array}$

Here:

$\tau^- \in \mathbb{D}^-$ : A negative triple $\tau^-$ is an element of the set of all possible negative triples $\mathbb{D}^-$ .
The definition for $\mathbb{D}^-$ $D^{-}$ is a union of three types of corrupted triples:
- The first set describes triples where the head entity $h$ of a positive triple $(h, r, t) \in \mathbb{D}^+$ is replaced by another entity $h'$ from the entity set $\mathbb{E}$ , such that $h' \neq h$ . The relation $r$ and tail entity $t$ remain the same.
- The second set describes triples where the tail entity $t$ of a positive triple $(h, r, t) \in \mathbb{D}^+$ is replaced by another entity $t'$ from the entity set $\mathbb{E}$ , such that $t' \neq t$ . The head entity $h$ and relation $r$ remain the same.
- The third set describes triples where the relation $r$ of a positive triple $(h, r, t) \in \mathbb{D}^+$ is replaced by another relation $r'$ from the relation set $\mathbb{R}$ (note: the paper states $r' \in \mathbb{D}$ which is likely a typo and should be $r' \in \mathbb{R}$ ), such that $r' \neq r$ . The head entity $h$ and tail entity $t$ remain the same.
  
  KGE models then use a scoring function $f_r(h, t)$ to measure the plausibility of a triple, aiming to assign higher scores to positive triples and lower scores to negative ones. The goal of negative sampling methods is to provide effective $\tau^-$ for this training.

4.2.1. Static Distribution-Based Sampling

These methods use predefined, unchanging probability distributions to generate negative samples. They are simple and efficient but can suffer from issues like zero loss or false negatives because they don't adapt during training.

4.2.1.1. Uniform Sampling

Principle: The simplest approach. For a positive triple (h, r, t), either the head $h$ or the tail $t$ is replaced by an entity randomly sampled from the entire entity set $\mathbb{E}$ with uniform probability.
Mechanism: To generate a negative triple from (h, r, t):
1. Randomly choose to corrupt either the head or the tail (e.g., with 0.5 probability for each).
2. If corrupting the head: Sample $h' \in \mathbb{E}$ uniformly such that $h' \neq h$ . The negative triple becomes (h', r, t).
3. If corrupting the tail: Sample $t' \in \mathbb{E}$ uniformly such that $t' \neq t$ . The negative triple becomes (h, r, t').
Characteristics:
- Pros: Easy to implement, computationally inexpensive.
- Cons: Often generates low-quality negatives that are too obviously false (e.g., (London, locatedIn, apple)), leading to the zero loss problem. It also has a high probability of generating false negatives (e.g., replacing "DonaldTrump" in (DonaldTrump, Gender, Male) with "JoeBiden" yields (JoeBiden, Gender, Male), which is also true).

4.2.1.2. Bernoulli Sampling

Principle: Introduced in TransH [12], this method aims to mitigate false negatives by adapting the probability of corrupting the head or tail based on the mapping properties of the relation $r$ . For relations that are 1-to-N (one head maps to many tails), it's better to corrupt the tail. For N-to-1 relations (many heads map to one tail), it's better to corrupt the head.
Mechanism: For each relation $r$ $r$ , two statistics are calculated:
- tph (tails per head): Average number of tail entities per head entity for relation $r$ .
- hpt (heads per tail): Average number of head entities per tail entity for relation $r$ .
- The probability of replacing the head entity for a given relation $r$ is calculated as $\frac{\text{tph}}{\text{tph} + \text{hpt}}$ .
- The probability of replacing the tail entity for a given relation $r$ is calculated as $\frac{\text{hpt}}{\text{tph} + \text{hpt}}$ .
Characteristics:
- Pros: Reduces false negatives by making more context-aware corruption choices. For many-to-one relations like "Gender," it preferentially replaces the head, reducing the chance of generating another true (person, Gender, Male) fact.
- Cons: Still a static distribution. It doesn't consider the current state of the model or semantic similarity beyond relation type.
Improvement in Bernoulli sampling (Zhang et al. [45]):
- Principle: Extends Bernoulli sampling by also considering relation replacement.
- Mechanism: Introduces a probability $\alpha$ $α$ for relation replacement, calculated as: $ \alpha = \frac{r_{\text{num}}}{r_{\text{num}} + e_{\text{num}}} $ where:
  - $r_{\text{num}}$ is the total number of relations in the KG.
  - $e_{\text{num}}$ is the total number of entities in the KG.
  - The remaining probability $1 - \alpha$ is then distributed between head and tail entity replacement according to the standard Bernoulli distribution based on tph and hpt.
- Characteristics: Enhances the model's ability in relation link prediction by allowing for relation corruption.

4.2.1.3. Probabilistic Sampling (PNS)

Principle: Proposed by Kanojia et al. [46] to address skewed data in KGs, where some relations have very few training examples. For such sparse relations, uniform or Bernoulli sampling might struggle to generate useful negatives. PNS introduces a tuning parameter $\beta$ (train bias) to complement generated negatives with "early-listed possible instances" to speed up training.
Mechanism: The paper describes it as bringing in a tuning parameter $\beta$ (train bias) that determines the probability by which generated negative examples are complemented with early-listed possible instances. While the exact formula for incorporating $\beta$ isn't provided in this review paper, the concept implies prioritizing certain candidates for negative samples, likely based on some pre-computed plausibility or difficulty measure, especially for sparse relations.
Characteristics:
- Pros: Addresses the issue of skewed data and sparse relations, potentially speeding up training for these cases.
- Cons: The exact definition of "early-listed possible instances" and how $\beta$ precisely modifies the sampling process is not detailed here, implying a need for further consultation of the original PNS paper. It still operates on a predefined bias, making it static.

4.2.2. Dynamic Distribution-Based Sampling

These methods dynamically adjust the negative sampling distribution during training, often guided by the current state of the KGE model itself. This allows for generating high-quality negatives that are "harder" for the model to distinguish.

4.2.2.1. KBGAN

Principle: KBGAN [16] is the first to leverage Generative Adversarial Networks (GANs) for negative sampling in KGE. A GAN consists of two neural networks: a generator and a discriminator that are trained adversarially. In KBGAN, the generator's role is to produce realistic-looking negative samples that can fool the discriminator, and the discriminator's role (which is the target KGE model) is to distinguish between real (positive) and fake (negative) samples.
Mechanism:
1. Generator (G): Takes a positive triple (h, r, t). It's typically another KGE model (e.g., DistMult, ComplEx). It samples from a candidate set of uniformly sampled negatives and assigns probabilities to them, then selects the "best" negative (highest probability or hardest) to pass to the discriminator.
2. Discriminator (D): This is the target KGE model (e.g., TransE, TransD) being trained. It receives both positive triples from $\mathbb{D}^+$ and negative triples from the generator. Its objective is to correctly classify them and learn knowledge representations by minimizing a marginal loss function (e.g., hinge loss) between positive and negative scores.
3. Adversarial Training: The generator is trained to maximize the discriminator's error (make negatives look more plausible). The discriminator is trained to minimize its error (better distinguish positives from negatives). This dynamic interplay forces the generator to produce high-quality, challenging negatives, which in turn pushes the discriminator (KGE model) to learn more robust embeddings.
Characteristics:
- Pros: Generates high-quality negatives that are difficult for the KGE model to distinguish, leading to more robust embeddings.
- Cons: GANs are notoriously difficult to train, prone to training instability (e.g., mode collapse) and require reinforcement learning. Pre-training the generator and discriminator can add significant computational cost.

4.2.2.2. IGAN

Principle: IGAN [17] also uses a GAN-based framework but differentiates itself in the generator's architecture. Instead of using another KGE model as the generator (like KBGAN), IGAN employs a neural network to generate negative samples.
Mechanism:
1. Generator (G): A two-layer fully-connected neural network. It takes the embedding vectors of a positive triple (h, r, t) as input.
  - The input vectors are passed through the neural network, followed by a non-linear ReLU activation function.
  - A softmax function is applied to the output to calculate a probability distribution over the entire entity set $\mathbb{E}$ (not just a small candidate set as in KBGAN) for replacing the head or tail. This allows for more flexible and less constrained negative generation.
2. Discriminator (D): Still the target KGE model, similar to KBGAN. It measures the plausibility of received triples using its scoring function.
3. Adversarial Training: Similar to KBGAN, the generator is trained to produce negatives that receive high plausibility scores from the discriminator, while the discriminator learns to assign low scores to these generated negatives and high scores to true positives.
Characteristics:
- Pros: Dynamically selects high-quality negatives using a dedicated neural network generator, potentially more flexible than using another KGE model.
- Cons: High computational complexity due to calculating probability distributions over the entire entity set. Shares the general difficulties of GAN training (instability, model collapse, need for reinforcement learning). Requires pre-training.
Comparison between GAN-based and self-adversarial sampling:
- The paper briefly mentions self-adversarial sampling (e.g., in RotatE [49]) as an alternative to GAN-based approaches.
- Challenge of GANs: GANs are hard to optimize because they involve training a discrete negative sample generator and the embedding model simultaneously, often requiring reinforcement learning, which incurs high computational costs and risks training instability.
- Self-adversarial sampling (RotatE): Avoids reinforcement learning by using the self-scoring function of the current embedding model to sample negatives. It introduces a temperature parameter $\alpha$ for sampling. This makes it easier to operate and often outperforms GAN-based methods in efficiency and effectiveness.
- Conclusion: GAN-based sampling, despite its potential for high-quality negatives, struggles with efficiency and stability, often requiring pre-training. Self-adversarial sampling offers a more practical alternative.

4.2.2.3. MCNS (Markov chain Monte Carlo Negative Sampling)

Principle: MCNS [22] offers a theoretically grounded approach to dynamic negative sampling. It derives an effective negative sampling distribution that should be positively but sub-linearly correlated with the positive sampling distribution. This means that entities that frequently appear in positive triples (and are thus "important") should also be more likely to be sampled as negatives, but not disproportionately so.
Mechanism:
1. Sampled NCE Framework: MCNS operates within a Sampled Noise Contrastive Estimation (NCE) framework.
2. Positive Sampling Distribution Estimation: It uses self-contrast approximation to estimate the underlying positive sampling distribution.
3. Negative Sample Generation: The depth first search (DFS) algorithm is applied to traverse the graph to obtain a Markov chain for the last node (e.g., the tail entity). Negative samples are then generated from this Markov chain.
4. Speed-up: The Metropolis-Hastings algorithm [50] is used to accelerate the negative sampling process.
5. Model Update: Embedding vectors are updated by minimizing a hinge loss function after inputting the positive sample and its generated negative counterpart into the encoder of the framework.
Characteristics:
- Pros: Provides a pioneering theoretical derivation for an effective negative sampling distribution, not limited to KGE but applicable to general graph representation learning. Exhibits strong performance in downstream tasks and high efficiency.
- Cons: The complexity of implementing DFS for Markov chain generation and Metropolis-Hastings might be higher than simpler methods.

4.2.3. Custom Cluster-Based Sampling

These methods pre-process the entity set into clusters or groups based on certain criteria. Negative samples are then drawn from within these restricted clusters, rather than the entire $\mathbb{E}$ , to ensure they are "plausible" or "harder" negatives.

Principle: TransE-SNS [18] is based on the idea that valid negatives should be semantically similar to the original entity they replace. If two entities are close in the embedding space, they are considered similar. Corrupting a triple by replacing an entity with a similar but incorrect one generates a "harder" negative.
Mechanism:
1. Clustering: The K-Means clustering algorithm [52] is applied to the entity embeddings to divide all entities into a number of groups (clusters) based on their similarity in the vector space.
2. Negative Generation: For a positive triple (h, r, t):
  - If corrupting the head: A replacement entity $h'$ is uniformly sampled only from the cluster to which the original head entity $h$ belongs.
  - If corrupting the tail: A replacement entity $t'$ is uniformly sampled only from the cluster to which the original tail entity $t$ belongs.
Characteristics:
- Pros: Generates high-quality negatives that are semantically related to the original entities, making them harder to distinguish and more informative for training. Enhances the performance of KGE models like TransE.
- Cons: Requires an initial clustering step, which adds computational overhead. The quality of negative samples depends heavily on the quality of initial entity embeddings and the clustering algorithm. KGs are dynamic, so clusters might need frequent re-computation.

4.2.3.2. NSCaching (Negative Sampling Caching)

Principle: NSCaching [19] is motivated by the observation that high-quality negative samples (i.e., "hard" negatives) tend to receive relatively high plausibility scores from the scoring function. Instead of generating these hard negatives from scratch every time, NSCaching maintains a cache of previously identified high-plausibility negatives and samples from this cache. It is seen as a distilled version of GAN-based strategies.
Mechanism:
1. Cache Maintenance: NSCaching stores a collection of negative triples that have been identified as "helpful" or "rare" (meaning they received high plausibility scores from the current KGE model).
2. Sampling from Cache: During training, negative samples are drawn uniformly from this maintained cache.
3. Cache Update (Importance Sampling): The cache is dynamically updated using importance sampling. This means that new high-plausibility negatives encountered during training are added to the cache, and older, less useful ones might be replaced or removed. This ensures the cache always contains challenging negatives.
Characteristics:
- Pros: Efficient, as it samples from a pre-identified set of hard negatives rather than generating them from a large space. Avoids the complexities, training instability, and model collapse issues associated with GANs. Performs better than GAN-based approaches in efficiency and effectiveness.
- Cons: Requires careful design of the cache update mechanism and criteria for "high-plausibility" negatives. The initial filling of the cache and its continuous maintenance add some overhead.

4.2.4. Other Novel Approaches

4.2.4.1. Confidence-Aware Negative Sampling (in NKRL)

Principle: NKRL [21] (building on CKRL [20]) addresses the problem of noisy KGs, where some "positive" triples might actually be false. It introduces confidence-aware negative sampling to simultaneously detect noise in the KG and generate plausible negatives. The core idea is to assign a confidence score not just to positive triples, but also to negative ones, guiding the sampling process towards more informative negatives.
Mechanism:
1. Noise Detection & Negative Sampling: Unlike traditional methods that assume all positive triples are true, NKRL detects potential noises within $\mathbb{D}^+$ while generating negatives.
2. Negative Triple Confidence: It proposes a concept of negative triple confidence, which quantifies the plausibility of a generated negative triple. This confidence is used to select "plausible negatives" for training.
3. Triple Quality Function Modification: NKRL modifies the triple quality function (originally defined in CKRL) to improve noise detection and reduce false detection problems. While the exact formula for confidence and quality function is not provided in this review, it implies a scoring mechanism that rates the "truthiness" of all triples, including generated negatives.
Characteristics:
- Pros: Addresses the practical problem of noisy KGs, making KGE more robust to real-world data imperfections. Generates plausible negatives by measuring their quality with confidence scores.
- Cons: Increases complexity by simultaneously handling noise detection and negative sampling. The effectiveness depends on the accuracy of the confidence-aware mechanism. The paper notes its similarity to self-adversarial sampling in RotatE in using the current embedding model's self-scoring function for sampling.

5. Experimental Setup

The paper is a review and categorization of existing negative sampling techniques, not a presentation of a new experimental method or model. Therefore, it does not describe its own experimental setup, datasets, evaluation metrics, or baselines. Instead, it discusses the typical evaluation contexts and findings from the original papers that introduced the negative sampling methods.

To provide a beginner-friendly overview, I will outline the general experimental setup and common evaluation metrics used in the KGE field, which are relevant to evaluating the impact of negative sampling.

5.1. Datasets

KGE models, and thus negative sampling techniques, are typically evaluated on benchmark datasets that represent different domains and scales of knowledge graphs. The paper mentions two such datasets in the context of Probabilistic Sampling:

WN18: (WordNet 18) A subset of WordNet, a lexical database of English. It contains relationships between words (e.g., hypernymy, hyponymy).
- Characteristics: Relatively small, but semantically rich. Often used for initial evaluation. It's known to have inverse relations where (h,r,t) implies (t,r',h) (e.g., _hypernym and _hyponym), which some KGE models struggle with.
- Data Sample Example (Conceptual):
  - (dog, _hypernym, canine)
  - (car, _has_part, wheel)
FB15K: (Freebase 15k) A subset of Freebase, a large collaborative knowledge base. It contains general facts about people, places, and things.
- Characteristics: Larger and more diverse than WN18, containing a wider variety of entities and relations. It's often used to evaluate the scalability and generalizability of KGE models.
- Data Sample Example (Conceptual):
  - (Barack Obama, _born_in, Honolulu)
  - (London, _located_in, United Kingdom)
    
    These datasets are chosen because they are widely recognized benchmarks in KGE research, allowing for fair comparison across different models and methods. They represent common challenges in KG data, such as varying relation types and density.

5.2. Evaluation Metrics

The primary task for evaluating KGE models (and indirectly, the negative sampling methods used to train them) is link prediction, which aims to predict missing entities or relations in triples. The paper mentions Mean Rank and Hits@N as common metrics, specifically citing link prediction and triple classification as downstream tasks.

Link Prediction: Given a corrupted triple (e.g., (h, r, ?), (?, r, t), or (h, ?, t)), the model predicts the missing component. This usually involves ranking all candidate entities/relations based on their plausibility score from the KGE model.

5.2.2.1. Mean Rank (MR)

Conceptual Definition: Mean Rank (MR) measures the average rank of correct entities/relations in a ranked list generated by the KGE model. For a true triple (h, r, t), the model generates scores for all possible candidate tail entities (if predicting $t$ ) or head entities (if predicting $h$ ). The rank of the correct entity among all candidates is then determined. A lower Mean Rank indicates better performance, as the correct entity is ranked higher on average.
Mathematical Formula: $ \text{MR} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \text{rank}(t | h,r) + \text{rank}(h | r,t) \right) / 2 $
Symbol Explanation:
- $|\mathbb{D}_{\text{test}}|$ : The number of triples in the test set.
- $\text{rank}(t | h,r)$ : The rank of the true tail entity $t$ when predicting $t$ given (h,r), among all candidate tail entities.
- $\text{rank}(h | r,t)$ : The rank of the true head entity $h$ when predicting $h$ given (r,t), among all candidate head entities.
- The division by 2 averages the ranks for head and tail prediction tasks.

5.2.2.2. Mean Reciprocal Rank (MRR)

Conceptual Definition: Mean Reciprocal Rank (MRR) is the average of the reciprocals of the ranks of the correct entities. If the correct entity is ranked first, its reciprocal rank is 1; if it's ranked second, it's 1/2, and so on. MRR emphasizes getting the correct entity ranked very highly. A higher MRR indicates better performance.
Mathematical Formula: $ \text{MRR} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \frac{1}{\text{rank}(t | h,r)} + \frac{1}{\text{rank}(h | r,t)} \right) / 2 $
Symbol Explanation:
- $|\mathbb{D}_{\text{test}}|$ : The number of triples in the test set.
- $\text{rank}(t | h,r)$ : The rank of the true tail entity $t$ when predicting $t$ given (h,r).
- $\text{rank}(h | r,t)$ : The rank of the true head entity $h$ when predicting $h$ given (r,t).
- The division by 2 averages the reciprocal ranks for head and tail prediction tasks.

5.2.2.3. Hits@N

Conceptual Definition: Hits@N (e.g., Hits@1, Hits@3, Hits@10) measures the proportion of correct entities that appear in the top N ranks of the predicted list. It indicates how often the correct answer is among the top few predictions. A higher Hits@N indicates better performance.
Mathematical Formula: $ \text{Hits@N} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \mathbb{I}(\text{rank}(t | h,r) \le N) + \mathbb{I}(\text{rank}(h | r,t) \le N) \right) / 2 $
Symbol Explanation:
- $|\mathbb{D}_{\text{test}}|$ : The number of triples in the test set.
- $\mathbb{I}(\cdot)$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- $\text{rank}(t | h,r)$ : The rank of the true tail entity $t$ when predicting $t$ given (h,r).
- $\text{rank}(h | r,t)$ : The rank of the true head entity $h$ when predicting $h$ given (r,t).
- $N$ : The threshold for the top ranks (e.g., 1, 3, 10).
- The division by 2 averages the Hits@N for head and tail prediction tasks.

5.3. Baselines

Since this paper reviews negative sampling methods, the "baselines" are typically other negative sampling approaches, often the simpler ones like uniform sampling or Bernoulli sampling. When a new negative sampling technique is proposed (e.g., in the original papers of KBGAN or MCNS), it's usually compared against these more basic methods, as well as against other advanced negative sampling methods if they exist at that time. For example, Probabilistic Negative Sampling (PNS) is evaluated against Bernoulli sampling. Self-adversarial sampling (e.g., RotatE) is compared against KBGAN.

The choice of baselines is representative because:

Uniform and Bernoulli sampling are standard, widely used, and computationally cheap methods, providing a basic performance reference.
Advanced methods (e.g., GAN-based) serve as strong baselines for newer, more sophisticated approaches, demonstrating progress in generating hard negatives.
Comparison against different negative sampling methods, applied to the same KGE model, allows researchers to isolate and quantify the impact of the sampling strategy itself on the overall KGE performance.

6. Results & Analysis

As a review paper, this work does not present its own experimental results in the form of tables or figures. Instead, it synthesizes the findings and characteristics of various negative sampling approaches as reported in their original publications. The analysis focuses on how each method addresses the limitations of simpler approaches and its impact on KGE model performance and training efficiency.

6.1. Core Results Analysis

The paper's core analysis revolves around the strengths and weaknesses of each categorized negative sampling approach, drawing conclusions about their effectiveness and efficiency.

Static Distribution-Based Sampling:
- Uniform Sampling: While earliest and easiest, it's critiqued for generating low-quality negatives (too easy to distinguish), leading to zero loss problems and false negatives. The paper implicitly states that this approach, though common, often contributes little to effective training.
- Bernoulli Sampling: An improvement over uniform sampling, it alleviates false negatives by considering relation types (1-to-N, N-to-1). By adapting the corruption probability based on tph and hpt, it generates more plausible negatives. Zhang et al.'s extension further enhances relation link prediction by incorporating relation replacement.
- Probabilistic Sampling (PNS): Shown to be beneficial for skewed data and sparse relations. Kanojia et al. [46] found that TransR-PNS achieved significant position gains (190 and 47 positions in Mean Rank on WN18 and FB15K respectively) compared to TransR with Bernoulli sampling, demonstrating its effectiveness in challenging data scenarios. This highlights the importance of specific sampling strategies for different data characteristics.
Dynamic Distribution-Based Sampling:
- GAN-based (KBGAN, IGAN): These methods represent a significant leap towards high-quality negatives by dynamically generating "hard" samples that challenge the KGE model. KBGAN demonstrates that combining different KGE models as generator-discriminator pairs can lead to better performance than baselines. IGAN further refines the generator using a neural network.
- Challenges of GANs: The paper critically notes the potential risks associated with GANs: training instability and model collapse. It also points out the high computational cost and the need for reinforcement learning and pre-training, which are significant drawbacks.
- Self-Adversarial Sampling (e.g., RotatE): Presented as a more efficient and stable alternative to GANs, avoiding reinforcement learning and often outperforming KBGAN in link prediction. This suggests that simpler, self-guided dynamic approaches can be more practical and effective.
- MCNS: Positioned as a pioneering approach due to its theoretical derivation of an effective negative sampling distribution. Experiments show MCNS outperforms all baselines in downstream tasks and achieves high efficiency. This validates the importance of a theoretically sound distribution for negative sampling. Its generic nature makes it applicable beyond KGE.
Custom Cluster-Based Sampling:
- TransE-SNS: By sampling from semantically similar clusters (e.g., using K-Means), TransE-SNS generates valid negatives that are challenging for the model. Experiments showed it enhances the ability of TransE in link prediction and triple classification. This indicates that localizing the search space for negatives based on learned semantics can be highly effective.
- NSCaching: Presented as an efficient and effective alternative, a "distilled version" of GANs, that avoids GAN's drawbacks. By caching high-plausibility negatives and using importance sampling for updates, NSCaching achieves better performance than GAN-based approaches in terms of efficiency and effectiveness. This points towards the value of maintaining and sampling from a curated set of hard negatives.
  
  In summary, the analysis highlights a clear progression from simple, static negative sampling (prone to zero loss and false negatives) to more sophisticated, dynamic, and context-aware methods. These advanced approaches aim to generate high-quality, hard negatives that force KGE models to learn more robust and fine-grained representations. While GAN-based methods offer theoretical advantages, their practical challenges have led to the exploration of more stable and efficient alternatives like self-adversarial sampling and NSCaching. The paper advocates for continued research in this area, emphasizing its direct impact on KGE model performance.

6.2. Data Presentation (Tables)

The paper does not contain any tables of experimental results from its own research. It synthesizes findings from other papers.

6.3. Ablation Studies / Parameter Analysis

As a review paper, it does not conduct its own ablation studies or parameter analysis. However, it implicitly discusses the impact of certain parameters or components from other works:

The tuning parameter\beta $in `Probabilistic Sampling` is mentioned as influencing the complementation of negative examples, implying its role in controlling sampling bias for skewed data. * The `temperature`\alpha$ in self-adversarial sampling (RotatE) is mentioned as a parameter for controlling the sampling distribution, indicating its importance in balancing the hardness of negatives.
The number of negative samples per positive triple is briefly mentioned with a reference [23] that suggests "fifty negative samples per positive is a good choice for balancing accuracy and duration," highlighting a critical hyper-parameter for training efficiency and effectiveness.

These discussions, while not explicit ablation studies within this paper, demonstrate the authors' awareness of how different components and parameters influence the performance of negative sampling techniques.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper provides a comprehensive review of negative sampling techniques within the field of knowledge graph embedding (KGE). It highlights that while KGE has seen significant advancements in model architectures, the crucial role of negative sampling, inspired by techniques from natural language processing like word2vec, has been largely overlooked or under-explored. The authors propose a novel categorization of existing negative sampling approaches into static distribution-based, dynamic distribution-based, and custom cluster-based methods. Through this structured analysis, the paper demonstrates how different strategies address challenges such as the zero loss problem, false negatives, and training instability, ultimately impacting the quality of learned knowledge representations and performance in downstream tasks like link prediction. The review emphasizes that sophisticated negative sampling is essential for training robust and effective KGE models.

7.2. Limitations & Future Work

The paper, being a review, does not present its own limitations or future work in the context of a specific proposed method. However, it implicitly points out limitations of existing negative sampling techniques and suggests areas for future research within the field:

Limitations of Static Sampling: Uniform sampling and Bernoulli sampling are limited by their inability to adapt to the learning process, often generating low-quality or false negative samples.
Limitations of Dynamic Sampling (GAN-based): While powerful, GAN-based methods suffer from training instability, model collapse, high computational costs, and the need for complex reinforcement learning.
Limitations of Custom Cluster-Based Sampling: These methods rely on initial clustering or pre-computed criteria, which can add overhead and may struggle with the dynamic nature and rapid updates of real-world KGs, requiring continuous re-computation of clusters.
Need for Further Exploration: The paper explicitly states that negative sampling is "under explored and needs more attention and efforts."
Future Research Suggestions:
- Comparative Studies: The authors hope their review will lead to subsequent work comparing the discussed approaches through KG downstream applications (like link prediction on benchmark datasets) to identify shortages of current negative sampling methods.
- Novel Strategy Development: This comparative analysis should pave the way for proposing new strategies for negative sampling after a full understanding of existing ones. This implies a need for methods that combine the benefits of dynamic and cluster-based approaches while overcoming their respective challenges in stability, efficiency, and adaptability.
- Theoretical Foundations: The work by Yang et al. [22] on deriving a general form for an effective negative sampling distribution is highlighted as pioneering significance, suggesting that more theoretical work in this area could yield general, robust solutions.

7.3. Personal Insights & Critique

This paper serves as an excellent foundational resource for anyone delving into KGE, especially beyond the embedding model architectures themselves. It effectively underscores the "hidden" but critical role of negative sampling.

Inspirations:
- Holistic View: It inspires a more holistic view of KGE model development, emphasizing that innovations in negative sampling can be as impactful as, if not more so than, novel scoring functions. A robust KGE system requires strong components across the board, and negative sampling is clearly one such component.
- Bridging Theory and Practice: The paper's discussion of MCNS highlights the value of theoretically derived sampling distributions. This suggests that future advancements might come from deeper mathematical understandings of optimal sampling rather than purely heuristic approaches.
- Efficiency vs. Quality Trade-off: The analysis of GANs versus self-adversarial sampling and NSCaching provides a practical lesson in machine learning: sometimes simpler, more stable alternatives (even if conceptually less ambitious) can yield better real-world performance due to ease of training and efficiency.
- Addressing Data Imperfections: The mention of confidence-aware negative sampling for noisy KGs is particularly insightful, demonstrating that negative sampling can also play a role in making KGE models more robust to real-world data imperfections, which are ubiquitous.
Potential Issues / Unverified Assumptions / Areas for Improvement:
- Lack of Quantitative Synthesis: As a review, the paper focuses on qualitative descriptions. While it mentions some performance gains (e.g., for PNS), it lacks a direct comparative table that would visually summarize the performance differences (e.g., MRR, Hits@N) of various negative sampling methods across common KGE models and datasets. Such a table, even if compiled from existing literature, would greatly enhance the paper's utility for quickly comparing methods.
- Practical Guidance for Selection: While it categorizes and discusses, it could offer more explicit guidelines on when to choose a particular negative sampling method given specific KG characteristics (e.g., density, noisiness, relation types, computational resources).
- Interplay with KGE Models: The paper mentions that GAN-based frameworks are independent of the specific form of the discriminator. However, the performance of a negative sampling method can still be highly dependent on the underlying KGE model it's paired with. A deeper analysis of these interactions (e.g., which sampling methods work best with translational vs. semantic matching models) would be valuable.
- Definition of "Quality": The concept of "high-quality negatives" is central. While implied to mean "hard but true-false," a more formal or operational definition across different methods could strengthen the analysis.
Transferability:
- The insights into negative sampling are highly transferable to other domains involving embedding learning where discriminating positive from negative instances is crucial. This includes recommendation systems (e.g., item-item similarity learning), natural language processing (e.g., context-word prediction), and general graph representation learning beyond KGs.
- The dynamic distribution-based strategies, especially MCNS (which is already noted as generic), and custom cluster-based ideas, could be adapted to any domain where the quality of generated "noise" significantly impacts model training and where semantic similarity or contextual relevance can be leveraged for better noise generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.