Understanding Negative Sampling in Knowledge Graph Embedding
TL;DR Summary
This paper discusses negative sampling in knowledge graph embedding, highlighting its importance in training. It categorizes negative sampling methods into static, dynamic, and custom clustering approaches, offering new insights to enhance knowledge representation in recommendati
Abstract
Knowledge graph embedding (KGE) is to project entities and relations of a knowledge graph (KG) into a low-dimensional vector space, which has made steady progress in recent years. Conventional KGE methods, especially translational distance-based models, are trained through discriminating positive samples from negative ones. Most KGs store only positive samples for space efficiency. Negative sampling thus plays a crucial role in encoding triples of a KG. The quality of generated negative samples has a direct impact on the performance of learnt knowledge representation in a myriad of downstream tasks, such as recommendation, link prediction and node classification. We summarize current negative sampling approaches in KGE into three categories, static distribution-based, dynamic distribution-based and custom cluster-based respectively. Based on this categorization we discuss the most prevalent existing approaches and their characteristics. It is a hope that this review can provide some guidelines for new thoughts about negative sampling in KGE.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Understanding Negative Sampling in Knowledge Graph Embedding".
1.2. Authors
The authors of the paper are:
- Jing Qian (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China, and Department of Computer Science, University of Liverpool, United Kingdom)
- Gangmin Li (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China)
- Katie Atkinson (Affiliated with Department of Computer Science, University of Liverpool, United Kingdom)
- Yong Yue (Affiliated with Department of Intelligent Science, School of Advanced Technology, Xi'an Jiaotong-Liverpool University, China)
1.3. Journal/Conference
The paper was published in the International Journal of Artificial Intelligence and Applications (IJAIA), Vol.12, No.1, January 2021. This is a peer-reviewed journal in the field of artificial intelligence, indicating its relevance and contribution to academic discourse in AI.
1.4. Publication Year
The paper was published on January 31, 2021.
1.5. Abstract
Knowledge graph embedding (KGE) involves projecting entities and relations of a knowledge graph (KG) into a low-dimensional vector space. Many KGE methods, particularly translational distance-based models, are trained by distinguishing positive (true) samples from negative (false) ones. Since most KGs only store positive samples for efficiency, negative sampling is crucial for encoding triples. The quality of these generated negative samples directly influences the performance of the learned knowledge representation in downstream tasks such as recommendation, link prediction, and node classification. The paper categorizes current negative sampling approaches in KGE into three types: static distribution-based, dynamic distribution-based, and custom cluster-based. It discusses prevalent approaches and their characteristics within this categorization, aiming to guide future research in negative sampling for KGE.
1.6. Original Source Link
The original source link is /files/papers/69354df00a9b802059199f26/paper.pdf. Given the publication details, it is an officially published paper.
2. Executive Summary
2.1. Background & Motivation
- Core Problem: Knowledge graphs (KGs) store facts as
(head entity, relation, tail entity)triples. To leverage machine learning techniques, these symbolic representations need to be converted into numerical vector spaces, a process calledknowledge graph embedding (KGE). Many KGE models, especiallytranslational distance-based models, rely on distinguishing true facts (positive samples) from false ones (negative samples) during training. However, KGs typically only store positive facts for storage efficiency. The challenge lies in effectively generatingnegative samplesthat are informative enough to train robust KGE models without introducing misleading or trivial information. - Problem Importance: The quality of negative samples directly impacts the learned knowledge representations. Poorly generated negative samples can lead to
zero loss problems(where the model easily distinguishes true from false, learning little),false negatives(where a generated negative sample is actually a true but unobserved fact), orvanishing gradients, all of which hinder model training and lead to suboptimal performance in critical downstream tasks likelink prediction(predicting missing links in a KG),recommendation systems, andnode classification. - Challenges/Gaps: While KGE research has largely focused on developing novel embedding models and scoring functions, the crucial aspect of negative sampling has been relatively underexplored and often relies on simple random sampling. Existing surveys on KGE mention negative sampling only briefly. There's a need for a systematic overview and a deeper understanding of various negative sampling strategies and their implications.
- Paper's Entry Point: This paper addresses this gap by providing the first systematic and exhaustive overview of existing negative sampling approaches in KGE. It categorizes them based on their sample source and discusses their characteristics, aiming to shed light on this under-appreciated but critical component of KGE training.
2.2. Main Contributions / Findings
- Primary Contributions:
- Comprehensive Categorization: The paper systematically categorizes existing negative sampling approaches in KGE into three main types:
static distribution-based,dynamic distribution-based, andcustom cluster-based. This provides a structured framework for understanding the landscape of negative sampling. - Detailed Review of Approaches: It discusses the most prevalent methods within each category, explaining their mechanisms, advantages, and disadvantages.
- Highlighting Importance: The paper strongly argues for the critical role of negative sampling in KGE training, emphasizing that it is often overlooked compared to the development of new embedding models.
- Guidance for Future Research: By summarizing and categorizing existing methods, the paper aims to provide guidelines and stimulate new thoughts for future research in negative sampling, suggesting areas for improvement and further exploration.
- Comprehensive Categorization: The paper systematically categorizes existing negative sampling approaches in KGE into three main types:
- Key Conclusions/Findings:
- Negative sampling is a vital, yet under-explored, component of KGE model training, significantly impacting the quality of learned representations.
- Simple
uniform samplingoften produces low-quality negatives, leading to issues likezero lossandfalse negatives. - More sophisticated methods, such as
Bernoulli sampling,GAN-based sampling,Markov chain Monte Carlo negative sampling (MCNS), andcustom cluster-based sampling, attempt to generate higher-quality, more informative negative samples by considering relation properties, dynamic distributions, or entity similarities. - There is a clear trend towards more dynamic and context-aware negative sampling strategies to overcome the limitations of static random sampling.
- The paper implicitly concludes that focusing solely on embedding model architectures without considering the quality of negative samples can lead to suboptimal KGE performance.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Knowledge Graph (KG): A
knowledge graph (KG)is a structured representation of information that models real-world entities and their relationships in a graph format. Nodes in the graph represententities(e.g., "London," "United Kingdom," "Barack Obama"), and edges representrelations(e.g., "locatedIn," "Gender," "PresidentOf") between these entities. Facts in a KG are typically expressed astriplesin the form(head entity, relation, tail entity)or(subject, predicate, object). For example, (London, locatedIn, UnitedKingdom). KGs like Freebase, NELL, and YAGO are widely used. - Knowledge Graph Embedding (KGE):
Knowledge graph embedding (KGE)is the process of mapping entities and relations from a symbolic knowledge graph into a low-dimensional continuous vector space (i.e., numerical vectors or "embeddings"). The goal is to capture the semantic meanings and structural properties of the KG within these vector representations, making them amenable to machine learning algorithms. For instance, "London" might be represented as a vector , and "locatedIn" as a vector . - Translational Distance-based Models: These are a popular family of KGE models where relations are conceptualized as "translations" in the embedding space. The core idea is that if a triple
(h, r, t)is true, then the embedding of the head entity plus the embedding of the relation should be close to the embedding of the tail entity . Mathematically, this is often expressed as . Examples include TransE, TransH, TransR. - Scoring Function: In KGE models, a
scoring function(orf(h, r, t)) is a mathematical function that takes a triple(h, r, t)as input and outputs a score indicating its plausibility or truthfulness. A higher score typically means the triple is more likely to be true. Models are trained to assign high scores to positive (true) triples and low scores to negative (false) triples. - Positive Samples: These are the
ground-truth factsdirectly observed and stored in the knowledge graph. They represent true relationships between entities. - Negative Samples: These are
false factsgenerated by corrupting positive samples. For a positive triple(h, r, t), a negative sample might be(h', r, t)(replacing head),(h, r, t')(replacing tail), or(h, r', t)(replacing relation), where or are other entities, or is another relation, such that the corrupted triple is unlikely to be true. These are crucial for teaching the model what is not true. - Noise Contrastive Estimation (NCE):
Noise Contrastive Estimation (NCE)is a statistical method used for estimating parameters of unnormalized probability models (models where the normalization constant, orpartition function, is hard to compute). Instead of directly estimating the probability distribution, NCE rephrases the problem as abinary classificationtask: distinguishing between "true" data samples (positive samples) and "noise" samples (negative samples) drawn from a known noise distribution. This significantly reduces computational complexity, especially for large vocabularies or continuous spaces. - Negative Sampling (in
word2veccontext):Negative samplingis a simplification of NCE popularized byword2vec. Inword2vec, it's used to efficiently train word embeddings. Instead of updating all weights for every word in the vocabulary, it only updates weights for the positive word in context and a small number of randomly sampled negative words. This speeds up training considerably. KGE adopts this concept for triples. - Open World Assumption (OWA) vs. Closed World Assumption (CWA):
- Closed World Assumption (CWA): States that any fact not explicitly present in the KG is considered false. This assumes KGs are complete, which is rarely true for real-world KGs.
- Open World Assumption (OWA): States that facts not explicitly present in the KG are simply unknown and could be either true or false. This is a more realistic assumption for incomplete KGs and is generally preferred in KGE. The paper notes that KGE models are typically trained under OWA due to the inherent incompleteness of KGs.
- Zero Loss Problem: This occurs when the generated negative samples are too obviously false or easily distinguishable from positive samples. In such cases, the KGE model quickly learns to assign very high scores to positives and very low scores to negatives, leading to a
lossvalue that approaches zero too rapidly. When the loss is near zero, the model's parameters receive very small updates (due to small gradients), effectively stopping meaningful learning and preventing the model from capturing subtle semantic differences.
3.2. Previous Works
The paper primarily focuses on negative sampling techniques and their interaction with KGE models rather than introducing a new KGE model. However, it references numerous KGE models as context or as discriminators within GAN-based negative sampling frameworks.
- TransE [5]: A foundational
translational distance-based KGE model. It models relations as translations in the embedding space: for a true triple(h, r, t). It usesuniform samplingfor negatives. - TransH [12]: An improvement over TransE that addresses its limitations in handling
1-to-N,N-to-1, andN-to-Nrelations. It projects entities onto a hyperplane specific to each relation before applying the translation operation. It introducedBernoulli sampling. - TransR [13], TransD [14], TransG [15]: Further variants of TransE that embed entities and relations into different spaces or use dynamic mapping matrices to handle more complex relation types.
- RESCAL [6]: An early
semantic matching-based KGE modelthat uses matrix factorization. It models relations as matrices , where the plausibility of(h, r, t)is measured by . - DistMult [31]: Simplifies RESCAL by constraining to be a diagonal matrix, which can effectively model symmetric relations.
- ComplEx [23]: Extends DistMult to
complex number spaceto better handle antisymmetric relations. - KBGAN [16]: The first work to apply
Generative Adversarial Networks (GANs)to negative sampling in KGE. It uses one KGE model as a generator to create negatives and another KGE model as a discriminator. - IGAN [17]: Another GAN-based approach, which uses a two-layer fully-connected neural network as its generator, unlike KBGAN which uses KGE models as generators.
- MCNS (Markov chain Monte Carlo Negative Sampling) [22]: Proposes a theoretically derived negative sampling distribution that is positively but sub-linearly correlated with the positive sampling distribution. It uses
depth first search (DFS)andMetropolis-Hastings algorithmfor efficient sampling. - RotatE [49]: While not solely a negative sampling method, it proposes a
self-adversarial samplingapproach, which is an alternative to GAN-based methods, avoiding reinforcement learning and improving efficiency. - CKRL [20] and NKRL [21]: These works focus on
confidence-aware knowledge representation learning, particularly dealing with noisy KGs. NKRL extends CKRL by introducing aconfidence-aware negative samplingmethod to generate plausible negatives and reduce false detection of noises.
3.3. Technological Evolution
The field of KGE has evolved from discrete representations to continuous vector spaces, mirroring trends in natural language processing (NLP) with word embeddings.
-
Early KGE (pre-2013): Focused on symbolic logic and rule-based systems, or matrix factorization methods like RESCAL. These often struggled with scalability and capturing rich semantics.
-
Translational Models (2013 onwards): The advent of
TransE[5] (inspired byword2vec[4]) marked a significant shift. It introduced the idea of relations as translations in a vector space, offering simplicity and efficiency. This led to a wave of variants (TransH, TransR, TransD, TransG) that addressed TransE's limitations, such as handling different relation types and modeling complexity. -
Semantic Matching Models (mid-2010s): Alongside translational models,
semantic matching models(e.g., DistMult, ComplEx) gained prominence, focusing on modeling interactions between entity and relation embeddings using dot products or other similarity measures. -
Neural Network-based Models (late 2010s): The deep learning revolution brought
neural networksinto KGE, leading to models likeConvE(using 2D convolutions),R-GCN(Graph Convolutional Networks), andKG-BERT(Transformer-based models), integrating KGE with advanced neural architectures. -
Focus on Negative Sampling (2018 onwards): While negative sampling was always present (e.g., uniform sampling in TransE), its critical importance became more recognized. Papers like KBGAN and IGAN introduced
Generative Adversarial Networks (GANs)to dynamically generate higher-quality negatives.MCNSprovided a theoretical foundation for optimal negative sampling distributions.Custom cluster-based methodsemerged to refine the search space for negatives.This paper fits into the latter part of this timeline, specifically addressing the growing realization that while embedding models have advanced, the quality of training data (especially negative samples) is equally crucial for robust KGE.
3.4. Differentiation Analysis
Compared to the primary focus of most KGE research (developing novel embedding models), this paper distinguishes itself by not proposing a new KGE model. Instead, its core innovation lies in:
- Meta-analysis of Negative Sampling: It provides a comprehensive, structured review of
negative samplingmethods, a crucial component that directly influences KGE model training but is often treated as a secondary concern. Most KGE papers either briefly mention their chosen negative sampling strategy (e.g., uniform sampling) or introduce a new one as part of their model without a broader discussion of the field. - Categorization Scheme: The paper introduces a novel categorization (static, dynamic, custom cluster-based) that helps organize and understand the diverse landscape of negative sampling techniques, providing a clear framework for comparison and future development.
- Emphasis on Impact: It explicitly highlights how different negative sampling strategies address issues like
zero loss,false negatives, andtraining instability, which are often implicit or secondary considerations in papers primarily focused on embedding model architecture. - Call to Action: By exhaustively reviewing existing methods and pointing out their strengths and weaknesses, the paper serves as a foundational resource and a call for more dedicated research into negative sampling, positioning it as an independent and vital area of KGE research rather than just a utility for KGE models.
4. Methodology
The paper's "methodology" is not a novel KGE model or a new negative sampling algorithm, but rather a systematic approach to categorize and analyze existing negative sampling techniques. It proposes a three-category classification based on the source or generation mechanism of the negative samples. Within each category, it discusses prominent examples, highlighting their principles, advantages, and drawbacks.
4.1. Principles
The core principle behind negative sampling in KGE is Noise Contrastive Estimation (NCE): transforming the task of learning a complex, unnormalized probability distribution over triples into a simpler binary classification task. The model is trained to discriminate between positive samples (true facts from the KG) and negative samples (generated false facts). The quality of these generated negative samples is paramount. If negatives are too easy to distinguish, the model learns little (zero loss problem). If they are too hard or are false negatives (actually true facts), the model can be misled.
The paper's categorization framework is based on how the "noise samples" (negatives) are generated:
- Static Distribution-Based Sampling: Negatives are drawn from a fixed, predefined probability distribution, usually based on global statistics of the KG. These distributions do not change during training.
- Dynamic Distribution-Based Sampling: Negatives are generated from a distribution that adapts or evolves during the training process, often leveraging feedback from the KGE model itself.
- Custom Cluster-Based Sampling: Negatives are sampled from a restricted set of candidate entities or relations that are pre-selected or clustered based on certain criteria (e.g., semantic similarity, entity types), rather than from the entire entity set.
4.2. Core Methodology In-depth (Layer by Layer)
The paper begins by defining the general context for negative sampling. For a standard KG, let be the set of entities and be the set of relations. represents the set of positive (true) triples . Negative triples, denoted , belong to the set . The paper formally defines as follows:
Here:
- : A negative triple is an element of the set of all possible negative triples .
- The definition for is a union of three types of corrupted triples:
-
The first set describes triples where the head entity of a positive triple is replaced by another entity from the entity set , such that . The relation and tail entity remain the same.
-
The second set describes triples where the tail entity of a positive triple is replaced by another entity from the entity set , such that . The head entity and relation remain the same.
-
The third set describes triples where the relation of a positive triple is replaced by another relation from the relation set (note: the paper states which is likely a typo and should be ), such that . The head entity and tail entity remain the same.
KGE models then use a
scoring functionto measure the plausibility of a triple, aiming to assign higher scores to positive triples and lower scores to negative ones. The goal of negative sampling methods is to provide effective for this training.
-
4.2.1. Static Distribution-Based Sampling
These methods use predefined, unchanging probability distributions to generate negative samples. They are simple and efficient but can suffer from issues like zero loss or false negatives because they don't adapt during training.
4.2.1.1. Uniform Sampling
- Principle: The simplest approach. For a positive triple
(h, r, t), either the head or the tail is replaced by an entity randomly sampled from the entire entity set with uniform probability. - Mechanism: To generate a negative triple from
(h, r, t):- Randomly choose to corrupt either the head or the tail (e.g., with 0.5 probability for each).
- If corrupting the head: Sample uniformly such that . The negative triple becomes
(h', r, t). - If corrupting the tail: Sample uniformly such that . The negative triple becomes
(h, r, t').
- Characteristics:
- Pros: Easy to implement, computationally inexpensive.
- Cons: Often generates
low-quality negativesthat are too obviously false (e.g., (London, locatedIn, apple)), leading to thezero loss problem. It also has a high probability of generatingfalse negatives(e.g., replacing "DonaldTrump" in (DonaldTrump, Gender, Male) with "JoeBiden" yields (JoeBiden, Gender, Male), which is also true).
4.2.1.2. Bernoulli Sampling
- Principle: Introduced in TransH [12], this method aims to mitigate
false negativesby adapting the probability of corrupting the head or tail based on themapping propertiesof the relation . For relations that are1-to-N(one head maps to many tails), it's better to corrupt the tail. ForN-to-1relations (many heads map to one tail), it's better to corrupt the head. - Mechanism: For each relation , two statistics are calculated:
tph(tails per head): Average number of tail entities per head entity for relation .hpt(heads per tail): Average number of head entities per tail entity for relation .- The probability of replacing the head entity for a given relation is calculated as .
- The probability of replacing the tail entity for a given relation is calculated as .
- Characteristics:
- Pros: Reduces
false negativesby making more context-aware corruption choices. Formany-to-onerelations like "Gender," it preferentially replaces the head, reducing the chance of generating another true(person, Gender, Male)fact. - Cons: Still a static distribution. It doesn't consider the current state of the model or semantic similarity beyond relation type.
- Pros: Reduces
- Improvement in Bernoulli sampling (Zhang et al. [45]):
- Principle: Extends Bernoulli sampling by also considering relation replacement.
- Mechanism: Introduces a probability for relation replacement, calculated as:
$
\alpha = \frac{r_{\text{num}}}{r_{\text{num}} + e_{\text{num}}}
$
where:
- is the total number of relations in the KG.
- is the total number of entities in the KG.
- The remaining probability is then distributed between head and tail entity replacement according to the standard Bernoulli distribution based on
tphandhpt.
- Characteristics: Enhances the model's ability in
relation link predictionby allowing for relation corruption.
4.2.1.3. Probabilistic Sampling (PNS)
- Principle: Proposed by Kanojia et al. [46] to address
skewed datain KGs, where some relations have very few training examples. For suchsparse relations, uniform or Bernoulli sampling might struggle to generate useful negatives. PNS introduces atuning parameter(train bias) to complement generated negatives with "early-listed possible instances" to speed up training. - Mechanism: The paper describes it as bringing in a tuning parameter (train bias) that determines the probability by which generated negative examples are
complemented with early-listed possible instances. While the exact formula for incorporating isn't provided in this review paper, the concept implies prioritizing certain candidates for negative samples, likely based on some pre-computed plausibility or difficulty measure, especially for sparse relations. - Characteristics:
- Pros: Addresses the issue of
skewed dataandsparse relations, potentially speeding up training for these cases. - Cons: The exact definition of "early-listed possible instances" and how precisely modifies the sampling process is not detailed here, implying a need for further consultation of the original PNS paper. It still operates on a predefined bias, making it static.
- Pros: Addresses the issue of
4.2.2. Dynamic Distribution-Based Sampling
These methods dynamically adjust the negative sampling distribution during training, often guided by the current state of the KGE model itself. This allows for generating high-quality negatives that are "harder" for the model to distinguish.
4.2.2.1. KBGAN
- Principle:
KBGAN[16] is the first to leverageGenerative Adversarial Networks (GANs)for negative sampling in KGE. A GAN consists of two neural networks: ageneratorand adiscriminatorthat are trained adversarially. In KBGAN, the generator's role is to produce realistic-looking negative samples that can fool the discriminator, and the discriminator's role (which is the target KGE model) is to distinguish between real (positive) and fake (negative) samples. - Mechanism:
- Generator (G): Takes a positive triple
(h, r, t). It's typically another KGE model (e.g., DistMult, ComplEx). It samples from a candidate set of uniformly sampled negatives and assigns probabilities to them, then selects the "best" negative (highest probability or hardest) to pass to the discriminator. - Discriminator (D): This is the target KGE model (e.g., TransE, TransD) being trained. It receives both positive triples from and negative triples from the generator. Its objective is to correctly classify them and learn knowledge representations by minimizing a
marginal loss function(e.g., hinge loss) between positive and negative scores. - Adversarial Training: The generator is trained to maximize the discriminator's error (make negatives look more plausible). The discriminator is trained to minimize its error (better distinguish positives from negatives). This dynamic interplay forces the generator to produce
high-quality, challenging negatives, which in turn pushes the discriminator (KGE model) to learn more robust embeddings.
- Generator (G): Takes a positive triple
- Characteristics:
- Pros: Generates
high-quality negativesthat are difficult for the KGE model to distinguish, leading to more robust embeddings. - Cons:
GANsare notoriously difficult to train, prone totraining instability(e.g., mode collapse) and requirereinforcement learning. Pre-training the generator and discriminator can add significant computational cost.
- Pros: Generates
4.2.2.2. IGAN
-
Principle:
IGAN[17] also uses aGAN-based frameworkbut differentiates itself in the generator's architecture. Instead of using another KGE model as the generator (like KBGAN), IGAN employs aneural networkto generate negative samples. -
Mechanism:
- Generator (G): A two-layer fully-connected neural network. It takes the embedding vectors of a positive triple
(h, r, t)as input.- The input vectors are passed through the neural network, followed by a non-linear
ReLU activation function. - A
softmax functionis applied to the output to calculate a probability distribution over the entire entity set (not just a small candidate set as in KBGAN) for replacing the head or tail. This allows for more flexible and less constrained negative generation.
- The input vectors are passed through the neural network, followed by a non-linear
- Discriminator (D): Still the target KGE model, similar to KBGAN. It measures the plausibility of received triples using its scoring function.
- Adversarial Training: Similar to KBGAN, the generator is trained to produce negatives that receive high plausibility scores from the discriminator, while the discriminator learns to assign low scores to these generated negatives and high scores to true positives.
- Generator (G): A two-layer fully-connected neural network. It takes the embedding vectors of a positive triple
-
Characteristics:
- Pros: Dynamically selects
high-quality negativesusing a dedicated neural network generator, potentially more flexible than using another KGE model. - Cons: High computational complexity due to calculating probability distributions over the entire entity set. Shares the general difficulties of GAN training (instability, model collapse, need for reinforcement learning). Requires pre-training.
- Pros: Dynamically selects
-
Comparison between GAN-based and self-adversarial sampling:
- The paper briefly mentions
self-adversarial sampling(e.g., inRotatE[49]) as an alternative to GAN-based approaches. - Challenge of GANs: GANs are hard to optimize because they involve training a discrete negative sample generator and the embedding model simultaneously, often requiring
reinforcement learning, which incurs high computational costs and riskstraining instability. - Self-adversarial sampling (RotatE): Avoids
reinforcement learningby using theself-scoring functionof the current embedding model to sample negatives. It introduces a temperature parameter for sampling. This makes it easier to operate and often outperforms GAN-based methods in efficiency and effectiveness. - Conclusion: GAN-based sampling, despite its potential for high-quality negatives, struggles with efficiency and stability, often requiring pre-training. Self-adversarial sampling offers a more practical alternative.
- The paper briefly mentions
4.2.2.3. MCNS (Markov chain Monte Carlo Negative Sampling)
- Principle:
MCNS[22] offers a theoretically grounded approach to dynamic negative sampling. It derives aneffective negative sampling distributionthat should be positively but sub-linearly correlated with thepositive sampling distribution. This means that entities that frequently appear in positive triples (and are thus "important") should also be more likely to be sampled as negatives, but not disproportionately so. - Mechanism:
- Sampled NCE Framework: MCNS operates within a
Sampled Noise Contrastive Estimation (NCE)framework. - Positive Sampling Distribution Estimation: It uses
self-contrast approximationto estimate the underlying positive sampling distribution. - Negative Sample Generation: The
depth first search (DFS)algorithm is applied to traverse the graph to obtain aMarkov chainfor the last node (e.g., the tail entity). Negative samples are then generated from this Markov chain. - Speed-up: The
Metropolis-Hastings algorithm[50] is used to accelerate the negative sampling process. - Model Update: Embedding vectors are updated by minimizing a
hinge lossfunction after inputting the positive sample and its generated negative counterpart into the encoder of the framework.
- Sampled NCE Framework: MCNS operates within a
- Characteristics:
- Pros: Provides a pioneering theoretical derivation for an effective negative sampling distribution, not limited to KGE but applicable to general graph representation learning. Exhibits strong performance in downstream tasks and high efficiency.
- Cons: The complexity of implementing
DFSfor Markov chain generation andMetropolis-Hastingsmight be higher than simpler methods.
4.2.3. Custom Cluster-Based Sampling
These methods pre-process the entity set into clusters or groups based on certain criteria. Negative samples are then drawn from within these restricted clusters, rather than the entire , to ensure they are "plausible" or "harder" negatives.
4.2.3.1. TransE-SNS (Similarity-based Negative Sampling)
- Principle:
TransE-SNS[18] is based on the idea thatvalid negativesshould be semantically similar to the original entity they replace. If two entities are close in the embedding space, they are considered similar. Corrupting a triple by replacing an entity with a similar but incorrect one generates a "harder" negative. - Mechanism:
- Clustering: The
K-Means clustering algorithm[52] is applied to the entity embeddings to divide all entities into a number of groups (clusters) based on their similarity in the vector space. - Negative Generation: For a positive triple
(h, r, t):- If corrupting the head: A replacement entity is uniformly sampled only from the cluster to which the original head entity belongs.
- If corrupting the tail: A replacement entity is uniformly sampled only from the cluster to which the original tail entity belongs.
- Clustering: The
- Characteristics:
- Pros: Generates
high-quality negativesthat are semantically related to the original entities, making them harder to distinguish and more informative for training. Enhances the performance of KGE models like TransE. - Cons: Requires an initial clustering step, which adds computational overhead. The quality of negative samples depends heavily on the quality of initial entity embeddings and the clustering algorithm. KGs are dynamic, so clusters might need frequent re-computation.
- Pros: Generates
4.2.3.2. NSCaching (Negative Sampling Caching)
- Principle:
NSCaching[19] is motivated by the observation thathigh-quality negative samples(i.e., "hard" negatives) tend to receive relatively high plausibility scores from the scoring function. Instead of generating these hard negatives from scratch every time,NSCachingmaintains acacheof previously identified high-plausibility negatives and samples from this cache. It is seen as a distilled version of GAN-based strategies. - Mechanism:
- Cache Maintenance:
NSCachingstores a collection of negative triples that have been identified as "helpful" or "rare" (meaning they received high plausibility scores from the current KGE model). - Sampling from Cache: During training, negative samples are drawn uniformly from this maintained cache.
- Cache Update (Importance Sampling): The cache is dynamically updated using
importance sampling. This means that new high-plausibility negatives encountered during training are added to the cache, and older, less useful ones might be replaced or removed. This ensures the cache always contains challenging negatives.
- Cache Maintenance:
- Characteristics:
- Pros: Efficient, as it samples from a pre-identified set of hard negatives rather than generating them from a large space. Avoids the complexities,
training instability, andmodel collapseissues associated withGANs. Performs better than GAN-based approaches in efficiency and effectiveness. - Cons: Requires careful design of the cache update mechanism and criteria for "high-plausibility" negatives. The initial filling of the cache and its continuous maintenance add some overhead.
- Pros: Efficient, as it samples from a pre-identified set of hard negatives rather than generating them from a large space. Avoids the complexities,
4.2.4. Other Novel Approaches
4.2.4.1. Confidence-Aware Negative Sampling (in NKRL)
- Principle:
NKRL[21] (building onCKRL[20]) addresses the problem ofnoisy KGs, where some "positive" triples might actually be false. It introducesconfidence-aware negative samplingto simultaneously detect noise in the KG and generate plausible negatives. The core idea is to assign aconfidence scorenot just to positive triples, but also to negative ones, guiding the sampling process towards more informative negatives. - Mechanism:
- Noise Detection & Negative Sampling: Unlike traditional methods that assume all positive triples are true, NKRL detects potential
noiseswithin while generating negatives. - Negative Triple Confidence: It proposes a concept of
negative triple confidence, which quantifies the plausibility of a generated negative triple. This confidence is used to select "plausible negatives" for training. - Triple Quality Function Modification: NKRL modifies the
triple quality function(originally defined in CKRL) to improve noise detection and reducefalse detection problems. While the exact formula for confidence and quality function is not provided in this review, it implies a scoring mechanism that rates the "truthiness" of all triples, including generated negatives.
- Noise Detection & Negative Sampling: Unlike traditional methods that assume all positive triples are true, NKRL detects potential
- Characteristics:
- Pros: Addresses the practical problem of
noisy KGs, making KGE more robust to real-world data imperfections. Generatesplausible negativesby measuring their quality with confidence scores. - Cons: Increases complexity by simultaneously handling noise detection and negative sampling. The effectiveness depends on the accuracy of the
confidence-aware mechanism. The paper notes its similarity toself-adversarial samplinginRotatEin using the current embedding model's self-scoring function for sampling.
- Pros: Addresses the practical problem of
5. Experimental Setup
The paper is a review and categorization of existing negative sampling techniques, not a presentation of a new experimental method or model. Therefore, it does not describe its own experimental setup, datasets, evaluation metrics, or baselines. Instead, it discusses the typical evaluation contexts and findings from the original papers that introduced the negative sampling methods.
To provide a beginner-friendly overview, I will outline the general experimental setup and common evaluation metrics used in the KGE field, which are relevant to evaluating the impact of negative sampling.
5.1. Datasets
KGE models, and thus negative sampling techniques, are typically evaluated on benchmark datasets that represent different domains and scales of knowledge graphs. The paper mentions two such datasets in the context of Probabilistic Sampling:
- WN18: (WordNet 18) A subset of
WordNet, a lexical database of English. It contains relationships between words (e.g., hypernymy, hyponymy).- Characteristics: Relatively small, but semantically rich. Often used for initial evaluation. It's known to have inverse relations where
(h,r,t)implies(t,r',h)(e.g.,_hypernymand_hyponym), which some KGE models struggle with. - Data Sample Example (Conceptual):
- (dog, _hypernym, canine)
- (car, _has_part, wheel)
- Characteristics: Relatively small, but semantically rich. Often used for initial evaluation. It's known to have inverse relations where
- FB15K: (Freebase 15k) A subset of
Freebase, a large collaborative knowledge base. It contains general facts about people, places, and things.- Characteristics: Larger and more diverse than WN18, containing a wider variety of entities and relations. It's often used to evaluate the scalability and generalizability of KGE models.
- Data Sample Example (Conceptual):
-
(Barack Obama, _born_in, Honolulu)
-
(London, _located_in, United Kingdom)
These datasets are chosen because they are widely recognized benchmarks in KGE research, allowing for fair comparison across different models and methods. They represent common challenges in KG data, such as varying relation types and density.
-
5.2. Evaluation Metrics
The primary task for evaluating KGE models (and indirectly, the negative sampling methods used to train them) is link prediction, which aims to predict missing entities or relations in triples. The paper mentions Mean Rank and Hits@N as common metrics, specifically citing link prediction and triple classification as downstream tasks.
- Link Prediction: Given a corrupted triple (e.g.,
(h, r, ?),(?, r, t), or(h, ?, t)), the model predicts the missing component. This usually involves ranking all candidate entities/relations based on their plausibility score from the KGE model.
5.2.2.1. Mean Rank (MR)
- Conceptual Definition:
Mean Rank (MR)measures the average rank of correct entities/relations in a ranked list generated by the KGE model. For a true triple (h, r, t), the model generates scores for all possible candidate tail entities (if predicting ) or head entities (if predicting ). The rank of the correct entity among all candidates is then determined. A lower Mean Rank indicates better performance, as the correct entity is ranked higher on average. - Mathematical Formula: $ \text{MR} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \text{rank}(t | h,r) + \text{rank}(h | r,t) \right) / 2 $
- Symbol Explanation:
- : The number of triples in the test set.
- : The rank of the true tail entity when predicting given
(h,r), among all candidate tail entities. - : The rank of the true head entity when predicting given
(r,t), among all candidate head entities. - The division by 2 averages the ranks for head and tail prediction tasks.
5.2.2.2. Mean Reciprocal Rank (MRR)
- Conceptual Definition:
Mean Reciprocal Rank (MRR)is the average of the reciprocals of the ranks of the correct entities. If the correct entity is ranked first, its reciprocal rank is 1; if it's ranked second, it's 1/2, and so on. MRR emphasizes getting the correct entity ranked very highly. A higher MRR indicates better performance. - Mathematical Formula: $ \text{MRR} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \frac{1}{\text{rank}(t | h,r)} + \frac{1}{\text{rank}(h | r,t)} \right) / 2 $
- Symbol Explanation:
- : The number of triples in the test set.
- : The rank of the true tail entity when predicting given
(h,r). - : The rank of the true head entity when predicting given
(r,t). - The division by 2 averages the reciprocal ranks for head and tail prediction tasks.
5.2.2.3. Hits@N
- Conceptual Definition:
Hits@N(e.g., Hits@1, Hits@3, Hits@10) measures the proportion of correct entities that appear in the top N ranks of the predicted list. It indicates how often the correct answer is among the top few predictions. A higherHits@Nindicates better performance. - Mathematical Formula: $ \text{Hits@N} = \frac{1}{|\mathbb{D}{\text{test}}|} \sum{(h,r,t) \in \mathbb{D}_{\text{test}}} \left( \mathbb{I}(\text{rank}(t | h,r) \le N) + \mathbb{I}(\text{rank}(h | r,t) \le N) \right) / 2 $
- Symbol Explanation:
- : The number of triples in the test set.
- : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- : The rank of the true tail entity when predicting given
(h,r). - : The rank of the true head entity when predicting given
(r,t). - : The threshold for the top ranks (e.g., 1, 3, 10).
- The division by 2 averages the Hits@N for head and tail prediction tasks.
5.3. Baselines
Since this paper reviews negative sampling methods, the "baselines" are typically other negative sampling approaches, often the simpler ones like uniform sampling or Bernoulli sampling. When a new negative sampling technique is proposed (e.g., in the original papers of KBGAN or MCNS), it's usually compared against these more basic methods, as well as against other advanced negative sampling methods if they exist at that time. For example, Probabilistic Negative Sampling (PNS) is evaluated against Bernoulli sampling. Self-adversarial sampling (e.g., RotatE) is compared against KBGAN.
The choice of baselines is representative because:
- Uniform and Bernoulli sampling are standard, widely used, and computationally cheap methods, providing a basic performance reference.
- Advanced methods (e.g., GAN-based) serve as strong baselines for newer, more sophisticated approaches, demonstrating progress in generating
hard negatives. - Comparison against different negative sampling methods, applied to the same KGE model, allows researchers to isolate and quantify the impact of the sampling strategy itself on the overall KGE performance.
6. Results & Analysis
As a review paper, this work does not present its own experimental results in the form of tables or figures. Instead, it synthesizes the findings and characteristics of various negative sampling approaches as reported in their original publications. The analysis focuses on how each method addresses the limitations of simpler approaches and its impact on KGE model performance and training efficiency.
6.1. Core Results Analysis
The paper's core analysis revolves around the strengths and weaknesses of each categorized negative sampling approach, drawing conclusions about their effectiveness and efficiency.
-
Static Distribution-Based Sampling:
- Uniform Sampling: While earliest and easiest, it's critiqued for generating
low-quality negatives(too easy to distinguish), leading tozero loss problemsandfalse negatives. The paper implicitly states that this approach, though common, often contributes little to effective training. - Bernoulli Sampling: An improvement over uniform sampling, it
alleviates false negativesby considering relation types (1-to-N,N-to-1). By adapting the corruption probability based ontphandhpt, it generates more plausible negatives. Zhang et al.'s extension further enhancesrelation link predictionby incorporating relation replacement. - Probabilistic Sampling (PNS): Shown to be beneficial for
skewed dataand sparse relations. Kanojia et al. [46] found thatTransR-PNSachieved significantposition gains(190 and 47 positions inMean Rankon WN18 and FB15K respectively) compared toTransRwith Bernoulli sampling, demonstrating its effectiveness in challenging data scenarios. This highlights the importance of specific sampling strategies for different data characteristics.
- Uniform Sampling: While earliest and easiest, it's critiqued for generating
-
Dynamic Distribution-Based Sampling:
- GAN-based (KBGAN, IGAN): These methods represent a significant leap towards
high-quality negativesby dynamically generating "hard" samples that challenge the KGE model.KBGANdemonstrates that combining different KGE models as generator-discriminator pairs can lead to better performance than baselines.IGANfurther refines the generator using a neural network. - Challenges of GANs: The paper critically notes the
potential risksassociated with GANs:training instabilityandmodel collapse. It also points out the high computational cost and the need forreinforcement learningand pre-training, which are significant drawbacks. - Self-Adversarial Sampling (e.g., RotatE): Presented as a more efficient and stable alternative to GANs, avoiding reinforcement learning and often outperforming
KBGANinlink prediction. This suggests that simpler, self-guided dynamic approaches can be more practical and effective. - MCNS: Positioned as a pioneering approach due to its theoretical derivation of an
effective negative sampling distribution. Experiments showMCNSoutperforms all baselinesin downstream tasks and achieves highefficiency. This validates the importance of a theoretically sound distribution for negative sampling. Its generic nature makes it applicable beyond KGE.
- GAN-based (KBGAN, IGAN): These methods represent a significant leap towards
-
Custom Cluster-Based Sampling:
-
TransE-SNS: By sampling from
semantically similar clusters(e.g., usingK-Means),TransE-SNSgeneratesvalid negativesthat are challenging for the model. Experiments showed itenhances the ability of TransEinlink predictionandtriple classification. This indicates that localizing the search space for negatives based on learned semantics can be highly effective. -
NSCaching: Presented as an efficient and effective alternative, a "distilled version" of GANs, that avoids GAN's drawbacks. By
caching high-plausibility negativesand usingimportance samplingfor updates,NSCachingachieves better performance than GAN-based approaches in terms ofefficiencyandeffectiveness. This points towards the value of maintaining and sampling from a curated set of hard negatives.In summary, the analysis highlights a clear progression from simple, static negative sampling (prone to
zero lossandfalse negatives) to more sophisticated, dynamic, and context-aware methods. These advanced approaches aim to generatehigh-quality,hard negativesthat force KGE models to learn more robust and fine-grained representations. WhileGAN-based methodsoffer theoretical advantages, their practical challenges have led to the exploration of more stable and efficient alternatives likeself-adversarial samplingandNSCaching. The paper advocates for continued research in this area, emphasizing its direct impact on KGE model performance.
-
6.2. Data Presentation (Tables)
The paper does not contain any tables of experimental results from its own research. It synthesizes findings from other papers.
6.3. Ablation Studies / Parameter Analysis
As a review paper, it does not conduct its own ablation studies or parameter analysis. However, it implicitly discusses the impact of certain parameters or components from other works:
-
The
tuning parameter\beta inself-adversarial sampling(RotatE) is mentioned as a parameter for controlling the sampling distribution, indicating its importance in balancing the hardness of negatives. -
The number of
negative samples per positive tripleis briefly mentioned with a reference [23] that suggests "fifty negative samples per positive is a good choice for balancing accuracy and duration," highlighting a critical hyper-parameter for training efficiency and effectiveness.These discussions, while not explicit ablation studies within this paper, demonstrate the authors' awareness of how different components and parameters influence the performance of negative sampling techniques.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper provides a comprehensive review of negative sampling techniques within the field of knowledge graph embedding (KGE). It highlights that while KGE has seen significant advancements in model architectures, the crucial role of negative sampling, inspired by techniques from natural language processing like word2vec, has been largely overlooked or under-explored. The authors propose a novel categorization of existing negative sampling approaches into static distribution-based, dynamic distribution-based, and custom cluster-based methods. Through this structured analysis, the paper demonstrates how different strategies address challenges such as the zero loss problem, false negatives, and training instability, ultimately impacting the quality of learned knowledge representations and performance in downstream tasks like link prediction. The review emphasizes that sophisticated negative sampling is essential for training robust and effective KGE models.
7.2. Limitations & Future Work
The paper, being a review, does not present its own limitations or future work in the context of a specific proposed method. However, it implicitly points out limitations of existing negative sampling techniques and suggests areas for future research within the field:
- Limitations of Static Sampling:
Uniform samplingandBernoulli samplingare limited by their inability to adapt to the learning process, often generating low-quality or false negative samples. - Limitations of Dynamic Sampling (GAN-based): While powerful,
GAN-based methodssuffer fromtraining instability,model collapse, high computational costs, and the need for complexreinforcement learning. - Limitations of Custom Cluster-Based Sampling: These methods rely on initial clustering or pre-computed criteria, which can add overhead and may struggle with the dynamic nature and rapid updates of real-world KGs, requiring continuous re-computation of clusters.
- Need for Further Exploration: The paper explicitly states that
negative samplingis "under explored and needs more attention and efforts." - Future Research Suggestions:
- Comparative Studies: The authors hope their review will lead to
subsequent workcomparing the discussed approaches throughKG downstream applications(likelink predictiononbenchmark datasets) to identifyshortages of current negative samplingmethods. - Novel Strategy Development: This comparative analysis should pave the way for proposing
new strategies for negative samplingafter a full understanding of existing ones. This implies a need for methods that combine the benefits of dynamic and cluster-based approaches while overcoming their respective challenges in stability, efficiency, and adaptability. - Theoretical Foundations: The work by Yang et al. [22] on deriving a general form for an effective negative sampling distribution is highlighted as
pioneering significance, suggesting that more theoretical work in this area could yield general, robust solutions.
- Comparative Studies: The authors hope their review will lead to
7.3. Personal Insights & Critique
This paper serves as an excellent foundational resource for anyone delving into KGE, especially beyond the embedding model architectures themselves. It effectively underscores the "hidden" but critical role of negative sampling.
-
Inspirations:
- Holistic View: It inspires a more holistic view of KGE model development, emphasizing that innovations in
negative samplingcan be as impactful as, if not more so than, novel scoring functions. A robust KGE system requires strong components across the board, and negative sampling is clearly one such component. - Bridging Theory and Practice: The paper's discussion of
MCNShighlights the value of theoretically derived sampling distributions. This suggests that future advancements might come from deeper mathematical understandings of optimal sampling rather than purely heuristic approaches. - Efficiency vs. Quality Trade-off: The analysis of GANs versus
self-adversarial samplingandNSCachingprovides a practical lesson in machine learning: sometimes simpler, more stable alternatives (even if conceptually less ambitious) can yield better real-world performance due to ease of training and efficiency. - Addressing Data Imperfections: The mention of
confidence-aware negative samplingfornoisy KGsis particularly insightful, demonstrating that negative sampling can also play a role in making KGE models more robust to real-world data imperfections, which are ubiquitous.
- Holistic View: It inspires a more holistic view of KGE model development, emphasizing that innovations in
-
Potential Issues / Unverified Assumptions / Areas for Improvement:
- Lack of Quantitative Synthesis: As a review, the paper focuses on qualitative descriptions. While it mentions some performance gains (e.g., for PNS), it lacks a direct comparative table that would visually summarize the performance differences (e.g., MRR, Hits@N) of various negative sampling methods across common KGE models and datasets. Such a table, even if compiled from existing literature, would greatly enhance the paper's utility for quickly comparing methods.
- Practical Guidance for Selection: While it categorizes and discusses, it could offer more explicit guidelines on when to choose a particular negative sampling method given specific KG characteristics (e.g., density, noisiness, relation types, computational resources).
- Interplay with KGE Models: The paper mentions that GAN-based frameworks are independent of the specific form of the discriminator. However, the performance of a negative sampling method can still be highly dependent on the underlying KGE model it's paired with. A deeper analysis of these interactions (e.g., which sampling methods work best with translational vs. semantic matching models) would be valuable.
- Definition of "Quality": The concept of "high-quality negatives" is central. While implied to mean "hard but true-false," a more formal or operational definition across different methods could strengthen the analysis.
-
Transferability:
- The insights into negative sampling are highly transferable to other domains involving
embedding learningwhere discriminating positive from negative instances is crucial. This includesrecommendation systems(e.g., item-item similarity learning),natural language processing(e.g., context-word prediction), and generalgraph representation learningbeyond KGs. - The
dynamic distribution-basedstrategies, especiallyMCNS(which is already noted as generic), andcustom cluster-basedideas, could be adapted to any domain where the quality of generated "noise" significantly impacts model training and where semantic similarity or contextual relevance can be leveraged for better noise generation.
- The insights into negative sampling are highly transferable to other domains involving
Similar papers
Recommended via semantic vector search.