Self-supervised Graph Learning for Recommendation
TL;DR Summary
This paper introduces Self-supervised Graph Learning (SGL) to address GCN limitations in recommendation systems, particularly improving long-tail item recommendation and noise robustness by generating multiple views through self-supervised tasks.
Abstract
Representation learning on user-item graph for recommendation has evolved from using single ID or interaction history to exploiting higher-order neighbors. This leads to the success of graph convolution networks (GCNs) for recommendation such as PinSage and LightGCN. Despite effectiveness, we argue that they suffer from two limitations: (1) high-degree nodes exert larger impact on the representation learning, deteriorating the recommendations of low-degree (long-tail) items; and (2) representations are vulnerable to noisy interactions, as the neighborhood aggregation scheme further enlarges the impact of observed edges. In this work, we explore self-supervised learning on user-item graph, so as to improve the accuracy and robustness of GCNs for recommendation. The idea is to supplement the classical supervised task of recommendation with an auxiliary self-supervised task, which reinforces node representation learning via self-discrimination. Specifically, we generate multiple views of a node, maximizing the agreement between different views of the same node compared to that of other nodes. We devise three operators to generate the views -- node dropout, edge dropout, and random walk -- that change the graph structure in different manners. We term this new learning paradigm as \textit{Self-supervised Graph Learning} (SGL), implementing it on the state-of-the-art model LightGCN. Through theoretical analyses, we find that SGL has the ability of automatically mining hard negatives. Empirical studies on three benchmark datasets demonstrate the effectiveness of SGL, which improves the recommendation accuracy, especially on long-tail items, and the robustness against interaction noises. Our implementations are available at \url{https://github.com/wujcan/SGL}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Self-supervised Graph Learning for Recommendation
1.2. Authors
- Jiancan Wu (University of Science and Technology of China)
- Xiang Wang (National University of Singapore)
- Fuli Feng (National University of Singapore)
- Xiangnan He (University of Science and Technology of China)
- Liang Chen (Sun Yat-sen University)
- Jianxun Lian (Microsoft Research Asia)
- Xing Xie (Microsoft Research Asia)
1.3. Journal/Conference
SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Note: SIGIR is considered one of the most prestigious and influential academic conferences in the field of information retrieval and recommender systems.
1.4. Publication Year
2021
1.5. Abstract
This paper addresses two key limitations in Graph Convolutional Network (GCN)-based recommender systems: the bias towards high-degree (popular) items which hurts long-tail recommendation, and vulnerability to noisy interaction data. To solve these, the authors propose Self-supervised Graph Learning (SGL). This framework augments the standard supervised recommendation task with an auxiliary self-supervised task. By generating multiple "views" of nodes through graph structure augmentation (dropping nodes, dropping edges, or random walks) and maximizing the agreement between these views (contrastive learning), SGL improves both accuracy and robustness. The authors implement SGL on the state-of-the-art LightGCN model and demonstrate through theory and experiments that SGL facilitates automatic mining of "hard negative" examples, leading to better long-tail performance and noise resistance.
1.6. Original Source Link
- Link: https://arxiv.org/abs/2010.10783
- Status: Published in SIGIR '21.
2. Executive Summary
2.1. Background & Motivation
In the field of recommender systems, methods have evolved from simple matrix factorization to Graph Convolutional Networks (GCNs). GCNs model the data as a user-item bipartite graph, where users and items are nodes and interactions (clicks, purchases) are edges. GCNs like LightGCN are currently state-of-the-art because they effectively aggregate information from high-order neighbors (e.g., a user's friends' purchases).
However, the authors identify three critical limitations in current GCN-based models:
-
Sparse Supervision Signal: The number of observed interactions is tiny compared to the total possible user-item pairs, making it hard to learn high-quality representations solely from supervised labels.
-
Skewed Data Distribution (Long-tail Issue): Real-world data follows a power-law distribution. High-degree nodes (popular items/active users) dominate the learning process because they appear frequently in the graph aggregation. This biases the model towards popular items and hurts the recommendation of low-degree (long-tail) items.
-
Noises in Interactions: Implicit feedback (like clicks) often contains false positives (e.g., accidental clicks). GCNs can amplify this noise because they aggregate information through edges; a noisy edge propagates wrong information to neighbors.
The motivation is to introduce Self-Supervised Learning (SSL)—a technique successful in Computer Vision (CV) and NLP—into graph-based recommendation to extract extra signals from the unlabeled data structure itself, thereby mitigating these issues.
2.2. Main Contributions & Findings
- SGL Framework: The authors propose Self-supervised Graph Learning (SGL), a model-agnostic framework that supplements the main recommendation task with a self-supervised auxiliary task based on node self-discrimination.
- Graph Augmentation Operators: They devise three specific operators to generate different views of the graph structure: Node Dropout, Edge Dropout, and Random Walk.
- Theoretical Insight: The paper provides a theoretical analysis showing that the proposed self-supervised loss (InfoNCE) inherently performs hard negative mining. This means the model automatically focuses on distinguishing items that are very similar and hard to tell apart, which accelerates training and improves discrimination.
- Experimental Success: SGL significantly outperforms state-of-the-art baselines (like LightGCN) on three benchmark datasets. It shows particular improvement on long-tail items and demonstrates strong robustness against noisy data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner needs to grasp the following concepts:
- User-Item Bipartite Graph: A graph structure where nodes are divided into two sets: Users () and Items (). Edges only exist between a user and an item, representing an interaction (e.g., user bought item ).
- Graph Convolutional Network (GCN): A neural network designed for graphs. The core operation is message passing or neighborhood aggregation: a node updates its representation (embedding) by combining its own features with the features of its neighbors.
- LightGCN: A simplified GCN specifically for recommendation. It removes non-linear activations and transformation matrices found in standard GCNs, keeping only the neighbor aggregation. It is the "backbone" model SGL is built upon.
- Self-Supervised Learning (SSL): A learning paradigm where the model generates its own labels from the data. A common approach is Contrastive Learning.
- Pretext Task: The auxiliary task the model solves to learn features (e.g., predicting if two images are modified versions of the same original).
- Contrastive Learning: The goal is to learn an embedding space where similar sample pairs (positive pairs) are close together, and dissimilar pairs (negative pairs) are far apart.
- InfoNCE Loss: A specific loss function used to maximize the mutual information between positive pairs while minimizing it for negative pairs. It looks like a softmax classification loss.
3.2. Previous Works
The authors position their work against several key technologies:
- GCN-based Recommendation:
- NGCF (Neural Graph Collaborative Filtering): Encoding high-order connectivity.
- LightGCN: State-of-the-art, simplified GCN.
- Gap: These models rely purely on supervised learning from sparse interactions and suffer from degree bias.
- SSL in CV/NLP:
- BERT: Masks tokens in text to learn context.
- SimCLR / MoCo: Uses data augmentation (cropping, coloring) on images and contrastive loss to learn visual representations.
- Gap: CV/NLP methods assume data samples are independent (i.e., one image doesn't depend on another). In graphs, nodes are inherently connected, so standard augmentations don't apply directly.
- SSL in Graph Learning:
- DGI (Deep Graph Infomax): Contrasts node embeddings with a global graph summary.
- Gap: These often focus on general graphs or node classification, not the specific bipartite structure and collaborative filtering needs of recommendation.
3.3. Differentiation Analysis
SGL differs from prior work in two main ways:
- Augmentation Design: Instead of manipulating node features (which are often just simple IDs in recommendation), SGL manipulates the graph structure (adjacency matrix) to create views.
- Multi-task Learning: SGL does not just pre-train; it optimizes the self-supervised task jointly with the main recommendation task, allowing them to mutually enhance each other.
4. Methodology
4.1. Principles
The core idea of SGL is Multi-Task Learning.
-
Main Task (Supervised): Predict whether a user will interact with an item using known interactions.
-
Auxiliary Task (Self-Supervised): "Self-Discrimination." The model creates two slightly different versions ("views") of the user-item graph. If we look at a specific User Node , its representation in View 1 (
z'_u) and View 2 (z''_u) should be very similar (Positive Pair). However,z'_ushould be different from the representations of any other node (Negative Pair).By forcing the model to recognize the "same" node despite changes in the graph structure, the model learns robust features that aren't over-reliant on specific noisy edges or popularity biases.
Figure 1 in the original paper illustrates this framework. The top path is the standard LightGCN training. The bottom path creates augmented views for contrastive learning.
The following figure (Figure 1 from the original paper) shows the system architecture:
该图像是Self-supervised Graph Learning (SGL)的整体系统框架示意图。上层展示了主要监督学习任务的工作流程,下层则展示了通过改变图结构进行自监督学习任务的工作流程。
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Step 1: Data Augmentation on Graph Structure
The input is the user-item graph . The goal is to generate two augmented subgraphs, view 1 and view 2. Since recommendation data usually lacks rich features (only IDs), the authors augment the structure (the adjacency matrix).
They propose three operators. Let be the node representations at layer . The general augmentation is defined as: $ \mathbf { Z } _ { 1 } ^ { ( l ) } = H ( \mathbf { Z } _ { 1 } ^ { ( l - 1 ) } , s _ { 1 } ( \mathcal { G } ) ) , \quad \mathbf { Z } _ { 2 } ^ { ( l ) } = H ( \mathbf { Z } _ { 2 } ^ { ( l - 1 ) } , s _ { 2 } ( \mathcal { G } ) ) $ Here, is the Graph Convolution function (e.g., LightGCN), and are stochastic functions that modify the graph .
Operator 1: Node Dropout (ND) This operator randomly discards nodes with a probability . When a node is dropped, all its connected edges are removed. The masking vectors determine which nodes are kept. $ s _ { 1 } ( \boldsymbol { \mathcal { G } } ) = ( \mathbf { M } ^ { \prime } \odot \boldsymbol { \mathcal { V } } , \boldsymbol { \mathcal { E } } ) , \quad s _ { 2 } ( \boldsymbol { \mathcal { G } } ) = ( \mathbf { M } ^ { \prime \prime } \odot \boldsymbol { \mathcal { V } } , \boldsymbol { \mathcal { E } } ) $
- Intuition: This simulates missing data and forces the model to recognize a user even if some historical behaviors (nodes) are missing.
Operator 2: Edge Dropout (ED) This operator focuses on the connections. It randomly removes edges with a probability . The masking vectors apply to the edge set. $ s _ { 1 } ( \boldsymbol { \mathcal { G } } ) = ( \boldsymbol { \mathcal { V } } , \mathbf { M } _ { 1 } \odot \boldsymbol { \mathcal { E } } ) , \quad s _ { 2 } ( \boldsymbol { \mathcal { G } } ) = ( \boldsymbol { \mathcal { V } } , \mathbf { M } _ { 2 } \odot \boldsymbol { \mathcal { E } } ) $
- Intuition: This is the most useful operator. It captures local structure patterns and is robust to noisy interactions (since noisy edges might be dropped).
Operator 3: Random Walk (RW) In ND and ED, the subgraph is fixed for all layers in one epoch. In RW, a different subgraph is generated for each layer of the GCN. Assuming we use edge dropout logic but vary it per layer: $ s _ { 1 } ( \boldsymbol { \mathcal { G } } ) = ( \boldsymbol { \mathcal { V } } , \mathbf { M } _ { 1 } ^ { ( l ) } \odot \boldsymbol { \mathcal { E } } ) , \quad s _ { 2 } ( \boldsymbol { \mathcal { G } } ) = ( \boldsymbol { \mathcal { V } } , \mathbf { M } _ { 2 } ^ { ( l ) } \odot \boldsymbol { \mathcal { E } } ) $
-
Intuition: This allows different paths of information flow at different layers, constructing a unique subgraph for each node (like a random walk).
The following figure (Figure 2 from the original paper) illustrates the difference between Edge Dropout (static structure across layers) and Random Walk (dynamic structure across layers):
该图像是示意图,展示了三层图卷积网络(GCN)模型中的高阶连通性,左侧为边丢弃(Edge Dropout),右侧为随机游走(Random Walk)。在随机游走中,图结构在不同层之间不断变化,使得节点 和 item 之间存在一条三阶路径,这在边丢弃中是不存在的。
4.2.2. Step 2: Contrastive Learning (The Auxiliary Task)
After augmentation, we run the LightGCN encoder on the two views to get node embeddings. For a specific user , we have two vectors: (from view 1) and (from view 2).
-
Positive Pair: — These represent the same user, so they should be similar.
-
Negative Pairs: for all — These are different users, so they should be dissimilar.
The authors adopt the InfoNCE loss to maximize agreement between positives and minimize it for negatives: $ \mathcal { L } _ { s s l } ^ { u s e r } = \sum _ { u \in \mathcal { U } } - \log \frac { \exp ( s ( \mathbf { z } _ { u } ^ { \prime } , \mathbf { z } _ { u } ^ { \prime \prime } ) / \tau ) } { \sum _ { v \in \mathcal { U } } \exp ( s ( \mathbf { z } _ { u } ^ { \prime } , \mathbf { z } _ { v } ^ { \prime \prime } ) / \tau ) } $
-
: Cosine similarity function.
-
: A hyper-parameter called temperature. This is crucial for mining hard negatives (explained in Section 4.3).
-
The denominator sums over all users in the batch (or dataset) as negative samples.
The total SSL loss combines user and item losses: $ \mathcal { L } _ { s s l } = \mathcal { L } _ { s s l } ^ { u s e r } + \mathcal { L } _ { s s l } ^ { i t e m } $
4.2.3. Step 3: Joint Optimization
The final objective function combines the main supervised task (using BPR Loss, which creates pairwise rankings for observed interactions) and the SSL task: $ \mathcal { L } = \mathcal { L } _ { m a i n } + \lambda _ { 1 } \mathcal { L } _ { s s l } + \lambda _ { 2 } | \Theta | _ { 2 } ^ { 2 } $
- : Bayesian Personalized Ranking (BPR) loss.
- : Controls the weight of the self-supervised task.
- : Regularization weight.
- : Model parameters.
4.3. Theoretical Analysis: Hard Negative Mining
The authors provide a gradient analysis to explain why SGL works, specifically focusing on the role of the temperature parameter .
They derive the gradient of the SSL loss with respect to the representation . The gradient contribution from a negative node is proportional to a term g(x), where is the similarity between the positive node and negative node ().
$ g ( x ) = \sqrt { 1 - x ^ { 2 } } \exp \left( \frac { x } { \tau } \right) $
- Hard Negatives: Nodes that are very similar to (high , e.g., ). These are "hard" to distinguish.
- Easy Negatives: Nodes that are dissimilar (low or negative ).
Key Insight:
The paper plots this function g(x) for different .
-
If is large (e.g., 1),
g(x)is flat. Hard and easy negatives contribute equally to the gradient. -
If is small (e.g., 0.1),
g(x)explodes for high (hard negatives) and vanishes for low .This proves that by setting a small , the SGL loss automatically amplifies the gradient signal from hard negatives, forcing the model to work harder to distinguish highly similar nodes. This leads to more discriminative embeddings.
The following figure (Figure 3 from the original paper) visualizes the function g(x) and how affects the gradient magnitude for hard vs. easy negatives:
该图像是图表,展示了函数 g(x) 的变化情况及相关指标。当 时,g(x) 的曲线呈现出平稳的趋势(图 a),而当 时,g(x) 显示出较大的波动(图 b)。图 c 和图 d 分别展示了最佳位置 和其对应的对数值 随 的变化。这些关系有助于理解自监督图学习的特性。
5. Experimental Setup
5.1. Datasets
The authors use three benchmark datasets.
- Yelp2018: Local business reviews.
- Amazon-Book: A large-scale book recommendation dataset.
- Alibaba-iFashion: A fashion outfit dataset. This dataset is notably sparse (very few interactions per user/item).
Statistics:
- Yelp2018: 31k users, 38k items, 1.5M interactions. Density: 0.0013.
- Amazon-Book: 52k users, 91k items, 2.9M interactions. Density: 0.0006.
- Alibaba-iFashion: 300k users, 81k items, 1.6M interactions. Density: 0.00007.
5.2. Evaluation Metrics
The paper uses two standard metrics for Top-K recommendation ().
-
Recall@K
- Definition: Measures coverage. Out of all the items a user actually liked (in the test set), what fraction did the model successfully recommend in the top K?
- Formula: $ \text{Recall@K} = \frac{|\mathcal{R}_u \cap \mathcal{T}_u|}{|\mathcal{T}_u|} $
- Symbols: is the set of top-K recommended items for user . is the set of ground-truth items user interacted with in the test set.
-
NDCG@K (Normalized Discounted Cumulative Gain)
- Definition: Measures ranking quality. It gives higher scores if the correct items are ranked higher in the list (position matters).
- Formula: $ \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}, \quad \text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)} $
- Symbols: is the relevance of the item at position (1 if interacted, 0 otherwise). IDCG is the ideal DCG (perfect ranking).
5.3. Baselines
SGL is compared against:
- NGCF: An early GCN model encoding second-order feature interactions.
- LightGCN: The backbone of SGL, representing state-of-the-art pure GCN.
- Mult-VAE: A variational auto-encoder based collaborative filtering method (strong non-GCN baseline).
- DNN+SSL: A state-of-the-art method applying SSL to standard Deep Neural Networks (using feature masking/dropout) rather than graphs.
6. Results & Analysis
6.1. Core Results Analysis
The experiments assess effectiveness (RQ1) across all datasets. The paper tests three variants of SGL: SGL-ND (Node Dropout), SGL-ED (Edge Dropout), and SGL-RW (Random Walk).
Key Findings:
-
Superiority: All SGL variants generally outperform the baseline LightGCN. This proves that the auxiliary self-supervised task provides valuable signal.
-
Best Variant: SGL-ED (Edge Dropout) is consistently the strongest performer. This suggests that capturing local structure patterns by masking edges is the most effective augmentation for recommendation graphs.
-
Robustness to Sparsity: The improvement is most significant on Amazon-Book and Alibaba-iFashion, which are sparser than Yelp. This confirms SGL helps when supervised data is scarce.
The following are the results from Table 3 of the original paper. This table compares LightGCN with the three SGL variants across different layer depths.
#Layer Method Yelp2018 Amazon-Book Alibaba-iFashion Recall NDCG Recall NDCG Recall NDCG 1 Layer LightGCN 0.0631 0.0515 0.0384 0.0298 0.0990 0.0454 SGL-ND 0.0643 0.0529 0.0432 0.0334 0.1133 0.0539 SGL-ED 0.0637 0.0526 0.0451 0.0353 0.1125 0.0536 SGL-RW 0.0637 0.0526 0.0451 0.0353 0.1125 0.0536 2 Layers LightGCN 0.0622 0.0504 0.0411 0.0315 0.1066 0.0505 SGL-ND 0.0658 0.0538 0.0427 0.0335 0.1106 0.0526 SGL-ED 0.0668 0.0549 0.0468 0.0371 0.1091 0.0520 SGL-RW 0.0644 0.0530 0.0453 0.0358 0.1091 0.0521 3 Layers LightGCN 0.0639 0.0525 0.0410 0.0318 0.1078 0.0507 SGL-ND 0.0644 0.0528 0.0440 0.0346 0.1126 0.0536 SGL-ED **0.0675** **0.0555** **0.0478** **0.0379** 0.1126 0.0538 SGL-RW 0.0667 0.0547 0.0457 0.0356 **0.1139** **0.0539**
6.2. Long-tail Recommendation Analysis
The authors group items by popularity (degree) into 10 groups. Group 10 contains the most popular items.
-
Observation: In LightGCN, the contribution to total Recall is dominated by Group 10 (popular items).
-
SGL Result: SGL reduces the dominance of Group 10 and significantly improves the recall for lower groups (long-tail items).
-
Conclusion: SGL effectively mitigates the popularity bias inherent in GCNs.
The following figure (Figure 4 from the original paper) shows the performance decomposition across popularity groups, highlighting SGL's gain in the long-tail:
该图像是图表,展示了在三个数据集(Yelp2018、Amazon-Book 和 Alibaba-iFashion)上,SGL 和 LightGCN 的不同模型在各组别(GroupID)下的召回率(Recall)。可以看到,SGL-ED 模型在长尾项上具有更好的表现。
6.3. Efficiency and Robustness
Training Efficiency: SGL converges much faster than LightGCN.
-
On Yelp2018, SGL reaches peak performance around epoch 20.
-
LightGCN requires hundreds of epochs.
-
Why? The hard negative mining (theoretical insight) provides stronger gradients, accelerating optimization.
The following figure (Figure 5 from the original paper) illustrates the faster convergence of SGL compared to LightGCN:

Robustness to Noise: The authors deliberately added fake edges (noise) to the training set (ratios of 0.05 to 0.2).
-
Result: While both models degrade, SGL degrades much slower than LightGCN.
-
Significance: The augmented views help the model identify "structural invariance"—core patterns that persist even when some random noisy edges are added or dropped.
The following figure (Figure 6 from the original paper) shows the performance degradation curves as noise increases:
该图像是图表,展示了模型在不同噪声比例下的性能表现。左侧为 Yelp2018 数据集,右侧为 Amazon-Book 数据集,条形图表示召回率,折线图表示性能下降百分比。
6.4. Parameter Analysis (Temperature )
- Impact: The temperature controls hard negative mining.
- Finding:
-
Too large (e.g., 1.0) -> Performance drops. (Gradients are too uniform; no focus on hard negatives).
-
Too small (e.g., 0.1) -> Performance drops. (Gradients are unstable/dominated by outliers).
-
Optimal: An intermediate range (e.g., 0.2) works best.
The following figure (Figure 7 from the original paper) shows the sensitivity of performance to different values:
该图像是图表,展示了在不同的 au值下,Yelp2018 和 Amazon-Book 数据集中的召回率随训练轮次的变化。左侧为 Yelp2018,右侧为 Amazon-Book。不同颜色的曲线代表不同的au值,反映了模型性能的调整效果。
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces Self-Supervised Learning to Graph Neural Networks for recommendation. By proposing SGL, the authors show that creating auxiliary tasks based on graph structure augmentation (Edge Dropout being the most effective) can solve critical issues in RecSys: data sparsity, popularity bias, and noise. The theoretical link established between the InfoNCE loss temperature parameter and hard negative mining provides a solid justification for why the method works so well and converges so quickly.
7.2. Limitations & Future Work
- Limitations: The authors note that the current augmentation operators are stochastic (random). Randomly dropping edges might inadvertently break crucial structural information (e.g., a user's only link to a specific category).
- Future Work: They suggest exploring:
- Counterfactual Learning: To identify influential data points rather than dropping randomly.
- Pre-training: Developing a model that captures universal user patterns transferable across domains.
7.3. Personal Insights & Critique
- Innovation: The most impressive part is not just applying SSL, but the theoretical derivation connecting to gradient magnitude. This moves the paper beyond "empirical alchemy" to providing a fundamental understanding of contrastive learning's mechanics.
- Simplicity: The method is elegant because it requires no extra model parameters (just augmentations and a loss function).
- Applicability: SGL is model-agnostic. While implemented on LightGCN, the concept of "Edge Dropout + Contrastive Loss" can be easily grafted onto almost any graph-based recommendation model (e.g., GraphSAGE, GAT).
- Critique: While efficient in epochs, the time complexity per epoch increases because the graph must be augmented (rebuilt) and propagated twice more (for the two views). The paper claims the overall training time is acceptable due to faster convergence, but for massive industrial graphs, generating augmented adjacency matrices on the fly could still be an engineering bottleneck.
Similar papers
Recommended via semantic vector search.