Paper status: completed

Disentangled Self-Supervision in Sequential Recommenders

Published:08/20/2020

Sequential Recommender Systems (22)Self-Supervised Learning (1)Sequence-to-Sequence Training (1)Intention Disentanglement (1)Future Behavior Sequence Reconstruction (1)

Original Link

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces a latent self-supervised and disentangled sequence-to-sequence training strategy to address myopic predictions and lack of diversity in traditional sequential recommenders, showing significant performance improvements on real and synthetic datasets.

Abstract

To learn a sequential recommender, the existing methods typically adopt the sequence-to-item (seq2item) training strategy, which supervises a sequence model with a user’s next behavior as the label and the user’s past behaviors as the input. The seq2item strategy, however, is myopic and usually produces non-diverse recommendation lists. In this paper, we study the problem of mining extra signals for supervision by looking at the longer-term future. There exist two challenges: i) reconstructing a future sequence containing many behaviors is exponentially harder than reconstructing a single next behavior, which can lead to difficulty in convergence, and ii) the sequence of all future behaviors can involve many intentions, not all of which may be predictable from the sequence of earlier behaviors. To address these challenges, we propose a sequence-to-sequence (seq2seq) training strategy based on latent self-supervision and disentanglement. Specifically, we perform self-supervision in the latent space, i.e., reconstructing the representation of the future sequence as a whole, instead of reconstructing the items in the future sequence individually. We also disentangle the intentions behind any given sequence of behaviors and construct seq2seq training samples using only pairs of sub-sequences that involve a shared intention. Results on real-world benchmarks and synthetic data demonstrate the improvement brought by seq2seq training.

Mind Map

In-depth Reading

English Analysis~41 min read · 57,017 chars

1. Bibliographic Information

1.1. Title

Disentangled Self-Supervision in Sequential Recommenders

1.2. Authors

Jianxin Ma (Tsinghua University, Beijing, China; Alibaba Group, China)
Chang Zhou (Alibaba Group, China)
Hongxia Yang (Alibaba Group, China)
Peng Cui (Tsinghua University, Beijing, China)
Xin Wang (Tsinghua University, Beijing, China; Key Laboratory of Pervasive Computing, Ministry of Education, China)
Wenwu Zhu (Tsinghua University, Beijing, China; Key Laboratory of Pervasive Computing, Ministry of Education, China)

1.3. Journal/Conference

Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '20), August 23–27, 2020, Virtual Event, CA, USA. KDD (Knowledge Discovery and Data Mining) is one of the premier conferences in the fields of data mining, data science, and big data. It is highly reputable and influential, attracting top-tier research and applications globally.

1.4. Publication Year

2020

1.5. Abstract

This paper addresses the limitations of the conventional sequence-to-item (seq2item) training strategy in sequential recommenders, which is often myopic and results in non-diverse recommendations. The authors propose to leverage longer-term future signals for supervision by introducing a novel sequence-to-sequence (seq2seq) training strategy. The core challenges of reconstructing a multi-behavior future sequence (difficulty in convergence) and handling multiple user intentions within that sequence (low signal-to-noise ratio) are tackled through two main ideas: i) latent self-supervision, where the model reconstructs the representation of the entire future sequence in a latent space instead of individual items, and ii) intention disentanglement, which identifies and utilizes only sub-sequences sharing a common intention for seq2seq training. Experimental results on real-world and synthetic datasets demonstrate that this seq2seq training strategy, when complementing seq2item training, leads to improved recommendation performance.

1.6. Original Source Link

/files/papers/6950ed9888e29060a51c8504/paper.pdf (This link indicates a PDF file often hosted by the conference proceedings or institution, and based on the provided metadata, it is officially published at KDD '20).

2. Executive Summary

2.1. Background & Motivation

The central problem addressed by this paper lies within sequential recommender systems. These systems aim to predict a user's next action (e.g., clicking an item) based on their historical sequence of interactions. The existing, standard training paradigm, known as sequence-to-item (seq2item) training, involves feeding a model a sequence of past behaviors and supervising it to predict only the immediately following single item.

This seq2item strategy, however, suffers from two critical limitations:

Myopia and Lack of Diversity: By focusing solely on the next immediate item, the model becomes short-sighted. It tends to over-emphasize recent behavior and may fall into recommending highly similar items consecutively, leading to non-diverse recommendation lists. For instance, if a user clicks five shirts and then one pair of trousers, the seq2item approach heavily reinforces "shirt" as the next recommendation, even if a broader recommendation of both shirts and trousers would be more desirable for a diverse top-k list.
Vulnerability to Irrelevant Immediate Behaviors: Users often have diverse and constantly evolving intentions. The very next behavior in a sequence might be an outlier, driven by momentary curiosity or an unrelated impulse, and thus irrelevant to the preceding sequence of behaviors. Training solely on such potentially noisy next-item labels can lead to a low signal-to-noise ratio in supervision, making the model less robust.

To overcome these challenges, the paper seeks to mine extra signals for supervision by looking at the longer-term future of user interactions, rather than just the immediate next item. However, this introduces new significant challenges:
Difficulty in Reconstructing Future Sequences: Reconstructing an entire sequence of many future behaviors is exponentially harder than predicting a single item. This complexity can hinder model convergence and efficiency, especially if individual items in the future sequence are reconstructed one by one.
Multiple User Intentions in Future Sequences: A longer future sequence might encapsulate several evolving user intentions. Not all of these future intentions might be predictable or relevant to the earlier input sequence. Without a mechanism to discern relevant intentions, the signal-to-noise ratio would again be low.

The paper's innovative idea is to propose a sequence-to-sequence (seq2seq) training strategy to complement the standard seq2item approach. This seq2seq strategy is designed to address the aforementioned challenges by:

Performing self-supervision in the latent space to ease the reconstruction task.
Employing intention disentanglement to filter out irrelevant future intentions and construct more meaningful training samples.

2.2. Main Contributions / Findings

The paper makes several key contributions to the field of sequential recommender systems:

Novel seq2seq Training Strategy: The authors propose a new seq2seq training strategy that extracts additional supervision signals from a user's longer-term future interactions, moving beyond the traditional focus on only the next immediate behavior. This strategy runs in parallel to and complements the standard seq2item training.
Latent Self-Supervision for Convergence: To mitigate the difficulty of reconstructing entire future sequences, the paper introduces self-supervision in the latent space. Instead of reconstructing individual items, the model learns to predict a compact representation (a single vector) of the entire future sequence. This "distilled pseudo behavior" summarizes the main intention of the future sequence, simplifying the learning task and boosting convergence.
Intention Disentanglement for Sample Selection: A key innovation is the design of a sequence encoder that can infer and disentangle multiple latent intentions reflected by a sequence of behaviors. The encoder produces multiple representations for a given sequence, each corresponding to a distinct latent category. This disentanglement is then crucial for constructing seq2seq training samples: only pairs of sub-sequences (an earlier input sequence and a future target sequence) that share a common intention (i.e., belong to the same latent category) are used for training. This mechanism improves the signal-to-noise ratio by filtering out irrelevant future intentions.
Empirical Demonstration of Efficacy: Through extensive experiments on real-world benchmark datasets (Amazon Beauty, Steam, MovieLens-1M, MovieLens-20M) and synthetic data corrupted with noise, the paper empirically demonstrates that the proposed seq2seq training strategy consistently improves recommendation performance across various metrics (Recall, NDCG, MRR). The results also show enhanced robustness to noise in training data.

3.1. Foundational Concepts

Sequential Recommender Systems: These systems aim to predict what a user will interact with next, based on their ordered history of past interactions (e.g., clicks, purchases, views). Unlike traditional recommender systems that might only consider overall preferences, sequential recommenders explicitly model the temporal dynamics and order of user behavior.
Deep Learning (Deep Sequential Models): A subfield of machine learning that uses artificial neural networks with multiple layers (deep neural networks) to learn representations of data with multiple levels of abstraction. In the context of sequential recommenders, deep sequential models like Recurrent Neural Networks (RNNs) and Transformer networks are used to capture complex patterns and dependencies within user behavior sequences.
- Recurrent Neural Networks (RNNs): A class of neural networks designed to process sequential data. They have internal memory that allows them to use information from previous steps in the sequence to influence the processing of the current step. GRU4Rec is a prominent example of an RNN-based recommender.
- Self-Attention Networks (Transformers): A neural network architecture that relies heavily on the attention mechanism to weigh the importance of different parts of the input sequence when processing each element. Transformers are highly effective for sequential tasks, overcoming some limitations of RNNs regarding long-range dependencies and parallelization. SASRec and BERT4Rec are examples of Transformer-based recommenders.
Self-Supervised Learning (SSL): A paradigm in machine learning where a model learns representations from unlabeled data by solving a "pretext task." The pretext task is designed such that solving it requires the model to understand certain aspects of the data, thereby learning useful features or representations without explicit human labels. For example, predicting a masked word in a sentence (like BERT's Cloze task) or predicting future frames in a video from past frames are pretext tasks. The learned representations can then be fine-tuned for downstream tasks like recommendation.
Disentangled Representation Learning: This aims to learn representations where different, independent explanatory factors of variation in the data are captured by distinct, independent dimensions or parts of the learned representation. For example, in an image of a face, disentangled representations might separately encode factors like "gender," "age," and "expression." In recommender systems, this means learning separate representations for a user's distinct intentions (e.g., "shopping for work clothes" vs. "browsing for leisure items").
Contrastive Learning: A method used in self-supervised learning where the model learns by contrasting positive pairs (similar data points) with negative pairs (dissimilar data points). The goal is to make the representations of positive pairs closer in the latent space and those of negative pairs farther apart. The softmax function is often used in contrastive losses to normalize scores and convert them into probabilities.
Softmax Function: A function that takes a vector of arbitrary real-valued scores and transforms them into a probability distribution. Each element of the output vector is a probability value between 0 and 1, and all elements sum up to 1. It is commonly used in the output layer of multi-class classification neural networks. $ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $ where $z_i$ is the $i$ -th element of the input vector, and $C$ is the total number of classes.
Layer Normalization: A technique used in neural networks to normalize the inputs across the features of an individual sample within a layer. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization normalizes across the feature dimension, making it suitable for recurrent networks and Transformers where batch sizes might vary or sequence lengths differ. It helps stabilize training and speed up convergence. $ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta $ where $\mu$ is the mean of the input features, $\sigma$ is the standard deviation, $\gamma$ and $\beta$ are learnable scale and shift parameters, and $\odot$ denotes element-wise multiplication.
Dot Product and Cosine Similarity: Both are measures of similarity between two vectors.
- Dot Product: $\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^n a_i b_i$ . A larger dot product typically indicates greater similarity, especially if vector magnitudes are similar.
- Cosine Similarity: $\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$ . This measures the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction), regardless of their magnitudes. It's often preferred when magnitude differences should not influence similarity.

3.2. Previous Works

The paper contextualizes its contributions by discussing several categories of prior research:

Early Recommender Systems (Collaborative Filtering & Matrix Factorization):
- Collaborative Filtering (CF): Techniques that make recommendations by collecting preferences from many users and finding users with similar tastes (user-based CF) or items that are frequently interacted with together (item-based CF). Examples include BPR-MF (Bayesian Personalized Ranking - Matrix Factorization).
- Matrix Factorization (MF): A popular technique within CF that decomposes the user-item interaction matrix into two lower-dimensional matrices: a user latent factor matrix and an item latent factor matrix. The dot product of a user's latent vector and an item's latent vector estimates the user's preference for that item.
- Neural Collaborative Filtering (NCF): Extends MF by replacing the simple dot product with a neural network to model user-item interactions, allowing for more complex non-linear relationships.
- Relevance to current paper: These methods often neglect the sequential nature of user interactions, treating them as independent events.
Markov Chains for Sequential Recommendation:
- These models explicitly consider the order of interactions, assuming that a user's next action primarily depends on their immediate past action(s). First-order Markov chains consider only the last action, while higher-order Markov chains consider a fixed window of past actions.
- Factorized Personalized Markov Chains (FPMC): A prominent model that combines matrix factorization with a first-order Markov chain to capture both general user preferences and sequential transitions.
- Relevance to current paper: While addressing sequence, they often have limited capacity to capture complex, long-range dependencies and evolving patterns.
Deep Sequential Recommenders:
- These leverage the expressive power of deep neural networks to model sequential user behaviors.
- GRU4Rec / GRU4Rec+: Based on Gated Recurrent Units (GRUs), a type of RNN, these models process sequences of items to predict the next item, effectively capturing temporal dependencies. $GRU4Rec+$ introduced improvements like session-parallel mini-batch training.
- Caser: Uses Convolutional Neural Networks (CNNs) to model sequential patterns, where horizontal convolutions capture general sequential features and vertical convolutions capture local sequential patterns.
- SASRec (Self-Attentive Sequential Recommendation): A groundbreaking model that adapts the Transformer encoder for sequential recommendation. It uses the self-attention mechanism to allow the model to weigh the importance of all previous items in a user's sequence when predicting the next item, effectively capturing long-range dependencies without relying on recurrence. SASRec demonstrated that a Transformer-based architecture can achieve state-of-the-art performance. The core self-attention mechanism, as introduced in "Attention Is All You Need" (Vaswani et al., 2017), is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input embeddings, and $d_k$ is the dimension of the key vectors. This allows the model to selectively focus on relevant past items.
- BERT4Rec: Further advances Transformer-based sequential recommendation by adopting the BERT (Bidirectional Encoder Representations from Transformers) pre-training objective. Instead of predicting only the next item (unidirectional), BERT4Rec uses a Cloze task (similar to masked language modeling) where random items in a sequence are masked, and the model is trained to predict them using bidirectional context. This allows it to learn rich contextual representations.
- Relevance to current paper: These models typically follow a sequence-to-item training strategy, where the future sequence is explicitly reconstructed item by item (or the next item is predicted). This paper argues this is myopic.
Disentangling a User's Mixed Intentions:
- Research in this area focuses on separating the underlying independent factors that explain observed data.
- Methods often involve Variational Autoencoders (VAEs) with regularization terms (e.g., beta-VAE) to minimize mutual information between different parts of the latent representation, thereby encouraging disentanglement.
- Other approaches involve mixture data models or methods for disentangling intentions in graph data (e.g., social networks).
- Relevance to current paper: This paper also employs disentanglement, but for a specific purpose: to identify shared intentions between input and target sequences for seq2seq sample selection, rather than just learning general disentangled representations.
Self-Supervision and Contrastive Learning:
- Beyond the specific Cloze task used by BERT, self-supervised learning encompasses a wide range of pretext tasks such as predicting parts of an object, recovering original order, or discriminating relationships.
- Contrastive Predictive Coding (CPC): A notable self-supervised approach that learns representations by predicting future observations in a latent space. It uses a contrastive loss to distinguish positive future samples from negative samples.
- Relevance to current paper: This paper explicitly draws inspiration from CPC by performing self-supervision in the latent space. However, it extends CPC by incorporating disentangled representation learning to handle multiple intentions, which CPC in its general form does not typically address.

3.3. Technological Evolution

The evolution of recommender systems, particularly sequential ones, can be viewed as a progression from static preference modeling to dynamic, context-aware, and increasingly sophisticated representation learning:

Static Preference Models (Early 2000s): Initial systems focused on Collaborative Filtering (CF) and Matrix Factorization (MF), primarily capturing general user preferences without considering the order of interactions.
Sequential Behavior Models (Mid-2000s to Early 2010s): The recognition of the importance of sequence led to models like Markov Chains (FPMC), which explicitly modeled transitions between items. However, these were often limited to short-term dependencies and lacked the capacity for complex patterns.
Deep Learning for Sequences (Mid-2010s onwards): The advent of deep learning, particularly Recurrent Neural Networks (RNNs) like GRU (e.g., GRU4Rec), significantly enhanced the ability to model long-range dependencies and complex non-linear patterns in sequences.
Attention-based and Bidirectional Models (Late 2010s): Self-attention networks (Transformers) revolutionized sequence modeling (e.g., SASRec) by allowing parallel processing and more effective capture of global dependencies. Bidirectional models like BERT4Rec further improved representations by allowing context from both past and future in pre-training.
Self-Supervised Learning and Disentanglement (Late 2010s onwards): The current paper represents a frontier by integrating self-supervised learning (for mining signals from unlabeled data) and disentangled representation learning (for handling multiple user intentions). It builds upon Transformer architectures but innovates in the training strategy by moving towards latent seq2seq supervision and smart sample selection.

This paper fits into this timeline by building on the success of Transformer-based models like SASRec and BERT4Rec, but fundamentally alters their training objective. Instead of merely predicting the next item or masking items in a sequence (both forms of seq2item or Cloze tasks in data space), it proposes to self-supervise by predicting representations of future sequences in a latent space, combined with a novel disentanglement mechanism for improving the quality of these seq2seq signals.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several core differences and innovations:

From Seq2Item to Latent Seq2Seq Training: The most significant departure is from the prevalent sequence-to-item (seq2item) paradigm (used by GRU4Rec, SASRec, BERT4Rec) to a novel sequence-to-sequence (seq2seq) training strategy. While seq2item only supervises the prediction of the immediate next item, this paper's seq2seq strategy explicitly leverages the entire longer-term future sequence for supervision.
Self-Supervision in Latent Space vs. Data Space: Existing deep sequential recommenders often perform prediction tasks directly in the data space (e.g., predicting the exact item ID, as in SASRec, or reconstructing masked item IDs, as in BERT4Rec). This paper proposes self-supervision in the latent space, where the model predicts a compact representation of the future sequence, rather than its individual items. This is inspired by Contrastive Predictive Coding (CPC) but specifically adapted for recommendation. This abstract prediction task is less prone to the "exponentially harder" problem of item-by-item reconstruction of long sequences.
Disentanglement for Seq2Seq Sample Selection: While disentangled representation learning has been explored in various contexts, this paper applies it uniquely to filter seq2seq training samples. Previous disentanglement efforts often aim to learn interpretable or independent representations. Here, the intention disentanglement layer explicitly aims to identify shared intentions between the input and target sequences. Only when such a shared intention is detected is a seq2seq sample considered "high confidence" and used for training. This addresses the "low signal-to-noise ratio" problem arising from multiple, potentially irrelevant, intentions in long future sequences.
Complementary Training Objective: Unlike methods that replace seq2item or Cloze objectives, this paper's seq2seq strategy complements the standard seq2item training. This indicates a hybrid approach that aims to retain the benefits of immediate prediction while adding longer-term awareness.
Implicit Disentanglement Regularization: The paper notes that its contrastive seq2seq loss function naturally encourages disentanglement, as the score for a positive case based on one latent intention must be higher than scores based on other, different latent intentions (from the same input or other samples in the batch). This removes the need for explicit regularization terms typically used in disentangled learning, simplifying the model.

In essence, this work innovates by redefining what is predicted (a latent sequence representation vs. an item ID), how it's predicted (latent self-supervision), and when these longer-term predictions are used (filtered by disentangled intentions), all within a framework that enhances existing seq2item models.

4. Methodology

4.1. Principles

The core idea behind the proposed methodology is to address the limitations of myopic sequence-to-item (seq2item) training in sequential recommenders by introducing a complementary sequence-to-sequence (seq2seq) training strategy. This seq2seq strategy aims to mine additional, longer-term supervision signals from a user's future behavior. The two main principles guiding this approach are:

Latent Self-Supervision: Instead of trying to reconstruct every individual item in a potentially long and complex future sequence (which is "exponentially harder" and can lead to convergence difficulties), the model performs self-supervision in the latent space. This means it learns to predict a concise, high-level representation (a single vector) that summarizes the overall intention or meaning of the entire future sequence, given the earlier sequence. This simplifies the prediction task and distills the most relevant information.
Intention Disentanglement for Relevant Sample Selection: User behavior sequences, especially long ones, can involve multiple evolving intentions. Not all intentions in a future sequence might be predictable or relevant to the input sequence. To address this low signal-to-noise ratio, the model is equipped with an intention disentanglement layer. This layer learns to extract multiple, distinct latent intention representations from any given sequence. Crucially, these disentangled intentions are then used to intelligently select seq2seq training samples: only pairs of input and future sequences that are inferred to share a common underlying intention (i.e., belong to the same latent category) are used for seq2seq training. This ensures that the model learns from meaningful and aligned long-term signals.

By combining these two principles, the proposed seq2seq strategy provides a robust and efficient way to leverage future behavioral signals, enhancing the existing seq2item training.

4.2. Core Methodology In-depth (Layer by Layer)

The proposed methodology integrates a novel seq2seq self-supervision loss with a specially designed disentangled sequence encoder, operating in parallel with the traditional seq2item loss.

4.2.1. Notations and Problem Formulation

Let's first clarify the key notations used throughout the paper, as summarized in Table 1:

The input data consists of $N$ user click sequences, denoted as $\{ \mathbf{x}^{(u)} \}_{u=1}^{N}$ . For a specific user $u$ , their sequence is $\mathbf{x}^{(u)} = [x_1^{(u)}, x_2^{(u)}, \ldots, x_{T_u}^{(u)}]$ , where $T_u$ is the length of the sequence, and $x_t^{(u)} \in \{1, 2, \ldots, M\}$ is the index of the item clicked at time $t$ . $M$ is the total number of unique items.

The model consists of a sequence encoder $\phi_{\theta}(\cdot)$ and an item embedding table $\mathbf{H} \in \mathbb{R}^{M \times D}$ , where $\theta$ encompasses all trainable parameters, including $\mathbf{H}$ . $D$ is the dimensionality of the latent representations. Each item $i$ has a representation $\mathbf{H}_{i,:} \in \mathbb{R}^D$ .

The encoder $\phi_{\theta}(\cdot)$ is designed to output $K$ different $D$ -dimensional vectors, $\{ \phi_{\theta}^{(k)}(\cdot) \}_{k=1}^{K}$ , where each vector $\phi_{\theta}^{(k)}(\cdot)$ represents the user's intention under the $k$ -th latent category.

The primary task remains candidate generation for sequential recommendation: predicting the next item(s) a user $u$ is likely to click, based on their observed sequence $\mathbf{x}^{(u)}$ .

Traditional Sequence-to-Item (seq2item) Training: The standard training approach uses a truncated past sequence $\mathbf{x}_{1:t}^{(u)} = [x_1^{(u)}, \ldots, x_t^{(u)}]$ as input to predict the immediate next item $x_{t+1}^{(u)}$ . The common loss function is a negative log-likelihood: $ \mathcal{L}{s2i}(\boldsymbol{\theta}) = \sum_u \sum_t \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) $ $ \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) = - \ln { p{\theta}(x_{t+1}^{(u)} \mid x_1^{(u)}, x_2^{(u)}, \ldots, x_t^{(u)}) } $ where $p_{\theta}(\cdot)$ is typically proportional to the similarity between the sequence representation and the item representation in the vector space.

Notation	Description
N	the number of sequences, aka. the number of users
M	the number of items
D	the dimensionality of the latent representations
K	the number of disentangled user intentions
x(u) x(u)	the sequence of items clicked by the uth user the th click in the uth h user's sequence x(u)
	(u) r(u)]. ,x2 , where x (u) {1, 2, . .., M }
Tu	the length of the uth user's click sequence x(u)
θ	parameters of the sequence encoder
H RMxD	the item embedding table, included in θ
Hi,: RD	the ith item's representation, i.e. the ith row of H
(u) RD	the representation of item (u (u) (u) of H
$θ(·)</td><td>the sequence encoder, which outputs K vectors</td></tr><tr><td>(x(u)$	the representation of user u's intention under the
λ [0, 1]	kth latent category, where 1 ≤ k ≤ K
B	the threshold for selecting sequence-to-sequence training samples of high confidence a mini-batch of sequences for training

4.2.2. Sequence-to-Sequence Self-Supervision

The paper's core contribution is the seq2seq training strategy, which runs in parallel with the standard seq2item training. During each mini-batch gradient descent step, both losses are minimized.

Mini-batch Construction: A mini-batch $\mathcal{B}$ is formed by uniformly sampling (u, t) pairs from the training set, where $1 \leq u \leq N$ and $1 \leq t \leq T_u - 1$ . Each pair defines an earlier sequence $\mathbf{x}_{1:t}^{(u)} = [x_1^{(u)}, \ldots, x_t^{(u)}]$ and a corresponding future sequence $\mathbf{x}_{t+1:T_u}^{(u)} = [x_{t+1}^{(u)}, \ldots, x_{T_u}^{(u)}]$ . For the seq2seq target, the paper uses a reversed version of the future sequence, denoted as $\mathbf{x}_{T_u:t+1}^{(u)} = [x_{T_u}^{(u)}, x_{T_u-1}^{(u)}, \ldots, x_{t+1}^{(u)}]$ . This reversal is specifically for the latent encoding of the target sequence, allowing items closer to the split point $t$ to potentially gain more weight, which aligns with the typical recency bias in sequential recommendation.

Sequence-to-sequence (seq2seq) Loss: For a given training sample (u, t) and a latent category $k$ , the seq2seq loss $\mathcal{L}_{s2s}(\boldsymbol{\theta}, u, t, k)$ is defined as the negative log-likelihood of predicting the latent representation of the future sequence $\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{T_u:t+1}^{(u)})$ given the latent representation of the earlier sequence $\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)})$ . This is framed as a contrastive learning task. The model tries to make the representation of the true future sequence (for category $k$ ) similar to the representation of the input sequence (for category $k$ ), while making it dissimilar to representations of other future sequences (from other categories or other samples in the mini-batch).

The seq2seq loss for a specific user $u$ , time step $t$ , and latent category $k$ is: $ \begin{array}{r l} & \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) = - \ln { p{\theta}(\boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{T_u:t+1}^{(u)}) \mid \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) ) } = \ & - \ln { \frac { \exp { \left( \frac { 1 } { \sqrt { D } } \ \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{T_u:t+1}^{(u)}) \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } } { \sum _ { (u^{\prime}, t^{\prime}) \in \mathcal{B} } { \sum _ { k^{\prime} = 1 } ^ { K } \exp { \left( \frac { 1 } { \sqrt { D } } \ \boldsymbol{\phi}{\theta}^{(k^{\prime})}(\mathbf{x}{T_{u^{\prime}}:t^{\prime}+1}^{(u^{\prime})}) \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } } } } \end{array} $ Here:

$\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{T_u:t+1}^{(u)})$ is the $k$ -th latent intention representation of the reversed future sequence for user $u$ starting from $t+1$ .
$\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)})$ is the $k$ -th latent intention representation of the earlier sequence for user $u$ up to time $t$ .
The dot product $\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{T_u:t+1}^{(u)}) \cdot \boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)})$ measures the similarity between the predicted and target latent representations for the same intention category $k$ . This is the positive pair.
The scaling factor $\frac{1}{\sqrt{D}}$ is applied to the dot product scores, which is a common practice in Transformers and helps stabilize convergence, especially when the last layer is LayerNormalization.
The denominator is a sum over all possible future sequence representations $\boldsymbol{\phi}_{\theta}^{(k^{\prime})}(\mathbf{x}_{T_{u^{\prime}}:t^{\prime}+1}^{(u^{\prime})})$ from all latent categories $k^{\prime}$ and all samples $(u^{\prime}, t^{\prime})$ within the current mini-batch $\mathcal{B}$ . These serve as negative samples, forcing the model to distinguish the true match from other irrelevant latent representations. This softmax is normalized over the mini-batch, which is a common technique in contrastive learning to approximate normalization over the entire training set.

Selecting Samples of High Confidence for seq2seq Training: A crucial aspect of the seq2seq strategy is to only learn from "high confidence" samples. Not all future sequences are genuinely predictable from earlier behaviors, especially if intentions shift. To filter out irrelevant samples, the seq2seq loss is only computed for samples (u, t, k) where the current value of the loss (before optimization) is below a certain threshold. This threshold is dynamically determined.

The total seq2seq loss for a mini-batch $\mathcal{B}$ is: $ \mathcal{L}{s2s}(\boldsymbol{\theta}, \mathcal{B}) = \sum{(u, t) \in \mathcal{B}} \sum_{k=1}^{K} \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) \cdot \mathbb{I}[ \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) \leq \tau ] $ where:

$\mathbb{I}[\cdot]$ is an indicator function, which is 1 if the condition inside the bracket is true, and 0 otherwise.
$\tau$ is a dynamic threshold. It is set to the $\lceil \lambda \cdot |\mathcal{B}| \cdot K \rceil$ -th smallest value among all computed seq2seq losses $\{ \mathcal{L}_{s2s}(\boldsymbol{\theta}, u, t, k) \}$ within the current mini-batch.
$\lambda \in [0, 1]$ is a hyper-parameter. It controls the percentage of highest-confidence seq2seq samples that are kept for training. For example, if $\lambda = 0.1$ , only the top 10% of samples (those with the smallest current loss values, implying the model is already somewhat confident in the prediction) are used. This mechanism helps to focus learning on the most relevant and predictable seq2seq relationships.

Sequence-to-Item (seq2item) Loss: The traditional seq2item training is maintained as it's crucial for learning a proper encoder quickly and aligning item and sequence representations. The paper adapts the seq2item loss to leverage the multiple disentangled intentions $K$ from the encoder. Instead of just one sequence representation, the model now considers all $K$ intention representations when predicting the next item. It takes the maximum similarity score among all intentions.

The total seq2item loss for a mini-batch $\mathcal{B}$ is: $ \mathcal{L}{s2i}(\boldsymbol{\theta}, \mathcal{B}) = \sum{(u, t) \in \mathcal{B}} \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) $ $ \begin{array}{r} \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) = - \ln p_{\theta}(\mathbf{h}{t+1}^{(u)} \mid \boldsymbol{\phi}{\theta}(\mathbf{x}_{1:t}^{(u)})) = \end{array} $ $

\ln \frac { \operatorname*{max}{k \in {1, 2, \ldots, K}} \exp \left( \frac { 1 } { \sqrt { D } } \mathbf{h}{t+1}^{(u)} \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } { \sum _ { (u^{\prime}, t^{\prime}) \in \mathcal{B} } \sum _ { k^{\prime} = 1 } ^ { K } \exp \left( \frac { 1 } { \sqrt { D } } \mathbf{h}{t^{\prime}+1}^{(u^{\prime})} \cdot \boldsymbol{\phi}{\theta}^{(k^{\prime})}(\mathbf{x}_{1:t}^{(u)}) \right) } $ Here:

$\mathbf{h}_{t+1}^{(u)} \in \mathbb{R}^D$ is the embedding of the actual next item $x_{t+1}^{(u)}$ (i.e., row $x_{t+1}^{(u)}$ of the item embedding table $\mathbf{H}$ ). This is the positive item.
$\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)})$ is the $k$ -th latent intention representation of the earlier sequence.
The numerator calculates the similarity between the true next item's embedding and the most relevant of the $K$ intention representations from the input sequence (selected by max). This max operation implies that if any of the disentangled intentions can predict the next item, it contributes to the positive score.
The denominator is a summation over all item embeddings $\mathbf{h}_{t^{\prime}+1}^{(u^{\prime})}$ (from other samples in the mini-batch) and all their $K$ corresponding latent intention representations. This again forms a contrastive loss where the model learns to distinguish the true next item from other items (negative samples) based on the learned sequence representations.
The scaling factor $\frac{1}{\sqrt{D}}$ is used, similar to the seq2seq loss.

Overall Training Loss: The final objective during training is to minimize the sum of both seq2item and seq2seq losses for each mini-batch: $ \mathcal{L}(\boldsymbol{\theta}, \mathcal{B}) = \mathcal{L}{s2i}(\boldsymbol{\theta}, \mathcal{B}) + \mathcal{L}{s2s}(\boldsymbol{\theta}, \mathcal{B}) $ This combined loss ensures that the model maintains strong short-term predictive power while also learning from longer-term future signals.

4.2.3. Disentangled Sequence Encoding

The paper utilizes a modified SASRec encoder as its base, augmented with a custom intention-disentanglement layer.

Base Encoder (SASRec): The SASRec encoder [25] is a Transformer-based model that uses multi-head self-attention. For an input sequence $\mathbf{x}_{1:t}^{(u)} = [x_1^{(u)}, x_2^{(u)}, \ldots, x_t^{(u)}]$ , it first obtains item embeddings (from the shared item embedding table $\mathbf{H}$ ) and adds trainable position embeddings to them. After passing through several self-attention blocks, the single-head SASRec encoder outputs a sequence of $D$ -dimensional vectors $[ \mathbf{z}_1^{(u)}, \mathbf{z}_2^{(u)}, \ldots, \mathbf{z}_t^{(u)} ]$ . Each $\mathbf{z}_i^{(u)} \in \mathbb{R}^D$ can be interpreted as the latent intention of the user when clicking item $x_i^{(u)}$ , considering its context within the sequence.

The paper notes that a simple multi-head SASRec doesn't inherently lead to disentanglement, as its heads often focus on similar aspects (e.g., the latest click). Therefore, a specific intention-disentanglement layer is appended.

Intention Clustering: The disentanglement layer starts by associating each item's latent intention $\mathbf{z}_i^{(u)}$ with one of $K$ predefined prototypical intention representations, denoted as $\{ \mathbf{c}_k \in \mathbb{R}^D : 1 \leq k \leq K \}$ . These prototypes $\mathbf{c}_k$ are learnable parameters. The association is quantified by a probability-like attention weight $p_{k|i}$ : $ p_{k|i} = \frac { \exp { \Big ( } { \frac { 1 } { \sqrt { D } } } { \mathrm{LayerNorm} }_1 ( \mathbf{z}_i^{(u)} ) \cdot { \mathrm{LayerNorm} }_2 ( \mathbf{c}_k ) { \Big ) } } { \sum _ { k^{\prime} = 1 } ^ { K } \exp { \Big ( } { \frac { 1 } { \sqrt { D } } } { \mathrm{LayerNorm} }_1 ( \mathbf{z}_i^{(u)} ) \cdot { \mathrm{LayerNorm} }2 ( \mathbf{c}{k^{\prime}} ) { \Big ) } } $ where:

$i = 1, 2, \ldots, t$ (positions in the sequence).
$k = 1, 2, \ldots, K$ (latent categories).
$\mathrm{LayerNorm}_1(\cdot)$ and $\mathrm{LayerNorm}_2(\cdot)$ are distinct LayerNormalization layers. Normalizing the vectors before the dot product effectively makes this a cosine similarity measurement.
The scaling factor $\frac{1}{\sqrt{D}}$ is applied.
This equation essentially calculates a softmax over the similarities between the normalized item intention $\mathbf{z}_i^{(u)}$ and each of the $K$ prototypes $\mathbf{c}_k$ . This $p_{k|i}$ indicates how likely the intention at position $i$ belongs to the $k$ -th latent category. The use of cosine similarity is noted to be more robust against mode collapse (where prototypes might be ignored) than simple dot products.

Intention Weighting: Beyond categorizing intentions, it's also important to weigh their significance for predicting future behaviors. A second attention mechanism computes a weight $p_i$ that indicates how important the primary intention at position $i$ is for predicting the user's future intentions: $ p_i = \frac { \exp { \left( \frac { 1 } { \sqrt { D } } \mathrm{key}i \cdot \mathrm{query} \right) } } { \sum _ { i^{\prime} = 1 } ^ { t } \exp { \left( \frac { 1 } { \sqrt { D } } \mathrm{key}{i^{\prime}} \cdot \mathrm{query} \right) } } $ The key and query vectors are derived as follows: $ \mathbf{key}_i = \widetilde{\mathbf{key}}_i + \mathrm{ReLU}(\mathbf{W}^{\top} \widetilde{\mathbf{key}}_i + \mathbf{b}) $ $ \widetilde{\mathbf{key}}_i = \mathrm{LayerNorm}_3 ( \boldsymbol{\alpha}_i + \mathbf{z}_i^{(u)} ) $ $ \mathrm{query} = \mathrm{LayerNorm}_4 ( \boldsymbol{\alpha}_t + \mathbf{z}_t^{(u)} + \mathbf{b}^{\prime} ) $ where:

$i = 1, 2, \ldots, t$ .
$\mathbf{W} \in \mathbb{R}^{D \times D}$ , $\mathbf{b} \in \mathbb{R}^D$ , $\mathbf{b}^{\prime} \in \mathbb{R}^D$ are trainable parameters.
$\boldsymbol{\alpha}_i \in \mathbb{R}^D$ are position embeddings specific to this disentanglement layer (separate from SASRec's position embeddings).
$\mathrm{LayerNorm}_3(\cdot)$ and $\mathrm{LayerNorm}_4(\cdot)$ are distinct LayerNormalization layers.
The query is constructed based on the latest position embedding $\boldsymbol{\alpha}_t$ and the latest item intention $\mathbf{z}_t^{(u)}$ , plus a bias $\mathbf{b}^{\prime}$ . This implies a recency bias and a focus on intentions similar to the most recent one.
The key for each position $i$ is derived from its position embedding $\boldsymbol{\alpha}_i$ and item intention $\mathbf{z}_i^{(u)}$ , processed through a non-linear transformation (ReLU with $\mathbf{W}, \mathbf{b}$ ).
The softmax over these key-query dot products (scaled by $\frac{1}{\sqrt{D}}$ ) gives the attention weights $p_i$ , indicating the importance of each position's intention.

Intention Aggregation: Finally, the $K$ disentangled output representations of the sequence encoder, $\{ \boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)}) \}_{k=1}^{K}$ , are computed by aggregating all item intentions $\mathbf{z}_i^{(u)}$ according to both their category-specific attention $p_{k|i}$ and their overall importance $p_i$ : $ \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) = \mathrm{LayerNorm}5 \left( \boldsymbol{\beta}k + \sum{i=1}^{t} p{k|i} \cdot p_i \cdot \mathbf{z}_i^{(u)} \right) $ where:

$k = 1, 2, \ldots, K$ .
$\mathrm{LayerNorm}_5(\cdot)$ is another LayerNormalization layer.
$\boldsymbol{\beta}_k \in \mathbb{R}^D$ is a learnable bias vector for each latent category $k$ . The paper mentions using two sets of such biases: one for encoding the input sequence $\mathbf{x}_{1:t}^{(u)}$ and another for encoding the reversed future sequence $\mathbf{x}_{T_u:t+1}^{(u)}$ , acknowledging their different roles.
This sum essentially creates a weighted average of all item intentions $\mathbf{z}_i^{(u)}$ , where the weights $p_{k|i} \cdot p_i$ ensure that each disentangled representation $\boldsymbol{\phi}_{\theta}^{(k)}(\mathbf{x}_{1:t}^{(u)})$ predominantly reflects intentions belonging to category $k$ that are also deemed important for future prediction.

Encouraging Disentanglement: The paper argues that no explicit regularization term is needed to encourage disentanglement (e.g., minimizing mutual information between representations). This is because the contrastive nature of both seq2seq (Eq. 3) and seq2item (Eq. 6) loss functions naturally drives disentanglement. In both losses, the score of a positive sample for a particular intention $\boldsymbol{\phi}_{\theta}^{(k)}(\cdot)$ is compared against the scores from all other K-1 intentions (and other negative samples). To maximize the likelihood of the positive case for a specific part $k$ , the model is implicitly forced to make $\boldsymbol{\phi}_{\theta}^{(k)}(\cdot)$ capture distinct information from $\boldsymbol{\phi}_{\theta}^{(k^{\prime})}(\cdot)$ for $k' \neq k$ . If the different $\boldsymbol{\phi}_{\theta}^{(k)}(\cdot)$ were entangled and carried redundant information, they would not be able to effectively distinguish positive cases associated with one specific intention from those associated with others.

This comprehensive methodology, integrating latent self-supervision and disentangled intention modeling, aims to provide a more robust and insightful way to learn from user behavior sequences.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on four real-world public datasets, which have been pre-processed and widely used in previous sequential recommendation research, particularly by SASRec and BERT4Rec.

Amazon Beauty:
- Source: Amazon product reviews, specifically from the "Beauty" category.
- Scale: 40,226 users and 54,542 items.
- Characteristics: Relatively sparse dataset with shorter sequences.
- Average Sequence Length: 8.8
Steam:
- Source: User interactions (e.g., game purchases, playtimes) on the Steam gaming platform.
- Scale: 281,428 users and 13,044 items.
- Characteristics: Large number of users, but fewer items compared to Beauty. Also features relatively short sequences.
- Average Sequence Length: 12.4
MovieLens-1M:
- Source: Movie ratings data from the MovieLens platform.
- Scale: 6,040 users and 3,416 items.
- Characteristics: Fewer users and items but significantly longer user interaction sequences, reflecting more extensive viewing histories.
- Average Sequence Length: 163.5
MovieLens-20M:
- Source: A larger version of the MovieLens dataset with 20 million ratings.
- Scale: 138,493 users and 26,744 items.
- Characteristics: More users and items than MovieLens-1M, also characterized by long user sequences.
- Average Sequence Length: 144.4

Data Splitting: Following the common practice in sequential recommendation research [25, 49], the datasets are split as follows for each user:

The last item of each user's sequence is reserved for testing.
The second-to-last item of each user's sequence is reserved for validation.
All remaining items (from the beginning up to the third-to-last) are used for training. Items within a sequence are strictly ordered by their timestamps, with the last position corresponding to the most recent click.

These datasets are chosen because they are standard benchmarks in sequential recommendation, allowing for fair comparison with state-of-the-art methods and covering a range of sequence lengths and sparsity levels. They are effective for validating the method's performance across different user behavior patterns.

5.2. Evaluation Metrics

The performance of all methods is evaluated using three widely accepted metrics in recommendation systems: Recall, Normalized Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR). Higher values for all these metrics indicate better recommendation performance.

The evaluation setup follows BERT4Rec's advice: for each ground-truth item in the test set, it is paired with 100 negative items randomly sampled according to their popularity. The task then becomes identifying the single ground-truth item among these 101 items. This is a common practice to make evaluation tractable for large item sets.

5.2.1. Recall@K

Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved by the recommender system within its top $K$ recommendations. It focuses on the ability of the system to find the relevant items, regardless of their precise ranking within the top $K$ . In the context of next-item prediction where there's usually only one ground-truth next item, Recall@K indicates whether the true next item is present in the top $K$ recommended items.
Mathematical Formula: $ \text{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \mathbb{I}(\text{next_item}_u \in \text{Recommended}_u(K)) $
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the test set.
- $u \in \mathcal{U}$ : A specific user in the test set.
- $\text{next\_item}_u$ : The actual next item clicked by user $u$ (the ground-truth item for evaluation). In this paper's setup, there is one such item per user in the test set.
- $\text{Recommended}_u(K)$ : The set of the top $K$ items recommended by the system for user $u$ .
- $\mathbb{I}(\cdot)$ : The indicator function. It returns 1 if the condition inside the parenthesis is true (i.e., the $next_item_u$ is found among the top $K$ recommendations), and 0 otherwise.

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

Conceptual Definition: NDCG@K is a measure of ranking quality. It evaluates how well a recommender system ranks relevant items, giving higher scores to relevant items that appear earlier in the recommendation list (i.e., at higher ranks). It "discounts" the value of relevant items as their position in the list decreases. NDCG is commonly used in information retrieval and recommendation to account for the graded relevance of items. For next-item prediction, the relevance is binary (1 for the true next item, 0 for others).
Mathematical Formula: $ \text{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\text{DCG@K}_u}{\text{IDCG@K}_u} $ where: $ \text{DCG@K}u = \sum{j=1}^{K} \frac{2^{\text{rel}_j}-1}{\log_2(j+1)} $ $ \text{IDCG@K}u = \sum{j=1}^{K} \frac{2^{\text{rel}^{ideal}_j}-1}{\log_2(j+1)} $
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the test set.
- $u \in \mathcal{U}$ : A specific user in the test set.
- $j$ : The rank (position) in the recommendation list, from 1 to $K$ .
- $\text{rel}_j$ : The relevance score of the item at rank $j$ in the actual recommendation list for user $u$ . For binary relevance (relevant/irrelevant), it's typically 1 if the item at rank $j$ is the ground-truth next item, and 0 otherwise.
- $\text{DCG@K}_u$ : Discounted Cumulative Gain for user $u$ at cutoff $K$ . It sums the relevance scores, discounted logarithmically by their rank.
- $\text{IDCG@K}_u$ : Ideal Discounted Cumulative Gain for user $u$ at cutoff $K$ . This is the maximum possible DCG for user $u$ , achieved if all relevant items were perfectly ranked at the top. For a single ground-truth item, IDCG@K would be $\frac{2^1-1}{\log_2(1+1)} = 1$ (if $K \ge 1$ ).
- $\text{rel}^{ideal}_j$ : The ideal relevance score at rank $j$ . For a single ground-truth item, $\text{rel}^{ideal}_1 = 1$ , and $\text{rel}^{ideal}_j = 0$ for $j > 1$ .

5.2.3. Mean Reciprocal Rank (MRR)

Conceptual Definition: MRR measures the average of the reciprocal ranks of the first relevant item in a list of recommendations. If the first relevant item is at rank 1, its reciprocal rank is 1. If it's at rank 2, its reciprocal rank is 1/2, and so on. MRR is particularly useful when only one relevant item is expected, and placing it high in the list is critical.
Mathematical Formula: $ \text{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{1}{\text{rank}_u} $
Symbol Explanation:
- $|\mathcal{U}|$ : The total number of users in the test set.
- $u \in \mathcal{U}$ : A specific user in the test set.
- $\text{rank}_u$ : The rank (position) of the first relevant item for user $u$ in the recommendation list. If the relevant item is not found, its rank is often considered $\infty$ , making $1/\text{rank}_u = 0$ .

5.3. Baselines

The paper compares its approach against a comprehensive set of representative baselines, spanning different generations and paradigms of recommender systems:

POP (Popularity-based): A naive but strong baseline that recommends the most globally popular items to all users. This helps determine if more complex models are genuinely learning personalized preferences or just popular items.
BPR-MF (Bayesian Personalized Ranking - Matrix Factorization): A classic and widely used collaborative filtering algorithm based on matrix factorization. It optimizes for pairwise ranking, aiming to rank observed (positive) items higher than unobserved (negative) items. It primarily captures general user preferences.
NCF (Neural Collaborative Filtering): An influential deep learning-based collaborative filtering model that replaces the inner product in matrix factorization with a neural network to model user-item interactions, allowing for more complex, non-linear relationships.
FPMC (Factorized Personalized Markov Chains): A sequential recommender that combines matrix factorization with a first-order Markov chain. It models both general user preferences and sequential transitions (predicting the next item based on the last item).
GRU4Rec / GRU4Rec+: Recurrent Neural Network (RNN) based sequential recommenders. GRU4Rec uses Gated Recurrent Units to model sequences. $GRU4Rec+$ is an improved version, often with better training strategies or architectures. These models capture temporal dependencies.
Caser (Convolutional Sequence Embedding Recommendation): A sequential recommender that uses Convolutional Neural Networks (CNNs) to extract local and general sequential patterns from user interaction sequences.
SASRec (Self-Attentive Sequential Recommendation): A state-of-the-art sequential recommender based on the Transformer architecture. It uses self-attention to capture long-range dependencies in user sequences, allowing it to weigh the importance of all previous items when predicting the next one. This model serves as the foundation for the proposed encoder in this paper.
BERT4Rec (Sequential Recommendation with Bidirectional Encoder Representations from Transformer): A leading deep sequential recommender that adapts the BERT pre-training objective (masked item prediction using bidirectional context) to train a bidirectional Transformer encoder for sequential recommendations. It captures rich contextual representations by considering both past and future items.

These baselines represent a progression from non-sequential to sequential, and from traditional methods to deep learning approaches, including the current state-of-the-art Transformer-based models. This diverse set allows for a robust evaluation of the proposed method's novelty and effectiveness.

5.4. Implementation and Hyper-parameters

Framework: The model is implemented using TensorFlow.
Initialization: Parameters are initialized using TensorFlow's default initialization methods.
Optimizer: Adam optimizer is used for mini-batch gradient descent.
Learning Rate: Fixed at 0.001.
Mini-batch Size: 128 sequences per batch.
Base Encoder: The single-head implementation of SASRec is used as the foundational component for the sequence encoder.
Maximum Sequence Length:
- MovieLens-1M and MovieLens-20M: Capped at 200.
- Amazon Beauty and Steam: Capped at 50. These caps align with configurations used by SASRec and BERT4Rec to manage computational complexity and focus on recent history for datasets with very long sequences.
Hyper-parameter Tuning: Other hyper-parameters are tuned using random search.
- Dimensionality of Item Embeddings ( $D$ ): Chosen from $\{16, 32, 64, 128, 256\}$ .
- Number of Self-Attention Blocks (in SASRec part): Chosen from $\{1, 2, 3\}$ .
- Lambda ( $\lambda$ ): The threshold hyper-parameter for seq2seq sample selection, chosen from $\{0.05, 0.10, \ldots, 1.0\}$ .
- Number of Latent Categories ( $K$ ): Chosen from $\{1, 2, \ldots, 8\}$ .
- Dropout Rate: Chosen from $\{0, 0.1, 0.2, \ldots, 0.9\}$ .
- L2 Regularization Term: Selected from $\{0, 0.0001, 0.001, \ldots, 1\}$ .

6. Results & Analysis

6.1. Core Results Analysis

The empirical results demonstrate that the proposed approach, which combines traditional seq2item training with the novel disentangled latent seq2seq training, consistently outperforms all baseline models across various metrics and datasets.

The following figure (Figure 2 from the original paper) shows the recommendation performance in terms of Recall@1, Recall@5, and Recall@10. These metrics measure how well a method can retrieve the relevant items with a limited budget.

该图像是一个图表，展示了不同推荐方法在多个数据集（Beauty、Steam、ML-1m、ML-20m）上的召回率（Recall）。图中分为三部分，分别表示在前1、前5和前10个推荐位置的召回率。各个方法的性能比较结果显示了所提出方法的优势。

The next figure (Figure 3 from the original paper) shows the recommendation performance in terms of NDCG@5, NDCG@10, and MRR. These metrics measure how well a method can rank the relevant items before the irrelevant ones.

该图像是一个展示不同推荐方法在多个数据集（如Beauty和Steam）上表现的条形图，包含了标准化折扣累积增益（NDCG@5、NDCG@10）和平均倒数排名（MRR）的结果。数据表明所提方法在多个评估指标上均优于其他基准方法。

Key Observations:

Consistent Outperformance: Across all four datasets (Beauty, Steam, MovieLens-1M, MovieLens-20M) and all evaluated metrics (Recall@1, @5, @10; NDCG@5, @10; MRR), the proposed method achieves the highest performance.
Significant Gains on Shorter Sequences: The improvement is particularly pronounced on the Beauty and Steam datasets. For these datasets, which have relatively shorter average sequence lengths (8.8 and 12.4, respectively), the relative improvement over the strongest baselines often exceeds 35%. This suggests that for sequences where intentions might be clearer or less prone to very long-term shifts, the latent seq2seq signals combined with disentanglement are highly effective.
Modest Gains on Longer Sequences: On the MovieLens-1M and MovieLens-20M datasets, which feature much longer average sequence lengths (163.5 and 144.4, respectively), the relative improvement is around 5%. The authors attribute this to the inherent challenge of disentangling intentions within very long and complex sequences. Such sequences might contain many evolving or intertwined intentions, making it harder for the model to isolate and leverage specific shared intentions for seq2seq training. This indicates a potential area for future improvement.

In summary, the seq2seq training strategy successfully extracts additional supervision signals that complement seq2item training, leading to superior recommendation performance, especially on datasets with more manageable sequence lengths for intention disentanglement.

6.2. Robustness to Synthetic Noises

To evaluate the robustness of the proposed seq2seq training strategy, the authors conducted experiments where the training data was artificially corrupted. This involved randomly replacing a portion of observed clicks in the training set with uniformly sampled items. The experiment was performed on the Beauty dataset, with the corruption level (percentage of corrupted data) ranging from 10% to 50%.

The following figure (Figure 4 from the original paper) illustrates the performance drop, showing the relative performance (with noisy data / with clean data) for Recall@5, NDCG@5, and MRR.

该图像是性能下降的图表，展示了在不同百分比的训练数据损坏下，seq2item与seq2seq策略的性能变化。左图为Recall@5的下降情况，中图为NDCG@5的变化，右图为MRR的性能下降。结果表明，在训练数据损坏时，使用seq2seq策略的性能下降较小。

Analysis: The figure compares two variants: one that optimizes only the seq2item loss (labeled seq2item in the plot) and the proposed method that optimizes both seq2item and seq2seq losses (labeled seq2seq).

Improved Robustness: The curves show that the recommendation performance of the model using the seq2seq training strategy (seq2seq) drops slower and maintains a higher relative performance compared to the seq2item only variant (seq2item), especially when the corruption level is relatively modest (e.g., up to 20% noise).
Reasoning: This indicates that by mining additional supervision signals from the longer-term future and selectively learning from seq2seq samples of high confidence (through the $\lambda$ threshold and intention disentanglement), the proposed strategy is more resilient to noisy or irrelevant immediate behaviors in the training data. The longer-term signals, when filtered for shared intentions, provide a more stable and reliable learning objective, making the model less vulnerable to individual erroneous data points.
Limitations at High Noise: At very high corruption levels (e.g., 40-50%), the performance drop becomes significant for both methods, suggesting that beyond a certain point, the integrity of the data is too compromised for either strategy to fully recover. However, the seq2seq approach still demonstrates a comparative advantage in the more realistic range of moderate noise.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation Study

An ablation study was conducted on the Beauty dataset to understand the contribution of different components of the proposed method.

The following are the results from Table 2 of the original paper:

Variants of Our Method	Evaluation Metrics
Variants of Our Method	Recall@1	Recall@5	Recall@10	NDCG@5	NDCG@10	MRR
(1) Remove seq2seq training	0.1358	0.3002	0.3891	0.2369	0.2675	0.2420
(2) Individually reconstruct all items in a future sequence	0.1071	0.2709	0.3744	0.1916	0.2251	0.1992
(3) Individually reconstruct the next three items	0.1202	0.2914	0.3898	0.2084	0.2403	0.2139
(0) Default	0.1522	0.3225	0.4171	0.2404	0.2709	0.2448

Analysis:

Variant (1) - Removing seq2seq training: This variant corresponds to using only the seq2item loss, essentially turning the model into a standard Transformer-based sequential recommender with the multi-intention encoder. A noticeable drop in performance across all metrics (e.g., Recall@1 drops from 0.1522 to 0.1358, NDCG@5 from 0.2404 to 0.2369) is observed compared to the Default (full model). This directly confirms the efficacy and positive contribution of the proposed seq2seq training strategy in leveraging additional supervision signals from the longer-term future.
Variant (2) - Individually reconstructing all items in a future sequence: This variant attempts to perform seq2seq training by explicitly predicting every individual item in the entire future sequence, instead of predicting a single latent representation of the sequence. This approach performs even worse than Variant (1) (e.g., Recall@1 drops further to 0.1071, NDCG@5 to 0.1916). This result strongly supports the paper's first challenge: "reconstructing a future sequence containing many behaviors is exponentially harder." It also validates the design choice of latent self-supervision, demonstrating that predicting a distilled representation of the future sequence is far more effective than trying to reconstruct each item individually, which might include many irrelevant signals.
Variant (3) - Individually reconstructing the next three items: This is a compromise between predicting all future items and just the next one, trying to capture slightly longer-term but still explicit signals. It also performs worse than Variant (1) (e.g., Recall@1 at 0.1202, NDCG@5 at 0.2084), although slightly better than Variant (2). This further reinforces the finding that explicit item-by-item reconstruction of even a few future items is less effective than the latent representation approach, likely due to the presence of irrelevant items or intentions within those explicitly chosen future items.

The ablation study conclusively demonstrates that both the seq2seq loss component and the latent self-supervision (predicting representations rather than individual items) are crucial for the superior performance of the proposed method. The intention disentanglement aspect, by enabling the selection of high-confidence seq2seq samples, is implicitly validated by the success of the Default model over the naive reconstruction variants.

6.3.2. Hyper-parameter Sensitivity

The paper investigates the impact of the critical hyper-parameter $\lambda \in [0, 1]$ , which controls the threshold for determining whether a seq2seq sample is considered "high confidence" and thus used for self-supervised training.

The following figure (Figure 5 from the original paper) illustrates the impact of the threshold hyper-parameter $\lambda \in \left[ 0 , 1 \right]$ , which is for determining whether a seq2seq sample is of high confidence and thus whether to use the sample for self-supervised training. $\lambda = 0$ is equivalent to not using seq2seq training, while $\lambda = 1$ selects all seq2seq samples for training.

$Figure 5: Impact of the threshold hyper-parameter $\\lambda \\in \\left\[ 0 , 1 \\right\]$ , which is for determining whether a seq2seq sample is of high confidence and thus whether to use the sample for selfsupervised training. $\\lambda = 0$ is equivalent to not using seq2seq training, while $\\lambda = 1$ selects all seq2seq samples for training.$

Analysis: The figure plots Recall@1 and NDCG@10 against different values of $\lambda$ .

$\lambda = 0$ Baseline: When $\lambda = 0$ , no seq2seq samples are used for training, making this scenario equivalent to removing seq2seq training (Variant 1 in the ablation study). The performance at $\lambda=0$ is indeed lower than the optimal point, confirming the value of seq2seq training.
Optimal $\lambda$ : The performance generally increases as $\lambda$ increases from 0, reaches a peak somewhere in the middle (e.g., around $\lambda = 0.6$ to 0.8 for Recall@1), and then starts to decline as $\lambda$ approaches 1.
Impact of $\lambda$ :
- Too Strict ( $\lambda$ too small): A very small $\lambda$ means only a very tiny fraction of the most confident seq2seq samples are used. This limits the additional supervision signals, leading to suboptimal performance because the model doesn't learn enough from the longer-term future.
- Too Loose ( $\lambda$ too large, approaching 1): A large $\lambda$ means almost all seq2seq samples are used, including those with low confidence or involving intentions irrelevant to the input sequence. This introduces too much noise and irrelevant signals into the training process, hindering effective learning and causing performance degradation. This validates the paper's second challenge regarding multiple intentions and the need for filtering.
Dataset Dependency: The authors mention that while the trend is similar across datasets, the optimal $\lambda$ value may vary, highlighting the importance of tuning this hyper-parameter for specific applications.

This sensitivity analysis demonstrates that the careful selection of seq2seq training samples using the $\lambda$ threshold, enabled by the intention disentanglement mechanism, is crucial for maximizing the benefits of the proposed strategy.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a novel and effective approach to enhance sequential recommender systems by overcoming the myopia and limitations of traditional sequence-to-item (seq2item) training. The core contribution is a sequence-to-sequence (seq2seq) training strategy that leverages longer-term future user behaviors for additional supervision. The method introduces two key innovations:

Latent Self-Supervision: Instead of the challenging task of reconstructing individual items in a future sequence, the model learns to predict a compact, high-level latent representation of the entire future sequence, significantly easing convergence and distilling relevant information.
Intention Disentanglement and Sample Selection: A disentangled sequence encoder is designed to capture multiple distinct user intentions. This disentanglement is then strategically used to select seq2seq training samples, ensuring that only pairs of sub-sequences with shared intentions are utilized, thereby improving the signal-to-noise ratio in supervision.

Extensive experiments on real-world datasets (Amazon Beauty, Steam, MovieLens-1M, MovieLens-20M) and synthetic noisy data consistently demonstrate that this seq2seq training strategy, when combined with standard seq2item training, leads to superior recommendation performance and increased robustness to noise.

7.2. Limitations & Future Work

The authors acknowledge the following limitations and propose future research directions:

Computational Cost of seq2seq Training: While the paper successfully demonstrates the performance benefits, seq2seq training, especially with contrastive losses over mini-batches and multiple disentangled heads, can be computationally intensive. Future work aims to reduce this cost via an "engineering-efficient framework" [59].
Performance on Long Sequences: The paper notes that the performance gains on datasets with very long average sequence lengths (MovieLens-1M and MovieLens-20M) are less impressive compared to those with shorter sequences. This suggests that effectively disentangling intentions and extracting relevant long-term signals from extremely long and complex user histories remains a challenge. Improving the model's ability to handle and benefit from such long sequences is a promising direction.

7.3. Personal Insights & Critique

The paper offers several valuable insights and represents a significant step forward in sequential recommendation:

Beyond Immediate Prediction: The fundamental shift from seq2item to latent seq2seq is conceptually powerful. It intuitively addresses the real-world limitation that users' long-term interests are not always perfectly aligned with their very next click. The idea of learning a "summary" representation of the future is an elegant way to tackle the complexity of many future items.
Novel Application of Disentanglement: The use of intention disentanglement not just for learning interpretable representations, but specifically for filtering training samples in a self-supervised context, is a highly innovative aspect. This directly addresses the signal-to-noise ratio problem that arises when mining broad future signals. It intelligently selects what to learn from, making the longer-term signals more actionable.
Implicit Disentanglement through Contrastive Loss: The observation that the contrastive nature of the softmax loss implicitly encourages disentanglement, removing the need for explicit regularization terms, is a neat theoretical and practical finding. It simplifies model design while achieving the desired property.
Enhanced Robustness: The demonstrated robustness to noisy training data is a crucial practical advantage. Real-world user behavior data is inherently noisy, and a model that can learn more reliably from such data is highly valuable.

Critique and Potential Areas for Improvement:

Scalability of $K$ Intentions: While the paper uses $K$ (number of latent categories) up to 8, in highly dynamic and diverse e-commerce scenarios, users might exhibit many more than 8 distinct intentions over a long period. Scaling $K$ and ensuring that all categories are actively learned without mode collapse could become challenging. Further research into dynamic $K$ or hierarchical disentanglement could be beneficial.
Subjectivity of Lambda: The $\lambda$ hyper-parameter, while effective, introduces a manual thresholding mechanism. Can this selection of "high confidence" seq2seq samples be made more adaptive or learned dynamically based on some uncertainty estimation?
Interpretability of Latent Categories: While the paper aims to disentangle intentions, a deeper dive into the actual interpretability of the learned prototypical intention representations $\mathbf{c}_k$ and how they align with human-understandable categories (e.g., "work items," "hobby items") could provide valuable insights. This could be explored through case studies or qualitative analysis.
Handling Very Long Sequences: The weaker performance on MovieLens datasets with extremely long sequences suggests that the current disentanglement and aggregation mechanisms might struggle to effectively summarize or differentiate intentions over hundreds of items. Perhaps a hierarchical or multi-scale attention mechanism within the disentanglement layer could better capture very long-term intention shifts or recurring patterns.
Computational Cost for Very Large Scale: While mentioned as future work, the current approach with its mini-batch softmax over $K$ options for all samples in the batch can still be costly for recommender systems with billions of items and millions of users, especially if deployed in real-time. Exploring approximate nearest neighbor search or more efficient negative sampling strategies within the contrastive loss could be vital for practical deployment.

Overall, this paper provides a robust framework for improving sequential recommendation by incorporating longer-term future signals and a sophisticated disentanglement mechanism. Its insights into latent self-supervision and intention-aware sample selection are highly transferable and could inspire future research in various sequence modeling tasks beyond recommendation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Disentangled Self-Supervision in Sequential Recommenders

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 57,017 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Notations and Problem Formulation

4.2.2. Sequence-to-Sequence Self-Supervision

4.2.3. Disentangled Sequence Encoding

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Recall@K

5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)

5.2.3. Mean Reciprocal Rank (MRR)

5.3. Baselines

5.4. Implementation and Hyper-parameters

6. Results & Analysis

6.1. Core Results Analysis

6.2. Robustness to Synthetic Noises

6.3. Ablation Studies / Parameter Analysis

6.3.1. Ablation Study

6.3.2. Hyper-parameter Sensitivity

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers