Disentangled Self-Supervision in Sequential Recommenders
TL;DR Summary
The paper introduces a latent self-supervised and disentangled sequence-to-sequence training strategy to address myopic predictions and lack of diversity in traditional sequential recommenders, showing significant performance improvements on real and synthetic datasets.
Abstract
To learn a sequential recommender, the existing methods typically adopt the sequence-to-item (seq2item) training strategy, which supervises a sequence model with a user’s next behavior as the label and the user’s past behaviors as the input. The seq2item strategy, however, is myopic and usually produces non-diverse recommendation lists. In this paper, we study the problem of mining extra signals for supervision by looking at the longer-term future. There exist two challenges: i) reconstructing a future sequence containing many behaviors is exponentially harder than reconstructing a single next behavior, which can lead to difficulty in convergence, and ii) the sequence of all future behaviors can involve many intentions, not all of which may be predictable from the sequence of earlier behaviors. To address these challenges, we propose a sequence-to-sequence (seq2seq) training strategy based on latent self-supervision and disentanglement. Specifically, we perform self-supervision in the latent space, i.e., reconstructing the representation of the future sequence as a whole, instead of reconstructing the items in the future sequence individually. We also disentangle the intentions behind any given sequence of behaviors and construct seq2seq training samples using only pairs of sub-sequences that involve a shared intention. Results on real-world benchmarks and synthetic data demonstrate the improvement brought by seq2seq training.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Disentangled Self-Supervision in Sequential Recommenders
1.2. Authors
- Jianxin Ma (Tsinghua University, Beijing, China; Alibaba Group, China)
- Chang Zhou (Alibaba Group, China)
- Hongxia Yang (Alibaba Group, China)
- Peng Cui (Tsinghua University, Beijing, China)
- Xin Wang (Tsinghua University, Beijing, China; Key Laboratory of Pervasive Computing, Ministry of Education, China)
- Wenwu Zhu (Tsinghua University, Beijing, China; Key Laboratory of Pervasive Computing, Ministry of Education, China)
1.3. Journal/Conference
Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '20), August 23–27, 2020, Virtual Event, CA, USA. KDD (Knowledge Discovery and Data Mining) is one of the premier conferences in the fields of data mining, data science, and big data. It is highly reputable and influential, attracting top-tier research and applications globally.
1.4. Publication Year
2020
1.5. Abstract
This paper addresses the limitations of the conventional sequence-to-item (seq2item) training strategy in sequential recommenders, which is often myopic and results in non-diverse recommendations. The authors propose to leverage longer-term future signals for supervision by introducing a novel sequence-to-sequence (seq2seq) training strategy. The core challenges of reconstructing a multi-behavior future sequence (difficulty in convergence) and handling multiple user intentions within that sequence (low signal-to-noise ratio) are tackled through two main ideas: i) latent self-supervision, where the model reconstructs the representation of the entire future sequence in a latent space instead of individual items, and ii) intention disentanglement, which identifies and utilizes only sub-sequences sharing a common intention for seq2seq training. Experimental results on real-world and synthetic datasets demonstrate that this seq2seq training strategy, when complementing seq2item training, leads to improved recommendation performance.
1.6. Original Source Link
/files/papers/6950ed9888e29060a51c8504/paper.pdf (This link indicates a PDF file often hosted by the conference proceedings or institution, and based on the provided metadata, it is officially published at KDD '20).
2. Executive Summary
2.1. Background & Motivation
The central problem addressed by this paper lies within sequential recommender systems. These systems aim to predict a user's next action (e.g., clicking an item) based on their historical sequence of interactions. The existing, standard training paradigm, known as sequence-to-item (seq2item) training, involves feeding a model a sequence of past behaviors and supervising it to predict only the immediately following single item.
This seq2item strategy, however, suffers from two critical limitations:
-
Myopia and Lack of Diversity: By focusing solely on the next immediate item, the model becomes short-sighted. It tends to over-emphasize recent behavior and may fall into recommending highly similar items consecutively, leading to non-diverse recommendation lists. For instance, if a user clicks five shirts and then one pair of trousers, the
seq2itemapproach heavily reinforces "shirt" as the next recommendation, even if a broader recommendation of both shirts and trousers would be more desirable for a diverse top-k list. -
Vulnerability to Irrelevant Immediate Behaviors: Users often have diverse and constantly evolving intentions. The very next behavior in a sequence might be an outlier, driven by momentary curiosity or an unrelated impulse, and thus irrelevant to the preceding sequence of behaviors. Training solely on such potentially noisy
next-itemlabels can lead to alow signal-to-noise ratioin supervision, making the model less robust.To overcome these challenges, the paper seeks to mine
extra signals for supervisionby looking at thelonger-term futureof user interactions, rather than just the immediate next item. However, this introduces new significant challenges: -
Difficulty in Reconstructing Future Sequences: Reconstructing an entire sequence of many future behaviors is exponentially harder than predicting a single item. This complexity can hinder model convergence and efficiency, especially if individual items in the future sequence are reconstructed one by one.
-
Multiple User Intentions in Future Sequences: A longer future sequence might encapsulate several evolving user intentions. Not all of these future intentions might be predictable or relevant to the earlier input sequence. Without a mechanism to discern relevant intentions, the
signal-to-noise ratiowould again be low.The paper's innovative idea is to propose a
sequence-to-sequence (seq2seq)training strategy to complement the standardseq2itemapproach. Thisseq2seqstrategy is designed to address the aforementioned challenges by:
- Performing
self-supervisionin thelatent spaceto ease the reconstruction task. - Employing
intention disentanglementto filter out irrelevant future intentions and construct more meaningful training samples.
2.2. Main Contributions / Findings
The paper makes several key contributions to the field of sequential recommender systems:
- Novel
seq2seqTraining Strategy: The authors propose a newseq2seqtraining strategy that extracts additional supervision signals from a user'slonger-term futureinteractions, moving beyond the traditional focus on only the next immediate behavior. This strategy runs in parallel to and complements the standardseq2itemtraining. - Latent Self-Supervision for Convergence: To mitigate the difficulty of reconstructing entire future sequences, the paper introduces
self-supervisionin thelatent space. Instead of reconstructing individual items, the model learns to predict acompact representation(a single vector) of the entire future sequence. This "distilled pseudo behavior" summarizes the main intention of the future sequence, simplifying the learning task and boosting convergence. - Intention Disentanglement for Sample Selection: A key innovation is the design of a sequence encoder that can infer and
disentangle multiple latent intentionsreflected by a sequence of behaviors. The encoder produces multiple representations for a given sequence, each corresponding to a distinctlatent category. This disentanglement is then crucial for constructingseq2seqtraining samples: only pairs of sub-sequences (an earlier input sequence and a future target sequence) thatshare a common intention(i.e., belong to the same latent category) are used for training. This mechanism improves thesignal-to-noise ratioby filtering out irrelevant future intentions. - Empirical Demonstration of Efficacy: Through extensive experiments on real-world benchmark datasets (Amazon Beauty, Steam, MovieLens-1M, MovieLens-20M) and synthetic data corrupted with noise, the paper empirically demonstrates that the proposed
seq2seqtraining strategy consistently improves recommendation performance across various metrics (Recall, NDCG, MRR). The results also show enhancedrobustnessto noise in training data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Sequential Recommender Systems: These systems aim to predict what a user will interact with next, based on their ordered history of past interactions (e.g., clicks, purchases, views). Unlike traditional recommender systems that might only consider overall preferences, sequential recommenders explicitly model the temporal dynamics and order of user behavior.
-
Deep Learning (Deep Sequential Models): A subfield of machine learning that uses artificial neural networks with multiple layers (deep neural networks) to learn representations of data with multiple levels of abstraction. In the context of sequential recommenders, deep sequential models like Recurrent Neural Networks (RNNs) and Transformer networks are used to capture complex patterns and dependencies within user behavior sequences.
- Recurrent Neural Networks (RNNs): A class of neural networks designed to process sequential data. They have internal memory that allows them to use information from previous steps in the sequence to influence the processing of the current step.
GRU4Recis a prominent example of an RNN-based recommender. - Self-Attention Networks (Transformers): A neural network architecture that relies heavily on the
attention mechanismto weigh the importance of different parts of the input sequence when processing each element. Transformers are highly effective for sequential tasks, overcoming some limitations of RNNs regarding long-range dependencies and parallelization.SASRecandBERT4Recare examples of Transformer-based recommenders.
- Recurrent Neural Networks (RNNs): A class of neural networks designed to process sequential data. They have internal memory that allows them to use information from previous steps in the sequence to influence the processing of the current step.
-
Self-Supervised Learning (SSL): A paradigm in machine learning where a model learns representations from unlabeled data by solving a "pretext task." The pretext task is designed such that solving it requires the model to understand certain aspects of the data, thereby learning useful features or representations without explicit human labels. For example, predicting a masked word in a sentence (like BERT's
Clozetask) or predicting future frames in a video from past frames are pretext tasks. The learned representations can then be fine-tuned for downstream tasks like recommendation. -
Disentangled Representation Learning: This aims to learn representations where different, independent explanatory factors of variation in the data are captured by distinct, independent dimensions or parts of the learned representation. For example, in an image of a face, disentangled representations might separately encode factors like "gender," "age," and "expression." In recommender systems, this means learning separate representations for a user's distinct intentions (e.g., "shopping for work clothes" vs. "browsing for leisure items").
-
Contrastive Learning: A method used in self-supervised learning where the model learns by contrasting positive pairs (similar data points) with negative pairs (dissimilar data points). The goal is to make the representations of positive pairs closer in the latent space and those of negative pairs farther apart. The
softmaxfunction is often used in contrastive losses to normalize scores and convert them into probabilities. -
Softmax Function: A function that takes a vector of arbitrary real-valued scores and transforms them into a probability distribution. Each element of the output vector is a probability value between 0 and 1, and all elements sum up to 1. It is commonly used in the output layer of multi-class classification neural networks. $ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $ where is the -th element of the input vector, and is the total number of classes.
-
Layer Normalization: A technique used in neural networks to normalize the inputs across the features of an individual sample within a layer. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization normalizes across the feature dimension, making it suitable for recurrent networks and Transformers where batch sizes might vary or sequence lengths differ. It helps stabilize training and speed up convergence. $ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta $ where is the mean of the input features, is the standard deviation, and are learnable scale and shift parameters, and denotes element-wise multiplication.
-
Dot Product and Cosine Similarity: Both are measures of similarity between two vectors.
- Dot Product: . A larger dot product typically indicates greater similarity, especially if vector magnitudes are similar.
- Cosine Similarity: . This measures the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction), regardless of their magnitudes. It's often preferred when magnitude differences should not influence similarity.
3.2. Previous Works
The paper contextualizes its contributions by discussing several categories of prior research:
-
Early Recommender Systems (Collaborative Filtering & Matrix Factorization):
- Collaborative Filtering (CF): Techniques that make recommendations by collecting preferences from many users and finding users with similar tastes (user-based CF) or items that are frequently interacted with together (item-based CF). Examples include
BPR-MF(Bayesian Personalized Ranking - Matrix Factorization). - Matrix Factorization (MF): A popular technique within CF that decomposes the user-item interaction matrix into two lower-dimensional matrices: a user latent factor matrix and an item latent factor matrix. The dot product of a user's latent vector and an item's latent vector estimates the user's preference for that item.
- Neural Collaborative Filtering (NCF): Extends MF by replacing the simple dot product with a neural network to model user-item interactions, allowing for more complex non-linear relationships.
- Relevance to current paper: These methods often
neglect the sequential natureof user interactions, treating them as independent events.
- Collaborative Filtering (CF): Techniques that make recommendations by collecting preferences from many users and finding users with similar tastes (user-based CF) or items that are frequently interacted with together (item-based CF). Examples include
-
Markov Chains for Sequential Recommendation:
- These models explicitly consider the order of interactions, assuming that a user's next action primarily depends on their immediate past action(s).
First-order Markov chainsconsider only the last action, whilehigher-order Markov chainsconsider a fixed window of past actions. - Factorized Personalized Markov Chains (FPMC): A prominent model that combines matrix factorization with a first-order Markov chain to capture both general user preferences and sequential transitions.
- Relevance to current paper: While addressing sequence, they often have
limited capacityto capture complex, long-range dependencies and evolving patterns.
- These models explicitly consider the order of interactions, assuming that a user's next action primarily depends on their immediate past action(s).
-
Deep Sequential Recommenders:
- These leverage the expressive power of deep neural networks to model sequential user behaviors.
- GRU4Rec / GRU4Rec+: Based on Gated Recurrent Units (GRUs), a type of RNN, these models process sequences of items to predict the next item, effectively capturing temporal dependencies. introduced improvements like session-parallel mini-batch training.
- Caser: Uses Convolutional Neural Networks (CNNs) to model sequential patterns, where horizontal convolutions capture general sequential features and vertical convolutions capture local sequential patterns.
- SASRec (Self-Attentive Sequential Recommendation): A groundbreaking model that adapts the
Transformer encoderfor sequential recommendation. It uses theself-attention mechanismto allow the model to weigh the importance of all previous items in a user's sequence when predicting the next item, effectively capturing long-range dependencies without relying on recurrence. SASRec demonstrated that a Transformer-based architecture can achieve state-of-the-art performance. The coreself-attentionmechanism, as introduced in "Attention Is All You Need" (Vaswani et al., 2017), is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where (Query), (Key), and (Value) are matrices derived from the input embeddings, and is the dimension of the key vectors. This allows the model to selectively focus on relevant past items. - BERT4Rec: Further advances Transformer-based sequential recommendation by adopting the
BERT (Bidirectional Encoder Representations from Transformers)pre-training objective. Instead of predicting only the next item (unidirectional), BERT4Rec uses aClozetask (similar to masked language modeling) where random items in a sequence are masked, and the model is trained to predict them using bidirectional context. This allows it to learn rich contextual representations. - Relevance to current paper: These models typically follow a
sequence-to-itemtraining strategy, where the future sequence is explicitly reconstructed item by item (or the next item is predicted). This paper argues this ismyopic.
-
Disentangling a User's Mixed Intentions:
- Research in this area focuses on separating the underlying independent factors that explain observed data.
- Methods often involve
Variational Autoencoders (VAEs)with regularization terms (e.g.,beta-VAE) to minimize mutual information between different parts of the latent representation, thereby encouraging disentanglement. - Other approaches involve
mixture data modelsor methods for disentangling intentions ingraph data(e.g., social networks). - Relevance to current paper: This paper also employs disentanglement, but for a specific purpose: to
identify shared intentionsbetween input and target sequences forseq2seqsample selection, rather than just learning general disentangled representations.
-
Self-Supervision and Contrastive Learning:
- Beyond the specific
Clozetask used by BERT, self-supervised learning encompasses a wide range ofpretext taskssuch as predicting parts of an object, recovering original order, or discriminating relationships. Contrastive Predictive Coding (CPC):A notable self-supervised approach that learns representations by predicting future observations in alatent space. It uses a contrastive loss to distinguish positive future samples from negative samples.- Relevance to current paper: This paper explicitly draws inspiration from
CPCby performingself-supervision in the latent space. However, it extendsCPCby incorporatingdisentangled representation learningto handle multiple intentions, whichCPCin its general form does not typically address.
- Beyond the specific
3.3. Technological Evolution
The evolution of recommender systems, particularly sequential ones, can be viewed as a progression from static preference modeling to dynamic, context-aware, and increasingly sophisticated representation learning:
-
Static Preference Models (Early 2000s): Initial systems focused on
Collaborative Filtering (CF)andMatrix Factorization (MF), primarily capturing general user preferences without considering the order of interactions. -
Sequential Behavior Models (Mid-2000s to Early 2010s): The recognition of the importance of sequence led to models like
Markov Chains (FPMC), which explicitly modeled transitions between items. However, these were often limited to short-term dependencies and lacked the capacity for complex patterns. -
Deep Learning for Sequences (Mid-2010s onwards): The advent of deep learning, particularly
Recurrent Neural Networks (RNNs)like GRU (e.g.,GRU4Rec), significantly enhanced the ability to model long-range dependencies and complex non-linear patterns in sequences. -
Attention-based and Bidirectional Models (Late 2010s):
Self-attention networks (Transformers)revolutionized sequence modeling (e.g.,SASRec) by allowing parallel processing and more effective capture of global dependencies.Bidirectional modelslikeBERT4Recfurther improved representations by allowing context from both past and future in pre-training. -
Self-Supervised Learning and Disentanglement (Late 2010s onwards): The current paper represents a frontier by integrating
self-supervised learning(for mining signals from unlabeled data) anddisentangled representation learning(for handling multiple user intentions). It builds upon Transformer architectures but innovates in the training strategy by moving towardslatent seq2seq supervisionand smartsample selection.This paper fits into this timeline by building on the success of Transformer-based models like
SASRecandBERT4Rec, but fundamentally alters theirtraining objective. Instead of merely predicting the next item or masking items in a sequence (both forms ofseq2itemorClozetasks in data space), it proposes toself-superviseby predicting representations of future sequences in a latent space, combined with a noveldisentanglementmechanism for improving the quality of theseseq2seqsignals.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several core differences and innovations:
-
From
Seq2ItemtoLatent Seq2SeqTraining: The most significant departure is from the prevalentsequence-to-item (seq2item)paradigm (used by GRU4Rec, SASRec, BERT4Rec) to a novelsequence-to-sequence (seq2seq)training strategy. Whileseq2itemonly supervises the prediction of the immediate next item, this paper'sseq2seqstrategy explicitly leverages the entirelonger-term future sequencefor supervision. -
Self-Supervision in Latent Space vs. Data Space: Existing deep sequential recommenders often perform
prediction tasksdirectly in thedata space(e.g., predicting the exact item ID, as in SASRec, or reconstructing masked item IDs, as in BERT4Rec). This paper proposesself-supervision in the latent space, where the model predicts a compactrepresentationof the future sequence, rather than its individual items. This is inspired byContrastive Predictive Coding (CPC)but specifically adapted for recommendation. This abstract prediction task is less prone to the "exponentially harder" problem of item-by-item reconstruction of long sequences. -
Disentanglement for
Seq2SeqSample Selection: While disentangled representation learning has been explored in various contexts, this paper applies it uniquely to filterseq2seqtraining samples. Previous disentanglement efforts often aim to learn interpretable or independent representations. Here, theintention disentanglement layerexplicitly aims to identifyshared intentionsbetween the input and target sequences. Only when such a shared intention is detected is aseq2seqsample considered "high confidence" and used for training. This addresses the "low signal-to-noise ratio" problem arising from multiple, potentially irrelevant, intentions in long future sequences. -
Complementary Training Objective: Unlike methods that replace
seq2itemorClozeobjectives, this paper'sseq2seqstrategycomplementsthe standardseq2itemtraining. This indicates a hybrid approach that aims to retain the benefits of immediate prediction while adding longer-term awareness. -
Implicit Disentanglement Regularization: The paper notes that its contrastive
seq2seqloss function naturally encourages disentanglement, as the score for a positive case based on one latent intention must be higher than scores based on other, different latent intentions (from the same input or other samples in the batch). This removes the need for explicit regularization terms typically used in disentangled learning, simplifying the model.In essence, this work innovates by redefining what is predicted (a latent sequence representation vs. an item ID), how it's predicted (latent self-supervision), and when these longer-term predictions are used (filtered by disentangled intentions), all within a framework that enhances existing
seq2itemmodels.
4. Methodology
4.1. Principles
The core idea behind the proposed methodology is to address the limitations of myopic sequence-to-item (seq2item) training in sequential recommenders by introducing a complementary sequence-to-sequence (seq2seq) training strategy. This seq2seq strategy aims to mine additional, longer-term supervision signals from a user's future behavior. The two main principles guiding this approach are:
-
Latent Self-Supervision: Instead of trying to reconstruct every individual item in a potentially long and complex future sequence (which is "exponentially harder" and can lead to convergence difficulties), the model performs
self-supervisionin thelatent space. This means it learns to predict a concise, high-levelrepresentation(a single vector) that summarizes the overall intention or meaning of the entire future sequence, given the earlier sequence. This simplifies the prediction task and distills the most relevant information. -
Intention Disentanglement for Relevant Sample Selection: User behavior sequences, especially long ones, can involve multiple evolving intentions. Not all intentions in a future sequence might be predictable or relevant to the input sequence. To address this
low signal-to-noise ratio, the model is equipped with anintention disentanglement layer. This layer learns to extract multiple, distinct latentintention representationsfrom any given sequence. Crucially, these disentangled intentions are then used to intelligently selectseq2seqtraining samples: only pairs of input and future sequences that are inferred toshare a common underlying intention(i.e., belong to the same latent category) are used forseq2seqtraining. This ensures that the model learns from meaningful and aligned long-term signals.By combining these two principles, the proposed
seq2seqstrategy provides a robust and efficient way to leverage future behavioral signals, enhancing the existingseq2itemtraining.
4.2. Core Methodology In-depth (Layer by Layer)
The proposed methodology integrates a novel seq2seq self-supervision loss with a specially designed disentangled sequence encoder, operating in parallel with the traditional seq2item loss.
4.2.1. Notations and Problem Formulation
Let's first clarify the key notations used throughout the paper, as summarized in Table 1:
The input data consists of user click sequences, denoted as . For a specific user , their sequence is , where is the length of the sequence, and is the index of the item clicked at time . is the total number of unique items.
The model consists of a sequence encoder and an item embedding table , where encompasses all trainable parameters, including . is the dimensionality of the latent representations. Each item has a representation .
The encoder is designed to output different -dimensional vectors, , where each vector represents the user's intention under the -th latent category.
The primary task remains candidate generation for sequential recommendation: predicting the next item(s) a user is likely to click, based on their observed sequence .
Traditional Sequence-to-Item (seq2item) Training: The standard training approach uses a truncated past sequence as input to predict the immediate next item . The common loss function is a negative log-likelihood: $ \mathcal{L}{s2i}(\boldsymbol{\theta}) = \sum_u \sum_t \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) $ $ \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) = - \ln { p{\theta}(x_{t+1}^{(u)} \mid x_1^{(u)}, x_2^{(u)}, \ldots, x_t^{(u)}) } $ where is typically proportional to the similarity between the sequence representation and the item representation in the vector space.
| Notation | Description |
| N | the number of sequences, aka. the number of users |
| M | the number of items |
| D | the dimensionality of the latent representations |
| K | the number of disentangled user intentions |
| x(u) x(u) | the sequence of items clicked by the uth user the th click in the uth h user's sequence x(u) |
| (u) r(u)]. ,x2 , where x (u) {1, 2, . .., M } | |
| Tu | the length of the uth user's click sequence x(u) |
| θ | parameters of the sequence encoder |
| H RMxD | the item embedding table, included in θ |
| Hi,: RD | the ith item's representation, i.e. the ith row of H |
| (u) RD | the representation of item (u (u) (u) of H |
| the representation of user u's intention under the | |
| λ [0, 1] | kth latent category, where 1 ≤ k ≤ K |
| B | the threshold for selecting sequence-to-sequence training samples of high confidence a mini-batch of sequences for training |
4.2.2. Sequence-to-Sequence Self-Supervision
The paper's core contribution is the seq2seq training strategy, which runs in parallel with the standard seq2item training. During each mini-batch gradient descent step, both losses are minimized.
Mini-batch Construction: A mini-batch is formed by uniformly sampling (u, t) pairs from the training set, where and . Each pair defines an earlier sequence and a corresponding future sequence . For the seq2seq target, the paper uses a reversed version of the future sequence, denoted as . This reversal is specifically for the latent encoding of the target sequence, allowing items closer to the split point to potentially gain more weight, which aligns with the typical recency bias in sequential recommendation.
Sequence-to-sequence (seq2seq) Loss:
For a given training sample (u, t) and a latent category , the seq2seq loss is defined as the negative log-likelihood of predicting the latent representation of the future sequence given the latent representation of the earlier sequence . This is framed as a contrastive learning task. The model tries to make the representation of the true future sequence (for category ) similar to the representation of the input sequence (for category ), while making it dissimilar to representations of other future sequences (from other categories or other samples in the mini-batch).
The seq2seq loss for a specific user , time step , and latent category is:
$
\begin{array}{r l} & \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) = - \ln { p{\theta}(\boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{T_u:t+1}^{(u)}) \mid \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) ) } = \ & - \ln { \frac { \exp { \left( \frac { 1 } { \sqrt { D } } \ \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{T_u:t+1}^{(u)}) \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } } { \sum _ { (u^{\prime}, t^{\prime}) \in \mathcal{B} } { \sum _ { k^{\prime} = 1 } ^ { K } \exp { \left( \frac { 1 } { \sqrt { D } } \ \boldsymbol{\phi}{\theta}^{(k^{\prime})}(\mathbf{x}{T_{u^{\prime}}:t^{\prime}+1}^{(u^{\prime})}) \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } } } } \end{array}
$
Here:
- is the -th latent intention representation of the reversed future sequence for user starting from .
- is the -th latent intention representation of the earlier sequence for user up to time .
- The dot product measures the similarity between the predicted and target latent representations for the same intention category . This is the positive pair.
- The scaling factor is applied to the dot product scores, which is a common practice in Transformers and helps stabilize convergence, especially when the last layer is
LayerNormalization. - The denominator is a sum over all possible future sequence representations from all latent categories and all samples within the current mini-batch . These serve as negative samples, forcing the model to distinguish the true match from other irrelevant latent representations. This
softmaxis normalized over the mini-batch, which is a common technique in contrastive learning to approximate normalization over the entire training set.
Selecting Samples of High Confidence for seq2seq Training:
A crucial aspect of the seq2seq strategy is to only learn from "high confidence" samples. Not all future sequences are genuinely predictable from earlier behaviors, especially if intentions shift. To filter out irrelevant samples, the seq2seq loss is only computed for samples (u, t, k) where the current value of the loss (before optimization) is below a certain threshold. This threshold is dynamically determined.
The total seq2seq loss for a mini-batch is:
$
\mathcal{L}{s2s}(\boldsymbol{\theta}, \mathcal{B}) = \sum{(u, t) \in \mathcal{B}} \sum_{k=1}^{K} \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) \cdot \mathbb{I}[ \mathcal{L}{s2s}(\boldsymbol{\theta}, u, t, k) \leq \tau ]
$
where:
- is an indicator function, which is 1 if the condition inside the bracket is true, and 0 otherwise.
- is a dynamic threshold. It is set to the -th smallest value among all computed
seq2seqlosses within the current mini-batch. - is a hyper-parameter. It controls the
percentage of highest-confidence seq2seq samplesthat are kept for training. For example, if , only the top 10% of samples (those with the smallest current loss values, implying the model is already somewhat confident in the prediction) are used. This mechanism helps to focus learning on the most relevant and predictableseq2seqrelationships.
Sequence-to-Item (seq2item) Loss:
The traditional seq2item training is maintained as it's crucial for learning a proper encoder quickly and aligning item and sequence representations. The paper adapts the seq2item loss to leverage the multiple disentangled intentions from the encoder. Instead of just one sequence representation, the model now considers all intention representations when predicting the next item. It takes the maximum similarity score among all intentions.
The total seq2item loss for a mini-batch is:
$
\mathcal{L}{s2i}(\boldsymbol{\theta}, \mathcal{B}) = \sum{(u, t) \in \mathcal{B}} \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t)
$
$
\begin{array}{r} \mathcal{L}{s2i}(\boldsymbol{\theta}, u, t) = - \ln p_{\theta}(\mathbf{h}{t+1}^{(u)} \mid \boldsymbol{\phi}{\theta}(\mathbf{x}_{1:t}^{(u)})) = \end{array}
$
$
- \ln \frac { \operatorname*{max}{k \in {1, 2, \ldots, K}} \exp \left( \frac { 1 } { \sqrt { D } } \mathbf{h}{t+1}^{(u)} \cdot \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) \right) } { \sum _ { (u^{\prime}, t^{\prime}) \in \mathcal{B} } \sum _ { k^{\prime} = 1 } ^ { K } \exp \left( \frac { 1 } { \sqrt { D } } \mathbf{h}{t^{\prime}+1}^{(u^{\prime})} \cdot \boldsymbol{\phi}{\theta}^{(k^{\prime})}(\mathbf{x}_{1:t}^{(u)}) \right) } $ Here:
- is the embedding of the actual next item (i.e., row of the item embedding table ). This is the positive item.
- is the -th latent intention representation of the earlier sequence.
- The numerator calculates the similarity between the true next item's embedding and the most relevant of the intention representations from the input sequence (selected by
max). Thismaxoperation implies that if any of the disentangled intentions can predict the next item, it contributes to the positive score. - The denominator is a summation over all item embeddings (from other samples in the mini-batch) and all their corresponding latent intention representations. This again forms a
contrastive losswhere the model learns to distinguish the true next item from other items (negative samples) based on the learned sequence representations. - The scaling factor is used, similar to the
seq2seqloss.
Overall Training Loss:
The final objective during training is to minimize the sum of both seq2item and seq2seq losses for each mini-batch:
$
\mathcal{L}(\boldsymbol{\theta}, \mathcal{B}) = \mathcal{L}{s2i}(\boldsymbol{\theta}, \mathcal{B}) + \mathcal{L}{s2s}(\boldsymbol{\theta}, \mathcal{B})
$
This combined loss ensures that the model maintains strong short-term predictive power while also learning from longer-term future signals.
4.2.3. Disentangled Sequence Encoding
The paper utilizes a modified SASRec encoder as its base, augmented with a custom intention-disentanglement layer.
Base Encoder (SASRec):
The SASRec encoder [25] is a Transformer-based model that uses multi-head self-attention. For an input sequence , it first obtains item embeddings (from the shared item embedding table ) and adds trainable position embeddings to them. After passing through several self-attention blocks, the single-head SASRec encoder outputs a sequence of -dimensional vectors . Each can be interpreted as the latent intention of the user when clicking item , considering its context within the sequence.
The paper notes that a simple multi-head SASRec doesn't inherently lead to disentanglement, as its heads often focus on similar aspects (e.g., the latest click). Therefore, a specific intention-disentanglement layer is appended.
Intention Clustering:
The disentanglement layer starts by associating each item's latent intention with one of predefined prototypical intention representations, denoted as . These prototypes are learnable parameters. The association is quantified by a probability-like attention weight :
$
p_{k|i} = \frac { \exp { \Big ( } { \frac { 1 } { \sqrt { D } } } { \mathrm{LayerNorm} }_1 ( \mathbf{z}_i^{(u)} ) \cdot { \mathrm{LayerNorm} }_2 ( \mathbf{c}_k ) { \Big ) } } { \sum _ { k^{\prime} = 1 } ^ { K } \exp { \Big ( } { \frac { 1 } { \sqrt { D } } } { \mathrm{LayerNorm} }_1 ( \mathbf{z}_i^{(u)} ) \cdot { \mathrm{LayerNorm} }2 ( \mathbf{c}{k^{\prime}} ) { \Big ) } }
$
where:
- (positions in the sequence).
- (latent categories).
- and are distinct
LayerNormalizationlayers. Normalizing the vectors before the dot product effectively makes this acosine similaritymeasurement. - The scaling factor is applied.
- This equation essentially calculates a
softmaxover the similarities between the normalized item intention and each of the prototypes . This indicates how likely the intention at position belongs to the -th latent category. The use of cosine similarity is noted to be more robust againstmode collapse(where prototypes might be ignored) than simple dot products.
Intention Weighting:
Beyond categorizing intentions, it's also important to weigh their significance for predicting future behaviors. A second attention mechanism computes a weight that indicates how important the primary intention at position is for predicting the user's future intentions:
$
p_i = \frac { \exp { \left( \frac { 1 } { \sqrt { D } } \mathrm{key}i \cdot \mathrm{query} \right) } } { \sum _ { i^{\prime} = 1 } ^ { t } \exp { \left( \frac { 1 } { \sqrt { D } } \mathrm{key}{i^{\prime}} \cdot \mathrm{query} \right) } }
$
The key and query vectors are derived as follows:
$
\mathbf{key}_i = \widetilde{\mathbf{key}}_i + \mathrm{ReLU}(\mathbf{W}^{\top} \widetilde{\mathbf{key}}_i + \mathbf{b})
$
$
\widetilde{\mathbf{key}}_i = \mathrm{LayerNorm}_3 ( \boldsymbol{\alpha}_i + \mathbf{z}_i^{(u)} )
$
$
\mathrm{query} = \mathrm{LayerNorm}_4 ( \boldsymbol{\alpha}_t + \mathbf{z}_t^{(u)} + \mathbf{b}^{\prime} )
$
where:
- .
- , , are trainable parameters.
- are
position embeddingsspecific to this disentanglement layer (separate from SASRec's position embeddings). - and are distinct
LayerNormalizationlayers. - The
queryis constructed based on the latest position embedding and the latest item intention , plus a bias . This implies a recency bias and a focus on intentions similar to the most recent one. - The
keyfor each position is derived from its position embedding and item intention , processed through a non-linear transformation (ReLU with ). - The
softmaxover thesekey-querydot products (scaled by ) gives the attention weights , indicating the importance of each position's intention.
Intention Aggregation: Finally, the disentangled output representations of the sequence encoder, , are computed by aggregating all item intentions according to both their category-specific attention and their overall importance : $ \boldsymbol{\phi}{\theta}^{(k)}(\mathbf{x}{1:t}^{(u)}) = \mathrm{LayerNorm}5 \left( \boldsymbol{\beta}k + \sum{i=1}^{t} p{k|i} \cdot p_i \cdot \mathbf{z}_i^{(u)} \right) $ where:
- .
- is another
LayerNormalizationlayer. - is a learnable
bias vectorfor each latent category . The paper mentions using two sets of such biases: one for encoding the input sequence and another for encoding the reversed future sequence , acknowledging their different roles. - This sum essentially creates a weighted average of all item intentions , where the weights ensure that each disentangled representation predominantly reflects intentions belonging to category that are also deemed important for future prediction.
Encouraging Disentanglement:
The paper argues that no explicit regularization term is needed to encourage disentanglement (e.g., minimizing mutual information between representations). This is because the contrastive nature of both seq2seq (Eq. 3) and seq2item (Eq. 6) loss functions naturally drives disentanglement. In both losses, the score of a positive sample for a particular intention is compared against the scores from all other K-1 intentions (and other negative samples). To maximize the likelihood of the positive case for a specific part , the model is implicitly forced to make capture distinct information from for . If the different were entangled and carried redundant information, they would not be able to effectively distinguish positive cases associated with one specific intention from those associated with others.
This comprehensive methodology, integrating latent self-supervision and disentangled intention modeling, aims to provide a more robust and insightful way to learn from user behavior sequences.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on four real-world public datasets, which have been pre-processed and widely used in previous sequential recommendation research, particularly by SASRec and BERT4Rec.
-
Amazon Beauty:
- Source: Amazon product reviews, specifically from the "Beauty" category.
- Scale: 40,226 users and 54,542 items.
- Characteristics: Relatively sparse dataset with shorter sequences.
- Average Sequence Length: 8.8
-
Steam:
- Source: User interactions (e.g., game purchases, playtimes) on the Steam gaming platform.
- Scale: 281,428 users and 13,044 items.
- Characteristics: Large number of users, but fewer items compared to Beauty. Also features relatively short sequences.
- Average Sequence Length: 12.4
-
MovieLens-1M:
- Source: Movie ratings data from the MovieLens platform.
- Scale: 6,040 users and 3,416 items.
- Characteristics: Fewer users and items but significantly longer user interaction sequences, reflecting more extensive viewing histories.
- Average Sequence Length: 163.5
-
MovieLens-20M:
- Source: A larger version of the MovieLens dataset with 20 million ratings.
- Scale: 138,493 users and 26,744 items.
- Characteristics: More users and items than MovieLens-1M, also characterized by long user sequences.
- Average Sequence Length: 144.4
Data Splitting: Following the common practice in sequential recommendation research [25, 49], the datasets are split as follows for each user:
- The last item of each user's sequence is reserved for testing.
- The second-to-last item of each user's sequence is reserved for validation.
- All remaining items (from the beginning up to the third-to-last) are used for training. Items within a sequence are strictly ordered by their timestamps, with the last position corresponding to the most recent click.
These datasets are chosen because they are standard benchmarks in sequential recommendation, allowing for fair comparison with state-of-the-art methods and covering a range of sequence lengths and sparsity levels. They are effective for validating the method's performance across different user behavior patterns.
5.2. Evaluation Metrics
The performance of all methods is evaluated using three widely accepted metrics in recommendation systems: Recall, Normalized Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR). Higher values for all these metrics indicate better recommendation performance.
The evaluation setup follows BERT4Rec's advice: for each ground-truth item in the test set, it is paired with 100 negative items randomly sampled according to their popularity. The task then becomes identifying the single ground-truth item among these 101 items. This is a common practice to make evaluation tractable for large item sets.
5.2.1. Recall@K
-
Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved by the recommender system within its top recommendations. It focuses on the ability of the system to find the relevant items, regardless of their precise ranking within the top . In the context of next-item prediction where there's usually only one ground-truth next item, Recall@K indicates whether the true next item is present in the top recommended items.
-
Mathematical Formula: $ \text{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \mathbb{I}(\text{next_item}_u \in \text{Recommended}_u(K)) $
-
Symbol Explanation:
- : The total number of users in the test set.
- : A specific user in the test set.
- : The actual next item clicked by user (the ground-truth item for evaluation). In this paper's setup, there is one such item per user in the test set.
- : The set of the top items recommended by the system for user .
- : The indicator function. It returns 1 if the condition inside the parenthesis is true (i.e., the is found among the top recommendations), and 0 otherwise.
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
-
Conceptual Definition: NDCG@K is a measure of ranking quality. It evaluates how well a recommender system ranks relevant items, giving higher scores to relevant items that appear earlier in the recommendation list (i.e., at higher ranks). It "discounts" the value of relevant items as their position in the list decreases. NDCG is commonly used in information retrieval and recommendation to account for the graded relevance of items. For next-item prediction, the relevance is binary (1 for the true next item, 0 for others).
-
Mathematical Formula: $ \text{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\text{DCG@K}_u}{\text{IDCG@K}_u} $ where: $ \text{DCG@K}u = \sum{j=1}^{K} \frac{2^{\text{rel}_j}-1}{\log_2(j+1)} $ $ \text{IDCG@K}u = \sum{j=1}^{K} \frac{2^{\text{rel}^{ideal}_j}-1}{\log_2(j+1)} $
-
Symbol Explanation:
- : The total number of users in the test set.
- : A specific user in the test set.
- : The rank (position) in the recommendation list, from 1 to .
- : The relevance score of the item at rank in the actual recommendation list for user . For binary relevance (relevant/irrelevant), it's typically 1 if the item at rank is the ground-truth next item, and 0 otherwise.
- : Discounted Cumulative Gain for user at cutoff . It sums the relevance scores, discounted logarithmically by their rank.
- : Ideal Discounted Cumulative Gain for user at cutoff . This is the maximum possible DCG for user , achieved if all relevant items were perfectly ranked at the top. For a single ground-truth item,
IDCG@Kwould be (if ). - : The ideal relevance score at rank . For a single ground-truth item, , and for .
5.2.3. Mean Reciprocal Rank (MRR)
-
Conceptual Definition: MRR measures the average of the reciprocal ranks of the first relevant item in a list of recommendations. If the first relevant item is at rank 1, its reciprocal rank is 1. If it's at rank 2, its reciprocal rank is 1/2, and so on. MRR is particularly useful when only one relevant item is expected, and placing it high in the list is critical.
-
Mathematical Formula: $ \text{MRR} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{1}{\text{rank}_u} $
-
Symbol Explanation:
- : The total number of users in the test set.
- : A specific user in the test set.
- : The rank (position) of the first relevant item for user in the recommendation list. If the relevant item is not found, its rank is often considered , making .
5.3. Baselines
The paper compares its approach against a comprehensive set of representative baselines, spanning different generations and paradigms of recommender systems:
-
POP (Popularity-based): A naive but strong baseline that recommends the most globally popular items to all users. This helps determine if more complex models are genuinely learning personalized preferences or just popular items.
-
BPR-MF (Bayesian Personalized Ranking - Matrix Factorization): A classic and widely used collaborative filtering algorithm based on matrix factorization. It optimizes for pairwise ranking, aiming to rank observed (positive) items higher than unobserved (negative) items. It primarily captures general user preferences.
-
NCF (Neural Collaborative Filtering): An influential deep learning-based collaborative filtering model that replaces the inner product in matrix factorization with a neural network to model user-item interactions, allowing for more complex, non-linear relationships.
-
FPMC (Factorized Personalized Markov Chains): A sequential recommender that combines matrix factorization with a first-order Markov chain. It models both general user preferences and sequential transitions (predicting the next item based on the last item).
-
GRU4Rec / GRU4Rec+: Recurrent Neural Network (RNN) based sequential recommenders.
GRU4Recuses Gated Recurrent Units to model sequences. is an improved version, often with better training strategies or architectures. These models capture temporal dependencies. -
Caser (Convolutional Sequence Embedding Recommendation): A sequential recommender that uses Convolutional Neural Networks (CNNs) to extract local and general sequential patterns from user interaction sequences.
-
SASRec (Self-Attentive Sequential Recommendation): A state-of-the-art sequential recommender based on the Transformer architecture. It uses self-attention to capture long-range dependencies in user sequences, allowing it to weigh the importance of all previous items when predicting the next one. This model serves as the foundation for the proposed encoder in this paper.
-
BERT4Rec (Sequential Recommendation with Bidirectional Encoder Representations from Transformer): A leading deep sequential recommender that adapts the BERT pre-training objective (masked item prediction using bidirectional context) to train a bidirectional Transformer encoder for sequential recommendations. It captures rich contextual representations by considering both past and future items.
These baselines represent a progression from non-sequential to sequential, and from traditional methods to deep learning approaches, including the current state-of-the-art Transformer-based models. This diverse set allows for a robust evaluation of the proposed method's novelty and effectiveness.
5.4. Implementation and Hyper-parameters
- Framework: The model is implemented using
TensorFlow. - Initialization: Parameters are initialized using TensorFlow's default initialization methods.
- Optimizer:
Adamoptimizer is used for mini-batch gradient descent. - Learning Rate: Fixed at 0.001.
- Mini-batch Size: 128 sequences per batch.
- Base Encoder: The single-head implementation of
SASRecis used as the foundational component for the sequence encoder. - Maximum Sequence Length:
- MovieLens-1M and MovieLens-20M: Capped at 200.
- Amazon Beauty and Steam: Capped at 50.
These caps align with configurations used by
SASRecandBERT4Recto manage computational complexity and focus on recent history for datasets with very long sequences.
- Hyper-parameter Tuning: Other hyper-parameters are tuned using
random search.- Dimensionality of Item Embeddings (): Chosen from .
- Number of Self-Attention Blocks (in SASRec part): Chosen from .
- Lambda (): The threshold hyper-parameter for
seq2seqsample selection, chosen from . - Number of Latent Categories (): Chosen from .
- Dropout Rate: Chosen from .
- L2 Regularization Term: Selected from .
6. Results & Analysis
6.1. Core Results Analysis
The empirical results demonstrate that the proposed approach, which combines traditional seq2item training with the novel disentangled latent seq2seq training, consistently outperforms all baseline models across various metrics and datasets.
The following figure (Figure 2 from the original paper) shows the recommendation performance in terms of Recall@1, Recall@5, and Recall@10. These metrics measure how well a method can retrieve the relevant items with a limited budget.
该图像是一个图表,展示了不同推荐方法在多个数据集(Beauty、Steam、ML-1m、ML-20m)上的召回率(Recall)。图中分为三部分,分别表示在前1、前5和前10个推荐位置的召回率。各个方法的性能比较结果显示了所提出方法的优势。
The next figure (Figure 3 from the original paper) shows the recommendation performance in terms of NDCG@5, NDCG@10, and MRR. These metrics measure how well a method can rank the relevant items before the irrelevant ones.

Key Observations:
-
Consistent Outperformance: Across all four datasets (Beauty, Steam, MovieLens-1M, MovieLens-20M) and all evaluated metrics (Recall@1, @5, @10; NDCG@5, @10; MRR), the proposed method achieves the highest performance.
-
Significant Gains on Shorter Sequences: The improvement is particularly pronounced on the
BeautyandSteamdatasets. For these datasets, which have relatively shorter average sequence lengths (8.8 and 12.4, respectively), therelative improvementover the strongest baselines often exceeds 35%. This suggests that for sequences where intentions might be clearer or less prone to very long-term shifts, thelatent seq2seqsignals combined with disentanglement are highly effective. -
Modest Gains on Longer Sequences: On the
MovieLens-1MandMovieLens-20Mdatasets, which feature much longer average sequence lengths (163.5 and 144.4, respectively), the relative improvement is around 5%. The authors attribute this to the inherent challenge ofdisentangling intentionswithin very long and complex sequences. Such sequences might contain many evolving or intertwined intentions, making it harder for the model to isolate and leverage specific shared intentions forseq2seqtraining. This indicates a potential area for future improvement.In summary, the
seq2seqtraining strategy successfully extracts additional supervision signals that complementseq2itemtraining, leading to superior recommendation performance, especially on datasets with more manageable sequence lengths for intention disentanglement.
6.2. Robustness to Synthetic Noises
To evaluate the robustness of the proposed seq2seq training strategy, the authors conducted experiments where the training data was artificially corrupted. This involved randomly replacing a portion of observed clicks in the training set with uniformly sampled items. The experiment was performed on the Beauty dataset, with the corruption level (percentage of corrupted data) ranging from 10% to 50%.
The following figure (Figure 4 from the original paper) illustrates the performance drop, showing the relative performance (with noisy data / with clean data) for Recall@5, NDCG@5, and MRR.

Analysis:
The figure compares two variants: one that optimizes only the seq2item loss (labeled seq2item in the plot) and the proposed method that optimizes both seq2item and seq2seq losses (labeled seq2seq).
- Improved Robustness: The curves show that the recommendation performance of the model using the
seq2seqtraining strategy (seq2seq) drops slower and maintains a higher relative performance compared to theseq2itemonly variant (seq2item), especially when the corruption level isrelatively modest(e.g., up to 20% noise). - Reasoning: This indicates that by
mining additional supervision signals from the longer-term futureandselectively learning from seq2seq samples of high confidence(through the threshold and intention disentanglement), the proposed strategy is more resilient to noisy or irrelevant immediate behaviors in the training data. The longer-term signals, when filtered for shared intentions, provide a more stable and reliable learning objective, making the model less vulnerable to individual erroneous data points. - Limitations at High Noise: At very high corruption levels (e.g., 40-50%), the performance drop becomes significant for both methods, suggesting that beyond a certain point, the integrity of the data is too compromised for either strategy to fully recover. However, the
seq2seqapproach still demonstrates a comparative advantage in the more realistic range of moderate noise.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Ablation Study
An ablation study was conducted on the Beauty dataset to understand the contribution of different components of the proposed method.
The following are the results from Table 2 of the original paper:
| Variants of Our Method | Evaluation Metrics | |||||
|---|---|---|---|---|---|---|
| Recall@1 | Recall@5 | Recall@10 | NDCG@5 | NDCG@10 | MRR | |
| (1) Remove seq2seq training | 0.1358 | 0.3002 | 0.3891 | 0.2369 | 0.2675 | 0.2420 |
| (2) Individually reconstruct all items in a future sequence | 0.1071 | 0.2709 | 0.3744 | 0.1916 | 0.2251 | 0.1992 |
| (3) Individually reconstruct the next three items | 0.1202 | 0.2914 | 0.3898 | 0.2084 | 0.2403 | 0.2139 |
| (0) Default | 0.1522 | 0.3225 | 0.4171 | 0.2404 | 0.2709 | 0.2448 |
Analysis:
-
Variant (1) - Removing
seq2seqtraining: This variant corresponds to using only theseq2itemloss, essentially turning the model into a standard Transformer-based sequential recommender with the multi-intention encoder. A noticeable drop in performance across all metrics (e.g., Recall@1 drops from 0.1522 to 0.1358, NDCG@5 from 0.2404 to 0.2369) is observed compared to theDefault(full model). This directly confirms the efficacy and positive contribution of the proposedseq2seqtraining strategy in leveraging additional supervision signals from the longer-term future. -
Variant (2) - Individually reconstructing all items in a future sequence: This variant attempts to perform
seq2seqtraining by explicitly predicting every individual item in the entire future sequence, instead of predicting a single latent representation of the sequence. This approach performs even worse than Variant (1) (e.g., Recall@1 drops further to 0.1071, NDCG@5 to 0.1916). This result strongly supports the paper's first challenge: "reconstructing a future sequence containing many behaviors is exponentially harder." It also validates the design choice oflatent self-supervision, demonstrating that predicting a distilled representation of the future sequence is far more effective than trying to reconstruct each item individually, which might include many irrelevant signals. -
Variant (3) - Individually reconstructing the next three items: This is a compromise between predicting all future items and just the next one, trying to capture slightly longer-term but still explicit signals. It also performs worse than Variant (1) (e.g., Recall@1 at 0.1202, NDCG@5 at 0.2084), although slightly better than Variant (2). This further reinforces the finding that explicit item-by-item reconstruction of even a few future items is less effective than the latent representation approach, likely due to the presence of irrelevant items or intentions within those explicitly chosen future items.
The ablation study conclusively demonstrates that both the
seq2seqloss component and thelatent self-supervision(predicting representations rather than individual items) are crucial for the superior performance of the proposed method. Theintention disentanglementaspect, by enabling the selection of high-confidenceseq2seqsamples, is implicitly validated by the success of theDefaultmodel over the naive reconstruction variants.
6.3.2. Hyper-parameter Sensitivity
The paper investigates the impact of the critical hyper-parameter , which controls the threshold for determining whether a seq2seq sample is considered "high confidence" and thus used for self-supervised training.
The following figure (Figure 5 from the original paper) illustrates the impact of the threshold hyper-parameter , which is for determining whether a seq2seq sample is of high confidence and thus whether to use the sample for self-supervised training. is equivalent to not using seq2seq training, while selects all seq2seq samples for training.
![Figure 5: Impact of the threshold hyper-parameter \(\\lambda \\in \\left\[ 0 , 1 \\right\]\) , which is for determining whether a seq2seq sample is of high confidence and thus whether to use the sample for selfsupervised training. \(\\lambda = 0\) is equivalent to not using seq2seq training, while \(\\lambda = 1\) selects all seq2seq samples for training.](/files/papers/6950ed9888e29060a51c8504/images/5.jpg)
Analysis:
The figure plots Recall@1 and NDCG@10 against different values of .
-
Baseline: When , no
seq2seqsamples are used for training, making this scenario equivalent to removingseq2seqtraining (Variant 1 in the ablation study). The performance at is indeed lower than the optimal point, confirming the value ofseq2seqtraining. -
Optimal : The performance generally increases as increases from 0, reaches a peak somewhere in the middle (e.g., around to
0.8for Recall@1), and then starts to decline as approaches 1. -
Impact of :
- Too Strict ( too small): A very small means only a very tiny fraction of the most confident
seq2seqsamples are used. This limits the additional supervision signals, leading to suboptimal performance because the model doesn't learn enough from the longer-term future. - Too Loose ( too large, approaching 1): A large means almost all
seq2seqsamples are used, including those with low confidence or involving intentions irrelevant to the input sequence. This introducestoo much noiseandirrelevant signalsinto the training process, hindering effective learning and causing performance degradation. This validates the paper's second challenge regarding multiple intentions and the need for filtering.
- Too Strict ( too small): A very small means only a very tiny fraction of the most confident
-
Dataset Dependency: The authors mention that while the trend is similar across datasets, the optimal value may vary, highlighting the importance of tuning this hyper-parameter for specific applications.
This sensitivity analysis demonstrates that the careful selection of
seq2seqtraining samples using the threshold, enabled by theintention disentanglementmechanism, is crucial for maximizing the benefits of the proposed strategy.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents a novel and effective approach to enhance sequential recommender systems by overcoming the myopia and limitations of traditional sequence-to-item (seq2item) training. The core contribution is a sequence-to-sequence (seq2seq) training strategy that leverages longer-term future user behaviors for additional supervision. The method introduces two key innovations:
-
Latent Self-Supervision: Instead of the challenging task of reconstructing individual items in a future sequence, the model learns to predict a compact, high-level
latent representationof the entire future sequence, significantly easing convergence and distilling relevant information. -
Intention Disentanglement and Sample Selection: A disentangled sequence encoder is designed to capture multiple distinct user intentions. This disentanglement is then strategically used to select
seq2seqtraining samples, ensuring that only pairs of sub-sequences withshared intentionsare utilized, thereby improving thesignal-to-noise ratioin supervision.Extensive experiments on real-world datasets (Amazon Beauty, Steam, MovieLens-1M, MovieLens-20M) and synthetic noisy data consistently demonstrate that this
seq2seqtraining strategy, when combined with standardseq2itemtraining, leads to superior recommendation performance and increased robustness to noise.
7.2. Limitations & Future Work
The authors acknowledge the following limitations and propose future research directions:
- Computational Cost of
seq2seqTraining: While the paper successfully demonstrates the performance benefits,seq2seqtraining, especially with contrastive losses over mini-batches and multiple disentangled heads, can be computationally intensive. Future work aims to reduce this cost via an "engineering-efficient framework" [59]. - Performance on Long Sequences: The paper notes that the performance gains on datasets with very long average sequence lengths (MovieLens-1M and MovieLens-20M) are less impressive compared to those with shorter sequences. This suggests that effectively disentangling intentions and extracting relevant long-term signals from extremely long and complex user histories remains a challenge. Improving the model's ability to handle and benefit from such long sequences is a promising direction.
7.3. Personal Insights & Critique
The paper offers several valuable insights and represents a significant step forward in sequential recommendation:
- Beyond Immediate Prediction: The fundamental shift from
seq2itemtolatent seq2seqis conceptually powerful. It intuitively addresses the real-world limitation that users' long-term interests are not always perfectly aligned with their very next click. The idea of learning a "summary" representation of the future is an elegant way to tackle the complexity of many future items. - Novel Application of Disentanglement: The use of
intention disentanglementnot just for learning interpretable representations, but specifically forfiltering training samplesin a self-supervised context, is a highly innovative aspect. This directly addresses thesignal-to-noise ratioproblem that arises when mining broad future signals. It intelligently selects what to learn from, making the longer-term signals more actionable. - Implicit Disentanglement through Contrastive Loss: The observation that the contrastive nature of the
softmaxloss implicitly encourages disentanglement, removing the need for explicit regularization terms, is a neat theoretical and practical finding. It simplifies model design while achieving the desired property. - Enhanced Robustness: The demonstrated robustness to noisy training data is a crucial practical advantage. Real-world user behavior data is inherently noisy, and a model that can learn more reliably from such data is highly valuable.
Critique and Potential Areas for Improvement:
-
Scalability of Intentions: While the paper uses (number of latent categories) up to 8, in highly dynamic and diverse e-commerce scenarios, users might exhibit many more than 8 distinct intentions over a long period. Scaling and ensuring that all categories are actively learned without
mode collapsecould become challenging. Further research into dynamic or hierarchical disentanglement could be beneficial. -
Subjectivity of
Lambda: The hyper-parameter, while effective, introduces a manual thresholding mechanism. Can this selection of "high confidence"seq2seqsamples be made more adaptive or learned dynamically based on some uncertainty estimation? -
Interpretability of Latent Categories: While the paper aims to disentangle intentions, a deeper dive into the actual interpretability of the learned
prototypical intention representationsand how they align with human-understandable categories (e.g., "work items," "hobby items") could provide valuable insights. This could be explored through case studies or qualitative analysis. -
Handling Very Long Sequences: The weaker performance on MovieLens datasets with extremely long sequences suggests that the current disentanglement and aggregation mechanisms might struggle to effectively summarize or differentiate intentions over hundreds of items. Perhaps a hierarchical or multi-scale attention mechanism within the disentanglement layer could better capture very long-term intention shifts or recurring patterns.
-
Computational Cost for Very Large Scale: While mentioned as future work, the current approach with its mini-batch
softmaxover options for all samples in the batch can still be costly for recommender systems with billions of items and millions of users, especially if deployed in real-time. Exploring approximate nearest neighbor search or more efficient negative sampling strategies within the contrastive loss could be vital for practical deployment.Overall, this paper provides a robust framework for improving sequential recommendation by incorporating longer-term future signals and a sophisticated disentanglement mechanism. Its insights into latent self-supervision and intention-aware sample selection are highly transferable and could inspire future research in various sequence modeling tasks beyond recommendation.
Similar papers
Recommended via semantic vector search.