DiffuRec: A Diffusion Model for Sequential Recommendation
TL;DR Summary
This paper introduces DiffuRec, the first diffusion model for sequential recommendation, representing item embeddings as distributions to better reflect multiple user interests and diverse item features. The method leverages noise addition for uncertainty injection and reconstruc
Abstract
Mainstream solutions to Sequential Recommendation (SR) represent items with fixed vectors. These vectors have limited capability in capturing items' latent aspects and users' diverse preferences. As a new generative paradigm, Diffusion models have achieved excellent performance in areas like computer vision and natural language processing. To our understanding, its unique merit in representation generation well fits the problem setting of sequential recommendation. In this paper, we make the very first attempt to adapt Diffusion model to SR and propose DiffuRec, for item representation construction and uncertainty injection. Rather than modeling item representations as fixed vectors, we represent them as distributions in DiffuRec, which reflect user's multiple interests and item's various aspects adaptively. In diffusion phase, DiffuRec corrupts the target item embedding into a Gaussian distribution via noise adding, which is further applied for sequential item distribution representation generation and uncertainty injection. Afterward, the item representation is fed into an Approximator for target item representation reconstruction. In reverse phase, based on user's historical interaction behaviors, we reverse a Gaussian noise into the target item representation, then apply a rounding operation for target item prediction. Experiments over four datasets show that DiffuRec outperforms strong baselines by a large margin.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DiffuRec: A Diffusion Model for Sequential Recommendation
1.2. Authors
ZIHAO LI and CHENLIANG LI (Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China), AIXIN SUN (Nanyang Technological University, Singapore).
1.3. Journal/Conference
ACM Transactions on Information Systems (ACM Trans. Inf. Syst. 37, 4, Article 111). This is a highly reputable journal in the field of information systems and computer science, known for publishing high-quality research in areas like information retrieval, recommender systems, and data mining.
1.4. Publication Year
2023
1.5. Abstract
Mainstream solutions for Sequential Recommendation (SR) represent items using fixed vectors, which have limited capability in capturing items' latent aspects and users' diverse preferences. This paper proposes DiffuRec, a novel approach that adapts the Diffusion model to SR for the first time. DiffuRec addresses the limitations of fixed vectors by representing item embeddings as distributions, reflecting multiple user interests and item aspects adaptively. In the diffusion phase, the target item embedding is corrupted into a Gaussian distribution through noise addition, which then guides the generation of sequential item distribution representations and injects uncertainty. An Approximator (a Transformer model) is used to reconstruct the target item representation. In the reverse phase, based on historical interactions, a Gaussian noise is iteratively denoise into the target item representation, followed by a rounding operation for target item prediction. Experimental results on four datasets demonstrate that DiffuRec significantly outperforms strong baselines.
1.6. Original Source Link
- Original Source:
https://arxiv.org/abs/2304.00686(Preprint) - PDF Link:
https://arxiv.org/pdf/2304.00686v4.pdf(Preprint, version 4) - Publication Status: The paper is officially published in ACM Trans. Inf. Syst., as indicated by the ACM Reference Format in the paper: "ACM Trans. Inf. Syst. 37, 4, Article 111 (August 2023)". The arXiv link is likely a preprint version that was updated prior to formal publication.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inherent limitation of fixed vector representations for items in Sequential Recommendation (SR). Traditional SR models, including Transformer-based and GNN-based approaches, typically represent items as static embedding vectors.
This problem is important because:
-
Multiple Latent Aspects: Real-world items often possess multiple intrinsic aspects (e.g., a movie having themes of 'Love' and 'War'). A single fixed vector struggles to encode this multi-faceted nature comprehensively, especially when these aspects are difficult to pre-define or annotate.
-
Multiple User Interests: User preferences are dynamic, diverse, and context-dependent. A user's interest in an item's aspects can shift over time or based on their current context. A static item representation across all users and contexts is insufficient to capture these diverse and evolving interests.
-
Uncertainty: User intentions and preferences are inherently uncertain. Current
recommender systemsare also expected to offer diversity, novelty, and serendipity, which often involves modeling this uncertainty. Modeling user interests as a fixed point in an embedding space does not account for this uncertainty. -
Guidance of Target Item: The target item (the next item a user will interact with) contains crucial information about the user's current intention. Incorporating this signal effectively during representation learning can significantly improve
recommendationquality, but it's computationally challenging and often only applicable at the ranking stage.Prior research has attempted to address some of these issues through
attention mechanisms,dynamic routingformulti-interest modeling, orVariational Auto-Encoders (VAEs)fordistribution representationanduncertainty injection. However,multi-interest modelsoften require heuristically pre-defined numbers of interests, limiting flexibility, whileVAEssuffer fromrepresentation capacityissues andposterior distribution approximation collapse.
The paper's entry point or innovative idea is to leverage the unique merit of Diffusion models – their strong capability in distribution generation and diversity representation – and adapt them for the first time to the Sequential Recommendation problem. The authors posit that Diffusion models are a natural fit for modeling item representations as distributions, thereby addressing the limitations of fixed vectors and existing probabilistic models in a unified framework.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
First Application of Diffusion Models to SR: It presents the first successful attempt to adapt
Diffusion modelsforSequential Recommendation, bridging a significant gap betweenDiffusion models(predominantly used inCVandNLP) andrecommender systems. -
Distributional Item Representation: Instead of fixed vectors,
DiffuRecmodels items and user preferences as distributions. This allows for adaptive representation of items'multi-latent aspectsand users'diverse intentions, inherently capturing more complex information. -
Target Item Guidance and Uncertainty Injection: The proposed
diffusion processinherently fuses thetarget item embeddingforhistorical interacted item representation generation. This introduces the user's current preference as an auxiliary signal, while also injectinguncertaintythrough noise, leading to more robust and discriminative item embedding learning. -
Novel Model Architecture:
DiffuRecintroduces aTransformer-basedApproximatorfor reconstructing target item representations during training and iteratively denoising during inference. It also devises arounding operationto map the continuousreversed representationback to discrete item indices for prediction. -
Superior Performance: Extensive experiments on four real-world datasets demonstrate that
DiffuRecconsistently and significantly outperforms nine strong baselines across various metrics, showcasing its effectiveness. -
Empirical Analysis: The paper provides in-depth ablation studies, hyperparameter analysis, comparisons with
adversarial training, performance analysis on sequences of varying lengths and items of different popularity, and visualization ofuncertaintyanddiversity, elucidating the efficacy of its components and approach.The key finding is that
Diffusion modelsare highly effective forSequential Recommendationby enabling the modeling of item representations and user preferences as distributions, injecting beneficial uncertainty, and leveragingtarget item guidance. This approach leads to substantial performance gains and offers greater flexibility and diversity inrecommendations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand DiffuRec, a novice reader should be familiar with the following concepts:
3.1.1. Sequential Recommendation (SR)
Sequential Recommendation is a task in recommender systems where the goal is to predict the next item a user will interact with, given their historical sequence of interactions. It focuses on modeling the user's interest evolution over time.
- Example: If a user watched movies A, B, and C in that order, an
SRmodel would predict movie D as their next likely watch.
3.1.2. Item Embeddings
In recommender systems and natural language processing (NLP), items (like movies, products, words) are often represented as dense, low-dimensional vectors called embeddings. These vectors capture the semantic and latent features of the items, allowing for mathematical operations (like dot products or cosine similarity) to measure their relatedness.
- Example: Movies with similar genres or actors would have
embedding vectorsthat are close to each other in theembedding space.
3.1.3. Gaussian Distribution
A Gaussian distribution, also known as a normal distribution or bell curve, is a common probability distribution that is symmetric around its mean. It's defined by two parameters: its mean () and variance ().
- Formula: The probability density function of a
univariate Gaussian distributionis: $ f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $ Where:- : the random variable
- : the
mean(average value) - : the
variance(spread of the distribution) - : mathematical constant (approximately 3.14159)
- : Euler's number (approximately 2.71828)
- Importance:
Gaussian noiseis often used in machine learning for adding randomness or modeling uncertainty, particularly indiffusion modelsandVAEs.
3.1.4. Diffusion Models (Denoising Diffusion Probabilistic Models - DDPM)
Diffusion models are a class of generative models that learn to generate data by reversing a diffusion process. This process gradually adds Gaussian noise to data until it becomes pure noise, then the model learns to reverse this process, step-by-step, to reconstruct the original data from noise.
- Forward (Diffusion) Process: This is a fixed
Markov chainthat gradually addsGaussian noiseto an input data point over time steps. $ q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t}\mathbf{x}{t-1}, \beta_t\mathbf{I}) $ Where:- : the original data point.
- : the noisy data point at time step .
- : the
variance schedule(a small, increasing value that controls the amount of noise added at each step). - : identity matrix.
- :
Gaussian distributionwithmeanandcovariance. A key property is that can be sampled directly from at any step : $ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) $ Where and .
- Reverse (Denoising) Process: This is a learned
Markov chainthat starts from pureGaussian noiseand iteratively removes noise to generate a sample from the data distribution. The goal is to learn . $ p_\theta(\mathbf{x}{t-1} | \mathbf{x}t) = \mathcal{N}(\mathbf{x}{t-1}; \mu\theta(\mathbf{x}t, t), \Sigma\theta(\mathbf{x}_t, t)) $ Aneural network(often called anApproximatorordenoiser) is trained to predict the noise added at each step, or directly predict from .
3.1.5. Transformer Architecture
The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It revolutionized sequence modeling tasks, particularly in NLP, by solely relying on self-attention mechanisms rather than recurrent neural networks (RNNs) or convolutional neural networks (CNNs).
- Self-Attention: A mechanism that allows a model to weigh the importance of different parts of the input sequence when processing each element. It computes attention scores between an element and all other elements in the sequence.
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- :
Querymatrix (derived from the input embedding). - :
Keymatrix (derived from the input embedding). - :
Valuematrix (derived from the input embedding). - : dimension of
keys. - : normalization function.
- :
- Multi-Head Attention: Multiple
self-attentionlayers are run in parallel, allowing the model to focus on different parts of the input sequence simultaneously and capture diverse relationships. - Positional Encoding: Since
Transformersdo not have inherent sequential processing likeRNNs,positional encodingsare added to the inputembeddingsto provide information about the relative or absolute position of tokens in the sequence. - Feed-Forward Networks: After
attention, each position's output is passed through a separate, identicalfeed-forward neural network. - Layer Normalization & Residual Connections: Used to stabilize training and enable deeper models.
3.1.6. Cross-Entropy Loss
Cross-entropy loss is commonly used in classification problems to measure the difference between two probability distributions: the true distribution (target labels) and the predicted distribution (model outputs). For next-item prediction, it often measures how well the model predicts the correct next item among all possible items.
- Formula (for a single sample in
multi-class classification): $ L_{CE} = - \sum_{c=1}^{C} y_c \log(\hat{y}_c) $ Where:- : number of classes (items).
- : a binary indicator (0 or 1) if class is the correct class (1 for the true target item, 0 otherwise).
- : the predicted
probabilitythat sample belongs to class .
- In
recommender systems: The predictedprobabilityfor an item is often derived from thesoftmaxof thedot product(orsimilarity score) between the user'srepresentationand the item'sembedding.
3.1.7. Inner Product (Dot Product)
The inner product of two vectors is a scalar value that measures their similarity. In recommender systems, it's widely used to calculate the relevance or compatibility between a user's preference vector and an item's embedding vector. A higher inner product generally indicates higher predicted preference.
- Formula: For two
vectorsand , theirinner productis: $ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^d a_i b_i $
3.1.8. t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is a dimensionality reduction technique used for non-linear dimensionality reduction. It's particularly well-suited for visualizing high-dimensional data in a low-dimensional space (e.g., 2D or 3D) while preserving the local structure of the data points. Points that are close in the high-dimensional space tend to be close in the low-dimensional visualization.
3.2. Previous Works
The paper discusses previous work in three main categories:
3.2.1. Sequential Recommendation (SR)
- Early Approaches (Markov Chains):
Markov Chains(e.g., [33, 38, 62]) modeled user interactions, assuming the next item depended only on the previous one.- Background: These models are simple but limited by the
Markov assumption, which struggles with long-term dependencies.
- Background: These models are simple but limited by the
- Deep Sequential Neural Networks:
GRU4Rec[19]: A pioneering work usingGated Recurrent Units (GRUs)to model sequential information, training withsession-parallel mini-batches.- Background:
GRUsare a type ofRecurrent Neural Network (RNN)designed to capture dependencies in sequences, addressing the vanishing/exploding gradient problem of vanillaRNNs.
- Background:
Caser[44]: AppliedConvolutional Neural Networks (CNNs)torecent K adjacent items(treated as an "image") to capture sequential patterns.- Background:
CNNsare effective at capturing local patterns and are typically used in image processing, but can be adapted for sequence data by treating subsequences as local windows.
- Background:
SASRec[25]: A pioneeringTransformer-based model usinguni-directional masked self-attentionto learn transition patterns.- Background:
SASRecadapted theTransformerencoder forautoregressive sequence modeling, wheremaskingprevents attending to future items.
- Background:
BERT4Rec[42]: Adopted abi-directional Transformer encoderand acloze task(predicting masked items) to leverage both past and future context.- Background: Inspired by
BERTinNLP, it uses amasked language modelobjective, allowing context from both directions to inform predictions.
- Background: Inspired by
- Graph Neural Networks (GNNs) for SR:
SR-GNN[51]: IntroducedGNNsto model explicit relationships between items.- Background:
GNNsare neural networks that operate ongraph-structured data, enabling the capture of high-order relationships and information propagation between nodes (items).
- Background:
- Other
GNN-based methods [8, 11, 14]: Incorporated time intervals,collaborative filtering signals, and different graph constructions (user-item, item-item transition graphs).
3.2.2. Multi-Interest Modeling
These methods acknowledge that users have multiple, dynamic interests.
DIN (Deep Interest Network)[60]: Designed alocal activation unitto adaptively learn userinterest representationsfrom historical sequences, allowing different parts of the sequence to contribute differently to predictions.- Background: Uses an
attention mechanismto dynamically weigh historical items based on the candidate item, creating aninterest representationspecific to that candidate.
- Background: Uses an
MIND (Multi-Interest Network with Dynamic routing)[27]: Used amulti-interest extractor layerviacapsule dynamic routingto extract multipleuser interests.- Background:
Capsule networksare designed to group neurons into "capsules" that represent different entities or properties, anddynamic routingallows these capsules to activate others.
- Background:
ComiRec[4]: Leveraged bothdynamic routingandattention mechanismsformulti-interest modeling.TiMiRec[48]: A more recentmulti-interest modelthat designs anauxiliary loss functionincorporating thetarget itemas asupervised signalforinterest distribution generation.- Background: This is a key difference from
ComiRec, explicitly guidinginterest learningwith the ground truth next item.
- Background: This is a key difference from
- Limitation: Many of these methods require a
pre-defined number of interests(or capsules), which can be heuristic and constrainmodel flexibility.
3.2.3. Representation Distribution and Uncertainty Modeling
These approaches aim to model user preferences or item aspects as distributions rather than fixed points, often to inject uncertainty.
Multi-VAE[29]: A pioneering work applyingVariational Auto-Encoders (VAEs)tocollaborative filtering.- Background:
VAEsaregenerative modelsthat learn aprobabilistic latent space. They map input data to adistribution(typicallyGaussian) in thelatent spaceand then decode from thatlatent distributionback to data. They introduceuncertaintyby sampling from theselatent distributions.
- Background:
SVAE[36]: Combinedrecurrent neural networkswithVAEsto capture temporal patterns forsequential recommendation.ACVAE[53]: Used anAdversarial Variational Bayes (AVB)framework to enhance therepresentationoflatent variablesand addressposterior distribution approximationissues suffered by conventionalVAEs.- Background:
AVBuses anadversarial networkto improve the quality ofposterior approximationinVAEs, addressing issues likeposterior collapsewhere thelatent distributionbecomes trivial.
- Background:
Distribution-based Transform[9]: Modeleditem representationsasElliptical Gaussian distributions, using conventionalitem embeddingsasmeansand introducing astochastic vectorforcovariance representationanduncertainty injection. UsedWasserstein Distanceas aloss function.STOSA (Stochastic Self-Attention)[10]: Replaced theinner productin theself-attention modulewithWasserstein Distanceto measure correlation betweenitem distributions, injectinguncertaintyviastochastic Gaussian distributions.- Limitations:
VAEsoften struggle withrepresentation degenerationandposterior collapseissues, limiting their effectiveness.
3.3. Technological Evolution
The evolution of Sequential Recommendation has mirrored general advancements in deep learning.
- Early Statistical Models: Started with
Markov Chains, simple but limited in capturing complex dependencies. - Recurrent Neural Networks (RNNs):
GRU4RecbroughtRNNs(specificallyGRUs) toSR, enabling better capture of sequential patterns. - Convolutional Neural Networks (CNNs):
CasershowedCNNscould capture local sequential patterns. - Attention Mechanisms and Transformers:
SASRecandBERT4Recadopted theTransformer architecture, demonstrating the power ofself-attentionfor long-range dependencies and complex pattern learning, becoming state-of-the-art. - Graph Neural Networks (GNNs):
SR-GNNand others leveragedGNNsto modelhigh-order relationshipsanditem transitionsmore explicitly. - Multi-Interest and Distributional Modeling: Recognizing the dynamic and diverse nature of user preferences,
DIN,MIND,ComiRec,TiMiRec, andVAE-based methods (likeMulti-VAE,ACVAE,STOSA) emerged to modelmulti-interestsanduncertaintywithprobabilistic representations. - Diffusion Models (DiffuRec):
DiffuRecmarks the latest evolution by introducingDiffusion modelsfromgenerative AItoSR, aiming to overcome the limitations of previousdistributional modelsand fixed-vector representations by leveragingdiffusion models' inherent strengths indistribution generationanddiversity representation.
3.4. Differentiation Analysis
DiffuRec differentiates itself from previous methods primarily in its approach to item representation and uncertainty injection:
- Fixed Vectors vs. Distributions: Unlike mainstream
SRmethods (GRU4Rec,Caser,SASRec,BERT4Rec,GNN-based) that represent items asfixed vectors,DiffuRecmodels them asdistributions. This allows it to adaptively capturemultiple latent aspectsof items anddiverse user interests, which fixed vectors inherently struggle with. - Superiority over VAEs: While
VAE-based methods (Multi-VAE,SVAE,ACVAE) also modellatent variablesasdistributionsand injectuncertainty, they often suffer fromrepresentation degenerationandposterior collapse.DiffuRecleverages the more robustdistribution generationcapabilities ofDiffusion models, which are known for producing high-quality samples and avoiding theseVAEpitfalls.STOSAis closer as it also usesGaussian distributionsfor items, butDiffuRecintegratesdiffusionover the target item directly. - Enhanced Multi-Interest Modeling: Compared to
multi-interest models(DIN,MIND,ComiRec,TiMiRec),DiffuRecdoesn't require heuristicallypre-defining a fixed number of interests. Instead, it implicitly modelsmulti-latent aspectsanddiverse intentionsthrough thedistributional representationsand thestochasticityof thediffusion process.TiMiRecuses thetarget itemas asupervised signal, similar toDiffuRec'starget item guidance, butDiffuRec's mechanism is integrated within a more powerfulgenerative framework. - Unique Uncertainty Injection and Target Item Guidance:
DiffuRec'sdiffusion processnot only injects uncertainty but also directly incorporates thetarget item embedding's information (even when noised) to guide therepresentation generationof historical items. Thisuncertaintyis not just pure noise buttarget-item-aware, providing a more discriminative and robust learning signal. This is a novel way to fuse information compared to explicitadversarial perturbationsorVAEsampling. - Continuous to Discrete Mapping:
DiffuRecintroduces arounding operationto map thereversed continuous representationback todiscrete item indices, which is a practical consideration forrecommendationtasks where the output must be a specific item from a catalog.
4. Methodology
4.1. Principles
The core idea of DiffuRec is to move beyond fixed vector representations for items in Sequential Recommendation (SR) and instead model them as distributions. This approach is motivated by the limitations of fixed vectors in capturing multiple latent aspects of items and diverse user interests, as well as the inherent uncertainty in user preferences. The theoretical basis lies in Diffusion Models, a class of generative models capable of generating diverse data samples by progressively denoising Gaussian noise.
The intuition is that if an item's representation is a distribution rather than a single point, it can inherently encompass various latent aspects. Similarly, a user's evolving preferences can be reflected by how these item distributions are interactively adjusted. The diffusion process provides a robust mechanism to:
- Generate
distributional representations: By gradually addingnoiseto a target item,DiffuReclearns to represent it as aGaussian distribution, and this process is then leveraged to generatedistributional representationsfor historical items. - Inject
uncertainty: The stochastic nature of thediffusionandreverse processesnaturally introducesuncertaintyinto the model, making it more robust and capable of generating diverserecommendations. - Incorporate
target item guidance: Thediffusion processis designed such that thenoised target item embeddingdirectly influences therepresentationsof historical items, acting as anauxiliary signalto guide the model towards understanding the user's current intention.
4.2. Core Methodology In-depth (Layer by Layer)
DiffuRec is built upon the Diffusion Model framework, consisting of a diffusion (training) phase and a reverse (inference) phase, orchestrated by a Transformer-based Approximator.
4.2.1. Overview of DiffuRec
The overall architecture of DiffuRec is visualized in the figure below:
该图像是示意图,展示了DiffuRec模型在顺序推荐中的双阶段流程,包括扩散(训练)阶段和反向(推理)阶段。模型通过对目标项嵌入进行高斯噪声处理,生成分布表示,以适应用户的多样兴趣和项目的各方面特性。公式 表示在扩散过程中推断目标项表示。
Figure 2. Architecture of DiffuRec. The figure on the left is the Approximator, a Transformer backbone for the training and reverse phases, respectively.
At a high level, the process is as follows:
- Item Embeddings: Each item is initially represented by a static
embedding vector. - Diffusion (Training) Process:
- The
embeddingof the target item (denoted as ) is chosen as the starting point . Gaussian noiseis gradually added to over steps to create anoised target item representation. This is treated as adistributional representationsampled from .- The
historical item embeddingsfrom the user's sequence are then adjusted using thisnoised target item representationand astep embedding. This adjustment creates a set ofdistributional representationsfor the historical items. This step is crucial for exploiting theguidance of the target item. - These adjusted
historical item representationsare fed into aTransformer-basedApproximator. - The
Approximatoris trained toreconstructthe originaltarget item representationfrom . Thecross-entropy lossis used to minimize the difference between and the actualtarget item embedding.
- The
- Reverse (Inference) Process:
- Starts with a pure
Gaussian noise(where is the maximum diffusion step), representing an uncertain initial state for the target item. - The model iteratively
denoisesback to . In each step, the currentnoisy representationis used to adjust thehistorical item representations, which are then fed into the well-trainedApproximatorto estimate . - This estimated and the current are used to calculate the next less noisy representation . This process continues until is reached.
- A
rounding functionis then applied to map the continuous intodiscrete item indicesfor final prediction.
- Starts with a pure
4.2.2. Approximator
The Approximator is the core neural network responsible for reconstructing the target item representation. Following practices in NLP Diffusion models, DiffuRec utilizes a Transformer as the backbone for the Approximator, given its proven effectiveness in sequential dependency modeling.
The Approximator takes as input a sequence of adjusted historical item representations and outputs a reconstructed target item representation .
The input sequence to the Transformer consists of adjusted representations :
$
\hat{\mathbf{x}}0 = f\theta(\mathbf{Z}_x) = \mathrm{Transformer}\left(\left[\mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_n\right]\right)
$
Each adjusted historical item representation is constructed as follows:
$
\mathbf{z}_i = \mathbf{e}_i + \lambda_i \odot \left(\mathbf{x} + \mathbf{d}\right)
$
Where:
-
: The static
embedding vectorof the -th historical item in the sequence. -
: A scalar sampled from a
Gaussian distribution, , where is a hyperparameter defining both themeanandvariance. This controls theextent of noise injectionfor each historical item in thediffusion phase, and introducesstochasticityto modeluser interest evolutionin thereverse phase. -
: The
element-wise product(Hadamard product). -
: This is the
noised target item embeddingin thediffusion phase. In thereverse phase, it is thecurrent reversed target item representation(iterating from down to 1). -
: The
embeddingof the correspondingdiffusion/reverse step, created following the standardTransformerpositional encodingapproach [47]. Thisstep embeddinginjects information about the currentdiffusionorreverse stepinto the model.The
Transformeritself maintains its standard architecture, includingmulti-head self-attention,feed-forward networkswithReLU activation,layer normalization,dropout, andresidual connections. MultipleTransformer blocksare stacked to enhancerepresentation ability. Finally, therepresentationcorresponding to the last item in the sequence (after passing through theTransformerlayers) is used as thereconstructed target item representation.
4.2.3. Diffusion Phase
The diffusion phase (or training phase) is where the model learns to reconstruct the original target item embedding from its noisy version.
The key component here is the noise schedule , which dictates how much Gaussian noise is added at each step . DiffuRec utilizes a truncated linear schedule for generation.
The formula for the truncated linear schedule is:
$
\beta_s = \begin{cases}
\frac{a}{t}s + \frac{b}{s}, & \text{if } \beta_s \leq \tau \
\frac{1}{10}(\frac{a}{t}s + \frac{b}{s}), & \text{otherwise}
\end{cases}
$
Where:
-
a, b: Hyperparameters controlling the range of . -
: Maximum
diffusion step. -
: Current
diffusion step. -
: A
truncated threshold, set to 1 in the paper. If the calculated exceeds , it is scaled down by a factor of 10.During training, for each target item, a
diffusion stepis randomly sampled from a uniform distribution over the range[1, t], i.e.,s = \lfloor s'\rfloor, s' \sim U(0, t).
The noised target item representation is then generated from the original target item embedding (which is initially treated as ) following the Markov chain property of diffusion models.
Specifically, DiffuRec first derives from the target item embedding with one step of diffusion:
$
q(\mathbf{x}0 | \mathbf{e}{n+1}) = \mathcal{N}(\mathbf{x}0; \sqrt{\alpha_0}\mathbf{e}{n+1}, (1 - \alpha_0)\mathbf{I})
$
Then, the noised target item representation is generated from this using the reparameterization trick:
$
\mathbf{x}_s = \sqrt{\bar{\alpha}_s}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_s}\epsilon
$
Where:
-
: A
random noise vectorsampled from astandard Gaussian distribution. -
.
-
.
The full
diffusion (training)process is outlined inAlgorithm 2:
Algorithm 2: Diffusion or Training
1: Input:
2: Historical Sequence: ;
3: Learning Epochs: ;
4: Maximum Diffusion Steps: ;
5: Schedule : Linear;
6: Hyperparameter (mean and variance): sampling;
7: Approximator: ;
8: Output:
9: Well-trained Approximator: ;
10: while do
11: s = \lfloor s'\rfloor, s' \sim U(0, t); // Diffusion Step Sampling
12: ; // Diffusion (where is derived from as above)
13: ; // Distribution Representation
14: ; // Reconstruction
15: parameter update: ;
16:
17: end while
4.2.4. Reverse Phase
The reverse phase (or inference phase) is where the trained DiffuRec model is used to predict the target item. It starts with a pure Gaussian noise and iteratively denoises it into the target item representation.
The goal is to recover iteratively from a standard Gaussian noise .
At each reverse step, the well-trained Approximator is used to estimate from the current noisy representation and the historical sequence. Then, this estimated is used to calculate the mean and variance for sampling .
The mean and variance for sampling are given by:
$
\tilde{\mu}_t(\mathbf{x}_t, \hat{\mathbf{x}}0) = \frac{\sqrt{\bar{\alpha}{t-1}}\beta_t}{1 - \bar{\alpha}_t}\hat{\mathbf{x}}0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}{t-1})}{1 - \bar{\alpha}_t}\mathbf{x}_t
$
$
\tilde{\beta}t = \frac{1 - \bar{\alpha}{t-1}}{1 - \bar{\alpha}t}\beta_t
$
Using the reparameterization trick, is sampled as:
$
\mathbf{x}{t-1} = \tilde{\mu}_t(\mathbf{x}_t, \hat{\mathbf{x}}_0) + \tilde{\beta}_t \epsilon'
$
Where is another standard Gaussian noise vector.
This process is repeated iteratively from down to until the final denoised representation is obtained.
The full reverse (inference) process is outlined in Algorithm 1:
Algorithm 1: Reverse or Inference 1: Input: 2: Historical Sequence: ; 3: Target item Representation : ; 4: Total Reverse Steps: ; 5: Schedule : Linear; 6: Hyperparameter (mean and variance): sampling; 7: Approximator: ; 8: Output: 9: Predicted Target Item: ; 10: 11: while do 12: ; // Distribution Representation 13: ; // Reconstruction 14: ; // Reverse 15: ; // Iteration 16: end while 17:
4.2.5. Loss Function and Rounding
Loss Function:
Unlike typical Diffusion models that often use mean-squared error (MSE) (e.g., to predict the noise ), DiffuRec employs cross-entropy loss for model optimization. This choice is motivated by two reasons:
-
Discrete Nature of Item Embeddings:
Item embeddingsinrecommender systemsare often static and represent discrete entities in a continuous space, makingMSEless stable.Cross-entropyis more appropriate for predicting one out of many discrete items. -
Relevance Calculation: In
sequential recommendation,inner productis a common way to calculate relevance betweenitem representations, which aligns well withcross-entropy lossused for ranking.The
cross-entropy lossis calculated as follows: First, asoftmaxfunction is applied to theinner productbetween thereconstructed target item representationand all candidateitem embeddingsto obtain predicted probabilities: $ \hat{y} = \frac{\exp\left(\hat{\mathbf{x}}0 \cdot \mathbf{e}{n+1}\right)}{\sum_{i \in J} \exp\left(\hat{\mathbf{x}}0 \cdot \mathbf{e}i\right)} $ Then, thecross-entropy lossis computed over all users: $ \mathcal{L}{CE} = \frac{1}{|\mathcal{U}|} \sum{u \in \mathcal{U}} - \log \hat{y}_u $ Where:
- : The predicted
probabilityfor the true target item . - : The
reconstructed target item representationfrom theApproximator. - : The
embeddingof the actual target item. - : The set of all candidate items (including the target item and negative samples).
- : The
embeddingof a candidate item . - :
Inner productoperation. - : The set of all users.
Rounding Operation:
After the reverse phase produces the continuous representation , a rounding operation is needed to map it to a discrete item index for final recommendation. This is achieved by calculating the inner product between and all candidate item embeddings , and selecting the item with the maximal inner product score.
$
\mathrm{Rounding}(\mathbf{x}_0) = \underset{i \in \mathcal{I}}{\mathrm{arg}\max} \quad \mathbf{x}_0 \cdot \mathbf{e}_i^T
$
Where is the set of all items. This inner product is also used to rank the candidate items for the final recommendation list.
4.2.6. Discussion on Uncertainty and Adversarial Training
DiffuRec inherently injects uncertainty through two main mechanisms:
-
Corrupted/Reversed Target Item Representation: The
diffusion process(forward) addsGaussian noiseto the target item, and thereverse process(inference) starts fromGaussian noise. This means and are inherentlystochastic, representingdistributionsrather than fixed points. -
Gaussian Sampling for : The parameter used to adjust
historical item embeddings(Equation 13) is sampled from aGaussian distribution, . This explicitly introducesstochasticityanduncertaintyinto theitem distribution representation generation.The paper discusses that this injected
uncertainty(small perturbations) can improve modelrobustnessand alleviateover-fitting. To analyze this,DiffuRecis compared againstadversarial training, which also aims to improverobustnessby addingadversarial perturbations.
Adversarial Training Formulation:
Adversarial training transforms the optimization into a min-max problem. First, effective perturbations are found that maximize the loss, making the model vulnerable. Then, the model is trained to minimize both the original loss and the additional loss with these adversarial perturbations.
The min-max optimization problem is formalized as:
$
\Theta^, \Delta^ = \underset{\Theta}{\arg\min} \ \underset{\Delta, ||\Delta||\leq\epsilon}{\max} \mathcal{L}(\mathcal{D} | \Theta) + \gamma \mathcal{L}(\mathcal{D} | \Theta + \Delta)
$
Where:
-
:
Learnable parametersof the model. -
:
Perturbationsadded toembeddings. -
: Regulates the
magnitudeofperturbations. -
: Balances the
adversarial regularizerand themain loss. -
: norm.
-
: The
original loss function(e.g.,cross-entropy) on dataset . -
: The
loss functionwithadversarial perturbations.Finding the exact is intractable. Following [13],
adversarial perturbationsare approximated by moving in thegradient directionof theobjective function. $ \Delta_{adv} = \epsilon \frac{\Gamma}{||\Gamma||} \quad \mathrm{where} \quad \Gamma = \frac{\partial \mathcal{L}(\mathcal{D} | \Theta + \Delta^t)}{\partial \Delta^t} $ Here, areperturbationsin the -thtraining epoch, initialized as zeros. is updated using . The finalloss functionforadversarial trainingbecomes: $ \mathcal{L}^*(\mathcal{D} | \Theta) = \mathcal{L}(\mathcal{D} | \Theta) + \gamma \mathcal{L}(\mathcal{D} | \Theta + \Delta_{adv}) $ The paper argues thatDiffuRec's injecteduncertaintyis superior because it's not just a pure noise forrobustnessbut also incorporatesauxiliary informationfrom thetarget item representationas asupervised signalto captureuser intentions, whichadversarial trainingtypically lacks.
5. Experimental Setup
5.1. Datasets
The experiments were conducted on four real-world datasets widely used for sequential recommendation tasks.
- Amazon Beauty: A subset of the Amazon review dataset, focusing on beauty products.
- Amazon Toys: Another subset of the Amazon review dataset, focusing on toys.
- Movielens-1M: A benchmark dataset containing one million movie ratings from 6,000 users on 4,000 movies.
- Steam: A dataset collected from a large online video game distribution platform, containing interactions with over 40,000 games and rich external information.
Data Preprocessing:
-
All reviews/ratings are treated as
implicit feedback(user-item interaction). -
Interactions are chronologically ordered by
timestamps. -
Unpopular items and inactive users (fewer than 5 interactions) are filtered out.
-
A
leave-one-out strategyis used for evaluation: the most recent interaction is for testing, the penultimate for validation, and earlier ones for training. -
Maximum sequence length is set to 200 for
Movielens-1Mand 50 for the other three datasets.The following are the results from Table 1 of the original paper:
Dataset # Sequence # items # Actions Avg_len Sparsity Beauty 22,363 12,101 198,502 8.53 99.93% Toys 19,412 11,924 167,597 8.63 99.93% Movielens-1M 6,040 3,416 999,611 165.50 95.16% Steam 281,428 13,044 3,485,022 12.40 99.90%
These datasets were chosen because they represent a broad spectrum of real-world scenarios, varying in size, sequence length, and sparsity, which allows for a comprehensive validation of the method's performance under different conditions.
5.2. Evaluation Metrics
All models are evaluated using Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K). The experiments report results for . Evaluation is performed by ranking all candidate items (no sampling of negative items).
5.2.1. Hit Rate (HR@K)
- Conceptual Definition:
HR@Kmeasures the proportion of users for whom the target item (the item they actually interacted with next) is present within the top- items recommended by the system. It indicates whether the relevant item was found within the candidate list, regardless of its position. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users with hit in top-K}}{\text{Total number of users}} $ A hit occurs if the true next item is in the top- recommended list.
- Symbol Explanation:
- : The count of users for whom the ground truth next item is found among the top recommended items.
- : The total number of users in the evaluation set.
5.2.2. Normalized Discounted Cumulative Gain (NDCG@K)
-
Conceptual Definition:
NDCG@Kis a ranking quality metric that considers not only whether a relevant item is in the top- list but also its position. Higher positions for relevant items contribute more to the score. It's especially useful for scenarios where the order of recommended items matters. -
Mathematical Formula: First,
Discounted Cumulative Gain (DCG@K)is calculated: $ \mathrm{DCG@K} = \sum_{j=1}^{K} \frac{2^{rel_j} - 1}{\log_2(j+1)} $ Then,NDCG@Kis calculated by normalizingDCG@Kby theIdeal DCG (IDCG@K): $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ Where:- is the maximum possible
DCG@Kif all relevant items were perfectly ranked at the top. - For
implicit feedback(next-item predictionwhere relevance is binary), is 1 if the item at rank is the true next item, and 0 otherwise. Thus, the formula simplifies to if the item at rank is the target, and 0 otherwise.
- is the maximum possible
-
Symbol Explanation:
- : The number of top items in the recommendation list being considered.
- : The rank of an item in the recommendation list.
- : The
relevance scoreof the item at rank . For binary relevance (hit/no hit), if the item at rank is the target item, and otherwise. - : The
logarithmic discountfactor, which reduces the contribution of relevant items as their rank increases. - : The
Discounted Cumulative Gainup to rank . - : The
Ideal Discounted Cumulative Gainup to rank , which is theDCGachieved by a perfect ranking where all relevant items are placed at the top.
5.3. Baselines
DiffuRec is compared against nine baselines from three categories:
5.3.1. Conventional Sequential Neural Models
GRU4Rec[19]: UsesGRUfor sequential modeling.Caser[44]: Employshorizontalandvertical CNNsto capture recentsub-sequence behaviors.SASRec[25]: ATransformer-based model withuni-directional maskingforself-attention.BERT4Rec[42]: Uses abi-directional Transformerand acloze taskforsequential recommendation.
5.3.2. Multi-Interest Models
ComiRec[4]: Adoptsattention mechanismanddynamic routingformulti-interest extraction.TiMiRec[48]: A recentmulti-interest modelthat uses anauxiliary loss functionwith thetarget itemas asupervised signal.
5.3.3. VAE and Uncertainty Models
SVAE[36]: CombinesGRUwithVariational Autoencoderfornext item prediction.ACVAE[53]: Uses theAdversarial Variational Bayes (AVB)framework to enhancelatent variable representations.STOSA[10]: Developsstochastic self-attentionwithWasserstein DistanceandGaussian distributionsforuncertainty injection.
5.4. Implementation Details
- Optimizer:
Adamoptimizer with an initial learning rate of 0.001. - Parameter Initialization:
Transformerparameters initialized withXavier normalization distribution. - Model Architecture:
Transformerblock numbers set to 4. - Dimensions: Both
embedding dimensionand allhidden state sizesfixed at 128. - Batch Size: 1024.
- Dropout Rate: 0.1 for
Transformerblocks, 0.3 foritem embeddingacross all datasets. - Parameter: Sampled from a
Gaussian distributionwith bothmeanandvariance. - Total Reverse Steps (): Set to 32. Further analysis is conducted for .
- Noise Schedule ():
Truncated linear schedule(Equation 14) is primarily used, with comparisons againstsqrt,cosine, andlinear schedules. - Experiment Repetitions: Each method's experiments are run five times, and averaged results are reported.
- Statistical Significance:
Student t-testis used for statistical significance.
6. Results & Analysis
6.1. Core Results Analysis (RQ1)
The following are the results from Table 2 of the original paper:
| Dataset | Metric | GRU4Rec | Caser | SASRec | BERT4Rec | ComiRec | TiMiRec | SVAE | ACVAE | STOSA | DiffuRec | % |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bee | HR@5 | 1.0112 | 1.6188 | 3.2688 | 2.1326 | 2.0495 | 1.9044 | 0.9943 | 2.4672 | 3.5457 | 5.5758* | 57.26% |
| HR@10 | 1.9370 | 2.8166 | 6.2648 | 3.7160 | 4.4545 | 3.3434 | 1.9745 | 3.8832 | 6.2048 | 7.9068* | 26.21% | |
| HR@20 | 3.8531 | 4.4048 | 8.9791 | 5.7922 | 7.6968 | 5.1674 | 3.1552 | 6.1224 | 9.5939 | 11.1098* | 15.80% | |
| NDCG@5 | 0.6084 | 0.9758 | 2.3989 | 1.3207 | 1.0503 | 1.2438 | 0.6702 | 1.6858 | 2.5554 | 4.0047* | 56.72% | |
| NDCG@10 | 0.9029 | 1.3602 | 3.2305 | 1.8291 | 1.8306 | 1.7044 | 0.9863 | 2.1389 | 3.2085 | 4.7494* | 47.02% | |
| NDCG@20 | 1.3804 | 1.7595 | 3.6563 | 2.3541 | 2.6451 | 2.1627 | 1.2867 | 2.7020 | 3.7609 | 5.5566* | 47.75% | |
| A | HR@5 | 1.1009 | 0.9622 | 4.5333 | 1.9260 | 2.3026 | 1.1631 | 0.9109 | 2.1897 | 4.2236 | 5.5650* | 22.76% |
| HR@10 | 1.8553 | 1.8317 | 6.5496 | 2.9312 | 4.2901 | 1.8169 | 1.3683 | 3.0749 | 6.9393 | 7.4587* | 7.48% | |
| HR@20 | 3.1827 | 2.9500 | 9.2263 | 4.5889 | 6.9357 | 2.7156 | 1.9239 | 4.4061 | 9.5096 | 9.8417* | 3.49% | |
| NDCG@5 | 0.6983 | 0.5707 | 3.0105 | 1.1630 | 1.1571 | 0.7051 | 0.5580 | 1.5604 | 3.1017 | 4.1667* | 34.34% | |
| NDCG@10 | 0.9396 | 0.8510 | 3.7533 | 1.4870 | 1.7953 | 0.9123 | 0.7063 | 1.8452 | 3.8806 | 4.7724* | 22.98% | |
| NDCG@20 | 1.2724 | 1.1293 | 4.3323 | 1.9038 | 2.4631 | 1.1374 | 0.8446 | 2.1814 | 4.3789 | 5.3684* | 22.60% | |
| 20 | HR@5 | 5.1139 | 7.1401 | 9.3812 | 13.6393 | 6.1073 | 16.2176 | 1.4869 | 12.7167 | 7.0495 | 17.9659* | 10.78% |
| HR@10 | 10.1664 | 13.3792 | 16.8941 | 20.5675 | 12.0406 | 23.7142 | 2.7189 | 19.9313 | 14.3941 | 26.2647* | 10.76% | |
| HR@20 | 18.6995 | 22.5507 | 28.318 | 29.9479 | 21.0094 | 33.2293 | 5.0326 | 28.9722 | 24.9871 | 36.7870* | 10.71% | |
| NDCG@5 | 3.0529 | 4.1550 | 5.3165 | 8.8922 | 3.5214 | 10.8796 | 0.9587 | 8.2287 | 3.7174 | 12.1150* | 11.36% | |
| NDCG@10 | 4.6754 | 6.1400 | 7.7277 | 11.1251 | 5.4076 | 13.3059 | 1.2302 | 10.5417 | 6.0771 | 14.7909* | 11.16% | |
| NDCG@20 | 6.8228 | 8.4304 | 10.5946 | 13.4763 | 7.6502 | 15.7019 | 1.8251 | 12.8210 | 8.7241 | 17.4386* | 11.06% | |
| 2 | HR@5 | 3.0124 | 3.6053 | 4.7428 | 4.7391 | 2.2872 | 6.0155 | 3.2384 | 5.5825 | 4.8546 | 6.6742* | 10.95% |
| HR@10 | 5.4257 | 6.4940 | 8.3763 | 7.9448 | 5.4358 | 9.6697 | 5.8275 | 9.2783 | 8.5870 | 10.7520* | 11.19% | |
| HR@20 | 9.2319 | 10.9633 | 13.6060 | 12.7332 | 10.3663 | 14.8884 | 7.9753 | 14.4846 | 14.1107 | 16.6507* | 11.84% | |
| NDCG@5 | 1.8293 | 2.1586 | 2.8842 | 2.9708 | 1.0965 | 3.8721 | 1.8836 | 3.5429 | 2.9220 | 4.2902* | 10.80% | |
| NDCG@10 | 2.6033 | 3.0846 | 4.0489 | 4.0002 | 2.1053 | 5.0446 | 2.6881 | 4.7290 | 4.1191 | 5.5981* | 10.97% | |
| NDCG@20 | 3.5572 | 4.2073 | 5.3630 | 5.2027 | 3.3434 | 6.3569 | 3.2323 | 6.0374 | 5.5072 | 7.0810* | 11.39% |
DiffuRecconsistently outperforms all baselines across all four datasets (Amazon Beauty,Amazon Toys,Movielens-1M, andSteam) and all six metrics (, ). This strong performance confirms the effectiveness of modeling item representations as distributions and incorporating uncertainty via adiffusion model.- The performance gains are substantial, with improvements up to 57.26% / 56.72% (HR/NDCG) on
Amazon Beautyand 22.76% / 34.34% onAmazon Toys. On larger datasets likeMovielens-1MandSteam, improvements are still significant, around 10-11%. This suggestsDiffuRec's unified framework effectively addresses the challenges ofmultiple latent aspects,multiple user interests,uncertainty, andtarget item guidance. - Conventional SR Models:
GRU4RecandCasershow the lowest performance, likely due to their limited ability to capturelong-term dependenciescompared toTransformer-based models.SASRecandBERT4Recperform better, especiallySASRec, highlighting the effectiveness ofTransformerarchitecture forsequential recommendation. - Multi-Interest Models:
ComiRecandTiMiRecdo not consistently outperformSASRecandBERT4Rec.TiMiRecperforms better thanComiReconMovielens-1MandSteamdatasets, which are more complex and have longer sequences. This is attributed toTiMiRec's use of thetarget itemas anauxiliary supervised signal, a concept thatDiffuRecalso leverages in a more integrated manner. - VAE and Uncertainty Models:
SVAEshows the weakest performance among this group.ACVAEandSTOSAachieve competitive results.STOSA, in particular, demonstrates the value ofdistributional representation learninganduncertainty injection, aligning withDiffuRec's core principles and providing strong support for the direction of this research.DiffuRec’s superior performance overSTOSAsuggests that thediffusion frameworkprovides a more robust and effective way to achieve these benefits compared toWasserstein distance-basedself-attention.
6.2. Ablation Study (RQ2)
The following are the results from Table 3 of the original paper:
| Dataset | Ablation | HR@5 | HR@10 | HR@20 | NDCG@5 | NDCG@10 | NDCG@20 |
|---|---|---|---|---|---|---|---|
| Baeng | w GRU | 3.1773 | 4.6685 | 6.9000 | 2.2253 | 2.7075 | 3.2682 |
| w R* | 1.3036 | 2.2902 | 3.4768 | 0.7964 | 1.1130 | 1.4108 | |
| DiffuRec | 5.5758 | 7.9068 | 11.1098 | 4.0047 | 4.7494 | 5.5566 | |
| S0 | w GRU | 2.3895 | 3.4150 | 4.9407 | 1.7577 | 2.0849 | 2.4700 |
| w R* | 0.1194 | 0.4133 | 0.7480 | 0.0567 | 0.1510 | 0.2343 | |
| DiffuRec | 5.5650 | 7.4587 | 9.8417 | 4.1667 | 4.7724 | 5.3684 | |
| 0 | w GRU | 16.6016 | 24.6094 | 34.9772 | 10.9553 | 13.5348 | 16.1403 |
| W R* | 4.0690 | 6.0872 | 9.1688 | 2.7787 | 3.4297 | 4.1978 | |
| DiffuRec | 17.9659 | 26.2647 | 36.7870 | 12.1150 | 14.7909 | 17.4386 | |
| 2 | w GRU | 6.2832 | 10.1723 | 15.8169 | 4.0056 | 5.2548 | 6.6727 |
| w R* | 1.4873 | 2.5977 | 4.3086 | 0.9022 | 1.2582 | 1.6877 | |
| DiffuRec | 6.6742 | 10.7520 | 16.6507 | 4.2902 | 5.5981 | 7.0810 |
6.2.1. Approximator Type (w GRU vs. DiffuRec)
- When the
Transformer Approximatoris replaced withGRU(w GRU),DiffuRec's performance drops, but it still achieves strong results onMovielens-1MandSteam. This suggests that the corediffusion mechanismis effective even with a simpler backbone, andTransformerenhances its capabilities. The observation thatGRUcan still yield good results withdiffusionis consistent withCVwhere simpler architectures likeU-Netcan work well withDiffusion models.
6.2.2. Rounding Operation ( vs. DiffuRec)
- Replacing the
inner product-basedrounding operation(Equation 19) with the variant used in [12] () leads to a drastic performance drop (e.g.,HR/NDCGreducing by 68-95% across datasets). This strongly validates that usinginner productfor bothloss calculationandfinal rankingis crucial and aligns with the typical practices insequential recommendation. The alternativerounding functionmight not be suitable for this domain.
6.2.3. Noise Schedule (w Cosine, w L, w Sqrt vs. DiffuRec)
The performance impact of different noise schedules is visualized in the following figure:
该图像是一个图表,展示了DiffuRec在不同数据集(Amazon Beauty、Amazon Toys、Movielens-1M、Steam)上的推荐性能,包括HR@k和NDCG@k指标。图表中使用了不同的线型标识,包括余弦相似度(Cosine)、线性(Linear)、平方根(Sqrt)和截断线性(Truncated Linear)。
Figure 3. Performance of DiffuRec with different schedule.
The plot showing values for different noise schedules is presented below:
该图像是一个图表,展示了不同扩散调度下的 ar{ eta }_t值随扩散步骤的变化。图中包含线性(蓝色)、余弦(橙色)、平方根(绿色)和截断线性(红色)四种曲线,帮助理解在扩散过程中不同调度对 ar{ eta }_t 的影响。
Figure 4. throughout diffusion in different schedule.
- Figure 3 shows that the
Truncated Linear schedulegenerally performs best onAmazon BeautyandAmazon Toys, butLinear schedulecan sometimes achieve sub-optimal performance on other datasets. - Figure 4 illustrates that the
Truncated Linear scheduleprovides a "near-linear sharp drop in the middle of the process and subtle changes near the extremes of ". The authors hypothesize that this sharp corruption (adding more noise) in training is beneficial for better denoising ability, especially forone-step estimationin thereverse process. - The
ablationindicates that whilenoise scheduleshave some impact, they don't cause huge fluctuations, suggesting that the corediffusion framework's benefits are relatively robust to this choice, as noted in previousdiffusion modelliterature [17].
6.3. Impact of Hyper-parameter Setting (RQ3)
The sensitivity of DiffuRec's performance to the hyperparameter (controlling the mean and variance of sampling distribution) is presented below:
该图像是一个图表,展示了在两个数据集(Amazon Beauty 和 Movielens-1M)上,随着参数 λ 的不同取值(0.001、0.01、0.1 和 1)对 HR@K 和 NDCG@K 的影响。图中包含了 HR@5、HR@10、HR@20 和 NDCG@5、NDCG@10、NDCG@20 的条形图。
Figure 5. Parameter sensitivity of .
- Figure 5 demonstrates that
DiffuRecis sensitive to the value of . A small value (e.g., ) generally yields the best performance on bothAmazon BeautyandMovielens-1Mdatasets. - Increasing to 0.1 or 1.0 leads to a sharp drop in
HRandNDCG, particularly noticeable onMovielens-1M. - Analysis: This suggests that a very large introduces excessive noise to the
historical interaction sequences. This excessive noise corrupts the original information, making it difficult for the model to precisely understand the user'sinterestsand hindering effectiverepresentation generation. A smaller allows for controlledstochasticityanduncertainty injectionwithout overwhelming the meaningful signals.
6.4. Compared with Adversarial Training (RQ4)
The comparison between DiffuRec and adversarial training is shown in the following figure:
该图像是一个图表,展示了 adversarial training 和 DiffuRec 在四个数据集(Amazon Beauty、Amazon Toys、Movielens-1M 和 Steam)上的表现。图表中,横坐标为不同的指标(HR@5、HR@10、HR@20 和 NDCG@5、NDCG@10、NDCG@20),纵坐标为对应的评分,蓝色条表示 adversarial,橙色条表示 DiffuRec。
Figure 6. Performance of adversarial training and DiffuRec.
- Figure 6 shows that an
adversarial Transformer(usingBERT4Recas backbone) performs better than the standardBERT4Recbaseline. Specifically,adversarial TransformerimprovesBERT4Recby at least 2.57% / 10.23% (HR/NDCG) onAmazon Beautyand 8.05% / 4.89% onMovielens-1M. This validates that injecting perturbations throughadversarial trainingcan indeed enhance modelrobustnessand performance. - However,
DiffuRecsignificantly outperforms theadversarial Transformeracross all datasets. - Analysis: The authors explain this superiority by arguing that
DiffuRec'suncertainty injectionis more sophisticated. TheperturbationsinDiffuRec(from thediffusionandreverse phases) are not merely pure noise forrobustness. Instead, they are infused withauxiliary informationfrom thetarget item representation, which acts as asupervised signalto help the model capture the user's current intentions. Thistarget-item-aware uncertainty injectionis a key advantage that differentiatesDiffuRecfrom genericadversarial trainingmethods.
6.5. Performance on Varying Lengths of Sequences and Items with Different Popularity (RQ5)
The distribution of sequence length and item interaction times for Amazon Beauty and Movielens-1M datasets is shown below:
该图像是条形图,展示了两个数据集(Amazon Beauty 和 Movielens-1M)中序列长度和项目互动次数的分布情况。图左侧以蓝色表示序列长度,图右侧以橙色表示项目互动次数,反映了不同数据集中用户行为的特点。
Figure 7. Frequency histogram of sequence length and item interaction times for Amazon Beauty and Movielens-1M dataset. The maximum length and maximum interactions are 204/431 and 2341/3428 respectively for Amazon Beauty and Movielens-1M datasets.
6.5.1. Head and Long-tail Items
The performance on head (top 20% most frequent) and long-tail items is shown in the following figure:
该图像是一个图表,展示了在 Amazon Beauty 和 Movielens-1M 数据集上,DiffuRec 与其他方法在头部项和长尾项推荐性能的比较。图中提供了各方法在 HR@10 和 NDCG@10 指标下的表现,DiffuRec 方法的效果表现在不同类型的项中均优于其他对比方法。
Figure 8. Performance on the head and long-tail items.
- Figure 8 reveals that for all models, performance on
head itemsis significantly better than onlong-tail items. This is a common challenge inrecommender systemsdue to data sparsity for less popular items. - The performance gap between
headandlong-tailitems is smaller onMovielens-1Mcompared toAmazon Beauty. This might be becauseMovielens-1Mhas fewer total items, leading to a less pronounced distinction betweenheadandlong-tailin terms of data density. - Despite this challenge,
DiffuRecconsistently outperforms all baselines for bothheadandlong-tail itemsacrossHR@10andNDCG@10metrics, demonstrating its robust performance across different item popularity levels. Its superiority inNDCGimplies better ranking of relevant items, even for less popular ones.
6.5.2. Different Sequence Length
The performance of different baselines on varied sequence lengths is depicted in the following figure:
该图像是图表,展示了在不同序列长度下,DiffuRec与其他推荐方法(如SASRec、TimiRec、STOSA)的性能比较。横轴表示序列长度,纵轴分别为HR@10和NDCG@10,能够直观地反映各模型在Amazon Beauty和Movielens-1M数据集上的效果。
Figure 9. Performance on different length of sequence.
- Figure 9 shows contrasting trends between
Amazon BeautyandMovielens-1Mdatasets regarding sequence length:- On
Amazon Beauty,DiffuRecperforms better on longer sequences. - On
Movielens-1M, performance generally declines as sequence length increases for all models, includingDiffuRec.
- On
- Analysis: The authors point out that 80% of sequences in
Amazon Beautyare shorter than 10, while forMovielens-1M, 80% are longer than 37. This suggests that very short sequences might not provide enough information for accurate recommendations, while very long sequences pose a challenge for all models to capture relevantlong-term dependenciesand filter out noise. - Regardless of these trends,
DiffuRecconsistently maintains the best performance across allsequence lengthcategories when compared to other baselines. This indicates thatDiffuRecis robust to varyingsequence lengths, adapting well to both shorter and longer user interaction histories.
6.6. Convergence and Efficiency (RQ6)
6.6.1. Convergence
The loss curves on the training process for Amazon Beauty and Movielens-1M datasets are shown below:
该图像是图表,展示了SASRec和DiffuRec在Amazon Beauty和Movielens-1M数据集上的训练损失曲线。左侧为SASRec模型,右侧为DiffuRec模型,横轴为训练轮数,纵轴为损失值。可以观察到,DiffuRec在训练过程中损失下降幅度较大,表现更优。
Figure 10. Curve of training loss on Amazon Beauty and Movielens-1M datasets.
- Figure 10 compares the convergence speed of
DiffuRecwithSASRec(a representativeTransformer-based baseline). - On
Amazon Beauty, bothDiffuRecandSASRecconverge around 150 epochs. - On
Movielens-1M,DiffuRecdemonstrates faster convergence (around 100 epochs) compared toSASRec(around 250 epochs). - Analysis: This faster convergence on
Movielens-1M(which has a much longer average sequence length, 165.50 vs. 8.53 forAmazon Beauty) is attributed toDiffuRec's ability toinduce the user's current interestby extracting information from thetarget item representationasauxiliary information. Thistarget item guidanceaccelerates model learning, especially when dealing with longer and more complex sequences whereSASRecmight struggle to efficiently capturelong-term dependencies.
6.6.2. Efficiency
The inference time on Amazon Beauty dataset and average training time on Amazon Beauty dataset are shown below:
该图像是图表,展示了不同方法在扩散步骤下的时间表现。左侧的折线图比较了SASRec、BERT4Rec等方法随扩散步骤的运行时间,右侧的柱状图则展示了不同算法在特定扩散步骤下的总时间消耗,包括DiffuRec和其他对比方法。
Figure 11. Inference time (in seconds for one sample) on Amazon Beauty dataset (left). The points in the inset illustrate performance (Recall@K and NDCG@K) at different reverse steps. Average training time (in seconds for one batch) on Amazon Beauty dataset (right).
The following are the results from Table 4 of the original paper:
| Model | GRU4Rec | SASRec | BERT4Rec | ACVAE | STOSA | DiffuRec |
|---|---|---|---|---|---|---|
| Complexity | O(nd) | O(d2 + n2) | O(d2 + n2) | O(nd) | O(d2 + n2) | O(n2 + dn) |
Where is the representation dimension and is the sequence length.
-
Time Complexity (Table 4):
GRU4RecandACVAE(bothGRU-based) have a complexity of , linear with sequence length .SASRec,BERT4Rec, andSTOSA(allTransformer-based) have a complexity of , due to theself-attentionmechanism's quadratic dependency on sequence length.DiffuRechas a complexity of . This aligns withTransformer-based models, given itsApproximator. Thednterm likely comes from operations involving theitem embeddingsandstep embeddingswithin thediffusionprocess.
-
Inference Time (Figure 11, left):
- At a very small total
diffusion step(),DiffuRec's inference time is comparable to otherTransformer-based models (SASRec,BERT4Rec). - As the number of
reverse stepsincreases,DiffuRec's inference time for a single sample grows exponentially. This is expected as thereverse phaseis an iterative process. - The inset in Figure 11 (left) shows that performance (
Recall@KandNDCG@K) fluctuates withreverse steps. The full analysis of this fluctuation is provided in Figure 12. GRU4RecandACVAEhave generally faster inference times due to their lowertime complexity.STOSAis slightly slower than pureTransformer-based models due toWasserstein distancecalculations.- The paper suggests that a moderate number of
reverse steps(e.g., 32 or 64) offers a good balance betweenrepresentation qualityandinference speed.
- At a very small total
-
Training Time (Figure 11, right):
-
GRU4Rechas significantly lower training time per batch due to its simpler and more lightweight architecture. -
SASRec,ACVAE,STOSA,DiffuRec, andBERT4Rechave very similar training times, ranging from 14s to 18s per batch. -
Analysis: This indicates that the
training processofDiffuRecis not a bottleneck compared to otherTransformer-based orVAE-based models, despite its sophisticateddiffusion mechanism. The main computational overhead is concentrated in the iterativeinference phase.The performance (
Recall@KandNDCG@K) at differentreverse stepsonAmazon BeautyandMovielens-1Mdatasets is shown below:
该图像是图表,展示了在不同反向步骤下,Amazon Beauty 和 Movielens-1M 数据集的 HR @ K和 NDCG@ K指标变化。上方左侧为 Amazon Beauty 的 HR,右侧为 NDCG;下方左侧为 Movielens-1M 的 HR,右侧为 NDCG。
-
Figure 12. Recall@\mathsf{K}and NDCG@ at different reverse steps on Amazon Beauty and Movielens-1M datasets.
- Figure 12 shows how
Recall@KandNDCG@Kmetrics vary with the number ofreverse steps. - Generally, increasing the
reverse stepsinitially improves performance as thedenoising processbecomes more refined, but beyond a certain point, the gains diminish or even slightly fluctuate. - For
Amazon Beauty, performance peaks around 64 steps forHR@10andNDCG@10. ForMovielens-1M, it's around 32-64 steps. This confirms the earlier point about balancing quality and speed.
6.7. Visualization of Uncertainty and Diversity (RQ7)
6.7.1. Uncertainty
A t-SNE plot of the reconstructed representation from the reverse phase is shown below:
该图像是一个t-SNE图,展示了从反向阶段重构的 extbf{x}_0 表示。左侧是来自Amazon Beauty的数据,紫点表示通过100种不同高斯噪声反转的相同项,黄点表示随机序列的其他项;右侧是Movielens-1M的数据,图示相似。
Figure 13. A t-SNE plot of the reconstructed representation from reverse phase. The purple points represent the same item but reversed by 100 different Gaussian noise for a specific historical sequence. The yellow points are the others each of which is reversed from a random sequence in Amazon Beauty and Movielens-1M datasets respectively.
- Figure 13 visualizes the
uncertaintycaptured byDiffuRecusingt-SNE. - Yellow Points: The
yellow points(reversed from random sequences) are dispersed uniformly throughout the2D space, indicating high diversity across different user contexts. - Purple Points: The
purple pointsrepresent 100 differentreconstructed target item representationsfor the same historical sequence, but each generated from a differentGaussian noiseinitialization in thereverse phase. These points are clustered relatively closely but still exhibit distinct deviations from each other. - Analysis: This visualization confirms that
DiffuRecdoes not produce a single deterministic output for a given sequence, unlike most traditional models. Instead, it generates adistributionof plausibletarget item representationsfor the same input sequence, reflecting the inherentuncertaintyin user preferences. Thisstochasticityis a key feature, allowing fordiversified retrieval resultsby running thereverse processmultiple times in parallel, which can be beneficial in theretrieval stageofrecommender systems. For instance, reversing 100 different for an exemplar sequence resulted in 643 and 82 unique items in the top-20 forAmazon BeautyandMovielens-1M, respectively.
6.7.2. Diversity
The categories of recommended items for a sampled sequence from Amazon Toys dataset are shown below:
| Model | Categories of Top 5 Items. | Target Item Category |
|---|---|---|
| SASRec | Kids' Electronics, Electronic Toys, Learning & Education, Toy Remote Control & Play Vehicles, Play Vehicles | Electronic Toys |
| TimiRec | Toy Remote Control & Play Vehicles, Electronic Toys, Play Vehicles, Remote & App Controlled Vehicle Parts, Play Trains & Railway Sets | Electronic Toys |
| STOSA | Learning & Education, Playsets & Vehicles, Board Games, Train Sets, Wind-up Toys Electronic Toys, Helicopter Kits, Remote & App Controlled Vehicles, | Electronic Toys |
| DiffuRec | Card Games, Electronic Software & Books | Electronic Toys |
- Table 5 illustrates the
diversityofrecommendation resultsby showing the categories of top-5 recommended items for a single user sequence fromAmazon Toys. Thetarget itemcategory is 'Electronic Toys'. SASRecandTiMiRectend to recommend items from very similar and narrow categories (e.g., 'Electronic Toys', 'Vehicles'). This is especially true forTiMiRec, which explicitly leverages thetarget itemas asupervised signal, potentially constraining its recommendations to highly similar items.- In contrast,
STOSAandDiffuRecrecommend a wider range of categories, moving beyond just 'Electronic Toys'. For example,DiffuRecsuggests 'Card Games' and 'Electronic Software & Books'. - Analysis: This demonstrates that the
uncertainty injectioninherent inSTOSAand, more effectively, inDiffuRec, leads to morediverse recommendation results. By modeling items and preferences asdistributionsand leveragingstochasticity,DiffuReccan explore a broader range of potentially relevant items, fostering greater flexibility and potentially increasingserendipityinrecommendations, which is a desired characteristic of modernrecommender systems.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents DiffuRec, a pioneering application of Diffusion models to Sequential Recommendation (SR). The core innovation is to replace the traditional fixed vector representation of items with distributional representations, which more effectively capture multiple latent aspects of items and diverse user interests. DiffuRec leverages the diffusion process to inject uncertainty and incorporate target item guidance into the representation learning. The model consists of a Transformer-based Approximator that learns to reconstruct target item representations during training and iteratively denoise Gaussian noise during inference, followed by a rounding operation for discrete item prediction. Extensive experiments on four real-world datasets demonstrate DiffuRec's significant superiority over strong baselines, confirming the effectiveness of its diffusion-based approach to SR. Ablation studies and detailed analyses further elucidate the importance of its components and its robustness across varying sequence lengths and item popularity.
7.2. Limitations & Future Work
The authors highlight several potential directions for future research:
- Other
Diffusion ModelParadigms: Exploring different ways of adaptingDiffusion modelstorecommendation tasksbeyond the currentDiffuRecframework. - Alternative
Recommendation Scenarios: ApplyingDiffusion modelsto otherrecommendation settingssuch assession-based recommendation(where user identity might be implicit or shorter sequences are common) orclick-through rate (CTR) prediction.
7.3. Personal Insights & Critique
DiffuRec presents a highly innovative and promising direction for recommender systems. The core insight that Diffusion models are well-suited for distribution generation and diversity representation in SR is powerful.
Inspirations:
- Bridging Generative AI and Recommender Systems: This paper successfully imports a state-of-the-art generative paradigm (
Diffusion models) fromCV/NLPintorecommender systems. This opens up a vast new area of research, suggesting that many other advancedgenerative modelscould be adapted torecommendationto address long-standing challenges likecold-start,data sparsity, anddiversity. - Unified Uncertainty and Multi-Interest Modeling:
DiffuRecprovides a more elegant and unified framework for handlinguser interest diversityanduncertaintycompared to prior piecemeal solutions (e.g., explicit multi-interest networks orVAEswith their known issues). The idea oftarget-item-aware uncertainty injectionis particularly compelling, as it grounds thestochasticityin a meaningful, supervised signal. - Robustness and Serendipity: The demonstrated
robustnessacross datasets and the increaseddiversityin recommendations (as shown in thet-SNEand category analysis) are highly desirable qualities for modernrecommender systems, moving beyond simple accuracy to user satisfaction.
Potential Issues/Areas for Improvement:
-
Inference Speed: While training time is comparable, the iterative
reverse processmakes inference exponentially slower with more steps. Though the authors suggest a moderate number of steps for balance, for real-timerecommendation systemswith extremely low latency requirements, this could still be a bottleneck. Further research into fasterdiffusion sampling techniques(e.g.,DDIM,ODE solvers) is critical for practical deployment. -
Complexity of Model Interpretation:
Diffusion models, like many deepgenerative models, are complex. Interpreting why certain recommendations are made or howlatent aspectsare modeled within thedistributionsmight be challenging. Future work could focus onexplainable AI (XAI)fordiffusion-based recommender systems. -
Hyperparameter Sensitivity: The sensitivity to indicates that careful
hyperparameter tuningis necessary, and large values can degrade performance significantly. Robustness tohyperparameterscould be an area of improvement. -
Scalability to Extremely Large Item Sets: While impressive on the tested datasets,
sequential recommendationwithcross-entropy lossandinner productrankingover all candidate items (as done in this paper) can be computationally intensive for systems with millions of items. Exploring sampling strategies or more efficient retrieval mechanisms for therounding operationmight be necessary for industrial-scale applications. -
Generalization to Other Data Types: The current work focuses on
implicit feedback. It would be interesting to see howDiffuRecperforms withexplicit feedback(ratings) orside information(item features, user demographics).Overall,
DiffuRecrepresents a significant advancement, demonstrating the untapped potential ofDiffusion modelsinrecommender systemsand setting a new benchmark for modelingdynamic user preferencesanduncertainty.
Similar papers
Recommended via semantic vector search.