De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation
TL;DR Summary
The ADARec framework addresses data sparsity in sequential recommendation by utilizing a Mixture-of-Experts architecture to decouple coarse and fine-grained user intents, effectively reconstructing intent hierarchies. Experiments show it outperforms existing methods on standard b
Abstract
Sequential recommendation (SR) aims to predict users’ next action based on their historical behavior, and is widely adopted by a number of platforms. The performance of SR models relies on rich interaction data. However, in real-world scenarios, many users only have a few historical interactions, leading to the problem of data sparsity. Data sparsity not only leads to model overfitting on sparse sequences, but also hinders the model’s ability to capture the underlying hierarchy of user intents. This results in misinterpreting the user’s true intents and recommending irrelevant items. Existing data augmentation methods attempt to mitigate overfitting by generating relevant and varied data. However, they overlook the problem of reconstructing the user’s intent hierarchy, which is lost in sparse data. Consequently, the augmented data often fails to align with the user’s true intents, potentially leading to misguided recommendations. To address this, we propose the Adaptive Diffusion Augmentation for Recommendation (ADARec) framework. Critically, instead of using a diffusion model as a black-box generator, we use its entire step-wise denoising trajectory to reconstruct a user’s intent hierarchy from a single sparse sequence. To ensure both efficiency and effectiveness, our framework adaptively determines the required augmentation depth for each sequence and employs a specialized mixture-of-experts architecture to decouple coarse- and fine-grained intents. Experiments show ADARec outperforms state-of-the-art methods by 2-9% on standard benchmarks and 3-17% on sparse sequences, demonstrating its ability to reconstruct hierarchical intent representations from sparse data.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation
1.2. Authors
Xiaoxi Cui, Chao Zhao, Yurong Cheng, Xiangmin Zhou. Affiliations: Beijing Institute of Technology (Xiaoxi Cui, Chao Zhao, Yurong Cheng), RMIT University (Xiangmin Zhou). Their research backgrounds appear to be in recommendation systems, machine learning, and data mining, given the nature of the paper.
1.3. Journal/Conference
The paper is listed with references to Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025) and The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25). Given the publication year (implied 2025 by references) and the nature of the content, it is likely intended for a top-tier conference in artificial intelligence or information retrieval. Both SIGIR and AAAI are highly reputable and influential venues in their respective fields, known for publishing cutting-edge research in areas like recommender systems, machine learning, and data mining.
1.4. Publication Year
The references indicate publication in 2025.
1.5. Abstract
Sequential recommendation (SR) aims to predict user actions based on historical interactions. However, data sparsity, a common problem where users have few historical interactions, leads to model overfitting and an inability to capture hierarchical user intents. Existing data augmentation methods attempt to mitigate overfitting but often fail to reconstruct the user's intent hierarchy, leading to misguided recommendations. To address this, the paper proposes Adaptive Diffusion Augmentation for Recommendation (ADARec). ADARec utilizes the entire step-wise denoising trajectory of a diffusion model to reconstruct a user's intent hierarchy from a single sparse sequence. To ensure efficiency and effectiveness, ADARec adaptively determines the required augmentation depth for each sequence and employs a mixture-of-experts (MoE) architecture to decouple coarse- and fine-grained intents. Experiments show ADARec outperforms state-of-the-art methods by 2-9% on standard benchmarks and 3-17% on sparse sequences, demonstrating its ability to reconstruct hierarchical intent representations from sparse data.
1.6. Original Source Link
/files/papers/694a32027a7e7809d937f471/paper.pdf This link points to a PDF, suggesting it's likely a pre-print or conference paper. Its publication status is likely either submitted to or accepted by one of the mentioned 2025 conferences.
2. Executive Summary
2.1. Background & Motivation
The core problem ADARec aims to solve is data sparsity in sequential recommendation (SR). SR models predict users' next actions based on their past interactions, which is crucial for many online platforms. However, in real-world scenarios, most users have very few interactions, leading to sparse data. This data sparsity creates two major issues:
-
Model Overfitting: With limited data,
SRmodels tend tooverfitto the few available interactions, making them less generalizable. -
Hinders Hierarchical Intent Modeling: Crucially,
sparse datacollapses the user'sintent hierarchy. For example, a user interacting with an "energy bar" and "hiking boot" might have a coarse-grained intent of "hiking" and a fine-grained intent of "buying a water bottle." However, sparse data might lead models to simply categorize these as "snacks" and "shoes," misinterpreting the true underlying intents and leading to irrelevant recommendations.Existing
data augmentationmethods try to generate more data to combatoverfitting. However, they often focus on generating items that are either too similar (limited semantic clustering) or too diverse (deviating from true intent), without explicitly addressing the reconstruction of theuser's intent hierarchythat is lost in sparse data. This oversight means augmented data might not align with true user intents, leading tomisguided recommendations.
The paper's entry point or innovative idea is to use the step-wise denoising trajectory of a diffusion model to reconstruct the intent hierarchy. They draw inspiration from the Information Bottleneck (IB) principle, suggesting that by progressively adding noise and then denoising, a model can learn to discard fine-grained details and capture more fundamental, coarse-grained patterns, thus revealing the underlying intent hierarchy.
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
-
Proposed
ADARecFramework: They introduceAdaptive Diffusion Augmentation for Recommendation (ADARec), a novel framework designed to construct a comprehensiveintent hierarchyeven fromsparse user sequences. -
Hierarchical Diffusion Augmentation (HDA)Module: They proposeHDA, which leverages thedenoising trajectoryof adiffusion processto generate a rich, multi-levelhierarchy of user intents, spanning fromfine-grainedtocoarse-grainedlevels. This is a key departure from prior diffusion-based augmentation that only used the final output. -
Specialized Modules for Efficiency and Effectiveness: They introduce two specialized modules:
- An
Adaptive Depth Controller (ADC)that intelligently determines the optimaldiffusion depthfor each user sequence, enhancing efficiency. - A
Hierarchical Parsing Mixture-of-Experts (HP-MoE)module that processes different levels of theintent hierarchybydecoupling coarse- and fine-grained intentsand routing them to specialized experts.
- An
-
Extensive Experimental Validation: Through comprehensive experiments on real-world datasets, they demonstrate that
ADARecsignificantly outperforms state-of-the-art baselines.The key findings are:
ADARecconsistently and significantly outperforms state-of-the-artsequential recommendationmodels, with improvements of 2-9% on standard benchmarks.- The performance gains are even more pronounced on
sparse sequences(user history length ), reaching 3-17% improvement over strong competitors. This highlights its effectiveness in reconstructinghierarchical intentfrom limited data. - The
t-SNE visualizationconfirms thatADARecsuccessfully learnswell-separated intent clustersfromsparse data, unlike baselines where intents oftencollapse. - All proposed components (
ADC,HDA,HP-MoE,Content-Aware Router,Contrastive Loss) are shown to be indispensable forADARec's performance throughablation studies. - The framework can effectively enhance various
intent-based backbone modelswhen integrated, demonstrating its generalizability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the ADARec framework, a beginner should understand several key concepts:
-
Sequential Recommendation (SR): This is a subfield of
recommender systemswhere the goal is to predict thenext itema user will interact with, based on their ordered sequence of past interactions. Unlike traditionalrecommender systemsthat might only consider a user's overall preferences,SRmodels emphasize the sequential and temporal patterns in user behavior. For example, if a user buys a camera, then a lens, the next recommendation might be a tripod, following a logical sequence of purchases. -
Data Sparsity: This is a pervasive problem in
recommender systems. It refers to the situation where there are very few recorded interactions between users and items compared to the total possible interactions. In the context ofSR,data sparsityoften means that many users have very short interaction histories (e.g., only 1-5 items). This makes it difficult for models to learn meaningful patterns and user preferences. -
User Intent: This refers to the underlying goal or motivation a user has when interacting with items. The paper distinguishes between:
Fine-grained intent: Specific, immediate goals (e.g., "buy a specific brand of water bottle").Coarse-grained intent: Broader, more abstract goals (e.g., "prepare for a hiking trip"). Insparse data, the distinction between these intents oftencollapses, making it hard for models to understand the user's broader motivations.
-
Diffusion Models: These are a class of generative models that have gained significant attention, particularly for image generation. They operate in two main phases:
Forward (Noising) Process: This process gradually addsGaussian noiseto data, progressively transforming it into pure random noise. This is like blurring an image until it's just static. Each step in this process represents a different level of noise.Reverse (Denoising) Process: This is the generative part. A neural network learns to reverse thenoising process, gradually removing noise from random data to reconstruct a clear data sample. InADARec, thisdenoising trajectory(the sequence of outputs at eachdenoising step) is used to represent different levels ofuser intent. The intuition is that at high noise levels, the model must infer very abstract,coarse-grainedinformation, while at low noise levels, it recoversfine-graineddetails.
-
Information Bottleneck (IB) Principle: Proposed by Tishby et al. (2000), this principle suggests that to learn a concise representation of relevant information, a system should compress its input by removing irrelevant details while preserving as much information as possible about a target variable. In
ADARec, adding noise (theforward process) creates aninformation bottleneck. Thedenoising processis then forced to extract theessential message(user intent) by discarding less important, specific details (fine-grained noise) and keeping only the most fundamental and general patterns (coarse-grained intents). -
Mixture-of-Experts (MoE): This is a machine learning technique where multiple "expert" sub-networks are trained, and a "gating" network learns to combine or select the outputs of these experts based on the input. Each expert specializes in a different aspect of the problem. In
ADARec,HP-MoEuses separate experts forfine-grainedandcoarse-grained intents, with a gating mechanism to dynamically route and combine their outputs. -
Gumbel-Softmax Estimator: This is a technique used to make discrete sampling operations differentiable, allowing gradient-based optimization in neural networks. When a model needs to make a discrete choice (e.g., selecting a specific
diffusion depthfrom a set of integers),Gumbel-Softmaxprovides a continuous approximation during training that can be backpropagated through, while still allowing a discrete choice (e.g.,argmax) to be made during inference. -
Transformer: Introduced by Vaswani et al. (2017),
Transformersare a neural network architecture widely used insequential dataprocessing, particularly in natural language processing and nowsequential recommendation. Their key innovation is theself-attentionmechanism, which allows the model to weigh the importance of different elements in an input sequence when processing each element, capturing long-range dependencies effectively.ADARecuses aTransformerwithin itsfine-grained expert. -
Cross-entropy Loss (): A common loss function used in classification tasks, particularly for
next-item predictioninSR. It measures the difference between the predicted probability distribution over items and the true one-hot distribution of the actual next item. A lowercross-entropyindicates better prediction accuracy. The formula is generally: $ \mathcal{L}{CE} = -\sum{i=1}^{N} y_i \log(\hat{y}_i) $ where is the true probability (1 for the correct class, 0 otherwise) and is the predicted probability for class . -
Mean Squared Error (MSE) Loss (): Another common loss function, used to quantify the difference between predicted continuous values and actual continuous values. In
ADARec, it's used to train thedenoising networkby measuring the squared difference between the predicted noise and the actual noise added. The formula is: $ \mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} (y_i - \hat{y}_i)^2 $ where is the true value and is the predicted value. -
Contrastive Loss (InfoNCE-based, ): This type of loss aims to pull
positive pairs(similar samples) closer together in the embedding space while pushingnegative pairs(dissimilar samples) further apart.InfoNCE(Information Noise-Contrastive Estimation) is a popular form ofcontrastive lossoften used in self-supervised learning. InADARec, it's used to make thecoarse-grained expertlearn invariant user essence by contrasting coarse-grained representations of the same user with those of other users. A commonInfoNCEformula is: $ \mathcal{L}{InfoNCE} = -\log \frac{\exp(\text{sim}(\mathbf{z}, \mathbf{z}^+) / \tau)}{\sum{\mathbf{z}^- \in \mathcal{N}} \exp(\text{sim}(\mathbf{z}, \mathbf{z}^-) / \tau) + \exp(\text{sim}(\mathbf{z}, \mathbf{z}^+) / \tau)} $ where is a similarity function (e.g., cosine similarity), is an anchor, is a positive sample, is a set of negative samples, and is a temperature parameter.
3.2. Previous Works
The paper frames its work against two main lines of previous research: sequential recommendation models and data augmentation methods.
-
Traditional Sequential Recommendation Models:
- Markov Chains: Early methods like
Markov Chains(He and McAuley 2016; Rendle, Freudenthaler, and Schmidt-Thieme 2010) predicted the next item based on the immediately preceding one(s). - Deep Learning Models:
RNNs(Hidasi et al. 2015):Recurrent Neural Networkscaptured temporal dependencies.Transformers(SASRec(Kang and McAuley 2018),BERT4Rec(Sun et al. 2019)): These models, leveragingself-attention, became state-of-the-art for capturing long-range dependencies in sequences.GNNs(Chang et al. 2021; Li et al. 2023b):Graph Neural Networksmodel item transitions as graph structures.MLPs(Li et al. 2022; Zhou et al. 2022):Multi-Layer Perceptronshave also been adapted forSR.
- Self-Supervised Learning (SSL): Recent
SRworks incorporateSSLobjectives (Xie et al. 2022; Liu et al. 2021; Qiu et al. 2022; Yang et al. 2023) to learn robust representations by creating auxiliary prediction tasks on the input sequence (e.g., masking items and predicting them). - Intent Learning: More recently,
SRhas moved towards explicitly modelinguser intents, using techniques likemulti-interest modeling(Li et al. 2019; Cen et al. 2020),disentanglement(Ma et al. 2020; Li et al. 2024), andend-to-end clustering(ICLRec(Chen et al. 2022),IOCRec(Li et al. 2023a),ELCRec(Liu et al. 2024)). The paper specifically usesELCRecas its backbone encoder, which is designed to learn explicit clusters representing user intents.
- Markov Chains: Early methods like
-
Data Augmentation Methods for SR:
- Heuristic Augmentations: Simple techniques like
maskingandcropping(Xie et al. 2022; Liu et al. 2021) generate variations of existing sequences. - Diffusion-based Augmentation:
PDRec(Ma et al. 2024),GlobalDiff(Luo, Li, and Lin 2025),DiffDiv(Cai et al. 2025): These methods usediffusion modelsto generate new, generalized user preferences or augmented item sequences. However, the paper criticizes them for only utilizing the final output of the diffusion process, thus overlooking theintent hierarchy.BASRec(Dang et al. 2025): This method focuses on balancing relevance and diversity in generated data through representation-levelmixup.
- Feature Denoising: Other diffusion-based approaches denoise feature representations from sparse interactions (Cui et al. 2025a,b).
- Heuristic Augmentations: Simple techniques like
3.3. Technological Evolution
Sequential recommendation has evolved significantly:
-
Early Statistical Models:
Markov Chainsrepresented the initial phase, focusing on local item-to-item transitions. -
Deep Learning Revolution:
RNNs,CNNs(e.g.,Caser), and laterTransformers(SASRec,BERT4Rec) brought sophisticated sequence modeling capabilities, capturing longer-range dependencies and complex patterns. -
Self-Supervised Learning Era: To combat
data sparsityand learn more robust representations without extensive labeling,SSLtechniques emerged, often using tasks likemasked item predictionorcontrastive learning. -
Intent-Aware Recommendation: Recognizing that users often have multiple, dynamic underlying goals,
multi-interestandintent disentanglementmodels became prominent, aiming to explicitly identify and leverage these intents.ELCRecis an example ofend-to-end clusteringfor intent learning. -
Generative Augmentation: The recent advent of powerful
generative modelslikediffusion modelsopened new avenues fordata augmentation, moving beyond simple heuristics to synthesize more complex and varied data.ADARecfits into the latest stage of this evolution, specifically at the intersection ofintent-aware recommendationandgenerative augmentation. It aims to overcome the limitations of priordata augmentation(which often lackedintent hierarchyawareness) by leveragingdiffusion modelsin a novel way to explicitly reconstruct theintent hierarchyfrom sparse data, rather than just generating more data points.
3.4. Differentiation Analysis
Compared to the main methods in related work, ADARec introduces several core differences and innovations:
-
Hierarchical Intent Reconstruction via Diffusion Trajectory:
- Existing Augmentation: Prior diffusion-based augmentation methods (
PDRec,GlobalDiff,DiffDiv) primarily use the final output of thediffusion model(e.g., a denoised representation or generated item) as augmented data. They treat thediffusion modelas a "black-box generator" for single data points. - ADARec's Innovation:
ADAReccritically utilizes the entire step-wisedenoising trajectoryof thediffusion model. It argues that each step in this trajectory, from high noise to low noise, inherently represents a different level of abstraction, effectively reconstructing auser's intent hierarchyfromcoarse-grained(high noise, abstract) tofine-grained(low noise, specific). This is a fundamental conceptual shift.
- Existing Augmentation: Prior diffusion-based augmentation methods (
-
Adaptive Augmentation Depth:
- Existing Augmentation: Most augmentation methods apply a fixed augmentation strategy or depth to all sequences.
- ADARec's Innovation: It introduces an
Adaptive Depth Controller (ADC)that dynamically determines the optimaldiffusion depth() for each individual user sequence. This ensures customized and efficient enhancement, avoiding unnecessary computation for sequences that might not require deep augmentation.
-
Specialized Hierarchical Parsing with
Mixture-of-Experts:- Existing Intent Models/Augmentation: Many
intent-based modelsuse a single network or multiple parallel branches to capture interests, but typically don't explicitly parse a generated hierarchy of intents. Existing augmentation methods typically produce a single augmented representation. - ADARec's Innovation: It designs a dedicated
Hierarchical Parsing Mixture-of-Experts (HP-MoE)module. ThisMoEarchitecture is specifically tailored todecoupleand process thecoarse-grainedandfine-grainedintent representations derived from thediffusion trajectory. It uses functionally decoupled experts (aTransformerforfine-grained, a separate module withcontrastive lossforcoarse-grained) and acontent-aware routingmechanism, which is novel for handling diffusion-generated hierarchies.
- Existing Intent Models/Augmentation: Many
-
Addressing the "Collapsed Intent" Problem:
- Existing Methods: While some
intent-based modelsaim to learn distinct intents, they struggle significantly withsparse datawhere theintent hierarchyhascollapsed(as visualized in Figure 1 and 2). - ADARec's Advantage: By explicitly reconstructing this
hierarchythrough theHDAmodule and effectively parsing it withHP-MoE,ADARecdirectly tackles thefundamental issueof incompleteintent hierarchy representationresulting fromdata sparsity, leading to more accurate and less misguided recommendations.
- Existing Methods: While some
4. Methodology
4.1. Principles
The core idea behind ADARec is to overcome the data sparsity problem in sequential recommendation by reconstructing a user's intent hierarchy from their sparse interaction history. It achieves this by:
- Leveraging the
Denoising Trajectoryof aDiffusion Model: Inspired by theInformation Bottleneck principle, thediffusion model'sdenoising processis used to progressively reveal user intents at different levels of abstraction. At higher noise levels (more abstractdenoising), the model is forced to capturecoarse-grained intents, while at lower noise levels, it recoversfine-grained details. The sequence of intermediate denoised representations thus forms a naturalintent hierarchy. - Adaptive Augmentation: Recognizing that not all user sequences require the same level of augmentation, an
Adaptive Depth Controller (ADC)dynamically determines the optimaldiffusion depthfor each sequence, balancing efficiency and effectiveness. - Specialized Intent Parsing: A
Hierarchical Parsing Mixture-of-Experts (HP-MoE)module is designed to intelligentlydecoupleand process thesecoarse-grainedandfine-grainedintents using specialized experts and acontent-aware routingmechanism, leading to a robust final user representation.
4.2. Core Methodology In-depth (Layer by Layer)
The ADARec framework processes user sequences through a sequential pipeline, starting with embedding, followed by adaptive depth determination, hierarchical augmentation, and finally, expert-driven parsing and fusion. The overall architecture is illustrated in the VLM described Figure 3 of the original paper.
4.2.1. Sequence Embedding
First, a user's historical interaction sequence is transformed into a numerical representation. ADARec adopts ELCRec (Liu et al. 2024) as its backbone encoder to obtain the sequence embedding . This embedding represents the initial state of the user's history. Additionally, the length of the sequence is encoded into a vector using a Multi-Layer Perceptron (MLP). Both and serve as inputs to the subsequent Adaptive Depth Controller.
4.2.2. Adaptive Depth Controller (ADC)
The Adaptive Depth Controller (ADC) is responsible for determining the optimal augmentation depth for each user sequence . This is crucial for customizing the enhancement and managing computational costs.
The ADC takes the user's original sequence embedding (where is the sequence length and is the embedding dimension) and the embedding vector of the normalized sequence length as input. These inputs are concatenated and then fed into a Gated Recurrent Unit (GRU) network to capture temporal dependencies. The output of the GRU is then pooled (e.g., using mean pooling) to produce a fixed-size representation . This is then passed through a linear layer to generate logits over possible diffusion depths. The possible depths range from 0 to a predefined maximum depth .
The calculation for generating logits is:
Here:
-
: The initial sequence embedding of user .
-
: The embedding vector representing the normalized length of sequence .
-
: Concatenation of the sequence embedding and the length embedding.
-
: A
Gated Recurrent Unitnetwork that processes the concatenated input to capture sequential information. -
: An aggregation function, such as
mean pooling, applied to the output of theGRUto obtain a fixed-size context vector . -
: A learnable weight matrix.
-
: A learnable bias vector.
-
: A vector of
logitswhose dimension is , representing the scores for each possiblediffusion depth.To select a discrete
depthin a differentiable manner, theGumbel-Softmax estimatoris applied to thelogits. This allows for backpropagation through the discrete selection process. Where: -
: The
logitsvector computed previously. -
: A vector of
i.i.d. (independent and identically distributed)samples drawn from a distribution. This introduces randomness during training to encourage exploration. -
: The
Gumbel-Softmax temperature. A higher temperature makes the distribution flatter (more uniform), while a lower temperature makes it sharper (closer to one-hot). -
: The
softmaxfunction, which converts thelogitsinto a probability distribution over the possible depths. During theforward pass(inference), the actualdepthis selected by taking theargmaxof (the depth with the highest probability). During thebackward pass(training), the continuousGumbel-Softmaxprobabilities are used for gradient calculation.
To encourage exploration of different diffusion depths and prevent the ADC from prematurely converging to shallow, "safe" depths, an auxiliary exploration loss is introduced. This loss is the negative entropy of the ADC's probability distribution :
Here:
- : The probability assigned by
Gumbel-Softmaxto the -thdiffusion depthfor user . Thisentropy-basedloss encourages theADCto maintain uncertainty and explore a wider range of depths, leading to more robust training for the subsequentHDAandHP-MoEmodules.
4.2.3. Hierarchical Diffusion Augmentation (HDA)
The Hierarchical Diffusion Augmentation (HDA) module is the core component for generating the user intent hierarchy. Unlike other diffusion-based augmentation methods that use only the final output, HDA leverages the entire denoising trajectory.
Intuition behind Constructing Hierarchical Intent via Denoising Trajectory:
The HDA module's design is inspired by the observation that the denoising process of a diffusion model naturally progresses from abstract to concrete information.
-
At
high noise levels(i.e., when is large, approaching ), the input is heavily corrupted with noise. Thedenoising networkis forced to extract only the most abstract, fundamental information to reconstruct the original signal. This process yields acoarse-grained intent. The information retained from the original sequence is significantly compressed compared to the total information , yet it maintainspredictive relevancefor the next item . -
At
low noise levels(i.e., when is small, approaching 0), the input contains more of the original signal. Thedenoising networkcan recover more specific behavioral details, forming afine-grained intent. In this case, the information is almost fully preserved .Thus, the entire
denoising trajectoryinherently forms auser intent hierarchy, where higher values correspond tocoarse-grained intentsand lower values tofine-grained intents.
The process of HDA:
Given a user 's sequence embedding and the ADC-predicted depth , the diffusion's forward process (noising) progressively adds Gaussian noise to generate noisy representations .
The forward process is defined as:
Here:
-
: The
noisy representationof the sequence embedding atdiffusion step. -
: The original
sequence embedding. -
: A predefined
noise schedule parameter(e.g., fromDDPM's linear variance schedule), which controls the amount of original signal preserved at step . decreases as increases, meaning more noise is added. -
: A random noise vector sampled from a
standard normal distribution.The core of
HDAlies in thereverse process(denoising), which is executed by a trainabledenoising network. This network learns to predict the noise added at each step, allowing for the reconstruction of the original signal. Thedenoising networkpredicts the noise given the noisy input and the current step . The denoised representation is then estimated as: Where: -
: The estimated
denoised representationatdiffusion step. -
: The
denoising network(parameterized by ) that predicts the noise component from thenoisy representationand the currentdiffusion step. For simplicity and efficiency, thedenoising networkis implemented as a simpleMulti-Layer Perceptron (MLP).
To train the denoising network , it is optimized using a standard Mean Squared Error (MSE) loss between the predicted noise and the true noise:
Here:
- : The noise predicted by the
denoising network. - : The actual noise added at
diffusion step. TheMSEloss ensures that thedenoising networkaccurately learns to remove the noise.
This iterative denoising process, from the highest noise level (approaching ) down to the lowest noise level (approaching ), generates a sequence of denoised representations that forms the intent hierarchy: . In this hierarchy, representations with higher (closer to ) capture coarse-grained intents, while representations with lower (closer to 0) capture fine-grained intents. The output of the HDA module is this entire structured set .
4.2.4. Hierarchical Parsing MoE (HP-MoE)
The Hierarchical Parsing Mixture-of-Experts (HP-MoE) module is designed to efficiently parse the intent hierarchy generated by HDA, intelligently decoupling and processing fine-grained and coarse-grained intents.
Functionally Decoupled Experts:
The HP-MoE employs two specialized experts:
-
Fine-grained Expert (): This expert is implemented as a
Transformernetwork. It is trained solely using therecommendation loss, which optimizes fornext-item prediction. This direct supervision compels to capture the subtle, transient, and specific details within the user's interaction sequence that are crucial for accurate immediate recommendations. -
Coarse-grained Expert (): This expert is designed to extract the user's
coarse-grained intent. It is regularized by anexclusive contrastive loss, . ThisInfoNCE-based lossforces to learn the invariant, underlying essence of the user's preferences by pulling together representations from the user's owncoarse-grained hierarchywhile pushing them away fromcoarse-grained intentsof other users in the same batch.The
contrastive lossfor thecoarse-grained expertis defined as: Here:
- : The cosine similarity function, used to measure the similarity between two intent representations.
- : A
coarse-grained representationof user atdiffusion step. This is an output from when processing . - : The
positive sample. This is acoarse-grained representationfrom the next higherdiffusion stepfor the same user , implying a related but more abstract intent. - : A set of
negative samples. It contains (the positive sample) and a collection ofcoarse-grained representationsfrom other users within the same training batch. This ensures that the expert differentiates the current user's intent from others. - : A
temperature parameterfor theInfoNCE loss, controlling the sharpness of the probability distribution. The summation starts from to , focusing thecontrastive learningon the upper (more coarse-grained) half of the generated hierarchy.
Content-Aware Routing and Gated Fusion:
To synthesize a unified user representation from the intent hierarchy , a two-level gated fusion mechanism is employed.
First, at each hierarchy level , a content-aware routing mechanism determines how much each expert contributes. This gate's probability, , considers both the noise level embedding (representing the current level of abstraction) and the pooled content representation of the denoised sequence at that level, .
Where:
-
: An aggregation (e.g.,
mean pooling) of thedenoised representationat step , capturing its content. -
: An embedding vector corresponding to the
diffusion step. This provides positional information about thehierarchy level. -
: Concatenation of the pooled content and the step embedding.
-
: A
Multi-Layer Perceptron. -
: The
sigmoid activation function, which squashes the output to a value between 0 and 1, representing the gate's probability . A high means more weight for thefine-grained expert, and for thecoarse-grained expert.Next, the outputs from each expert are aggregated across the entire
hierarchy, weighted by their respective gate probabilities, to form consolidatedfine-grainedandcoarse-grained representations: Here: -
: The output of the
fine-grained expertfor thedenoised representationat step . -
: The output of the
coarse-grained expertfor thedenoised representationat step . -
: A small constant added to the denominator to prevent division by zero. These equations essentially compute a weighted average of expert outputs, where determines the contribution of and determines the contribution of at each
hierarchy level.
Finally, to produce the ultimate user representation , a final gating vector is computed from the aggregated coarse-grained representation . This gate adaptively combines and , balancing stable long-term themes (coarse-grained) against transient specifics (fine-grained).
Where:
- : A learnable weight matrix for the final gate.
- : A learnable bias vector for the final gate.
- : The
element-wise multiplication(Hadamard product) operator. This final gating mechanism allows the model to dynamically adjust the influence offine-grainedandcoarse-grainedintents based on the user's overallcoarse-grainedpreferences.
Prediction Layer:
The fused user representation is then passed to a prediction layer. This layer computes the probability scores for all candidate items in the item set , which are then used for making recommendations.
4.2.5. Training Objective
The ADARec framework is trained end-to-end using a composite objective function that combines multiple losses:
Here:
- : The total loss to be minimized during training.
- : The
primary recommendation loss, calculated usingstandard cross-entropy loss. For a given user , it measures the discrepancy between the predicted probabilities and the true next item. Here,p(i)is 1 for the true next item and 0 for all other items in the item set . is the predicted probability score for item for user . - : The
denoising loss, which trains theHDAmodule (specifically thedenoising network) to accurately reconstruct the original sequence embedding by predicting noise. - : The
contrastive loss, which acts as a key regularizer for thecoarse-grained expertinHP-MoE, enforcing thefunctional decouplingof intents. - : The
exploration loss, anentropy-based regularizerfor theADC, preventing it from prematurely converging to shallowdiffusion depthsand encouraging broader exploration. - , , :
Hyperparametersthat balance the contribution of each auxiliary loss component to the total objective.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on four public benchmark datasets, commonly used in sequential recommendation research:
-
Beauty: An Amazon review dataset, typically containing interactions related to beauty products.
-
Sports: An Amazon review dataset, containing interactions related to sports and outdoor equipment.
-
Toys: An Amazon review dataset, focusing on toys and games.
-
Yelp: A dataset from the Yelp platform, encompassing various business categories (e.g., restaurants, shops), providing a more diverse set of user intents.
These datasets are typically characterized by
sparse user interactionsand varying degrees ofitem diversity. The paper states it strictly adheres to the datasets and pipelines released byELCRec(Liu et al. 2024) to ensure fairness and reproducibility. These datasets are effective for validating the method's performance because they represent real-world user behavior, including the presence ofdata sparsityand complexuser intentpatterns, which are the main challengesADARecaims to address.
5.2. Evaluation Metrics
The performance of sequential recommendation models is typically evaluated using ranking-based metrics that assess how well the model ranks the true next item among a list of candidates. The paper uses Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K).
-
Hit Rate (HR@K):
- Conceptual Definition:
Hit Ratemeasures the proportion of users for whom thetrue next itemappears in the top-K recommended items. It indicates whether the correct item was found within the recommendation list, regardless of its position. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the true next item is in top-K recommendations}}{\text{Total number of users}} $ More formally, for a set of users : $ \mathrm{HR@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{true_item}_u \in \text{top-K_list}_u) $
- Symbol Explanation:
- : The set of all users in the evaluation.
- : The total number of users.
- : An
indicator functionthat returns 1 if the condition inside is true, and 0 otherwise. - : The actual next item interacted by user .
- : The list of top K items recommended to user .
- Conceptual Definition:
-
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition:
NDCGis a ranking quality metric that accounts for the position of therelevant itemsin the recommendation list. It assigns higher scores torelevant itemsthat appear higher in the list and penalizesrelevant itemsthat appear lower. It is "normalized" to range from 0 to 1, where 1 represents a perfect ranking. - Mathematical Formula:
First,
Discounted Cumulative Gain (DCG@K)is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ Then,NDCG@KisDCG@Kdivided by theIdeal DCG (IDCG@K), which is the maximum possibleDCGif allrelevant itemswere perfectly ranked: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $ - Symbol Explanation:
-
: The number of top items in the recommendation list being considered.
-
: The rank of an item in the recommendation list (starting from 1).
-
: The
relevance scoreof the item at rank . Fornext-item prediction,relevanceis typically binary (1 if it's thetrue next item, 0 otherwise). -
: The
Ideal Discounted Cumulative Gainfor the given user, calculated by assuming thetrue next itemis at rank 1, and all other items are irrelevant (or ranked perfectly if multiplerelevant itemswere possible, which is less common innext-item predictionforSR).The paper evaluates
HR@5,NDCG@5,HR@20, andNDCG@20, meaning they consider the top 5 and top 20 recommendations.
-
- Conceptual Definition:
5.3. Baselines
ADARec is benchmarked against eleven diverse baselines, covering a wide spectrum of sequential recommendation models and data augmentation techniques:
- Early Deep Learning Models:
Caser(Tang and Wang 2018): AConvolutional Sequence Embeddingmodel that usesCNNsto capture local sequential patterns.
- Transformer-based Models:
SASRec(Kang and McAuley 2018): A pioneeringself-attentive sequential recommendationmodel usingTransformerarchitecture.BERT4Rec(Sun et al. 2019): AdaptsBERT'smasked language modelobjective forsequential recommendationwithbidirectional encoder representations.
- Intent-Learning Methods (often
SSL-based): These models explicitly try to capture different user intents.IOCRec(Li et al. 2023a):Multi-Intention Oriented Contrastive LearningforSequential Recommendation.ICLRec(Chen et al. 2022):Intent Contrastive LearningforSequential Recommendation.ELCRec(Liu et al. 2024):End-to-end Learnable ClusteringforIntent Learning in Recommendation. This model also serves asADARec's backbone encoder.
- Data Augmentation Frameworks (Diffusion-based): These are the most direct competitors, also using
diffusion modelsfor augmentation.-
PDRec(Ma et al. 2024):Plug-In Diffusion ModelforSequential Recommendation. -
BASRec(Dang et al. 2025): AugmentingSequential RecommendationwithBalanced Relevance and Diversity. -
GlobalDiff(Luo, Li, and Lin 2025): EnhancingSequential RecommendationwithGlobal Diffusion. -
DiffDiv(Cai et al. 2025): Unleashing thePotential of Diffusion ModelsTowardsDiversified Sequential Recommendations.The performance data for the first nine models (
CaserthroughELCRec) are adopted fromELCRec(Liu et al. 2024) for consistency.BASRec,PDRec,GlobalDiff, andDiffDivare re-implemented withinADARec's framework for fair comparison, suggesting their performance numbers are directly obtained under the same experimental conditions asADARec.
-
5.4. Implementation Details
- Optimizer:
Adamoptimizer is used with aweight decayof . - Sequence Length: The maximum user sequence length is capped at 50.
- Hyperparameters:
Gumbel-Softmax temperature(): Fixed at 0.5.InfoNCE temperature(): Set to 0.07.Maximum diffusion depth(): Tuned via grid search over .Loss weights(, ): Selected from .Exploration loss weight(): Searched in .
- Computing Environment: Implemented on a server equipped with
20GB NVIDIA RTX 3090 GPUs.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Overall Performance (RQ1)
The overall performance results, presented in Table 1, demonstrate ADARec's consistent and significant outperformance across all benchmarks and metrics compared to a wide range of baselines. The "Improv." column highlights the percentage improvement over the strongest competitor in each category.
The following are the results from Table 1 of the original paper:
| Dataset | Metric | Caser (2018) | SASRec (2018) | BERT4Rec (2019) | IOCRec (2023) | ICLRec (2022) | ELCRec (2024) | PDRec (2024) | BASRec (2025) | GDiff (2025) | DDiv (2025) | ADARec (Ours) | Improv. |
| Beauty | H@5 | 0.0251 | 0.0374 | 0.0360 | 0.0408 | 0.0495 | 0.0529 | 0.0569 | 0.0551 | 0.0563 | 0.0575 | 0.0600* | +4.35% |
| N@5 | 0.0145 | 0.0241 | 0.0216 | 0.0245 | 0.0326 | 0.0355 | 0.0380 | 0.0385 | 0.0382 | 0.0375 | 0.0403* | +4.68% | |
| H@20 | 0.0643 | 0.0901 | 0.0984 | 0.0916 | 0.1072 | 0.1079 | 0.1145 | 0.1135 | 0.1129 | 0.1152 | 0.1187* | +3.04% | |
| N@20 | 0.0298 | 0.0387 | 0.0391 | 0.0444 | 0.0491 | 0.0509 | 0.0541 | 0.0539 | 0.0531 | 0.0548 | 0.0568* | +3.65% | |
| Sports | H@5 | 0.0114 | 0.0206 | 0.0217 | 0.0246 | 0.0263 | 0.0286 | 0.0325 | 0.0328 | 0.0321 | 0.0315 | 0.0341* | +3.96% |
| N@5 | 0.0076 | 0.0135 | 0.0143 | 0.0162 | 0.0173 | 0.0185 | 0.0218 | 0.0220 | 0.0215 | 0.0211 | 0.0228* | +3.64% | |
| H@20 | 0.0399 | 0.0497 | 0.0604 | 0.0641 | 0.0630 | 0.0648 | 0.0701 | 0.0728 | 0.0719 | 0.0714 | 0.0745* | +2.33% | |
| N@20 | 0.0178 | 0.0216 | 0.0251 | 0.0280 | 0.0276 | 0.0286 | 0.0328 | 0.0331 | 0.0325 | 0.0318 | 0.0341* | +3.02% | |
| Toys | H@5 | 0.0166 | 0.0463 | 0.0274 | 0.0311 | 0.0586 | 0.0585 | 0.0635 | 0.0648 | 0.0642 | 0.0645 | 0.0691* | +6.64% |
| N@5 | 0.0107 | 0.0306 | 0.0174 | 0.0197 | 0.0397 | 0.0403 | 0.0433 | 0.0431 | 0.0435 | 0.0428 | 0.0474* | +9.08% | |
| H@20 | 0.0420 | 0.0941 | 0.0688 | 0.0781 | 0.1130 | 0.1138 | 0.1230 | 0.1245 | 0.1219 | 0.1225 | 0.1298* | +4.26% | |
| N@20 | 0.0179 | 0.0441 | 0.0291 | 0.0330 | 0.0550 | 0.0560 | 0.0615 | 0.0618 | 0.0613 | 0.0605 | 0.0646* | +4.53% | |
| Yelp | H@5 | 0.0142 | 0.0160 | 0.0196 | 0.0222 | 0.0233 | 0.0248 | 0.0253 | 0.0255 | 0.0251 | 0.0258 | 0.0264* | +2.33% |
| N@5 | 0.0080 | 0.0101 | 0.0121 | 0.0137 | 0.0146 | 0.0153 | 0.0159 | 0.0153 | 0.0161 | 0.0155 | 0.0167* | +2.83% | |
| H@20 | 0.0406 | 0.0443 | 0.0564 | 0.0640 | 0.0645 | 0.0667 | 0.0688 | 0.0681 | 0.0685 | 0.0691 | 0.0712* | +3.04% | |
| N@20 | 0.0156 | 0.0179 | 0.0223 | 0.0263 | 0.0261 | 0.0275 | 0.0280 | 0.0275 | 0.0278 | 0.0283 | 0.0291* | +2.83% |
Analysis:
ADARecconsistently achieves the highestHR@KandNDCG@Kscores across all four datasets (Beauty, Sports, Toys, Yelp) and both values (5 and 20). The asterisks indicate statistical significance.- The improvement is notable across different types of baselines, including older models (
Caser,SASRec),Transformer-based models (BERT4Rec),intent-learning models(IOCRec,ICLRec,ELCRec), and especially otherdiffusion-based augmentationmethods (PDRec,BASRec,GlobalDiff,DiffDiv). This validatesADARec's novel approach over existingdiffusion-based augmentationthat only use the final output. - Datasets like
Toysshow particularly strong relative improvements (e.g., +9.08% forN@5), indicating thatADARecis highly effective whereintent reconstructionmight be particularly challenging or beneficial. - The
Yelpdataset shows relatively modest improvements (e.g., +2.33% forH@5). The authors attribute this toYelpbeing a very comprehensive dataset, which makes the cleardisentanglement of user intentsinto distinct units more challenging even forintent-based recommendationmethods. This suggests that whileADARecexcels at reconstructing hierarchy, inherent complexities in dataset structure can still pose challenges.
6.1.2. Performance on Extremely Sparse Sequences
ADARec's strength is particularly evident when dealing with extremely short sequences (user history length ), which are highly prone to data sparsity.
The following are the results from Table 2 of the original paper:
| Dataset | BASRec (2025) | GlobalDiff (2025) | DiffDiv (2025) | ADARec (Ours) | Improv. | |
| Beauty | H@5 | 0.0441 | 0.0428 | 0.0455 | 0.0531* | +16.70% |
| N@5 | 0.0285 | 0.0272 | 0.0295 | 0.0353* | +19.66% | |
| H@20 | 0.0910 | 0.0897 | 0.0925 | 0.1066* | +15.24% | |
| N@20 | 0.0421 | 0.0409 | 0.0427 | 0.0504* | +17.48% | |
| H@5 | 0.0261 | 0.0256 | 0.0265 | 0.0305* | +15.09% | |
| Sports | N@5 | 0.0180 | 0.0174 | 0.0179 | 0.0199* | +10.56% |
| H@20 | 0.0592 | 0.0585 | 0.0590 | 0.0659* | +11.32% | |
| N@20 | 0.0270 | 0.0266 | 0.0269 | 0.0298* | +10.37% | |
| H@5 | 0.0557 | 0.0550 | 0.0560 | 0.0618* | +10.36% | |
| Toys | N@5 | 0.0389 | 0.0387 | 0.0392 | 0.0431* | +9.95% |
| H@20 | 0.1051 | 0.1043 | 0.1050 | 0.1169* | +11.23% | |
| N@20 | 0.0530 | 0.0518 | 0.0522 | 0.0586* | +10.57% | |
| H@5 | 0.0194 | 0.0189 | 0.0190 | 0.0200* | +3.09% | |
| Yelp | N@5 | 0.0121 | 0.0112 | 0.0119 | 0.0124* | +2.48% |
| H@20 | 0.0528 | 0.0506 | 0.0522 | 0.0559* | +5.87% | |
| N@20 | 0.0217 | 0.0204 | 0.0209 | 0.0225* | +3.69% | |
Analysis:
- On sparse datasets, especially
BeautyandSports,ADARecachievesrelative improvementsover the strongest competitor that exceed 15% (e.g., +19.66% forN@5on Beauty). This is a strong indicator ofADARec's ability to handledata sparsity. - The large improvements on sparse data confirm the hypothesis that
ADARec's ability to reconstructcoarse-grained intent representationsfrom minimal item histories prevents it from being misled by noise oroverfittingto sparse, misleadingfine-grained details. This leads to more accurate and generalized recommendations for users with limited data. - The
Yelpdataset still shows relatively smaller improvements, consistent with the overall results, reinforcing the idea that its inherent complexity makesintent disentanglementharder.
6.1.3. Performance with Different Intent-Aware Prediction Layers
To demonstrate the generalizability of ADARec's augmentation framework, the authors integrate it with several advanced intent-aware models used as prediction layers.
The following are the results from Table 3 of the original paper:
| Model | Beauty | Toys | ||
| HR@20 | NDCG@20 | HR@20 | NDCG@20 | |
| IOCRec (2023) | 0.0916 | 0.0444 | 0.0781 | 0.0330 |
| ADARec w/ IOCRec | 0.1053 | 0.0501 | 0.1098 | 0.0515 |
| ICLRec (2022) | 0.1072 | 0.0491 | 0.1130 | 0.0550 |
| ADARec w/ ICLRec | 0.1165 | 0.0558 | 0.1271 | 0.0628 |
| ELCRec (2024) | 0.1079 | 0.0509 | 0.1138 | 0.0560 |
| ADARec w/ ELCRec | 0.1187 | 0.0568 | 0.1298 | 0.0646 |
Analysis:
- For all three
intent-aware models(IOCRec,ICLRec,ELCRec), integratingADARec'saugmentation frameworkconsistently leads to performance improvements in bothHR@20andNDCG@20on bothBeautyandToysdatasets. - This demonstrates that
ADARecis not just an isolated model but a valuableaugmentation frameworkthat can effectively generate richer and more structuredhierarchical user representations. These representations, in turn, provide enhancedsemantic informationthat can be leveraged by various downstreamintent-based prediction models, improving their performance.
6.2. Ablation Study of Key Components (RQ2)
An ablation study was conducted to understand the individual contribution of each component within ADARec.
The following are the results from Table 4 of the original paper:
| ADC | HDA | HP-MoE | C-A Router | Lcont | Beauty | Sports | Toys | Yelp | ||||
| HR@20 | NDCG@20 | HR@20 | NDCG@20 | HR@20 | NDCG@20 | HR@20 | NDCG@20 | |||||
| ✓ | ✓ | ✓ | ✓ | ✓ | 0.1187 | 0.0568 | 0.0745 | 0.0341 | 0.1298 | 0.0646 | 0.0712 | 0.0291 |
| X | ✓ | ✓ | ✓ | ✓ | 0.1141 | 0.0545 | 0.0715 | 0.0328 | 0.1246 | 0.0621 | 0.0690 | 0.0281 |
| ✓ | X | ✓ | ✓ | ✓ | 0.1105 | 0.0528 | 0.0681 | 0.0310 | 0.1189 | 0.0592 | 0.0672 | 0.0276 |
| ✓ | ✓ | X | X | X | 0.1075 | 0.0505 | 0.0645 | 0.0283 | 0.1132 | 0.0557 | 0.0661 | 0.0273 |
| X | ✓ | ✓ | X | X | 0.1128 | 0.0539 | 0.0702 | 0.0321 | 0.1225 | 0.0610 | 0.0683 | 0.0279 |
| ✓ | ✓ | ✓ | X | ✓ | 0.1172 | 0.0561 | 0.0736 | 0.0337 | 0.1281 | 0.0638 | 0.0705 | 0.0288 |
| ✓ | ✓ | ✓ | ✓ | X | 0.1158 | 0.0553 | 0.0725 | 0.0332 | 0.1260 | 0.0627 | 0.0697 | 0.0284 |
Analysis:
-
Removing
HDA(row 3,HDA=X): This leads to the most severe performance degradation across all datasets. For example, on Beauty, HR@20 drops from 0.1187 to 0.1105. This highlights the foundational role ofHDAin generating theintent hierarchy, which is the core ofADARec. Without it, the model loses its ability to reconstruct the multi-grained intents. -
Removing
ADC(row 2,ADC=X): This also causes significant performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1141). This validates the necessity of dynamically allocatingaugmentation resources. A fixeddiffusion depthis less effective than an adaptively chosen one. -
Removing
HP-MoE(row 4,HP-MoE=X): This variant usesADCandHDAbut noHP-MoEfor parsing (it implies a simpler aggregation strategy or direct use of a single representation, orELCRecitself without theMoEpart). The performance drops substantially (e.g., Beauty HR@20 to 0.1075). This variant is noted to still outperform strong baselines likeELCRec, demonstrating the inherent strength of theADCandHDAmodules. However, the drop from the fullADARecconfirms the necessity of the specializedHP-MoEarchitecture for effective hierarchical parsing. -
HP-MoEvariant (row 4,ADC=✓,HDA=✓,HP-MoE=X): This configuration represents usingADCandHDAto generate the hierarchy, but then lacks theHP-MoEfor parsing. The performance (e.g., Beauty HR@20 0.1075) is comparable toELCRec(0.1079 from Table 1). This critically shows that theHP-MoE's effectiveness is contingent on the structuredhierarchical inputprovided byHDA. Without the specializedMoEto process it, the benefits ofHDAare largely lost. -
Removing
Content-Aware Router(row 6,C-A Router=X): This results in smaller but consistent performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1172). This confirms its vital role in nuanced routing and weighting of expert outputs based oncontentandnoise level. -
Removing
Contrastive Loss() (row 7,Lcont=X): This also leads to smaller but consistent performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1158). This validates the importance of in enforcingfunctional decouplingbetween thefine-grainedandcoarse-grained experts, ensuring they learn distinct yet complementary aspects of user intent.The
ablation studyconfirms that all components ofADARecare indispensable and contribute positively to its overall superior performance.
6.3. Efficiency Analysis (RQ3)
The efficiency of ADARec is analyzed by comparing its training time and inference latency with other models.
The following are the results from Table 5 of the original paper:
| Model | Beauty | Sports | Toys | Yelp | ||||
| Train | Infer | Train | Infer | Train | Infer | Train | Infer | |
| (s) | (ms) | (s) | (ms) | (s) | (ms) | (s) | (ms) | |
| SASRec | 2.15 | 3.1 | 6.51 | 9.5 | 4.32 | 6.8 | 5.25 | 7.9 |
| ELCRec | 3.81 | 5.9 | 11.28 | 16.1 | 7.88 | 11.5 | 9.00 | 13.2 |
| DiffDiv | 7.55 | 14.8 | 20.10 | 31.5 | 15.60 | 22.4 | 18.23 | 26.1 |
| GlobalDiff | 8.90 | 21.2 | 23.50 | 45.8 | 18.95 | 33.6 | 21.75 | 38.5 |
| Ours | 6.21 | 10.5 | 17.95 | 24.9 | 13.52 | 18.7 | 15.88 | 19.3 |
Analysis:
- Training Time:
ADARec's training time (e.g., 6.21s for Beauty) is higher than non-diffusion baselines likeSASRec(2.15s) andELCRec(3.81s). This is expected due to the complexity of thediffusion processand the additional modules (ADC,HP-MoE). However, the authors argue that this is ajustifiable one-time offline costwhich is heavily compensated by the significant improvements in recommendation quality, especially for sparse data. - Inference Latency: Similarly,
ADARec's inference latency (e.g., 10.5ms for Beauty) is higher thanSASRec(3.1ms) andELCRec(5.9ms). However,ADARecis generallymore efficientthan otherdiffusion-based methodslikeDiffDiv(14.8ms) andGlobalDiff(21.2ms). This efficiency gain stems fromADARec'slightweight architectureand, critically, itsAdaptive Depth Controller (ADC)which dynamically determines the optimal, often shallower,diffusion depthrequired for each sequence, avoiding unnecessary computation for inference.
6.4. Hyperparameter Sensitivity (RQ4)
6.4.1. Analysis of and
The stability of ADARec is assessed by analyzing its performance sensitivity to the weights of the auxiliary losses, (exploration loss) and (denoising loss).
The following figure (Figure 4 from the original paper) shows the stability analysis for auxiliary loss weights:

Analysis:
- The plots show
HR@20performance across various values of and forBeauty,Sports, andToysdatasets. - Optimal Points: The model generally performs best when and .
- Robustness: Crucially, the performance curves are relatively flat around these optimal points. This indicates that
ADARecis robust to minor deviations in thesehyperparameters. This stability implies that the model is not overly sensitive to the exact tuning of these weights, making it easier to apply in practice.
6.4.2. Analysis of Number of Experts in HP-MoE
The paper investigates the impact of increasing the number of experts within the Hierarchical Parsing Mixture-of-Experts (HP-MoE) module on performance and training time.
The following figure (Figure 5 from the original paper) shows the analysis of performance (HR@20) and training time cost vs. number of experts:

Analysis:
- Marginal Gains vs. Cost: As the number of experts in
HP-MoEincreases beyond two, theHR@20gains (performance) become only marginal. For example, moving from 2 experts to 3 or 4 experts yields minimal improvements. - Significant Cost Increase: However, increasing the number of experts
significantly increases the training time. This is evident from the lower part of the plots, where training time rises notably with more experts. - Justification for Two Experts: This result strongly supports
ADARec's design choice of employing onlytwo experts(fine-grainedandcoarse-grained). The authors explain that theAdaptive Depth Controller (ADC)typically selects a relatively smallactual diffusion depth() forsparse sequences, which inherently limits the number of potentialintent hierarchiesthat can be meaningfully reconstructed. Therefore, adding more expertscannot uncover more independent intent levelsfrom this limited input information, leading to redundancy and increased computational overhead without tangible benefits. Thetwo-expert configurationprovides the optimal balance between efficiency and effectiveness.
6.5. Visualization of Intent Representations (RQ5)
To qualitatively assess ADARec's ability to construct hierarchical user intents from sparse data, t-SNE visualization is used to project the learned user intent representations into a 2D space.
The following figure (Figure 6 from the original paper) shows t-SNE visualization of the learned user intent representations on the Yelp dataset:

Analysis:
- Baseline (
DiffDiv) - Figure 6(a): Thet-SNE visualizationforDiffDiv, a strongdiffusion-based baseline, shows that most user intentscollapseinto an indistinguishable central mass. The clusters are not clearly separated, indicating that the model struggles to differentiate distinctuser intentsfromsparse data. This visually confirms the paper's argument that existingdata augmentationmethods, by failing to reconstruct theintent hierarchy, do not effectively capture the true underlying intents. ADARec- Figure 6(b): In contrast,ADARec's visualization reveals clear,well-separated intent clusters. This visual evidence strongly suggests thatADARecsuccessfully identifies and reconstructs distinctuser intent hierarchieseven from highlysparse data(sequence length on theYelpdataset). This capability is a direct result of itshierarchical diffusion augmentationandspecialized parsing, allowing it to effectivelyde-collapseuser intents.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ADARec, an innovative framework designed to tackle the critical problem of data sparsity in sequential recommendation by reconstructing hierarchical user intents. The core of ADARec lies in its novel utilization of the step-wise denoising trajectory of a diffusion model to generate a rich intent hierarchy, moving from coarse-grained to fine-grained intents. This approach departs from traditional diffusion-based augmentation methods that treat the diffusion model as a black-box generator for single data points.
To enhance both efficiency and effectiveness, ADARec incorporates an Adaptive Depth Controller (ADC) that dynamically determines the optimal augmentation depth for each user sequence and a Hierarchical Parsing Mixture-of-Experts (HP-MoE) module. The HP-MoE intelligently decouples and processes fine-grained and coarse-grained intents through specialized experts, leveraging an asymmetric contrastive objective for robust representation learning.
Extensive experiments on real-world datasets demonstrate ADARec's superior performance, outperforming state-of-the-art baselines by 2-9% on standard benchmarks and achieving even more significant gains (3-17%) on sparse sequences. Visualizations confirm ADARec's ability to learn well-separated intent clusters, providing strong evidence for its effectiveness in de-collapsing user intents from sparse data.
7.2. Limitations & Future Work
The paper does not explicitly dedicate a section to limitations and future work. However, based on the context and experimental results, several points can be inferred:
Inferred Limitations:
- Computational Cost: While
ADARecis more efficient than otherdiffusion-based methods, itstraining timeandinference latencyare still higher than non-diffusion baselines (SASRec,ELCRec). This might limit its applicability in extremely latency-sensitive or resource-constrained environments, despite being a "one-time offline cost." - Complexity on Diverse Datasets: The relatively modest improvements on the
Yelpdataset suggest that for highly diverse and complexuser intentlandscapes, evenADAReccan face challenges in fullydisentanglingand representing user intents effectively. The inherent ambiguity in intent on such platforms might be difficult to fully resolve. - Reliance on Backbone Encoder:
ADARecusesELCRecas itsbackbone encoder. The quality ofADARec's initialsequence embeddingsis thus dependent on the performance and inductive biases of this underlying model. If the backbone struggles with certain types of sequences or biases, it might propagate toADARec. - Predefined Max Depth: The
maximum diffusion depth() is a predefined hyperparameter. WhileADCadaptively chooses a depth up to , an inappropriate (too small or too large) could limit the expressiveness of theintent hierarchyor introduce unnecessary computation.
Potential Future Work (Inferred):
- Adaptive : Instead of a fixed , future work could explore dynamic determination of the maximum plausible
diffusion depthbased on dataset characteristics or user behavior patterns. - More Sophisticated Denoising Networks: The current
denoising networkis a simpleMLP. Investigating more powerful generative architectures (e.g.,Transformersor more complexUNet-like structures) for theHDAmodule might further improve the quality of theintent hierarchyreconstruction. - Cross-Domain Generalization: Exploring how
ADARec's principles ofhierarchical intent reconstructioncan be adapted forcross-domain recommendationorcold-start scenarioswheredata sparsityis even more severe. - Explainability: Given that
ADARecaims tode-collapse user intents, future work could focus on making thesehierarchical intent representationsmore interpretable, allowing for a deeper understanding of why specific recommendations are made. - Real-time Adaptation: While
ADCprovides adaptive depth, further research could explore more real-time adaptive mechanisms for thediffusion processitself, potentially adjustingnoise schedulesordenoising stepsbased on real-time user context or dynamic environmental changes.
7.3. Personal Insights & Critique
This paper presents a truly innovative approach to tackling data sparsity in sequential recommendation, moving beyond simple augmentation to a deeper understanding of user behavior. The core idea of leveraging the denoising trajectory of a diffusion model to uncover hierarchical user intents is highly insightful and opens up new avenues for research in generative models for recommender systems.
Inspirations & Transferability:
- The concept of using the intermediate states of a generative process to capture different levels of abstraction is highly transferable. This could be applied to other domains where
sparse dataorcollapsed representationsare an issue, such asnatural language processing(e.g., understandinghierarchical topicsfrom short text snippets),image recognition(e.g.,fine-grained vs. coarse-grained image features), ormedical diagnosis(e.g., inferringbroader health conditionsfromsparse symptom data). - The
Adaptive Depth Controller (ADC)is a practical innovation that balanceseffectivenesswithefficiency. This adaptive computation paradigm could be useful in otherdeep learningarchitectures where certain layers or modules might be computationally expensive but not always necessary for every input. - The
Mixture-of-Experts (MoE)design, coupled withcontent-aware routingand anasymmetric contrastive loss, provides a powerful blueprint for handling multi-granularity information. This could inspire new ways to designMoEsystems for tasks requiringmulti-level feature extractionormulti-task learning.
Potential Issues & Areas for Improvement:
-
Interpretability of Latent Intents: While the
t-SNE visualizationshows distinct clusters, the semantic meaning of eachcoarse-grainedorfine-grained intentlearned by the model remains somewhat opaque. Further work intointerpretable AIcould try to align these learned intent clusters with human-understandable categories. -
Cold-Start Problem: While
ADARecexcels atsparse sequences, the ultimatecold-start problem(users with no history) is a different beast. It would be interesting to see howADAReccould be extended or integrated withmeta-learningortransfer learningtechniques to generate initialintent hierarchiesfor completely new users. -
Scalability for Extremely Large Item Sets: The
prediction layerstill computes scores for "all candidate items." For systems with millions or billions of items, this can be computationally intensive. Investigating efficient sampling strategies or hierarchical softmax approaches in conjunction withADARecwould be valuable. -
Robustness to Noisy Interactions: Real-world interaction sequences can contain noise (e.g., accidental clicks, irrelevant browsing). While
diffusion modelsare robust to noise to some extent, the impact of significant noise in the input sequence itself on the quality of thereconstructed intent hierarchycould be further investigated.Overall,
ADARecmakes a significant contribution by addressing a fundamental challenge insequential recommendationwith a theoretically sound and empirically effective solution. Its novel integration ofdiffusion modelsforhierarchical intent reconstructionsets a new standard fordata augmentationin this field.
Similar papers
Recommended via semantic vector search.