Paper status: completed

De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation

Mixture-of-Experts Architecture (2)Diffusion Model-based Recommendation Systems (2)User Intent Reconstruction in Sequential Recommendation (1)Sparse Data Augmentation Methods (1)Adaptive Diffusion Augmentation Framework (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The ADARec framework addresses data sparsity in sequential recommendation by utilizing a Mixture-of-Experts architecture to decouple coarse and fine-grained user intents, effectively reconstructing intent hierarchies. Experiments show it outperforms existing methods on standard b

Abstract

Sequential recommendation (SR) aims to predict users’ next action based on their historical behavior, and is widely adopted by a number of platforms. The performance of SR models relies on rich interaction data. However, in real-world scenarios, many users only have a few historical interactions, leading to the problem of data sparsity. Data sparsity not only leads to model overfitting on sparse sequences, but also hinders the model’s ability to capture the underlying hierarchy of user intents. This results in misinterpreting the user’s true intents and recommending irrelevant items. Existing data augmentation methods attempt to mitigate overfitting by generating relevant and varied data. However, they overlook the problem of reconstructing the user’s intent hierarchy, which is lost in sparse data. Consequently, the augmented data often fails to align with the user’s true intents, potentially leading to misguided recommendations. To address this, we propose the Adaptive Diffusion Augmentation for Recommendation (ADARec) framework. Critically, instead of using a diffusion model as a black-box generator, we use its entire step-wise denoising trajectory to reconstruct a user’s intent hierarchy from a single sparse sequence. To ensure both efficiency and effectiveness, our framework adaptively determines the required augmentation depth for each sequence and employs a specialized mixture-of-experts architecture to decouple coarse- and fine-grained intents. Experiments show ADARec outperforms state-of-the-art methods by 2-9% on standard benchmarks and 3-17% on sparse sequences, demonstrating its ability to reconstruct hierarchical intent representations from sparse data.

Mind Map

In-depth Reading

English Analysis~33 min read · 45,702 chars

1. Bibliographic Information

1.1. Title

De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation

1.2. Authors

Xiaoxi Cui, Chao Zhao, Yurong Cheng, Xiangmin Zhou. Affiliations: Beijing Institute of Technology (Xiaoxi Cui, Chao Zhao, Yurong Cheng), RMIT University (Xiangmin Zhou). Their research backgrounds appear to be in recommendation systems, machine learning, and data mining, given the nature of the paper.

1.3. Journal/Conference

The paper is listed with references to Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025) and The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25). Given the publication year (implied 2025 by references) and the nature of the content, it is likely intended for a top-tier conference in artificial intelligence or information retrieval. Both SIGIR and AAAI are highly reputable and influential venues in their respective fields, known for publishing cutting-edge research in areas like recommender systems, machine learning, and data mining.

1.4. Publication Year

The references indicate publication in 2025.

1.5. Abstract

Sequential recommendation (SR) aims to predict user actions based on historical interactions. However, data sparsity, a common problem where users have few historical interactions, leads to model overfitting and an inability to capture hierarchical user intents. Existing data augmentation methods attempt to mitigate overfitting but often fail to reconstruct the user's intent hierarchy, leading to misguided recommendations. To address this, the paper proposes Adaptive Diffusion Augmentation for Recommendation (ADARec). ADARec utilizes the entire step-wise denoising trajectory of a diffusion model to reconstruct a user's intent hierarchy from a single sparse sequence. To ensure efficiency and effectiveness, ADARec adaptively determines the required augmentation depth for each sequence and employs a mixture-of-experts (MoE) architecture to decouple coarse- and fine-grained intents. Experiments show ADARec outperforms state-of-the-art methods by 2-9% on standard benchmarks and 3-17% on sparse sequences, demonstrating its ability to reconstruct hierarchical intent representations from sparse data.

1.6. Original Source Link

/files/papers/694a32027a7e7809d937f471/paper.pdf This link points to a PDF, suggesting it's likely a pre-print or conference paper. Its publication status is likely either submitted to or accepted by one of the mentioned 2025 conferences.

2. Executive Summary

2.1. Background & Motivation

The core problem ADARec aims to solve is data sparsity in sequential recommendation (SR). SR models predict users' next actions based on their past interactions, which is crucial for many online platforms. However, in real-world scenarios, most users have very few interactions, leading to sparse data. This data sparsity creates two major issues:

Model Overfitting: With limited data, SR models tend to overfit to the few available interactions, making them less generalizable.
Hinders Hierarchical Intent Modeling: Crucially, sparse data collapses the user's intent hierarchy. For example, a user interacting with an "energy bar" and "hiking boot" might have a coarse-grained intent of "hiking" and a fine-grained intent of "buying a water bottle." However, sparse data might lead models to simply categorize these as "snacks" and "shoes," misinterpreting the true underlying intents and leading to irrelevant recommendations.

Existing data augmentation methods try to generate more data to combat overfitting. However, they often focus on generating items that are either too similar (limited semantic clustering) or too diverse (deviating from true intent), without explicitly addressing the reconstruction of the user's intent hierarchy that is lost in sparse data. This oversight means augmented data might not align with true user intents, leading to misguided recommendations.

The paper's entry point or innovative idea is to use the step-wise denoising trajectory of a diffusion model to reconstruct the intent hierarchy. They draw inspiration from the Information Bottleneck (IB) principle, suggesting that by progressively adding noise and then denoising, a model can learn to discard fine-grained details and capture more fundamental, coarse-grained patterns, thus revealing the underlying intent hierarchy.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Proposed ADARec Framework: They introduce Adaptive Diffusion Augmentation for Recommendation (ADARec), a novel framework designed to construct a comprehensive intent hierarchy even from sparse user sequences.
Hierarchical Diffusion Augmentation (HDA) Module: They propose HDA, which leverages the denoising trajectory of a diffusion process to generate a rich, multi-level hierarchy of user intents, spanning from fine-grained to coarse-grained levels. This is a key departure from prior diffusion-based augmentation that only used the final output.
Specialized Modules for Efficiency and Effectiveness: They introduce two specialized modules:
- An Adaptive Depth Controller (ADC) that intelligently determines the optimal diffusion depth for each user sequence, enhancing efficiency.
- A Hierarchical Parsing Mixture-of-Experts (HP-MoE) module that processes different levels of the intent hierarchy by decoupling coarse- and fine-grained intents and routing them to specialized experts.
Extensive Experimental Validation: Through comprehensive experiments on real-world datasets, they demonstrate that ADARec significantly outperforms state-of-the-art baselines.

The key findings are:

ADARec consistently and significantly outperforms state-of-the-art sequential recommendation models, with improvements of 2-9% on standard benchmarks.
The performance gains are even more pronounced on sparse sequences (user history length $\leq 5$ ), reaching 3-17% improvement over strong competitors. This highlights its effectiveness in reconstructing hierarchical intent from limited data.
The t-SNE visualization confirms that ADARec successfully learns well-separated intent clusters from sparse data, unlike baselines where intents often collapse.
All proposed components (ADC, HDA, HP-MoE, Content-Aware Router, Contrastive Loss) are shown to be indispensable for ADARec's performance through ablation studies.
The framework can effectively enhance various intent-based backbone models when integrated, demonstrating its generalizability.

3.1. Foundational Concepts

To fully grasp the ADARec framework, a beginner should understand several key concepts:

Sequential Recommendation (SR): This is a subfield of recommender systems where the goal is to predict the next item a user will interact with, based on their ordered sequence of past interactions. Unlike traditional recommender systems that might only consider a user's overall preferences, SR models emphasize the sequential and temporal patterns in user behavior. For example, if a user buys a camera, then a lens, the next recommendation might be a tripod, following a logical sequence of purchases.
Data Sparsity: This is a pervasive problem in recommender systems. It refers to the situation where there are very few recorded interactions between users and items compared to the total possible interactions. In the context of SR, data sparsity often means that many users have very short interaction histories (e.g., only 1-5 items). This makes it difficult for models to learn meaningful patterns and user preferences.
User Intent: This refers to the underlying goal or motivation a user has when interacting with items. The paper distinguishes between:
- Fine-grained intent: Specific, immediate goals (e.g., "buy a specific brand of water bottle").
- Coarse-grained intent: Broader, more abstract goals (e.g., "prepare for a hiking trip"). In sparse data, the distinction between these intents often collapses, making it hard for models to understand the user's broader motivations.
Diffusion Models: These are a class of generative models that have gained significant attention, particularly for image generation. They operate in two main phases:
- Forward (Noising) Process: This process gradually adds Gaussian noise to data, progressively transforming it into pure random noise. This is like blurring an image until it's just static. Each step in this process represents a different level of noise.
- Reverse (Denoising) Process: This is the generative part. A neural network learns to reverse the noising process, gradually removing noise from random data to reconstruct a clear data sample. In ADARec, this denoising trajectory (the sequence of outputs at each denoising step) is used to represent different levels of user intent. The intuition is that at high noise levels, the model must infer very abstract, coarse-grained information, while at low noise levels, it recovers fine-grained details.
Information Bottleneck (IB) Principle: Proposed by Tishby et al. (2000), this principle suggests that to learn a concise representation of relevant information, a system should compress its input by removing irrelevant details while preserving as much information as possible about a target variable. In ADARec, adding noise (the forward process) creates an information bottleneck. The denoising process is then forced to extract the essential message (user intent) by discarding less important, specific details (fine-grained noise) and keeping only the most fundamental and general patterns (coarse-grained intents).
Mixture-of-Experts (MoE): This is a machine learning technique where multiple "expert" sub-networks are trained, and a "gating" network learns to combine or select the outputs of these experts based on the input. Each expert specializes in a different aspect of the problem. In ADARec, HP-MoE uses separate experts for fine-grained and coarse-grained intents, with a gating mechanism to dynamically route and combine their outputs.
Gumbel-Softmax Estimator: This is a technique used to make discrete sampling operations differentiable, allowing gradient-based optimization in neural networks. When a model needs to make a discrete choice (e.g., selecting a specific diffusion depth from a set of integers), Gumbel-Softmax provides a continuous approximation during training that can be backpropagated through, while still allowing a discrete choice (e.g., argmax) to be made during inference.
Transformer: Introduced by Vaswani et al. (2017), Transformers are a neural network architecture widely used in sequential data processing, particularly in natural language processing and now sequential recommendation. Their key innovation is the self-attention mechanism, which allows the model to weigh the importance of different elements in an input sequence when processing each element, capturing long-range dependencies effectively. ADARec uses a Transformer within its fine-grained expert.
Cross-entropy Loss ( $\mathcal{L}_{rec}$ ): A common loss function used in classification tasks, particularly for next-item prediction in SR. It measures the difference between the predicted probability distribution over items and the true one-hot distribution of the actual next item. A lower cross-entropy indicates better prediction accuracy. The formula is generally: $ \mathcal{L}{CE} = -\sum{i=1}^{N} y_i \log(\hat{y}_i) $ where $y_i$ is the true probability (1 for the correct class, 0 otherwise) and $\hat{y}_i$ is the predicted probability for class $i$ .
Mean Squared Error (MSE) Loss ( $\mathcal{L}_{deno}$ ): Another common loss function, used to quantify the difference between predicted continuous values and actual continuous values. In ADARec, it's used to train the denoising network by measuring the squared difference between the predicted noise and the actual noise added. The formula is: $ \mathcal{L}{MSE} = \frac{1}{N} \sum{i=1}^{N} (y_i - \hat{y}_i)^2 $ where $y_i$ is the true value and $\hat{y}_i$ is the predicted value.
Contrastive Loss (InfoNCE-based, $\mathcal{L}_{cont}$ ): This type of loss aims to pull positive pairs (similar samples) closer together in the embedding space while pushing negative pairs (dissimilar samples) further apart. InfoNCE (Information Noise-Contrastive Estimation) is a popular form of contrastive loss often used in self-supervised learning. In ADARec, it's used to make the coarse-grained expert learn invariant user essence by contrasting coarse-grained representations of the same user with those of other users. A common InfoNCE formula is: $ \mathcal{L}{InfoNCE} = -\log \frac{\exp(\text{sim}(\mathbf{z}, \mathbf{z}^+) / \tau)}{\sum{\mathbf{z}^- \in \mathcal{N}} \exp(\text{sim}(\mathbf{z}, \mathbf{z}^-) / \tau) + \exp(\text{sim}(\mathbf{z}, \mathbf{z}^+) / \tau)} $ where $\text{sim}$ is a similarity function (e.g., cosine similarity), $\mathbf{z}$ is an anchor, $\mathbf{z}^+$ is a positive sample, $\mathcal{N}$ is a set of negative samples, and $\tau$ is a temperature parameter.

3.2. Previous Works

The paper frames its work against two main lines of previous research: sequential recommendation models and data augmentation methods.

Traditional Sequential Recommendation Models:
- Markov Chains: Early methods like Markov Chains (He and McAuley 2016; Rendle, Freudenthaler, and Schmidt-Thieme 2010) predicted the next item based on the immediately preceding one(s).
- Deep Learning Models:
  - RNNs (Hidasi et al. 2015): Recurrent Neural Networks captured temporal dependencies.
  - Transformers (SASRec (Kang and McAuley 2018), BERT4Rec (Sun et al. 2019)): These models, leveraging self-attention, became state-of-the-art for capturing long-range dependencies in sequences.
  - GNNs (Chang et al. 2021; Li et al. 2023b): Graph Neural Networks model item transitions as graph structures.
  - MLPs (Li et al. 2022; Zhou et al. 2022): Multi-Layer Perceptrons have also been adapted for SR.
- Self-Supervised Learning (SSL): Recent SR works incorporate SSL objectives (Xie et al. 2022; Liu et al. 2021; Qiu et al. 2022; Yang et al. 2023) to learn robust representations by creating auxiliary prediction tasks on the input sequence (e.g., masking items and predicting them).
- Intent Learning: More recently, SR has moved towards explicitly modeling user intents, using techniques like multi-interest modeling (Li et al. 2019; Cen et al. 2020), disentanglement (Ma et al. 2020; Li et al. 2024), and end-to-end clustering (ICLRec (Chen et al. 2022), IOCRec (Li et al. 2023a), ELCRec (Liu et al. 2024)). The paper specifically uses ELCRec as its backbone encoder, which is designed to learn explicit clusters representing user intents.
Data Augmentation Methods for SR:
- Heuristic Augmentations: Simple techniques like masking and cropping (Xie et al. 2022; Liu et al. 2021) generate variations of existing sequences.
- Diffusion-based Augmentation:
  - PDRec (Ma et al. 2024), GlobalDiff (Luo, Li, and Lin 2025), DiffDiv (Cai et al. 2025): These methods use diffusion models to generate new, generalized user preferences or augmented item sequences. However, the paper criticizes them for only utilizing the final output of the diffusion process, thus overlooking the intent hierarchy.
  - BASRec (Dang et al. 2025): This method focuses on balancing relevance and diversity in generated data through representation-level mixup.
- Feature Denoising: Other diffusion-based approaches denoise feature representations from sparse interactions (Cui et al. 2025a,b).

3.3. Technological Evolution

Sequential recommendation has evolved significantly:

Early Statistical Models: Markov Chains represented the initial phase, focusing on local item-to-item transitions.
Deep Learning Revolution: RNNs, CNNs (e.g., Caser), and later Transformers (SASRec, BERT4Rec) brought sophisticated sequence modeling capabilities, capturing longer-range dependencies and complex patterns.
Self-Supervised Learning Era: To combat data sparsity and learn more robust representations without extensive labeling, SSL techniques emerged, often using tasks like masked item prediction or contrastive learning.
Intent-Aware Recommendation: Recognizing that users often have multiple, dynamic underlying goals, multi-interest and intent disentanglement models became prominent, aiming to explicitly identify and leverage these intents. ELCRec is an example of end-to-end clustering for intent learning.
Generative Augmentation: The recent advent of powerful generative models like diffusion models opened new avenues for data augmentation, moving beyond simple heuristics to synthesize more complex and varied data.

ADARec fits into the latest stage of this evolution, specifically at the intersection of intent-aware recommendation and generative augmentation. It aims to overcome the limitations of prior data augmentation (which often lacked intent hierarchy awareness) by leveraging diffusion models in a novel way to explicitly reconstruct the intent hierarchy from sparse data, rather than just generating more data points.

3.4. Differentiation Analysis

Compared to the main methods in related work, ADARec introduces several core differences and innovations:

Hierarchical Intent Reconstruction via Diffusion Trajectory:
- Existing Augmentation: Prior diffusion-based augmentation methods (PDRec, GlobalDiff, DiffDiv) primarily use the final output of the diffusion model (e.g., a denoised representation or generated item) as augmented data. They treat the diffusion model as a "black-box generator" for single data points.
- ADARec's Innovation: ADARec critically utilizes the entire step-wise denoising trajectory of the diffusion model. It argues that each step in this trajectory, from high noise to low noise, inherently represents a different level of abstraction, effectively reconstructing a user's intent hierarchy from coarse-grained (high noise, abstract) to fine-grained (low noise, specific). This is a fundamental conceptual shift.
Adaptive Augmentation Depth:
- Existing Augmentation: Most augmentation methods apply a fixed augmentation strategy or depth to all sequences.
- ADARec's Innovation: It introduces an Adaptive Depth Controller (ADC) that dynamically determines the optimal diffusion depth ( $T_u$ ) for each individual user sequence. This ensures customized and efficient enhancement, avoiding unnecessary computation for sequences that might not require deep augmentation.
Specialized Hierarchical Parsing with Mixture-of-Experts:
- Existing Intent Models/Augmentation: Many intent-based models use a single network or multiple parallel branches to capture interests, but typically don't explicitly parse a generated hierarchy of intents. Existing augmentation methods typically produce a single augmented representation.
- ADARec's Innovation: It designs a dedicated Hierarchical Parsing Mixture-of-Experts (HP-MoE) module. This MoE architecture is specifically tailored to decouple and process the coarse-grained and fine-grained intent representations derived from the diffusion trajectory. It uses functionally decoupled experts (a Transformer for fine-grained, a separate module with contrastive loss for coarse-grained) and a content-aware routing mechanism, which is novel for handling diffusion-generated hierarchies.
Addressing the "Collapsed Intent" Problem:
- Existing Methods: While some intent-based models aim to learn distinct intents, they struggle significantly with sparse data where the intent hierarchy has collapsed (as visualized in Figure 1 and 2).
- ADARec's Advantage: By explicitly reconstructing this hierarchy through the HDA module and effectively parsing it with HP-MoE, ADARec directly tackles the fundamental issue of incomplete intent hierarchy representation resulting from data sparsity, leading to more accurate and less misguided recommendations.

4. Methodology

4.1. Principles

The core idea behind ADARec is to overcome the data sparsity problem in sequential recommendation by reconstructing a user's intent hierarchy from their sparse interaction history. It achieves this by:

Leveraging the Denoising Trajectory of a Diffusion Model: Inspired by the Information Bottleneck principle, the diffusion model's denoising process is used to progressively reveal user intents at different levels of abstraction. At higher noise levels (more abstract denoising), the model is forced to capture coarse-grained intents, while at lower noise levels, it recovers fine-grained details. The sequence of intermediate denoised representations thus forms a natural intent hierarchy.
Adaptive Augmentation: Recognizing that not all user sequences require the same level of augmentation, an Adaptive Depth Controller (ADC) dynamically determines the optimal diffusion depth for each sequence, balancing efficiency and effectiveness.
Specialized Intent Parsing: A Hierarchical Parsing Mixture-of-Experts (HP-MoE) module is designed to intelligently decouple and process these coarse-grained and fine-grained intents using specialized experts and a content-aware routing mechanism, leading to a robust final user representation.

4.2. Core Methodology In-depth (Layer by Layer)

The ADARec framework processes user sequences through a sequential pipeline, starting with embedding, followed by adaptive depth determination, hierarchical augmentation, and finally, expert-driven parsing and fusion. The overall architecture is illustrated in the VLM described Figure 3 of the original paper.

4.2.1. Sequence Embedding

First, a user's historical interaction sequence $S_u$ is transformed into a numerical representation. ADARec adopts ELCRec (Liu et al. 2024) as its backbone encoder to obtain the sequence embedding $\mathbf{E}_u^0$ . This embedding represents the initial state of the user's history. Additionally, the length of the sequence is encoded into a vector $\mathbf{v}_{l_u}$ using a Multi-Layer Perceptron (MLP). Both $\mathbf{E}_u^0$ and $\mathbf{v}_{l_u}$ serve as inputs to the subsequent Adaptive Depth Controller.

4.2.2. Adaptive Depth Controller (ADC)

The Adaptive Depth Controller (ADC) is responsible for determining the optimal augmentation depth $T_u$ for each user sequence $S_u$ . This is crucial for customizing the enhancement and managing computational costs.

The ADC takes the user's original sequence embedding $\mathbf{E}_u^0 \in \mathbb{R}^{L \times D}$ (where $L$ is the sequence length and $D$ is the embedding dimension) and the embedding vector of the normalized sequence length $\mathbf{v}_{l_u}$ as input. These inputs are concatenated and then fed into a Gated Recurrent Unit (GRU) network to capture temporal dependencies. The output of the GRU is then pooled (e.g., using mean pooling) to produce a fixed-size representation $\mathbf{h}_c$ . This $\mathbf{h}_c$ is then passed through a linear layer to generate logits over possible diffusion depths. The possible depths range from 0 to a predefined maximum depth $T_{max}$ .

The calculation for generating logits is: $\begin{array}{r} \mathbf{h}_c = \mathrm{Pool}(\mathrm{GRU}([\mathbf{E}_u^0; \mathbf{v}_{l_u}])) \\ \mathbf{logits}_u = \mathbf{W}_{\sigma} \mathbf{h}_c + b_{\sigma} \end{array}$ Here:

$\mathbf{E}_u^0$ : The initial sequence embedding of user $u$ .
$\mathbf{v}_{l_u}$ : The embedding vector representing the normalized length of sequence $S_u$ .
$[\mathbf{E}_u^0; \mathbf{v}_{l_u}]$ : Concatenation of the sequence embedding and the length embedding.
$\mathrm{GRU}(\cdot)$ : A Gated Recurrent Unit network that processes the concatenated input to capture sequential information.
$\mathrm{Pool}(\cdot)$ : An aggregation function, such as mean pooling, applied to the output of the GRU to obtain a fixed-size context vector $\mathbf{h}_c$ .
$\mathbf{W}_{\sigma}$ : A learnable weight matrix.
$b_{\sigma}$ : A learnable bias vector.
$\mathbf{logits}_u$ : A vector of logits whose dimension is $T_{max} + 1$ , representing the scores for each possible diffusion depth.

To select a discrete depth $T_u$ in a differentiable manner, the Gumbel-Softmax estimator is applied to the logits. This allows for backpropagation through the discrete selection process. $\mathbf{p}_u = \mathrm{Softmax}((\mathbf{logits}_u + \mathbf{g}) / \tau_{gs})$ Where:
$\mathbf{logits}_u$ : The logits vector computed previously.
$\mathbf{g}$ : A vector of i.i.d. (independent and identically distributed) samples drawn from a $Gumbel(0,1)$ distribution. This introduces randomness during training to encourage exploration.
$\tau_{gs}$ : The Gumbel-Softmax temperature. A higher temperature makes the distribution flatter (more uniform), while a lower temperature makes it sharper (closer to one-hot).
$\mathrm{Softmax}(\cdot)$ : The softmax function, which converts the logits into a probability distribution $\mathbf{p}_u$ over the possible depths. During the forward pass (inference), the actual depth $T_u$ is selected by taking the argmax of $\mathbf{p}_u$ (the depth with the highest probability). During the backward pass (training), the continuous Gumbel-Softmax probabilities $\mathbf{p}_u$ are used for gradient calculation.

To encourage exploration of different diffusion depths and prevent the ADC from prematurely converging to shallow, "safe" depths, an auxiliary exploration loss is introduced. This loss is the negative entropy of the ADC's probability distribution $\mathbf{p}_u$ : $\mathcal{L}_{expl} = - \sum_{i=0}^{T_{max}} p_{u,i} \log p_{u,i}$ Here:

$p_{u,i}$ : The probability assigned by Gumbel-Softmax to the $i$ -th diffusion depth for user $u$ . This entropy-based loss encourages the ADC to maintain uncertainty and explore a wider range of depths, leading to more robust training for the subsequent HDA and HP-MoE modules.

4.2.3. Hierarchical Diffusion Augmentation (HDA)

The Hierarchical Diffusion Augmentation (HDA) module is the core component for generating the user intent hierarchy. Unlike other diffusion-based augmentation methods that use only the final output, HDA leverages the entire denoising trajectory.

Intuition behind Constructing Hierarchical Intent via Denoising Trajectory: The HDA module's design is inspired by the observation that the denoising process of a diffusion model naturally progresses from abstract to concrete information.

At high noise levels (i.e., when $k$ is large, approaching $T_u$ ), the input $\mathbf{E}_u^k$ is heavily corrupted with noise. The denoising network $\epsilon_\theta(\mathbf{E}_u^k, k)$ is forced to extract only the most abstract, fundamental information to reconstruct the original signal. This process yields a coarse-grained intent. The information retained from the original sequence $I(\hat{\mathbf{E}}_u^k; \mathbf{E}_u^0)$ is significantly compressed compared to the total information $H(\mathbf{E}_u^0)$ , yet it maintains predictive relevance for the next item $Y_{next}$ .
At low noise levels (i.e., when $k$ is small, approaching 0), the input $\mathbf{E}_u^k$ contains more of the original signal. The denoising network can recover more specific behavioral details, forming a fine-grained intent. In this case, the information is almost fully preserved $I(\hat{\mathbf{E}}_u^k; \mathbf{E}_u^0) \approx H(\mathbf{E}_u^0)$ .

Thus, the entire denoising trajectory $\{\hat{\mathbf{E}}_u^{T_u}, \hat{\mathbf{E}}_u^{T_u-1}, \ldots, \hat{\mathbf{E}}_u^0\}$ inherently forms a user intent hierarchy, where higher $k$ values correspond to coarse-grained intents and lower $k$ values to fine-grained intents.

The process of HDA: Given a user $u$ 's sequence embedding $\mathbf{E}_u^0$ and the ADC-predicted depth $T_u$ , the diffusion's forward process (noising) progressively adds Gaussian noise to generate noisy representations $\{\mathbf{E}_u^k\}_{k=0}^{T_u}$ . The forward process is defined as: $\mathbf{E}_u^k = \sqrt{\bar{\alpha}_k} \mathbf{E}_u^0 + \sqrt{1 - \bar{\alpha}_k} \boldsymbol{\epsilon}_k, \quad \boldsymbol{\epsilon}_k \sim \mathcal{N}(0, \mathbf{I})$ Here:

$\mathbf{E}_u^k$ : The noisy representation of the sequence embedding at diffusion step $k$ .
$\mathbf{E}_u^0$ : The original sequence embedding.
$\bar{\alpha}_k$ : A predefined noise schedule parameter (e.g., from DDPM's linear variance schedule), which controls the amount of original signal preserved at step $k$ . $\bar{\alpha}_k$ decreases as $k$ increases, meaning more noise is added.
$\boldsymbol{\epsilon}_k$ : A random noise vector sampled from a standard normal distribution $\mathcal{N}(0, \mathbf{I})$ .

The core of HDA lies in the reverse process (denoising), which is executed by a trainable denoising network $D_\theta$ . This network learns to predict the noise added at each step, allowing for the reconstruction of the original signal. The denoising network $D_\theta$ predicts the noise $\boldsymbol{\epsilon}_k$ given the noisy input $\mathbf{E}_u^k$ and the current step $k$ . The denoised representation $\hat{\mathbf{E}}_u^k$ is then estimated as: $\hat{\mathbf{E}}_u^k = \frac{\mathbf{E}_u^k - \sqrt{1 - \bar{\alpha}_k} \epsilon_\theta(\mathbf{E}_u^k, k)}{\sqrt{\bar{\alpha}_k}}$ Where:
$\hat{\mathbf{E}}_u^k$ : The estimated denoised representation at diffusion step $k$ .
$\epsilon_\theta(\mathbf{E}_u^k, k)$ : The denoising network (parameterized by $\theta$ ) that predicts the noise component $\boldsymbol{\epsilon}_k$ from the noisy representation $\mathbf{E}_u^k$ and the current diffusion step $k$ . For simplicity and efficiency, the denoising network $D_\theta$ is implemented as a simple Multi-Layer Perceptron (MLP).

To train the denoising network $D_\theta$ , it is optimized using a standard Mean Squared Error (MSE) loss between the predicted noise and the true noise: $\mathcal{L}_{deno} = ||\epsilon_\theta(\mathbf{E}_u^k, k) - \boldsymbol{\epsilon}_k||^2$ Here:

$\epsilon_\theta(\mathbf{E}_u^k, k)$ : The noise predicted by the denoising network.
$\boldsymbol{\epsilon}_k$ : The actual noise added at diffusion step $k$ . The MSE loss ensures that the denoising network accurately learns to remove the noise.

This iterative denoising process, from the highest noise level (approaching $\mathbf{E}_u^{T_u}$ ) down to the lowest noise level (approaching $\mathbf{E}_u^0$ ), generates a sequence of denoised representations that forms the intent hierarchy: $A_u = \{\hat{\mathbf{E}}_u^{T_u}, \hat{\mathbf{E}}_u^{T_u-1}, \ldots, \hat{\mathbf{E}}_u^0\}$ . In this hierarchy, representations with higher $k$ (closer to $T_u$ ) capture coarse-grained intents, while representations with lower $k$ (closer to 0) capture fine-grained intents. The output of the HDA module is this entire structured set $A_u$ .

4.2.4. Hierarchical Parsing MoE (HP-MoE)

The Hierarchical Parsing Mixture-of-Experts (HP-MoE) module is designed to efficiently parse the intent hierarchy $A_u$ generated by HDA, intelligently decoupling and processing fine-grained and coarse-grained intents.

Functionally Decoupled Experts: The HP-MoE employs two specialized experts:

Fine-grained Expert ( $E_{fin}$ ): This expert is implemented as a Transformer network. It is trained solely using the recommendation loss $\mathcal{L}_{rec}$ , which optimizes for next-item prediction. This direct supervision compels $E_{fin}$ to capture the subtle, transient, and specific details within the user's interaction sequence that are crucial for accurate immediate recommendations.
Coarse-grained Expert ( $E_{coar}$ ): This expert is designed to extract the user's coarse-grained intent. It is regularized by an exclusive contrastive loss, $\mathcal{L}_{cont}$ . This InfoNCE-based loss forces $E_{coar}$ to learn the invariant, underlying essence of the user's preferences by pulling together representations from the user's own coarse-grained hierarchy while pushing them away from coarse-grained intents of other users in the same batch.

The contrastive loss for the coarse-grained expert is defined as: $\mathcal{L}_{cont} = \sum_{k = \lfloor T_u / 2 \rfloor}^{T_u - 1} - \log \frac{\exp(\mathrm{sim}(\mathbf{z}_{u,k}, \mathbf{z}_{u,k+1}) / \tau)}{\sum_{\mathbf{z}_j \in \mathcal{B}_k} \exp(\mathrm{sim}(\mathbf{z}_{u,k}, \mathbf{z}_j) / \tau)}$ Here:

$\mathrm{sim}(\cdot, \cdot)$ : The cosine similarity function, used to measure the similarity between two intent representations.
$\mathbf{z}_{u,k}$ : A coarse-grained representation of user $u$ at diffusion step $k$ . This is an output from $E_{coar}$ when processing $\hat{\mathbf{E}}_u^k$ .
$\mathbf{z}_{u,k+1}$ : The positive sample. This is a coarse-grained representation from the next higher diffusion step $k+1$ for the same user $u$ , implying a related but more abstract intent.
$\mathcal{B}_k$ : A set of negative samples. It contains $\mathbf{z}_{u,k+1}$ (the positive sample) and a collection of coarse-grained representations from other users within the same training batch. This ensures that the expert differentiates the current user's intent from others.
$\tau$ : A temperature parameter for the InfoNCE loss, controlling the sharpness of the probability distribution. The summation starts from $\lfloor T_u / 2 \rfloor$ to $T_u - 1$ , focusing the contrastive learning on the upper (more coarse-grained) half of the generated hierarchy.

Content-Aware Routing and Gated Fusion: To synthesize a unified user representation from the intent hierarchy $A_u$ , a two-level gated fusion mechanism is employed.

First, at each hierarchy level $k$ , a content-aware routing mechanism determines how much each expert contributes. This gate's probability, $g_k$ , considers both the noise level embedding $\mathbf{emb}(k)$ (representing the current level of abstraction) and the pooled content representation of the denoised sequence at that level, $\mathrm{Pool}(\hat{\mathbf{E}}_u^k)$ . $g_k = \mathrm{Sigmoid}(\mathrm{MLP}([\mathrm{Pool}(\hat{\mathbf{E}}_u^k); \mathbf{emb}(k)]))$ Where:

$\mathrm{Pool}(\hat{\mathbf{E}}_u^k)$ : An aggregation (e.g., mean pooling) of the denoised representation at step $k$ , capturing its content.
$\mathbf{emb}(k)$ : An embedding vector corresponding to the diffusion step $k$ . This provides positional information about the hierarchy level.
$[\cdot; \cdot]$ : Concatenation of the pooled content and the step embedding.
$\mathrm{MLP}(\cdot)$ : A Multi-Layer Perceptron.
$\mathrm{Sigmoid}(\cdot)$ : The sigmoid activation function, which squashes the output to a value between 0 and 1, representing the gate's probability $g_k$ . A high $g_k$ means more weight for the fine-grained expert, and $1-g_k$ for the coarse-grained expert.

Next, the outputs from each expert are aggregated across the entire hierarchy, weighted by their respective gate probabilities, to form consolidated fine-grained and coarse-grained representations: $\mathbf{z}_{fin} = \frac{\sum_{k=0}^{T_u} g_k \cdot E_{fin}(\hat{\mathbf{E}}_u^k)}{\sum_{k=0}^{T_u} g_k + \epsilon}$ $\mathbf{z}_{coar} = \frac{\sum_{k=0}^{T_u} (1 - g_k) \cdot E_{coar}(\hat{\mathbf{E}}_u^k)}{\sum_{k=0}^{T_u} (1 - g_k) + \epsilon}$ Here:
$E_{fin}(\hat{\mathbf{E}}_u^k)$ : The output of the fine-grained expert for the denoised representation at step $k$ .
$E_{coar}(\hat{\mathbf{E}}_u^k)$ : The output of the coarse-grained expert for the denoised representation at step $k$ .
$\epsilon$ : A small constant added to the denominator to prevent division by zero. These equations essentially compute a weighted average of expert outputs, where $g_k$ determines the contribution of $E_{fin}$ and $(1-g_k)$ determines the contribution of $E_{coar}$ at each hierarchy level.

Finally, to produce the ultimate user representation $\mathbf{h}_u$ , a final gating vector $\mathbf{gate}_u$ is computed from the aggregated coarse-grained representation $\mathbf{z}_{coar}$ . This gate adaptively combines $\mathbf{z}_{fin}$ and $\mathbf{z}_{coar}$ , balancing stable long-term themes (coarse-grained) against transient specifics (fine-grained). $\mathbf{gate}_u = \mathrm{Sigmoid}(\mathbf{W}_g \mathbf{z}_{coar} + \mathbf{b}_g)$ $\mathbf{h}_u = \mathbf{gate}_u \odot \mathbf{z}_{fin} + (1 - \mathbf{gate}_u) \odot \mathbf{z}_{coar}$ Where:

$\mathbf{W}_g$ : A learnable weight matrix for the final gate.
$\mathbf{b}_g$ : A learnable bias vector for the final gate.
$\odot$ : The element-wise multiplication (Hadamard product) operator. This final gating mechanism allows the model to dynamically adjust the influence of fine-grained and coarse-grained intents based on the user's overall coarse-grained preferences.

Prediction Layer: The fused user representation $\mathbf{h}_u$ is then passed to a prediction layer. This layer computes the probability scores $\mathbf{y}_u$ for all candidate items in the item set $\mathcal{I}$ , which are then used for making recommendations.

4.2.5. Training Objective

The ADARec framework is trained end-to-end using a composite objective function that combines multiple losses: $\mathcal{L}_{tot} = \mathcal{L}_{rec} + \lambda_{deno} \mathcal{L}_{deno} + \lambda_{cont} \mathcal{L}_{cont} + \lambda_{expl} \mathcal{L}_{expl}$ Here:

$\mathcal{L}_{tot}$ : The total loss to be minimized during training.
$\mathcal{L}_{rec}$ : The primary recommendation loss, calculated using standard cross-entropy loss. For a given user $u$ , it measures the discrepancy between the predicted probabilities $\mathbf{y}_u$ and the true next item. $\mathcal{L}_{rec} = - \sum_{i \in \mathcal{I}} p(i) \log(\mathbf{y}_{u,i})$ Here, p(i) is 1 for the true next item $i$ and 0 for all other items in the item set $\mathcal{I}$ . $\mathbf{y}_{u,i}$ is the predicted probability score for item $i$ for user $u$ .
$\mathcal{L}_{deno}$ : The denoising loss, which trains the HDA module (specifically the denoising network $D_\theta$ ) to accurately reconstruct the original sequence embedding by predicting noise.
$\mathcal{L}_{cont}$ : The contrastive loss, which acts as a key regularizer for the coarse-grained expert in HP-MoE, enforcing the functional decoupling of intents.
$\mathcal{L}_{expl}$ : The exploration loss, an entropy-based regularizer for the ADC, preventing it from prematurely converging to shallow diffusion depths and encouraging broader exploration.
$\lambda_{deno}$ , $\lambda_{cont}$ , $\lambda_{expl}$ : Hyperparameters that balance the contribution of each auxiliary loss component to the total objective.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on four public benchmark datasets, commonly used in sequential recommendation research:

Beauty: An Amazon review dataset, typically containing interactions related to beauty products.
Sports: An Amazon review dataset, containing interactions related to sports and outdoor equipment.
Toys: An Amazon review dataset, focusing on toys and games.
Yelp: A dataset from the Yelp platform, encompassing various business categories (e.g., restaurants, shops), providing a more diverse set of user intents.

These datasets are typically characterized by sparse user interactions and varying degrees of item diversity. The paper states it strictly adheres to the datasets and pipelines released by ELCRec (Liu et al. 2024) to ensure fairness and reproducibility. These datasets are effective for validating the method's performance because they represent real-world user behavior, including the presence of data sparsity and complex user intent patterns, which are the main challenges ADARec aims to address.

5.2. Evaluation Metrics

The performance of sequential recommendation models is typically evaluated using ranking-based metrics that assess how well the model ranks the true next item among a list of candidates. The paper uses Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K).

Hit Rate (HR@K):
- Conceptual Definition: Hit Rate measures the proportion of users for whom the true next item appears in the top-K recommended items. It indicates whether the correct item was found within the recommendation list, regardless of its position.
- Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the true next item is in top-K recommendations}}{\text{Total number of users}} $ More formally, for a set of users $U$ : $ \mathrm{HR@K} = \frac{1}{|U|} \sum_{u \in U} \mathbb{I}(\text{true_item}_u \in \text{top-K_list}_u) $
- Symbol Explanation:
  - $U$ : The set of all users in the evaluation.
  - $|U|$ : The total number of users.
  - $\mathbb{I}(\cdot)$ : An indicator function that returns 1 if the condition inside is true, and 0 otherwise.
  - $\text{true\_item}_u$ : The actual next item interacted by user $u$ .
  - $\text{top-K\_list}_u$ : The list of top K items recommended to user $u$ .
Normalized Discounted Cumulative Gain (NDCG@K):
- Conceptual Definition: NDCG is a ranking quality metric that accounts for the position of the relevant items in the recommendation list. It assigns higher scores to relevant items that appear higher in the list and penalizes relevant items that appear lower. It is "normalized" to range from 0 to 1, where 1 represents a perfect ranking.
- Mathematical Formula: First, Discounted Cumulative Gain (DCG@K) is calculated: $ \mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i+1)} $ Then, NDCG@K is DCG@K divided by the Ideal DCG (IDCG@K), which is the maximum possible DCG if all relevant items were perfectly ranked: $ \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
- Symbol Explanation:
  - $K$ : The number of top items in the recommendation list being considered.
  - $i$ : The rank of an item in the recommendation list (starting from 1).
  - $\mathrm{rel}_i$ : The relevance score of the item at rank $i$ . For next-item prediction, relevance is typically binary (1 if it's the true next item, 0 otherwise).
  - $\mathrm{IDCG@K}$ : The Ideal Discounted Cumulative Gain for the given user, calculated by assuming the true next item is at rank 1, and all other items are irrelevant (or ranked perfectly if multiple relevant items were possible, which is less common in next-item prediction for SR).
    
    The paper evaluates HR@5, NDCG@5, HR@20, and NDCG@20, meaning they consider the top 5 and top 20 recommendations.

5.3. Baselines

ADARec is benchmarked against eleven diverse baselines, covering a wide spectrum of sequential recommendation models and data augmentation techniques:

Early Deep Learning Models:
- Caser (Tang and Wang 2018): A Convolutional Sequence Embedding model that uses CNNs to capture local sequential patterns.
Transformer-based Models:
- SASRec (Kang and McAuley 2018): A pioneering self-attentive sequential recommendation model using Transformer architecture.
- BERT4Rec (Sun et al. 2019): Adapts BERT's masked language model objective for sequential recommendation with bidirectional encoder representations.
Intent-Learning Methods (often SSL-based): These models explicitly try to capture different user intents.
- IOCRec (Li et al. 2023a): Multi-Intention Oriented Contrastive Learning for Sequential Recommendation.
- ICLRec (Chen et al. 2022): Intent Contrastive Learning for Sequential Recommendation.
- ELCRec (Liu et al. 2024): End-to-end Learnable Clustering for Intent Learning in Recommendation. This model also serves as ADARec's backbone encoder.
Data Augmentation Frameworks (Diffusion-based): These are the most direct competitors, also using diffusion models for augmentation.
- PDRec (Ma et al. 2024): Plug-In Diffusion Model for Sequential Recommendation.
- BASRec (Dang et al. 2025): Augmenting Sequential Recommendation with Balanced Relevance and Diversity.
- GlobalDiff (Luo, Li, and Lin 2025): Enhancing Sequential Recommendation with Global Diffusion.
- DiffDiv (Cai et al. 2025): Unleashing the Potential of Diffusion Models Towards Diversified Sequential Recommendations.
  
  The performance data for the first nine models (Caser through ELCRec) are adopted from ELCRec (Liu et al. 2024) for consistency. BASRec, PDRec, GlobalDiff, and DiffDiv are re-implemented within ADARec's framework for fair comparison, suggesting their performance numbers are directly obtained under the same experimental conditions as ADARec.

5.4. Implementation Details

Optimizer: Adam optimizer is used with a weight decay of $1 \times 10^{-4}$ .
Sequence Length: The maximum user sequence length is capped at 50.
Hyperparameters:
- Gumbel-Softmax temperature ( $\tau_{gs}$ ): Fixed at 0.5.
- InfoNCE temperature ( $\tau$ ): Set to 0.07.
- Maximum diffusion depth ( $T_{max}$ ): Tuned via grid search over $\{4, 6, 8, 10, 12\}$ .
- Loss weights ( $\lambda_{deno}$ , $\lambda_{cont}$ ): Selected from $\{0.01, 0.03, 0.05, 0.1, 0.15\}$ .
- Exploration loss weight ( $\lambda_{expl}$ ): Searched in $\{0.001, 0.003, 0.01, 0.03, 0.1\}$ .
Computing Environment: Implemented on a server equipped with 20GB NVIDIA RTX 3090 GPUs.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance (RQ1)

The overall performance results, presented in Table 1, demonstrate ADARec's consistent and significant outperformance across all benchmarks and metrics compared to a wide range of baselines. The "Improv." column highlights the percentage improvement over the strongest competitor in each category.

The following are the results from Table 1 of the original paper:

Dataset	Metric	Caser (2018)	SASRec (2018)	BERT4Rec (2019)	IOCRec (2023)	ICLRec (2022)	ELCRec (2024)	PDRec (2024)	BASRec (2025)	GDiff (2025)	DDiv (2025)	ADARec (Ours)	Improv.
Dataset	Metric	Caser (2018)	SASRec (2018)	BERT4Rec (2019)	IOCRec (2023)	ICLRec (2022)	ELCRec (2024)	PDRec (2024)	BASRec (2025)	GDiff (2025)	DDiv (2025)	ADARec (Ours)	Improv.	Beauty	H@5	0.0251	0.0374	0.0360	0.0408	0.0495	0.0529	0.0569	0.0551	0.0563	0.0575	0.0600*	+4.35%
N@5	0.0145	0.0241	0.0216	0.0245	0.0326	0.0355	0.0380	0.0385	0.0382	0.0375	0.0403*	+4.68%
H@20	0.0643	0.0901	0.0984	0.0916	0.1072	0.1079	0.1145	0.1135	0.1129	0.1152	0.1187*	+3.04%
N@20	0.0298	0.0387	0.0391	0.0444	0.0491	0.0509	0.0541	0.0539	0.0531	0.0548	0.0568*	+3.65%
Sports	H@5	0.0114	0.0206	0.0217	0.0246	0.0263	0.0286	0.0325	0.0328	0.0321	0.0315	0.0341*	+3.96%
	N@5	0.0076	0.0135	0.0143	0.0162	0.0173	0.0185	0.0218	0.0220	0.0215	0.0211	0.0228*	+3.64%
	H@20	0.0399	0.0497	0.0604	0.0641	0.0630	0.0648	0.0701	0.0728	0.0719	0.0714	0.0745*	+2.33%
N@20	0.0178	0.0216	0.0251	0.0280	0.0276	0.0286	0.0328	0.0331	0.0325	0.0318	0.0341*	+3.02%
Toys	H@5	0.0166	0.0463	0.0274	0.0311	0.0586	0.0585	0.0635	0.0648	0.0642	0.0645	0.0691*	+6.64%
	N@5	0.0107	0.0306	0.0174	0.0197	0.0397	0.0403	0.0433	0.0431	0.0435	0.0428	0.0474*	+9.08%
	H@20	0.0420	0.0941	0.0688	0.0781	0.1130	0.1138	0.1230	0.1245	0.1219	0.1225	0.1298*	+4.26%
	N@20	0.0179	0.0441	0.0291	0.0330	0.0550	0.0560	0.0615	0.0618	0.0613	0.0605	0.0646*	+4.53%
Yelp	H@5	0.0142	0.0160	0.0196	0.0222	0.0233	0.0248	0.0253	0.0255	0.0251	0.0258	0.0264*	+2.33%
	N@5	0.0080	0.0101	0.0121	0.0137	0.0146	0.0153	0.0159	0.0153	0.0161	0.0155	0.0167*	+2.83%
	H@20	0.0406	0.0443	0.0564	0.0640	0.0645	0.0667	0.0688	0.0681	0.0685	0.0691	0.0712*	+3.04%
	N@20	0.0156	0.0179	0.0223	0.0263	0.0261	0.0275	0.0280	0.0275	0.0278	0.0283	0.0291*	+2.83%

Analysis:

ADARec consistently achieves the highest HR@K and NDCG@K scores across all four datasets (Beauty, Sports, Toys, Yelp) and both $K$ values (5 and 20). The asterisks indicate statistical significance.
The improvement is notable across different types of baselines, including older models (Caser, SASRec), Transformer-based models (BERT4Rec), intent-learning models (IOCRec, ICLRec, ELCRec), and especially other diffusion-based augmentation methods (PDRec, BASRec, GlobalDiff, DiffDiv). This validates ADARec's novel approach over existing diffusion-based augmentation that only use the final output.
Datasets like Toys show particularly strong relative improvements (e.g., +9.08% for N@5), indicating that ADARec is highly effective where intent reconstruction might be particularly challenging or beneficial.
The Yelp dataset shows relatively modest improvements (e.g., +2.33% for H@5). The authors attribute this to Yelp being a very comprehensive dataset, which makes the clear disentanglement of user intents into distinct units more challenging even for intent-based recommendation methods. This suggests that while ADARec excels at reconstructing hierarchy, inherent complexities in dataset structure can still pose challenges.

6.1.2. Performance on Extremely Sparse Sequences

ADARec's strength is particularly evident when dealing with extremely short sequences (user history length $\leq 5$ ), which are highly prone to data sparsity.

The following are the results from Table 2 of the original paper:

Dataset		BASRec (2025)	GlobalDiff (2025)	DiffDiv (2025)	ADARec (Ours)	Improv.
Beauty	H@5	0.0441	0.0428	0.0455	0.0531*	+16.70%
	N@5	0.0285	0.0272	0.0295	0.0353*	+19.66%
	H@20	0.0910	0.0897	0.0925	0.1066*	+15.24%
	N@20	0.0421	0.0409	0.0427	0.0504*	+17.48%
	H@5	0.0261	0.0256	0.0265	0.0305*	+15.09%
Sports	N@5	0.0180	0.0174	0.0179	0.0199*	+10.56%
	H@20	0.0592	0.0585	0.0590	0.0659*	+11.32%
	N@20	0.0270	0.0266	0.0269	0.0298*	+10.37%
	H@5	0.0557	0.0550	0.0560	0.0618*	+10.36%
Toys	N@5	0.0389	0.0387	0.0392	0.0431*	+9.95%
	H@20	0.1051	0.1043	0.1050	0.1169*	+11.23%
	N@20	0.0530	0.0518	0.0522	0.0586*	+10.57%
	H@5	0.0194	0.0189	0.0190	0.0200*	+3.09%
Yelp	N@5	0.0121	0.0112	0.0119	0.0124*	+2.48%
	H@20	0.0528	0.0506	0.0522	0.0559*	+5.87%
	N@20	0.0217	0.0204	0.0209	0.0225*	+3.69%

Analysis:

On sparse datasets, especially Beauty and Sports, ADARec achieves relative improvements over the strongest competitor that exceed 15% (e.g., +19.66% for N@5 on Beauty). This is a strong indicator of ADARec's ability to handle data sparsity.
The large improvements on sparse data confirm the hypothesis that ADARec's ability to reconstruct coarse-grained intent representations from minimal item histories prevents it from being misled by noise or overfitting to sparse, misleading fine-grained details. This leads to more accurate and generalized recommendations for users with limited data.
The Yelp dataset still shows relatively smaller improvements, consistent with the overall results, reinforcing the idea that its inherent complexity makes intent disentanglement harder.

6.1.3. Performance with Different Intent-Aware Prediction Layers

To demonstrate the generalizability of ADARec's augmentation framework, the authors integrate it with several advanced intent-aware models used as prediction layers.

The following are the results from Table 3 of the original paper:

Model	Beauty		Toys
	HR@20	NDCG@20	HR@20	NDCG@20
IOCRec (2023)	0.0916	0.0444	0.0781	0.0330
ADARec w/ IOCRec	0.1053	0.0501	0.1098	0.0515
ICLRec (2022)	0.1072	0.0491	0.1130	0.0550
ADARec w/ ICLRec	0.1165	0.0558	0.1271	0.0628
ELCRec (2024)	0.1079	0.0509	0.1138	0.0560
ADARec w/ ELCRec	0.1187	0.0568	0.1298	0.0646

Analysis:

For all three intent-aware models (IOCRec, ICLRec, ELCRec), integrating ADARec's augmentation framework consistently leads to performance improvements in both HR@20 and NDCG@20 on both Beauty and Toys datasets.
This demonstrates that ADARec is not just an isolated model but a valuable augmentation framework that can effectively generate richer and more structured hierarchical user representations. These representations, in turn, provide enhanced semantic information that can be leveraged by various downstream intent-based prediction models, improving their performance.

6.2. Ablation Study of Key Components (RQ2)

An ablation study was conducted to understand the individual contribution of each component within ADARec.

The following are the results from Table 4 of the original paper:

ADC	HDA	HP-MoE	C-A Router	Lcont	Beauty		Sports		Toys		Yelp
ADC	HDA	HP-MoE	C-A Router	Lcont	HR@20	NDCG@20	HR@20	NDCG@20	HR@20	NDCG@20	HR@20	NDCG@20
✓	✓	✓	✓	✓	0.1187	0.0568	0.0745	0.0341	0.1298	0.0646	0.0712	0.0291
X	✓	✓	✓	✓	0.1141	0.0545	0.0715	0.0328	0.1246	0.0621	0.0690	0.0281
✓	X	✓	✓	✓	0.1105	0.0528	0.0681	0.0310	0.1189	0.0592	0.0672	0.0276
✓	✓	X	X	X	0.1075	0.0505	0.0645	0.0283	0.1132	0.0557	0.0661	0.0273
X	✓	✓	X	X	0.1128	0.0539	0.0702	0.0321	0.1225	0.0610	0.0683	0.0279
✓	✓	✓	X	✓	0.1172	0.0561	0.0736	0.0337	0.1281	0.0638	0.0705	0.0288
✓	✓	✓	✓	X	0.1158	0.0553	0.0725	0.0332	0.1260	0.0627	0.0697	0.0284

Analysis:

Removing HDA (row 3, HDA=X): This leads to the most severe performance degradation across all datasets. For example, on Beauty, HR@20 drops from 0.1187 to 0.1105. This highlights the foundational role of HDA in generating the intent hierarchy, which is the core of ADARec. Without it, the model loses its ability to reconstruct the multi-grained intents.
Removing ADC (row 2, ADC=X): This also causes significant performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1141). This validates the necessity of dynamically allocating augmentation resources. A fixed diffusion depth is less effective than an adaptively chosen one.
Removing HP-MoE (row 4, HP-MoE=X): This variant uses ADC and HDA but no HP-MoE for parsing (it implies a simpler aggregation strategy or direct use of a single representation, or ELCRec itself without the MoE part). The performance drops substantially (e.g., Beauty HR@20 to 0.1075). This variant is noted to still outperform strong baselines like ELCRec, demonstrating the inherent strength of the ADC and HDA modules. However, the drop from the full ADARec confirms the necessity of the specialized HP-MoE architecture for effective hierarchical parsing.
HP-MoE variant (row 4, ADC=✓, HDA=✓, HP-MoE=X): This configuration represents using ADC and HDA to generate the hierarchy, but then lacks the HP-MoE for parsing. The performance (e.g., Beauty HR@20 0.1075) is comparable to ELCRec (0.1079 from Table 1). This critically shows that the HP-MoE's effectiveness is contingent on the structured hierarchical input provided by HDA. Without the specialized MoE to process it, the benefits of HDA are largely lost.
Removing Content-Aware Router (row 6, C-A Router=X): This results in smaller but consistent performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1172). This confirms its vital role in nuanced routing and weighting of expert outputs based on content and noise level.
Removing Contrastive Loss ( $\mathcal{L}_{cont}$ ) (row 7, Lcont=X): This also leads to smaller but consistent performance drops (e.g., Beauty HR@20 from 0.1187 to 0.1158). This validates the importance of $\mathcal{L}_{cont}$ in enforcing functional decoupling between the fine-grained and coarse-grained experts, ensuring they learn distinct yet complementary aspects of user intent.

The ablation study confirms that all components of ADARec are indispensable and contribute positively to its overall superior performance.

6.3. Efficiency Analysis (RQ3)

The efficiency of ADARec is analyzed by comparing its training time and inference latency with other models.

The following are the results from Table 5 of the original paper:

Model	Beauty		Sports		Toys		Yelp
	Train	Infer	Train	Infer	Train	Infer	Train	Infer
	(s)	(ms)	(s)	(ms)	(s)	(ms)	(s)	(ms)
SASRec	2.15	3.1	6.51	9.5	4.32	6.8	5.25	7.9
ELCRec	3.81	5.9	11.28	16.1	7.88	11.5	9.00	13.2
DiffDiv	7.55	14.8	20.10	31.5	15.60	22.4	18.23	26.1
GlobalDiff	8.90	21.2	23.50	45.8	18.95	33.6	21.75	38.5
Ours	6.21	10.5	17.95	24.9	13.52	18.7	15.88	19.3

Analysis:

Training Time: ADARec's training time (e.g., 6.21s for Beauty) is higher than non-diffusion baselines like SASRec (2.15s) and ELCRec (3.81s). This is expected due to the complexity of the diffusion process and the additional modules (ADC, HP-MoE). However, the authors argue that this is a justifiable one-time offline cost which is heavily compensated by the significant improvements in recommendation quality, especially for sparse data.
Inference Latency: Similarly, ADARec's inference latency (e.g., 10.5ms for Beauty) is higher than SASRec (3.1ms) and ELCRec (5.9ms). However, ADARec is generally more efficient than other diffusion-based methods like DiffDiv (14.8ms) and GlobalDiff (21.2ms). This efficiency gain stems from ADARec's lightweight architecture and, critically, its Adaptive Depth Controller (ADC) which dynamically determines the optimal, often shallower, diffusion depth required for each sequence, avoiding unnecessary computation for inference.

6.4. Hyperparameter Sensitivity (RQ4)

6.4.1. Analysis of $\lambda_{expl}$ and $\lambda_{deno}$

The stability of ADARec is assessed by analyzing its performance sensitivity to the weights of the auxiliary losses, $\lambda_{expl}$ (exploration loss) and $\lambda_{deno}$ (denoising loss).

The following figure (Figure 4 from the original paper) shows the stability analysis for auxiliary loss weights:

Figure 4: Stability analysis for auxiliary loss weights.

Analysis:

The plots show HR@20 performance across various values of $\lambda_{expl}$ and $\lambda_{deno}$ for Beauty, Sports, and Toys datasets.
Optimal Points: The model generally performs best when $\lambda_{expl} = 0.01$ and $\lambda_{deno} = 0.1$ .
Robustness: Crucially, the performance curves are relatively flat around these optimal points. This indicates that ADARec is robust to minor deviations in these hyperparameters. This stability implies that the model is not overly sensitive to the exact tuning of these weights, making it easier to apply in practice.

6.4.2. Analysis of Number of Experts in HP-MoE

The paper investigates the impact of increasing the number of experts within the Hierarchical Parsing Mixture-of-Experts (HP-MoE) module on performance and training time.

The following figure (Figure 5 from the original paper) shows the analysis of performance (HR@20) and training time cost vs. number of experts:

$Figure 5: Analysis of Performance $( \\mathrm { H R } @ 2 0 )$ and Training Time Cost vs. Number of Experts.$

Analysis:

Marginal Gains vs. Cost: As the number of experts in HP-MoE increases beyond two, the HR@20 gains (performance) become only marginal. For example, moving from 2 experts to 3 or 4 experts yields minimal improvements.
Significant Cost Increase: However, increasing the number of experts significantly increases the training time. This is evident from the lower part of the plots, where training time rises notably with more experts.
Justification for Two Experts: This result strongly supports ADARec's design choice of employing only two experts (fine-grained and coarse-grained). The authors explain that the Adaptive Depth Controller (ADC) typically selects a relatively small actual diffusion depth ( $T_u$ ) for sparse sequences, which inherently limits the number of potential intent hierarchies that can be meaningfully reconstructed. Therefore, adding more experts cannot uncover more independent intent levels from this limited input information, leading to redundancy and increased computational overhead without tangible benefits. The two-expert configuration provides the optimal balance between efficiency and effectiveness.

6.5. Visualization of Intent Representations (RQ5)

To qualitatively assess ADARec's ability to construct hierarchical user intents from sparse data, t-SNE visualization is used to project the learned user intent representations into a 2D space.

The following figure (Figure 6 from the original paper) shows t-SNE visualization of the learned user intent representations on the Yelp dataset:

Figure 6: t-SNE visualization of the learned user intent representations on the Yelp dataset. (a) A strong baseline like DiffDiv struggles to form distinct clusters. (b) Our method, ADARec, successfully learns well-separated intent clusters.

Analysis:

Baseline (DiffDiv) - Figure 6(a): The t-SNE visualization for DiffDiv, a strong diffusion-based baseline, shows that most user intents collapse into an indistinguishable central mass. The clusters are not clearly separated, indicating that the model struggles to differentiate distinct user intents from sparse data. This visually confirms the paper's argument that existing data augmentation methods, by failing to reconstruct the intent hierarchy, do not effectively capture the true underlying intents.
ADARec - Figure 6(b): In contrast, ADARec's visualization reveals clear, well-separated intent clusters. This visual evidence strongly suggests that ADARec successfully identifies and reconstructs distinct user intent hierarchies even from highly sparse data (sequence length $\leq 5$ on the Yelp dataset). This capability is a direct result of its hierarchical diffusion augmentation and specialized parsing, allowing it to effectively de-collapse user intents.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ADARec, an innovative framework designed to tackle the critical problem of data sparsity in sequential recommendation by reconstructing hierarchical user intents. The core of ADARec lies in its novel utilization of the step-wise denoising trajectory of a diffusion model to generate a rich intent hierarchy, moving from coarse-grained to fine-grained intents. This approach departs from traditional diffusion-based augmentation methods that treat the diffusion model as a black-box generator for single data points.

To enhance both efficiency and effectiveness, ADARec incorporates an Adaptive Depth Controller (ADC) that dynamically determines the optimal augmentation depth for each user sequence and a Hierarchical Parsing Mixture-of-Experts (HP-MoE) module. The HP-MoE intelligently decouples and processes fine-grained and coarse-grained intents through specialized experts, leveraging an asymmetric contrastive objective for robust representation learning.

Extensive experiments on real-world datasets demonstrate ADARec's superior performance, outperforming state-of-the-art baselines by 2-9% on standard benchmarks and achieving even more significant gains (3-17%) on sparse sequences. Visualizations confirm ADARec's ability to learn well-separated intent clusters, providing strong evidence for its effectiveness in de-collapsing user intents from sparse data.

7.2. Limitations & Future Work

The paper does not explicitly dedicate a section to limitations and future work. However, based on the context and experimental results, several points can be inferred:

Inferred Limitations:

Computational Cost: While ADARec is more efficient than other diffusion-based methods, its training time and inference latency are still higher than non-diffusion baselines (SASRec, ELCRec). This might limit its applicability in extremely latency-sensitive or resource-constrained environments, despite being a "one-time offline cost."
Complexity on Diverse Datasets: The relatively modest improvements on the Yelp dataset suggest that for highly diverse and complex user intent landscapes, even ADARec can face challenges in fully disentangling and representing user intents effectively. The inherent ambiguity in intent on such platforms might be difficult to fully resolve.
Reliance on Backbone Encoder: ADARec uses ELCRec as its backbone encoder. The quality of ADARec's initial sequence embeddings is thus dependent on the performance and inductive biases of this underlying model. If the backbone struggles with certain types of sequences or biases, it might propagate to ADARec.
Predefined Max Depth: The maximum diffusion depth ( $T_{max}$ ) is a predefined hyperparameter. While ADC adaptively chooses a depth up to $T_{max}$ , an inappropriate $T_{max}$ (too small or too large) could limit the expressiveness of the intent hierarchy or introduce unnecessary computation.

Potential Future Work (Inferred):

Adaptive $T_{max}$ : Instead of a fixed $T_{max}$ , future work could explore dynamic determination of the maximum plausible diffusion depth based on dataset characteristics or user behavior patterns.
More Sophisticated Denoising Networks: The current denoising network is a simple MLP. Investigating more powerful generative architectures (e.g., Transformers or more complex UNet-like structures) for the HDA module might further improve the quality of the intent hierarchy reconstruction.
Cross-Domain Generalization: Exploring how ADARec's principles of hierarchical intent reconstruction can be adapted for cross-domain recommendation or cold-start scenarios where data sparsity is even more severe.
Explainability: Given that ADARec aims to de-collapse user intents, future work could focus on making these hierarchical intent representations more interpretable, allowing for a deeper understanding of why specific recommendations are made.
Real-time Adaptation: While ADC provides adaptive depth, further research could explore more real-time adaptive mechanisms for the diffusion process itself, potentially adjusting noise schedules or denoising steps based on real-time user context or dynamic environmental changes.

7.3. Personal Insights & Critique

This paper presents a truly innovative approach to tackling data sparsity in sequential recommendation, moving beyond simple augmentation to a deeper understanding of user behavior. The core idea of leveraging the denoising trajectory of a diffusion model to uncover hierarchical user intents is highly insightful and opens up new avenues for research in generative models for recommender systems.

Inspirations & Transferability:

The concept of using the intermediate states of a generative process to capture different levels of abstraction is highly transferable. This could be applied to other domains where sparse data or collapsed representations are an issue, such as natural language processing (e.g., understanding hierarchical topics from short text snippets), image recognition (e.g., fine-grained vs. coarse-grained image features), or medical diagnosis (e.g., inferring broader health conditions from sparse symptom data).
The Adaptive Depth Controller (ADC) is a practical innovation that balances effectiveness with efficiency. This adaptive computation paradigm could be useful in other deep learning architectures where certain layers or modules might be computationally expensive but not always necessary for every input.
The Mixture-of-Experts (MoE) design, coupled with content-aware routing and an asymmetric contrastive loss, provides a powerful blueprint for handling multi-granularity information. This could inspire new ways to design MoE systems for tasks requiring multi-level feature extraction or multi-task learning.

Potential Issues & Areas for Improvement:

Interpretability of Latent Intents: While the t-SNE visualization shows distinct clusters, the semantic meaning of each coarse-grained or fine-grained intent learned by the model remains somewhat opaque. Further work into interpretable AI could try to align these learned intent clusters with human-understandable categories.
Cold-Start Problem: While ADARec excels at sparse sequences, the ultimate cold-start problem (users with no history) is a different beast. It would be interesting to see how ADARec could be extended or integrated with meta-learning or transfer learning techniques to generate initial intent hierarchies for completely new users.
Scalability for Extremely Large Item Sets: The prediction layer still computes scores for "all candidate items." For systems with millions or billions of items, this can be computationally intensive. Investigating efficient sampling strategies or hierarchical softmax approaches in conjunction with ADARec would be valuable.
Robustness to Noisy Interactions: Real-world interaction sequences can contain noise (e.g., accidental clicks, irrelevant browsing). While diffusion models are robust to noise to some extent, the impact of significant noise in the input sequence itself on the quality of the reconstructed intent hierarchy could be further investigated.

Overall, ADARec makes a significant contribution by addressing a fundamental challenge in sequential recommendation with a theoretically sound and empirically effective solution. Its novel integration of diffusion models for hierarchical intent reconstruction sets a new standard for data augmentation in this field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

De-collapsing User Intent: Adaptive Diffusion Augmentation with Mixture-of-Experts for Sequential Recommendation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 45,702 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Sequence Embedding

4.2.2. Adaptive Depth Controller (ADC)

4.2.3. Hierarchical Diffusion Augmentation (HDA)

4.2.4. Hierarchical Parsing MoE (HP-MoE)

4.2.5. Training Objective

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Overall Performance (RQ1)

6.1.2. Performance on Extremely Sparse Sequences

6.1.3. Performance with Different Intent-Aware Prediction Layers

6.2. Ablation Study of Key Components (RQ2)

6.3. Efficiency Analysis (RQ3)

6.4. Hyperparameter Sensitivity (RQ4)

6.4.1. Analysis of λexpl\lambda_{expl}λexpl​ and λdeno\lambda_{deno}λdeno​

6.4.2. Analysis of Number of Experts in HP-MoE

6.5. Visualization of Intent Representations (RQ5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.4.1. Analysis of $\lambda_{expl}$ and $\lambda_{deno}$