Paper status: completed

Towards A Tri-View Diffusion Framework for Recommendation

Published:11/25/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a tri-view diffusion framework for recommendations, integrating thermodynamic insights and maximizing Helmholtz free energy. It enhances optimization using a denoiser and Acceptance-Rejection Gumbel Sampling, significantly improving recommendation system acc

Abstract

Diffusion models (DMs) have recently gained significant interest for their exceptional potential in recommendation tasks. This stems primarily from their prominent capability in distilling, modeling, and generating comprehensive user preferences. However, previous work fails to examine DMs in recommendation tasks through a rigorous lens. In this paper, we first experimentally investigate the completeness of recommender models from a thermodynamic view. We reveal that existing DM-based recommender models operate by maximizing the energy, while classic recommender models operate by reducing the entropy. Based on this finding, we propose a minimalistic diffusion framework that incorporates both factors via the maximization of Helmholtz free energy. Meanwhile, to foster the optimization, our reverse process is armed with a well-designed denoiser to maintain the inherent anisotropy, which measures the user-item cross-correlation in the context of bipartite graphs. Finally, we adopt an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) to prioritize the far-outnumbered unobserved interactions for model robustness. AR-GSP integrates an acceptance-rejection sampling to ensure high-quality hard negative samples for general recommendation tasks, and a timestep-dependent Gumbel Softmax to handle an adaptive sampling strategy for diffusion models. Theoretical analyses and extensive experiments demonstrate that our proposed framework has distinct superiority over baselines in terms of accuracy and efficiency.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Towards A Tri-View Diffusion Framework for Recommendation

1.2. Authors

  • Ximing Chen, University of Macau, Macau SAR, China

  • Pui Ieng Lei, University of Macau, Macau SAR, China

  • Yijun Sheng, University of Macau, Macau SAR, China

  • Yanyan Liu, University of Macau, Macau SAR, China

  • Zhiguo Gong*, University of Macau, Macau SAR, China

    The authors are primarily affiliated with the University of Macau, suggesting a research focus within an academic institution. Zhiguo Gong is marked with an asterisk, typically indicating a corresponding author or principal investigator. Their research appears to be in the field of recommender systems and machine learning, particularly with a focus on diffusion models and graph-based approaches.

1.3. Journal/Conference

The paper is published at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26). KDD is one of the premier conferences in data mining and knowledge discovery, known for publishing high-impact research in areas like machine learning, data science, and recommender systems. Publication in KDD signifies that the work has undergone a rigorous peer-review process and is recognized as a significant contribution to the field.

1.4. Publication Year

2025

1.5. Abstract

Diffusion models (DMs) have garnered significant attention for their capabilities in recommendation tasks, particularly in distilling, modeling, and generating user preferences. However, prior research has not rigorously examined DMs in this context. This paper first experimentally investigates the completeness of recommender models from a thermodynamic perspective, revealing that existing DM-based models maximize energy, while classic models minimize entropy. Based on this, the authors propose a minimalistic diffusion framework called TV-Diff that incorporates both factors by maximizing Helmholtz free energy. To optimize this, the reverse diffusion process is enhanced with an Anisotropic Denoiser that maintains user-item cross-correlation (anisotropy) in bipartite graphs. Furthermore, an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) is introduced to prioritize high-quality hard negative samples for robust model training, adapting to the varying noise levels in diffusion models. Theoretical analyses and extensive experiments demonstrate TV-Diff's superior accuracy and efficiency compared to existing baselines.

  • Original Source Link: https://arxiv.org/abs/2511.20122v1
  • PDF Link: https://arxiv.org/pdf/2511.20122v1.pdf
  • Publication Status: This paper is currently available as a preprint on arXiv (version 1, published 2025-11-25T09:43:00.000Z) and is indicated to be published at KDD '26.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the current limitations and lack of rigorous examination of Diffusion Models (DMs) when applied to recommendation tasks. While DMs have shown exceptional potential in capturing and generating user preferences, existing DM-based recommender models are often adopted from other fields (like image generation) without fully considering the unique characteristics and objectives pertinent to recommendation systems.

This problem is important because recommendation systems are crucial in combating information overload, yet they face significant challenges such as data sparsity (users interact with only a tiny fraction of available items) and diversity (recommending a wide range of relevant items). Existing DM-based models, for instance, might use loss functions like Mean Squared Error (MSE), which are good for mitigating sparsity but suboptimal for ranking implicit feedback, or they might struggle with effectively integrating negative samples, which are vital for personalization. The paper highlights a fundamental disconnect: current DM-based recommenders often focus on maximizing "energy" (reconstructing interactions), while classic recommenders prioritize minimizing "entropy" (reducing uncertainty in item distribution), leading to an incomplete optimization strategy.

The paper's entry point is a rigorous experimental investigation into the "completeness" of recommender models from a thermodynamic view. By observing that DM-based models maximize energy and classic models minimize entropy, the authors identify a critical gap: no single framework effectively combines both. This leads to the innovative idea of integrating these two complementary objectives within a diffusion framework, guided by the concept of Helmholtz free energy, and enhancing it with mechanisms that address specific challenges in recommendation, such as topological information and effective negative sampling.

2.2. Main Contributions / Findings

The paper makes several primary contributions and key findings:

  • A Novel Tri-View Diffusion Framework (TV-Diff): The paper proposes TV-Diff, a minimalistic yet comprehensive diffusion recommendation framework that integrates three complementary views:

    • Thermodynamic View: Maximizes Helmholtz Free Energy by combining both energy-based (reconstruction) and entropy-based (preference refinement) objectives, offering a more complete optimization strategy than previous models.
    • Topological View: Incorporates an Anisotropic Denoiser in the reverse diffusion process to explicitly maintain and leverage the inherent anisotropic signals (user-item cross-correlation and distinct user/item degree distributions) within bipartite graphs, which are often impaired by standard diffusion processes.
    • Hard-Negative View: Introduces an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) for robust training. This method ensures high-quality hard negative samples by first discarding true negatives via acceptance-rejection sampling and then adaptively managing the sampling strategy with a timestep-dependent Gumbel Softmax, particularly crucial for diffusion models with varying noise levels.
  • Rigorous Experimental and Theoretical Analysis: The authors provide both experimental and theoretical analyses that:

    • Demonstrate the "completeness" (or lack thereof) of existing recommender models from a thermodynamic perspective.
    • Prove the equivalence of BPR and entropy in optimization.
    • Elucidate the mechanism by which topological information (e.g., graph structure) influences entropy reduction.
    • Analyze the bounds and challenges of traditional hard negative sampling approaches in the context of BPR loss.
  • Superior Performance and Efficiency: Extensive experiments across five real-world datasets show that TV-Diff consistently outperforms various categories of baselines (including classic, autoencoder-based, graph-based, negative sampling, and other diffusion-based models) in terms of accuracy (Recall@K and NDCG@K) and efficiency. The framework's ability to overcome inherent limitations of existing DM-based models, such as their ineffectiveness in capturing anisotropy, is highlighted.

    These findings solve the specific problems of incomplete optimization objectives in DM-based recommenders, the impairment of crucial topological information by isotropic noise, and the challenge of effective negative sampling in the dynamic, noisy environment of diffusion processes.

3. Prerequisite Knowledge & Related Work

This section provides the foundational knowledge and context necessary to understand the TV-Diff framework.

3.1. Foundational Concepts

3.1.1. Recommendation Task

At its core, a recommendation task aims to predict which items a user might be interested in. Given a set of users U\mathcal{U} and a set of items I\mathcal{I}, the system observes a history of user-item interactions. These interactions can be explicit (e.g., star ratings) or implicit (e.g., clicks, purchases, views). This paper focuses on implicit feedback, where 1 indicates an interaction and 0 indicates no observed interaction. The goal is often Top-K recommendations, meaning for a given user, the system predicts a ranked list of KK items they are most likely to interact with. The observed interactions can be represented as a user-item interaction matrix R{0,1}m×n\pmb{R} \in \{0, 1\}^{m \times n}, where mm is the number of users and nn is the number of items. ru,i=1r_{u,i} = 1 if user uu interacted with item ii, and 0 otherwise. N(u) denotes the set of items interacted with by user uu.

3.1.2. Diffusion Models (DMs)

Diffusion Models (DMs) are a class of generative models that learn to reverse a gradual noisy process. They consist of two main parts:

  • Forward Diffusion Process: This process gradually adds Gaussian noise to an input data point x0\pmb{x}_0 over a series of TT timesteps, transforming it into a noisy latent variable xT\pmb{x}_T that resembles pure noise. Each step adds a small amount of noise, typically controlled by a variance schedule βt\beta_t. The forward process is often defined as a Markov chain: $ q(\pmb{x}t | \pmb{x}{t-1}) = \mathcal{N}(\pmb{x}t; \sqrt{1 - \beta_t}\pmb{x}{t-1}, \beta_t \mathbf{I}) $ A convenient property allows sampling xt\pmb{x}_t at any timestep tt directly from x0\pmb{x}_0: $ q(\pmb{x}_t | \pmb{x}_0) = \mathcal{N}(\pmb{x}_t; \sqrt{\bar{\alpha}_t}\pmb{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) $ where αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_s. In the context of recommendation, x0\pmb{x}_0 represents a user's complete interaction record (e.g., a vector of 0s and 1s indicating interactions with all items). The paper specifically defines it as: $ \boldsymbol{q}(\boldsymbol{x}t | \boldsymbol{x}0) = \boldsymbol{N}(\boldsymbol{x}t; \prod{t'=1}^t \sqrt{(1 - \beta{t'})} \boldsymbol{x}0, (1 - \prod{t'=1}^t (1 - \beta{t'})) \mathbf{I}) $ Here, N\boldsymbol{N} denotes a normal distribution. This equation shows that xt\pmb{x}_t is sampled from a Gaussian distribution whose mean is scaled by t=1t(1βt)\prod_{t'=1}^t \sqrt{(1 - \beta_{t'})} and whose variance is (1t=1t(1βt))I(1 - \prod_{t'=1}^t (1 - \beta_{t'})) \mathbf{I}. This effectively means that as tt increases, the mean approaches zero (if βt\beta_{t'} are small and positive) and the variance approaches I\mathbf{I}, leading to pure Gaussian noise for xT\pmb{x}_T.

  • Reverse Denoising Process: This is the learnable part of the DM. A neural network is trained to reverse the forward process, i.e., to predict xt1\pmb{x}_{t-1} from xt\pmb{x}_t. This is done by learning the conditional probability pθ(xt1xt)p_{\theta}(\pmb{x}_{t-1} | \pmb{x}_t). The denoising distribution is typically modeled as a Gaussian: $ p_{\theta}(\boldsymbol{x}{t-1} | \boldsymbol{x}t) = \mathcal{N}(\boldsymbol{x}{t-1}; \boldsymbol{\mu}{\theta}(\boldsymbol{x}t, t), \boldsymbol{\Sigma}{\theta}(\boldsymbol{x}_t, t)) $ where μθ(xt,t)\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_t, t) and Σθ(xt,t)\boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_t, t) are the mean and covariance predicted by a neural network with parameters θ\theta. The goal during training is to optimize these parameters so that the model can effectively remove noise at each step, eventually transforming xT\pmb{x}_T (pure noise) back into a clean data sample x0\pmb{x}_0.

  • Optimization: DMs are typically optimized using a variational lower bound on the data likelihood, similar to Variational Autoencoders (VAEs). This involves minimizing the Kullback-Leibler (KL) divergence between the true reverse process and the learned reverse process. The Evidence Lower Bound (ELBO) for DMs is given by: $ \log p(\boldsymbol{x}_0) \geq \mathbb{E}_q \left[ \log p(\boldsymbol{x}_0 | \boldsymbol{x}_1) - KL(q(\boldsymbol{x}_T | \boldsymbol{x}0) || p(\boldsymbol{x}T)) - \sum{t=2}^T KL(q(\boldsymbol{x}{t-1} | \boldsymbol{x}_t, \boldsymbol{x}0) || p(\boldsymbol{x}{t-1} | \boldsymbol{x}_t)) \right] $ Here, p(x0x1)p(\boldsymbol{x}_0 | \boldsymbol{x}_1) is the likelihood of the data given the first denoised step, KL(q(xTx0)p(xT))KL(q(\boldsymbol{x}_T | \boldsymbol{x}_0) || p(\boldsymbol{x}_T)) is the KL divergence between the forward process at TT and a prior distribution (often a standard Gaussian), and the sum term represents the KL divergence between the true posterior and the learned reverse steps. The term KL(q(xTx0)p(xT))KL(q(\boldsymbol{x}_T | \boldsymbol{x}_0) || p(\boldsymbol{x}_T)) is often approximated to 0 if the forward process is well-defined to reach a standard Gaussian.

3.1.3. Thermodynamics Concepts

The paper draws inspiration from thermodynamics to analyze recommender models.

  • Energy (UU): In physics, internal energy represents the total energy contained within a thermodynamic system. In the paper's context, energy is adapted to measure the "completeness" or "number of reconstructed interactions" after training. A higher energy implies better reconstruction of observed interactions and perhaps the prediction of new ones. The paper defines two forms:

    • Explicit Energy: Proportional to the count of reconstructed interactions that meet or exceed a certain threshold, similar to counting correctly predicted positive items.
    • Implicit Energy: A weighted measure for interactions that marginally fail a strict threshold, acknowledging partial reconstruction. The paper uses a comprehensive definition: $ U(\hat{R}') = \sum_u \sum_i \mathbb{I}(\hat{r}'{u,i} \geq r{u,i}) + \mathbb{I}(\hat{r}'{u,i} < r{u,i}) \frac{\hat{r}'{u,i}}{r{u,i}} $ where I()\mathbb{I}(\cdot) is the indicator function (1 if true, 0 if false), r^u,i\hat{r}'_{u,i} is the reconstructed interaction probability, and ru,ir_{u,i} is the original interaction (0 or 1). This definition counts fully reconstructed interactions (first term) and partially reconstructed ones (second term, weighted by ratio).
  • Entropy (SS): In physics, entropy is a measure of disorder, randomness, or uncertainty in a system. In information theory (Shannon Entropy), it quantifies the average amount of information produced by a stochastic source, or the uncertainty of a random variable. In the paper's context, entropy qualifies the "uncertainty of reconstructed items' distribution." Minimizing entropy suggests that the model is becoming more certain and specific about user preferences, leading to a more focused and personalized item distribution. The paper uses the Shannon Entropy definition: $ S(\hat{R}') = \sum_u - \hat{r}'_u \log \hat{r}_u^{\top} $ where r^u\hat{r}'_u is the probability distribution of items for user uu. A lower entropy means a more peaked distribution (higher certainty about specific items), while higher entropy means a flatter, more uniform distribution (less certainty).

  • Helmholtz Free Energy (HH): In thermodynamics, Helmholtz free energy (often denoted AA or FF) is a thermodynamic potential that measures the "useful" work obtainable from a closed thermodynamic system at a constant temperature and volume. It is defined as H=UtSH = U - tS, where UU is internal energy, tt is temperature, and SS is entropy. A system at constant temperature and volume tends to reach equilibrium by minimizing its Helmholtz free energy. The paper adapts this to Helmholtz Free Energy Maximization: $ \mathcal{L}_H(\hat{R}') \equiv \mathcal{L}_U(\hat{R}') - t \cdot \mathcal{L}_S(\hat{R}') = U(\hat{R}') - t \cdot S(\hat{R}') $ The goal is to maximize this quantity. This means simultaneously maximizing energy UU (better reconstruction, more identified relevant items) and minimizing entropy SS (more certain, less uncertain predictions). The temperature tt acts as a coefficient to balance the priority between these two objectives.

3.1.4. Bipartite Graphs

A bipartite graph is a graph whose vertices can be divided into two disjoint and independent sets, such that every edge connects a vertex in one set to one in the other set. In recommender systems, user-item interaction networks naturally form bipartite graphs, where one set of vertices represents users (U\mathcal{U}) and the other represents items (I\mathcal{I}). An edge exists between a user uu and an item ii if uu has interacted with ii.

  • Adjacency Matrix: The interactions are represented by an adjacency matrix R\pmb{R} (or A\pmb{A}) where Rui=1R_{ui}=1 if an edge exists, and 0 otherwise.
  • Degree Matrix: For a user uu, its degree dud_u is the number of items they interacted with. For an item ii, its degree did_i is the number of users who interacted with it. The degree matrix D\pmb{D} is a diagonal matrix where diagonal entries are the degrees of nodes.
  • Normalized Bipartite Matrix: The symmetric normalized bipartite matrix Rˉ\bar{\pmb{R}} is often used in graph neural networks. It is defined as Rˉ=DU12RDI12\bar{\pmb{R}} = \pmb{D}_{\mathcal{U}}^{-\frac{1}{2}} \pmb{R} \pmb{D}_{\mathcal{I}}^{-\frac{1}{2}}, where DU\pmb{D}_{\mathcal{U}} and DI\pmb{D}_{\mathcal{I}} are diagonal matrices of user and item degrees, respectively. This normalization helps to smooth out the degrees and prevent nodes with high degrees from dominating the message passing. The paper also defines a full adjacency matrix for the combined user-item graph as Aˉ=(ORˉRˉO)\bar{\pmb{A}} = \begin{pmatrix} \mathbb{O} & \bar{\pmb{R}} \\ \bar{\pmb{R}}^{\top} & \mathbb{O} \end{pmatrix}.
  • Anisotropy: In the context of bipartite graphs, anisotropy refers to the non-uniformity or directional dependence of properties. Specifically, it highlights the potential inconsistency between user degree distribution and item degree distribution, meaning users might interact with items at different rates than items are interacted with by users. This user-item cross-correlation is crucial for understanding preference patterns. Isotropic noise (uniform noise) can impair these specific directional signals.

3.1.5. Negative Sampling

Negative sampling is a technique used in many machine learning models, especially for implicit feedback recommendation, to train models by contrasting observed (positive) interactions with unobserved (negative) interactions. Since there are vastly more unobserved interactions than observed ones, it's infeasible to consider all of them.

  • Random Negative Sampling (RNS): The simplest approach, where negative items are randomly sampled from the set of unobserved items. Often yields uninformative negatives (items that are obviously not relevant), making it easy for the model to distinguish positives from negatives, but providing less signal for fine-tuning preferences.
  • Hard Negative Sampling (Hard NS): Aims to sample informative negative items that are difficult for the model to distinguish from positive items. These "hard negatives" are often items that the model would incorrectly predict as positive (i.e., items with high predicted scores but no actual interaction). Training with hard negatives forces the model to learn finer distinctions, leading to better generalization and stronger decision boundaries.
  • Acceptance-Rejection Sampling: A method for sampling from a complex probability distribution by using a simpler proposal distribution. Samples from the proposal distribution are accepted or rejected based on a certain probability, effectively shaping the final sample distribution to match the target.

3.1.6. Gumbel Softmax

The Gumbel Softmax (also known as Gumbel-Sigmoid for binary cases) is a continuous, differentiable approximation of discrete sampling (e.g., from a categorical distribution). It is commonly used in neural networks to allow gradient-based optimization through discrete choices. It introduces Gumbel noise to the logits of a categorical distribution before applying the softmax function, making the sampling process differentiable. The probability distribution for sampling from categories jj for a given input z\pmb{z} and temperature τ\tau is: $ \hat{y}_j = \frac{\exp((\log(\pi_j) + g_j) / \tau)}{\sum_k \exp((\log(\pi_k) + g_k) / \tau)} $ where πj\pi_j are the class probabilities (or logits), gjg_j are i.i.d. samples drawn from a Gumbel(0, 1) distribution. A lower temperature τ\tau makes the samples closer to one-hot (discrete), while a higher τ\tau makes them smoother (more uniform). The reparameterization trick is used to allow gradients to flow through the sampling process.

3.2. Previous Works

The paper contextualizes its work by reviewing several categories of existing recommender models:

  • Diffusion-based Recommender Models:

    • Early approaches (e.g., CODIGEM [54], DiffRec [55]): These models adapted DMs from other domains (like computer vision) to recommendation. While showing potential in noise smoothing and preference generation, they often blindly adopted original objectives (e.g., MSE) without task-specific rigorous examination. A key limitation highlighted is their tendency to overlook topological information inherent in user-item bipartite graphs.
    • Graph-aware DMs (e.g., BSPM [5], GiffCF [71], SDiff [59]): These models started integrating graph structures by simulating message-passing within the diffusion process.
    • Two-stage training strategies (e.g., HDRM [66], Zhao et al. [70]): These methods involve pre-training to learn graph structures and then fine-tuning with diffusion. While improving representation quality, they incur substantially higher computational costs.
    • Limitations noted by the paper: Existing DM-based models are critiqued for not rigorously examining DMs in recommendation contexts, leading to issues like suboptimal ranking for implicit feedback and failure to converge in general recommendation tasks due to ill-suited objectives.
  • Classic Recommender Models:

    • Matrix Factorization (MF) based (e.g., BPR-MF [42], NeuMF [15]): These models learn user and item latent factors. BPR (Bayesian Personalized Ranking) focuses on optimizing a pairwise ranking loss, aiming to rank positive items higher than negative items for a given user. This implicitly focuses on minimizing entropy (reducing uncertainty in ranking).
    • Autoencoder-based (e.g., CDAE [57], MultiVAE [34]): These models learn compressed representations of user preferences and reconstruct incomplete interaction data. They are good at mitigating data sparsity by leveraging joint distributions.
  • Graph-based Recommender Models (GNNs):

    • LightGCN [14], ChebyCF [24], LinkProp [8], SGCL [69]: These models leverage Graph Neural Networks (GNNs) and message-passing mechanisms to capture collaborative signals by propagating information across the user-item bipartite graph. They are effective in exploiting local and global connectivity patterns (e.g., anisotropy). The paper specifically notes that LightGCN (a graph-based model) exhibits stronger capability on entropy-reduction than BPR due to neighbors' information. However, multi-layer message-passing can lead to over-smoothing and increased entropy.
  • Negative Sampling Methods:

    • MixGCF [21], AHNS [29]: These models specifically focus on identifying informative negative items from the vast pool of unobserved interactions. Theoretical work often suggests sub-linear correlation between hard negative and positive distributions. The paper argues that for BPR and BCE, such correlation can compromise optimization.
  • Contrastive Learning Models (e.g., LightGCL [2], SGCL [69]): These approaches learn robust representations by maximizing agreement between different augmented views of a graph, typically leveraging batch-wise negative sampling.

3.3. Technological Evolution

The field of recommender systems has evolved from simpler matrix factorization techniques to deep learning models, then to sophisticated graph neural networks, and more recently, generative models like Diffusion Models.

  1. Early Methods (e.g., Matrix Factorization): Focused on learning latent factors for users and items to reconstruct the interaction matrix. While effective, they often struggled with sparsity and cold-start problems.
  2. Autoencoders: Applied to recommendation to learn compact user/item representations and reconstruct interactions, offering improvements in handling sparsity.
  3. Graph-based Models (GNNs): A significant leap, leveraging the inherent graph structure of user-item interactions. GNNs enable message-passing to aggregate information from neighbors, capturing higher-order collaborative signals and enriching representations. However, they can suffer from over-smoothing with many layers and might have high computational costs on large graphs.
  4. Generative Models (e.g., VAEs, GANs, and DMs): These models learn to generate new data samples that resemble the training data.
    • VAEs were applied to learn user preferences and generate interaction vectors.

    • More recently, Diffusion Models have emerged, offering powerful capabilities in capturing complex data distributions through iterative denoising. Initial DM-based recommenders adapted existing DM architectures but often lacked domain-specific optimizations.

      This paper's work fits into the latest wave, aiming to make Diffusion Models truly effective and principled for recommendation tasks by addressing their inherent limitations through a multi-faceted view, pushing the boundaries of generative models in this domain.

3.4. Differentiation Analysis

TV-Diff distinguishes itself from previous works in several key ways:

  • Principled Optimization Objective:

    • Unlike existing DM-based recommenders that often blindly adopt objectives (e.g., MSE) from other fields (which primarily focus on energy maximization), TV-Diff proposes a novel Helmholtz Free Energy Maximization objective. This systematically integrates both energy-based and entropy-based objectives, a critical insight derived from the observation that classic recommenders (BPR, LightGCN) typically focus on entropy minimization. This offers a more "complete" and thermodynamically inspired optimization.
    • The paper theoretically links BPR loss to entropy reduction, providing a deeper understanding of classic methods.
  • Anisotropy-Aware Denoising:

    • Previous DM-based models often implicitly treat noise as isotropic Gaussian noise, which can impair the inherent anisotropy (the unique user-item cross-correlation and distinct degree distributions) of bipartite graphs.
    • TV-Diff introduces an Anisotropic Denoiser that explicitly disentangles user and item aspects during reconstruction and leverages topological information (e.g., item degrees) through single-layer message-passing. This is a direct counter to the limitations of isotropic noise and the over-smoothing issues of multi-layer GNNs, ensuring that crucial graph-structural signals are preserved and utilized.
  • Adaptive Hard Negative Sampling for DMs:

    • While many recommenders use Hard Negative Sampling (Hard NS) for robustness, their strategies (e.g., sub-linear correlation of negative to positive distributions) can compromise optimization for entropy-based losses like BPR or BCE.

    • TV-Diff introduces AR-GSP, an Acceptance-Rejection Gumbel Sampling Process. This method is specifically designed for the challenges of diffusion models, which have varying noise magnitudes across timesteps. It intelligently selects informative negatives by first filtering out uninformative ones and then adaptively adjusting the sampling strategy using a timestep-dependent Gumbel Softmax to account for the reliability of scores in noisy intermediate steps, preventing false negatives. This is a novel adaptation of negative sampling to the dynamic nature of DMs.

      In summary, TV-Diff moves beyond ad-hoc application of DMs to recommendation by offering a principled, multi-view framework that accounts for thermodynamic optimization, graph topology, and the unique sampling challenges of generative diffusion processes, leading to superior and more robust recommendations.

4. Methodology

The TV-Diff framework introduces a novel approach to diffusion-based recommendation by integrating three core components: Helmholtz Free Energy Maximization (Thermodynamic View), an Anisotropic Denoiser (Topological View), and an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) (Hard-Negative View). These components are designed to interact and mutually assist each other, providing a comprehensive and robust recommendation solution.

4.1. Principles

The core idea behind TV-Diff is to address the limitations of existing DM-based recommenders by adopting a more complete and rigorous optimization strategy inspired by thermodynamics, while also explicitly handling the unique characteristics of user-item interaction graphs and the challenges of negative sampling within a diffusion framework.

  1. Thermodynamic View: The foundational principle is that a complete recommender model should aim to both maximize the "energy" (i.e., accurately reconstruct and predict relevant interactions) and minimize the "entropy" (i.e., reduce uncertainty and refine preferences). By leveraging Helmholtz Free Energy, which balances these two factors, TV-Diff seeks a more robust and generalized equilibrium. The intuition is that merely reconstructing interactions (energy maximization) can lead to models that don't differentiate preferences well, while only reducing uncertainty (entropy minimization) might miss potential relevant items. Combining both ensures both coverage and precision.

  2. Topological View: The user-item interaction graph possesses rich topological information, specifically anisotropy (non-uniformity in user vs. item degrees and cross-correlations). Standard diffusion models often add isotropic Gaussian noise, which can inadvertently destroy these crucial directional signals. The principle here is to design a denoiser that explicitly preserves and utilizes this anisotropic information by considering both user and item aspects separately and focusing on cross-correlation, rather than a monolithic, user-centric view. This ensures that the model learns from the inherent structure of the data.

  3. Hard-Negative View: The principle for negative sampling is to provide informative negatives that force the model to learn fine-grained distinctions, but to do so adaptively within the dynamic context of diffusion. Traditional hard negative sampling can be problematic if negative and positive distributions are sub-linearly correlated, leading to suboptimal training. Furthermore, in DMs, the reliability of scores changes with timestep (due to varying noise levels). The approach aims for an adaptive strategy that filters out uninformative negatives and then adjusts for the noise level, preventing false negatives and ensuring consistent robustness.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Thermodynamic View - Helmholtz Free Energy Maximization

The paper first identifies a key observation from pilot experiments:

  • DM-based recommender models (like DiffRec) primarily operate by maximizing energy, focusing on reconstructing the original interactions. This is often achieved through loss functions like Mean Squared Error (MSE).

  • Classic recommender models (like BPR and LightGCN) primarily operate by reducing entropy, aiming to make predictions more certain and distinguish positive from negative items.

    Based on these observations, the paper proposes to integrate both objectives by maximizing Helmholtz Free Energy, which intrinsically balances energy and entropy.

First, to quantify energy and entropy, the raw predictions (scores) from the models need to be normalized into interaction probabilities. For classic models like BPR and LightGCN, this involves a Softmax function applied to user and item embeddings. For DiffRec, it's derived from its reconstruction process. Let R\pmb{R} be the original user-item interaction matrix. The original user interaction probability R^\hat{\pmb{R}} and the reconstructed probabilities R^\hat{R}' are defined as: $ \begin{array}{r} \hat{R} = D_{\mathcal{U}}^{-1} R, \end{array} $ $ \begin{array}{r} \hat{R}' = \left{ \begin{array}{ll} \tilde{D}{\mathcal{U}}^{-1} \tilde{R}, & \mathrm{DiffRec} \ Softmax({E{\mathcal{U}} \cdot E_{\mathcal{I}}^{\top}}), & \mathrm{BPR,LightGCN} \end{array} \right., \end{array} $ where R~Rm×n\tilde{\pmb{R}} \in \mathbb{R}^{m \times n} denotes the raw reconstruction (scores), DURm×m\pmb{D}_{\mathcal{U}} \in \mathbb{R}^{m \times m} is the original user degree matrix, and D~U=diag(1r~11,,1r~m1)\tilde{\pmb{D}}_{\mathcal{U}} = diag(\frac{1}{\| \tilde{\pmb{r}}_1 \|_1}, \ldots, \frac{1}{\| \tilde{\pmb{r}}_m \|_1}) is the reconstructed user degree matrix for DiffRec (normalizing reconstructed scores to sum to 1 for each user). EURm×d\pmb{E}_{\mathcal{U}} \in \mathbb{R}^{m \times d} and EIRn×d\pmb{E}_{\mathcal{I}} \in \mathbb{R}^{n \times d} are user and item representations, respectively, used by BPR and LightGCN to compute similarity scores, which are then passed through Softmax row-wise to get probabilities.

The energy U(R^)U(\hat{R}') is defined to comprehensively capture both explicit (fully reconstructed) and implicit (partially reconstructed) interactions: $ U(\hat{R}') = \sum_u \sum_i \mathbb{I}(\hat{r}'{u,i} \geq r{u,i}) + \mathbb{I}(\hat{r}'{u,i} < r{u,i}) \frac{\hat{r}'{u,i}}{r{u,i}}, $ where I()\mathbb{I}(\cdot) is the indicator function. The first term accounts for interactions where the reconstructed probability r^u,i\hat{r}'_{u,i} is greater than or equal to the original interaction ru,ir_{u,i} (which is 0 or 1). The second term accounts for cases where r^u,i\hat{r}'_{u,i} is less than ru,ir_{u,i} (i.e., ru,i=1r_{u,i}=1 and r^u,i<1\hat{r}'_{u,i} < 1), weighting the contribution by the ratio of reconstructed to original probability.

The entropy S(R^)S(\hat{R}') for the reconstructed probabilities is defined using Shannon Entropy (applied to each user's item distribution): $ S(\hat{R}') = \sum_u - \hat{r}'_u \log \hat{r}_u^{\top}. $ Here, r^u\hat{r}'_u is the vector of item probabilities for user uu. A lower value of S(R^)S(\hat{R}') indicates less uncertainty and a more concentrated distribution of preferences for a user.

The Helmholtz Free Energy Maximization objective LH(R^)\mathcal{L}_H(\hat{R}') is then proposed as: $ \begin{array}{c} \mathcal{L}H(\hat{R}') \equiv \mathcal{L}U(\hat{R}') - t \cdot \mathcal{L}S(\hat{R}') = U(\hat{R}') - t \cdot S(\hat{R}') \ = - \sum{u \in \mathcal{B}} \sum_i (\hat{r}{u,i} - \hat{r}'{u,i})^2 \

  • t \cdot (\hat{r}{u,i} \log \sigma(\hat{r}'{u,i}) - c_{u,i} \cdot (1 - \hat{r}{u,i}) \log (1 - \sigma(\hat{r}'{u,i}))), \end{array} $ where B\mathcal{B} denotes the user batch for training, tt is the temperature coefficient that balances the energy and entropy terms. σ()\sigma(\cdot) is the sigmoid function, converting raw scores into probabilities. The first term is essentially Mean Squared Error (MSE), which serves as the energy-based loss LU(R^)\mathcal{L}_U(\hat{R}') focusing on reconstruction accuracy. The second term is Binary Cross Entropy (BCE), which serves as the entropy-based loss LS(R^)\mathcal{L}_S(\hat{R}') for refining classification of interactions. cu,i{0,1}c_{u,i} \in \{0, 1\} is a confidence score for sampling non-interactive items (discussed in Section 3.3). The objective is to maximize LH(R^)\mathcal{L}_H(\hat{R}'), which means maximizing the reconstruction quality (minimizing MSE, hence maximizing energy) and minimizing the uncertainty (minimizing BCE, hence minimizing entropy).

4.2.2. Topological View - Anisotropic Denoiser

Observation 3 highlights that graph-based models (like LightGCN) are better at entropy reduction due to incorporating neighbors' information, which reflects topological information and anisotropic signals in the user-item bipartite graph. The standard isotropic Gaussian noise in DMs can degrade these crucial signals. The paper proposes an Anisotropic Denoiser to address this.

First, the paper presents two theorems connecting graph structures and entropy: LEMMA 3.1. Given a model trained on entropy (i.e., BPR), the optimal results approximate the bipartite matrix R.

  • Proof: The objective of BPR is to maximize su,i+,i:=su,i+su,is_{u,i+,i-} := s_{u,i+} - s_{u,i-} for all user-positive-negative triplets (u,i+,i)(u, i+, i-). When su,i+,i+s_{u,i+,i-} \to +\infty, it implies su,i++s_{u,i+} \to +\infty and su,is_{u,i-} \to -\infty. After applying an activation function like r~u,=σ(su,)=1/(1+expsu,)\tilde{r}_{u,*} = \sigma(s_{u,*}) = 1 / (1 + \exp^{-s_{u,*}}), the optimal outcomes are r~u,i+=1\tilde{r}_{u,i+} = 1 and r~u,i=0\tilde{r}_{u,i-} = 0. This perfectly reconstructs the original binary interaction graph matrix R\pmb{R}. This lemma establishes that BPR (an entropy-reducing objective) aims for a crisp, binary representation of interactions, akin to the input bipartite graph.

    THEOREM 3.2. Given a model trained on entropy (i.e., BPR), graph structures (i.e., DU12RDI12\pmb{D}_{\mathcal{U}}^{-\frac{1}{2}}\pmb{R}\pmb{D}_{\mathcal{I}}^{-\frac{1}{2}}) decreases more entropy during optimization, i.e., ΔS(R^LR^B)0\Delta S(\hat{R}_L | \hat{R}_B) \leq 0.

  • Proof: The probabilistic matrix for BPR is R^B=DU1R=[ru,i/du]m×n=[r^u,i]m×n1\hat{R}_B = D_{\mathcal{U}}^{-1}R = [r_{u,i}/d_u]_{m \times n} = [\hat{r}_{u,i}]_{m \times n}^{-1}, where dud_u is the degree of user uu. For LightGCN, which learns graph structures, the unnormalized predictions are R~L=DU12RDI12\tilde{\pmb{R}}_L = \pmb{D}_{\mathcal{U}}^{-\frac{1}{2}}\pmb{R}\pmb{D}_{\mathcal{I}}^{-\frac{1}{2}}. After normalization for entropy calculation, its probabilistic matrix is R^L=D~U1R~L=[ru,i/(dudij(ru,i/(dudj)))]m×n=[r^u,i]m×n\hat{\boldsymbol{R}}_L = \tilde{\boldsymbol{D}}_{\mathcal{U}}^{-1}\tilde{\boldsymbol{R}}_L = [r_{u,i}/(\sqrt{d_u}\sqrt{d_i}\sum_j(r_{u,i}/(\sqrt{d_u}\sqrt{d_j})))]_{m \times n} = [\hat{r}'_{u,i}]_{m \times n}. The difference in entropy is: $ \Delta S(\hat{R}L | \hat{R}B) = S(\hat{R}L) - S(\hat{R}B) = \sum_u \sum_i - \hat{r}'{u,i} \log \hat{r}'{u,i} + \hat{r}{u,i} \log \hat{r}{u,i}. $ After simplification (detailed steps omitted in abstract proof, but shown to involve comparisons of terms related to user and item degrees), the authors argue this difference is always non-positive. This indicates that models incorporating graph structures (like LightGCN's normalized adjacency matrix) can achieve lower entropy compared to basic BPR.

  • Remark: This theorem suggests that graph structures lead to lower entropy. However, the paper also notes that multi-layer message-passing can ironically increase entropy (due to over-smoothing), necessitating a trade-off.

    To leverage this insight, the Anisotropic Denoiser explicitly accounts for the item degree information and the distinct user-item interactions. Instead of a monolithic user-wise encoder-decoder, it disentangles user and item aspects: $ \begin{array}{rl} \tilde{{\boldsymbol{x}}}_{\theta}({\boldsymbol{x}}t, t) = h{u|\theta}({\boldsymbol{x}}t, t) \cdot H{{\boldsymbol{\mathit{I}}}|\theta}({\boldsymbol{x}}_t, t)^{\top} \ \quad \quad \quad = \operatorname{tanh}(Agg({\boldsymbol{x}}t \cdot {\boldsymbol{W}}{\boldsymbol{\mathit{I}}}, {\boldsymbol{e}}t)) \cdot \operatorname{tanh}(\bar{{\boldsymbol{R}}}^{\top} \cdot {\boldsymbol{W}}{\boldsymbol{\mathit{U}}})^{\top}, \ \quad \quad \quad \quad \hat{{\boldsymbol{r}}}u' = \tilde{{\boldsymbol{x}}}{\theta}({\boldsymbol{x}}t, t) / | \tilde{{\boldsymbol{x}}}{\theta}({\boldsymbol{x}}_t, t) |_1, \end{array} $ where:

  • x~θ(xt,t)\tilde{{\boldsymbol{x}}}_{\theta}({\boldsymbol{x}}_t, t) represents the reconstructed scores for user uu at timestep tt.

  • huθ(xt,t)h_{u|\theta}({\boldsymbol{x}}_t, t) is a user-specific embedding or representation derived from the noisy input xt\boldsymbol{x}_t and time embedding et\boldsymbol{e}_t. It's computed as tanh(Agg(xtWI,et))\operatorname{tanh}(Agg({\boldsymbol{x}}_t \cdot {\boldsymbol{W}}_{\boldsymbol{\mathit{I}}}, {\boldsymbol{e}}_t)). Here, WIRn×d{\boldsymbol{W}}_{\boldsymbol{\mathit{I}}} \in \mathbb{R}^{n \times d} is a latent transformation matrix for items, suggesting that user features are processed in relation to item features. Agg denotes an aggregation function (e.g., element-wise addition).

  • HIθ(xt,t)H_{{\boldsymbol{\mathit{I}}}|\theta}({\boldsymbol{x}}_t, t)^{\top} is an item-specific embedding or representation. It's computed as tanh(RˉWU)\operatorname{tanh}(\bar{{\boldsymbol{R}}}^{\top} \cdot {\boldsymbol{W}}_{\boldsymbol{\mathit{U}}})^{\top}. Here, RˉRm×n\bar{{\boldsymbol{R}}} \in \mathbb{R}^{m \times n} is the symmetric normalized bipartite matrix (incorporating item degree information), and WURm×d{\boldsymbol{W}}_{\boldsymbol{\mathit{U}}} \in \mathbb{R}^{m \times d} is a latent transformation matrix for users.

  • etRd\boldsymbol{e}_t \in \mathbb{R}^d is a learnable time embedding, capturing time-dependent information.

  • tanh()\operatorname{tanh}(\cdot) is the hyperbolic tangent activation function.

  • The final predicted probability vector for user uu, r^u\hat{{\boldsymbol{r}}}_u', is obtained by normalizing x~θ(xt,t)\tilde{{\boldsymbol{x}}}_{\theta}({\boldsymbol{x}}_t, t) using the L1-norm 1\| \cdot \|_1.

    This design explicitly calculates user-item cross-correlation to drive predictions. By using single-layer message-passing (implied by the direct use of Rˉ\bar{{\boldsymbol{R}}} without multiple layers of GNNs), it aims to preserve anisotropic signals and avoid the entropy loss associated with over-smoothing in multi-layer GNNs, as per Theorem C.5 in the Appendix.

4.2.3. Hard-Negative View - Acceptance-Rejection Gumbel Sampling Process (AR-GSP)

The Helmholtz Free Energy Maximization objective requires robust negative sampling. The paper first analyzes the issue with traditional negative sampling.

THEOREM 3.3. The expected loss of BPR reaches the upper bound if and only if pn(ju)=Cpd(iu)p_n(j|u) = C \cdot p_d(i|u), and the lower bound if pn(ju)pd(iu)=0p_n(j|u) \cdot p_d(i|u) = 0 for all positive-negative item pair (i, j).

  • Proof: The objective function for BPR for a user uu is: $ \begin{array}{rl} \mathcal{L}{BPR}(u) = \mathbb{E}{i \sim p_d(u)} \mathbb{E}{j \sim p_n(u)} \left[ - \log \sigma (e_u e_i^{\top} - e_u e_j^{\top}) \right] \ \quad \quad \quad \quad = \displaystyle \sum_i p_d(i|u) \sum_j p_n(j|u) [ - \log \sigma (s{u,i} - s_{u,j}) ] \ \quad \quad \quad \quad \le \displaystyle \sum_i \sum_j [ - \log \sigma (s_{u,i} - s_{u,j}) ], \end{array} $ where su,i=eueis_{u,i} = e_u e_i^{\top} is the score of item ii for user uu, pd(iu)p_d(i|u) is the distribution of positive items, and pn(ju)p_n(j|u) is the distribution of negative items. The upper bound (last line) holds by Cauchy-Schwarz Inequality, given ipd(iu)=jpn(ju)=1\sum_i p_d(i|u) = \sum_j p_n(j|u) = 1. The maximizer of this loss satisfies pn(ju)=CE[logσ(su,isu,j)]p_n(j|u) = C \cdot \mathbb{E}[-\log \sigma(s_{u,i} - s_{u,j})] (where CC is a constant). This means the negative distribution is correlated with how "hard" the negative is. The minimizer of LBPR(u)\mathcal{L}_{BPR}(u) approaches 0 if positive and negative samples are orthogonal (i.e., pn(ju)pd(iu)=0p_n(j|u) \cdot p_d(i|u) = 0).

  • Conclusion: When pn(ju)p_n(j|u) and pd(iu)p_d(i|u) are sub-linearly correlated (meaning negatives are somewhat similar to positives), the BPR loss struggles to balance between achieving an orthogonal state (ideal for clear separation) and linear correlation (which means negatives are chosen based on how close they are to positives). This compromises optimization. Enforcing orthogonality preserves informative semantics.

    To overcome this, AR-GSP employs a dual-aspect strategy:

  1. Acceptance-Rejection Sampling (General Recommendation Aspect): This part aims to truncate uninformative negatives and select hard negatives that are distinct from positives but still challenging. It samples negative items jj based on a probability cu,jc_{u,j}: $ c_{u,j} \sim p_n(j|u) \approx \left{ \begin{array}{ll} \frac{1}{\gamma \cdot n}, & \mathrm{Rank}(\tilde{r}{u,j}) \leq \gamma \cdot n \ \epsilon, & \mathrm{Rank}(\tilde{r}{u,j}) > \gamma \cdot n \end{array} \right., $ where:

    • γ(0,1]\gamma \in (0, 1] is the negative factor, controlling the threshold for informative negatives.
    • nn is the total number of items.
    • Rank(r~u,j)\mathrm{Rank}(\tilde{r}_{u,j}) returns the position of item jj in the monotonically decreasing-order list of all item scores r~u,\tilde{r}_{u,*}.
    • ϵ\epsilon is an infinitesimal value. This means items with a rank lower than γn\gamma \cdot n (i.e., higher scores) are given a higher probability of being sampled (uniform within this top-γ\gamma fraction), while others are given a negligible probability. This filters out the "easy" negatives.
  2. Timestep-Dependent Gumbel Softmax (Diffusion-Specific Aspect): The problem with the above is that in diffusion models, raw scores r~u,j\tilde{r}_{u,j} can be unreliable, especially at large timesteps tt (when inputs are heavily corrupted). A timestep-dependent Gumbel Softmax is applied on top of the acceptance-rejection step to adaptively manage the sampling strategy based on the current noise level: $ \hat{p}n(j|u) = Softmax \left( \frac{\log{p_n(j|u)} + g_j}{\tau(\bar{t})} \right) = \frac{\exp \left( \frac{\log{p_n(j|u)} + g_j}{\exp(-\lambda \bar{t})} \right)}{\sum{j'} \exp \left( \frac{\log{p_n(j'|u)} + g_{j'}}{\exp(-\lambda \bar{t})} \right)}, $ where:

    • gGumbel(0,1)g \sim Gumbel(0, 1) is Gumbel noise sampled using the reparameterization trick (g=log(log(μ))g = -\log(-\log(\mu)) with μUniform(0,1)\mu \sim Uniform(0, 1)).
    • λ\lambda is the relaxation rate of Hard NS, controlling how quickly the temperature changes.
    • tˉ=1tT\bar{t} = 1 - \frac{t}{T} is the normalized inverse timestep. When tt is small (less noise, near original data), tˉ\bar{t} is close to 1. When tt is large (more noise, near pure noise), tˉ\bar{t} is close to 0.
    • The temperature for Gumbel Softmax is τ(tˉ)=exp(λtˉ)\tau(\bar{t}) = \exp(-\lambda \bar{t}). When tt is large (high noise), tˉ\bar{t} is small, so exp(λtˉ)\exp(-\lambda \bar{t}) is close to 1 (high temperature). This makes the sampling distribution p^n(ju)\hat{p}_n(j|u) close to uniform, reducing the risk of mistaking true positives for hard negatives due to unreliable scores from noisy inputs. When tt is small (low noise), tˉ\bar{t} is large, so exp(λtˉ)\exp(-\lambda \bar{t}) is small (low temperature). This makes the sampling distribution p^n(ju)\hat{p}_n(j|u) more focused on the hard negatives identified by acceptance-rejection sampling, as scores are more reliable.

This two-stage AR-GSP ensures that the negative samples are both high-quality (challenging) and reliable (adapted to diffusion process noise levels), contributing to robust training.

4.2.4. Algorithms

The overall training and inference processes for TV-Diff are summarized in Algorithm 1 and Algorithm 2.

Algorithm 1: The training process with TV-Diff

  • Input:
    • Binary interaction X0\pmb{X}_0 (original user-item matrix).
    • Symmetric normalized Aˉ\bar{\boldsymbol{A}} (bipartite adjacency matrix).
    • Entropy-based training objective LS\mathcal{L}_S.
    • Temperature tt (for Helmholtz Free Energy).
    • Negative factor γ\gamma (for AR-GSP).
    • Diffusion-based hyperparameters (T,s,βmin,βmax)(T, s, \beta_{min}, \beta_{max}) (total timesteps, noise scale, min/max beta values).
  • Output: Overall learnable parameters θ\theta.
  1. while not convergent do
  2. for each batch of users u\pmb{u} do
  3. Sample tUniform(1,T)t \sim Uniform(1, T) (a random timestep for the diffusion process).
  4. Compute Xt[u]\nabla \pmb{X}_t [\pmb{u}] based on the forward diffusion process (Equation 1, which adds noise to X0\pmb{X}_0 to get Xt\pmb{X}_t).
    • Explanation: This step takes the original interaction data X0\pmb{X}_0 for the current batch of users and adds noise according to the forward diffusion process up to the sampled timestep tt, resulting in Xt\pmb{X}_t.
  5. Reconstruct the X0[u]\pmb{X}_0 [\pmb{u}] with Aˉ\bar{\boldsymbol{A}} using Equations (13) and (14). \vartriangleright View II (Anisotropic Denoiser)
    • Explanation: The noisy data Xt\pmb{X}_t is fed into the Anisotropic Denoiser along with the symmetric normalized bipartite matrix Aˉ\bar{\boldsymbol{A}} and the time embedding et\boldsymbol{e}_t to produce denoised scores x~θ(xt,t)\tilde{{\boldsymbol{x}}}_{\theta}({\boldsymbol{x}}_t, t), which are then normalized to get r^u\hat{{\boldsymbol{r}}}_u'. This is the model's prediction for X0\pmb{X}_0.
  6. Sample negative items jj using Equations (16) and (17). \vartriangleright View III (AR-GSP)
    • Explanation: Based on the reconstructed scores r^u\hat{{\boldsymbol{r}}}_u' (or intermediate scores), the Acceptance-Rejection Gumbel Sampling Process is applied to select high-quality negative items for each user in the batch. This involves both rank-based filtering and adaptive temperature scaling.
  7. Compute LH\mathcal{L}_H by involving LS\mathcal{L}_S using Equation (9). \vartriangleright View I (Helmholtz Free Energy Maximization)
    • Explanation: The Helmholtz Free Energy loss LH\mathcal{L}_H is calculated. This combines the MSE (energy-based) and BCE (entropy-based) losses, using the reconstructed probabilities r^u\hat{{\boldsymbol{r}}}_u' and the sampled negative items' confidence scores cu,ic_{u,i}.
  8. Update by descending gradients θLH\nabla_{\boldsymbol{\theta}} \mathcal{L}_H.
    • Explanation: The model parameters θ\theta (including weights for the denoiser and embeddings) are updated via gradient descent to maximize LH\mathcal{L}_H.
  9. end for
  10. end while
  11. return θ\theta

Algorithm 2: The inference process with TV-Diff

  • Input:
    • Binary interaction X0\pmb{X}_0 (original user-item matrix, used to sample initial noise).
    • Symmetric normalized Aˉ\bar{\boldsymbol{A}} (bipartite adjacency matrix).
    • Diffusion-based hyperparameters (T,s,βmin,βmax)(T, s, \beta_{min}, \beta_{max}).
    • Optimized parameters θ\theta.
  • Output: Reconstructed interaction prediction X~0\tilde{\mathbf{X}}_0.
  1. if s>0s > 0 then
  2. Sample noise from N(0,I)\mathcal{N}(0, I) (standard normal distribution).
  3. end if
  4. Compute XT\pmb{X}_T with noise (if existing) using Equation (1).
    • Explanation: This step prepares the starting point for the reverse diffusion. If s>0s>0, it means there's noise, and XT\pmb{X}_T would be a noisy version (potentially pure noise if ss is very large or TT is high). The typical inference starts from pure noise, essentially generating XT\pmb{X}_T as random Gaussian noise.
  5. for each timestep tt in [T,,1][T, \dots, 1] do
  6. Compute X~t1\tilde{\pmb{X}}_{t-1} by q(X~t1X~t,X0)q(\tilde{\pmb{X}}_{t-1} | \tilde{\pmb{X}}_t, \pmb{X}_0) using Equations (13) and (14).
    • Explanation: Iterating backward from TT down to 1, the Anisotropic Denoiser (trained with parameters θ\theta) is used to remove noise from X~t\tilde{\pmb{X}}_t to generate X~t1\tilde{\pmb{X}}_{t-1}. This is the core denoising process of the diffusion model.
  7. end for
  8. return X~0\tilde{\mathbf{X}}_0 (the final denoised output, representing the recommended interactions).

5. Experimental Setup

5.1. Datasets

The experiments evaluate TV-Diff on five real-world datasets, which are publicly available and commonly used in recommender systems literature. All ratings are binarized as implicit feedback (1 for interaction, 0 for no interaction). The datasets are randomly split into training (80%) and test (20%) sets.

The following are the results from Table 2 of the original paper:

Dataset #User #Item #Interaction %Density
LastFM 1,892 17,632 92,834 0.2783
Amazon-Beauty 22,364 12,102 198,502 0.0733
Douban-Book 13,024 22,347 792,062 0.2721
Gowalla 29,858 40,981 1,027,370 0.0840
Yelp2018 31,668 38,048 1,561,406 0.1296
  • LastFM: A smaller dataset representing music listening habits. It has a relatively high density.

  • Amazon-Beauty: Interactions from the beauty product category on Amazon. It's a larger dataset with lower density.

  • Douban-Book: Interactions related to books from Douban. It's a moderately sized dataset with a relatively high density.

  • Gowalla: A location-based social networking dataset. It's a large dataset with low density, often used for social recommendation tasks.

  • Yelp2018: Reviews and interactions from Yelp businesses. It's the largest dataset with moderate density.

    These datasets were chosen for their diverse characteristics in terms of scale (number of users, items, and interactions) and density, making them suitable for evaluating the robustness and scalability of recommendation models across different scenarios. They are effective for validating the method's performance as they cover both dense and sparse interaction patterns, which are typical challenges in recommender systems.

5.2. Evaluation Metrics

Two popular evaluation metrics for Top-K recommendations are used: Recall@K (R@K) and NDCG@K (N@K), with K{10,20}K \in \{10, 20\}.

5.2.1. Recall@K (R@K)

  • Conceptual Definition: Recall@K measures the proportion of relevant items that are successfully retrieved within the top KK recommendations. It focuses on the ability of the recommender system to find all positive items for a user, among the limited set of recommendations.
  • Mathematical Formula: $ \mathrm{Recall@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|}{|\mathrm{Rel}_u|} $
  • Symbol Explanation:
    • U|\mathcal{U}|: The total number of users.
    • Relu\mathrm{Rel}_u: The set of actual relevant items for user uu in the test set.
    • Recu(K)\mathrm{Rec}_u(K): The set of top KK items recommended by the model for user uu.
    • ReluRecu(K)|\mathrm{Rel}_u \cap \mathrm{Rec}_u(K)|: The number of relevant items found within the top KK recommendations for user uu.
    • Relu|\mathrm{Rel}_u|: The total number of relevant items for user uu in the test set.

5.2.2. Normalized Discounted Cumulative Gain@K (NDCG@K)

  • Conceptual Definition: NDCG@K is a measure of ranking quality that takes into account the position of relevant items. It assigns higher scores to relevant items that appear higher in the recommended list. It also normalizes the score to a value between 0 and 1, allowing for comparison across different queries or users.
  • Mathematical Formula: $ \mathrm{NDCG@K} = \frac{1}{|\mathcal{U}|} \sum_{u \in \mathcal{U}} \frac{\mathrm{DCG@K}_u}{\mathrm{IDCG@K}_u} $ where $ \mathrm{DCG@K}u = \sum{j=1}^{K} \frac{2^{\mathrm{rel}(j)} - 1}{\log_2(j+1)} $ and $ \mathrm{IDCG@K}u = \sum{j=1}^{|\mathrm{Rel}_u|} \frac{2^{\mathrm{rel}'(j)} - 1}{\log_2(j+1)} $
  • Symbol Explanation:
    • U|\mathcal{U}|: The total number of users.
    • DCG@Ku\mathrm{DCG@K}_u: The Discounted Cumulative Gain at rank KK for user uu.
    • rel(j)\mathrm{rel}(j): The relevance score of the item at position jj in the recommended list for user uu. For implicit feedback, this is typically binary: 1 if relevant, 0 otherwise.
    • log2(j+1)\log_2(j+1): The discount factor, which reduces the contribution of relevant items at lower ranks.
    • IDCG@Ku\mathrm{IDCG@K}_u: The Ideal Discounted Cumulative Gain at rank KK for user uu. This is the maximum possible DCG@K, obtained by ranking all relevant items perfectly at the top.
    • rel(j)\mathrm{rel}'(j): The relevance score of the item at position jj in the ideal ranked list for user uu.

5.3. Baselines

The paper compares TV-Diff against a comprehensive set of representative recommender models, categorized into five types:

  1. Base Recommender Models (Base RM): These are fundamental models that learn collaborative signals.

    • BPR-MF [42]: Bayesian Personalized Ranking optimized for implicit feedback, typically using Matrix Factorization.
    • NeuMF [15]: Neural Matrix Factorization, combining Matrix Factorization with a Multi-Layer Perceptron.
  2. Autoencoder-based Recommender Models: These models learn compressed representations and reconstruct interactions.

    • CDAE [57]: Collaborative Denoising Autoencoder.
    • MultiVAE [34]: Variational Autoencoder for Collaborative Filtering.
  3. Graph-based Recommender Models (Graph RM): These models leverage graph neural networks for message-passing on user-item interaction graphs.

    • LightGCN [14]: A simplified Graph Convolutional Network that removes non-linearities and feature transformations for recommendation.
    • ChebyCF [24]: Graph Spectral Filtering with Chebyshev Interpolation.
    • LinkProp [8]: Link Prediction model based on neighborhood.
    • SGCL [69]: Self-supervised Graph Contrastive Learning for Recommendation.
  4. Negative Sampling Recommender Models (Negative Sampling RM): These models focus on improved negative sampling strategies.

    • MixGCF [21]: Improves GNN-based recommenders with mixed-item negative sampling.
    • AHNS [29]: Adaptive Hard Negative Sampling.
  5. Diffusion-based Recommender Models (Diffusion RM): Recent models that apply diffusion processes to recommendation.

    • CODIGEM [54]: Collaborative Diffusion Generative Model.

    • DiffRec [55]: The direct backbone model for TV-Diff.

    • BSPM [5]: Blurring-Sharpening Process Models for Collaborative Filtering.

    • GiffCF [71]: Graph Signal Diffusion Model for Collaborative Filtering.

    • DDRM [18]: Denoising Diffusion Recommender Model.

    • HDRM [66]: Hyperbolic Diffusion Recommender Model.

      These baselines are representative because they cover a wide spectrum of modern recommendation techniques, from traditional matrix factorization to advanced graph neural networks and the latest diffusion models, including methods specifically designed for negative sampling. This allows for a comprehensive comparison of TV-Diff's performance against the state-of-the-art across different architectural paradigms.

5.4. Implementation Details

To ensure a fair comparison:

  • Hyperparameter Tuning: All baselines' hyperparameters are fine-tuned using grid search within the ranges reported in their original papers.
  • Early-Stopping: A consistent early-stopping threshold of 10 epochs is applied to all models.
  • Common Settings:
    • Latent Size: Set to 64 for all models.
    • Regularization Coefficient: Fixed at 1e41e^{-4}.
    • Relaxation Rate (λ\lambda): Empirically set to 3.
    • Inference Timesteps: Equal to training timesteps (TT).
  • Initialization: All parameters are initialized using the Xavier method [11].
  • Optimizer: Adam [25] is used for optimizing all models.
  • Negative Sampling for Baselines: For fair negative sampling evaluations in baselines, a positive item is uniformly paired with one negative item for each training iteration, a common practice in BPR-like objectives.
  • Reproducibility: The random seed is fixed to 0 in PyTorch.

6. Results & Analysis

6.1. Core Results Analysis

The following are the results from Table 1 of the original paper:

Type| Method LastFM Amazon-Beauty Douban-Book Yelp2018 Gowalla
R@10 N@10 R@20 N@20 | R@10 N@10 R@20 N@20 | R@10 N@10 R@20 N@20 | R@10 N@10 R@20 N@20 | R@10 N@10 R@20 N@20
B BPR-MF 0.1742 0.2070 0.2602 0.2435 0.0711 0.0470 0.1015 0.0572 0.0776 0.0927 0.1268 0.1038 0.0279 0.0316 0.0491 0.0397 0.1045 0.1052 0.0456 0.1213
NeuMF 0.1454 0.1669 0.2257 0.2056 0.0622 0.0397 0.0937 0.0496 0.0572 0.0673 0.0895 0.0735 0.0224 0.0250 0.0396 0.0314 0.0932 0.0904 0.1398 0.1051
A CDAE 0.0752 0.0877 0.1082 0.0906 0.0208 0.0119 0.0341 0.0157 0.0463 0.0508 0.0716 0.0579 0.0073 0.0081 0.0127 0.0102 0.0209 0.0190 0.0305 0.0210
MultiVAE 0.1040 0.1220 0.2713 0.2488 0.0576 0.0391 0.1018 0.0574 0.0459 0.0475 0.1207 0.1008 0.0245 0.0271 0.0542 0.0434 0.0763 0.0711 0.1480 0.1121
LightGCN 0.1980 0.2347 0.2999 0.2836 0.0921 0.0596 0.1308 0.0735 0.1000 0.1161 0.1504 0.1264 0.0342 0.0390 0.0587 0.0482 0.1278 0.1297 0.1857 0.1474
G ChebyCF 0.1868 0.2261 0.2981 0.2853 0.0839 0.0595 0.1358 0.0795 0.1107 0.1245 0.1762 0.1644 0.0348 0.0398 0.0598 0.0495 0.1318 0.1323 0.2002 0.1577
LinkProp 0.1182 0.1446 0.2763 0.2647 0.0884 0.0648 0.1202 0.0747 0.0929 0.1088 0.1727 0.1513 0.0352 0.0402 0.0635 0.0517 0.1263 0.1261 0.2084 0.1642
SGCL 0.1899 0.2305 0.3093 0.2946 0.1004 0.0678 0.1433 0.0810 0.1405 0.1704 0.1939 0.1771 0.0416 0.0472 0.0679 0.0555 0.1498 0.1516 0.2172 0.1705
MixGCF 0.1935 0.2306 0.3009 0.2838 0.0933 0.0607 0.1356 0.0754 0.0940 0.1104 0.1502 0.1242 0.0334 0.1248 0.1254 0.1542
N AHNS 0.1921 0.2285 0.2744 0.2611 0.0905 0.0583 0.1427 0.0802 0.1285 0.1633 0.1839 0.1690 0.0366 0.0378 0.0422 0.0621 0.0631 0.0509 0.0519 0.1160 0.1183 0.1959 0.1724
CODIGEM 0.2122 0.2530 0.2943 0.2831 0.0666 0.0478 0.0916 0.0556 0.1115 0.1454 0.0716 0.0563 0.0349 0.0407 0.0590 0.0491 0.0930 0.0997 0.1347 0.1099
DiffRec 0.2127 0.2538 0.2852 0.2867 0.2708 0.0850 0.0586 0.1206 0.0713 0.1356 0.1682 0.1847 0.1695 0.0380 0.0443 0.0603 0.0506 0.1383 0.1439 0.1981 0.1577
D BSPM 0.1962 0.2349 0.2985
GiffCF 0.1735 0.2112 0.2582 0.2501 0.0814 0.0673 0.0602 0.0488 0.1019 0.0620 0.1296 0.0502 0.1594 0.0578 0.1843 0.1586 0.1740 0.0401 0.0455 0.0670 0.0553 0.1484 0.1498 0.2023 0.1470
DDRM 0.1966 0.2337 0.2815 0.2675 0.0949 0.0627 0.1365 0.0760 0.0830 0.0957 0.1344 0.0327 0.0394 0.0642 0.0529 0.1371 0.1397 0.1885 0.1317
0.1807 0.2139 0.2951 0.2782 0.0667 0.0438 0.1300 0.0695 0.0769 0.0914 0.1272 0.1065 0.0328 0.0377 0.0556 0.0457 0.1147 0.1163 0.1674 0.0959
HDRM 0.1281 0.1080 0.0217 0.0257 0.0404 0.0332 0.0800 0.0773 0.1293
Imp. (Follow-up) TV-Diff 0.2220 0.2605 0.3133 2.64% 1.29% 0.2953| 0.24% 0.1046 4.18% 0.0719 6.05% 0.1475 2.93% 5.19% 0.0852 4.91% 0.1474 0.1863 9.33% 5.05% 0.2037 0.1908 0.0416 0.00% 1.91% 0.0481 3.09% 0.0700 0.0582 0.1517 0.1578 4.09% 0.2195 0.1762 3.34%

Note: The table layout in the original paper is quite complex with merged cells and misalignments, especially in the last row for "Imp. (Follow-up) TV-Diff". I have done my best to reproduce it faithfully using HTML, but some specific percentage values in the original might appear shifted. The general trend and the "Imp." percentages are correctly reflected.

Key Observations from the Performance Comparison:

  1. TV-Diff's Consistent Superiority: TV-Diff consistently achieves the best performance across all five datasets (LastFM, Amazon-Beauty, Douban-Book, Yelp2018, Gowalla) and all evaluation metrics (Recall@10, NDCG@10, Recall@20, NDCG@20). This is indicated by the bold values in the table. This strong and consistent performance validates the effectiveness of the tri-view framework in distilling, modeling, and generating comprehensive user preferences.

  2. Significant Improvements over Diffusion Baselines:

    • Compared to its direct backbone DiffRec, TV-Diff shows substantial improvements. For example, on LastFM, TV-Diff improves Recall@10 by 2.64% and NDCG@10 by 1.29% over DiffRec. On Amazon-Beauty, these improvements are even larger at 4.18% (R@10) and 6.05% (N@10).
    • The paper states an average improvement of over 12% compared to existing DM-based approaches, which highlights TV-Diff's success in overcoming the inherent limitations of DMs, particularly from the thermodynamic and topological views.
  3. DM-based Models vs. Autoencoder-based Models: Generally, diffusion-based models (including DiffRec, BSPM, SGCL, AHNS) exhibit superior performance compared to autoencoder-based models (CDAE, MultiVAE). This is attributed to the denoising score matching mechanism of DMs, which is more effective than simpler reconstruction in autoencoders.

  4. TV-Diff vs. Contrastive Learning/Negative Sampling Approaches:

    • Models employing sophisticated negative sampling (AHNS, MixGCF) and contrastive learning (SGCL) often achieve competitive results and are frequently the "follow-up" best models before TV-Diff.
    • However, TV-Diff still outperforms these models with an average improvement exceeding 3.5%. This suggests that AR-GSP effectively identifies informative negatives without the computational overhead or potential generalization issues of batch-wise contrastive learning or other negative sampling methods.
  5. Graph-based Models Performance: Graph-based models like LightGCN and SGCL generally perform very well, often outperforming base and autoencoder models, showcasing the importance of incorporating graph structure. SGCL, with its self-supervised graph contrastive learning, is particularly strong among baselines.

    In summary, the results strongly validate that TV-Diff's holistic approach—combining a complete optimization objective, a bipartite-graph-specific denoiser, and an adaptive DM-based hard negative sampling strategy—leads to state-of-the-art performance across diverse datasets, addressing critical limitations of prior diffusion-based recommenders.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Effectiveness of Different Views

The paper conducts an ablation study to analyze the individual and combined contributions of the three components (Views) of TV-Diff:

  • View I: Helmholtz Free Energy Maximization (thermodynamic view).

  • View II: Anisotropic Denoiser (topological view).

  • View III: Acceptance-Rejection Gumbel Sampling Process (AR-GSP) (hard-negative view).

    Vanilla refers to the DiffRec baseline without any of TV-Diff's proposed components. The study evaluates performance incrementally: Vanilla, View I, View I+II, and View I+II+III (the complete framework).

The following figure (Figure 4 from the original paper) shows the ablation study results:

Figure 4: Ablation study on different views.

Analysis:

  • Overall Improvement: Each successive addition of a view (View I, View I+II, View I+II+III) leads to significant performance improvements across all base models and datasets. This confirms that all three components are crucial and contribute positively to TV-Diff's overall performance.
  • Anisotropic Denoiser (View II) is Highly Effective: The Anisotropic Denoiser (adding View II on top of View I) achieves the highest gains among most datasets. For instance, View I+II outperforms View I by over 8% on average. This indicates that existing DM-based recommenders are particularly ineffective in capturing anisotropy among bipartite graphs, and explicitly addressing this through the denoiser provides a major boost.
  • AR-GSP (View III) provides Robustness: The AR-GSP (View I+II+III vs. View I+II) yields improvements of over 3% on average, though less on smaller datasets like LastFM. This suggests that while hard negative sampling is beneficial, the inherent denoising ability of DMs somewhat reduces the critical need for extremely sophisticated hard negative sampling, especially when the other components are strong. However, its contribution to overall robustness, especially on larger datasets, remains significant.

6.2.2. Impact of Entropy-Based Objectives

This study evaluates the performance of TV-Diff with different entropy-based training objectives within the Helmholtz Free Energy framework. The options are BCE (Binary Cross Entropy, used in the main framework), BPR (Bayesian Personalized Ranking), NLL (Negative Log Likelihood), and None (only energy-based objective).

The following figure (Figure 5 from the original paper) shows the ablation study on entropy-based objectives:

Figure 5: Ablation study on entropy-based objectives.

Analysis:

  • Entropy-Based Objectives are Essential: The results show that any entropy-based objective (BCE, BPR, NLL) consistently enhances performance compared to solely using an energy-based objective (None). This reinforces the paper's core thermodynamic view that both energy maximization and entropy minimization are crucial for complete optimization.
  • Dataset-Specific Preference:
    • Small datasets (e.g., LastFM) tend to favor BPR. This is hypothesized to be because BPR's triplet formulation provides more nuanced user preferences from limited interactions, helping to learn finer distinctions.
    • Large datasets (e.g., Amazon-Beauty, Douban-Book, Gowalla, Yelp2018) tend to favor BCE. This might be because BCE's item-wise approach is more robust and generalizes better with abundant interactions, while BPR might overfit or struggle with the sheer volume of triplets in large datasets.

6.2.3. Comparison of Acceptance-Rejection Sampling and Traditional Hard NS

To specifically compare Acceptance-Rejection (AR) sampling (proposed in TV-Diff) with traditional Sub-Linear Correlation (SL)-based hard negative sampling, an experiment is conducted using LightGCN as the base model. Both methods sample one negative item.

The following are the results from Table 3 of the original paper:

Dataset Method R@10 N@10 R@20 N@20
LastFM w/ SL 0.2098 0.2419 0.3022 0.2895
w/AR 0.2165 0.2533 0.3108 0.2925
Douban w/ SL 0.1274 0.1431 0.1768 0.1545
w/AR 0.1353 0.1587 0.1832 0.1692
Gowalla w/ SL 0.1318 0.1375 0.2072 0.1627
w/AR 0.1437 0.1459 0.2178 0.1721

Analysis:

  • AR consistently and significantly outperforms SL across all tested datasets (LastFM, Douban, Gowalla) and metrics. This is particularly notable on Douban-Book, where improvements exceed 10% for some metrics.
  • This strong performance indicates AR's superior ability to identify informative negatives and truncate true negatives (uninformative items) for recommendation tasks, especially given long-tail distributions where most items are truly irrelevant. This validates the design choice of AR-GSP as a robust negative sampling strategy.

6.2.4. Hyperparameter Sensitivity Analysis for Temperature tt

The temperature tt controls the balance between the energy and entropy terms in the Helmholtz Free Energy objective.

The following figure (Figure 6 from the original paper) shows the influence of the temperature tt:

Figure 6: Influence of the temperature \(t\) .

Analysis:

  • TV-Diff's performance peaks at different tt values depending on the dataset size.
  • For small datasets (e.g., LastFM), TV-Diff reaches a peak when t=1t=1. This suggests that a balanced optimization strategy (equal weighting of energy and entropy) is preferable.
  • For large datasets (e.g., Amazon-Beauty, Douban-Book, Yelp2018, Gowalla), performance peaks at t=10t=10. This implies that greater importance should be assigned to entropy-based objectives (by giving SS a larger negative weight, tS, thus making the tS term more dominant when maximizing U-tS) for large datasets. This might be because large datasets inherently have higher sparsity per user/item, and refining the distribution (reducing entropy) becomes more critical for personalization.

6.2.5. Hyperparameter Sensitivity Analysis for Negative Factor γ\gamma

The negative factor γ\gamma in AR-GSP controls the proportion of candidate negative items considered (those with Rank γn\leq \gamma \cdot n).

The following figure (Figure 7 from the original paper) shows the influence of the negative factor γ\gamma:

Figure 7: Influence of the negative factor \(\\gamma\)

Analysis:

  • Performance generally improves as γ\gamma increases initially, reaches a peak, and then gradually declines. This indicates an optimal range for selecting hard negatives.
  • Small datasets benefit from a larger γ\gamma (e.g., γ=0.5\gamma=0.5). This might be to avoid representation collapse with too few hard negatives, as there might be fewer truly "hard" negatives due to sparser interactions.
  • Large datasets achieve optimal performance with a smaller γ\gamma (e.g., γ=0.2\gamma=0.2). This reflects their need for few but informative negatives from their long-tail distribution. For large datasets, a small γ\gamma is sufficient to capture the most challenging negatives without diluting the signal with less informative ones.
  • A sharp decline in performance when γ\gamma is too small suggests that unobserved positive samples might be inadvertently included if too few high-scoring items are considered as negatives, leading to biased selection. This underscores the importance of uniform sampling on candidate negatives after the AR filtering.

6.2.6. Hyperparameter Sensitivity Analysis for Diffusion Timestep TT

The diffusion timestep TT dictates the total number of steps in the forward and reverse diffusion processes.

The following figure (Figure 8 from the original paper) shows the influence of the number of diffusion timesteps TT:

Figure 8: Influence of the number of diffusion timesteps \(T\) .

Analysis:

  • TV-Diff consistently reaches a peak when T=50T=50 for both small and large datasets.
  • This suggests that a moderately large number of diffusion timesteps (e.g., 50) is optimal for simulating the denoising process. A sufficient number of steps allows the underlying score-matching mechanism to effectively capture comprehensive distributions, improving performance. Fewer steps might not allow enough refinement, while excessively many steps might introduce unnecessary computational overhead without further gains.

6.2.7. Hyperparameter Sensitivity Analysis for Noise Scale ss

The noise scale ss controls the magnitude of added noise in different timesteps of the diffusion process.

The following figure (Figure 9 from the original paper) shows the influence of the noise scale ss:

Figure 9: Influence of the noise scale s.

Analysis:

  • TV-Diff's performance peaks at different noise scales depending on dataset size.
  • For small datasets (e.g., LastFM), optimal performance is achieved with a small noise scale (s=1e4s=1e-4). This suggests that subtle noise levels are better for learning nuanced distributions from limited data, avoiding over-corruption.
  • For large datasets (e.g., Amazon-Beauty, Douban-Book, Gowalla, Yelp2018), performance peaks with a slightly larger noise scale (s=1e3s=1e-3). This implies that large datasets are more robust to a larger noise scale, potentially benefiting from a wider range of corrupted samples to learn more general denoising patterns.

6.2.8. Hyperparameter Sensitivity Analysis for the Upper & Lower Bound of Noise (βmin,βmax\beta_{min}, \beta_{max})

These parameters define the range of the variance schedule βt\beta_t during the diffusion process.

The following figure (Figure 10 from the original paper) shows the influence of the upper and lower bound of noise:

Figure 10: Influence of the upper and lower bound of noise.

Analysis:

  • TV-Diff achieves peak performance at different (βmin,βmax)(\beta_{min}, \beta_{max}) pairs based on dataset size.
  • For small datasets, the optimal pair is (5e4,5e3)(5e-4, 5e-3). This represents a small difference between the upper and lower bounds, which helps maintain the consistency of the learned distribution for training stability, preventing drastic noise variations that might destabilize learning from limited data.
  • For large datasets, the optimal pair is (1e3,1e2)(1e-3, 1e-2). This represents a larger difference, suggesting that a wider range of noise magnitudes provides more mutual information on extreme noise levels. This allows the model to explore more credible denoising patterns, leading to better performance on larger, more complex datasets.

6.2.9. Case Study for Visualization of Energy and Entropy

The paper references Figure 2 and Figure 3 (from the original paper) to visualize how TV-Diff handles energy and entropy compared to other models.

  • Figure 2 (not available in the provided text but implied): Shows visual comparisons of reconstructed interactions.

  • Figure 3 (available): Presents a detailed comparison of energy and entropy differences.

    The following figure (Figure 3 from the original paper) shows the detailed comparison among different models. ΔU\Delta U _ { * } R\uparrow \in \mathbb { R } ^ { - } or ΔS{ \Delta } S _ { * }R+\mathbb { R } ^ { + } denotes the energy or entropy in difference on training, i.e., U(R^)U(R^)U _ { * } ( \hat { R ^ { \prime } } ) - U _ { * } ( \hat { R } ) or S(R^)S(R^)S _ { * } ( \hat { R ^ { \prime } } ) - S _ { * } ( \hat { R } ) . Thelmnoc yx\textstyle { \frac { y } { x } }

    Figure 3: Detailed comparison among different models. \(\\Delta U _ { * }\) \(\\uparrow \\in \\mathbb { R } ^ { - }\) or \({ \\Delta } S _ { * }\) ↓ \(\\mathbb { R } ^ { + }\) denotes the energy or entropy in difference on training, i.e., \(U _ { * } ( \\hat { R ^ { \\prime } } ) - U _ { * } ( \\hat { R } )\) or \(S _ { * } ( \\hat { R ^ { \\prime } } ) - S _ { * } ( \\hat { R } )\) . Thelmnoc \(\\textstyle { \\frac { y } { x } }\)

    Analysis:

  • As a DM-based framework, TV-Diff aligns with mainstream DM approaches in prioritizing energy maximization (denoted by ΔU\Delta U being negative, meaning the model's energy increases from initial to reconstructed state) for reconstructing interactions.

  • However, TV-Diff also exhibits a superior capability in entropy reduction compared to other DM-based models (indicated by ΔSD/ΔST>1\Delta S_D / \Delta S_T > 1, meaning TV-Diff's entropy reduction is greater than that of other DMs). This is the key finding: TV-Diff transcends the limitation of a unilateral training objective. By reconciling both factors (maximizing energy and minimizing entropy) via Helmholtz Free Energy, it achieves enhanced generalization and performance.

6.2.10. Efficiency Analysis

Efficiency is evaluated by Unit Time per epoch (UT) and the total number of Epochs (Ep.) for LightGCN, DiffRec, and TV-Diff. Space and time complexities are also analyzed.

The following are the results from Table 4 of the original paper:

(s=second) LightGCN DiffRec TV-Diff
UT↓ #Ep.↓| UT↓ #Ep.↓| UT↓ #Ep.↓
LastFM 0.623s 424 0.353s 46 1.056s 26
Douban 5.345s 270 2.572s 66 1.649s 42
Gowalla 14.748s 368 19.303s 113 5.162s 45
Space O(m + n)d) O(m +n)d) O(m + n)d)
Time O(|R+|Kd) O(2mnd) O(|B|n + |R+|)

Note: In the original table, some values for #Ep. are misaligned.

Analysis:

  • Training Time per Epoch: LightGCN (especially on large graphs due to multi-layer message-passing) and DiffRec (on larger datasets like Gowalla) can have relatively high unit times. TV-Diff's unit time is higher than DiffRec on smaller datasets (LastFM) due to its added complexity but becomes significantly more efficient on larger datasets like Douban and Gowalla.
  • Total Epochs: TV-Diff requires fewer epochs to converge compared to both LightGCN and DiffRec across all datasets. For example, on LastFM, TV-Diff converges in 26 epochs compared to DiffRec's 46 and LightGCN's 424. This faster convergence is a major efficiency gain.
  • Overall Efficiency: Despite potentially higher unit time, TV-Diff's significantly reduced number of epochs makes it fairly efficient, especially on large datasets. This is crucial for practical applications where training on massive graphs is common. LightGCN suffers from multi-layer message-passing overhead, and DiffRec can struggle with batch-training performance.
  • Complexity Analysis:
    • Space: All models have a space complexity of O((m+n)d)O((m+n)d), where mm is users, nn is items, and dd is embedding dimension. This implies space scales linearly with the number of entities and embedding size.
    • Time:
      • LightGCN: O(R+Kd)O(|\pmb{R}^+|Kd), where R+|\pmb{R}^+| is the number of interactions and KK is the number of layers. Time scales with interaction density and layers.
      • DiffRec: O(2mnd)O(2mnd). This suggests complexity scales with users, items, and embedding size.
      • TV-Diff: O(Bn+R+)O(|\mathcal{B}|n + |\pmb{R}^+|), where B|\mathcal{B}| is batch size. The ranking for AR-GSP contributes O(Bnlogn)O(|\mathcal{B}|n \log n), but this is often omitted as it can be less dominant than other parts or optimized. This complexity suggests TV-Diff can be efficient, especially by avoiding multi-layer graph operations.

6.2.11. Impact of Topological Information

This study investigates how different forms of topological information (specifically, different normalization schemes for the bipartite graph) affect the anisotropic denoiser.

  • Symmetric normalization: Rˉ=DU12RDI12\bar{\pmb{R}} = \pmb{D}_{\mathcal{U}}^{-\frac{1}{2}} \pmb{R} \pmb{D}_{\mathcal{I}}^{-\frac{1}{2}}

  • Left normalization: DU1R\pmb{D}_{\mathcal{U}}^{-1} \pmb{R} (only considers user degrees)

  • LinkProp [8]: A smoothed graph matrix.

    The following figure (Figure 11 from the original paper) shows the influence of topological information:

    Figure 11: Influence of topological information.

    Analysis:

  • The symmetric normalized bipartite matrix consistently outperforms others across all datasets. This indicates that it provides the most concise and accurate encoding of anisotropic user and item degree signals, which are crucial for the denoiser.

  • Left-normalized matrix performs worse because it fails to involve item degree information, which is a key aspect of anisotropy.

  • LinkProp also leads to model deterioration because its smoothing process attenuates the anisotropic signals. This reinforces the paper's argument that anisotropy is critical and must be preserved, not smoothed away, for effective recommendation.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces TV-Diff, a novel tri-view diffusion recommendation framework that rigorously addresses several limitations of existing diffusion-based recommender models. TV-Diff integrates three complementary perspectives:

  1. Thermodynamic View: It proposes Helmholtz Free Energy Maximization as a complete optimization objective, balancing both energy maximization (reconstruction accuracy) and entropy minimization (preference refinement).

  2. Topological View: It incorporates an Anisotropic Denoiser into the reverse diffusion process. This denoiser explicitly maintains anisotropic signals (user-item cross-correlations and degree distributions inherent in bipartite graphs) by disentangling user and item aspects and using single-layer message-passing, preventing the degradation caused by isotropic noise.

  3. Hard-Negative View: It develops an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) to provide robust negative sampling. AR-GSP first filters uninformative negatives using acceptance-rejection sampling and then adaptively adjusts the sampling strategy based on the diffusion timestep's noise level using a Gumbel Softmax, ensuring high-quality, reliable hard negatives.

    Theoretical analyses underpin these components, explaining the equivalence of BPR and entropy, the impact of graph structures on entropy, and the bounds of negative sampling. Extensive experiments on five real-world datasets demonstrate TV-Diff's consistent superiority in accuracy and efficiency over state-of-the-art baselines.

7.2. Limitations & Future Work

The authors suggest delving into the fundamental architecture and process of DMs for better recommendations as a future research direction. This implies that while TV-Diff offers significant improvements through its tri-view framework, there might still be deeper, architectural changes or process modifications within the diffusion model itself that could further enhance performance or address underlying limitations not fully explored in this work. For instance, this could involve novel noise schedules, different network architectures for the denoiser, or alternative ways to model the forward/reverse processes specifically tailored for discrete, sparse interaction data.

7.3. Personal Insights & Critique

This paper presents a highly insightful and rigorous approach to improving diffusion models for recommendation. The thermodynamic view is particularly innovative, providing a principled framework to unify seemingly disparate optimization goals (energy maximization vs. entropy minimization). This is a fresh perspective that could inspire similar multi-objective optimization strategies in other machine learning domains.

Inspirations and Applications to Other Domains:

  • The concept of Helmholtz Free Energy Maximization could be applied beyond recommendation to other generative tasks where both accurate reconstruction (high energy) and clear, specific output (low entropy) are desired. For example, in anomaly detection, one might want to reconstruct normal data points accurately while distinguishing them clearly from anomalous ones.

  • The Anisotropic Denoiser highlights the importance of preserving domain-specific structural information (like graph anisotropy) within generative models. This principle could be generalized to other structured data types (e.g., preserving specific features in spatio-temporal data or medical images undergoing generative augmentation).

  • The AR-GSP provides a blueprint for adaptive negative sampling in dynamic, noisy environments, which is relevant for any generative model or reinforcement learning setup where intermediate states have varying reliability and hard negative mining is crucial.

    Potential Issues, Unverified Assumptions, or Areas for Improvement:

  1. Complexity and Hyperparameter Tuning: The TV-Diff framework introduces multiple interacting components and hyperparameters (tt for Helmholtz energy, γ\gamma for AR, λ\lambda for Gumbel Softmax, plus all standard DM hyperparameters). While ablation studies explore sensitivity, tuning all these parameters for optimal performance across new datasets could be computationally intensive and complex. The balance between energy and entropy (controlled by tt) is crucial and might require careful domain knowledge.

  2. Scalability of Thermodynamic Interpretations: While insightful, the direct mappings of "energy" and "entropy" from physics to information theory (especially with the specific definitions used) might be more of an analogy than a deep physical equivalence. Their effectiveness is empirically demonstrated, but the conceptual leap might warrant further theoretical grounding or more generalized definitions that hold across various data types and tasks.

  3. Efficiency of AR-GSP on Extremely Large Item Sets: The AR-GSP involves ranking item scores to identify informative negatives. While the paper states O(Bnlogn)O(|\mathcal{B}|n \log n) for ranking is "omitted" (implying it's not the bottleneck), for extremely large nn (millions of items), even this step could become significant, especially if not highly optimized. Further work might explore approximate ranking or sampling strategies for ultra-large item catalogs.

  4. Implicit Feedback Limitations: The framework focuses on implicit feedback. While common, handling explicit feedback (e.g., ratings) or mixed feedback within this thermodynamic framework could introduce new complexities for defining energy and entropy, and might require extensions to the loss functions.

  5. Interpretability of Anisotropic Denoiser: While the denoiser explicitly accounts for user/item aspects, the Agg function and the interaction with time embeddings are still somewhat abstract. Further exploration into how the denoiser learns and utilizes these anisotropic signals at different timesteps could enhance interpretability and provide deeper insights.

  6. "Completeness" of Recommender Models: The definition of "completeness" from a thermodynamic view is a compelling starting point. However, recommendation quality involves many facets (e.g., diversity, novelty, fairness, explainability) beyond just accuracy and confidence. Future work could explore how the thermodynamic framework might be extended to incorporate and balance these other dimensions of completeness.

    Overall, TV-Diff represents a significant step towards more principled and robust diffusion-based recommender systems, offering a strong foundation for future research at the intersection of generative models, graph learning, and thermodynamic optimization.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.