1. Bibliographic Information

1.1. Title

Collaborative Diffusion Model for Recommender System

1.2. Authors

Gyuseok Lee: Pohang University of Science and Technology (POSTECH), South Korea.
Yaochen Zhu: University of Virginia, USA.
Hwanjo Yu (Corresponding Author): Pohang University of Science and Technology (POSTECH), South Korea.
Yao Zhou: Google LLC, USA.
Jundong Li: University of Virginia, USA.

1.3. Journal/Conference

The paper was published in the Companion Proceedings of the ACM Web Conference 2025 (WWW Companion '25). The ACM Web Conference (formerly WWW) is a premier, high-impact venue (ranked A* in CORE) focusing on the future of the World Wide Web, including information retrieval and recommender systems.

1.4. Publication Year

Published on May 8, 2025.

1.5. Abstract

Existing diffusion-based recommender systems (DR) face a critical trade-off: injecting noise enhances generative capacity but erodes personalized user information. Furthermore, most DR models focus solely on user-item interactions, neglecting rich item-side information like reviews. To solve this, the authors propose CDiff4Rec (Collaborative Diffusion model for Recommender System). This model generates "pseudo-users" from item features and identifies personalized neighbors from both real and pseudo-users. By integrating these collaborative signals during the diffusion process, the model effectively reconstructs nuanced user preferences. Tests on three datasets show CDiff4Rec outperforms existing baselines by mitigating information loss.

1.6. Original Source Link

The paper can be accessed via ACM Digital Library. It is an officially published conference paper.

2. Executive Summary

2.1. Background & Motivation

Recommender systems aim to help users navigate massive amounts of information. While Collaborative Filtering (CF)—which uses the behavior of similar users to make predictions—is standard, deep generative models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have been used to capture complex patterns. However, VAEs often suffer from "posterior collapse" (the model ignores the latent variables), and GANs are notoriously unstable to train.

Recently, Diffusion Models have emerged as a powerful alternative. They work by adding noise to data and then learning to "denoise" it. In recommendation, this means taking a user's noisy interaction history and reconstructing their "true" preferences.

The Problem:

Personalization vs. Generation Trade-off: If you add too much noise, you lose the specific things that make a user unique. If you add too little, the model can't generate new, high-quality recommendations.
Underutilization of Item Content: Most diffusion models for recommendation only look at who clicked what (user-oriented), ignoring why they clicked it (item-side information like reviews).

2.2. Main Contributions / Findings

Pseudo-User Concept: The authors treat item features (words from reviews) as "pseudo-users." For example, if the word "action" appears in several movies, "action" becomes a pseudo-user who has "interacted" with those movies.
Collaborative Signal Integration: The model identifies the top- $K$ most similar real users and pseudo-users for every user. These signals are used to guide the denoising process.
CDiff4Rec Model: A novel architecture that blends a user's own reconstructed signal with the signals from their "real" and "pseudo" neighbors.
Key Findings: The model significantly improves accuracy (measured by Recall and NDCG) without a major increase in computational time. It proves that item-side content, when framed as collaborative behavior, effectively fills the gaps left by noisy user data.

3.1. Foundational Concepts

3.1.1. Collaborative Filtering (CF)

This is a technique used by recommender systems to make automatic predictions about the interests of a user by collecting preferences from many users. The underlying assumption is that if User A has the same opinion as User B on one issue, A is more likely to have B's opinion on a different issue.

3.1.2. Diffusion Models

A class of generative models that involve two processes:

Forward Process (Diffusion): Gradually adding Gaussian noise to the data until it becomes complete noise.
Backward Process (Reverse): Learning a neural network to reverse the noise addition, step-by-step, to recover the original data. In recommendation, the "data" is the user's item interaction vector.

3.1.3. TF-IDF (Term Frequency-Inverse Document Frequency)

A numerical statistic that reflects how important a word is to a document (or item) in a collection.

TF: How often a word appears in a specific item's reviews.
IDF: How rare the word is across all items. High TF-IDF scores are given to words that are frequent in one item but rare in others, making them good identifiers of that item's unique characteristics.

3.2. Previous Works

DiffRec (2023): The first model to apply diffusion to Collaborative Filtering. It showed that denoising corrupted interactions could lead to better recommendations than VAEs. However, it still struggled with the noise-personalization trade-off and didn't use item features.
MultiVAE (2018): A popular Variational Autoencoder approach for recommendation. It uses a "bottleneck" to learn a compressed representation of users but can lose detail.
Ease (2019): A linear model that uses neighborhood information effectively. The authors of CDiff4Rec use this as a baseline to show that their diffusion approach is superior to simple linear neighborhood models.

3.3. Technological Evolution

Recommendation shifted from simple Matrix Factorization (finding latent factors) to Deep Learning (MLPs), then to Graph Neural Networks (learning from the user-item graph), and now to Generative Models (VAEs, GANs, and Diffusion). Diffusion is currently the "state-of-the-art" because it handles data sparsity better than previous methods.

4. Methodology

4.1. Principles

The core intuition of CDiff4Rec is that a user's preference is not just an isolated signal. It is a combination of:

Their own past behavior.
The behavior of similar Real Users.
The characteristics of the items they like, represented by Pseudo-Users.

By aggregating these three signals, the model can reconstruct a user's preference even if the diffusion process has added significant noise to their original data.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Pseudo-User Generation from Item Reviews

The authors extract words from item reviews to create features. Each feature $f$ (word) is treated as a pseudo-user.

Feature Matrix: Let $M \in \mathbb{R}^{|\mathcal{F}| \times |\mathcal{I}|}$ be the matrix where rows are features and columns are items.
TF-IDF & Normalization: For each feature $f$ , they compute its TF-IDF score across items. They then apply Min-Max Normalization to scale the values between 0 and 1.
Pseudo-User Vector: The result is a vector $\mathbf{m}_{pu} \in [0, 1]^{|\mathcal{I}|}$ , which represents the "interaction history" of that word. If the word "battery" appears strongly in the reviews of five specific phones, the pseudo-user "battery" has high interaction values for those phones.

4.2.2. Stage 2: Personalized Top-K Neighbors Identification

For every real user $u$ , the model finds the most similar real users and pseudo-users.

Similarity Metric: They use Cosine Distance $\phi$ .
Neighbor Sets: They precompute two sets:
- $\mathcal{U}_{ru}^u$ : The top- $K$ real users most similar to user $u$ .
- $\mathcal{U}_{pu}^u$ : The top- $K$ pseudo-users (review words) most similar to user $u$ .
  
  The formulas for selection are: $ \mathcal{U}{ru}^u = { ru \mid \underset{ru \in \mathcal{U}}{\arg\mathrm{sort}} \phi(\mathbf{r}u, \mathbf{r}{ru})[: K] } $ $ \mathcal{U}{pu}^u = { \hat{p}u \mid \underset{\hat{p}u \in \mathcal{F}}{\arg\mathrm{sort}} \phi(\mathbf{r}u, \mathbf{m}{\hat{p}u})[: K] } $ Where:

$\mathbf{r}_u$ is the interaction history of user $u$ .
$\mathbf{r}_{ru}$ is the interaction history of a potential real neighbor.
$\mathbf{m}_{\hat{p}u}$ is the interaction vector of a potential pseudo-user.
[: K] denotes taking the top $K$ results.

4.2.3. Stage 3: Collaborative Signal Aggregation

This is the heart of the model. During the diffusion process, the standard DR model predicts a reconstructed interaction vector $\hat{\mathbf{r}}_u$ . CDiff4Rec modifies this by incorporating the neighbors.

The Aggregation Formula: $ \hat{\mathbf{r}}u^{\prime} = \alpha \hat{\mathbf{r}}u + \beta \sum{ru_i \in \mathcal{U}{ru}^u} a_{ru_i} \hat{\mathbf{r}}{ru_i} + \gamma \sum{pu_j \in \mathcal{U}{pu}^u} a{jpu_j} \hat{\mathbf{m}}_{pu_j} $ Where:

$\hat{\mathbf{r}}_u^{\prime}$ is the final fine-grained preference representation.
$\alpha, \beta, \gamma$ are weights that sum to 1, controlling the influence of the user, real neighbors, and pseudo neighbors.
$\hat{\mathbf{r}}_{ru_i}$ and $\hat{\mathbf{m}}_{pu_j}$ are the model's predictions for the neighbors.
$a_{ru_i}$ and $a_{jpu_j}$ are Attention Scores determining how much weight to give to each neighbor.

The authors propose three ways to calculate the attention score $a_{ru_i}$ :

Average Pooling: $a_{ru_i} = \frac{1}{K}$ .
Behavior Similarity: Uses the precomputed cosine distance. $ a_{ru_i} = \frac{\exp(-\phi(\mathbf{r}u, \mathbf{r}{ru_i}))}{\sum_{ru_i \in \mathcal{U}_{ru}^u} \exp(-\phi(\mathbf{r}u, \mathbf{r}{ru_i}))} $
Parametric Modeling: Learns weights using a neural network. $ a_{ru_i} = \frac{\exp((\mathbf{W}q^T \mathbf{\hat{r}}u)^T (\mathbf{W}k^T \mathbf{\hat{r}}{u_i}))}{\sum{ru_i \in \mathcal{U}{ru}^u} \exp((\mathbf{W}_q^T \mathbf{\hat{r}}_u)^T (\mathbf{W}k^T \mathbf{\hat{r}}{ru_i}))} $ Symbols: $\mathbf{W}_q, \mathbf{W}_k$ are learnable weight matrices.

The following figure (Figure 1 from the original paper) shows the system architecture:

Figure 1: The overview of the proposed Collaborative Diffusion Model for Recommender System (CDiff4Rec). 该图像是示意图，展示了提议的协作扩散模型（CDiff4Rec）在推荐系统中的工作原理。图中包含了推荐者、预测信息以及来自真实用户和伪用户的偏好信号的处理过程。公式 $r_u = x_0$ 表示用户的初始评分。

4.2.4. Objective Function

The model is trained by minimizing the difference between the aggregated prediction and the ground truth (the actual items the user liked). $ \mathcal{L}t = \mathbb{E}{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ C |\hat{\mathbf{r}}_u^{\prime} - \mathbf{r}_u |_2^2 \right] $ Where:

$C$ is a scaling constant based on the noise schedule.
$\mathbf{x}_t$ is the noisy input at timestep $t$ .
$\mathbf{r}_u$ is the original, noise-free interaction history.

5. Experimental Setup

5.1. Datasets

The authors used three public datasets:

Yelp: Business reviews and ratings.
Amazon-Game (AM-Game): User reviews and ratings for video games.
Citeulike-t: A dataset of users and the research papers they bookmarked, including titles and abstracts as features.

The following are the statistics from Table 1 of the original paper:

Dataset #Users #Items #Interactions Sparsity (%)

Yelp 26,695 20,220 942,328 99.83

AM-Game 2,343 1,700 39,263 99.01

Citeulike-t 7,947 25,975 132,275 99.94

Dataset	#Users	#Items	#Interactions	Sparsity (%)
Yelp	26,695	20,220	942,328	99.83
AM-Game	2,343	1,700	39,263	99.01
Citeulike-t	7,947	25,975	132,275	99.94

5.2. Evaluation Metrics

5.2.1. Recall@K (R@K)

Conceptual Definition: Measures the proportion of relevant items (items the user actually interacted with in the test set) that are successfully captured in the top $K$ recommendations.
Mathematical Formula: $ \mathrm{Recall@K} = \frac{|\mathrm{Rel} \cap \mathrm{Rec}_K|}{|\mathrm{Rel}|} $
Symbol Explanation: $\mathrm{Rel}$ is the set of relevant items for a user; $\mathrm{Rec}_K$ is the set of top $K$ items recommended by the model.

5.2.2. NDCG@K (N@K)

Conceptual Definition: Normalized Discounted Cumulative Gain. It measures the quality of the ranking. It gives higher scores if the relevant items are at the very top of the list rather than near the bottom.
Mathematical Formula: $ \mathrm{DCG@K} = \sum_{i=1}^K \frac{2^{rel_i} - 1}{\log_2(i+1)} ; \quad \mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}} $
Symbol Explanation: $rel_i$ is the relevance of the item at rank $i$ ; $\mathrm{IDCG}$ is the Ideal DCG (the score if the list was perfectly sorted).

5.3. Baselines

Traditional: BPRMF (Bayesian Personalized Ranking) and LightGCN (Graph Convolution).
Generative: MultiVAE (Variational Autoencoder) and DiffRec (Base Diffusion Model).
Content/Neighbor-aware: ConVAE (VAE with content) and Ease (Neighborhood-based).

6. Results & Analysis

6.1. Core Results Analysis

CDiff4Rec consistently outperformed all baselines. Specifically:

It beat the base DiffRec on all datasets, proving that neighbor information is vital.
It outperformed Ease, showing that the generative diffusion process is more powerful than linear neighborhood models.

The "Real & Pseudo" version performed best, showing that item content (pseudo-users) and user behavior (real users) are complementary.

The following are the results from Table 2 of the original paper:

Dataset	Model	R@20	N@20
Yelp	LightGCN	0.1047	0.0553
	BPRMF	0.0918	0.0485
	Ease	0.1099	0.0601
	MultiVAE	0.1056	0.0548
	ConVAE	0.1082	0.0564
	DiffRec	0.1045	0.0563
	Ours (+Real users)	0.1090	0.0590
	Ours (+Pseudo-users)	0.1074	0.0584
	Ours (+Real & Pseudo users)	0.1145***	0.0622*
Amazon-Game	LightGCN	0.1899	0.0854
	BPRMF	0.1839	0.0844
	Ease	0.2108	0.1012
	MultiVAE	0.2181	0.0995
	ConVAE	0.2192	0.0997
	DiffRec	0.2193	0.1034
	Ours (+Real users)	0.2204	0.1033
	Ours (+Pseudo-users)	0.2250	0.1054*
	Ours (+Real & Pseudo users)	0.2255*	0.1052

(Note: Asterisks denote statistical significance levels)

6.2. Accuracy and Efficiency Analysis

The authors compared the execution time and accuracy of DiffRec vs CDiff4Rec. Despite the extra neighborhood calculations, the time increase was minimal (e.g., from 37 mins to 41 mins on Yelp), while accuracy improved significantly.

The following are the results from Table 3 of the original paper:

Dataset	Model	R@10	R@100	N@10	N@100	Wall time
Yelp	DiffRec	0.0643	0.2818	0.0437	0.0982	0:37:47
Yelp	CDiff4Rec	0.0722	0.2966	0.0490	0.1057	0:41:45
AM-Game	DiffRec	0.1365	0.4689	0.0771	0.1478	0:25:13
AM-Game	CDiff4Rec	0.1442	0.4813	0.0806	0.1514	0:25:54

6.3. Hyperparameter Study

The authors varied the number of neighbors ( $K$ ) and pseudo-users. They found that Top-20 neighbors and 1,000 pseudo-users generally provided the best balance of speed and performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents CDiff4Rec, a diffusion model that effectively uses collaborative signals from both real users and item-review-based pseudo-users. This dual-source neighborhood approach compensates for the information loss that occurs when diffusion models add noise to user data. The model achieves state-of-the-art results on several datasets while remaining computationally efficient.

7.2. Limitations & Future Work

Static Features: The pseudo-users are based on TF-IDF, which is a static word-count method. Future work could use Large Language Models (LLMs) to create more semantic pseudo-user embeddings.
Cold Start: While the model uses item content, it still relies on behavior similarity for neighbor identification. This might struggle with brand-new users who have zero history.

7.3. Personal Insights & Critique

Innovation: The idea of "Pseudo-Users" is quite clever. It essentially converts a Content-Based Filtering problem into a Collaborative Filtering format. This allows the diffusion model to treat words exactly like users, meaning the same architecture can handle both content and behavior without needing complex multi-modal fusion layers.
Rigor: The authors were thorough in testing three different attention mechanisms, proving that "Behavior Similarity" (cosine) is actually often better than more complex "Parametric" (neural) attention for this task. This is a helpful finding for practitioners looking for simplicity.
Critique: The paper is only 5 pages (a "Companion" proceedings paper), so it lacks a deep dive into the specific items the pseudo-users helped identify. A qualitative analysis (e.g., showing which words helped which users) would have made the findings even more convincing.

Collaborative Diffusion Model for Recommender System

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~10 min read · 13,476 chars