Initializing viewer…

English Analysis

Details

1. Bibliographic Information

Title: SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system
Authors: Jialiang Shi, Yaguang Dou, Tian Qi
Affiliations: Shanghai Dewu Information Group Co., Ltd., Shanghai, China
Journal/Conference: The paper is submitted to "The ACM International Conference on Web Search and Data Mining (WSDM)" for the year 2025. WSDM is a top-tier conference in the fields of information retrieval, data mining, and web search, making it a highly reputable venue.
Publication Year: 2025 (as per the submission). The version referenced is v2 of the preprint.
Abstract: The paper addresses key challenges in multi-interest retrieval for recommender systems (RS). Current methods suffer from using static, predefined interests and an over-exploitative online inference strategy that neglects novel interests. The proposed solution, SPARC, is a retrieval framework that uses a Residual Quantized Variational Autoencoder (RQ-VAE) to create a dynamic, discrete interest space (a "codebook"). Crucially, this RQ-VAE is trained end-to-end with the main recommendation model, allowing the interests to evolve based on user feedback. For online inference, SPARC introduces a probabilistic "soft-search" mechanism that predicts a user's interest distribution and explores multiple potential interests, shifting from "passive matching" to "proactive exploration". The method showed significant gains in online A/B tests on a large industrial platform (+0.9% view duration, +22.7% new content discovery) and offline experiments on the Amazon Books dataset.
Original Source Link:
- ArXiv: https://arxiv.org/abs/2508.09090
- PDF: https://arxiv.org/pdf/2508.09090v2.pdf
- Status: This is a preprint submitted for peer review. The DOI is a placeholder (XXXXXXX.XXXXXXX).

2. Executive Summary

Background & Motivation (Why): Modern recommender systems rely on a multi-stage architecture where the initial retrieval stage is critical, as it sets the upper bound on recommendation quality. While multi-interest models have improved recommendations by representing users with multiple interest vectors, they face three major hurdles:
1. Static Interests: Interests are often defined using external, fixed knowledge (e.g., human-defined categories or clusters from pre-trained models). These static representations cannot adapt to a user's evolving tastes in real-time.
2. Over-Exploitation: Online, these models tend to perform a "hard search," focusing only on a user's most dominant historical interests. This leads to repetitive recommendations and fails to uncover new or niche (long-tail) interests, harming user discovery experience.
3. Cold-Start Problem: It is difficult to model multi-interests for new users or users with very few interactions.
Main Contributions / Findings (What): The paper introduces SPARC, a novel retrieval framework that addresses these problems with two key innovations:
1. Behavior-Aware Dynamic Interest Space: SPARC redefines "interests" as learnable, discrete codes within a codebook. It is the first work to jointly train a Residual Quantized VAE (RQ-VAE) with a large-scale industrial retrieval model in an end-to-end manner. This allows the interest codebook to be dynamically optimized based on user interaction signals (e.g., clicks), bridging the gap between representation learning and the recommendation task. The interests are no longer static but become "behavior-aware prototypes."
2. Probabilistic Interest Exploration: SPARC includes a probabilistic module that predicts a user's interest distribution across the entire learned codebook. This enables a "soft-search" online strategy where the system proactively explores multiple high-probability interests in parallel. This shifts the retrieval paradigm from passively matching known interests to proactively exploring potential new ones, significantly improving recommendation diversity and novelty.

This section explains the foundational concepts needed to understand the paper and situates SPARC within the existing literature.

Foundational Concepts:
- Recommender System (RS) Architecture: Most large-scale systems use a multi-stage pipeline: retrieval (quickly finding thousands of relevant candidates from millions), ranking (finely scoring the candidates), and re-ranking (adjusting the final list for diversity, etc.). SPARC focuses on the retrieval stage.
- Two-Tower Model: A popular architecture for retrieval. It has two separate neural networks ("towers"): one for the user and one for the item. They each produce an embedding (a dense vector). The similarity between the user and item embeddings (often measured by dot product) represents the recommendation score. This design is highly efficient for fast retrieval using Approximate Nearest Neighbor (ANN) search.
- Multi-Interest Modeling: The idea that a single vector cannot capture a user's diverse tastes (e.g., interest in both sci-fi movies and cooking shows). Models like MIND and ComiRec generate multiple embedding vectors per user to represent these different interests.
- Vector Quantization (VQ): A data compression technique that maps a continuous vector to a discrete "codeword" from a finite "codebook." Essentially, it finds the closest prototype vector in the codebook to represent the original vector.
- Residual Quantization (RQ): An advanced form of VQ. Instead of quantizing the original vector once, it performs quantization in stages. The first stage quantizes the vector, the second stage quantizes the residual (the error from the first stage), the third stage quantizes the new residual, and so on. This allows for a more accurate representation with smaller codebooks.
- Variational Autoencoder (VAE): A generative neural network that learns to compress data into a low-dimensional latent space and then reconstruct it. An RQ-VAE is a VAE that uses Residual Quantization in its bottleneck layer to create a structured, discrete latent representation.
- Contrastive Learning (CL): A self-supervised learning technique. Its goal is to learn representations by pulling "positive pairs" (e.g., two augmented views of the same image) closer together in the embedding space while pushing "negative pairs" (e.g., views of different images) apart.
- BPR Loss (Bayesian Personalized Ranking): A pairwise loss function popular in recommendation. For a user, it tries to rank a positive item (that the user interacted with) higher than a negative item (that the user did not interact with).
Differentiation from Previous Works:
- vs. MIND/ComiRec: While these models also capture multi-interests, they produce continuous, abstract embedding vectors that are hard to interpret. SPARC, by contrast, learns a discrete and structured interest space via the RQ-VAE codebook. Each codeword can be viewed as an explicit, understandable interest prototype.
- vs. Traditional VQ in RS: Previous works often used VQ in a two-stage process: first, train a VQ-VAE to create a fixed codebook, and then train the recommendation model. This decouples the interest definition from the recommendation objective. SPARC's key innovation is end-to-end joint training, which allows gradients from the recommendation loss to update the codebook, ensuring interests are optimized for recommendation performance.
- vs. ETEGRec: ETEGRec also uses end-to-end training of an RQ-VAE tokenizer with a recommendation model. However, ETEGRec is designed for a generative recommendation paradigm (generating item IDs like text). SPARC applies the same core philosophy to the more widely adopted and efficient two-tower retrieval paradigm, giving it greater immediate practical applicability in many industrial systems.
- vs. Standard CL in RS: SPARC designs specific contrastive losses (L_CL_ii and L_CL_ui) to ensure the quantized representations remain aligned with the original semantics. More uniquely, it introduces the L_BPR_shuffle loss, which is tailored to its discrete interest space to enforce the disentanglement of user representations derived from different interest codes.

4. Methodology (Core Technology & Implementation)

This section provides a detailed breakdown of the SPARC framework, as illustrated in Figure 1.

Figure 1: Overview of SPARC framework. It consists of four parts. a) RQ-VAE and Two-Tower Model. b) Representation alignment losses. c) Interest Disentanglement Loss. d) Probabilistic interest Model. 该图像是论文中的示意图，展示了SPARC框架的整体结构，包括(a) RQ-VAE与双塔模型，(b) 表征对齐损失，(c) 兴趣解缠绕损失，以及(d) 概率兴趣模型，体现了模型训练与推理流程。

The framework is composed of four interconnected parts: (a) the main RQ-VAE and Two-Tower model, (b) representation alignment losses, (c) an interest disentanglement loss, and (d) a probabilistic interest model for online serving.

3.1 Overall Architecture:
- Item Tower: Encodes raw item features into a dense embedding vector $z$ .
- Residual Quantization (RQ-VAE): This module takes the item embedding $z$ and quantizes it into a sequence of discrete codes using a three-level codebook. It also reconstructs an approximation of the original vector, $z_{\mathrm{recon}}$ . This codebook is a learnable part of the model.
- User Tower: Generates a user embedding $u$ . It takes user features and historical interactions as input. Critically, it uses information from the item's quantized representation (e.g., $z_{\mathrm{recon}}$ ) to attend to the user's behavior sequence, creating a context-aware user vector.
- Probabilistic Interest Module: A separate tower trained to predict the user's interest distribution over the first-level codebook, which is used during online inference.
- Prediction: The final click-through rate (pCTR) is predicted based on the similarity (dot product) between the user embedding $u$ and the original item embedding $z$ .
3.2 End-to-End Residual Quantization: This is the core of SPARC's interest definition. The model uses an RQ-VAE with $M=3$ levels. For an item embedding $z$ :
1. Level 1: Find the nearest codeword $e_0$ in the first codebook $C_0$ . The first residual is $r_1 = z - e_0$ .
2. Level 2: Find the nearest codeword $e_1$ in the second codebook $C_1$ to the residual $r_1$ . The new residual is $r_2 = r_1 - e_1$ .
3. Level 3: Find the nearest codeword $e_2$ in the third codebook $C_2$ to the residual $r_2$ . The discrete representation of the item is the set of code indices $(idx_0, idx_1, idx_2)$ . The reconstructed vector is $z_{\mathrm{recon}} = e_0 + e_1 + e_2$ . The first-level code $e_0$ represents the item's core interest, while subsequent codes capture finer details.
The RQ-VAE is trained with a loss function $\mathcal{L}_{\mathrm{rqvae}}$ that is part of the total model loss: $\mathcal{L}_{\mathrm{rqvae}} = \| z - z_{\mathrm{recon}} \|_2^2 + \beta \sum_{k=0}^{2} \| sg(\boldsymbol{r}_k) - \boldsymbol{e}_k \|_2^2$
- $\| z - z_{\mathrm{recon}} \|_2^2$ : The reconstruction loss, which pushes the reconstructed vector to be close to the original.
- $\| sg(\boldsymbol{r}_k) - \boldsymbol{e}_k \|_2^2$ : The codebook commitment loss. $sg(\cdot)$ is the stop-gradient operator, which prevents the gradients from flowing back into the encoder from this term. This loss encourages the chosen codewords $e_k$ to be close to their corresponding inputs $r_k$ , effectively "committing" the encoder to the codebook.
- Because $\mathcal{L}_{\mathrm{rqvae}}$ is part of the total loss, gradients from the main recommendation task can flow back and update the codebooks, making them "behavior-aware."
3.3 Probabilistic Interest-Aware User Representation: The User Tower uses an attention mechanism where the item's quantized representation acts as the Query and the user's historical behavior sequence is the Key/Value. To make the model robust, a Mixed Cross-Feature Strategy is used during training:
- 50% of the time, only the coarse, first-level codeword $e_0$ is used as the query.
- 50% of the time, the full reconstructed vector $z_{\mathrm{recon}}$ is used as the query. This trains the User Tower to be effective with both coarse and fine-grained interest signals.
3.4 Multi-Task Optimization Framework for Disentanglement and Alignment: The total loss function is a weighted sum of multiple components designed to learn high-quality, disentangled interest representations. $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{BCE}} + \lambda_{\mathrm{bpr}} \mathcal{L}_{\mathrm{BPR\_shuffle}} + \lambda_{\mathrm{interest}} \mathcal{L}_{\mathrm{interest}} + \lambda_{\mathrm{ui}} \mathcal{L}_{\mathrm{CL\_ui}} + \lambda_{\mathrm{ii}} \mathcal{L}_{\mathrm{CL\_ii}} + \lambda_{\mathrm{rq}} \mathcal{L}_{\mathrm{rqvae}}$
- $\mathcal{L}_{\mathbf{BCE}}$ : The primary binary cross-entropy loss for the click prediction task.
- $\mathcal{L}_{\mathbf{BPR\_shuffle}}$ (Interest Disentanglement Loss): This novel BPR loss enhances the distinguishability of interest codes. For a positive user-item pair (u, i), it generates a positive user vector $u_q$ using the item's core interest code. It then samples core interest codes from other items in the batch to generate negative user vectors u'_{q,j}. The loss pushes the score of the positive pair $(u_q, z)$ to be higher than the negative pairs $(u'_{q,j}, z)$ , forcing the user tower to be sensitive to the input interest code. $\mathcal{L}_{\mathrm{BPR\_shuffle}} = \sum_{(u, i) \in \mathcal{D}} \sum_{j=1}^{J} - \log \sigma ( u_q^T z - (u_{q,j}')^T z )$
- $\mathcal{L}_{\mathbf{CL\_ii}}$ $L_{CL_ii}$ and $\mathcal{L}_{\mathbf{CL\_ui}}$ $L_{CL_ui}$ (Representation Alignment Losses): These are two in-batch contrastive losses.
  - $\mathcal{L}_{\mathrm{CL\_ii}}$ (Item-Item CL) ensures the original item embedding $z$ is similar to its reconstruction $z_{\mathrm{recon}}$ compared to reconstructions of other items in the batch. $\mathcal{L}_{\mathrm{CL\_ii}} = - \log \frac{\exp(\langle z, z_{\mathrm{recon}} \rangle / \tau)}{\sum_{j \in \mathrm{batch}} \exp(\langle z, z_{j, \mathrm{recon}} \rangle / \tau)}$
  - $\mathcal{L}_{\mathrm{CL\_ui}}$ (User-Item CL) ensures the user embedding $u$ is aligned with the positive item's reconstruction $z_{\mathrm{recon}}$ compared to other items' reconstructions. $\mathcal{L}_{\mathrm{CL\_ui}} = - \log \frac{\exp(\langle u, z_{\mathrm{recon}} \rangle / \tau)}{\sum_{j \in \mathrm{batch}} \exp(\langle u, z_{j, \mathrm{recon}} \rangle / \tau)}$
- $\mathcal{L}_{\mathbf{interest}}$ (Interest Supervision Loss): This loss trains the separate Interest Probability Tower to predict the first-level interest code of the positive item, given user features.
3.5 Online Serving with Probabilistic Exploration: This "soft-search" mechanism revolutionizes online inference:
1. Phase 1: Interest Prediction: For an incoming user request, the pre-trained Interest Probability Tower computes a probability distribution $P(\text{interest}|u)$ over the 256 first-level interest codes.
2. Phase 2: Parallel Retrieval:
  - The top-K most probable interest codes are selected (e.g., $K=5$ ).
  - For each of these K codes, a corresponding codeword vector $e_{0,k}$ is used to generate a unique user embedding $u_k$ .
  - These K user vectors are then used to perform K parallel ANN searches in the item index.
  - The results are aggregated and de-duplicated, forming the final candidate set. This allows the system to proactively explore multiple potential interests, rather than just the single most dominant one, enhancing discovery and diversity.

5. Experimental Setup

Datasets:
- Amazon Books: A public 5-core dataset where each user and item has at least five interactions. Positive feedback is defined as a rating of 4.0 or higher.
- Industrial Dataset: A large-scale proprietary dataset from a commercial platform with tens of millions of daily active users. The table below (transcribed from the paper) summarizes the dataset statistics. Table 1: Statistics of datasets.
Dataset # users # items # interactions

Amazon Books 459,133 313,966 8,898,041

Industrial Dataset 358,920,221 43,156,789 4,732,456,754
Evaluation Metrics:
- Recall@K:
  - Definition: Measures the proportion of test items that are found within the top-K recommended items. It evaluates the model's ability to retrieve relevant items.
  - Formula: $\text{Recall@K} = \frac{|\{\text{relevant items}\} \cap \{\text{top-K recommended items}\}|}{|\{\text{relevant items}\}|}$
  - Symbols: In this paper's setting (1 positive item), it is 1 if the positive item is in the top-K and 0 otherwise, averaged over all users.
- NDCG@K (Normalized Discounted Cumulative Gain@K):
  - Definition: A ranking quality metric that evaluates how well the relevant items are ranked. It gives higher scores for placing relevant items at the top of the list.
  - Formula: $\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}, \quad \text{where} \quad \text{DCG@K} = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i+1)}$
  - Symbols: $\text{rel}_i$ is the relevance of the item at rank $i$ (1 if it's the positive item, 0 otherwise). IDCG@K is the DCG of a perfect ranking.
- MRR (Mean Reciprocal Rank):
  - Definition: Measures the average of the reciprocal of the rank at which the first relevant item was found. It heavily rewards models for placing the correct item very high in the list.
  - Formula: $\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$
  - Symbols: $|Q|$ is the number of users (queries), and $\text{rank}_i$ is the rank of the first correct item for the $i$ -th user.
- Coverage@K:
  - Definition: Measures the percentage of unique items from the entire item corpus that appear in any user's top-K recommendation list. It's a measure of catalog diversity.
  - Formula: $\text{Coverage@K} = \frac{|\bigcup_{u \in U} \text{TopK}(u)|}{|\mathcal{I}|}$
  - Symbols: $U$ is the set of all users, $\text{TopK}(u)$ is the top-K list for user $u$ , and $\mathcal{I}$ is the set of all items.
- ILD@K (Intra-List Diversity@K):
  - Definition: Measures the average pairwise dissimilarity of items within a single recommendation list. Higher ILD means the items recommended to a user are more diverse.
  - Formula: $\text{ILD@K} = \frac{1}{|U|} \sum_{u \in U} \frac{\sum_{i,j \in \text{TopK}(u), i \neq j} (1 - \cos(v_i, v_j))}{K(K-1)}$
  - Symbols: $v_i$ and $v_j$ are the embedding vectors of items $i$ and $j$ .
- PV500: An online business metric measuring the number of new content pieces that reach 500 Page Views (PVs) within 24 hours. It directly reflects the system's ability to discover and promote new content.
Baselines:
- Two-Tower (YouTubeDNN): A classic two-tower model with mean pooling.
- MIND: A multi-interest model using a capsule network with dynamic routing.
- ComiRec: A state-of-the-art multi-interest model using self-attention.
- SPARC-Hard: A variant of SPARC that uses a "hard-search" strategy (deterministic selection of top interests) instead of the probabilistic soft-search.
- SPARC-Static: A variant where the codebook is pre-computed with K-Means and kept fixed during training, mimicking the traditional two-stage approach.

6. Results & Analysis

The experiments systematically validate SPARC's effectiveness across multiple dimensions.

RQ1: Overall Performance Comparison Table 2 shows that SPARC significantly outperforms all baselines, including the strong multi-interest model ComiRec, across all standard retrieval metrics. For instance, it achieves a +5.54% relative improvement in Recall@50 and +5.73% in NDCG@50 over ComiRec. This confirms the overall superiority of the SPARC framework.

Table 2: Overall performance comparison on the Amazon Books dataset.

Model	Recall@20	NDCG@20	Recall@50	NDCG@50	MRR
Two-Tower	0.1852	0.1033	0.3015	0.1246	0.0812
MIND	0.2014	0.1168	0.3321	0.1395	0.0925
ComiRec	0.2088	0.1215	0.3413	0.1448	0.0967
SPARC-Static	0.1985	0.1132	0.3276	0.1361	0.0901
SPARC-Hard	0.2075	0.1201	0.3398	0.1432	0.0954
SPARC (ours)	0.2216	0.1294	0.3602	0.1531	0.1038
Improv. vs Runner up	+6.13%	+6.50%	+5.54%	+5.73%	+7.34%

RQ2: Analysis on Novelty and Long-tail Discovery The results strongly support the claim that SPARC excels at interest discovery.

Long-tail Performance (Table 3): SPARC shows a remarkable +24.2% relative improvement in Recall@50 for tail items (the least popular 20%) compared to ComiRec. This provides a clear explanation for its strong online PV500 result (+22.7%), as it is much better at surfacing and promoting new and niche content.
Diversity and Coverage (Table 4): SPARC also leads significantly in Coverage@50 (+15.7%) and ILD@50 (+5.67%), indicating it recommends a wider variety of items and constructs more diverse recommendation lists for users.

Table 3: Recall@50 performance on items of different popularity levels.

Model	Head (Top 20%)	Torso (20%-60%)	Tail (Bottom 20%)
Two-Tower	0.5512	0.2843	0.0411
MIND	0.5489	0.3155	0.0586
ComiRec	0.5501	0.3248	0.0632
SPARC (ours)	0.5623	0.3477	0.0785
Improv. vs runner up (Tail)	2.01%	7.05%	+24.2%

Table 4: Performance on novelty and diversity metrics.

Model	Coverage@50	ILD@50
Two-Tower	0.085	0.682
MIND	0.102	0.725
ComiRec	0.108	0.741
SPARC (ours)	0.125	0.783
Improv. vs Runner up	+15.7%	+5.67%

RQ3: Ablation Study The ablation studies in Table 2 clearly demonstrate the value of SPARC's key components:
- SPARC vs. SPARC-Hard: The full SPARC model outperforms SPARC-Hard. This proves the effectiveness of the probabilistic "soft-search" mechanism, which creates a more exploratory and effective user representation compared to a deterministic "hard-search".
- SPARC vs. SPARC-Static: The massive performance gap between SPARC and SPARC-Static (which is even worse than baselines) underscores the critical importance of the end-to-end training of the dynamic codebook. A static, pre-trained codebook is suboptimal as it is not aligned with the downstream recommendation task.

RQ4: Analysis for Cold-Start Users Table 5 shows that SPARC's advantage is most significant for users with sparse interaction histories (length [5, 10]), where it achieves a +11.84% relative improvement in NDCG@50 over ComiRec. This suggests that SPARC's probabilistic exploration mechanism is better at handling uncertainty and generalizing to users for whom preferences are not yet well-established.

Table 5: NDCG@50 performance on user groups with different activity levels.

Model	Len: [5, 10] (Sparse)	Len: [11, 20] (Medium)	Len: [21, 50] (Active)
Two-Tower	0.0825	0.1198	0.1451
ComiRec	0.0988	0.1385	0.1663
SPARC (ours)	0.1105	0.1492	0.1741
Improv. vs Runner up (Sparse)	+11.84%	+7.73%	+4.69%

7. Conclusion & Reflections

Conclusion Summary: The paper introduces SPARC, an innovative end-to-end retrieval framework that tackles fundamental limitations in multi-interest modeling. By jointly training an RQ-VAE with the retrieval model, it creates a dynamic, "behavior-aware" discrete interest space. Its probabilistic "soft-search" mechanism transforms online inference from passive matching to proactive exploration. Comprehensive online and offline experiments validate that SPARC significantly improves recommendation accuracy, diversity, and novelty, particularly for long-tail content and cold-start users.
Limitations & Future Work: The authors do not explicitly list limitations, but we can infer some:
- Training Complexity: The multi-task learning framework involves multiple loss functions with corresponding weighting hyperparameters ( $\lambda$ s). Tuning these weights can be complex and computationally expensive.
- Inference Overhead: The "soft-search" strategy requires K parallel ANN searches, which increases computational and latency costs during online serving compared to a single search. A trade-off between the exploration gain (from a larger K) and system cost must be carefully managed.
- Interpretability of Codes: While the discrete codes are more structured than continuous embeddings, directly interpreting what real-world concept each codeword represents (e.g., "vintage fashion," "sci-fi books") remains a challenge that requires further qualitative analysis. Future work could explore more advanced quantization techniques, apply the framework to other domains like multimodal recommendation, or develop adaptive methods to dynamically choose the number of exploration paths (K) based on user uncertainty.
Personal Insights & Critique: SPARC represents a significant step forward in the design of retrieval systems. Its core contribution—the end-to-end integration of a learnable, discrete representation space (the codebook) with the recommendation task—is a powerful paradigm. It elegantly solves the "semantic-behavior gap" that plagued previous two-stage approaches. The "soft-search" mechanism is also a highly practical and impactful innovation. It provides a theoretically grounded way to balance the exploration-exploitation dilemma at the retrieval stage, which is directly responsible for the impressive gains in novelty and long-tail discovery. The strong performance on cold-start users further highlights the robustness of this probabilistic approach. Overall, SPARC is a well-designed and thoroughly validated framework that not only pushes the state-of-the-art in multi-interest retrieval but also offers a practical blueprint for building more intelligent and exploratory industrial recommender systems.

Dataset	# users	# items	# interactions
Amazon Books	459,133	313,966	8,898,041
Industrial Dataset	358,920,221	43,156,789	4,732,456,754