- Title: SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system
- Authors: Jialiang Shi, Yaguang Dou, Tian Qi
- Affiliations: Shanghai Dewu Information Group Co., Ltd., Shanghai, China
- Journal/Conference: The paper is submitted to "The ACM International Conference on Web Search and Data Mining (WSDM)" for the year 2025. WSDM is a top-tier conference in the fields of information retrieval, data mining, and web search, making it a highly reputable venue.
- Publication Year: 2025 (as per the submission). The version referenced is v2 of the preprint.
- Abstract: The paper addresses key challenges in multi-interest retrieval for recommender systems (RS). Current methods suffer from using static, predefined interests and an over-exploitative online inference strategy that neglects novel interests. The proposed solution, SPARC, is a retrieval framework that uses a Residual Quantized Variational Autoencoder (RQ-VAE) to create a dynamic, discrete interest space (a "codebook"). Crucially, this RQ-VAE is trained end-to-end with the main recommendation model, allowing the interests to evolve based on user feedback. For online inference, SPARC introduces a probabilistic "soft-search" mechanism that predicts a user's interest distribution and explores multiple potential interests, shifting from "passive matching" to "proactive exploration". The method showed significant gains in online A/B tests on a large industrial platform (+0.9% view duration, +22.7% new content discovery) and offline experiments on the Amazon Books dataset.
- Original Source Link:
2. Executive Summary
- Background & Motivation (Why):
Modern recommender systems rely on a multi-stage architecture where the initial retrieval stage is critical, as it sets the upper bound on recommendation quality. While multi-interest models have improved recommendations by representing users with multiple interest vectors, they face three major hurdles:
- Static Interests: Interests are often defined using external, fixed knowledge (e.g., human-defined categories or clusters from pre-trained models). These static representations cannot adapt to a user's evolving tastes in real-time.
- Over-Exploitation: Online, these models tend to perform a "hard search," focusing only on a user's most dominant historical interests. This leads to repetitive recommendations and fails to uncover new or niche (long-tail) interests, harming user discovery experience.
- Cold-Start Problem: It is difficult to model multi-interests for new users or users with very few interactions.
- Main Contributions / Findings (What):
The paper introduces SPARC, a novel retrieval framework that addresses these problems with two key innovations:
- Behavior-Aware Dynamic Interest Space: SPARC redefines "interests" as learnable, discrete codes within a codebook. It is the first work to jointly train a Residual Quantized VAE (RQ-VAE) with a large-scale industrial retrieval model in an end-to-end manner. This allows the interest codebook to be dynamically optimized based on user interaction signals (e.g., clicks), bridging the gap between representation learning and the recommendation task. The interests are no longer static but become "behavior-aware prototypes."
- Probabilistic Interest Exploration: SPARC includes a probabilistic module that predicts a user's interest distribution across the entire learned codebook. This enables a "soft-search" online strategy where the system proactively explores multiple high-probability interests in parallel. This shifts the retrieval paradigm from passively matching known interests to proactively exploring potential new ones, significantly improving recommendation diversity and novelty.
This section explains the foundational concepts needed to understand the paper and situates SPARC within the existing literature.
4. Methodology (Core Technology & Implementation)
This section provides a detailed breakdown of the SPARC framework, as illustrated in Figure 1.
该图像是论文中的示意图,展示了SPARC框架的整体结构,包括(a) RQ-VAE与双塔模型,(b) 表征对齐损失,(c) 兴趣解缠绕损失,以及(d) 概率兴趣模型,体现了模型训练与推理流程。
The framework is composed of four interconnected parts: (a) the main RQ-VAE and Two-Tower model, (b) representation alignment losses, (c) an interest disentanglement loss, and (d) a probabilistic interest model for online serving.
-
3.1 Overall Architecture:
- Item Tower: Encodes raw item features into a dense embedding vector z.
- Residual Quantization (RQ-VAE): This module takes the item embedding z and quantizes it into a sequence of discrete codes using a three-level codebook. It also reconstructs an approximation of the original vector, zrecon. This codebook is a learnable part of the model.
- User Tower: Generates a user embedding u. It takes user features and historical interactions as input. Critically, it uses information from the item's quantized representation (e.g., zrecon) to attend to the user's behavior sequence, creating a context-aware user vector.
- Probabilistic Interest Module: A separate tower trained to predict the user's interest distribution over the first-level codebook, which is used during online inference.
- Prediction: The final click-through rate (pCTR) is predicted based on the similarity (dot product) between the user embedding u and the original item embedding z.
-
3.2 End-to-End Residual Quantization:
This is the core of SPARC's interest definition. The model uses an RQ-VAE with M=3 levels. For an item embedding z:
- Level 1: Find the nearest codeword e0 in the first codebook C0. The first residual is r1=z−e0.
- Level 2: Find the nearest codeword e1 in the second codebook C1 to the residual r1. The new residual is r2=r1−e1.
- Level 3: Find the nearest codeword e2 in the third codebook C2 to the residual r2.
The discrete representation of the item is the set of code indices (idx0,idx1,idx2). The reconstructed vector is zrecon=e0+e1+e2. The first-level code e0 represents the item's core interest, while subsequent codes capture finer details.
The RQ-VAE is trained with a loss function Lrqvae that is part of the total model loss:
Lrqvae=∥z−zrecon∥22+βk=0∑2∥sg(rk)−ek∥22
- ∥z−zrecon∥22: The reconstruction loss, which pushes the reconstructed vector to be close to the original.
- ∥sg(rk)−ek∥22: The codebook commitment loss. sg(⋅) is the
stop-gradient
operator, which prevents the gradients from flowing back into the encoder from this term. This loss encourages the chosen codewords ek to be close to their corresponding inputs rk, effectively "committing" the encoder to the codebook.
- Because Lrqvae is part of the total loss, gradients from the main recommendation task can flow back and update the codebooks, making them "behavior-aware."
-
3.3 Probabilistic Interest-Aware User Representation:
The User Tower uses an attention mechanism where the item's quantized representation acts as the Query and the user's historical behavior sequence is the Key/Value. To make the model robust, a Mixed Cross-Feature Strategy is used during training:
- 50% of the time, only the coarse, first-level codeword e0 is used as the query.
- 50% of the time, the full reconstructed vector zrecon is used as the query.
This trains the User Tower to be effective with both coarse and fine-grained interest signals.
-
3.4 Multi-Task Optimization Framework for Disentanglement and Alignment:
The total loss function is a weighted sum of multiple components designed to learn high-quality, disentangled interest representations.
Ltotal=LBCE+λbprLBPR_shuffle+λinterestLinterest+λuiLCL_ui+λiiLCL_ii+λrqLrqvae
- LBCE: The primary binary cross-entropy loss for the click prediction task.
- LBPR_shuffle (Interest Disentanglement Loss): This novel BPR loss enhances the distinguishability of interest codes. For a positive user-item pair
(u, i)
, it generates a positive user vector uq using the item's core interest code. It then samples core interest codes from other items in the batch to generate negative user vectors u'_{q,j}
. The loss pushes the score of the positive pair (uq,z) to be higher than the negative pairs (uq,j′,z), forcing the user tower to be sensitive to the input interest code.
LBPR_shuffle=(u,i)∈D∑j=1∑J−logσ(uqTz−(uq,j′)Tz)
- LCL_ii and LCL_ui (Representation Alignment Losses): These are two in-batch contrastive losses.
- LCL_ii (Item-Item CL) ensures the original item embedding z is similar to its reconstruction zrecon compared to reconstructions of other items in the batch.
LCL_ii=−log∑j∈batchexp(⟨z,zj,recon⟩/τ)exp(⟨z,zrecon⟩/τ)
- LCL_ui (User-Item CL) ensures the user embedding u is aligned with the positive item's reconstruction zrecon compared to other items' reconstructions.
LCL_ui=−log∑j∈batchexp(⟨u,zj,recon⟩/τ)exp(⟨u,zrecon⟩/τ)
- Linterest (Interest Supervision Loss): This loss trains the separate
Interest Probability Tower
to predict the first-level interest code of the positive item, given user features.
-
3.5 Online Serving with Probabilistic Exploration:
This "soft-search" mechanism revolutionizes online inference:
- Phase 1: Interest Prediction: For an incoming user request, the pre-trained
Interest Probability Tower
computes a probability distribution P(interest∣u) over the 256 first-level interest codes.
- Phase 2: Parallel Retrieval:
- The top-K most probable interest codes are selected (e.g., K=5).
- For each of these K codes, a corresponding codeword vector e0,k is used to generate a unique user embedding uk.
- These K user vectors are then used to perform K parallel ANN searches in the item index.
- The results are aggregated and de-duplicated, forming the final candidate set.
This allows the system to proactively explore multiple potential interests, rather than just the single most dominant one, enhancing discovery and diversity.
5. Experimental Setup
-
Datasets:
- Amazon Books: A public 5-core dataset where each user and item has at least five interactions. Positive feedback is defined as a rating of 4.0 or higher.
- Industrial Dataset: A large-scale proprietary dataset from a commercial platform with tens of millions of daily active users.
The table below (transcribed from the paper) summarizes the dataset statistics.
Table 1: Statistics of datasets.
Dataset |
# users |
# items |
# interactions |
Amazon Books |
459,133 |
313,966 |
8,898,041 |
Industrial Dataset |
358,920,221 |
43,156,789 |
4,732,456,754 |
-
Evaluation Metrics:
- Recall@K:
- Definition: Measures the proportion of test items that are found within the top-K recommended items. It evaluates the model's ability to retrieve relevant items.
- Formula:
Recall@K=∣{relevant items}∣∣{relevant items}∩{top-K recommended items}∣
- Symbols: In this paper's setting (1 positive item), it is 1 if the positive item is in the top-K and 0 otherwise, averaged over all users.
- NDCG@K (Normalized Discounted Cumulative Gain@K):
- Definition: A ranking quality metric that evaluates how well the relevant items are ranked. It gives higher scores for placing relevant items at the top of the list.
- Formula:
NDCG@K=IDCG@KDCG@K,whereDCG@K=i=1∑Klog2(i+1)reli
- Symbols: reli is the relevance of the item at rank i (1 if it's the positive item, 0 otherwise).
IDCG@K
is the DCG of a perfect ranking.
- MRR (Mean Reciprocal Rank):
- Definition: Measures the average of the reciprocal of the rank at which the first relevant item was found. It heavily rewards models for placing the correct item very high in the list.
- Formula:
MRR=∣Q∣1i=1∑∣Q∣ranki1
- Symbols: ∣Q∣ is the number of users (queries), and ranki is the rank of the first correct item for the i-th user.
- Coverage@K:
- Definition: Measures the percentage of unique items from the entire item corpus that appear in any user's top-K recommendation list. It's a measure of catalog diversity.
- Formula:
Coverage@K=∣I∣∣⋃u∈UTopK(u)∣
- Symbols: U is the set of all users, TopK(u) is the top-K list for user u, and I is the set of all items.
- ILD@K (Intra-List Diversity@K):
- Definition: Measures the average pairwise dissimilarity of items within a single recommendation list. Higher ILD means the items recommended to a user are more diverse.
- Formula:
ILD@K=∣U∣1u∈U∑K(K−1)∑i,j∈TopK(u),i=j(1−cos(vi,vj))
- Symbols: vi and vj are the embedding vectors of items i and j.
- PV500: An online business metric measuring the number of new content pieces that reach 500 Page Views (PVs) within 24 hours. It directly reflects the system's ability to discover and promote new content.
-
Baselines:
Two-Tower (YouTubeDNN)
: A classic two-tower model with mean pooling.
MIND
: A multi-interest model using a capsule network with dynamic routing.
ComiRec
: A state-of-the-art multi-interest model using self-attention.
SPARC-Hard
: A variant of SPARC that uses a "hard-search" strategy (deterministic selection of top interests) instead of the probabilistic soft-search.
SPARC-Static
: A variant where the codebook is pre-computed with K-Means and kept fixed during training, mimicking the traditional two-stage approach.
6. Results & Analysis
The experiments systematically validate SPARC's effectiveness across multiple dimensions.
-
RQ1: Overall Performance Comparison
Table 2 shows that SPARC significantly outperforms all baselines, including the strong multi-interest model ComiRec
, across all standard retrieval metrics. For instance, it achieves a +5.54% relative improvement in Recall@50
and +5.73% in NDCG@50
over ComiRec
. This confirms the overall superiority of the SPARC framework.
Table 2: Overall performance comparison on the Amazon Books dataset.
Model |
Recall@20 |
NDCG@20 |
Recall@50 |
NDCG@50 |
MRR |
Two-Tower |
0.1852 |
0.1033 |
0.3015 |
0.1246 |
0.0812 |
MIND |
0.2014 |
0.1168 |
0.3321 |
0.1395 |
0.0925 |
ComiRec |
0.2088 |
0.1215 |
0.3413 |
0.1448 |
0.0967 |
SPARC-Static |
0.1985 |
0.1132 |
0.3276 |
0.1361 |
0.0901 |
SPARC-Hard |
0.2075 |
0.1201 |
0.3398 |
0.1432 |
0.0954 |
SPARC (ours) |
0.2216 |
0.1294 |
0.3602 |
0.1531 |
0.1038 |
Improv. vs Runner up |
+6.13% |
+6.50% |
+5.54% |
+5.73% |
+7.34% |
-
RQ2: Analysis on Novelty and Long-tail Discovery
The results strongly support the claim that SPARC excels at interest discovery.
- Long-tail Performance (Table 3): SPARC shows a remarkable +24.2% relative improvement in
Recall@50
for tail items (the least popular 20%) compared to ComiRec
. This provides a clear explanation for its strong online PV500
result (+22.7%), as it is much better at surfacing and promoting new and niche content.
- Diversity and Coverage (Table 4): SPARC also leads significantly in
Coverage@50
(+15.7%) and ILD@50
(+5.67%), indicating it recommends a wider variety of items and constructs more diverse recommendation lists for users.
Table 3: Recall@50 performance on items of different popularity levels.
Model |
Head (Top 20%) |
Torso (20%-60%) |
Tail (Bottom 20%) |
Two-Tower |
0.5512 |
0.2843 |
0.0411 |
MIND |
0.5489 |
0.3155 |
0.0586 |
ComiRec |
0.5501 |
0.3248 |
0.0632 |
SPARC (ours) |
0.5623 |
0.3477 |
0.0785 |
Improv. vs runner up (Tail) |
2.01% |
7.05% |
+24.2% |
Table 4: Performance on novelty and diversity metrics.
Model |
Coverage@50 |
ILD@50 |
Two-Tower |
0.085 |
0.682 |
MIND |
0.102 |
0.725 |
ComiRec |
0.108 |
0.741 |
SPARC (ours) |
0.125 |
0.783 |
Improv. vs Runner up |
+15.7% |
+5.67% |
-
RQ3: Ablation Study
The ablation studies in Table 2 clearly demonstrate the value of SPARC's key components:
- SPARC vs.
SPARC-Hard
: The full SPARC model outperforms SPARC-Hard
. This proves the effectiveness of the probabilistic "soft-search" mechanism, which creates a more exploratory and effective user representation compared to a deterministic "hard-search".
- SPARC vs.
SPARC-Static
: The massive performance gap between SPARC and SPARC-Static
(which is even worse than baselines) underscores the critical importance of the end-to-end training of the dynamic codebook. A static, pre-trained codebook is suboptimal as it is not aligned with the downstream recommendation task.
-
RQ4: Analysis for Cold-Start Users
Table 5 shows that SPARC's advantage is most significant for users with sparse interaction histories (length [5, 10]), where it achieves a +11.84% relative improvement in NDCG@50
over ComiRec
. This suggests that SPARC's probabilistic exploration mechanism is better at handling uncertainty and generalizing to users for whom preferences are not yet well-established.
Table 5: NDCG@50 performance on user groups with different activity levels.
Model |
Len: [5, 10] (Sparse) |
Len: [11, 20] (Medium) |
Len: [21, 50] (Active) |
Two-Tower |
0.0825 |
0.1198 |
0.1451 |
ComiRec |
0.0988 |
0.1385 |
0.1663 |
SPARC (ours) |
0.1105 |
0.1492 |
0.1741 |
Improv. vs Runner up (Sparse) |
+11.84% |
+7.73% |
+4.69% |
7. Conclusion & Reflections
-
Conclusion Summary:
The paper introduces SPARC, an innovative end-to-end retrieval framework that tackles fundamental limitations in multi-interest modeling. By jointly training an RQ-VAE with the retrieval model, it creates a dynamic, "behavior-aware" discrete interest space. Its probabilistic "soft-search" mechanism transforms online inference from passive matching to proactive exploration. Comprehensive online and offline experiments validate that SPARC significantly improves recommendation accuracy, diversity, and novelty, particularly for long-tail content and cold-start users.
-
Limitations & Future Work:
The authors do not explicitly list limitations, but we can infer some:
- Training Complexity: The multi-task learning framework involves multiple loss functions with corresponding weighting hyperparameters (λs). Tuning these weights can be complex and computationally expensive.
- Inference Overhead: The "soft-search" strategy requires K parallel ANN searches, which increases computational and latency costs during online serving compared to a single search. A trade-off between the exploration gain (from a larger K) and system cost must be carefully managed.
- Interpretability of Codes: While the discrete codes are more structured than continuous embeddings, directly interpreting what real-world concept each codeword represents (e.g., "vintage fashion," "sci-fi books") remains a challenge that requires further qualitative analysis.
Future work could explore more advanced quantization techniques, apply the framework to other domains like multimodal recommendation, or develop adaptive methods to dynamically choose the number of exploration paths (K) based on user uncertainty.
-
Personal Insights & Critique:
SPARC represents a significant step forward in the design of retrieval systems. Its core contribution—the end-to-end integration of a learnable, discrete representation space (the codebook) with the recommendation task—is a powerful paradigm. It elegantly solves the "semantic-behavior gap" that plagued previous two-stage approaches.
The "soft-search" mechanism is also a highly practical and impactful innovation. It provides a theoretically grounded way to balance the exploration-exploitation dilemma at the retrieval stage, which is directly responsible for the impressive gains in novelty and long-tail discovery. The strong performance on cold-start users further highlights the robustness of this probabilistic approach.
Overall, SPARC is a well-designed and thoroughly validated framework that not only pushes the state-of-the-art in multi-interest retrieval but also offers a practical blueprint for building more intelligent and exploratory industrial recommender systems.