BPL: Bias-adaptive Preference Distillation Learning for Recommender System
TL;DR Summary
BPL introduces bias-adaptive preference distillation combining teacher-student and self-distillation to mitigate bias, enhancing accuracy in both factual and counterfactual recommendation tests and balancing short- and long-term user preference prediction.
Abstract
Recommender systems suffer from biases that cause the collected feedback to incompletely reveal user preference. While debiasing learning has been extensively studied, they mostly focused on the specialized (called counterfactual) test environment simulated by random exposure of items, significantly degrading accuracy in the typical (called factual) test environment based on actual user-item interactions. In fact, each test environment highlights the benefit of a different aspect: the counterfactual test emphasizes user satisfaction in the long-terms, while the factual test focuses on predicting subsequent user behaviors on platforms. Therefore, it is desirable to have a model that performs well on both tests rather than only one. In this work, we introduce a new learning framework, called Bias-adaptive Preference distillation Learning (BPL), to gradually uncover user preferences with dual distillation strategies. These distillation strategies are designed to drive high performance in both factual and counterfactual test environments. Employing a specialized form of teacher-student distillation from a biased model, BPL retains accurate preference knowledge aligned with the collected feedback, leading to high performance in the factual test. Furthermore, through self-distillation with reliability filtering, BPL iteratively refines its knowledge throughout the training process. This enables the model to produce more accurate predictions across a broader range of user-item combinations, thereby improving performance in the counterfactual test. Comprehensive experiments validate the effectiveness of BPL in both factual and counterfactual tests. Our implementation is accessible via: https://github.com/SeongKu-Kang/BPL.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
-
Title: BPL: Bias-adaptive Preference Distillation Learning for Recommender System
-
Authors: SeongKu Kang, Jianxun Lian, Dongha Lee, Wonbin Kweon, Sanghwan Jang, Jaehyun Lee, Jindong Wang, Xing Xie (Fellow, IEEE), and Hwanjo Yu. The authors are affiliated with various academic and industrial research institutions, including prominent tech companies, indicating a blend of theoretical and practical expertise in machine learning and recommender systems.
-
Journal/Conference: The paper is available on arXiv, a preprint server. The specific journal or conference of publication is not mentioned in the provided text. The "Fellow, IEEE" title for Xing Xie suggests a high level of seniority and contribution to the field.
-
Publication Year: The provided source link suggests a future date (2025), which indicates this is likely a placeholder. The paper is available as a preprint, meaning it has not yet completed the formal peer-review process for a specific venue.
-
Abstract: The paper addresses the problem of biases in recommender systems, which lead to collected user feedback being an incomplete representation of true user preferences. Existing debiasing methods improve performance on specialized "counterfactual" tests (simulating random item exposure) but degrade accuracy on typical "factual" tests (based on actual user interactions). The authors argue that a model should perform well on both, as they represent long-term user satisfaction and short-term platform goals, respectively. They introduce Bias-adaptive Preference distillation Learning (BPL), a framework that uses dual distillation strategies to achieve high performance in both environments. BPL uses teacher-student distillation from a biased model to retain knowledge for the factual test and self-distillation with reliability filtering to refine its predictions for the counterfactual test.
-
Original Source Link:
https://arxiv.org/abs/2510.16076(Note: This appears to be a placeholder link). The provided PDF link ishttps://arxiv.org/pdf/2510.16076v1.pdf.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Recommender systems are trained on user interaction data (e.g., ratings), but this data is heavily biased. Users are not exposed to all items randomly; they tend to see and rate popular items or items recommended by the system. This creates a "feedback loop" where the system's biases are reinforced.
- Gap in Prior Work: Existing solutions, known as debiasing methods, have a major flaw. They are designed to work well in a counterfactual test environment, which simulates an ideal, unbiased world with random item exposure. However, they perform poorly in a factual test environment, which uses real-world interaction data to measure how well the model predicts the next thing a user will do.
- Importance: A model needs to excel in both tests. The factual test is linked to immediate business metrics like click-through rates and revenue. The counterfactual test is linked to long-term user satisfaction by ensuring the model understands a user's true, unbiased preferences. Existing methods force a trade-off, but the ideal model should not.
- Innovation: The paper introduces BPL, a framework that aims to train a single model to perform well in both test environments simultaneously. It cleverly uses knowledge distillation, but instead of relying on fixed "teacher" models, it iteratively refines its own knowledge while adaptively learning from a biased model.
-
Main Contributions / Findings (What):
- A New Learning Framework: BPL is based on risk minimization theory. The authors decompose the total prediction error (true risk) into three components and design specific learning objectives to minimize each one, providing a principled approach to the problem.
- Dual Distillation Strategies: This is the core technical innovation.
- Confidence-Penalized Preference Distillation: BPL learns from a standard, biased "teacher" model, which is good at predicting interactions with popular or familiar items. A novel confidence penalty prevents the model from just memorizing the teacher's biases.
- Reliability-Filtered Self-Distillation: BPL learns from its own predictions on unrated items. It uses a clever filtering mechanism to only trust its most reliable predictions, allowing it to gradually and safely uncover user preferences for a wider range of items.
- Bias-Adaptive Balancing: The two distillation strategies are adaptively balanced using a score called
S^1`-affinity`, which measures how similar an unrated item is to the kind of items the user has already rated. This ensures the model learns from the biased teacher for "familiar" items and explores on its own for "unfamiliar" ones.
4. **State-of-the-Art Performance:** Extensive experiments show that BPL achieves superior performance on both factual and counterfactual tests, outperforming previous methods and successfully bridging the performance gap between the two environments.
---
# 3. Prerequisite Knowledge & Related Work
* **Foundational Concepts:**
* **Recommender Systems & Explicit Feedback:** Recommender systems predict user preferences. **Explicit feedback** refers to direct input from users, like a 1-to-5 star rating, which clearly indicates their level of preference.
* **Biases in Recommenders:**
* **Selection Bias:** Users choose which items to interact with and rate. They are more likely to rate items they love or hate, and rarely rate items they feel neutral about.
* **Exposure Bias:** Users can only rate items they have been exposed to. The system's own recommendations and an item's popularity heavily influence what users see.
* **Conformity Bias:** A user's rating can be influenced by public opinion or the average rating of an item.
* **Factual vs. Counterfactual Tests:**
* **Factual Test:** Evaluates the model on a test set collected from *actual* user interactions. This measures the model's ability to predict future user behavior in the real, biased system. It is crucial for short-term platform performance.
* **Counterfactual Test:** Evaluates the model on a test set collected from a **Randomized Controlled Trial (RCT)**, where items are shown to users randomly. This measures the model's ability to predict true, unbiased user preferences. It is crucial for long-term user satisfaction and discovering new interests.
* **Debiasing Techniques:**
* **Inverse Propensity Score (IPS):** A statistical method that re-weights training samples to counteract bias. Samples that are underrepresented in the biased data (e.g., unpopular items) are given higher weight.
* **Data Imputation:** A method that tries to "fill in" the missing ratings for unrated items with pseudo-labels, allowing the model to learn from more than just the observed data.
* **Doubly Robust (DR):** A hybrid method that combines IPS and data imputation to be more robust and effective than either one alone.
* **Knowledge Distillation (KD):** A machine learning technique where a smaller "student" model is trained to mimic the outputs of a larger, pre-trained "teacher" model. This transfers the "knowledge" of the teacher to the student. **Self-distillation** is a variant where a model learns from its own past predictions to refine its knowledge.
* **Previous Works & Differentiation:**
* **Standard Debiasing (e.g., `Stable-DR`, `Stable-MRDR`):** These methods are strong in the counterfactual test but, as the paper shows (Figure 1), significantly hurt performance in the factual test. They focus too much on correcting bias and lose the ability to predict what users will actually do next.
* **Adversarial Learning (e.g., `FADA`):** These methods treat debiasing as a domain adaptation problem, trying to make the representations of rated and unrated items similar. While they offer some improvement, their effectiveness is limited.
* **`InterD`:** This is the most direct competitor. It uses KD to combine a biased teacher (good at factual tests) and a debiased teacher (good at counterfactual tests). However, its performance is fundamentally **capped by the quality of its pre-trained teachers**. It cannot learn anything beyond what the two teachers already know.
* **BPL's Differentiation:** BPL overcomes `InterD`'s limitation. Instead of just combining fixed knowledge, BPL **iteratively discovers new preference knowledge**. Its self-distillation component allows it to continuously improve and surpass its initial teachers. The adaptive balancing and confidence penalty are also novel mechanisms not found in prior work.

*Figure 1 from the paper illustrates the core problem: Standard training is good for factual tests but bad for counterfactual. The debiasing method `Stable-MRDR` is the opposite. `InterD` offers a compromise, but BPL aims to be the best at both.*
---
# 4. Methodology (Core Technology & Implementation)
The core idea of BPL is to minimize the true prediction error (risk) across all possible user-item pairs, not just the observed ones. The authors use risk minimization theory to guide the model's design.
* **Principles: Risk Upper Bound**
The paper starts by defining the true risk as the average prediction error over *all* user-item pairs (). Since we only have ratings for a small, biased subset (), we can't compute this directly. Instead, we minimize an upper bound on this risk. Based on Theorem 1 from domain adaptation theory, the total error () is bounded by three terms:
\epsilon_S(\eta) \leq \underbrace{\epsilon_{S^1}(\eta)}_{\text{T1: Empirical Risk}} + \lambda_0 \left( \underbrace{\frac{1}{2} d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{Z}_{S^0}, \mathcal{Z}_{S^1})}_{\text{T2: Distribution Divergence}} + \underbrace{(\epsilon_{S^0}(\eta^*) + \epsilon_{S^1}(\eta^*))}_{\text{T3: Ideal Error / Discriminability}} \right)
* **T1: Empirical Risk:** The prediction error on the data we have ($S^1$). This is what standard models minimize.
* **T2: Distribution Divergence:** The difference between the representations of rated data ($S^1$) and unrated data ($S^0$). If this is large, a model trained on $S^1$ won't generalize to $S^0$.
* **T3: Ideal Error / Discriminability:** The error of a perfect predictor. This term is low if the learned representations are "discriminative"—meaning, representations for items with different ratings are easily separable. This is the hardest term to minimize because we lack labels for $S^0$.
BPL is designed with three loss functions, each targeting one of these terms.

*\text{该图像是图表,展示了基于最大概率和时间一致性的两种过滤方法在不同训练阶段的后过滤误差比(左)及均方误差(右)对比,结果来源于}Yahoo!R3-RCT\text{数据集。}*
*This diagram shows the overall architecture of BPL. It has three main loss components: $\mathcal{L}_{T1}$ for learning from rated data, $\mathcal{L}_{T2}$ for aligning distributions, and the core innovation $\mathcal{L}_{T3}$, which uses dual distillation strategies for unrated data, guided by an affinity estimator.*
* **Steps & Procedures:**
**1. Preference Learning with Rated Data ($\mathcal{L}_{T1}$ to minimize T1)**
This is the standard supervised learning part. The model is trained to predict the correct ratings for the user-item pairs in the training set $S^1$. The loss function is the standard cross-entropy loss for classification.
\mathcal{L}_{T1} = |S^1|^{-1} \sum_{(u, i) \in S^1} \ell_s(f(u, i), r_{ui})
* Where $\ell_s$ is the cross-entropy loss and $r_{ui}$ is the ground-truth rating.
**2. Distribution Alignment Learning ($\mathcal{L}_{T2}$ to minimize T2)**
To reduce the divergence between rated and unrated data (T2), BPL uses adversarial learning. A "discriminator" model ($f_d$) is trained to tell whether a user-item representation comes from the rated or unrated set, while the main model's "encoder" ($\phi$) is trained to fool the discriminator.
* **Key Tweak:** The authors found that simply separating $S^1$ from $S^0$ is hard because some unrated items are very similar to rated ones. To fix this, they introduce **S^1`-affinity`**, which is the probability that a pair `(u,i)` belongs to the rated set. They move the unrated data with the highest affinity () into the "rated" group for the discriminator. This creates a clearer boundary and makes the adversarial training more stable.
**3. Dual Distillation Learning ( to minimize T3)**
This is BPL's main innovation for improving representation discriminability (T3). It generates pseudo-supervision for the vast set of unrated data () using two complementary strategies, which are balanced based on the S^1`-affinity` of each item.
* **Strategy A: Reliability-Filtered Self-Distillation ($\ell_{sd}$)**
For unrated data with **low** S^1`-affinity` (i.e., items unlike what has been rated before), the model learns from itself.
* **Filtering:** The model's predictions on unrated data can be noisy. BPL only trusts predictions that are **temporally consistent**. It keeps a moving average of its own weights (a "temporal ensemble" model ). A prediction is considered reliable only if the current model and the temporally-stable model agree on the predicted rating class. As shown in Figure 5, this method is more effective at identifying accurate predictions than using the model's confidence (maximum probability).
* **Loss:** For reliable predictions, the model is trained to become more confident by minimizing the entropy of its output probability distribution.
Where is the entropy function. Minimizing entropy makes the probability distribution "sharper" or more confident.
* **Strategy B: Confidence-Penalized Preference Distillation ()**
For unrated data with **high** S^1`-affinity` (i.e., items similar to what has been rated), the model learns from a pre-trained **biased teacher**. The motivational analysis (Figures 2 and 3) showed that biased models are surprisingly good at predicting these types of interactions.
* **Loss:** The loss has two parts. The first part pushes the model's expected rating to match the teacher's prediction ($t_{ui}$). The second part is a **confidence penalty**, which *discourages* the model from being overconfident by maximizing the output entropy.
\ell_{pd} = \lambda(\mathbb{E}_{\hat{p}}[r_{ui}] - t_{ui})^2 - H(\hat{p}(r_{ui}))
This penalty is crucial. It prevents the student model from simply memorizing the teacher's biases and allows it to generalize better.
* **Combining the Strategies:** The two losses are combined using the
S^1-affinity score ().
* BPL-Soft: Uses a weighted average: .
* BPL-Hard: Makes a hard switch: if affinity is high, use ; if low, use .
The combined loss for T3 is:
**4. Overall Training**
The three loss components are combined into a single objective function, which is optimized jointly.
* Where and are hyperparameters that balance the three objectives.
---
5. Experimental Setup
-
Datasets:
-
The experiments use three real-world datasets that are standard benchmarks for debiasing research: Yahoo!R3, Coat, and KuaiRec.
-
Each dataset contains a biased training set from natural user interactions and a counterfactual test set from randomized trials. The authors created an additional factual test set by holding out 10% of the biased training data. This setup allows for comprehensive evaluation in both environments.
The following is a transcription of the data statistics from Table I in the paper. TABLE I: Data statistics after preprocessing.
#User #Item #Ratings (a) training (b) counterfactual test (c) factual test Yahoo!R3 15,400 1,000 280,534 54,000 31,170 Coat 290 300 6,264 4,640 696 KuaiRec 7,176 10,728 11,277,725 4,676,570 1,253,081 -
-
Evaluation Metrics:
- Mean Squared Error (MSE):
- Conceptual Definition: MSE measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. It penalizes larger errors more heavily than smaller ones.
- Mathematical Formula:
- Symbol Explanation:
- : The total number of ratings in the test set.
- : The true rating for the -th user-item pair.
- : The model's predicted rating for the -th user-item pair.
- Mean Absolute Error (MAE):
- Conceptual Definition: MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average over the test sample of the absolute differences between prediction and actual observation.
- Mathematical Formula:
- Symbol Explanation:
- : The total number of ratings in the test set.
- : The true rating for the -th user-item pair.
- : The model's predicted rating for the -th user-item pair.
- Mean Squared Error (MSE):
-
Baselines: The paper compares BPL against a strong set of baselines from four categories:
-
Standard Training: A basic model trained only on the observed data ().
-
Adversarial Learning:
FADAandIA, which use domain adaptation techniques. -
Debiasing Learning: State-of-the-art methods like
Stable-DR,Stable-MRDR, andDCE-TDR, which are known to be strong on counterfactual tests. -
Knowledge Distillation:
InterD, the most relevant competitor, which distills knowledge from both a biased and a debiased teacher.
-
6. Results & Analysis
-
Core Results:
Figure 6 from the paper shows scatter plots of performance on the factual (x-axis) and counterfactual (y-axis) tests. The ideal position is the bottom-left corner (low error on both). BPL (red star) consistently achieves this, while other methods are forced into a trade-off.
Figure 7 from the paper shows the harmonic mean of MSE and MAE on both tests, providing a single score for overall performance. BPL-Hard and BPL-Soft achieve the lowest (best) scores across all datasets, demonstrating their superior balance.- Trade-off is Evident: The results confirm the paper's core motivation. The biased teacher excels in the factual test but fails in the counterfactual one. Debiasing methods like
Stable-DR/MRDRshow the opposite pattern. InterDis Limited:InterDfinds a better balance than its teachers but often fails to significantly outperform the debiased teacher in the counterfactual test. This highlights its limitation of being bounded by its teachers' knowledge.- BPL Achieves Best of Both Worlds: BPL, particularly
BPL-Hard, consistently achieves the best overall performance. It maintains low error in the factual test (comparable to the biased teacher) while also achieving the lowest error in the counterfactual test across all datasets. This demonstrates its success in breaking the trade-off. BPL-Hardvs.BPL-Soft: TheBPL-Hardvariant, which uses a sharp cutoff based on affinity, consistently outperformsBPL-Soft. The authors suggest this is because a hard separation is more robust, especially when the rated and unrated data distributions are very different.
- Trade-off is Evident: The results confirm the paper's core motivation. The biased teacher excels in the factual test but fails in the counterfactual one. Debiasing methods like
-
Ablations / Parameter Sensitivity:
The following is a transcription of the ablation study results from Table II in the paper. (Note: The provided text cuts off, so the table is incomplete. This transcription reflects the available data.)
TABLE II: Ablation study of BPL-Hard on Yahoo!R3 and Coat datasets.
Method Test Yahoo!R3 MSE Yahoo!R3 MAE Coat MSE Coat MAE BPL-Hard Factual 0.871 0.697 1.127 0.838 Counterfactual 0.958 0.758 1.214 0.887 w/o (Adversarial) Factual 0.874 0.701 1.122 0.835 Counterfactual 0.982 0.771 1.248 0.902 w/o (Self-Distill) Factual 0.878 0.702 1.155 0.852 Counterfactual 1.011 0.790 1.298 0.931 w/o (Teacher-Distill) Factual 0.905 0.720 1.189 0.870 Counterfactual 0.975 0.769 1.241 0.899 w/o & (Standard) Factual 0.875 0.701 1.151 0.852 Counterfactual 1.102 0.841 1.543 1.021 w/o confidence penalty Factual 0.879 0.705 1.171 0.864 Counterfactual 1.033 0.802 1.341 0.955 -
Impact of Each Component:
- Removing the self-distillation loss () or the teacher-distillation loss () hurts performance, especially on the counterfactual test, proving both are necessary.
- Removing the adversarial alignment loss () also degrades counterfactual performance, showing its role in bridging the distribution gap.
- The model with only the standard loss (
w/o &) performs very poorly on the counterfactual test, as expected.
-
Importance of Confidence Penalty: The most critical finding is from the
w/o confidence penaltyablation. When this penalty is removed (making it a standard teacher-student distillation), performance on the counterfactual test drops significantly. This confirms that naively mimicking the biased teacher leads to overfitting its biases. The penalty is essential for effective and safe knowledge transfer.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and addresses a critical trade-off in recommender systems: performance on factual versus counterfactual tests. The proposed BPL framework provides a principled and effective solution. By combining adaptive teacher distillation with iterative self-refinement, BPL trains a single model that achieves state-of-the-art performance in both evaluation settings. The key innovations—the dual distillation strategy, the confidence penalty, and the reliability filter—are shown to be crucial for its success.
-
Limitations & Future Work (Inferred):
- Dependence on Affinity Estimator: The performance of BPL, especially the hard variant, relies on the quality of the $$S^1
-affinityestimator. A poor estimator could lead to suboptimal balancing of the two distillation strategies. - Computational Cost: The framework involves training a discriminator and performing distillation, which may add computational overhead compared to standard training, although the inference cost remains the same.
- Applicability to Implicit Feedback: The paper focuses on explicit feedback (ratings). Adapting the framework to implicit feedback (clicks, views), where there are no negative signals, would be a valuable next step.
- Dependence on Affinity Estimator: The performance of BPL, especially the hard variant, relies on the quality of the $$S^1
-
Personal Insights & Critique:
- Strong Theoretical Grounding: Grounding the method in risk minimization theory is a major strength. It moves beyond heuristic solutions and provides a clear, structured argument for why each component of the model is necessary.
- Elegant and Insightful Solution: The dual distillation strategy is an elegant way to solve the trade-off problem. The insight that a biased model's knowledge is useful but dangerous, and therefore needs to be handled with a "confidence penalty," is particularly clever and effective.
- High Practical Relevance: By not requiring expensive, fully randomized training data and focusing on a problem with direct business implications (balancing short-term metrics with long-term satisfaction), this work is highly relevant for real-world recommender system deployment. It offers a tangible path toward building more robust and satisfying recommendation services.
- Open Questions: How would BPL perform in a live A/B test? Does the improved counterfactual performance translate to measurable gains in user engagement, retention, or discovery of novel items over the long term? Answering these questions would be the ultimate validation of the approach.
Similar papers
Recommended via semantic vector search.