Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Herke van Hoof

Paper status: completed

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

Published:10/20/2024

Mitigation of Exposure Bias in Online Recommendation Systems (1)Linear Cascading Bandits (1)Exposure-Aware Reward Model (1)Feedback Loop Issue in Recommender Systems (1)Diverse Recommendation Strategies (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study addresses exposure bias in recommender systems, introducing an Exposure-Aware reward model for online learning-to-rank via Linear Cascading Bandits. It adjusts item utility based on position in the list, enhancing exposure fairness while maintaining accuracy, outpacing

Abstract

Exposure bias is a well-known issue in recommender systems where items and suppliers are not equally represented in the recommendation results. This bias becomes particularly problematic over time as a few items are repeatedly over-represented in recommendation lists, leading to a feedback loop that further amplifies this bias. Although extensive research has addressed this issue in model-based or neighborhood-based recommendation algorithms, less attention has been paid to online recommendation models, such as those based on top-K contextual bandits. In this paper, we study exposure bias in a class of well-known contextual bandit algorithms known as Linear Cascading Bandits. We propose an Exposure-Aware reward model that mitigates exposure bias by controlling the utility assigned to the items based on their exposure in the recommendation list. Our experiments with real-world datasets show that our proposed model improves exposure fairness while maintaining recommendation accuracy and outperforms current baselines.

Mind Map

In-depth Reading

English Analysis~15 min read · 18,409 chars

1. Bibliographic Information

1.1. Title

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

1.2. Authors

Masoud Mansoury: Delft University of Technology, Delft, The Netherlands.
Bamshad Mobasher: DePaul University, Chicago, USA.
Herke van Hoof: University of Amsterdam, Amsterdam, The Netherlands.

1.3. Journal/Conference

CIKM 2024 (Proceedings of the 33rd ACM International Conference on Information and Knowledge Management). CIKM is a highly reputable, top-tier international conference in the fields of information retrieval, knowledge management, and database systems.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses exposure bias in recommender systems, a phenomenon where a small subset of items receives disproportionate visibility while the majority are ignored. While existing research focuses on static scenarios, this paper investigates exposure bias in online learning-to-rank systems, specifically Linear Cascading Bandits (CB). The authors analyze how standard CB algorithms fail to mitigate this bias over time. They propose a novel Exposure-Aware (EA) reward model that adjusts the utility (reward) assigned to items based on their position in the recommendation list. Experiments on real-world datasets demonstrate that this model significantly improves exposure fairness without compromising recommendation accuracy, outperforming existing baselines.

1.6. Original Source Link

https://doi.org/10.1145/3627673.3679763 Status: Officially Published.

2. Executive Summary

2.1. Background & Motivation

The Core Problem: In recommender systems (like Netflix or Amazon), "exposure bias" is a critical issue. Systems tend to repeatedly show popular items to users, creating a feedback loop (the "rich get richer"). This hurts suppliers of niche items (economic unfairness), limits user discovery of diverse content, and biases the data collected for future training.
The Gap: Most existing solutions fix this bias in "static" settings (analyzing a single batch of recommendations). However, modern systems use Online Learning to Rank, where the model updates continuously based on real-time user clicks. There is a lack of research on how to fix exposure bias in these dynamic, interactive environments, specifically within the framework of Contextual Bandits.
The Innovation: The authors identify that current bandit algorithms treat all "clicks" equally. They hypothesize that a click on an item at the bottom of a list is more valuable than a click at the top, and an ignored item at the top is a stronger negative signal than an ignored item at the bottom. Leveraging this intuition, they propose a new way to calculate rewards.

2.2. Main Contributions / Findings

Analytical Insight: The paper demonstrates that standard Linear Cascading Bandits, despite having built-in exploration mechanisms, fail to achieve long-term exposure fairness.
New Model (EACB): They propose the Exposure-Aware Cascading Bandit (EACB). This model introduces a position-dependent reward function: it gives higher rewards for clicks on lower-ranked items and higher penalties for ignored items at higher ranks.
Theoretical Guarantee: The authors prove a high-probability upper regret bound for their model, ensuring that the pursuit of fairness does not catastrophically harm the system's learning speed or accuracy.
Empirical Success: Experiments on MovieLens and Yahoo Music datasets show that EACB improves fairness metrics (Gini index) significantly compared to baselines while maintaining high click rates.

3.1. Foundational Concepts

Recommender Systems (RS): Algorithms designed to suggest items (movies, products) to users.
Exposure Bias: The unequal distribution of attention. If Item A is shown 1,000 times and Item B is shown 0 times, the system has high exposure bias.
Contextual Bandits: An online machine learning framework. Imagine a gambler at a row of slot machines (bandits).
- Context: Before pulling a lever, the gambler sees information (e.g., it's a sunny day, the user is 25 years old).
- Action: The gambler chooses a machine (recommends an item).
- Reward: The machine pays out (user clicks) or doesn't.
- Goal: Maximize total reward over time by balancing Exploration (trying new machines to see if they pay out) and Exploitation (sticking to the machine known to pay out best).
Cascading Bandits (CB): A specific type of bandit for Learning to Rank. Instead of choosing one item, the algorithm chooses a list of $K$ $K$ items.
- Cascade Assumption: The user scans the list from top (position 1) to bottom. They click the first item they like and stop looking. If they click item 3, it implies they saw and rejected items 1 and 2.
Upper Confidence Bound (UCB): A strategy to handle the Exploration-Exploitation trade-off. When estimating how good an item is, the model calculates a "score" plus an "uncertainty bonus."
- $Score = \text{Estimated Quality} + \text{Uncertainty}$ .
- If we know little about an item, Uncertainty is high, so the Score is high, and we pick it (Exploration). If we know it's good, Estimated Quality is high, so we pick it (Exploitation).
Regret: The difference between the reward the algorithm actually got and the reward it could have gotten if it knew the optimal items to show. Lower regret is better.
Gini Index: A statistical measure of inequality. In this paper, it measures how evenly "exposure" is distributed among all available items. 0 is perfect inequality (one item gets everything), 1 is perfect equality. (Note: The paper uses 1 - Gini so higher is better).

3.2. Previous Works

Static Fairness: Previous work addressed fairness in collaborative filtering or matrix factorization models, often via re-ranking (post-processing) or regularization (adding a fairness penalty to the loss function).
Online Learning to Rank: Algorithms like CascadeKL-UCB and CascadeLinUCB (Zong et al.) established the baseline for using bandits in ranking. They aim to minimize regret (maximize clicks) but do not explicitly care if some items are never shown.
Fairness in Bandits: Some recent works like EARSLinUCB (Jeunen & Goethals) try to fix fairness by shuffling the final list (randomization). FRMLinUCB (Wang et al.) adds a fairness term to the optimization objective.

3.3. Differentiation Analysis

vs. Randomization (EARSLinUCB): EARSLinUCB shuffles items after the model ranks them. The authors argue this is a "band-aid" solution.
vs. Optimization Constraints (FRMLinUCB): FRMLinUCB changes the objective function.
EACB (This Paper): Intervenes at the Reward level. By changing how the model interprets a click (feedback), it fundamentally alters the learning trajectory. It teaches the model that "finding a hidden gem at rank 10" is more valuable than "confirming a popular hit at rank 1," incentivizing the model to promote diverse items naturally.

4. Methodology

4.1. Principles

The core intuition is derived from the Position Bias phenomenon.

Top items are privileged: Users click them partly just because they are at the top. Therefore, a click at position 1 is "less impressive" than a click at position 10.
Top items are scrutinized: If a user sees an item at position 1 and skips it, that is a very strong negative signal. If they skip an item at position 10, it matters less (they might be tired or just satisfied with item 9).

Standard Cascading Bandits assign a binary reward: 1 for a click, 0 for a skip. The proposed Exposure-Aware Cascading Bandit (EACB) modifies this to weighted values based on position.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Standard Cascading Bandit Model

First, we must understand the baseline.

$\mathcal{I} = \{1, ..., m\}$ is the set of all items.
In each round $t$ , the system recommends a list $\mathcal{L}_t$ of size $K$ .
The user scans from position 1 to $K$ .
$C_t$ denotes the position of the clicked item. If no click, $C_t = K+1$ .
Feedback:
- If user clicks item at position $C_t$ , then items at $1, ..., C_t-1$ are "ignored" (value 0), and item at $C_t$ is "liked" (value 1).
- Items after $C_t$ are unobserved.
  
  The preference weight (feedback) $\mathbf{w}_t$ for an item at position $k$ in the list is defined as: $\mathbf { w } _ { t } ( \mathcal { L } ( k ) ) = \mathbb { 1 } ( C _ { t } = k ) , \quad \mathrm { w h e r e ~ } \ k \in [ 1 , . . . , \operatorname* { m i n } \{ K , C _ { t } \} ]$
Explanation: $\mathbb{1}(C_t=k)$ is an indicator function. It is 1 if the clicked position is $k$ , and 0 otherwise. This is the standard "unweighted" feedback.

4.2.2. The Exposure-Aware Reward Model

The authors propose modifying $\mathbf{w}_t$ to include position-based weights. They define a new feedback vector $\mathbf{w}^{EA}_t$ .

The Proposed Formula: $\mathbf { w } _ { t } ^ { E A } ( \mathcal { L } ( k ) ) = \mathcal { F } _ { t , k } \times \mathbb { 1 } \left[ C _ { t } = k \right] - \gamma \mathcal { F } _ { t , k } \times \mathbb { 1 } \left[ C _ { t } < k \right]$

Symbol Explanation:

$\mathbf{w}_t^{EA}(\mathcal{L}(k))$ : The modified reward/penalty for the item at position $k$ in the list $\mathcal{L}$ .
$\mathcal{F}_{t, k}$ : The Exposure-Aware Weight Function. This is a number that depends on the position $k$ . It determines "how much" exposure or importance position $k$ has.
$\mathbb{1}[C_t = k]$ : Is this the clicked item? (Returns 1 if yes, 0 if no).
$\mathbb{1}[C_t < k]$ : Was this item ignored? (Returns 1 if the click happened earlier in the list, meaning the user didn't even look at this item? Correction/Refinement: Wait, let's look closely at the logic. In a cascade model, if $C_t = 3$ , items 1 and 2 were examined and ignored. Item 3 was clicked. Items 4+ are unobserved.
- Let's re-read the paper's definition of the cascade model in Section 2.1: "items above the clicked item are considered unattractive... rest are unobserved".
- Let's check the formula's condition: $k \in [1, ..., \min\{K, C_t\}]$ . This range covers only the examined items (the ones before the click and the clicked one itself).
- Therefore, if $C_t < k$ , it implies the click happened before position $k$ . But the range says we only update up to $C_t$ .
- Critical Analysis of the Formula: Let's look at the standard definition. Usually, updates happen for examined items. If $C_t=3$ $C_{t} = 3$ , we update items 1, 2, and 3.
  - For item 3 ( $k=3$ ): $C_t=k$ . First term is active. Reward = $\mathcal{F}_{t,3}$ .
  - For item 1 ( $k=1$ ): $C_t=3$ , so $C_t > k$ .
  - Let's look at the formula again: $\mathbb{1}[C_t < k]$ . If $C_t=3$ and we are looking at item 1 ( $k=1$ ), then $3 < 1$ is False.
  - Wait, there is a potential confusion in the paper's notation or my reading. Let's look at the penalty term: $-\gamma \mathcal{F}_{t,k} \times \mathbb{1}[C_t < k]$ .
  - If the user clicks item 1 ( $C_t=1$ ), then for item 1, term 1 is 1.
  - If the user clicks item 3 ( $C_t=3$ $C_{t} = 3$ ). We iterate $k$ $k$ from 1 to 3.
    - $k=1$ : Item 1 was ignored. We want a penalty. Is $C_t < k$ ? $3 < 1$ is False. Is $C_t = k$ ? False.
    - This implies the formula in the text might have a typo or uses a specific condition logic. Let's check the Algorithm pseudo-code (Algorithm 1, lines 18-22).
    - Line 18: $if C_t == k then$ Update with +.
    - Line 20: else (meaning item was examined but not clicked) Update with -.
    - This confirms the logic: Examined items that are not clicked are penalized.
    - Returning to Equation 6: $\mathbb{1}[C_t < k]$ seems to imply "Is the clicked position before $k$ ?". If $C_t=3$ , and we are at $k=4$ , then $3<4$ is true. But item 4 is unobserved!
    - Re-evaluating the text: "Otherwise, the second term (penalization) applies...". The text says: "if the examined item is clicked... reward applies... otherwise... penalization applies."
    - Conclusion: The mathematical notation $\mathbb{1}[C_t < k]$ $1 [C_{t} < k]$ in Eq 6 is confusing given the standard cascade assumption where we only update up to $C_t$ $C_{t}$ . However, the Algorithm 1 makes the intent crystal clear:
      - If item is clicked ( $k = C_t$ ): Reward is positive (+).
      - If item is examined but NOT clicked ( $k < C_t$ ): Reward is negative (-).
      - The term in Eq 6 likely meant $\mathbb{1}[C_t > k]$ (Click happened after this item) OR simply "Not Clicked". Let's stick to the paper's text but clarify using the Algorithm.
$\gamma$ : A hyperparameter controlling the severity of the penalty for ignored items.

Weight Functions ( $\mathcal{F}_{t, k}$ ): The paper tests three functions to define the importance of position $k$ .

Logarithmic: $\log(1 + k)$ . (Standard exposure drop-off).
Exponential (RBP): $\beta^{k-1}$ . (Based on Rank-Biased Precision, models user patience).
Linear: $\beta \times k$ .

Visualizing the Reward (Figure 1): The following figure (Figure 1 from the original paper) illustrates how rewards differ. In standard CB (blue line), the reward is always 1.0 regardless of position. In EACB (orange/green/red lines), the reward increases as the position index increases (clicks deeper in the list are rewarded more).

Figure 1: Reward distribution of CB and EACB for different weight functions when click is observed at varying positions in the list. 该图像是图表，展示了在不同权重函数下，点击项目在推荐列表中不同位置时CB（Cascade Bandit）和EACB（Exposure-Aware Cascade Bandit）的奖励分布。三种权重函数（对数、指数和线性）下的奖励随点击位置的变化表现出不同的趋势。

4.2.3. The EACB Learning Algorithm (Algorithm 1)

The authors integrate this reward model into the LinUCB (Linear Upper Confidence Bound) framework.

Step 1: Initialization

Initialize matrix $\mathbf{M} = \lambda \mathbf{I}$ (Identity matrix) and vector $\mathbf{B} = \mathbf{0}$ . These store the historical data for the Ridge Regression.

Step 2: Estimation (Ridge Regression) In each round $t$ , estimate the user preference vector $\hat{\theta}_t$ using the closed-form solution for Ridge Regression: $\hat { \theta } _ { t } = ( X _ { t } ^ { T } X _ { t } + \lambda I ) ^ { - 1 } X _ { t } ^ { T } \hat { W } _ { t }$

Here, the paper simplifies this in the algorithm loop (Line 4) to: $\hat{\theta}_t \leftarrow \sigma^{-2} \mathbf{M}_{t-1}^{-1} \mathbf{B}_{t-1}$

Step 3: Item Selection (UCB) Calculate the Upper Confidence Bound score for every item $i$ : $U _ { t } ( i ) = \hat { \theta } _ { t } x _ { i } ^ { T } + \alpha \sqrt { x _ { i } M _ { t - 1 } ^ { - 1 } x _ { i } ^ { T } }$

$\hat{\theta}_t x_i^T$ : Exploitation term (Estimated attractiveness).
$\sqrt{...}$ : Exploration term (Uncertainty/Variance).
$\alpha$ : Exploration parameter.

Select the top $K$ items with the highest $U_t(i)$ scores to form list $\mathcal{L}_t$ .

Step 4: Observe and Update

Display list, observe click position $C_t$ .
Loop through examined items $k$ $k$ from 1 to $\min\{K, C_t\}$ $min {K, C_{t}}$ :
- Update Covariance Matrix: $\mathbf{M}_t \leftarrow \mathbf{M}_{t-1} + \sigma^{-2} x_{\mathcal{L}(k)}^T x_{\mathcal{L}(k)}$
- Update Reward Vector $\mathbf{B}$ (The Crucial Step):
  - If item was clicked ( $C_t == k$ ): $\mathbf{B} \leftarrow \mathbf{B} + x_{\mathcal{L}(k)} \times \mathcal{F}_{t, k}$ (Add the feature vector scaled by the position weight).
  - If item was ignored ( $C_t \neq k$ ): $\mathbf{B} \leftarrow \mathbf{B} - x_{\mathcal{L}(k)} \times (\gamma \times \mathcal{F}_{t, k})$ (Subtract the feature vector scaled by the penalized position weight).

4.2.4. Theoretical Guarantee (Regret Bound)

The paper provides a theoretical analysis (Theorem 1) proving that the "regret" (cumulative loss of clicks) grows at a rate of $O(\sqrt{n})$ .

Meaning: Even with the position-based modifications, the algorithm is guaranteed to eventually "learn" the optimal ranking. The error grows slower than time ( $n$ ), meaning the average error per round goes to zero.
Condition: This holds if the penalty parameter $\gamma$ is chosen such that $\gamma \leq \frac{1}{\mathcal{F}_{t,k}} - 1$ .

5. Experimental Setup

5.1. Datasets

The experiments use two real-world datasets derived from user ratings (simulated environment).

MovieLens 1M:
- Content: Movie ratings.
- Scale: 6,000 users, 4,000 movies, 1 million ratings.
- Preprocessing: Used the 1,000 most active users. Ratings 4-5 are treated as "likes" (1), others as 0.
Yahoo Music:
- Content: Song ratings.
- Scale: 1.8 million users, 136,000 songs.
- Preprocessing: Used top 1,000 active users and top 1,000 rated songs.
  
  Simulation Logic: Since we can't show real lists to real users in a paper, they use "Off-Policy" simulation. They hold out 50% of data for testing. If the algorithm recommends Item X and the test set says the user likes Item X, a "click" is simulated.

5.2. Evaluation Metrics

5.2.1. Accuracy Metrics

Average Number of Clicks ( $\overline{clicks}$ ):
- Definition: The total clicks received divided by the total opportunities. Measures pure utility.
- Formula: $\overline { { c l i c k s } } = \frac { \# c l i c k s } { | \mathcal { U } | \times n }$
- Symbols: $|\mathcal{U}|$ is number of users, $n$ is number of rounds.
n-step-regret:
- Definition: The cumulative difference between the clicks the optimal model would get and what the current model got.
- Formula: $R ( n ) = \mathbb { E } \left[ \sum _ { { t = 1 } } ^ { n } \mathcal { R } ( \mathcal { L } ^ { * } , \mathbf { w } _ { t } ) - \mathcal { R } ( \mathcal { L } _ { t } , \mathbf { w } _ { t } ) \right]$

5.2.2. Exposure Fairness Metrics

The paper measures fairness using the Gini Index (conceptually). They define four specific metrics based on whether they consider the item's position and its merit (quality).

Equality: Everyone should get equal exposure.
Equity: Exposure should be proportional to merit (good items get more, bad items get less).

The metrics are reported as 1 - Gini, so 1.0 is fairest, 0.0 is unfair.
Equality of Binary Exposure (Equality(B)): Just counts if an item appeared in the list, ignoring position.
Equality of Position-based Exposure (Equality(P)): Counts appearance weighted by position ( $\frac{1}{\log_2(1+k)}$ ). Top spots count more.
Equity of Binary Exposure (Equity(B)): Like Equality(B), but normalized by the item's "true relevance" score.
Equity of Position-based Exposure (Equity(P)): Like Equality(P), but normalized by relevance.

5.3. Baselines

LinUCB: The standard Linear Cascading Bandit (no fairness modification).
EARSLinUCB (Exposure-Aware Arm Selection): A post-processing method that shuffles the final list to give lower-ranked items a chance.
FRMLinUCB (Fairness Regret Minimization): Adds a fairness constraint to the optimization objective itself.

6. Results & Analysis

6.1. RQ1: Impact of Exploration on Fairness

The authors first checked if simply increasing the exploration parameter $\alpha$ in the standard LinUCB would fix fairness (since exploration forces the model to try new things).

Analysis: The following figure (Figure 2 from the original paper) shows the results on MovieLens.

Left Plot: Shows exploration decreases over time (as expected).
Plots c-f (Fairness): Fairness improves initially but then flattens.
Conclusion: Simply increasing $\alpha$ (exploration) does not lead to sustained fairness. It hurts accuracy (clicks go down) without solving the long-term exposure bias problem. This proves a specific intervention like EACB is needed.

$Figure 2: The effect of varying the degree of exploration with $\\alpha \\in \\{ 0 . 2 5 , 0 . 7 5 , 1 , 2 , 5 \\}$ on the performance of LinUCB in terms of clicks and exposure bias on MovieLens dataset for $d = 1 0$ and $K = 1 0$ Left plot shows the average exploration across all users at b) $n$ -step-regret as in Eq. 4, c-f) fairness metrics computed on accumulated exposure values at each round.$ 该图像是图表，展示了不同探索度 eta ext{ (例如 } eta ext{ = 0.25, 0.75, 1, 2, 5)} 对 LinUCB 在 MovieLens 数据集上的点击数和曝光偏差的影响。左侧图显示了各用户的平均探索情况，包含了 n-step-regret 和各轮的公平性指标。

6.2. RQ2: Comparison with Baselines

The proposed EACB model was compared against baselines. The tables below show the results.

Analysis of Table 3 ( $K=5$ ): The following are the results from Table 3 of the original paper. Note that EALinUCB (ours) consistently achieves higher scores (bolded in original analysis) across almost all fairness metrics compared to LinUCB and the other baselines, often with statistical significance ( $\dagger$ ).

Method	F	MovieLens (ML)					Yahoo Music
Method	F	clicks	EqualityB	EqualityP	EquityB	EquityP	clicks	EqualityB	EqualityP	EquityB	EquityP
LinUCB	-	0.2166	0.329	0.3081	0.0506	0.0476	0.1802	0.3602	0.3316	0.3559	0.3278
EARSLinUCB		0.2165	0.3257	0.3257	0.0495	0.0495	0.1807	0.3591	0.3591	0.3556	0.3556
FRMLinUCB	-	0.2005	0.3334	0.4586	0.0504	0.0505	0.1721	0.3514	0.42	0.3514	0.3501
EALinUCB (ours)	Log	0.2105	0.3799	0.524	0.055	0.0546	0.1772	0.3662	0.5396	0.3599	0.3641
EALinUCB (ours)	RBP	0.2166	0.329	0.472	0.0506	0.0515	0.1814	0.365	0.5812	0.358	0.362
EALinUCB (ours)	Linear	0.2069	0.392†	0.5142	0.0545	0.0535	0.1706	0.3661	0.5879	0.3563	0.3582

Analysis of Table 4 ( $K=10$ ): The following are the results from Table 4 of the original paper. Similar to the $K=5$ case, EALinUCB shows superior performance in fairness metrics (EqualityP, EquityP) while maintaining competitive click rates.

Method	F	MovieLens (ML)					Yahoo
Method	F	clicks	EqualityB	EqualityP	EquityB	Equity	clicks	EqualityB	EqualityP	EquityB	EquityP
LinUCB		0.3722	0.3974	0.3656	0.0555	0.0507	0.3154	0.4469	0.404	0.4429	0.4002
EARSLinUCB		0.3721	0.3974	0.3848	0.0562	0.0541	0.3157	0.4467	0.4467	0.4429	0.4429
FRMLinUCB		0.3574	0.4029	0.4157	0.0572	0.055	0.3085	0.4517	0.4596	0.4517	0.4498
EALinUCB (ours)	Log	0.3605	0.494	0.514	0.065	0.065	0.3073	0.495	0.5174	0.485	0.473†
EALinUCB (ours)	RBP	0.3722	0.4074	0.434	0.0555	0.0557	0.3171	0.4489	0.553	0.4614	0.4585
EALinUCB (ours)	Linear	0.3585	0.498	0.5062	0.0601	0.0604	0.3008	0.4552	0.5312	0.461	0.4542

Key Observation:

EqualityP (Position-based Equality) sees massive gains. E.g., in Table 3 MovieLens, LinUCB is 0.3081, while EALinUCB (Log) is 0.524. This means items are being shown in decent positions much more evenly.
Clicks: Surprisingly, the number of clicks often increases or stays very close to the baseline. This suggests that by showing diverse items, the model finds "hidden gems" that users actually like, rather than just spamming popular items.

Visualizing Improvement Over Time: The following figure (Figure 3 from the original paper) plots fairness over 50,000 rounds. EALinUCB (colored lines) maintains a much higher fairness level than LinUCB (blue line) as time progresses.

$Figure 3: Comparison of LinUCB and EALinUCB with three weight functions in terms of Equality(P) per round for $d = 1 0$ , $K = 1 0 .$ . $\\alpha = 0 . 2 5$ , and $\\gamma = 0$ At each round `t _ { : }` ,Equality `( P )` is computed over the accumulated exposure up to round $t$ .$

Exposure Redistribution: The following figure (Figure 4 from the original paper) is very insightful.

X-axis: Items sorted by how much exposure LinUCB gave them (High to Low).
Y-axis: How much more exposure EALinUCB gave them ( $+100\%$ means double, $-100\%$ means zero).
Result: The curve goes up from left to right. This means EALinUCB takes exposure away from the over-exposed items on the left (negative Y values) and gives it to the under-exposed items on the right (positive Y values).
Red dots (Clicks): The red dots on the right side indicate that these newly promoted items are actually getting clicks—validating that they are relevant "hidden gems."

$Figure 4: Exposure analysis of our EALinUCB with three different weight functions for $d = 1 0$ and $K = 1 0$ Colorbar shows the percentage increase/decrease in clicks. Items are sorted based on their exposure $( E ^ { ( P ) } )$ by LinUCB in descending order from left hmeoe nmn.$ 该图像是图表，展示了不同权重函数下 EALinUCB 的曝光分析，包括对 MovieLens 和 Yahoo Music 数据集的结果。图中显示了在 LinUCB 中，项目的过曝与欠曝情况，以及相应的点击变化 $riangle E ^ { ( P ) }$ ，各个图的横轴为项目数量，纵轴为点击变化百分比。

6.3. RQ3: Parameter Sensitivity ( $\gamma$ )

The parameter $\gamma$ controls how much we punish ignored items.

Analysis: The following figure (Figure 5 from the original paper) shows the trade-off.

Small $\gamma$ (e.g., 0.005) improves fairness without hurting clicks.
Large $\gamma$ (e.g., 0.2) hurts both clicks and fairness.
Reason: If you punish every ignored item too strictly, the model thinks everything is bad (since most items are ignored most of the time). It becomes too pessimistic to recommend anything effectively.

该图像是一个图表，展示了不同算法在点击数与公平性之间的关系。图中包含了基线算法（LinUCB）、对数、指数和线性算法的比较，横轴为点击次数，纵轴为公平性指标。各算法的性能曲线展示了在不同点击数下的公平性变化。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully addresses the "rich-get-richer" exposure bias in online learning-to-rank systems. By introducing the Exposure-Aware Cascading Bandit (EACB), the authors provide a mechanism that rewards "hard-earned" clicks (at the bottom of the list) more than "easy" clicks (at the top). This simple but effective change in the reward signal significantly balances item exposure. The method is theoretically grounded with a regret bound and empirically validated to outperform state-of-the-art baselines in fairness while maintaining recommendation accuracy.

7.2. Limitations & Future Work

Simulator Reliance: The authors acknowledge that evaluation is done on simulated environments constructed from static datasets (MovieLens, Yahoo). While standard in bandit literature, this doesn't fully capture real-time human psychology (e.g., boredom, changing tastes).
Linear Assumption: The model assumes user preferences are a linear combination of features (Linear Cascading Bandit). Real-world preferences are often non-linear.
Hyperparameter Sensitivity: The performance depends on careful tuning of $\gamma$ (penalty factor).

7.3. Personal Insights & Critique

Elegant Simplicity: The beauty of this paper lies in its simplicity. Instead of complex optimization constraints (like FRMLinUCB) or post-hoc shuffling (EARSLinUCB), it fixes the root cause—the feedback signal itself. It "teaches" the model to value discovery.
Psychological Alignment: The weighting scheme aligns well with human behavior. We do put more effort into finding things at the bottom of a list, so those clicks should count for more.
Transferability: This logic could easily be applied to other areas, such as search engine ranking or ad bidding, where position bias is prevalent.
Critique on Formulas: The mathematical notation for the penalty term ( $\mathbb{1}[C_t < k]$ ) in the main text (Eq 6) was slightly confusing regarding the index bounds, though the algorithmic implementation clarified the intent. Future readers should focus on the Algorithm pseudo-code for implementation details.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mitigating Exposure Bias in Online Learning to Rank Recommendation: A Novel Reward Model for Cascading Bandits

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~15 min read · 18,409 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Standard Cascading Bandit Model

4.2.2. The Exposure-Aware Reward Model

4.2.3. The EACB Learning Algorithm (Algorithm 1)

4.2.4. Theoretical Guarantee (Regret Bound)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Accuracy Metrics

5.2.2. Exposure Fairness Metrics

5.3. Baselines

6. Results & Analysis

6.1. RQ1: Impact of Exploration on Fairness

6.2. RQ2: Comparison with Baselines

6.3. RQ3: Parameter Sensitivity (γ\gammaγ)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.3. RQ3: Parameter Sensitivity ( $\gamma$ )