Bandits with Ranking Feedback

Marcello Restelli

Paper status: completed

Bandits with Ranking Feedback

Published:01/01/2024

Bandits with Ranking Feedback (1)No-Regret Algorithm Design (1)Stochastic Reward Mechanism (1)Bandits in Adversarial Settings (1)Optimizing Regret Bound (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a novel variant of multi-armed bandits that provides ranking feedback instead of numerical rewards, suitable for scenarios like human preferences. It explores no-regret algorithms in stochastic and adversarial settings, proving the impossibility of logarithm

Abstract

In this paper, we introduce a novel variation of multi-armed bandits called bandits with ranking feedback. Unlike traditional bandits, this variation provides feedback to the learner that allows them to rank the arms based on previous pulls, without quantifying numerically the difference in performance. This type of feedback is well-suited for scenarios where the arms’ values cannot be precisely measured using metrics such as monetary scores, probabilities, or occurrences. Common examples include human preferences in matchmaking problems. Furthermore, its investigation answers the theoretical question on how numerical rewards are crucial in bandit settings. In particular, we study the problem of designing no-regret algorithms with ranking feedback both in the stochastic and adversarial settings. We show that, with stochastic rewards, differently from what happens with non-ranking feedback, no algorithm can suffer a logarithmic regret in the time horizon T in the instance-dependent case. Furthermore, we provide two algorithms. The first, namely DREE, guarantees a superlogarithmic regret in T in the instance-dependent case thus matching our lower bound, while the second, namely R-LPE, guarantees a regret of Ō(√T) in the instance-independent case. Remarkably, we show that no algorithm can have an optimal regret bound in both instance-dependent and instance-independent cases. Finally, we prove that no algorithm can achieve a sublinear regret when the rewards are adversarial.

Mind Map

In-depth Reading

English Analysis~11 min read · 13,897 chars

1. Bibliographic Information

1.1. Title

Bandits with Ranking Feedback

1.2. Authors

Davide Maran* (Politecnico di Milano)
Francesco Bacchiocchi* (Politecnico di Milano)
Francesco Emanuele Stradi* (Politecnico di Milano)
Matteo Castiglioni (Politecnico di Milano)
Nicola Gatti (Politecnico di Milano)
Marcello Restelli (Politecnico di Milano) * Equal contribution

1.3. Journal/Conference

NeurIPS 2023 (Advances in Neural Information Processing Systems). Comment: NeurIPS is one of the most prestigious and influential top-tier conferences in the fields of machine learning and artificial intelligence.

1.4. Publication Year

2024 (Published in proceedings for the 2023 conference year).

1.5. Abstract

This paper introduces a novel variant of the Multi-Armed Bandit (MAB) problem called Bandits with Ranking Feedback. In this setting, the learner does not observe the numerical value of the rewards but only receives a ranking of the arms based on their empirical means from previous pulls. This models scenarios where precise quantification of rewards is impossible (e.g., human preferences) or hidden (e.g., privacy in advertising). Key contributions include:

Stochastic Setting: Proving that unlike standard bandits, logarithmic regret is impossible in the instance-dependent case.
Algorithms: Proposing DREE (matching the super-logarithmic lower bound) and R-LPE (achieving $\tilde{O}(\sqrt{T})$ instance-independent regret for Gaussian rewards).
Trade-off: Proving no algorithm can simultaneously be optimal for both instance-dependent and instance-independent regret.
Adversarial Setting: Proving that no sublinear regret is achievable without statistical assumptions.

1.6. Original Source Link

https://proceedings.neurips.cc/paper_files/paper/2024/file/936ce22b767cf1a1496083e4725d3b21-Paper-Conference.pdf

2. Executive Summary

2.1. Background & Motivation

The Multi-Armed Bandit (MAB) problem is a classic framework for decision-making under uncertainty, balancing exploration (trying new options) and exploitation (sticking to the best known option).

The Gap: Traditional MAB algorithms rely on observing numerical rewards (e.g., receiving a reward of "0.8" or "5 dollars"). However, in many real-world applications, numerical values are unavailable or unreliable.
- Human-in-the-loop: Humans are notoriously bad at assigning consistent numerical scores (e.g., "Rate this movie 7.2/10") but are excellent at ranking (e.g., "I liked movie A better than movie B").
- Privacy/Security: In online advertising, a platform might see which ad gets more clicks (ranking) but not the actual revenue (numerical value) because advertisers hide sensitive financial data.
The Innovation: The paper formalizes Bandits with Ranking Feedback. Unlike "Dueling Bandits" (which compare two arms at a time), here the learner receives a full ranking of all arms based on their historical empirical means after every pull.
Core Question: How crucial is numerical information? Does removing it prevent us from achieving the efficient learning rates (regret bounds) seen in standard bandits?

2.2. Main Contributions / Findings

Theoretical Hardness: The authors prove that the lack of numerical magnitude makes the problem strictly harder. In the stochastic setting, it is theoretically impossible to achieve the standard $O(\log T)$ regret bound that is common in classic bandits; the lower bound is super-logarithmic.
Algorithmic Solutions:
- DREE (Dynamical Ranking Exploration-Exploitation): An algorithm designed for the instance-dependent case. It achieves a super-logarithmic regret bound (e.g., $\log(T)^{1+\delta}$ ), matching the theoretical lower bound.
- R-LPE (Ranking Logarithmic Phased Elimination): An algorithm designed for the instance-independent case (worst-case gap). It achieves near-optimal $\tilde{O}(\sqrt{T})$ regret for Gaussian rewards, utilizing complex techniques involving discretized Brownian motions.
Impossibility Results:
- Trade-off: There is a fundamental conflict: an algorithm cannot have both "good" instance-dependent regret (near logarithmic) and "good" instance-independent regret (sublinear) at the same time.
- Adversarial Failure: If rewards are adversarial (not drawn from a fixed distribution), no algorithm can learn effectively (sublinear regret is impossible). This implies some statistical structure (stochasticity) is strictly necessary for learning from rankings.

3.1. Foundational Concepts

Multi-Armed Bandit (MAB): A gambler stands before $K$ slot machines (arms). Each arm provides a reward drawn from an unknown probability distribution. The goal is to maximize total reward over $T$ rounds.
Regret ( $R_T$ ): The performance metric. It is the difference between the total reward of the optimal arm (if we knew it beforehand) and the total reward actually collected by the algorithm.
- Instance-Dependent Regret: Depends on the specific differences (gaps, $\Delta_i$ ) between arm means. Ideally $O(\frac{\log T}{\Delta_i})$ .
- Instance-Independent Regret: A worst-case bound valid for any valid bandit instance. Ideally $O(\sqrt{T})$ .
Sublinear Regret: If $R_T / T \to 0$ as $T \to \infty$ (i.e., $R_T = o(T)$ ), the algorithm is "learning" and converging to the optimal strategy. Linear regret ( $R_T = O(T)$ ) means the algorithm failed to learn.
Random Walk & Drift: A stochastic process where the position at time $t$ is the sum of independent steps. If the steps have a non-zero mean, the walk has a "drift". This concept is crucial because the cumulative reward of an arm behaves like a random walk.
Ranking vs. Dueling:
- Dueling Bandits: The learner pulls two arms and sees which one won in that specific instance.
- Ranking Feedback (This Paper): The learner pulls one arm. The system updates the empirical average of that arm. Then, the learner sees a ranking of all arms based on their current averages. This feedback is history-dependent and highly correlated over time.

3.2. Previous Works

Standard Bandits: Algorithms like UCB1 (Upper Confidence Bound) and Thompson Sampling rely on numerical estimates to build confidence intervals. These are not directly applicable here because the learner cannot calculate the variance or magnitude of rewards.
Explore-Then-Commit (ETC): A simple strategy: explore all arms uniformly for $m$ rounds, then pick the best one forever. This works with ranking feedback but yields a suboptimal regret of $O(T^{2/3})$ .
Preference Learning & Dueling Bandits: Works by Yue et al. (2012) and others study learning from pairwise comparisons. The key difference is that Dueling Bandits observe instantaneous preferences, while this paper observes rankings of aggregated histories (empirical means).

3.3. Differentiation Analysis

The paper's setting is unique because the feedback is aggregated and relative.

Unlike standard bandits, you don't know how much better arm A is than B.
Unlike dueling bandits, the ranking depends on the entire history of pulls, meaning the "feedback" received at time $t$ is heavily correlated with feedback at time t-1. This breaks many standard martingale-based proof techniques, requiring novel analysis using Brownian motion approximations.

4. Methodology

4.1. Principles

The core challenge is that without numerical values, the learner cannot estimate "confidence bounds" (how uncertain am I about this arm?).

For Instance-Dependent Learning (DREE): The strategy forces exploration based on a deterministic function f(t). Since we can't measure uncertainty, we explore "enough" to ensure that eventually, the law of large numbers guarantees the true ranking emerges.
For Instance-Independent Learning (R-LPE): The strategy uses an elimination tournament. We group arms, pull them until we are statistically confident (using Brownian motion properties) that the losers are truly worse, and then discard them.

4.2. Model Formulation

At each round $t \in [T]$ :

The learner chooses an arm $i_t \in \mathcal{A}$ .
The environment draws a hidden reward $r_t(i_t)$ from distribution $\nu_{i_t}$ .
The system updates the empirical mean for arm $i$ : $ \hat{r}t(i) := \frac{\sum{j \in \mathcal{W}_t(i)} r_j(i)}{Z_i(t)} $ where $\mathcal{W}_t(i)$ is the set of times arm $i$ was pulled, and $Z_i(t)$ is the count of pulls.
Feedback: The learner observes a ranking $\mathcal{R}_t$ such that if arm $i$ is ranked higher than $j$ , then $\hat{r}_t(i) \ge \hat{r}_t(j)$ .
Goal: Minimize Cumulative Expected Regret: $ R_T(\pi) = \mathbb{E}\left[ \sum_{t=1}^T (r_t(i^*) - r_t(i_t)) \right] $ where $i^*$ is the arm with the highest expected mean $\mu_{i^*}$ .

4.3. Algorithm 1: DREE (Dynamical Ranking Exploration-Exploitation)

This algorithm is designed to match the instance-dependent lower bound. It does not require knowledge of the horizon $T$ .

4.3.1. The Exploration Function

The algorithm relies on a user-defined function f(t) which determines the minimum number of times an arm must be pulled by time $t$ . The theoretical analysis suggests a super-logarithmic function, such as: $ f(t) = \log(t)^{1+\delta} \quad \text{where } \delta > 0 $

4.3.2. Algorithmic Flow

For each time step $t$ from 1 to $T$ :

Initialization: If $t \le n$ (where $n$ is the number of arms), pull each arm once to initialize.
Forced Exploration: Check if any arm $i$ $i$ has been pulled fewer than f(t) times (i.e., $Z_i(t-1) < f(t)$ $Z_{i} (t - 1) < f (t)$ ).
- If yes: Play that arm $i$ . (Ties broken arbitrarily).
Exploitation: If all arms have been pulled at least f(t) times:
- Observe the previous ranking $\mathcal{R}_{t-1}$ .
- Play the arm currently ranked first ( $\mathcal{R}_{t-1, 1}$ ).
Update: Receive the new ranking $\mathcal{R}_t$ based on the updated empirical means.

4.3.3. Theoretical Guarantee (Theorem 2)

If rewards are 1-subgaussian and f(t) is super-logarithmic, the regret is bounded by: $ R_T \le (1 + f(T)) \sum_{i=1}^n \Delta_i + \log(T) \sum_{i=1}^n C(f, \Delta_i) $ where $\Delta_i = \mu_{i^*} - \mu_i$ is the suboptimality gap.

Key Insight: The term f(T) dominates. Since $f(T) = \log(T)^{1+\delta}$ , the regret is slightly worse than the standard $O(\log T)$ but matches the proven lower bound for this setting.

4.4. Algorithm 2: R-LPE (Ranking Logarithmic Phased Elimination)

This algorithm achieves optimal instance-independent regret ( $\tilde{O}(\sqrt{T})$ ) for Gaussian rewards. It requires knowledge of $T$ .

4.4.1. Core Concepts

Active Set ( $S$ ): The set of arms currently considered candidates for the best arm. Initially $S = [n]$ .
Loggrid ( $\mathcal{L}$ ): A set of specific checkpoints (time indices) logarithmically spaced. The algorithm only checks for eliminations when the number of pulls hits a number in this grid. $ \mathcal{L} = LG(1/2, 1, T) \approx { \lfloor T^{1/2 + \frac{j}{2\log T}} \rfloor : j=0, \dots, \log T } $ This grid spans from roughly $\sqrt{T}$ to $T$ .
Fair Timestep: A moment $t$ is "fair" if all currently active arms in $S$ have been pulled the exact same number of times.

4.4.2. Filtering Condition (Definition 3)

An arm is kept in the active set only if it performs "well enough" against all other survivors. Specifically, at a fair timestep $t$ , the new active set $\mathcal{F}_t(\zeta)$ keeps arm $i$ if: $ \forall j \in S, \quad \sum_{\tau=1, \tau \text{ fair}}^t \mathbb{I}{ \mathcal{R}\tau(i) > \mathcal{R}\tau(j) } \ge \zeta $

Interpretation: Arm $i$ must have been ranked higher than arm $j$ at least $\zeta$ times during the fair history steps. If $i$ loses to any $j$ too often, $i$ is eliminated.
Threshold $\zeta$ : The threshold is dynamic, defined as $T^{2\alpha}$ , where $\alpha$ depends on the current number of pulls.

4.4.3. Algorithmic Flow

Initialize: Active set $S = [n]$ , Loggrid $\mathcal{L}$ as defined above.
Loop for $t = 1$ $t = 1$ to $T$ $T$ :
- Selection Strategy: Play the arm $i_t \in S$ that has been pulled the minimum number of times so far ( $Z_i(t)$ ). This enforces the "fairness" of pulls among survivors.
- Update: Update pull count $Z_{i_t}(t)$ and observe new ranking $\mathcal{R}_t$ .
- Elimination Check:
  - Check if the minimum number of pulls ( $\min_{i \in S} Z_i(t)$ ) is in the Loggrid $\mathcal{L}$ .
  - If yes (Check Point):
    1. Calculate the threshold exponent: $ \alpha = \frac{\log(\min_{i \in S} Z_i(t))}{\log(T)} - \frac{1}{2} $
    2. Update the active set by applying the filtering condition: $ S = \mathcal{F}_t(T^{2\alpha}) $
    3. (Arms not in the new $S$ are permanently eliminated).

4.4.4. Theoretical Guarantee (Theorem 4)

For Gaussian rewards, R-LPE achieves: $ R_T \le 62 n^4 \log(T)^2 \sqrt{T} $

Key Insight: This breaks the $O(T^{2/3})$ barrier of the baseline Explore-Then-Commit algorithm. The proof relies on modeling the difference in empirical means as a discretized Brownian Motion and using properties of the "occupation time" (how long a Brownian motion with drift stays positive).

5. Experimental Setup

5.1. Datasets

The authors use synthetic environments to strictly control the properties of the reward distributions.

Distribution: Gaussian rewards with unit variance ( $\sigma^2=1$ ).
Time Horizon ( $T$ ): $2 \cdot 10^5$ rounds.
Scenarios:
- Small Gaps: $n=4$ and $n=6$ arms with minimum gap $\Delta_{\min} = 0.03$ . Harder to distinguish.
- Large Gaps: $n=4$ and $n=8$ arms with minimum gap $\Delta_{\min} \ge 0.25$ . Easier to distinguish.

5.2. Evaluation Metrics

Cumulative Regret ( $R_T$ ):
- Definition: The total loss incurred by not playing the optimal arm at every step.
- Formula: $ R_T = \sum_{t=1}^T (r_t(i^*) - r_t(i_t)) $ (Note: In the stochastic setting, the paper typically plots the expected regret, summing the gaps $\Delta_{i_t}$ ).

5.3. Baselines

EC (Explore-Then-Commit): A standard algorithm that explores all arms uniformly for $m = T^{2/3}$ rounds and then commits to the arm with the highest empirical rank. This serves as the "naive" benchmark.

6. Results & Analysis

6.1. Core Results Analysis

The following analysis refers to the experimental results presented in Figures 1-4 of the original paper.

6.1.1. Performance in Small Gap Settings (Hard)

The following figure (Figure 1 from the original paper) shows the cumulative regret when gaps are small ( $\Delta_{\min}=0.03$ ).

$Figure 1: Instance with $\\Delta _ { \\mathrm { m i n } } = 0 . 0 3$ and all the gaps small.$ 该图像是图表，展示了累积后悔值随时间的变化情况。不同算法（EC、DREE及RLPE）的累积后悔值以不同颜色曲线表示，DREE在不同参数下的表现也有所不同，体现了算法在学习策略中的效果。

Analysis:
- DREE (Orange/Green curves): Shows linear growth (poor performance). This confirms the theoretical sensitivity: when gaps are small ( $\Delta \to 0$ ), the constant term $C(f, \Delta)$ in DREE's bound explodes, causing huge regret initially.
- R-LPE (Purple curve): Shows sublinear growth (good performance), significantly outperforming DREE and EC. This validates its instance-independent property—it handles small gaps robustly without exploding regret.
- EC (Blue curve): Performs poorly, similar to linear growth, confirming that the standard $O(T^{2/3})$ strategy is inefficient compared to R-LPE's $\sqrt{T}$ .

6.1.2. Performance in Large Gap Settings (Easy)

The following figure (Figure 3 from the original paper) shows the cumulative regret when gaps are large ( $\Delta_{\min}=0.5$ ).

$Figure 3: Instance with $\\Delta _ { \\mathrm { m i n } } = 0 . 5$ .$ 该图像是一个图表，展示了多种算法在时间与累计遗憾之间的关系。曲线分别表示 EC、DREE（不同 $ext{δ}$ 值）和 RLPE 的累计遗憾。该实例中的 $ext{Δ}_{ ext{min}} = 0.5$ 反映了这些算法在不同条件下的表现。

Analysis:
- DREE: Now performs the best (lowest regret). With large gaps, the constant term is small, and the $\log(T)^{1+\delta}$ asymptotic behavior dominates, leading to very efficient learning.
- Parameter $\delta$ : The plot shows that smaller $\delta$ (e.g., $\delta=1.0$ ) leads to better performance, as the exploration function f(t) grows slower, closer to the optimal logarithm.
- R-LPE: Performs worse than DREE here. Its conservative elimination strategy (designed for the worst case) is too slow when the optimal arm is obviously better.

6.2. The Trade-off Verification

The experiments empirically confirm the theoretical "Trade-off" (Theorem 3):

DREE is superior when gaps are large (Instance-Dependent regime).
R-LPE is superior when gaps are small (Instance-Independent regime).
No single algorithm wins in both cases, supporting the impossibility result that one cannot simultaneously optimize for both regimes in the ranking feedback setting.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously establishes that magnitude matters. Removing numerical rewards and relying only on ranking feedback fundamentally changes the hardness of the Multi-Armed Bandit problem.

Harder Learning: Logarithmic regret $O(\log T)$ is theoretically impossible; the best achievable is super-logarithmic.
New Algorithms: The authors provide DREE (for easy/instance-dependent cases) and R-LPE (for hard/instance-independent cases), both with strong theoretical guarantees matching the new hardness limits.
Separation: There is a proven dichotomy: you must choose between optimizing for the "easy" case or the "worst" case; you cannot have an algorithm that adapts perfectly to both (unlike in standard bandits).

7.2. Limitations & Future Work

Gaussian Assumption: The proof for R-LPE relies on properties of Brownian motion, strictly limiting its theoretical guarantee to Gaussian rewards. Generalizing this to other distributions (e.g., Bernoulli) is an open challenge.
Computational Complexity: R-LPE involves complex history tracking and filtering conditions, which might be computationally heavier than simple index-based policies like UCB.
Adversarial Setting: The paper proves a negative result (no sublinear regret) but does not explore if slight relaxations (e.g., limited adversarial changes) could allow for learning.

7.3. Personal Insights & Critique

Value of Information: This paper is a beautiful theoretical case study on the "Value of Information." It quantifies exactly how much "learning power" is lost when we lose the scale of rewards. The shift from $O(\log T)$ to super-logarithmic is a precise price tag on numerical magnitude.
Brownian Motion Technique: The use of Brownian motion discretization and occupation times (Takács' theorem) to analyze the "ranking" of empirical means is a highly innovative technical approach. It bridges continuous stochastic processes and discrete bandit algorithms in a way that could be applied to other "relative feedback" problems.
Practical Relevance: The setting is highly relevant for RLHF (Reinforcement Learning from Human Feedback). As LLMs rely more on ranking outputs (Model A is better than Model B), understanding the fundamental limits of learning from rankings (as opposed to scores) is crucial for efficient model alignment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Bandits with Ranking Feedback

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~11 min read · 13,897 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Model Formulation

4.3. Algorithm 1: DREE (Dynamical Ranking Exploration-Exploitation)

4.3.1. The Exploration Function

4.3.2. Algorithmic Flow

4.3.3. Theoretical Guarantee (Theorem 2)

4.4. Algorithm 2: R-LPE (Ranking Logarithmic Phased Elimination)

4.4.1. Core Concepts

4.4.2. Filtering Condition (Definition 3)

4.4.3. Algorithmic Flow

4.4.4. Theoretical Guarantee (Theorem 4)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance in Small Gap Settings (Hard)

6.1.2. Performance in Large Gap Settings (Easy)

6.2. The Trade-off Verification

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers