Bandits with Ranking Feedback
TL;DR Summary
This paper introduces a novel variant of multi-armed bandits that provides ranking feedback instead of numerical rewards, suitable for scenarios like human preferences. It explores no-regret algorithms in stochastic and adversarial settings, proving the impossibility of logarithm
Abstract
In this paper, we introduce a novel variation of multi-armed bandits called bandits with ranking feedback. Unlike traditional bandits, this variation provides feedback to the learner that allows them to rank the arms based on previous pulls, without quantifying numerically the difference in performance. This type of feedback is well-suited for scenarios where the arms’ values cannot be precisely measured using metrics such as monetary scores, probabilities, or occurrences. Common examples include human preferences in matchmaking problems. Furthermore, its investigation answers the theoretical question on how numerical rewards are crucial in bandit settings. In particular, we study the problem of designing no-regret algorithms with ranking feedback both in the stochastic and adversarial settings. We show that, with stochastic rewards, differently from what happens with non-ranking feedback, no algorithm can suffer a logarithmic regret in the time horizon T in the instance-dependent case. Furthermore, we provide two algorithms. The first, namely DREE, guarantees a superlogarithmic regret in T in the instance-dependent case thus matching our lower bound, while the second, namely R-LPE, guarantees a regret of Ō(√T) in the instance-independent case. Remarkably, we show that no algorithm can have an optimal regret bound in both instance-dependent and instance-independent cases. Finally, we prove that no algorithm can achieve a sublinear regret when the rewards are adversarial.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Bandits with Ranking Feedback
1.2. Authors
- Davide Maran* (Politecnico di Milano)
- Francesco Bacchiocchi* (Politecnico di Milano)
- Francesco Emanuele Stradi* (Politecnico di Milano)
- Matteo Castiglioni (Politecnico di Milano)
- Nicola Gatti (Politecnico di Milano)
- Marcello Restelli (Politecnico di Milano) * Equal contribution
1.3. Journal/Conference
NeurIPS 2023 (Advances in Neural Information Processing Systems). Comment: NeurIPS is one of the most prestigious and influential top-tier conferences in the fields of machine learning and artificial intelligence.
1.4. Publication Year
2024 (Published in proceedings for the 2023 conference year).
1.5. Abstract
This paper introduces a novel variant of the Multi-Armed Bandit (MAB) problem called Bandits with Ranking Feedback. In this setting, the learner does not observe the numerical value of the rewards but only receives a ranking of the arms based on their empirical means from previous pulls. This models scenarios where precise quantification of rewards is impossible (e.g., human preferences) or hidden (e.g., privacy in advertising). Key contributions include:
- Stochastic Setting: Proving that unlike standard bandits, logarithmic regret is impossible in the instance-dependent case.
- Algorithms: Proposing DREE (matching the super-logarithmic lower bound) and R-LPE (achieving instance-independent regret for Gaussian rewards).
- Trade-off: Proving no algorithm can simultaneously be optimal for both instance-dependent and instance-independent regret.
- Adversarial Setting: Proving that no sublinear regret is achievable without statistical assumptions.
1.6. Original Source Link
2. Executive Summary
2.1. Background & Motivation
The Multi-Armed Bandit (MAB) problem is a classic framework for decision-making under uncertainty, balancing exploration (trying new options) and exploitation (sticking to the best known option).
- The Gap: Traditional MAB algorithms rely on observing numerical rewards (e.g., receiving a reward of "0.8" or "5 dollars"). However, in many real-world applications, numerical values are unavailable or unreliable.
- Human-in-the-loop: Humans are notoriously bad at assigning consistent numerical scores (e.g., "Rate this movie 7.2/10") but are excellent at ranking (e.g., "I liked movie A better than movie B").
- Privacy/Security: In online advertising, a platform might see which ad gets more clicks (ranking) but not the actual revenue (numerical value) because advertisers hide sensitive financial data.
- The Innovation: The paper formalizes Bandits with Ranking Feedback. Unlike "Dueling Bandits" (which compare two arms at a time), here the learner receives a full ranking of all arms based on their historical empirical means after every pull.
- Core Question: How crucial is numerical information? Does removing it prevent us from achieving the efficient learning rates (regret bounds) seen in standard bandits?
2.2. Main Contributions / Findings
- Theoretical Hardness: The authors prove that the lack of numerical magnitude makes the problem strictly harder. In the stochastic setting, it is theoretically impossible to achieve the standard regret bound that is common in classic bandits; the lower bound is super-logarithmic.
- Algorithmic Solutions:
- DREE (Dynamical Ranking Exploration-Exploitation): An algorithm designed for the instance-dependent case. It achieves a super-logarithmic regret bound (e.g., ), matching the theoretical lower bound.
- R-LPE (Ranking Logarithmic Phased Elimination): An algorithm designed for the instance-independent case (worst-case gap). It achieves near-optimal regret for Gaussian rewards, utilizing complex techniques involving discretized Brownian motions.
- Impossibility Results:
- Trade-off: There is a fundamental conflict: an algorithm cannot have both "good" instance-dependent regret (near logarithmic) and "good" instance-independent regret (sublinear) at the same time.
- Adversarial Failure: If rewards are adversarial (not drawn from a fixed distribution), no algorithm can learn effectively (sublinear regret is impossible). This implies some statistical structure (stochasticity) is strictly necessary for learning from rankings.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Multi-Armed Bandit (MAB): A gambler stands before slot machines (arms). Each arm provides a reward drawn from an unknown probability distribution. The goal is to maximize total reward over rounds.
- Regret (): The performance metric. It is the difference between the total reward of the optimal arm (if we knew it beforehand) and the total reward actually collected by the algorithm.
- Instance-Dependent Regret: Depends on the specific differences (gaps, ) between arm means. Ideally .
- Instance-Independent Regret: A worst-case bound valid for any valid bandit instance. Ideally .
- Sublinear Regret: If as (i.e., ), the algorithm is "learning" and converging to the optimal strategy. Linear regret () means the algorithm failed to learn.
- Random Walk & Drift: A stochastic process where the position at time is the sum of independent steps. If the steps have a non-zero mean, the walk has a "drift". This concept is crucial because the cumulative reward of an arm behaves like a random walk.
- Ranking vs. Dueling:
- Dueling Bandits: The learner pulls two arms and sees which one won in that specific instance.
- Ranking Feedback (This Paper): The learner pulls one arm. The system updates the empirical average of that arm. Then, the learner sees a ranking of all arms based on their current averages. This feedback is history-dependent and highly correlated over time.
3.2. Previous Works
- Standard Bandits: Algorithms like UCB1 (Upper Confidence Bound) and Thompson Sampling rely on numerical estimates to build confidence intervals. These are not directly applicable here because the learner cannot calculate the variance or magnitude of rewards.
- Explore-Then-Commit (ETC): A simple strategy: explore all arms uniformly for rounds, then pick the best one forever. This works with ranking feedback but yields a suboptimal regret of .
- Preference Learning & Dueling Bandits: Works by Yue et al. (2012) and others study learning from pairwise comparisons. The key difference is that Dueling Bandits observe instantaneous preferences, while this paper observes rankings of aggregated histories (empirical means).
3.3. Differentiation Analysis
The paper's setting is unique because the feedback is aggregated and relative.
- Unlike standard bandits, you don't know how much better arm A is than B.
- Unlike dueling bandits, the ranking depends on the entire history of pulls, meaning the "feedback" received at time is heavily correlated with feedback at time
t-1. This breaks many standard martingale-based proof techniques, requiring novel analysis using Brownian motion approximations.
4. Methodology
4.1. Principles
The core challenge is that without numerical values, the learner cannot estimate "confidence bounds" (how uncertain am I about this arm?).
- For Instance-Dependent Learning (DREE): The strategy forces exploration based on a deterministic function
f(t). Since we can't measure uncertainty, we explore "enough" to ensure that eventually, the law of large numbers guarantees the true ranking emerges. - For Instance-Independent Learning (R-LPE): The strategy uses an elimination tournament. We group arms, pull them until we are statistically confident (using Brownian motion properties) that the losers are truly worse, and then discard them.
4.2. Model Formulation
At each round :
- The learner chooses an arm .
- The environment draws a hidden reward from distribution .
- The system updates the empirical mean for arm : $ \hat{r}t(i) := \frac{\sum{j \in \mathcal{W}_t(i)} r_j(i)}{Z_i(t)} $ where is the set of times arm was pulled, and is the count of pulls.
- Feedback: The learner observes a ranking such that if arm is ranked higher than , then .
- Goal: Minimize Cumulative Expected Regret: $ R_T(\pi) = \mathbb{E}\left[ \sum_{t=1}^T (r_t(i^*) - r_t(i_t)) \right] $ where is the arm with the highest expected mean .
4.3. Algorithm 1: DREE (Dynamical Ranking Exploration-Exploitation)
This algorithm is designed to match the instance-dependent lower bound. It does not require knowledge of the horizon .
4.3.1. The Exploration Function
The algorithm relies on a user-defined function f(t) which determines the minimum number of times an arm must be pulled by time . The theoretical analysis suggests a super-logarithmic function, such as:
$
f(t) = \log(t)^{1+\delta} \quad \text{where } \delta > 0
$
4.3.2. Algorithmic Flow
For each time step from 1 to :
- Initialization: If (where is the number of arms), pull each arm once to initialize.
- Forced Exploration: Check if any arm has been pulled fewer than
f(t)times (i.e., ).- If yes: Play that arm . (Ties broken arbitrarily).
- Exploitation: If all arms have been pulled at least
f(t)times:- Observe the previous ranking .
- Play the arm currently ranked first ().
- Update: Receive the new ranking based on the updated empirical means.
4.3.3. Theoretical Guarantee (Theorem 2)
If rewards are 1-subgaussian and f(t) is super-logarithmic, the regret is bounded by:
$
R_T \le (1 + f(T)) \sum_{i=1}^n \Delta_i + \log(T) \sum_{i=1}^n C(f, \Delta_i)
$
where is the suboptimality gap.
-
Key Insight: The term
f(T)dominates. Since , the regret is slightly worse than the standard but matches the proven lower bound for this setting.
4.4. Algorithm 2: R-LPE (Ranking Logarithmic Phased Elimination)
This algorithm achieves optimal instance-independent regret () for Gaussian rewards. It requires knowledge of .
4.4.1. Core Concepts
- Active Set (): The set of arms currently considered candidates for the best arm. Initially .
- Loggrid (): A set of specific checkpoints (time indices) logarithmically spaced. The algorithm only checks for eliminations when the number of pulls hits a number in this grid. $ \mathcal{L} = LG(1/2, 1, T) \approx { \lfloor T^{1/2 + \frac{j}{2\log T}} \rfloor : j=0, \dots, \log T } $ This grid spans from roughly to .
- Fair Timestep: A moment is "fair" if all currently active arms in have been pulled the exact same number of times.
4.4.2. Filtering Condition (Definition 3)
An arm is kept in the active set only if it performs "well enough" against all other survivors. Specifically, at a fair timestep , the new active set keeps arm if: $ \forall j \in S, \quad \sum_{\tau=1, \tau \text{ fair}}^t \mathbb{I}{ \mathcal{R}\tau(i) > \mathcal{R}\tau(j) } \ge \zeta $
- Interpretation: Arm must have been ranked higher than arm at least times during the fair history steps. If loses to any too often, is eliminated.
- Threshold : The threshold is dynamic, defined as , where depends on the current number of pulls.
4.4.3. Algorithmic Flow
- Initialize: Active set , Loggrid as defined above.
- Loop for to :
- Selection Strategy: Play the arm that has been pulled the minimum number of times so far (). This enforces the "fairness" of pulls among survivors.
- Update: Update pull count and observe new ranking .
- Elimination Check:
- Check if the minimum number of pulls () is in the Loggrid .
- If yes (Check Point):
- Calculate the threshold exponent: $ \alpha = \frac{\log(\min_{i \in S} Z_i(t))}{\log(T)} - \frac{1}{2} $
- Update the active set by applying the filtering condition: $ S = \mathcal{F}_t(T^{2\alpha}) $
- (Arms not in the new are permanently eliminated).
4.4.4. Theoretical Guarantee (Theorem 4)
For Gaussian rewards, R-LPE achieves: $ R_T \le 62 n^4 \log(T)^2 \sqrt{T} $
- Key Insight: This breaks the barrier of the baseline Explore-Then-Commit algorithm. The proof relies on modeling the difference in empirical means as a discretized Brownian Motion and using properties of the "occupation time" (how long a Brownian motion with drift stays positive).
5. Experimental Setup
5.1. Datasets
The authors use synthetic environments to strictly control the properties of the reward distributions.
- Distribution: Gaussian rewards with unit variance ().
- Time Horizon (): rounds.
- Scenarios:
- Small Gaps: and arms with minimum gap . Harder to distinguish.
- Large Gaps: and arms with minimum gap . Easier to distinguish.
5.2. Evaluation Metrics
- Cumulative Regret ():
- Definition: The total loss incurred by not playing the optimal arm at every step.
- Formula: $ R_T = \sum_{t=1}^T (r_t(i^*) - r_t(i_t)) $ (Note: In the stochastic setting, the paper typically plots the expected regret, summing the gaps ).
5.3. Baselines
- EC (Explore-Then-Commit): A standard algorithm that explores all arms uniformly for rounds and then commits to the arm with the highest empirical rank. This serves as the "naive" benchmark.
6. Results & Analysis
6.1. Core Results Analysis
The following analysis refers to the experimental results presented in Figures 1-4 of the original paper.
6.1.1. Performance in Small Gap Settings (Hard)
The following figure (Figure 1 from the original paper) shows the cumulative regret when gaps are small ().
该图像是图表,展示了累积后悔值随时间的变化情况。不同算法(EC、DREE及RLPE)的累积后悔值以不同颜色曲线表示,DREE在不同参数下的表现也有所不同,体现了算法在学习策略中的效果。
- Analysis:
- DREE (Orange/Green curves): Shows linear growth (poor performance). This confirms the theoretical sensitivity: when gaps are small (), the constant term in DREE's bound explodes, causing huge regret initially.
- R-LPE (Purple curve): Shows sublinear growth (good performance), significantly outperforming DREE and EC. This validates its instance-independent property—it handles small gaps robustly without exploding regret.
- EC (Blue curve): Performs poorly, similar to linear growth, confirming that the standard strategy is inefficient compared to R-LPE's .
6.1.2. Performance in Large Gap Settings (Easy)
The following figure (Figure 3 from the original paper) shows the cumulative regret when gaps are large ().
该图像是一个图表,展示了多种算法在时间与累计遗憾之间的关系。曲线分别表示 EC、DREE(不同 值)和 RLPE 的累计遗憾。该实例中的 反映了这些算法在不同条件下的表现。
- Analysis:
- DREE: Now performs the best (lowest regret). With large gaps, the constant term is small, and the asymptotic behavior dominates, leading to very efficient learning.
- Parameter : The plot shows that smaller (e.g., ) leads to better performance, as the exploration function
f(t)grows slower, closer to the optimal logarithm. - R-LPE: Performs worse than DREE here. Its conservative elimination strategy (designed for the worst case) is too slow when the optimal arm is obviously better.
6.2. The Trade-off Verification
The experiments empirically confirm the theoretical "Trade-off" (Theorem 3):
- DREE is superior when gaps are large (Instance-Dependent regime).
- R-LPE is superior when gaps are small (Instance-Independent regime).
- No single algorithm wins in both cases, supporting the impossibility result that one cannot simultaneously optimize for both regimes in the ranking feedback setting.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously establishes that magnitude matters. Removing numerical rewards and relying only on ranking feedback fundamentally changes the hardness of the Multi-Armed Bandit problem.
- Harder Learning: Logarithmic regret is theoretically impossible; the best achievable is super-logarithmic.
- New Algorithms: The authors provide DREE (for easy/instance-dependent cases) and R-LPE (for hard/instance-independent cases), both with strong theoretical guarantees matching the new hardness limits.
- Separation: There is a proven dichotomy: you must choose between optimizing for the "easy" case or the "worst" case; you cannot have an algorithm that adapts perfectly to both (unlike in standard bandits).
7.2. Limitations & Future Work
- Gaussian Assumption: The proof for R-LPE relies on properties of Brownian motion, strictly limiting its theoretical guarantee to Gaussian rewards. Generalizing this to other distributions (e.g., Bernoulli) is an open challenge.
- Computational Complexity: R-LPE involves complex history tracking and filtering conditions, which might be computationally heavier than simple index-based policies like UCB.
- Adversarial Setting: The paper proves a negative result (no sublinear regret) but does not explore if slight relaxations (e.g., limited adversarial changes) could allow for learning.
7.3. Personal Insights & Critique
- Value of Information: This paper is a beautiful theoretical case study on the "Value of Information." It quantifies exactly how much "learning power" is lost when we lose the scale of rewards. The shift from to super-logarithmic is a precise price tag on numerical magnitude.
- Brownian Motion Technique: The use of Brownian motion discretization and occupation times (Takács' theorem) to analyze the "ranking" of empirical means is a highly innovative technical approach. It bridges continuous stochastic processes and discrete bandit algorithms in a way that could be applied to other "relative feedback" problems.
- Practical Relevance: The setting is highly relevant for RLHF (Reinforcement Learning from Human Feedback). As LLMs rely more on ranking outputs (Model A is better than Model B), understanding the fundamental limits of learning from rankings (as opposed to scores) is crucial for efficient model alignment.
Similar papers
Recommended via semantic vector search.