Paper status: completed

Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

Published:11/08/2024

Foundations of Bradley-Terry Models (1)Preference-Based Reward Modeling (1)Multi-Task Evaluation for Large Language Models (1)Order Consistency in Reward Modeling (1)Reward Models Based on Deep Neural Networks (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper reevaluates the Bradley-Terry model in preference-based reward modeling, establishes a theoretical foundation for convergence using deep neural networks, argues it's not strictly necessary for optimization, proposes an alternative upper-bound algorithm, and validates m

Abstract

The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

Mind Map

In-depth Reading

English Analysis~46 min read · 58,724 chars

1. Bibliographic Information

1.1. Title

Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

1.2. Authors

Hao Sun: University of Cambridge, Cambridge, UK.
Yunyi Shen: Massachusetts Institute of Technology, Cambridge, MA, USA.
Jean-Francois Ton: ByteDance Research, London, UK.

1.3. Journal/Conference

This paper is a preprint published on arXiv. The abstract indicates a publication date of 2024-11-07T18:57:03.000Z. As a preprint, it has not yet undergone formal peer review for a specific journal or conference, but arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI and machine learning.

1.4. Publication Year

2024

1.5. Abstract

The Bradley-Terry (BT) model is widely used in reward modeling for Large Language Model (LLM) alignment, despite its original development for multi-player stochastic game matching. The paper investigates the theoretical justification for its application in converting sparse pairwise response comparisons to reward values. It first establishes a theoretical foundation by demonstrating the convergence rate of BT reward models based on deep neural networks using embeddings. While theoretically sound, the authors argue that the BT model is not strictly necessary for downstream optimization, as reward models only need to preserve correct ranking predictions through a monotonic transformation of the true reward. This highlights the critical concept of order consistency, which the BT model possesses. Consequently, the paper proposes a simpler, alternative order-consistent reward modeling objective using an upper-bound algorithm compatible with off-the-shelf binary classifiers. To provide practical insights, the study conducts extensive empirical evaluations across over 12,000 experimental setups, utilizing 6 base LLMs, 2 datasets, and various annotation designs (quantity, quality, pairing choices) to compare these different reward modeling approaches.

1.6. Original Source Link

https://arxiv.org/abs/2411.04991v2 PDF Link: https://arxiv.org/pdf/2411.04991v2.pdf Publication Status: Preprint (arXiv).

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the unclear theoretical justification and necessity of using the Bradley-Terry (BT) model in reward modeling for Large Language Model (LLM) alignment. While the BT model has shown empirical success in Reinforcement Learning from Human Feedback (RLHF), its origins are in multi-player stochastic game matching, raising questions about its suitability for converting sparse pairwise response comparisons to continuous reward values in the LLM context.

This problem is crucial because LLM alignment is fundamental for the safe, effective, and ethical deployment of LLMs across diverse applications. Reward modeling is a key component of RLHF, which is a dominant strategy for aligning LLMs with human preferences. Gaps exist in prior research regarding:

Theoretical Soundness: Why does the BT model work, especially with extremely sparse comparison data characteristic of LLM alignment, where a given prompt-response pair is compared only a few times?
Necessity: Is the BT model, with its specific anti-symmetric structure and log-likelihood loss, a necessary choice, or can simpler, more flexible alternatives achieve similar or better performance?
Data Format: The conventional BT model assumes randomized pairwise comparisons. Is restricting comparisons to responses from the same prompt (a common practice) optimal, or would cross-prompt comparisons lead to more effective reward modeling?

The paper's innovative entry point is to thoroughly revisit the foundations of BT models in this context, provide theoretical underpinnings, and then use these insights to explore and validate alternative, simpler, and potentially more efficient approaches.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Formal Analysis and Justification of BT Model Application: It provides a comprehensive analysis of the BT model's application in LLM alignment, distinguishing it from its traditional use in multi-player arenas. It formally justifies the use of BT models in LLM reward modeling by analyzing the underlying assumptions (existence of deterministic utility values, deterministic comparisons, logistic difference assumption) that lead to BT-type models.
First Asymptotic Theory for Neural Network-Based BT Regression: The paper introduces the first asymptotic theory for neural network-based BT regression in preference-based reward modeling. It establishes risk bounds for BT model reward estimation in the context of LLM alignment, proving that with a sufficient number of annotated comparisons in the embedding space, the BT reward model estimation converges to the true reward value up to an additive constant.
Introduction of Order Consistency and Classification-Based Alternative: It proposes order consistency as a core objective for reward modeling, arguing that learning a reward function up to a monotonic transformation is sufficient for downstream LLM optimization. It demonstrates that both the BT model and a novel classification-based approach inherently possess this property. The classification-based alternative offers greater flexibility and compatibility with off-the-shelf binary classifiers, interpreting reward scores as logits of marginalized winning probabilities.
Theoretical Superiority of Cross-Prompt Comparisons: The paper theoretically analyzes and highlights the superiority of cross-prompt annotations in LLM alignment. It demonstrates that cross-prompt comparisons increase the expected differences between prompt-response pairs, thereby improving annotation quality under broad conditions (e.g., Gaussian utility distributions and concave annotation quality functions).
Extensive Empirical Validation: The study conducts a massive empirical evaluation across over 12,000 experimental setups. These experiments involve 6 base LLMs, 2 datasets (Anthropic-Harmless, Anthropic-Helpful), 3 response sampling methods, 6 annotation noise levels, 3 reward model implementations (BT-MLP, CLF-MLP, CLF-LGB), 4 annotation availability scenarios, and 5 random seeds. The findings show that:
- Classification-based reward models generally achieve better performance and greater robustness to annotation noise and quantity compared to BT models.
- Cross-prompt comparisons significantly outperform same-prompt comparisons, especially when responses for a single prompt lack diversity, providing a more stable and less diversity-dependent performance.
  
  These findings solve the problem of clarifying the theoretical foundations of BT models, offering valid alternatives, and providing practical guidance on optimal annotation strategies for effective LLM alignment.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following fundamental concepts:

Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human-like language. They are capable of various tasks like translation, summarization, question answering, and creative writing. Examples include GPT-3, LLaMA, and Gemma.
LLM Alignment: The process of ensuring that LLMs behave in a way that is consistent with human values, intentions, and preferences. This involves making them helpful, harmless, and honest, and preventing them from generating toxic, biased, or undesirable content.
Reinforcement Learning from Human Feedback (RLHF): A machine learning paradigm used to align LLMs. It involves three main steps:
1. Collecting Human Preferences: Humans compare multiple responses generated by an LLM for a given prompt and indicate which one they prefer.
2. Training a Reward Model (RM): These human preferences are used to train a separate model, called a reward model, to predict a numerical score (reward) for any given prompt-response pair. A higher score indicates a more preferred response.
3. Fine-tuning the LLM with RL: The trained reward model is then used to fine-tune the original LLM using a reinforcement learning algorithm (like PPO). The LLM learns to generate responses that maximize the reward predicted by the RM.
Reward Modeling: The specific task within RLHF of training a model to predict a scalar reward for a given prompt-response pair. This reward acts as a proxy for human preference. The goal is to learn a reward function R(x, y) such that higher values of $R$ correspond to more desirable responses $y$ for a prompt $x$ .
Pairwise Comparisons: A method of data collection where annotators (humans or other AI models) are presented with two options (e.g., two different responses to the same prompt) and asked to choose which one they prefer. This generates comparative data, rather than absolute scores. This is the primary data format for training reward models in RLHF.
Bradley-Terry (BT) Model: A statistical model for predicting outcomes of paired comparisons. Originally developed for ranking sports teams, it posits that the probability of item $i$ being preferred over item $j$ depends on their latent utilities or abilities.
- In its simplest form, the probability of $i$ beating $j$ is given by: $P(i \succ j) = \frac{u_i}{u_i + u_j}$ where $u_i$ and $u_j$ are the positive utility scores (abilities) of items $i$ and $j$ .
- Often, these utilities are transformed to log-utilities or scores r_i = \log(u_i), leading to the softmax formulation: $ P(i \succ j) = \frac{\exp(r_i)}{\exp(r_i) + \exp(r_j)} = \mathrm{softmax}(r_i, r_j) $ The sigmoid function $\sigma(z) = \frac{1}{1+e^{-z}}$ is closely related, as $\mathrm{softmax}(r_i, r_j) = \sigma(r_i - r_j)$ . This means the probability of preferring $i$ over $j$ depends on the difference in their scores.
Luce-Shephard Choice Rule: A mathematical model in psychology that describes how individuals choose among multiple alternatives. It states that the probability of choosing an option from a set of alternatives is proportional to its utility. The Bradley-Terry model is a specific instance or specialization of the Luce-Shephard choice rule for pairwise comparisons.
Logistic Regression: A statistical model used for binary classification. It models the probability of a binary outcome (e.g., prefer $i$ or $j$ ) using a sigmoid function applied to a linear combination of input features. The output is a probability between 0 and 1. The BT model, when viewed as predicting the probability of one item being preferred over another based on feature differences, can be seen as a form of logistic regression.
Neural Networks (MLPs): Multilayer Perceptrons (MLPs) are a class of artificial neural networks. They consist of an input layer, one or more hidden layers, and an output layer. Each layer is composed of nodes (neurons) that are fully connected to the nodes in the previous and subsequent layers. MLPs use activation functions (e.g., ReLU, sigmoid) to introduce non-linearity, allowing them to learn complex patterns. In this paper, MLPs are used as the backbone for the reward models to map embeddings to scores.
Embeddings: Numerical representations of text (or other data types) in a high-dimensional vector space. Words, sentences, or entire documents can be converted into embeddings where semantically similar items are mapped to closer points in the vector space. Sentence embeddings or prompt-response pair embeddings are crucial for allowing a neural network-based reward model to generalize to unseen inputs.
Cross-Entropy Loss: A widely used loss function in classification tasks, particularly when the model outputs probabilities (like softmax). For a binary classification problem, it's also known as binary cross-entropy. It measures the difference between the true probability distribution (e.g., 1 for preferred, 0 for not preferred) and the predicted probability distribution. The goal during training is to minimize this loss.
- For a single instance where the true label is $y \in \{0, 1\}$ and the predicted probability is $\hat{y} \in [0, 1]$ , the binary cross-entropy loss is: $ L_{CE} = -(y \log(\hat{y}) + (1-y) \log(1-\hat{y})) $
Truncated KL Risk: A variant of Kullback-Leibler (KL) divergence (also known as KL risk or relative entropy), which measures how one probability distribution diverges from a second, expected probability distribution. Truncated KL risk is used in theoretical analysis to address potential divergence issues when probabilities are very close to 0 or 1, making the logarithm term unbounded. It essentially caps the maximum contribution of very small probabilities to the loss, ensuring the risk remains bounded for theoretical analysis.
Hölder Smooth Function: A function class used in mathematical analysis, particularly in non-parametric statistics and approximation theory. A function is Hölder smooth if its derivatives up to a certain order (characterized by $\beta$ ) satisfy a specific Lipschitz-like continuity condition. This property implies a certain level of "smoothness" and regularity, which is often assumed for functions that neural networks are trying to approximate to establish convergence rates.
Gumbel Distribution: A type of extreme value distribution. It describes the distribution of the maximum (or minimum) of a number of samples of various distributions. In the context of choice models, if the noise or bias terms associated with options are independently Gumbel distributed with the same scale parameter, then the differences between these noise terms follow a logistic distribution, which then leads to the Bradley-Terry model's softmax probability form.
Gaussian Distribution (Normal Distribution): A common continuous probability distribution characterized by its bell-shaped curve. It is defined by its mean ( $\mu$ ) and standard deviation ( $\sigma$ ). If the noise or bias terms are Gaussian distributed, then their differences are also Gaussian, leading to the Thurstonian model (which uses the cumulative distribution function (CDF) of the Gaussian, $\Phi(\cdot)$ , instead of sigmoid).
Best-of-N (BoN) Sampling: An evaluation strategy in RLHF. After an LLM is fine-tuned (or a reward model is trained), for a given prompt, the LLM generates $N$ different responses. The reward model (or human annotators) then scores each of these $N$ responses, and the response with the highest score is selected as the "best." This Best-of-N response is then used to evaluate the overall performance of the LLM or reward model, often by comparing its score to a golden reward model.
Proximal Policy Optimization (PPO): A popular reinforcement learning algorithm often used in the final stage of RLHF to fine-tune the LLM policy. It is a policy-gradient method that tries to update the policy (the LLM's behavior) in small, stable steps to maximize the expected reward, while preventing large, disruptive changes to the policy.

3.2. Previous Works

The paper extensively references prior studies, both in the application of BT models in RLHF and in the theoretical underpinnings of BT regression and non-pairwise data.

BT Models in RLHF (Seminal Applications):
- Christiano et al. (2017): This seminal work introduced Deep Reinforcement Learning from Human Preferences, providing the initial motivation for using Bradley-Terry-type models in RLHF. They frame it as a specialization of the Luce-Shephard choice rule, where pairwise comparisons estimate utility scores.
- Stiennon et al. (2020) and Ouyang et al. (2022): These works further established the BT model as a successful practice in LLM alignment, demonstrating its effectiveness in improving LLM quality when direct evaluation is difficult. The paper highlights that these works often applied the model directly without deep theoretical justification for its use in sparse comparison scenarios.
Challenges to BT Models in RLHF:
- Azar et al. (2023), Munos et al. (2023), Tang et al. (2024): These works have discussed challenges, such as the BT model's inability to capture non-transitive preferences (Munos et al., 2023) or potential overfitting issues in direct preference optimization (DPO) when preferences are deterministic (Azar et al., 2023). This paper distinguishes itself by focusing on the underlying assumptions for BT's correctness rather than solely on failure modes.
- Zhao et al. (2023): Also discusses challenges in direct preference optimization.
Bradley-Terry Model for Parameter Estimation and Prediction (Traditional Statistical View):
- Bradley and Terry (1952): The original paper, laying the foundation for using pairwise comparisons to estimate player abilities.
- Ford Jr (1957): Established identifiability and asymptotic theory for the classical BT model.
- Simons and Yao (1999): Studied BT asymptotics when the number of players diverged and each played many competitions.
- Han et al. (2020): Investigated the sparse comparison setting (more relevant to LLM alignment), proposing methods for consistent estimation with fewer comparisons, specifically $O(N \log^3(N))$ for $N$ scores. This work highlights the inherent challenge for LLM reward modeling with its extremely sparse $N/2$ comparisons.
- Wu et al. (2022): Compared asymptotic theory under different identifiability assumptions.
- Springall (1973): Explored extensions of the BT model to regression settings, assuming abilities (rewards) could be predicted from a linear combination of covariates (features).
- De Soete and Winsberg (1993): Used spline functions for more complex non-linear models in BT regression.
- Bockenholt (1988): Framed the Bradley-Terry model as a special logistic regression problem, opening theoretical investigation from this perspective.
- Chen and Pouzo (2012): Developed theory for general nonparametric conditional moment models, including logistic regression.
- Schmidt-Hieber (2020): Studied asymptotic theory for deep neural networks in nonparametric regression.
- Bos and Schmidt-Hieber (2022): Extended this to nonparametric logistic regression using deep neural networks. This paper builds on their framework but notes that their theories cannot be directly applied due to the Siamese structure (shared network for two inputs) and the specific softmax output of BT models (predicting preference based on a difference of scores from a single network).
Non-Pairwise Data and Cross-Prompt Comparisons in RLHF:
- Liu et al. (2022), Sun and van der Schaar (2024), Ethayarajh et al. (2024) (KTO): Explored alignment from non-pairwise data. KTO (Ethayarajh et al., 2024) is noted for its roots in prospect theory.
- Yin et al. (2024) (RPO): Proposed Relative Preference Optimization (RPO), involving response comparisons among both identical and similar questions (cross-prompt). The current paper aligns with RPO's insight but differs in motivation (why not restrict to same-prompt?) and implementation (focus on annotation efficiency and a two-stage reward model approach, not direct alignment).
- The paper contrasts its approach with direct alignment methods (e.g., Zheng et al., 2024; Liu et al., 2024; Zhao et al., 2023; Azar et al., 2023; Ethayarajh et al., 2024), arguing that explicit reward models (as in two-stage methods) offer advantages for inference-time optimization.
Representation Learning in Reward Modeling:
- Yang et al. (2024a): Highlighted how generation tasks can regularize learned embeddings and improve reward modeling.
- Zhang et al. (2024): Used generative verifiers to construct reward predictions via next-token prediction, leveraging LLM generative ability. The current paper notes its research is orthogonal but complementary to this direction.

3.3. Technological Evolution

The field of LLM alignment has rapidly evolved, primarily driven by the success of RLHF.

Early LLMs: Initial LLMs focused on pre-training on massive text corpora, leading to impressive language generation capabilities but often lacking alignment with human preferences or safety guidelines.
Emergence of RLHF: Pioneering works like Christiano et al. (2017) and Stiennon et al. (2020) introduced RLHF as a powerful paradigm. This involved collecting human preference data and training a reward model (often using the Bradley-Terry formulation) to provide a scalar feedback signal for reinforcement learning algorithms (like PPO) to fine-tune the LLM.
BT Model Dominance: The empirical success of BT models in systems like ChatGPT (Ouyang et al., 2022) solidified its status as a de facto standard for reward modeling. However, its theoretical foundations in this specific, sparse comparison context remained less explored.
Alternative Alignment Methods: Alongside RLHF, direct preference optimization (DPO) (Rafailov et al., 2023) emerged, directly optimizing the LLM policy using preference data without an explicit reward model. Other methods explored non-pairwise data or specific ways of structuring preferences (e.g., KTO, RPO).
This paper's Contribution: The current paper fits into this timeline by critically examining the foundational BT model that underpins much of RLHF. It provides missing theoretical justifications for its use in sparse data settings (connecting it to nonparametric logistic regression with deep neural networks). More importantly, it challenges the necessity of the BT model by proposing a simpler, order-consistent classification objective, suggesting that the core requirement is preserving rank order rather than precise probability calibration. It also pushes the boundaries of annotation design by advocating for cross-prompt comparisons, moving beyond the traditional same-prompt pairing. This work aims to refine and optimize the data collection and modeling aspects of the reward modeling stage, which is a crucial intermediate step in many RLHF pipelines.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several key differentiators and innovations:

Theoretical Justification for BT in Sparse LLM Context: Previous works (e.g., Christiano et al., Stiennon et al., Ouyang et al.) successfully applied the BT model empirically, but the theoretical soundness for sparse, high-dimensional data (prompt-response pairs represented by embeddings) was underexplored. This paper provides the first asymptotic theory for neural network-based BT regression in this specific context, establishing convergence rates. This moves beyond simply acknowledging empirical success to providing formal guarantees.
Emphasis on Order Consistency as a Core Objective: Unlike BT models, which aim to accurately model the probability of preference, this paper argues that for downstream LLM optimization, only order consistency (preserving the correct ranking) is strictly necessary. This is a crucial philosophical shift, implying that exact probability calibration (a harder problem) might be an overspecification. This insight leads to simpler, more flexible alternatives.
Novel Classification-Based Reward Modeling Objective: Building on the order consistency principle, the paper proposes a straightforward classification-based algorithm as an alternative to the BT model. This method relaxes the anti-symmetry constraint inherent in BT by training a standard binary classifier on individual prompt-response pairs. The output logits from this classifier then serve as reward scores. This is simpler to implement, allows for off-the-shelf classifiers (like LightGBM), and empirically demonstrates superior performance and robustness in many settings.
Theoretical and Empirical Advocacy for Cross-Prompt Comparisons: Traditional BT model applications (like sports ranking) often assume randomized pairwise comparisons. In RLHF, however, same-prompt comparisons are prevalent. This paper theoretically proves that cross-prompt comparisons (comparing responses from different prompts) can lead to higher utility diversity and improved annotation quality, thus being more beneficial. This is a direct challenge to a common practice and provides practical guidance for data collection strategies, differentiating it from RPO (Yin et al., 2024) which also uses cross-prompt comparisons but with different motivations and implementations.
Extensive and Controlled Empirical Validation: The paper conducts one of the most comprehensive empirical evaluations in this domain, with over 12,000 runs across varied LLMs, datasets, annotation qualities, quantities, and pairing choices. This systematic ablation study provides strong evidence for its claims and offers practical insights into the robustness and efficiency of different reward modeling approaches.

In essence, the paper provides both a deeper theoretical understanding of current practices and a practical, empirically validated alternative that is simpler, more robust, and guides more efficient data collection.

4. Methodology

4.1. Principles

The core idea of the methodology is multi-faceted:

Revisit and Validate BT Foundations: The paper first seeks to understand why the Bradley-Terry (BT) model works in LLM reward modeling, especially given the sparse comparison data. This involves establishing theoretical guarantees for neural network-based BT regression, connecting it to nonparametric logistic regression.
Deconstruct Necessity from Sufficiency: It then argues that the BT model, while theoretically sound, might not be necessary because downstream LLM optimization primarily requires a reward function that preserves the correct ranking of responses (i.e., order consistency), not necessarily perfectly calibrated preference probabilities. A monotonic transformation of the true reward is sufficient.
Propose Order-Consistent Alternatives: Based on the order consistency principle, the paper develops a simpler classification-based reward modeling objective that can leverage off-the-shelf binary classifiers, thereby offering more flexibility and potentially better empirical performance.
Optimize Annotation Strategy: The methodology also investigates the preference annotation process, particularly challenging the conventional wisdom of same-prompt comparisons. It theoretically and empirically explores the benefits of cross-prompt comparisons to improve annotation quality and data utility.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Rethinking the Usage of BT Models in LLM Alignment (Section 2.2)

The paper begins by dissecting the implicit assumptions made when applying the BT model to human preference annotations in LLM alignment.

Assumption 1 (Existence of Deterministic Oracle Utility Values) The (log-)utility value $r_{x,y}$ of any response $y$ given prompt $x$ exists and is deterministic.

Explanation: This foundational assumption posits that for any given prompt-response pair, there's an inherent, fixed, and objective quality score (utility) that the reward model tries to learn. This score doesn't change based on who is annotating or how it's compared.

Assumption 2 (Deterministic Comparisons) For annotator $A$ , the annotation result is deterministic and depends on the comparison of their biased evaluation of the utility values of both responses $y_1$ and $y_2$ , such that: $ \mathbb{1}(y_1 \succ y_2 | x, A) = \mathbb{1}(r_{x,y_1} + b(x, y_1, A) > r_{x,y_2} + b(x, y_2, A)) $

Explanation: This assumes that an annotator $A$ doesn't randomly pick a preference. Instead, they form a biased evaluation ( $r_{x,y} + b(x,y,A)$ ) for each response. The preference decision is then deterministic: if the biased evaluation of $y_1$ is higher than $y_2$ , then $y_1$ is preferred. The randomness in annotations across different annotators (or even for the same annotator across different occasions) comes from the bias term b(x,y,A), which is specific to the annotator and the prompt-response pair.

Assumption 3 (Logistic Difference Assumption) The difference $b(x, y_1, A) - b(x, y_2, A)$ is sampled i.i.d. from a standard logistic distribution for all x, y. $ P(b(x, y_1, A) - b(x, y_2, A) \leq t | A) = \frac{1}{1+e^{-t}} $

Explanation: This assumption specifies the distribution of the difference in annotator biases. If this difference follows a logistic distribution, it mathematically leads directly to the Bradley-Terry model's softmax form for preference probabilities. The i.i.d. (independent and identically distributed) nature means this bias difference is consistent across all comparisons.
Remark 1 (Transitive property of difference): The paper notes that this assumption is consistent with transitivity if all annotator biases are independently Gumbel distributed with the same scale parameter. The difference of two i.i.d. Gumbel variables with the same scale parameter is a logistic distribution.

Proposition 2 (Modeling Annotations under Logistic Difference Assumption) Under the above assumptions, the probability of preferring $y_1$ over $y_2$ given prompt $x$ is: $ P(y_1 \succ y_2 | x) = P \left( r_{x,y_1} - r_{x,y_2} > b(x, y_1, A) - b(x, y_2, A) \right) = \frac{1}{1 + e^{-(r_{x,y_1} - r_{x,y_2})}} $

Explanation: This is the core Bradley-Terry probability model. It shows that the probability of $y_1$ being preferred over $y_2$ is a sigmoid function of the difference between their true underlying utility values ( $r_{x,y_1} - r_{x,y_2}$ ). This equation is fundamental to how BT models translate utility differences into preference probabilities.

The paper also considers an alternative assumption for robustness:

Assumption 4 (Gaussian Difference Assumption) The $b(x, y_1, A) - b(x, y_2, A)$ is sampled from a standard Gaussian distribution: $ b(x, y_1, A) - b(x, y_2, A) \sim \mathcal{N}(0, 1) $

Explanation: Similar to Assumption 3, but here the bias differences follow a Gaussian distribution.

Proposition 3 (Modeling Annotations under Gaussian Difference Assumption) Under Assumption 1, Assumption 2, and Assumption 4, the probability of preferring $y_1$ over $y_2$ given prompt $x$ is: $ P(y_1 \succ y_2 | x) = P \left( r_{x,y_1} - r_{x,y_2} > b(x, y_1, A) - b(x, y_2, A) \right) = \Phi(r_{x,y_1} - r_{x,y_2}) $ where $\Phi$ is the CDF of the Gaussian distribution.

Explanation: This is the Thurstonian model. It's structurally similar to the BT model but uses the Gaussian CDF instead of the sigmoid function because of the Gaussian noise assumption. The paper notes that BT is often preferred for empirical fit and mathematical convenience.

4.2.2. BT Regression: How BT Model Works with Sparse Comparisons (Section 2.3)

The key challenge in LLM reward modeling is the sparsity of comparisons (e.g., $N/2$ for $N$ pairs, far below theoretical bounds for consistent estimation in classical BT). To address this, the paper notes that BT regression variants are used. Instead of estimating a unique score for each prompt-response pair, the score r(x,y) is modeled as a function of features or covariates of the pair. For LLMs, this means using embeddings of the prompt-response pair.

Regression Approach: The reward r(x,y) is expressed as $g(\Psi(x,y))$ , where $\Psi(x,y)$ is an embedding of the prompt-response pair (x,y), and $g$ is an unknown function. This allows generalization to unseen pairs.
Neural Networks: In practice, MLPs are commonly used to implement $g$ , mapping embeddings to reward scores. The paper's theoretical analysis focuses on this MLP-based BT regression.

4.2.3. Asymptotic Theory on MLP-based BT Regression in Reward Modeling (Section 2.4 and Appendix A)

This section provides the theoretical justification for using MLPs with BT loss for learning reward functions.

Data Format: The dataset for preference-based LLM alignment is $\mathcal { D } _ { \mathrm { p r e f } } = \{ ( x _ { i } , y _ { 1 , i } , y _ { 2 , i } , h _ { i } ) \} _ { i \in [ n ] }$ .

$x_i$ : the prompt.
$y_{1,i}, y_{2,i}$ : responses sampled from the LLM.
$h_i$ : human-annotated preference, 1 if $y_{1,i}$ is preferred, -1 otherwise.

Reward Model Formulation:

Embedding Function: Assume a known embedding function $\Psi ( \cdot , \cdot ) : \mathcal { X } \times \mathcal { Y } \mapsto [ 0 , 1 ] ^ { d }$ , which maps a prompt-response pair to a $d$ -dimensional vector.
Reward Function: An unknown reward function $r : \mathbb { R } ^ { d } \mapsto \mathbb { R }$ exists such that $r(x,y) = g(\Psi(x,y))$ for some function $g$ . The goal is to learn $g$ .
Predicted Probability: The reward model is denoted $\hat{r}_\theta$ . The predicted probability that $(x_1, y_1)$ is better than $(x_2, y_2)$ is $\sigma(\hat{r}(\Psi(x_1, y_1)) - \hat{r}(\Psi(x_2, y_2)))$ . This is equivalent to $\mathrm{softmax}(\hat{r}(\Psi(x_1, y_1)), \hat{r}(\Psi(x_2, y_2)))$ , which is a classification problem.

Loss Function: Cross-entropy loss is used for training. Let $\Psi_1^{(i)}$ and $\Psi_2^{(i)}$ be the embeddings for the $i$ -th pair. Let $\pmb{p}_0^{(i)}$ be the true preference probability and $\hat{\pmb{p}}^{(i)}$ be the model's predicted probability vector. The preference label $\pmb{h}^{(i)}$ is $(1,0)$ if the first response is preferred, and $(0,1)$ otherwise. The cross-entropy loss is: $ \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) = - \frac { 1 } { n } \sum _ { i = 1 } ^ { n } ( \pmb { h } ^ { ( i ) } ) ^ { \top } \log ( \pmb { p } ^ { ( i ) } ) $ The estimator $\hat{\pmb{p}}$ is the one that minimizes this loss over the function class $\mathcal{F}_\theta$ : $ \hat { \pmb { p } } = \arg \operatorname{min}{\pmb { p } \in \mathcal { F } _ { \theta } } \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) $ The difference between the fitted NN and the global minimum is denoted as: $ \Delta _ { n } ( \pmb { p } _ { 0 } , \hat { \pmb { p } } ) = \mathbb { E } \left[ \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \hat { \pmb { p } } ) - \operatorname{min}{\pmb { p } \in \mathcal { F } _ { \theta } } \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) \right] $

Truncated KL Risk (Definition 12): To overcome divergence problems with KL risk when probabilities are near 0 or 1, the paper uses a $B$ -truncated KL risk: $ R _ { B } ( { \pmb p } _ { 0 } , \hat { \pmb p } ) = \mathbb { E } \left[ { \pmb p } _ { 0 } ^ { \top } \operatorname{min} \left( B , \log \frac { { \pmb p } _ { 0 } } { \hat { \pmb p } } \right) \right] $

Explanation: This metric measures the expected difference between the true preference probabilities ( $\pmb{p}_0$ ) and the model's predicted probabilities ( $\hat{\pmb{p}}$ ), but with a cap $B$ on the logarithmic term. This prevents extremely small predicted probabilities from causing infinite risk, making the theoretical analysis tractable.

Assumptions on the Model and True Function:
Assumption 6 (MLP reward model): The neural network $\mathcal{F}_\theta$ is an MLP with ReLU activations (meaning $\max(x,0)$ is used as the activation function) with depth $L$ and width vector $\pmb{m}$ . The network parameters have norm restrictions (bounded $\ell_\infty$ norm for weights and biases) and sparsity (bounded $\ell_0$ norm for weights and biases, i.e., limited number of non-zero parameters).
- The function class is $\mathcal{F}(L, m, s)$ , and the softmax output version is $\mathcal{F}_\sigma(L, m, s)$ , where $p(\Psi_1, \Psi_2) = \mathrm{softmax}(f(\Psi_1), f(\Psi_2))$ for $f \in \mathcal{F}(L, m, s)$ .
Definition 13 (Small value bound by Bos and Schmidt-Hieber (2022)): A function class $\mathcal{H}$ is $\alpha$ -small value bounded ( $\alpha$ -SVB) if the probability of any component $p_k$ of the probability vector $\pmb{p}$ being less than or equal to $t$ is bounded by $Ct^\alpha$ .
- $ \mathbb { P } ( p _ { k } ( \Psi _ { 1 } , \Psi _ { 2 } ) \leq t ) \leq C t ^ { \alpha } , \quad \mathrm{for \ all \ } t \in ( 0 , 1 ] \mathrm{ \ and \ all \ } k = 0 , 1 $
- Explanation: This ensures that the true preference probabilities are not too close to 0 or 1, which helps avoid issues with the logarithm in KL divergence. It essentially means that extremely decisive preferences are not overwhelmingly common in the data distribution.
Definition 14 (Hölder smooth function): A function class that captures smoothness. The ball of $\beta$ -Hölder functions with radius $Q$ has bounded sums of $\ell_\infty$ norms of derivatives up to $\beta$ .
Assumption 7 (Class of true utility function): The true reward functions are $\beta$ -Hölder smooth, and the induced probability by softmax is $\alpha$ -SVB.

Theorem 20 (Truncated KL risk bound, formal version from Appendix A) Suppose the true utility function induced probability of preference is in $\mathcal { G } _ { \alpha } ( \beta , Q , C )$ (i.e., $\beta$ -Hölder smooth and $\alpha$ -SVB) for $\alpha \in [ 0 , 1 ]$ . Let $\phi _ { n } : = 2 ^ { \frac { ( 1 + \alpha ) \beta + ( 3 + \alpha ) d } { ( 1 + \alpha ) \beta + d } } n ^ { - \frac { ( 1 + \alpha ) \beta } { ( 1 + \alpha ) \beta + d } }$ . Let $\hat { p }$ be an estimator from family $\mathcal { F } _ { \sigma } ( L , m , s )$ (MLP with ReLU activations, norm restriction, and sparsity) satisfying:

$A ( d , \beta ) \log _ { 2 } ( n ) \leq L \lesssim n \phi _ { n }$ (depth $L$ scales appropriately).
$\operatorname{min}_{i} m _ { i } \gtrsim n \phi _ { n }$ (minimum width scales appropriately).
$s \asymp n \phi _ { n } \log ( n )$ (sparsity $s$ scales appropriately). For sufficiently large $n$ , there exist constants $C ^ { \prime } , C ^ { \prime \prime }$ such that when $\Delta _ { n } ( \hat { p } , p _ { 0 } ) \leq C ^ { \prime \prime } B \phi _ { n } L \log ^ { 2 } ( n )$ , then: $ R _ { B } ( { \pmb p } _ { 0 } , \hat { \pmb p } ) \leq C ^ { \prime } B \phi _ { n } L \log ^ { 2 } ( n ) $

Explanation: This is the main theoretical result. It states that under reasonable smoothness and regularity assumptions on the true reward function and the MLP architecture, the truncated KL risk of the BT reward model converges to zero as the number of data points $n$ increases. The convergence rate is determined by $\phi_n$ , which depends on the embedding dimension $d$ , smoothness $\beta$ , and small-value bound $\alpha$ . It implies that MLP-based BT regression can indeed learn the true preference probabilities. The term $\Delta_n$ represents the optimization error (how far the learned NN is from the global minimum for the given architecture), and if this error is also small enough, the overall risk is bounded.

Corollary 7 (Connecting probability to reward) Given Lemma 21 (which relates Hellinger distance to truncated KL risk) and Remark 22 (which links Hellinger distance to absolute probability differences): $ \left| p _ { 0 } ( \Psi _ { 1 } , \Psi _ { 2 } ) - \hat { p } ( \Psi _ { 1 } , \Psi _ { 2 } ) \right| \lesssim \left| \sqrt { p _ { 0 } } + \sqrt { \hat { p } } \right| \sqrt { \phi _ { n } L } \log ( n ) \to 0 $ Then, by the mean value theorem relating probability differences to logit differences (which are reward differences for BT): $ \left| r ( \Psi _ { 1 } ) - r ( \Psi _ { 2 } ) - ( \hat { r } ( \Psi _ { 1 } ) - \hat { r } ( \Psi _ { 2 } ) ) \right| \lesssim \frac { \left| \sqrt { p _ { 0 } } + \sqrt { \hat { p } } \right| } { \tilde { p } ( 1 - \tilde { p } ) } \sqrt { \phi _ { n } L } \log ( n ) \to 0 $ where $\tilde{p}$ is a probability between $p_0$ and $\hat{p}$ .

Explanation: This corollary directly connects the convergence of predicted probabilities to the convergence of estimated reward differences. It shows that if the model accurately predicts preference probabilities, it also accurately estimates the difference between the true reward values. This estimation is "up to an additive constant" because $r(\Psi_1) - r(\Psi_2)$ is what's learned, not the absolute values of $r(\Psi_1)$ or $r(\Psi_2)$ . The condition that $\tilde{p}(1-\tilde{p})$ should not be too small suggests that comparisons should not be between pairs with extremely different rewards, as the logit function (inverse sigmoid) diverges at probabilities close to 0 or 1.

4.2.4. Rethinking Reward Modeling Objectives in LLM Alignment (Section 3)

The paper argues that BT models are not necessary because LLM optimization only requires a reward function that preserves the correct ranking (order), not perfectly calibrated preference probabilities.

Assumption 5 (Imperfect Preference Annotation in Approximating True Scores) Let $\Delta r := | r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) |$ be the true utility difference. An annotator function $h(x_1, x_2, y_1, y_2)$ provides feedback that probabilistically aligns with the oracle utility r(x,y). The probability of the annotator being correct is $\xi(\Delta r)$ : $ \mathbb { P } \bigg ( h ( x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } ) ( r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) ) > 0 \bigg | \Delta r \bigg ) = \xi ( \Delta r ) $ where $\xi(\cdot)$ is any monotonic increasing function mapping to [0.5, 1].

Explanation: This assumption models the reality of human annotation noise. It states that annotators are more likely to make correct judgments when the difference in true utility ( $\Delta r$ ) between two responses is larger, and less likely when scores are very similar (approaching a 50% chance when $\Delta r \to 0$ ). Both BT and Thurstonian models satisfy this, as their probability functions (sigmoid and Gaussian CDF) are monotonic.

Definition 8 (Order Consistency) The loss over an ordering model $\hat{H}$ is defined as the probability that the reward model's ordering agrees with the annotation: $ \mathcal { L } _ { \mathrm { oc } } ( \hat { r } ) = \mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } , h } \mathbb { I } \left[ h = \hat { H } \right] $

Explanation: This is a simple classification accuracy metric. It measures how often the model's predicted preference direction ( $\hat{H}$ ) matches the observed human annotation ( $h$ ). The goal is to maximize this.

Proposition 9 (Lower bound on population level order consistency, formal version from Appendix B.1) Suppose a learned model $\hat{H}$ achieves objective equation 17 (Definition 8) up to $1 - \delta \epsilon$ error for small $0 < \delta < 1$ and $\epsilon < 3/20$ , i.e., $\mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } , h } \mathbb { 1 } \left[ h = \hat { H } \right] \geq 1 - \delta \epsilon$ . Then, with probability at least $1 - \delta$ over $\Delta r$ , for any given $\Delta r$ , the order consistency of $\hat{r}$ with respect to the oracle utility $r$ is bounded below by: $ \mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } } \left[ \mathbb { 1 } \left( \hat { H } \cdot \left[ r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) \right] \geq 0 \right) \bigg | \Delta r \right] \geq ( 1 - \epsilon ) \cdot \xi ^ { 2 } ( \Delta r ) + \epsilon \cdot ( 1 - \xi ( \Delta r ) ) ^ { 2 } $ Further, if with probability at least $1 - \kappa$ , we have $\xi ( \Delta r ) \ge \sqrt { \epsilon ^ { 2 } + 1 - 3 \epsilon } + \epsilon$ , then: $ \mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } \sim \ell ( x ) } \left[ \mathbb { 1 } \left( \hat { H } \cdot \left[ r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) \right] > 0 \right) \right] \geq 1 - 4 \epsilon - \kappa - \delta $

Explanation: This proposition is crucial. It shows that by optimizing the observable order consistency with respect to noisy human annotations (achieving high accuracy $\geq 1-\delta\epsilon$ ), we can also ensure a high order consistency with the true underlying oracle utility function. The lower bound depends on the annotation quality function $\xi(\Delta r)$ and the error rate $\epsilon$ . This theoretical link justifies pursuing simple classification accuracy as a valid objective for reward modeling.

The BT Model as a Choice (Section 3.2) The BT model explicitly enforces order consistency by modeling preference probability as a sigmoid of the reward difference: $ P(h=1) = \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) ) $ It is trained with binary cross-entropy loss: $ \mathcal { L } _ { \mathrm { B T } } = \mathbb { E } \left[ \mathbb { 1 } _ { h = 1 } \log \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) ) + \mathbb { 1 } _ { h = - 1 } \log ( 1 - \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) ) ) \right] $

Explanation: The BT model directly optimizes for predicting the probability of preference based on reward differences. Its Siamese architecture (where the same network processes both inputs and their outputs are differenced) ensures anti-symmetry (preferring A over B implies not preferring B over A).

Relaxing the Anti-Symmetry Constraint: a Classification-based Method (Section 3.3) The paper proposes a simpler classification-based method by relaxing the explicit anti-symmetry constraint of BT. Instead of modeling $P(i \succ j)$ , it focuses on learning the marginal probability $P(i \text{ wins})$ . Let $\hat{H}_{\mathrm{clf}}(x,y)$ be a classifier output. The order consistency could be thought of as achieving $h = \hat{H}_{\mathrm{clf}}(x_1, y_1)$ and $-h = \hat{H}_{\mathrm{clf}}(x_2, y_2)$ . A union bound of this order consistency leads to an objective function for a simple binary classifier: $ \mathcal { L } _ { \mathrm { oc } } \leq \mathcal { L } _ { \mathrm { clf } } : = \mathbb { E } ( h = \hat { H } _ { \mathrm { c l f } } ( x _ { 1 } , y _ { 1 } ) ) + \mathbb { E } ( - h = \hat { H } _ { \mathrm { c l f } } ( x _ { 2 } , y _ { 2 } ) ) $

Explanation: This means we can train a standard binary classifier by treating each prompt-response pair in the preference data independently. If $(x_1, y_1)$ is preferred over $(x_2, y_2)$ , then $(x_1, y_1)$ is a "positive" example and $(x_2, y_2)$ is a "negative" example. The logit output of this classifier can then be used as the reward score. This approach is more flexible as it allows using off-the-shelf binary classifiers (e.g., LightGBM) without the strict Siamese architecture or anti-symmetry.
Proposition 24 (Classification reward, from Appendix B.2): This proposition formally connects the logit of the marginal probability of winning (as learned by a classification model) to the BT reward. If data comes from a BT model, and $s_i := \mathrm{logit} P(i \text{ wins})$ , then $s_i \geq r_i - C$ for a constant $C$ . This means the classification reward provides a lower bound for the true reward (up to an additive constant), making it a valid proxy. $ s _ { i } : = \mathrm { logit } \mathbb { P } ( i \mathrm { \ wins } ) \ge r _ { i } - \log \mathbb { E } [ \exp ( r _ { j } ) ] $ where $\log \mathbb{E}[\exp(r_j)]$ is a constant.

4.2.5. Rethinking the Preference Annotation Process for Reward Modeling (Section 4 and Appendix C)

The paper investigates whether cross-prompt comparisons are beneficial.

Annotation Quality Function: The probability of an annotator being correct is $\xi(\Delta r) = \sigma(\beta \Delta r)$ , where $\beta$ controls annotation quality.

$\beta \to 0$ : Random annotations ( $\xi \to 0.5$ ).
$\beta \to \infty$ : Perfect annotations ( $\xi \to 1$ ).

Example 1 (Annotation Quality under Gaussian Score, from Appendix C.1) When responses $y_1, y_2$ for a prompt $x$ are sampled from an LLM $\ell$ , and their utilities r(x,y) are Gaussian distributed $r(x,y) \sim \mathcal{N}(\mu_x, \sigma_x^2)$ , then the annotation quality for that prompt is: $ \mathcal { Q } _ { \mathrm { p a i r } } ( x ) = \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } [ \tau _ { x } ] = \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } \left[ \sigma ( \beta | r ( x , y _ { 1 } ) - r ( x , y _ { 2 } ) | ) \right] $ where \tau_x = \sigma(\beta |r(x,y_1) - r(x,y_2)|) is a random variable, whose probability density function is: $ f _ { \tau _ { x } | x } ( t ) = \frac { 1 } { \sqrt { \pi \beta ^ { 2 } \sigma _ { x } ^ { 2 } } } \exp \left( - \frac { \left( \log \left( \frac { t } { 1 - t } \right) \right) ^ { 2 } } { 4 \beta ^ { 2 } \sigma _ { x } ^ { 2 } } \right) \cdot \frac { 1 } { t ( 1 - t ) } $

Explanation: This derivation quantifies the expected annotation quality for pairs from the same prompt. It shows that quality depends on both the annotator's ability ( $\beta$ ) and the diversity of response utilities ( $\sigma_x^2$ ). Higher $\beta$ or higher $\sigma_x^2$ leads to better annotation quality.

Proposition 10 (Cross-Prompt Comparisons Increase Utility Diversity, formal version from Appendix C.2) When responses $y_1 \sim \ell(x_1)$ and $y_2 \sim \ell(x_2)$ have Gaussian distributed utilities $r_{x,y} \sim \mathcal{N}(\mu_x, \sigma_x^2)$ , then the expected absolute reward difference for cross-prompt comparisons is greater than or equal to same-prompt comparisons: $ \mathbb { E } _ { x } \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } \left[ | r _ { x , y _ { 1 } } - r _ { x , y _ { 2 } } | \right] \leq \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \mathbb { E } _ { y _ { 1 } | x _ { 1 } , y _ { 2 } | x _ { 2 } } \left[ | r _ { x _ { 1 } , y _ { 1 } } - r _ { x _ { 2 } , y _ { 2 } } | \right] $

Explanation: This proposition provides a theoretical basis for cross-prompt comparisons. It states that, on average, the reward differences between responses from different prompts are larger than those between responses from the same prompt. Larger reward differences generally lead to easier and more reliable annotations (due to Assumption 5).

Theorem 11 (Cross-Prompt Annotation Improves Annotation Quality, formal version from Appendix C.3) When responses $y_1, y_2 \sim \ell(x)$ have utilities sampled from a location-scale family with a unimodal and symmetric density $f$ , and $\xi : \mathbb { R } _ { + } \to [ 1 / 2 , 1 ]$ is a first-order differentiable, monotone increasing, and concave function (representing annotation quality), then: $ \mathbb { E } _ { x } [ \mathcal { Q } _ { \mathrm { p a i r } } ( x ) ] = \mathbb { E } _ { x } \mathbb { E } _ { y _ { 1 } , y _ { 2 } \mid x } \left[ \xi ( | r _ { x , y _ { 1 } } - r _ { x , y _ { 2 } } | ) \right] \le \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \mathbb { E } _ { y _ { 1 } \mid x _ { 1 } , y _ { 2 } \mid x _ { 2 } } \left[ \xi ( | r _ { x _ { 1 } , y _ { 1 } } - r _ { x _ { 2 } , y _ { 2 } } | ) \right] : = \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \big [ \mathcal { Q } _ { \mathrm { c r o s s - p r o m p t } } ( x _ { 1 } , x _ { 2 } ) \big ] $

Explanation: This theorem generalizes Proposition 10 to annotation quality itself, not just reward differences. It rigorously proves that cross-prompt annotations yield higher expected annotation quality than same-prompt annotations under broad and realistic conditions for utility distributions and annotator behavior. This is a strong theoretical argument for adopting cross-prompt comparison strategies in RLHF.

5. Experimental Setup

The experiments are designed to validate the theoretical insights and methodological contributions, focusing on reproducibility, controllability, and computational efficiency.

5.1. Datasets

The primary datasets used are:

Anthropic-Harmless (Bai et al., 2022a): Contains 41,876 training prompts and 2,273 test prompts. This dataset focuses on aligning LLMs to be harmless, meaning they should avoid generating toxic, biased, or unsafe content.
Anthropic-Helpful (Bai et al., 2022a): Contains 42,846 training prompts and 2,292 test prompts. This dataset focuses on aligning LLMs to be helpful, meaning they should generate informative, relevant, and useful responses.

Additionally, results from two other datasets are presented in Appendix F:
Helpsteer dataset (Wang et al., 2023)
UltraFeedback dataset (Cui et al., 2023)

Data Generation and Annotation:
For each of the 6 base LLMs (see below), 10 responses are generated for each training prompt to serve as candidates for reward model training.
For each test prompt, 500 responses are generated to be scored by golden reward models for testing via Best-of-N sampling.
Simulated Preference Annotation: To simulate human annotation, the authors use existing open-source golden reward models (Yang et al., 2024b; Dong et al., 2023, 2024) as "annotators." This ensures reproducibility and affordability.
- Annotation Noise: A realistic imperfect annotation process is simulated using cognitive bottleneck models (Stewart et al., 2005; Guest et al., 2016). The probability of a correct label depends on the true reward difference \Delta r = |r(x,y_1) - r(x,y_2)| via a sigmoid function: $ \mathbb { P } \bigg ( h ( x , y _ { 1 } , y _ { 2 } ) ( r ( x , y _ { 1 } ) - r ( x , y _ { 2 } ) ) > 0 \bigg | \Delta r \bigg ) = \xi ( \Delta r ) = \sigma ( \beta \Delta r ) $ Here, $\beta$ $β$ is the annotation quality parameter.
  - When $\beta=0$ , annotations are random ( $\xi=0.5$ ).
  - When $\beta \to \infty$ , annotations are perfect ( $\xi \to 1$ ).
  - The default $\beta=1$ is used unless explicitly varied to study the impact of annotation quality (ranging from $\beta \in [0.5, 0.7, 1.0, 3.0, 5.0, 10.0]$ corresponding to error rates from 5% to 38%).
Annotation Quantity: Experiments vary the number of annotations used for training: [5000, 10000, 20000, 40000].
Pairing Choices:
- Same-Prompt Comparison (Standard): Annotators compare response pairs generated from the same prompt.
- Cross-Prompt Comparison (X-Prompt): Annotators compare two prompt-response pairs that are randomly selected from the dataset, potentially from different prompts.
- Synthetic Setups for Cross-Prompt Analysis:
  - Similar Comparison: For a single prompt, two responses with similar quality (middle-ranked by a golden reward model) are paired to simulate lack of diversity.
  - Diversified Comparison: For a single prompt, responses with very different qualities (highest and lowest-scored by a golden reward model) are paired to simulate high diversity.
    
    These datasets were chosen because they are extensively studied in reward modeling, and open-source golden reward models are available, enabling controlled and reproducible experiments.

5.2. Evaluation Metrics

The effectiveness of different reward models is evaluated using Best-of-N (BoN) sampling with $N=500$ .

Best-of-N (BoN) Sampling:
1. Conceptual Definition: BoN is a common strategy to evaluate the quality of a generative LLM or the effectiveness of a reward model. For a given prompt, the LLM generates $N$ diverse responses. The trained reward model then assigns a score to each of these $N$ responses. The response with the highest score according to the reward model is selected as the "best" and is used as the final output. The quality of this "best" response (as judged by an independent golden reward model) reflects the performance of the trained reward model.
2. Mathematical Formula: While BoN itself is a sampling strategy, the metric derived from it for evaluation is typically the Golden Reward Value of the selected best response. If $S_k$ is the score of the $k$ -th generated response $y_k$ for a prompt $x$ according to the trained reward model $\hat{r}$ , and $R^*(x, y)$ is the score of the golden reward model, then the BoN evaluation for a prompt $x$ is: $ \text{BoN_Score}(x) = R^*(x, y_{\text{best}}) \quad \text{where } y_{\text{best}} = \arg\max_{k \in {1, \dots, N}} \hat{r}(x, y_k) $
3. Symbol Explanation:
  - $N$ : The number of responses generated by the LLM for each prompt (here, $N=500$ ).
  - $\hat{r}(x, y_k)$ : The score predicted by the trained reward model for response $y_k$ to prompt $x$ .
  - $y_{\text{best}}$ : The response among the $N$ generations that received the highest score from the trained reward model.
  - $R^*(x, y_{\text{best}})$ : The score assigned by an independent golden reward model (a high-quality, pre-trained reward model acting as ground truth) to the selected best response. This is the ultimate metric for evaluating the performance of the trained reward model.
Golden Reward Value Improvement: The paper reports results as "improvements over base models" or "relative performance gains achieved through the reward models." This means comparing the BoN_Score obtained using the trained reward model against a baseline score (e.g., the average score of responses from the base LLM without reward model selection).

The choice of BoN is justified by:

Performance: Empirical studies show BoN often achieves better performance than PPO (Dong et al., 2023; Yuan et al., 2023; Gao et al., 2023; Coste et al., 2023).
Stability and Reduced Engineering Overhead: BoN requires no hyperparameter tuning and is more stable than PPO, leading to more consistent and interpretable results (Ivison et al., 2024; Xu et al., 2024).
Computational Efficiency and Reproducibility: BoN's reusability across $N$ generations at test time is more efficient than the computationally prohibitive PPO fine-tuning for 12,000 experimental setups.

5.3. Baselines

The paper compares its proposed classification-based reward models against the widely used Bradley-Terry (BT) model.

Base LLMs: The experiments use a diverse set of base LLMs for generating responses:
- Gemma2b
- Gemma7b
- LLaMA3-8b
- Gemma2b-SFT (Supervised Fine-Tuned version of Gemma2b)
- Gemma7b-SFT (Supervised Fine-Tuned version of Gemma7b)
- LLaMA3-8b-SFT (Supervised Fine-Tuned version of LLaMA3-8b) These base models cover different scales and levels of initial alignment (raw vs. SFT-ed), providing a comprehensive evaluation context.
Reward Model Implementations:
- BT-MLP: The traditional Bradley-Terry model implemented with an MLP backbone. It requires a Siamese structure to ensure anti-symmetry. This serves as the primary baseline for the BT model.
- CLF-MLP: The proposed classification-based reward model implemented with an MLP backbone. This allows for a fair comparison of the objective function, isolating gains from the classification objective itself while keeping the neural network architecture consistent.
- CLF-LGB: The proposed classification-based reward model implemented with LightGBM (Ke et al., 2017), an off-the-shelf gradient boosting decision tree algorithm. This demonstrates the flexibility and potential benefits of using non-neural network classifiers with the order consistency objective.
Embeddings: Gemma2b is used to generate embeddings for reward modeling across all experiments. This standardizes the representation learning component, ensuring that differences in reward model performance are attributed to the reward modeling objective and architecture, rather than varying embedding quality.
Hyper-parameters: To ensure a fair comparison and isolate the source of gains, identical default hyperparameter settings are used for LightGBM (e.g., $objective='binary'$ , metric='binary_logloss') and for all MLP configurations (for both BT and Classification models).

6. Results & Analysis

The experimental results aim to validate the order consistency framework, compare classification-based and BT models, and assess the impact of annotation quality, quantity, and cross-prompt comparisons.

6.1. Comparing The Bradley-Terry and Classification Objectives

The following figure (Figure 1 from the original paper) compares the performance of BT-MLP, CLF-MLP, and CLF-LGB across different base LLMs on the Harmless and Helpful datasets, measuring Golden Reward Value Improvement via Best-of-N (BoN) sampling with $N=500$ .

Figure 1: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added fexibility of using off-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds. 该图像是一个比较图表，展示了不同基础模型在两个任务（Harmless 和 Helpful）下的 Golden Reward Value Improvement。X 轴表示基础模型，Y 轴表示奖励值改进，BT-MLP、CLF-MLP 和 CLF-LGB 作为分类方法的表现被比较，结果显示 CLF 方法在多种设置下优于 BT 方法。

Original caption: Figure 1: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added fexibility of using off-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds.

Analysis:

Overall Superiority of Classification Models: For almost all base LLMs and both Harmless and Helpful datasets, the classification-based reward models (CLF-MLP and CLF-LGB) consistently achieve higher Golden Reward Value Improvement compared to the BT-MLP model.
Flexibility Benefit: CLF-LGB (LightGBM) often performs as well as, or even better than, CLF-MLP, highlighting that the classification objective is compatible with diverse off-the-shelf classifiers and can leverage their strengths. This supports the paper's argument for greater flexibility beyond MLPs and the Siamese structure required by BT-MLP.
Consistency Across Models and Tasks: The trend holds across various Gemma and LLaMA3-8b models (both raw and SFT-ed versions) and for both Harmlessness and Helpfulness alignment tasks. This suggests the classification objective is a robust alternative.
Takeaways: The classification-based reward models are shown to be more effective than BT reward models in general, offering superior performance and the practical advantage of being able to integrate various machine learning algorithms beyond specialized MLPs.

6.2. How Annotation Qualities and Quantities Affect Performance

6.2.1. Changing Annotation Quality

The following figure (Figure 2 from the original paper) shows the impact of varying annotation quality (controlled by $\beta$ ) on the performance of the reward models on the Harmless and Helpful datasets, with a fixed quantity of 40,000 annotations. Annotation Error Rate (AER) decreases as $\beta$ increases.

该图像是一个比较不同奖励模型在不同注释错误率下的表现的图表。图表展示了五个实验设置的结果，包括 Gemma2b、Gemma2b-SFT、Gemma7b、LLama3-8以及 LLama3-8b-SFT，Y轴表示金标准奖励值，X轴表示注释错误率。

Original caption: Figure 2: Changing the annotation quality. Dataset: Harmless, Helpful.

Analysis:

Robustness to Annotation Noise: As annotation error rates increase (i.e., $\beta$ decreases, moving left on the x-axis), the classification models (CLF-MLP, CLF-LGB) exhibit greater robustness. Their performance drops are generally smaller and more graceful compared to BT-MLP.
BT's Advantage in High-Quality Data: When annotation quality is very high (low error rates, e.g., $\beta=10.0$ where AER is around 5%), the BT-MLP model sometimes outperforms the classification models. This suggests that the BT model's emphasis on calibrated probabilities might be beneficial when the input signal is almost perfectly reliable.
Overall Practicality: In realistic scenarios where human annotation often contains significant noise (error rates between 5% and 38% are common), the classification models provide a more robust and reliable solution.
Takeaways: While BT models excel under near-perfect annotation quality, classification models demonstrate superior robustness and maintain better performance as annotation error rates increase, making them more practically useful in noisy human feedback environments.

Further results showing annotation quality under different availability scenarios are provided in Appendix E (Figures 8 and 9). These corroborate the findings, showing CLF methods generally more stable across various annotation counts when quality varies. For example, at lower annotation counts (e.g., 5000), CLF methods are more stable in Harmless dataset (Figure 8).

6.2.2. Changing Annotation Quantity

The following figure (Figure 3 from the original paper) illustrates how varying annotation quantities (5000, 10000, 20000, 40000) affect the reward models' performance, holding $\beta=1$ (a moderate annotation quality) constant.

Figure 3: Changing the annotation quantity. Dataset: Harmless, Helpful. 该图像是一个图表，展示了不同训练注释数量对各个模型（包括 Gemma2b 和 LLaMA3-8b）的奖励值的影响。图中比较了 BT-MLP、CLF-MLP 和 CLF-GLB 三种方法，X轴表示训练注释数量，Y轴表示黄金奖励值，展示了不同方法在不同条件下的效果。

Original caption: Figure 3: Changing the annotation quantity. Dataset: Harmless, Helpful.

Analysis:

Consistent Superiority of Classification Models: Across all annotation quantities and for both datasets, the classification models (CLF-MLP, CLF-LGB) consistently outperform the BT-MLP model.
Scalability with Data: As the number of annotations increases, all models generally show improved performance. However, classification models demonstrate more consistent and often steeper improvements, indicating better scalability and data efficiency.
Performance at Low Data Regimes: Even with fewer annotations (e.g., 5000), CLF models maintain a significant lead over BT-MLP, suggesting they are more effective in data-scarce scenarios.
Takeaways: Classification models consistently outperform BT models across varying annotation quantities, offering superior performance and more reliable improvements as more data becomes available. This further strengthens their position as a compelling alternative for reward modeling.

Additional results on annotation availability under different annotation qualities are provided in Appendix E (Figures 10 and 11), generally supporting these conclusions.

6.3. LLM Alignment with Preference Annotations Between Different Prompts

This section investigates the impact of cross-prompt comparisons versus same-prompt comparisons on reward model performance. The default setting uses $\beta=1$ and 40,000 annotations.

6.3.1. Cross-Prompt vs. Same-Prompt Comparisons

The following figure (Figure 4 from the original paper) compares the performance of reward models trained with same-prompt annotations (unshaded bars) and cross-prompt annotations (shaded bars) across various base models and reward modeling methods.

Figure 4: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons. 该图像是一个图表，展示了在无害和有帮助的随机比较中，不同基础模型的金色奖励值改进情况。通过比较 BT-MLP、CLF-MLP 等模型，在 Gemma2b-SFT 和 LLaMA3-8b 等基础模型上，展示了不同注释方式的效果。

Original caption: Figure 4: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons.

Analysis:

Significant Improvement with Cross-Prompt Comparisons: For nearly all base LLMs and reward modeling methods, cross-prompt annotations (shaded bars) consistently and substantially outperform same-prompt annotations (unshaded bars) in terms of Golden Reward Value Improvement. This is a strong empirical validation of the theoretical findings in Section 4 that cross-prompt comparisons can improve annotation quality.
Generalizability of Improvement: The benefits of cross-prompt annotations are observed across both Harmless and Helpful tasks, and across BT-MLP, CLF-MLP, and CLF-LGB models. This suggests that the improvement is not specific to a particular model or task but is a general advantage of the annotation design.
Takeaways: Cross-prompt comparisons are a highly effective annotation strategy that leads to significant improvements in reward model performance compared to conventional same-prompt comparisons.

Additional figures reporting cross-prompt annotation results on changing annotation availability and annotation quality are provided in Appendix E (Figures 12 to 15). These further confirm the benefits of cross-prompt comparisons across diverse experimental conditions.

6.3.2. Impact of Response Diversity

To understand why cross-prompt comparisons are so effective, the paper introduces two synthetic setups (Figures 5 and 6):

Similar Comparison: Responses for a single prompt have similar quality (lack diversity).
Diversified Comparison: Responses for a single prompt have different qualities (high diversity).

The following figure (Figure 5 from the original paper) shows results for Similar and Diversified Comparison setups.

该图像是图表，展示了在相似和多样化比较下，BT-MLP 和 CLF-MLP 不同模型的黄金奖励值改善情况。结果表明，跨提示比较显著提高了在缺乏多样性的相同提示响应对中的奖励建模表现。误差条是基于 5 次不同种子的实验结果。

Original caption: Figure 5: Results comparing cross-prompt comparison-based annotations on synthetically generated similar or diversified comparison pairs. Cross-prompt comparison significantly improves the performance of reward modeling with same-prompt response pairs lacking diversity. Error bars are from 5 runs with different seeds.

Analysis:

Critical Role in Low Diversity Scenarios: In the Similar Comparison setting (left side of the plots), same-prompt comparisons (unshaded bars) often fail to produce informative reward models, resulting in near-zero or even negative Golden Reward Value Improvement. In stark contrast, cross-prompt annotations (shaded bars) in this setting achieve substantial performance gains. This strongly suggests that cross-prompt comparisons are essential when same-prompt responses lack diversity and are hard to distinguish.
Benefits Even with High Diversity: In the Diversified Comparison setting (right side of the plots), where same-prompt responses already have high diversity, the need for cross-prompt comparisons diminishes. While cross-prompt annotations still perform well, their advantage over same-prompt annotations is smaller, and sometimes performance is comparable. Importantly, they do not have an evident negative impact.
Stability and Lower Dependence on Diversity: Comparing results across Similar and Diversified settings, and with Figure 4, cross-prompt annotations offer more stable performance with less dependence on the inherent diversity of responses for a single prompt. This makes them a generally reliable strategy.
Takeaways: Cross-prompt comparisons are crucial when same-prompt responses lack diversity, as they significantly enhance reward model performance. Their benefits are less pronounced but still present (or at least not detrimental) in high-diversity scenarios, making them a generally robust strategy.

6.3.3. Correlation with Averaged Absolute Score Difference

The following figure (Figure 6 from the original paper) examines the relationship between the averaged absolute score difference in pairwise annotations (x-axis) and the Golden Reward Value Improvement from cross-prompt annotations (y-axis).

Figure 6: Comparing the averaged absolute difference in scores in pairwise annotations (x-axis) and improvements achieved by using cross-prompt annotations (y-axis). The two variables are highly correlated. 该图像是一个散点图，展示了在‘无害’和‘有帮助’两种情境下，平均绝对奖励差异（x轴）与黄金奖励值改善（y轴）之间的关系。不同模型的相关性通过不同颜色的点和线进行区分，表明它们的表现差异。

Original caption: Figure 6: Comparing the averaged absolute difference in scores in pairwise annotations (x-axis) and improvements achieved by using cross-prompt annotations (y-axis). The two variables are highly correlated.

Analysis:

Strong Positive Correlation: The scatter plots clearly show a strong positive correlation between the averaged absolute reward difference in the annotated pairs and the Golden Reward Value Improvement achieved by the cross-prompt annotations. This empirically validates the theoretical intuition that larger differences between compared items make annotations more informative and lead to better reward models.
Practical Implication for Random Sampling: The paper notes that averaged reward differences in the synthetic Similar cases and Random settings are similar. This implies that in practice, when randomly selecting two responses for a single prompt to be annotated, these pairs are likely to face the challenge of high response similarity (low diversity). This is precisely the scenario where cross-prompt annotation can be most effectively applied to improve performance.
Takeaways: The effectiveness of cross-prompt annotations is directly linked to their ability to increase the expected difference between compared responses. Since same-prompt random sampling often yields low diversity, cross-prompt annotations provide a crucial mechanism to boost annotation quality and reward model performance in practical RLHF settings.

6.4. Additional Results on Helpsteer and UltraFeedback (Appendix F)

The paper also presents additional results on the Helpsteer and UltraFeedback datasets in Appendix F, which consistently align with the findings from Anthropic-Harmless and Anthropic-Helpful.

The following are the results from Figure 16 of the original paper, comparing BT and Classification reward models on Helpsteer and UltraFeedback datasets:

Figure 16: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added flexibility of using of-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds. 该图像是一个图表，展示了在不同基模型下，BT-MPL、CLF-MLP 和 CLF-LGB 三种奖励模型在 helpsteer 和 ultrafeedback 数据集上的 golden reward value improvement。可以看到，分类奖励模型普遍优于 BT 模型，且误差条表示不同实验的重复性。

Original caption: Figure 16: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added flexibility of using of-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds.

Analysis: This figure shows the classification reward models (CLF-MLP, CLF-LGB) generally achieve better performance than BT-MLP on both Helpsteer and UltraFeedback datasets across different base LLMs. This further reinforces the main conclusion from Figure 1.

The following are the results from Figure 17 of the original paper, showing annotation quality changes on Helpsteer and UltraFeedback datasets:

Figure 17: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback. 该图像是六个子图展示了不同模型在不同注释错误率下的黄金奖励值表现。每个子图分别标示为 Gemma2b-SFT、Gemma7b-SFT 和 LLaMA3-8b-SFT，横轴为注释错误率，纵轴为黄金奖励值。不同的线条代表了 BT-MLP、CLF-MLP 和 CLF-LGB 三种模型的结果。

Original caption: Figure 17: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback.

Analysis: Similar to Figure 2, classification models (CLF-MLP, CLF-LGB) demonstrate greater robustness to increasing annotation error rates (decreasing $\beta$ ) on Helpsteer and UltraFeedback. BT-MLP shows higher performance under very high quality annotations (low error rates), but CLF methods are generally more robust.

The following are the results from Figure 18 of the original paper, showing annotation quantity changes on Helpsteer and UltraFeedback datasets:

Figure 18: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback. 该图像是一个图表，展示了在不同训练注释数量下，不同奖励模型（BT-MLP、CLF-MLP、CLF-LGB）在Gemma2b-SFT、Gemma7b-SFT和LLama3-8b-SFT数据集上的黄金奖励值变化。图表中数据的趋势表明训练注释数量对不同模型的表现影响不一。

Original caption: Figure 18: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback.

Analysis: Similar to Figure 3, classification models consistently outperform BT models across varying annotation quantities on these additional datasets, providing more consistent improvements as data increases.

The following are the results from Figure 19 of the original paper, showing cross-prompt comparison results on Helpsteer and UltraFeedback datasets:

Figure 19: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons. 该图像是图表，展示了在不同基模型（Gemma2b-SFT、Gemma7b-SFT 和 LLaMA3-8b-SFT）下，使用 BT-MLP、CLF-MLP 和 CLF-LGB 存在的随机比较对黄金奖励值改进的影响。结果显示，跨提示比较注释的效果优于同提示比较注释。

Original caption: Figure 19: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons.

Analysis: Similar to Figure 4, cross-prompt comparisons (shaded bars) consistently outperform same-prompt comparisons (unshaded bars) on both Helpsteer and UltraFeedback datasets, reaffirming the benefits of this annotation strategy.

The following are the results from Figure 20 of the original paper, showing cross-prompt comparison results on synthetic similar or diversified comparison pairs on Helpsteer and UltraFeedback datasets:

Figure 20: Results comparing cross-prompt comparison-based annotations on synthetically generated similar or diversified comparison pairs. Cross-prompt comparison significantly improves the performance of reward modeling with same-prompt response pairs lacking diversity. Error bars are from 5 runs with different seeds. 该图像是图表，展示了不同基础模型在相似和多样化比较中的黄金奖励值改善。根据前述结果，Cross-prompt 比较显著提高了同提示响应对的奖励建模表现。图中数据来自多个实验设置，并展现了使用不同分类器的效果对比。

Original caption: Figure 20: Results comparing cross-prompt comparison-based annotations on synthetically generated similar or diversified comparison pairs. Cross-prompt comparison significantly improves the performance of reward modeling with same-prompt response pairs lacking diversity. Error bars are from 5 runs with different seeds.

Analysis: Similar to Figure 5, cross-prompt comparisons significantly improve reward modeling performance for same-prompt response pairs lacking diversity (in Similar Comparison settings) on these additional datasets. The benefit is less pronounced in Diversified Comparison settings but still present or neutral.

Conclusive Remark on Experiments: The comprehensive experiments consistently demonstrate that classification-based reward models offer significant advantages over traditional BT models in terms of performance, flexibility (using off-the-shelf classifiers), and robustness to varying annotation quality and quantity. Furthermore, cross-prompt comparisons emerge as a superior annotation strategy, particularly beneficial when same-prompt responses lack diversity—a common issue in practical settings. These findings provide strong empirical evidence for the paper's theoretical arguments and practical recommendations.

6.5. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Usage	LLM Arena	BT Reward Modeling
Objective	Direct Parameter Estimation	Model Parameterization
Prediction	Not needed	Needed
Number of Comparisons	Sufficient (1.7M for N = 130)	Very Sparse b (N/2)
Usage of BT Model	Classical BT	BT-Regression
Requirement	at least O(N log(N)) Comparisons	Covariates

Analysis: This table clearly differentiates the traditional application of the BT model in LLM Arenas (e.g., ranking chatbots) from its use in BT Reward Modeling for LLM alignment.

Objective: In LLM Arena, the goal is to estimate a fixed ability score for each LLM (player). In Reward Modeling, the goal is to learn a function that maps any prompt-response pair to a reward score.
Prediction: Arena models don't need to predict scores for new players, only rank existing ones. Reward models must generalize to predict scores for unseen prompt-response pairs.
Comparison Density: Arena settings typically have a massive number of comparisons per player (e.g., 1.7 million for 130 models), providing sufficient data. Reward modeling, however, deals with very sparse comparisons (e.g., $N/2$ for $N$ prompt-response pairs), a key challenge addressed by the paper.
BT Variant: This sparsity necessitates BT-Regression in reward modeling, where covariates (embeddings) are used to predict scores, unlike the classical BT model in arenas.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously re-examines the Bradley-Terry (BT) model in preference-based reward modeling for Large Language Model (LLM) alignment. It provides the first asymptotic theory for neural network-based BT regression in this context, formally establishing its convergence rate. However, it argues that the BT model's precise probability calibration is not strictly necessary for downstream LLM optimization; rather, order consistency (preserving correct ranking) is sufficient. Building on this insight, the paper proposes a simpler, classification-based reward modeling objective that is order-consistent and compatible with off-the-shelf binary classifiers. Furthermore, it theoretically demonstrates the superiority of cross-prompt comparisons in annotations, showing they increase utility diversity and improve annotation quality. Extensive empirical evaluations across over 12,000 setups confirm these findings: classification-based models generally outperform BT models and are more robust to annotation noise and quantity, while cross-prompt annotations significantly enhance reward model performance, especially when same-prompt responses lack diversity.

7.2. Limitations & Future Work

The authors implicitly highlight areas for future work rather than explicitly listing limitations as a dedicated section.

Representation Learning: In Section 5.4, the paper notes that it separates reward model learning from representation learning. Future work could explore combining their order consistency objectives with advances in generative reward models (Yang et al., 2024a) or generative verifiers (Zhang et al., 2024) that leverage LLM generation abilities to improve embeddings or reward predictions. This suggests that the gains from reward modeling objectives could be further amplified by better embedding representations.
Beyond Two-Stage Methods: While the paper champions two-stage methods (explicit reward model followed by RL) for inference-time optimization, it acknowledges the existence of direct alignment methods (e.g., DPO, KTO, RPO). Future research could investigate how the order consistency principle might be adapted or integrated into these direct methods, or explore hybrid approaches.
Understanding Annotation Biases: The paper uses a simulated annotation process with a cognitive bottleneck model ( $\sigma(\beta \Delta r)$ ). While realistic, actual human annotation processes might involve more complex biases or non-transitive preferences that are not fully captured by this model. Future work could explore reward modeling objectives that are more robust to a wider range of annotation imperfections.
Dynamic Preference Adjustment: The paper references works like Rewards-in-Context (Yang et al., 2024b), which aim for multi-objective alignment with dynamic preference adjustment. Integrating dynamic or context-dependent aspects into the order consistency objective could be a promising direction.

7.3. Personal Insights & Critique

This paper provides a highly valuable contribution to the RLHF literature by systematically re-evaluating the Bradley-Terry model, which has been a cornerstone of LLM alignment but often applied without deep theoretical scrutiny in the specific context of sparse, high-dimensional textual data.

Strengths and Inspirations:

Theoretical Rigor Meets Practicality: The paper successfully bridges theoretical analysis (providing asymptotic theory for BT regression with deep neural networks) with practical implications (proposing simpler, more effective alternatives and annotation strategies). This dual approach is highly commendable.
Challenging Assumptions: The shift in perspective from calibrated probabilities to order consistency as the core requirement for reward models is a profound insight. It simplifies the problem and opens the door for more flexible and robust machine learning algorithms. This concept can be transferable to any ranking or preference-based system where the absolute magnitude of scores is less critical than their relative ordering for optimization.
Actionable Data Collection Insights: The theoretical and empirical validation of cross-prompt comparisons is a significant practical takeaway. It directly informs how preference data should be collected to maximize its utility, potentially leading to more efficient and higher-quality RLHF pipelines with less data. This is particularly relevant given the high cost of human annotation.
Extensive Experiments: The sheer scale and systematic nature of the empirical evaluation (12,000+ setups) provide high confidence in the findings and offer a rich benchmark for future research.

Potential Issues/Areas for Improvement:

Complexity of Theoretical Conditions: While impressive, the theoretical conditions (e.g., Hölder smoothness, small value bounds, specific scaling of MLP depth/width/sparsity) required for the truncated KL risk bound (Theorem 20) are quite stringent. It might be challenging to verify if real-world LLM embedding spaces and true reward functions strictly adhere to these assumptions. Further work could explore robustness to violations of these assumptions.
Generalizability of "Additive Constant": The convergence of reward differences up to an additive constant is a common result for preference models. While sufficient for ranking, it means the absolute scale of the reward function is arbitrary. For certain RL algorithms or value-based agents, the scale of rewards can sometimes influence learning dynamics. However, for BoN sampling (as used here) and many policy optimization methods, this is indeed not an issue.
Simulated Annotators: Relying on golden reward models as annotators, while ensuring reproducibility, might not fully capture the nuances, inconsistencies, and emergent biases of real human annotators. Further validation with large-scale human-annotated datasets would strengthen the empirical claims, especially regarding the classification model's robustness to human noise.
Efficiency of Cross-Prompt Annotation in Practice: While theoretically beneficial, implementing cross-prompt annotation efficiently in a human labeling pipeline might introduce new logistical challenges (e.g., cognitive load for annotators if prompts are too disparate, interface design). The paper's analysis focuses on the statistical quality of the resulting data, but practical human factors also play a role.

Overall, this paper makes a substantial contribution by providing a deeper understanding of reward modeling foundations and offering practical, empirically validated alternatives that can make LLM alignment more efficient and effective. Its emphasis on order consistency and cross-prompt comparisons are particularly insightful and actionable for the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~46 min read · 58,724 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Rethinking the Usage of BT Models in LLM Alignment (Section 2.2)

4.2.2. BT Regression: How BT Model Works with Sparse Comparisons (Section 2.3)

4.2.3. Asymptotic Theory on MLP-based BT Regression in Reward Modeling (Section 2.4 and Appendix A)

4.2.4. Rethinking Reward Modeling Objectives in LLM Alignment (Section 3)

4.2.5. Rethinking the Preference Annotation Process for Reward Modeling (Section 4 and Appendix C)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Comparing The Bradley-Terry and Classification Objectives

6.2. How Annotation Qualities and Quantities Affect Performance

6.2.1. Changing Annotation Quality

6.2.2. Changing Annotation Quantity

6.3. LLM Alignment with Preference Annotations Between Different Prompts

6.3.1. Cross-Prompt vs. Same-Prompt Comparisons

6.3.2. Impact of Response Diversity

6.3.3. Correlation with Averaged Absolute Score Difference

6.4. Additional Results on Helpsteer and UltraFeedback (Appendix F)

6.5. Data Presentation (Tables)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers