Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives
TL;DR Summary
This paper reevaluates the Bradley-Terry model in preference-based reward modeling, establishes a theoretical foundation for convergence using deep neural networks, argues it's not strictly necessary for optimization, proposes an alternative upper-bound algorithm, and validates m
Abstract
The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using base LLMs, datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives
1.2. Authors
- Hao Sun: University of Cambridge, Cambridge, UK.
- Yunyi Shen: Massachusetts Institute of Technology, Cambridge, MA, USA.
- Jean-Francois Ton: ByteDance Research, London, UK.
1.3. Journal/Conference
This paper is a preprint published on arXiv. The abstract indicates a publication date of 2024-11-07T18:57:03.000Z. As a preprint, it has not yet undergone formal peer review for a specific journal or conference, but arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI and machine learning.
1.4. Publication Year
2024
1.5. Abstract
The Bradley-Terry (BT) model is widely used in reward modeling for Large Language Model (LLM) alignment, despite its original development for multi-player stochastic game matching. The paper investigates the theoretical justification for its application in converting sparse pairwise response comparisons to reward values. It first establishes a theoretical foundation by demonstrating the convergence rate of BT reward models based on deep neural networks using embeddings. While theoretically sound, the authors argue that the BT model is not strictly necessary for downstream optimization, as reward models only need to preserve correct ranking predictions through a monotonic transformation of the true reward. This highlights the critical concept of order consistency, which the BT model possesses. Consequently, the paper proposes a simpler, alternative order-consistent reward modeling objective using an upper-bound algorithm compatible with off-the-shelf binary classifiers. To provide practical insights, the study conducts extensive empirical evaluations across over 12,000 experimental setups, utilizing 6 base LLMs, 2 datasets, and various annotation designs (quantity, quality, pairing choices) to compare these different reward modeling approaches.
1.6. Original Source Link
https://arxiv.org/abs/2411.04991v2 PDF Link: https://arxiv.org/pdf/2411.04991v2.pdf Publication Status: Preprint (arXiv).
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the unclear theoretical justification and necessity of using the Bradley-Terry (BT) model in reward modeling for Large Language Model (LLM) alignment. While the BT model has shown empirical success in Reinforcement Learning from Human Feedback (RLHF), its origins are in multi-player stochastic game matching, raising questions about its suitability for converting sparse pairwise response comparisons to continuous reward values in the LLM context.
This problem is crucial because LLM alignment is fundamental for the safe, effective, and ethical deployment of LLMs across diverse applications. Reward modeling is a key component of RLHF, which is a dominant strategy for aligning LLMs with human preferences. Gaps exist in prior research regarding:
-
Theoretical Soundness: Why does the BT model work, especially with extremely sparse comparison data characteristic of LLM alignment, where a given prompt-response pair is compared only a few times?
-
Necessity: Is the BT model, with its specific anti-symmetric structure and log-likelihood loss, a necessary choice, or can simpler, more flexible alternatives achieve similar or better performance?
-
Data Format: The conventional BT model assumes randomized pairwise comparisons. Is restricting comparisons to responses from the same prompt (a common practice) optimal, or would
cross-prompt comparisonslead to more effective reward modeling?The paper's innovative entry point is to thoroughly revisit the foundations of BT models in this context, provide theoretical underpinnings, and then use these insights to explore and validate alternative, simpler, and potentially more efficient approaches.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Formal Analysis and Justification of BT Model Application: It provides a comprehensive analysis of the BT model's application in
LLM alignment, distinguishing it from its traditional use inmulti-player arenas. It formally justifies the use of BT models inLLM reward modelingby analyzing the underlying assumptions (existence of deterministic utility values, deterministic comparisons, logistic difference assumption) that lead to BT-type models. - First Asymptotic Theory for Neural Network-Based BT Regression: The paper introduces the first
asymptotic theoryforneural network-based BT regressioninpreference-based reward modeling. It establishesrisk boundsforBT model reward estimationin the context ofLLM alignment, proving that with a sufficient number of annotated comparisons in theembedding space, the BT reward model estimation converges to the true reward value up to an additive constant. - Introduction of Order Consistency and Classification-Based Alternative: It proposes
order consistencyas a core objective for reward modeling, arguing that learning a reward function up to a monotonic transformation is sufficient for downstream LLM optimization. It demonstrates that both theBT modeland a novelclassification-based approachinherently possess this property. The classification-based alternative offers greater flexibility and compatibility with off-the-shelf binary classifiers, interpreting reward scores aslogitsof marginalized winning probabilities. - Theoretical Superiority of Cross-Prompt Comparisons: The paper theoretically analyzes and highlights the superiority of
cross-prompt annotationsinLLM alignment. It demonstrates thatcross-prompt comparisonsincrease the expected differences between prompt-response pairs, thereby improving annotation quality under broad conditions (e.g., Gaussian utility distributions and concave annotation quality functions). - Extensive Empirical Validation: The study conducts a massive empirical evaluation across over 12,000 experimental setups. These experiments involve 6 base LLMs, 2 datasets (
Anthropic-Harmless,Anthropic-Helpful), 3 response sampling methods, 6 annotation noise levels, 3 reward model implementations (BT-MLP, CLF-MLP, CLF-LGB), 4 annotation availability scenarios, and 5 random seeds. The findings show that:-
Classification-based reward modelsgenerally achieve better performance and greater robustness to annotation noise and quantity compared toBT models. -
Cross-prompt comparisonssignificantly outperform same-prompt comparisons, especially when responses for a single prompt lack diversity, providing a more stable and less diversity-dependent performance.These findings solve the problem of clarifying the theoretical foundations of BT models, offering valid alternatives, and providing practical guidance on optimal annotation strategies for effective
LLM alignment.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following fundamental concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human-like language. They are capable of various tasks like translation, summarization, question answering, and creative writing. Examples include GPT-3, LLaMA, and Gemma.
-
LLM Alignment: The process of ensuring that LLMs behave in a way that is consistent with human values, intentions, and preferences. This involves making them helpful, harmless, and honest, and preventing them from generating toxic, biased, or undesirable content.
-
Reinforcement Learning from Human Feedback (RLHF): A machine learning paradigm used to align LLMs. It involves three main steps:
- Collecting Human Preferences: Humans compare multiple responses generated by an LLM for a given prompt and indicate which one they prefer.
- Training a Reward Model (RM): These human preferences are used to train a separate model, called a reward model, to predict a numerical score (reward) for any given prompt-response pair. A higher score indicates a more preferred response.
- Fine-tuning the LLM with RL: The trained reward model is then used to fine-tune the original LLM using a reinforcement learning algorithm (like PPO). The LLM learns to generate responses that maximize the reward predicted by the RM.
-
Reward Modeling: The specific task within RLHF of training a model to predict a scalar
rewardfor a givenprompt-response pair. This reward acts as a proxy for human preference. The goal is to learn a reward functionR(x, y)such that higher values of correspond to more desirable responses for a prompt . -
Pairwise Comparisons: A method of data collection where annotators (humans or other AI models) are presented with two options (e.g., two different responses to the same prompt) and asked to choose which one they prefer. This generates comparative data, rather than absolute scores. This is the primary data format for training reward models in RLHF.
-
Bradley-Terry (BT) Model: A statistical model for predicting outcomes of paired comparisons. Originally developed for ranking sports teams, it posits that the probability of item being preferred over item depends on their latent
utilitiesorabilities.- In its simplest form, the probability of beating is given by: where and are the positive utility scores (abilities) of items and .
- Often, these utilities are transformed to log-utilities or
scoresr_i = \log(u_i), leading to thesoftmaxformulation: $ P(i \succ j) = \frac{\exp(r_i)}{\exp(r_i) + \exp(r_j)} = \mathrm{softmax}(r_i, r_j) $ Thesigmoid functionis closely related, as . This means the probability of preferring over depends on the difference in their scores.
-
Luce-Shephard Choice Rule: A mathematical model in psychology that describes how individuals choose among multiple alternatives. It states that the probability of choosing an option from a set of alternatives is proportional to its utility. The Bradley-Terry model is a specific instance or specialization of the Luce-Shephard choice rule for pairwise comparisons.
-
Logistic Regression: A statistical model used for binary classification. It models the probability of a binary outcome (e.g., prefer or ) using a
sigmoid functionapplied to a linear combination of input features. The output is a probability between 0 and 1. The BT model, when viewed as predicting the probability of one item being preferred over another based on feature differences, can be seen as a form of logistic regression. -
Neural Networks (MLPs):
Multilayer Perceptrons (MLPs)are a class of artificial neural networks. They consist of an input layer, one or more hidden layers, and an output layer. Each layer is composed of nodes (neurons) that are fully connected to the nodes in the previous and subsequent layers. MLPs use activation functions (e.g., ReLU, sigmoid) to introduce non-linearity, allowing them to learn complex patterns. In this paper, MLPs are used as the backbone for the reward models to mapembeddingsto scores. -
Embeddings: Numerical representations of text (or other data types) in a high-dimensional vector space. Words, sentences, or entire documents can be converted into embeddings where semantically similar items are mapped to closer points in the vector space.
Sentence embeddingsorprompt-response pair embeddingsare crucial for allowing a neural network-based reward model to generalize to unseen inputs. -
Cross-Entropy Loss: A widely used loss function in classification tasks, particularly when the model outputs probabilities (like
softmax). For a binary classification problem, it's also known asbinary cross-entropy. It measures the difference between the true probability distribution (e.g., 1 for preferred, 0 for not preferred) and the predicted probability distribution. The goal during training is to minimize this loss.- For a single instance where the true label is and the predicted probability is , the binary cross-entropy loss is: $ L_{CE} = -(y \log(\hat{y}) + (1-y) \log(1-\hat{y})) $
-
Truncated KL Risk: A variant of
Kullback-Leibler (KL) divergence(also known as KL risk or relative entropy), which measures how one probability distribution diverges from a second, expected probability distribution.Truncated KL riskis used in theoretical analysis to address potential divergence issues when probabilities are very close to 0 or 1, making the logarithm term unbounded. It essentially caps the maximum contribution of very small probabilities to the loss, ensuring the risk remains bounded for theoretical analysis. -
Hölder Smooth Function: A function class used in mathematical analysis, particularly in non-parametric statistics and approximation theory. A function is
Hölder smoothif its derivatives up to a certain order (characterized by ) satisfy a specific Lipschitz-like continuity condition. This property implies a certain level of "smoothness" and regularity, which is often assumed for functions that neural networks are trying to approximate to establish convergence rates. -
Gumbel Distribution: A type of extreme value distribution. It describes the distribution of the maximum (or minimum) of a number of samples of various distributions. In the context of choice models, if the
noiseorbiasterms associated with options are independentlyGumbel distributedwith the same scale parameter, then the differences between these noise terms follow alogistic distribution, which then leads to theBradley-Terry model'ssoftmaxprobability form. -
Gaussian Distribution (Normal Distribution): A common continuous probability distribution characterized by its bell-shaped curve. It is defined by its mean () and standard deviation (). If the noise or bias terms are
Gaussian distributed, then their differences are also Gaussian, leading to theThurstonian model(which uses thecumulative distribution function (CDF)of the Gaussian, , instead of sigmoid). -
Best-of-N (BoN) Sampling: An evaluation strategy in RLHF. After an LLM is fine-tuned (or a reward model is trained), for a given prompt, the LLM generates different responses. The reward model (or human annotators) then scores each of these responses, and the response with the highest score is selected as the "best." This
Best-of-Nresponse is then used to evaluate the overall performance of the LLM or reward model, often by comparing its score to agolden reward model. -
Proximal Policy Optimization (PPO): A popular
reinforcement learningalgorithm often used in the final stage ofRLHFto fine-tune the LLM policy. It is a policy-gradient method that tries to update the policy (the LLM's behavior) in small, stable steps to maximize the expected reward, while preventing large, disruptive changes to the policy.
3.2. Previous Works
The paper extensively references prior studies, both in the application of BT models in RLHF and in the theoretical underpinnings of BT regression and non-pairwise data.
-
BT Models in RLHF (Seminal Applications):
- Christiano et al. (2017): This seminal work introduced
Deep Reinforcement Learning from Human Preferences, providing the initial motivation for usingBradley-Terry-type modelsinRLHF. They frame it as a specialization of theLuce-Shephard choice rule, where pairwise comparisons estimate utility scores. - Stiennon et al. (2020) and Ouyang et al. (2022): These works further established the
BT modelas a successful practice inLLM alignment, demonstrating its effectiveness in improvingLLM qualitywhen direct evaluation is difficult. The paper highlights that these works often applied the model directly without deep theoretical justification for its use insparse comparison scenarios.
- Christiano et al. (2017): This seminal work introduced
-
Challenges to BT Models in RLHF:
- Azar et al. (2023), Munos et al. (2023), Tang et al. (2024): These works have discussed challenges, such as the
BT model'sinability to capturenon-transitive preferences(Munos et al., 2023) or potentialoverfittingissues indirect preference optimization (DPO)when preferences are deterministic (Azar et al., 2023). This paper distinguishes itself by focusing on the underlying assumptions for BT's correctness rather than solely on failure modes. - Zhao et al. (2023): Also discusses challenges in direct preference optimization.
- Azar et al. (2023), Munos et al. (2023), Tang et al. (2024): These works have discussed challenges, such as the
-
Bradley-Terry Model for Parameter Estimation and Prediction (Traditional Statistical View):
- Bradley and Terry (1952): The original paper, laying the foundation for using pairwise comparisons to estimate player abilities.
- Ford Jr (1957): Established
identifiabilityandasymptotic theoryfor theclassical BT model. - Simons and Yao (1999): Studied
BT asymptoticswhen the number of players diverged and each played many competitions. - Han et al. (2020): Investigated the
sparse comparison setting(more relevant to LLM alignment), proposing methods for consistent estimation with fewer comparisons, specifically for scores. This work highlights the inherent challenge for LLM reward modeling with its extremely sparse comparisons. - Wu et al. (2022): Compared
asymptotic theoryunder differentidentifiability assumptions. - Springall (1973): Explored extensions of the
BT modeltoregression settings, assuming abilities (rewards) could be predicted from a linear combination ofcovariates(features). - De Soete and Winsberg (1993): Used
spline functionsfor more complexnon-linear modelsinBT regression. - Bockenholt (1988): Framed the
Bradley-Terry modelas a speciallogistic regression problem, opening theoretical investigation from this perspective. - Chen and Pouzo (2012): Developed theory for general
nonparametric conditional moment models, includinglogistic regression. - Schmidt-Hieber (2020): Studied
asymptotic theoryfordeep neural networksinnonparametric regression. - Bos and Schmidt-Hieber (2022): Extended this to
nonparametric logistic regressionusingdeep neural networks. This paper builds on their framework but notes that their theories cannot be directly applied due to theSiamese structure(shared network for two inputs) and the specificsoftmaxoutput ofBT models(predicting preference based on a difference of scores from a single network).
-
Non-Pairwise Data and Cross-Prompt Comparisons in RLHF:
- Liu et al. (2022), Sun and van der Schaar (2024), Ethayarajh et al. (2024) (KTO): Explored alignment from non-pairwise data.
KTO (Ethayarajh et al., 2024)is noted for its roots inprospect theory. - Yin et al. (2024) (RPO): Proposed
Relative Preference Optimization (RPO), involving response comparisons among both identical and similar questions (cross-prompt). The current paper aligns withRPO'sinsight but differs in motivation (why not restrict to same-prompt?) and implementation (focus on annotation efficiency and a two-stage reward model approach, not direct alignment). - The paper contrasts its approach with
direct alignment methods(e.g., Zheng et al., 2024; Liu et al., 2024; Zhao et al., 2023; Azar et al., 2023; Ethayarajh et al., 2024), arguing that explicit reward models (as in two-stage methods) offer advantages forinference-time optimization.
- Liu et al. (2022), Sun and van der Schaar (2024), Ethayarajh et al. (2024) (KTO): Explored alignment from non-pairwise data.
-
Representation Learning in Reward Modeling:
- Yang et al. (2024a): Highlighted how
generation taskscan regularizelearned embeddingsand improvereward modeling. - Zhang et al. (2024): Used
generative verifiersto constructreward predictionsvianext-token prediction, leveragingLLM generative ability. The current paper notes its research is orthogonal but complementary to this direction.
- Yang et al. (2024a): Highlighted how
3.3. Technological Evolution
The field of LLM alignment has rapidly evolved, primarily driven by the success of RLHF.
- Early LLMs: Initial
LLMsfocused on pre-training on massive text corpora, leading to impressive language generation capabilities but often lacking alignment with human preferences or safety guidelines. - Emergence of RLHF: Pioneering works like
Christiano et al. (2017)andStiennon et al. (2020)introducedRLHFas a powerful paradigm. This involved collecting human preference data and training areward model(often using theBradley-Terry formulation) to provide a scalar feedback signal forreinforcement learning algorithms(likePPO) to fine-tune the LLM. - BT Model Dominance: The empirical success of
BT modelsin systems likeChatGPT(Ouyang et al., 2022) solidified its status as a de facto standard forreward modeling. However, its theoretical foundations in this specific,sparse comparisoncontext remained less explored. - Alternative Alignment Methods: Alongside
RLHF,direct preference optimization (DPO)(Rafailov et al., 2023) emerged, directly optimizing the LLM policy using preference data without an explicitreward model. Other methods explored non-pairwise data or specific ways of structuring preferences (e.g.,KTO,RPO). - This paper's Contribution: The current paper fits into this timeline by critically examining the foundational
BT modelthat underpins much ofRLHF. It provides missing theoretical justifications for its use insparse data settings(connecting it tononparametric logistic regressionwithdeep neural networks). More importantly, it challenges the necessity of theBT modelby proposing a simpler,order-consistent classification objective, suggesting that the core requirement is preserving rank order rather than precise probability calibration. It also pushes the boundaries ofannotation designby advocating forcross-prompt comparisons, moving beyond the traditional same-prompt pairing. This work aims to refine and optimize the data collection and modeling aspects of thereward modelingstage, which is a crucial intermediate step in manyRLHFpipelines.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several key differentiators and innovations:
-
Theoretical Justification for BT in Sparse LLM Context: Previous works (e.g., Christiano et al., Stiennon et al., Ouyang et al.) successfully applied the BT model empirically, but the theoretical soundness for
sparse, high-dimensional data(prompt-response pairs represented by embeddings) was underexplored. This paper provides the firstasymptotic theoryforneural network-based BT regressionin this specific context, establishing convergence rates. This moves beyond simply acknowledging empirical success to providing formal guarantees. -
Emphasis on Order Consistency as a Core Objective: Unlike
BT models, which aim to accurately model theprobability of preference, this paper argues that for downstreamLLM optimization, onlyorder consistency(preserving the correct ranking) is strictly necessary. This is a crucial philosophical shift, implying that exact probability calibration (a harder problem) might be an overspecification. This insight leads to simpler, more flexible alternatives. -
Novel Classification-Based Reward Modeling Objective: Building on the
order consistencyprinciple, the paper proposes a straightforwardclassification-based algorithmas an alternative to theBT model. This method relaxes theanti-symmetry constraintinherent inBTby training a standardbinary classifieron individual prompt-response pairs. The outputlogitsfrom this classifier then serve as reward scores. This is simpler to implement, allows for off-the-shelf classifiers (like LightGBM), and empirically demonstrates superior performance and robustness in many settings. -
Theoretical and Empirical Advocacy for Cross-Prompt Comparisons: Traditional
BT modelapplications (like sports ranking) often assumerandomized pairwise comparisons. InRLHF, however,same-prompt comparisonsare prevalent. This paper theoretically proves thatcross-prompt comparisons(comparing responses from different prompts) can lead to higherutility diversityand improvedannotation quality, thus being more beneficial. This is a direct challenge to a common practice and provides practical guidance for data collection strategies, differentiating it fromRPO(Yin et al., 2024) which also uses cross-prompt comparisons but with different motivations and implementations. -
Extensive and Controlled Empirical Validation: The paper conducts one of the most comprehensive empirical evaluations in this domain, with over 12,000 runs across varied
LLMs,datasets,annotation qualities,quantities, andpairing choices. This systematic ablation study provides strong evidence for its claims and offers practical insights into the robustness and efficiency of differentreward modelingapproaches.In essence, the paper provides both a deeper theoretical understanding of current practices and a practical, empirically validated alternative that is simpler, more robust, and guides more efficient data collection.
4. Methodology
4.1. Principles
The core idea of the methodology is multi-faceted:
- Revisit and Validate BT Foundations: The paper first seeks to understand why the
Bradley-Terry (BT) modelworks inLLM reward modeling, especially given thesparse comparisondata. This involves establishing theoretical guarantees forneural network-based BT regression, connecting it tononparametric logistic regression. - Deconstruct Necessity from Sufficiency: It then argues that the
BT model, while theoretically sound, might not be necessary becausedownstream LLM optimizationprimarily requires areward functionthat preserves the correct ranking of responses (i.e.,order consistency), not necessarily perfectly calibratedpreference probabilities. Amonotonic transformationof the true reward is sufficient. - Propose Order-Consistent Alternatives: Based on the
order consistencyprinciple, the paper develops a simplerclassification-based reward modeling objectivethat can leverageoff-the-shelf binary classifiers, thereby offering more flexibility and potentially better empirical performance. - Optimize Annotation Strategy: The methodology also investigates the
preference annotation process, particularly challenging the conventional wisdom ofsame-prompt comparisons. It theoretically and empirically explores the benefits ofcross-prompt comparisonsto improveannotation qualityanddata utility.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Rethinking the Usage of BT Models in LLM Alignment (Section 2.2)
The paper begins by dissecting the implicit assumptions made when applying the BT model to human preference annotations in LLM alignment.
Assumption 1 (Existence of Deterministic Oracle Utility Values) The (log-)utility value of any response given prompt exists and is deterministic.
- Explanation: This foundational assumption posits that for any given prompt-response pair, there's an inherent, fixed, and objective quality score (utility) that the reward model tries to learn. This score doesn't change based on who is annotating or how it's compared.
Assumption 2 (Deterministic Comparisons) For annotator , the annotation result is deterministic and depends on the comparison of their biased evaluation of the utility values of both responses and , such that: $ \mathbb{1}(y_1 \succ y_2 | x, A) = \mathbb{1}(r_{x,y_1} + b(x, y_1, A) > r_{x,y_2} + b(x, y_2, A)) $
- Explanation: This assumes that an annotator doesn't randomly pick a preference. Instead, they form a
biased evaluation() for each response. The preference decision is thendeterministic: if the biased evaluation of is higher than , then is preferred. The randomness in annotations across different annotators (or even for the same annotator across different occasions) comes from thebias termb(x,y,A), which is specific to the annotator and the prompt-response pair.
Assumption 3 (Logistic Difference Assumption)
The difference is sampled i.i.d. from a standard logistic distribution for all x, y.
$
P(b(x, y_1, A) - b(x, y_2, A) \leq t | A) = \frac{1}{1+e^{-t}}
$
- Explanation: This assumption specifies the distribution of the difference in annotator biases. If this difference follows a
logistic distribution, it mathematically leads directly to theBradley-Terry model'ssoftmaxform for preference probabilities. Thei.i.d.(independent and identically distributed) nature means this bias difference is consistent across all comparisons. - Remark 1 (Transitive property of difference): The paper notes that this assumption is consistent with transitivity if all annotator biases are independently
Gumbel distributedwith the same scale parameter. The difference of two i.i.d. Gumbel variables with the same scale parameter is a logistic distribution.
Proposition 2 (Modeling Annotations under Logistic Difference Assumption) Under the above assumptions, the probability of preferring over given prompt is: $ P(y_1 \succ y_2 | x) = P \left( r_{x,y_1} - r_{x,y_2} > b(x, y_1, A) - b(x, y_2, A) \right) = \frac{1}{1 + e^{-(r_{x,y_1} - r_{x,y_2})}} $
-
Explanation: This is the core
Bradley-Terryprobability model. It shows that the probability of being preferred over is asigmoid functionof the difference between their true underlying utility values (). This equation is fundamental to how BT models translate utility differences into preference probabilities.The paper also considers an alternative assumption for robustness:
Assumption 4 (Gaussian Difference Assumption)
The is sampled from a standard Gaussian distribution:
$
b(x, y_1, A) - b(x, y_2, A) \sim \mathcal{N}(0, 1)
$
- Explanation: Similar to Assumption 3, but here the bias differences follow a
Gaussian distribution.
Proposition 3 (Modeling Annotations under Gaussian Difference Assumption)
Under Assumption 1, Assumption 2, and Assumption 4, the probability of preferring over given prompt is:
$
P(y_1 \succ y_2 | x) = P \left( r_{x,y_1} - r_{x,y_2} > b(x, y_1, A) - b(x, y_2, A) \right) = \Phi(r_{x,y_1} - r_{x,y_2})
$
where is the CDF of the Gaussian distribution.
- Explanation: This is the
Thurstonian model. It's structurally similar to theBT modelbut uses theGaussian CDFinstead of thesigmoid functionbecause of theGaussian noiseassumption. The paper notes thatBTis often preferred for empirical fit and mathematical convenience.
4.2.2. BT Regression: How BT Model Works with Sparse Comparisons (Section 2.3)
The key challenge in LLM reward modeling is the sparsity of comparisons (e.g., for pairs, far below theoretical bounds for consistent estimation in classical BT). To address this, the paper notes that BT regression variants are used. Instead of estimating a unique score for each prompt-response pair, the score r(x,y) is modeled as a function of features or covariates of the pair. For LLMs, this means using embeddings of the prompt-response pair.
- Regression Approach: The reward
r(x,y)is expressed as , where is anembeddingof the prompt-response pair(x,y), and is an unknown function. This allows generalization to unseen pairs. - Neural Networks: In practice,
MLPsare commonly used to implement , mappingembeddingsto reward scores. The paper's theoretical analysis focuses on thisMLP-based BT regression.
4.2.3. Asymptotic Theory on MLP-based BT Regression in Reward Modeling (Section 2.4 and Appendix A)
This section provides the theoretical justification for using MLPs with BT loss for learning reward functions.
Data Format: The dataset for preference-based LLM alignment is .
-
: the prompt.
-
: responses sampled from the LLM.
-
: human-annotated preference,
1if is preferred,-1otherwise.Reward Model Formulation:
-
Embedding Function: Assume a known
embedding function, which maps a prompt-response pair to a -dimensional vector. -
Reward Function: An unknown reward function exists such that for some function . The goal is to learn .
-
Predicted Probability: The reward model is denoted . The predicted probability that is better than is . This is equivalent to , which is a classification problem.
Loss Function:
Cross-entropy lossis used for training. Let and be the embeddings for the -th pair. Let be the true preference probability and be the model's predicted probability vector. The preference label is if the first response is preferred, and otherwise. The cross-entropy loss is: $ \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) = - \frac { 1 } { n } \sum _ { i = 1 } ^ { n } ( \pmb { h } ^ { ( i ) } ) ^ { \top } \log ( \pmb { p } ^ { ( i ) } ) $ The estimator is the one that minimizes this loss over the function class : $ \hat { \pmb { p } } = \arg \operatorname{min}{\pmb { p } \in \mathcal { F } _ { \theta } } \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) $ The difference between the fitted NN and the global minimum is denoted as: $ \Delta _ { n } ( \pmb { p } _ { 0 } , \hat { \pmb { p } } ) = \mathbb { E } \left[ \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \hat { \pmb { p } } ) - \operatorname{min}{\pmb { p } \in \mathcal { F } _ { \theta } } \widetilde { \mathcal { L } } _ { \mathrm { C E } } ( \pmb { p } ) \right] $
Truncated KL Risk (Definition 12): To overcome divergence problems with KL risk when probabilities are near 0 or 1, the paper uses a -truncated KL risk: $ R _ { B } ( { \pmb p } _ { 0 } , \hat { \pmb p } ) = \mathbb { E } \left[ { \pmb p } _ { 0 } ^ { \top } \operatorname{min} \left( B , \log \frac { { \pmb p } _ { 0 } } { \hat { \pmb p } } \right) \right] $
-
Explanation: This metric measures the expected difference between the true preference probabilities () and the model's predicted probabilities (), but with a cap on the logarithmic term. This prevents extremely small predicted probabilities from causing infinite risk, making the theoretical analysis tractable.
Assumptions on the Model and True Function:
-
Assumption 6 (MLP reward model): The neural network is an MLP with
ReLU activations(meaning is used as the activation function) with depth and width vector . The network parameters havenorm restrictions(bounded norm for weights and biases) andsparsity(bounded norm for weights and biases, i.e., limited number of non-zero parameters).- The function class is , and the softmax output version is , where for .
-
Definition 13 (Small value bound by Bos and Schmidt-Hieber (2022)): A function class is -small value bounded (-SVB) if the probability of any component of the probability vector being less than or equal to is bounded by .
- $ \mathbb { P } ( p _ { k } ( \Psi _ { 1 } , \Psi _ { 2 } ) \leq t ) \leq C t ^ { \alpha } , \quad \mathrm{for \ all \ } t \in ( 0 , 1 ] \mathrm{ \ and \ all \ } k = 0 , 1 $
- Explanation: This ensures that the true preference probabilities are not too close to 0 or 1, which helps avoid issues with the logarithm in
KL divergence. It essentially means that extremely decisive preferences are not overwhelmingly common in the data distribution.
-
Definition 14 (Hölder smooth function): A function class that captures smoothness. The ball of -Hölder functions with radius has bounded sums of norms of derivatives up to .
-
Assumption 7 (Class of true utility function): The true reward functions are -Hölder smooth, and the induced probability by
softmaxis -SVB.
Theorem 20 (Truncated KL risk bound, formal version from Appendix A) Suppose the true utility function induced probability of preference is in (i.e., -Hölder smooth and -SVB) for . Let . Let be an estimator from family (MLP with ReLU activations, norm restriction, and sparsity) satisfying:
- (depth scales appropriately).
- (minimum width scales appropriately).
- (sparsity scales appropriately). For sufficiently large , there exist constants such that when , then: $ R _ { B } ( { \pmb p } _ { 0 } , \hat { \pmb p } ) \leq C ^ { \prime } B \phi _ { n } L \log ^ { 2 } ( n ) $
- Explanation: This is the main theoretical result. It states that under reasonable smoothness and regularity assumptions on the true reward function and the
MLParchitecture, thetruncated KL riskof theBT reward modelconverges to zero as the number of data points increases. Theconvergence rateis determined by , which depends on the embedding dimension , smoothness , and small-value bound . It implies thatMLP-based BT regressioncan indeed learn the true preference probabilities. The term represents the optimization error (how far the learned NN is from the global minimum for the given architecture), and if this error is also small enough, the overall risk is bounded.
Corollary 7 (Connecting probability to reward)
Given Lemma 21 (which relates Hellinger distance to truncated KL risk) and Remark 22 (which links Hellinger distance to absolute probability differences):
$
\left| p _ { 0 } ( \Psi _ { 1 } , \Psi _ { 2 } ) - \hat { p } ( \Psi _ { 1 } , \Psi _ { 2 } ) \right| \lesssim \left| \sqrt { p _ { 0 } } + \sqrt { \hat { p } } \right| \sqrt { \phi _ { n } L } \log ( n ) \to 0
$
Then, by the mean value theorem relating probability differences to logit differences (which are reward differences for BT):
$
\left| r ( \Psi _ { 1 } ) - r ( \Psi _ { 2 } ) - ( \hat { r } ( \Psi _ { 1 } ) - \hat { r } ( \Psi _ { 2 } ) ) \right| \lesssim \frac { \left| \sqrt { p _ { 0 } } + \sqrt { \hat { p } } \right| } { \tilde { p } ( 1 - \tilde { p } ) } \sqrt { \phi _ { n } L } \log ( n ) \to 0
$
where is a probability between and .
- Explanation: This corollary directly connects the convergence of predicted probabilities to the convergence of estimated reward differences. It shows that if the model accurately predicts preference probabilities, it also accurately estimates the difference between the true reward values. This estimation is "up to an additive constant" because is what's learned, not the absolute values of or . The condition that should not be too small suggests that comparisons should not be between pairs with extremely different rewards, as the
logit function(inverse sigmoid) diverges at probabilities close to 0 or 1.
4.2.4. Rethinking Reward Modeling Objectives in LLM Alignment (Section 3)
The paper argues that BT models are not necessary because LLM optimization only requires a reward function that preserves the correct ranking (order), not perfectly calibrated preference probabilities.
Assumption 5 (Imperfect Preference Annotation in Approximating True Scores)
Let be the true utility difference. An annotator function provides feedback that probabilistically aligns with the oracle utility r(x,y). The probability of the annotator being correct is :
$
\mathbb { P } \bigg ( h ( x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } ) ( r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) ) > 0 \bigg | \Delta r \bigg ) = \xi ( \Delta r )
$
where is any monotonic increasing function mapping to [0.5, 1].
- Explanation: This assumption models the reality of human annotation noise. It states that annotators are more likely to make correct judgments when the difference in true utility () between two responses is larger, and less likely when scores are very similar (approaching a 50% chance when ). Both
BTandThurstonianmodels satisfy this, as their probability functions (sigmoid and Gaussian CDF) are monotonic.
Definition 8 (Order Consistency)
The loss over an ordering model is defined as the probability that the reward model's ordering agrees with the annotation:
$
\mathcal { L } _ { \mathrm { oc } } ( \hat { r } ) = \mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } , h } \mathbb { I } \left[ h = \hat { H } \right]
$
- Explanation: This is a simple classification accuracy metric. It measures how often the model's predicted preference direction () matches the observed human annotation (). The goal is to maximize this.
Proposition 9 (Lower bound on population level order consistency, formal version from Appendix B.1)
Suppose a learned model achieves objective equation 17 (Definition 8) up to error for small and , i.e., .
Then, with probability at least over , for any given , the order consistency of with respect to the oracle utility is bounded below by:
$
\mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } } \left[ \mathbb { 1 } \left( \hat { H } \cdot \left[ r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) \right] \geq 0 \right) \bigg | \Delta r \right] \geq ( 1 - \epsilon ) \cdot \xi ^ { 2 } ( \Delta r ) + \epsilon \cdot ( 1 - \xi ( \Delta r ) ) ^ { 2 }
$
Further, if with probability at least , we have , then:
$
\mathbb { E } _ { x _ { 1 } , x _ { 2 } , y _ { 1 } , y _ { 2 } \sim \ell ( x ) } \left[ \mathbb { 1 } \left( \hat { H } \cdot \left[ r ( x _ { 1 } , y _ { 1 } ) - r ( x _ { 2 } , y _ { 2 } ) \right] > 0 \right) \right] \geq 1 - 4 \epsilon - \kappa - \delta
$
- Explanation: This proposition is crucial. It shows that by optimizing the observable
order consistencywith respect to noisy human annotations (achieving high accuracy ), we can also ensure a highorder consistencywith the true underlyingoracle utilityfunction. The lower bound depends on theannotation quality functionand the error rate . This theoretical link justifies pursuing simple classification accuracy as a valid objective forreward modeling.
The BT Model as a Choice (Section 3.2)
The BT model explicitly enforces order consistency by modeling preference probability as a sigmoid of the reward difference:
$
P(h=1) = \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) )
$
It is trained with binary cross-entropy loss:
$
\mathcal { L } _ { \mathrm { B T } } = \mathbb { E } \left[ \mathbb { 1 } _ { h = 1 } \log \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) ) + \mathbb { 1 } _ { h = - 1 } \log ( 1 - \sigma ( \hat { r } _ { \mathrm { B T } } ( x _ { 1 } , y _ { 1 } ) - \hat { r } _ { \mathrm { B T } } ( x _ { 2 } , y _ { 2 } ) ) ) \right]
$
- Explanation: The
BT modeldirectly optimizes for predicting the probability of preference based onreward differences. ItsSiamese architecture(where the same network processes both inputs and their outputs are differenced) ensuresanti-symmetry(preferring A over B implies not preferring B over A).
Relaxing the Anti-Symmetry Constraint: a Classification-based Method (Section 3.3)
The paper proposes a simpler classification-based method by relaxing the explicit anti-symmetry constraint of BT. Instead of modeling , it focuses on learning the marginal probability .
Let be a classifier output. The order consistency could be thought of as achieving and . A union bound of this order consistency leads to an objective function for a simple binary classifier:
$
\mathcal { L } _ { \mathrm { oc } } \leq \mathcal { L } _ { \mathrm { clf } } : = \mathbb { E } ( h = \hat { H } _ { \mathrm { c l f } } ( x _ { 1 } , y _ { 1 } ) ) + \mathbb { E } ( - h = \hat { H } _ { \mathrm { c l f } } ( x _ { 2 } , y _ { 2 } ) )
$
- Explanation: This means we can train a standard
binary classifierby treating each prompt-response pair in the preference data independently. If is preferred over , then is a "positive" example and is a "negative" example. Thelogitoutput of this classifier can then be used as the reward score. This approach is more flexible as it allows usingoff-the-shelf binary classifiers(e.g.,LightGBM) without the strictSiamese architectureoranti-symmetry. - Proposition 24 (Classification reward, from Appendix B.2): This proposition formally connects the
logitof themarginal probabilityof winning (as learned by a classification model) to theBT reward. If data comes from aBT model, and , then for a constant . This means the classification reward provides a lower bound for the true reward (up to an additive constant), making it a valid proxy. $ s _ { i } : = \mathrm { logit } \mathbb { P } ( i \mathrm { \ wins } ) \ge r _ { i } - \log \mathbb { E } [ \exp ( r _ { j } ) ] $ where is a constant.
4.2.5. Rethinking the Preference Annotation Process for Reward Modeling (Section 4 and Appendix C)
The paper investigates whether cross-prompt comparisons are beneficial.
Annotation Quality Function: The probability of an annotator being correct is , where controls annotation quality.
- : Random annotations ().
- : Perfect annotations ().
Example 1 (Annotation Quality under Gaussian Score, from Appendix C.1)
When responses for a prompt are sampled from an LLM , and their utilities r(x,y) are Gaussian distributed , then the annotation quality for that prompt is:
$
\mathcal { Q } _ { \mathrm { p a i r } } ( x ) = \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } [ \tau _ { x } ] = \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } \left[ \sigma ( \beta | r ( x , y _ { 1 } ) - r ( x , y _ { 2 } ) | ) \right]
$
where \tau_x = \sigma(\beta |r(x,y_1) - r(x,y_2)|) is a random variable, whose probability density function is:
$
f _ { \tau _ { x } | x } ( t ) = \frac { 1 } { \sqrt { \pi \beta ^ { 2 } \sigma _ { x } ^ { 2 } } } \exp \left( - \frac { \left( \log \left( \frac { t } { 1 - t } \right) \right) ^ { 2 } } { 4 \beta ^ { 2 } \sigma _ { x } ^ { 2 } } \right) \cdot \frac { 1 } { t ( 1 - t ) }
$
- Explanation: This derivation quantifies the expected
annotation qualityfor pairs from the same prompt. It shows that quality depends on both theannotator's ability() and thediversity of response utilities(). Higher or higher leads to better annotation quality.
Proposition 10 (Cross-Prompt Comparisons Increase Utility Diversity, formal version from Appendix C.2)
When responses and have Gaussian distributed utilities , then the expected absolute reward difference for cross-prompt comparisons is greater than or equal to same-prompt comparisons:
$
\mathbb { E } _ { x } \mathbb { E } _ { y _ { 1 } , y _ { 2 } | x } \left[ | r _ { x , y _ { 1 } } - r _ { x , y _ { 2 } } | \right] \leq \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \mathbb { E } _ { y _ { 1 } | x _ { 1 } , y _ { 2 } | x _ { 2 } } \left[ | r _ { x _ { 1 } , y _ { 1 } } - r _ { x _ { 2 } , y _ { 2 } } | \right]
$
- Explanation: This proposition provides a theoretical basis for
cross-prompt comparisons. It states that, on average, thereward differencesbetween responses from different prompts are larger than those between responses from the same prompt. Larger reward differences generally lead to easier and more reliable annotations (due to Assumption 5).
Theorem 11 (Cross-Prompt Annotation Improves Annotation Quality, formal version from Appendix C.3)
When responses have utilities sampled from a location-scale family with a unimodal and symmetric density , and is a first-order differentiable, monotone increasing, and concave function (representing annotation quality), then:
$
\mathbb { E } _ { x } [ \mathcal { Q } _ { \mathrm { p a i r } } ( x ) ] = \mathbb { E } _ { x } \mathbb { E } _ { y _ { 1 } , y _ { 2 } \mid x } \left[ \xi ( | r _ { x , y _ { 1 } } - r _ { x , y _ { 2 } } | ) \right] \le \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \mathbb { E } _ { y _ { 1 } \mid x _ { 1 } , y _ { 2 } \mid x _ { 2 } } \left[ \xi ( | r _ { x _ { 1 } , y _ { 1 } } - r _ { x _ { 2 } , y _ { 2 } } | ) \right] : = \mathbb { E } _ { x _ { 1 } , x _ { 2 } } \big [ \mathcal { Q } _ { \mathrm { c r o s s - p r o m p t } } ( x _ { 1 } , x _ { 2 } ) \big ]
$
- Explanation: This theorem generalizes Proposition 10 to
annotation qualityitself, not justreward differences. It rigorously proves thatcross-prompt annotationsyield higher expectedannotation qualitythansame-prompt annotationsunder broad and realistic conditions for utility distributions and annotator behavior. This is a strong theoretical argument for adoptingcross-prompt comparisonstrategies inRLHF.
5. Experimental Setup
The experiments are designed to validate the theoretical insights and methodological contributions, focusing on reproducibility, controllability, and computational efficiency.
5.1. Datasets
The primary datasets used are:
-
Anthropic-Harmless (Bai et al., 2022a): Contains 41,876 training prompts and 2,273 test prompts. This dataset focuses on aligning LLMs to be harmless, meaning they should avoid generating toxic, biased, or unsafe content.
-
Anthropic-Helpful (Bai et al., 2022a): Contains 42,846 training prompts and 2,292 test prompts. This dataset focuses on aligning LLMs to be helpful, meaning they should generate informative, relevant, and useful responses.
Additionally, results from two other datasets are presented in Appendix F:
-
Helpsteer dataset (Wang et al., 2023)
-
UltraFeedback dataset (Cui et al., 2023)
Data Generation and Annotation:
-
For each of the 6 base LLMs (see below), 10 responses are generated for each training prompt to serve as candidates for reward model training.
-
For each test prompt, 500 responses are generated to be scored by
golden reward modelsfor testing viaBest-of-N sampling. -
Simulated Preference Annotation: To simulate human annotation, the authors use existing open-source
golden reward models(Yang et al., 2024b; Dong et al., 2023, 2024) as "annotators." This ensures reproducibility and affordability.- Annotation Noise: A realistic
imperfect annotation processis simulated usingcognitive bottleneck models(Stewart et al., 2005; Guest et al., 2016). The probability of a correct label depends on the true reward difference\Delta r = |r(x,y_1) - r(x,y_2)|via asigmoid function: $ \mathbb { P } \bigg ( h ( x , y _ { 1 } , y _ { 2 } ) ( r ( x , y _ { 1 } ) - r ( x , y _ { 2 } ) ) > 0 \bigg | \Delta r \bigg ) = \xi ( \Delta r ) = \sigma ( \beta \Delta r ) $ Here, is theannotation quality parameter.- When , annotations are random ().
- When , annotations are perfect ().
- The default is used unless explicitly varied to study the impact of
annotation quality(ranging from corresponding to error rates from 5% to 38%).
- Annotation Noise: A realistic
-
Annotation Quantity: Experiments vary the number of annotations used for training: [5000, 10000, 20000, 40000].
-
Pairing Choices:
- Same-Prompt Comparison (Standard): Annotators compare response pairs generated from the same prompt.
- Cross-Prompt Comparison (X-Prompt): Annotators compare two prompt-response pairs that are randomly selected from the dataset, potentially from different prompts.
- Synthetic Setups for Cross-Prompt Analysis:
-
Similar Comparison: For a single prompt, two responses with similar quality (middle-ranked by a golden reward model) are paired to simulate lack of diversity. -
Diversified Comparison: For a single prompt, responses with very different qualities (highest and lowest-scored by a golden reward model) are paired to simulate high diversity.These datasets were chosen because they are extensively studied in
reward modeling, and open-sourcegolden reward modelsare available, enabling controlled and reproducible experiments.
-
5.2. Evaluation Metrics
The effectiveness of different reward models is evaluated using Best-of-N (BoN) sampling with .
-
Best-of-N (BoN) Sampling:
- Conceptual Definition: BoN is a common strategy to evaluate the quality of a
generative LLMor the effectiveness of areward model. For a given prompt, the LLM generates diverse responses. The trainedreward modelthen assigns a score to each of these responses. The response with the highest score according to thereward modelis selected as the "best" and is used as the final output. The quality of this "best" response (as judged by an independentgolden reward model) reflects the performance of the trained reward model. - Mathematical Formula: While BoN itself is a sampling strategy, the
metricderived from it for evaluation is typically theGolden Reward Valueof the selected best response. If is the score of the -th generated response for a prompt according to the trained reward model , and is the score of the golden reward model, then the BoN evaluation for a prompt is: $ \text{BoN_Score}(x) = R^*(x, y_{\text{best}}) \quad \text{where } y_{\text{best}} = \arg\max_{k \in {1, \dots, N}} \hat{r}(x, y_k) $ - Symbol Explanation:
- : The number of responses generated by the LLM for each prompt (here, ).
- : The score predicted by the trained reward model for response to prompt .
- : The response among the generations that received the highest score from the trained reward model.
- : The score assigned by an independent golden reward model (a high-quality, pre-trained reward model acting as ground truth) to the selected best response. This is the ultimate metric for evaluating the performance of the trained reward model.
- Conceptual Definition: BoN is a common strategy to evaluate the quality of a
-
Golden Reward Value Improvement: The paper reports results as "improvements over base models" or "relative performance gains achieved through the reward models." This means comparing the
BoN_Scoreobtained using the trained reward model against a baseline score (e.g., the average score of responses from the base LLM without reward model selection).The choice of
BoNis justified by:
- Performance: Empirical studies show BoN often achieves better performance than
PPO(Dong et al., 2023; Yuan et al., 2023; Gao et al., 2023; Coste et al., 2023). - Stability and Reduced Engineering Overhead: BoN requires no hyperparameter tuning and is more stable than
PPO, leading to more consistent and interpretable results (Ivison et al., 2024; Xu et al., 2024). - Computational Efficiency and Reproducibility: BoN's reusability across generations at test time is more efficient than the computationally prohibitive
PPOfine-tuning for 12,000 experimental setups.
5.3. Baselines
The paper compares its proposed classification-based reward models against the widely used Bradley-Terry (BT) model.
-
Base LLMs: The experiments use a diverse set of
base LLMsfor generating responses:- Gemma2b
- Gemma7b
- LLaMA3-8b
- Gemma2b-SFT (Supervised Fine-Tuned version of Gemma2b)
- Gemma7b-SFT (Supervised Fine-Tuned version of Gemma7b)
- LLaMA3-8b-SFT (Supervised Fine-Tuned version of LLaMA3-8b) These base models cover different scales and levels of initial alignment (raw vs. SFT-ed), providing a comprehensive evaluation context.
-
Reward Model Implementations:
- BT-MLP: The traditional
Bradley-Terry modelimplemented with anMLP backbone. It requires aSiamese structureto ensureanti-symmetry. This serves as the primary baseline for theBT model. - CLF-MLP: The proposed
classification-based reward modelimplemented with anMLP backbone. This allows for a fair comparison of the objective function, isolating gains from theclassification objectiveitself while keeping the neural network architecture consistent. - CLF-LGB: The proposed
classification-based reward modelimplemented withLightGBM(Ke et al., 2017), anoff-the-shelf gradient boosting decision treealgorithm. This demonstrates the flexibility and potential benefits of using non-neural network classifiers with theorder consistency objective.
- BT-MLP: The traditional
-
Embeddings:
Gemma2bis used to generateembeddingsfor reward modeling across all experiments. This standardizes therepresentation learningcomponent, ensuring that differences in reward model performance are attributed to thereward modeling objectiveand architecture, rather than varyingembedding quality. -
Hyper-parameters: To ensure a fair comparison and isolate the source of gains, identical default hyperparameter settings are used for
LightGBM(e.g., ,metric='binary_logloss') and for allMLPconfigurations (for bothBTandClassificationmodels).
6. Results & Analysis
The experimental results aim to validate the order consistency framework, compare classification-based and BT models, and assess the impact of annotation quality, quantity, and cross-prompt comparisons.
6.1. Comparing The Bradley-Terry and Classification Objectives
The following figure (Figure 1 from the original paper) compares the performance of BT-MLP, CLF-MLP, and CLF-LGB across different base LLMs on the Harmless and Helpful datasets, measuring Golden Reward Value Improvement via Best-of-N (BoN) sampling with .
该图像是一个比较图表,展示了不同基础模型在两个任务(Harmless 和 Helpful)下的 Golden Reward Value Improvement。X 轴表示基础模型,Y 轴表示奖励值改进,BT-MLP、CLF-MLP 和 CLF-LGB 作为分类方法的表现被比较,结果显示 CLF 方法在多种设置下优于 BT 方法。
Original caption: Figure 1: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added fexibility of using off-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds.
Analysis:
- Overall Superiority of Classification Models: For almost all
base LLMsand bothHarmlessandHelpfuldatasets, theclassification-based reward models(CLF-MLPandCLF-LGB) consistently achieve higherGolden Reward Value Improvementcompared to theBT-MLPmodel. - Flexibility Benefit:
CLF-LGB(LightGBM) often performs as well as, or even better than,CLF-MLP, highlighting that theclassification objectiveis compatible with diverseoff-the-shelf classifiersand can leverage their strengths. This supports the paper's argument for greater flexibility beyondMLPsand theSiamese structurerequired byBT-MLP. - Consistency Across Models and Tasks: The trend holds across various
GemmaandLLaMA3-8bmodels (both raw and SFT-ed versions) and for bothHarmlessnessandHelpfulnessalignment tasks. This suggests theclassification objectiveis a robust alternative. - Takeaways: The
classification-based reward modelsare shown to be more effective thanBT reward modelsin general, offering superior performance and the practical advantage of being able to integrate variousmachine learning algorithmsbeyond specializedMLPs.
6.2. How Annotation Qualities and Quantities Affect Performance
6.2.1. Changing Annotation Quality
The following figure (Figure 2 from the original paper) shows the impact of varying annotation quality (controlled by ) on the performance of the reward models on the Harmless and Helpful datasets, with a fixed quantity of 40,000 annotations. Annotation Error Rate (AER) decreases as increases.
该图像是一个比较不同奖励模型在不同注释错误率下的表现的图表。图表展示了五个实验设置的结果,包括 Gemma2b、Gemma2b-SFT、Gemma7b、LLama3-8以及 LLama3-8b-SFT,Y轴表示金标准奖励值,X轴表示注释错误率。
Original caption: Figure 2: Changing the annotation quality. Dataset: Harmless, Helpful.
Analysis:
-
Robustness to Annotation Noise: As
annotation error ratesincrease (i.e., decreases, moving left on the x-axis), theclassification models(CLF-MLP,CLF-LGB) exhibit greater robustness. Their performance drops are generally smaller and more graceful compared toBT-MLP. -
BT's Advantage in High-Quality Data: When
annotation qualityis very high (lowerror rates, e.g., where AER is around 5%), theBT-MLPmodel sometimes outperforms theclassification models. This suggests that theBT model's emphasis on calibrated probabilities might be beneficial when the input signal is almost perfectly reliable. -
Overall Practicality: In realistic scenarios where
human annotationoften contains significant noise (error rates between 5% and 38% are common), theclassification modelsprovide a more robust and reliable solution. -
Takeaways: While
BT modelsexcel under near-perfectannotation quality,classification modelsdemonstrate superior robustness and maintain better performance asannotation error ratesincrease, making them more practically useful in noisyhuman feedbackenvironments.Further results showing annotation quality under different availability scenarios are provided in Appendix E (Figures 8 and 9). These corroborate the findings, showing
CLFmethods generally more stable across various annotation counts when quality varies. For example, at lower annotation counts (e.g., 5000),CLFmethods are more stable inHarmlessdataset (Figure 8).
6.2.2. Changing Annotation Quantity
The following figure (Figure 3 from the original paper) illustrates how varying annotation quantities (5000, 10000, 20000, 40000) affect the reward models' performance, holding (a moderate annotation quality) constant.
该图像是一个图表,展示了不同训练注释数量对各个模型(包括 Gemma2b 和 LLaMA3-8b)的奖励值的影响。图中比较了 BT-MLP、CLF-MLP 和 CLF-GLB 三种方法,X轴表示训练注释数量,Y轴表示黄金奖励值,展示了不同方法在不同条件下的效果。
Original caption: Figure 3: Changing the annotation quantity. Dataset: Harmless, Helpful.
Analysis:
-
Consistent Superiority of Classification Models: Across all
annotation quantitiesand for both datasets, theclassification models(CLF-MLP,CLF-LGB) consistently outperform theBT-MLPmodel. -
Scalability with Data: As the number of
annotationsincreases, all models generally show improved performance. However,classification modelsdemonstrate more consistent and often steeper improvements, indicating better scalability and data efficiency. -
Performance at Low Data Regimes: Even with fewer annotations (e.g., 5000),
CLF modelsmaintain a significant lead overBT-MLP, suggesting they are more effective indata-scarce scenarios. -
Takeaways:
Classification modelsconsistently outperformBT modelsacross varyingannotation quantities, offering superior performance and more reliable improvements as more data becomes available. This further strengthens their position as a compelling alternative forreward modeling.Additional results on annotation availability under different annotation qualities are provided in Appendix E (Figures 10 and 11), generally supporting these conclusions.
6.3. LLM Alignment with Preference Annotations Between Different Prompts
This section investigates the impact of cross-prompt comparisons versus same-prompt comparisons on reward model performance. The default setting uses and 40,000 annotations.
6.3.1. Cross-Prompt vs. Same-Prompt Comparisons
The following figure (Figure 4 from the original paper) compares the performance of reward models trained with same-prompt annotations (unshaded bars) and cross-prompt annotations (shaded bars) across various base models and reward modeling methods.
该图像是一个图表,展示了在无害和有帮助的随机比较中,不同基础模型的金色奖励值改进情况。通过比较 BT-MLP、CLF-MLP 等模型,在 Gemma2b-SFT 和 LLaMA3-8b 等基础模型上,展示了不同注释方式的效果。
Original caption: Figure 4: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons.
Analysis:
-
Significant Improvement with Cross-Prompt Comparisons: For nearly all
base LLMsandreward modeling methods,cross-prompt annotations(shaded bars) consistently and substantially outperformsame-prompt annotations(unshaded bars) in terms ofGolden Reward Value Improvement. This is a strong empirical validation of the theoretical findings in Section 4 thatcross-prompt comparisonscan improveannotation quality. -
Generalizability of Improvement: The benefits of
cross-prompt annotationsare observed across bothHarmlessandHelpfultasks, and acrossBT-MLP,CLF-MLP, andCLF-LGBmodels. This suggests that the improvement is not specific to a particular model or task but is a general advantage of theannotation design. -
Takeaways:
Cross-prompt comparisonsare a highly effectiveannotation strategythat leads to significant improvements inreward modelperformance compared to conventionalsame-prompt comparisons.Additional figures reporting cross-prompt annotation results on changing annotation availability and annotation quality are provided in Appendix E (Figures 12 to 15). These further confirm the benefits of cross-prompt comparisons across diverse experimental conditions.
6.3.2. Impact of Response Diversity
To understand why cross-prompt comparisons are so effective, the paper introduces two synthetic setups (Figures 5 and 6):
-
Similar Comparison: Responses for a single prompt have similar quality (lack diversity). -
Diversified Comparison: Responses for a single prompt have different qualities (high diversity).The following figure (Figure 5 from the original paper) shows results for
SimilarandDiversified Comparisonsetups.
该图像是图表,展示了在相似和多样化比较下,BT-MLP 和 CLF-MLP 不同模型的黄金奖励值改善情况。结果表明,跨提示比较显著提高了在缺乏多样性的相同提示响应对中的奖励建模表现。误差条是基于 5 次不同种子的实验结果。
Original caption: Figure 5: Results comparing cross-prompt comparison-based annotations on synthetically generated similar or diversified comparison pairs. Cross-prompt comparison significantly improves the performance of reward modeling with same-prompt response pairs lacking diversity. Error bars are from 5 runs with different seeds.
Analysis:
- Critical Role in Low Diversity Scenarios: In the
Similar Comparisonsetting (left side of the plots),same-prompt comparisons(unshaded bars) often fail to produce informativereward models, resulting in near-zero or even negativeGolden Reward Value Improvement. In stark contrast,cross-prompt annotations(shaded bars) in this setting achieve substantial performance gains. This strongly suggests thatcross-prompt comparisonsare essential whensame-prompt responseslackdiversityand are hard to distinguish. - Benefits Even with High Diversity: In the
Diversified Comparisonsetting (right side of the plots), wheresame-prompt responsesalready have high diversity, the need forcross-prompt comparisonsdiminishes. Whilecross-prompt annotationsstill perform well, their advantage oversame-prompt annotationsis smaller, and sometimes performance is comparable. Importantly, they do not have an evident negative impact. - Stability and Lower Dependence on Diversity: Comparing results across
SimilarandDiversifiedsettings, and with Figure 4,cross-prompt annotationsoffer more stable performance with less dependence on the inherentdiversityof responses for a single prompt. This makes them a generally reliable strategy. - Takeaways:
Cross-prompt comparisonsare crucial whensame-prompt responseslackdiversity, as they significantly enhancereward modelperformance. Their benefits are less pronounced but still present (or at least not detrimental) inhigh-diversity scenarios, making them a generally robust strategy.
6.3.3. Correlation with Averaged Absolute Score Difference
The following figure (Figure 6 from the original paper) examines the relationship between the averaged absolute score difference in pairwise annotations (x-axis) and the Golden Reward Value Improvement from cross-prompt annotations (y-axis).
该图像是一个散点图,展示了在‘无害’和‘有帮助’两种情境下,平均绝对奖励差异(x轴)与黄金奖励值改善(y轴)之间的关系。不同模型的相关性通过不同颜色的点和线进行区分,表明它们的表现差异。
Original caption: Figure 6: Comparing the averaged absolute difference in scores in pairwise annotations (x-axis) and improvements achieved by using cross-prompt annotations (y-axis). The two variables are highly correlated.
Analysis:
- Strong Positive Correlation: The scatter plots clearly show a strong positive correlation between the
averaged absolute reward differencein the annotated pairs and theGolden Reward Value Improvementachieved by thecross-prompt annotations. This empirically validates the theoretical intuition that larger differences between compared items make annotations more informative and lead to betterreward models. - Practical Implication for Random Sampling: The paper notes that
averaged reward differencesin thesynthetic Similar casesandRandom settingsare similar. This implies that in practice, when randomly selecting two responses for a single prompt to be annotated, these pairs are likely to face the challenge ofhigh response similarity(low diversity). This is precisely the scenario wherecross-prompt annotationcan be most effectively applied to improve performance. - Takeaways: The effectiveness of
cross-prompt annotationsis directly linked to their ability to increase theexpected differencebetween compared responses. Sincesame-prompt random samplingoften yields low diversity,cross-prompt annotationsprovide a crucial mechanism to boostannotation qualityandreward modelperformance in practicalRLHFsettings.
6.4. Additional Results on Helpsteer and UltraFeedback (Appendix F)
The paper also presents additional results on the Helpsteer and UltraFeedback datasets in Appendix F, which consistently align with the findings from Anthropic-Harmless and Anthropic-Helpful.
The following are the results from Figure 16 of the original paper, comparing BT and Classification reward models on Helpsteer and UltraFeedback datasets:
该图像是一个图表,展示了在不同基模型下,BT-MPL、CLF-MLP 和 CLF-LGB 三种奖励模型在 helpsteer 和 ultrafeedback 数据集上的 golden reward value improvement。可以看到,分类奖励模型普遍优于 BT 模型,且误差条表示不同实验的重复性。
Original caption: Figure 16: Comparison between BT and Classification reward models. In general, the classification reward models achieve better performance than the BT reward models, with the added flexibility of using of-the-shelf classifiers beyond MLPs. Error bars are given by 5 runs with different seeds.
Analysis:
This figure shows the classification reward models (CLF-MLP, CLF-LGB) generally achieve better performance than BT-MLP on both Helpsteer and UltraFeedback datasets across different base LLMs. This further reinforces the main conclusion from Figure 1.
The following are the results from Figure 17 of the original paper, showing annotation quality changes on Helpsteer and UltraFeedback datasets:
该图像是六个子图展示了不同模型在不同注释错误率下的黄金奖励值表现。每个子图分别标示为 Gemma2b-SFT、Gemma7b-SFT 和 LLaMA3-8b-SFT,横轴为注释错误率,纵轴为黄金奖励值。不同的线条代表了 BT-MLP、CLF-MLP 和 CLF-LGB 三种模型的结果。
Original caption: Figure 17: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback.
Analysis:
Similar to Figure 2, classification models (CLF-MLP, CLF-LGB) demonstrate greater robustness to increasing annotation error rates (decreasing ) on Helpsteer and UltraFeedback. BT-MLP shows higher performance under very high quality annotations (low error rates), but CLF methods are generally more robust.
The following are the results from Figure 18 of the original paper, showing annotation quantity changes on Helpsteer and UltraFeedback datasets:
该图像是一个图表,展示了在不同训练注释数量下,不同奖励模型(BT-MLP、CLF-MLP、CLF-LGB)在Gemma2b-SFT、Gemma7b-SFT和LLama3-8b-SFT数据集上的黄金奖励值变化。图表中数据的趋势表明训练注释数量对不同模型的表现影响不一。
Original caption: Figure 18: Changing the annotation quality. Dataset: Helpsteer, UltraFeedback.
Analysis:
Similar to Figure 3, classification models consistently outperform BT models across varying annotation quantities on these additional datasets, providing more consistent improvements as data increases.
The following are the results from Figure 19 of the original paper, showing cross-prompt comparison results on Helpsteer and UltraFeedback datasets:
该图像是图表,展示了在不同基模型(Gemma2b-SFT、Gemma7b-SFT 和 LLaMA3-8b-SFT)下,使用 BT-MLP、CLF-MLP 和 CLF-LGB 存在的随机比较对黄金奖励值改进的影响。结果显示,跨提示比较注释的效果优于同提示比较注释。
Original caption: Figure 19: Results comparing cross-prompt comparison based annotations. Preference annotations on cross-prompt comparisons outperform same-prompt comparisons.
Analysis:
Similar to Figure 4, cross-prompt comparisons (shaded bars) consistently outperform same-prompt comparisons (unshaded bars) on both Helpsteer and UltraFeedback datasets, reaffirming the benefits of this annotation strategy.
The following are the results from Figure 20 of the original paper, showing cross-prompt comparison results on synthetic similar or diversified comparison pairs on Helpsteer and UltraFeedback datasets:
该图像是图表,展示了不同基础模型在相似和多样化比较中的黄金奖励值改善。根据前述结果,Cross-prompt 比较显著提高了同提示响应对的奖励建模表现。图中数据来自多个实验设置,并展现了使用不同分类器的效果对比。
Original caption: Figure 20: Results comparing cross-prompt comparison-based annotations on synthetically generated similar or diversified comparison pairs. Cross-prompt comparison significantly improves the performance of reward modeling with same-prompt response pairs lacking diversity. Error bars are from 5 runs with different seeds.
Analysis:
Similar to Figure 5, cross-prompt comparisons significantly improve reward modeling performance for same-prompt response pairs lacking diversity (in Similar Comparison settings) on these additional datasets. The benefit is less pronounced in Diversified Comparison settings but still present or neutral.
Conclusive Remark on Experiments:
The comprehensive experiments consistently demonstrate that classification-based reward models offer significant advantages over traditional BT models in terms of performance, flexibility (using off-the-shelf classifiers), and robustness to varying annotation quality and quantity. Furthermore, cross-prompt comparisons emerge as a superior annotation strategy, particularly beneficial when same-prompt responses lack diversity—a common issue in practical settings. These findings provide strong empirical evidence for the paper's theoretical arguments and practical recommendations.
6.5. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Usage | LLM Arena | BT Reward Modeling |
| Objective | Direct Parameter Estimation | Model Parameterization |
| Prediction | Not needed | Needed |
| Number of Comparisons | Sufficient (1.7M for N = 130) | Very Sparse b (N/2) |
| Usage of BT Model | Classical BT | BT-Regression |
| Requirement | at least O(N log(N)) Comparisons | Covariates |
Analysis: This table clearly differentiates the traditional application of the BT model in LLM Arenas (e.g., ranking chatbots) from its use in BT Reward Modeling for LLM alignment.
- Objective: In
LLM Arena, the goal is to estimate a fixed ability score for each LLM (player). InReward Modeling, the goal is to learn a function that maps any prompt-response pair to a reward score. - Prediction: Arena models don't need to predict scores for new players, only rank existing ones. Reward models must generalize to predict scores for unseen prompt-response pairs.
- Comparison Density: Arena settings typically have a massive number of comparisons per player (e.g., 1.7 million for 130 models), providing sufficient data. Reward modeling, however, deals with very sparse comparisons (e.g., for prompt-response pairs), a key challenge addressed by the paper.
- BT Variant: This sparsity necessitates
BT-Regressionin reward modeling, wherecovariates(embeddings) are used to predict scores, unlike theclassical BTmodel in arenas.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously re-examines the Bradley-Terry (BT) model in preference-based reward modeling for Large Language Model (LLM) alignment. It provides the first asymptotic theory for neural network-based BT regression in this context, formally establishing its convergence rate. However, it argues that the BT model's precise probability calibration is not strictly necessary for downstream LLM optimization; rather, order consistency (preserving correct ranking) is sufficient. Building on this insight, the paper proposes a simpler, classification-based reward modeling objective that is order-consistent and compatible with off-the-shelf binary classifiers. Furthermore, it theoretically demonstrates the superiority of cross-prompt comparisons in annotations, showing they increase utility diversity and improve annotation quality. Extensive empirical evaluations across over 12,000 setups confirm these findings: classification-based models generally outperform BT models and are more robust to annotation noise and quantity, while cross-prompt annotations significantly enhance reward model performance, especially when same-prompt responses lack diversity.
7.2. Limitations & Future Work
The authors implicitly highlight areas for future work rather than explicitly listing limitations as a dedicated section.
- Representation Learning: In Section 5.4, the paper notes that it separates
reward model learningfromrepresentation learning. Future work could explore combining theirorder consistency objectiveswith advances ingenerative reward models(Yang et al., 2024a) orgenerative verifiers(Zhang et al., 2024) that leverageLLM generation abilitiesto improveembeddingsorreward predictions. This suggests that the gains fromreward modeling objectivescould be further amplified by betterembedding representations. - Beyond Two-Stage Methods: While the paper champions
two-stage methods(explicitreward modelfollowed byRL) forinference-time optimization, it acknowledges the existence ofdirect alignment methods(e.g.,DPO,KTO,RPO). Future research could investigate how theorder consistencyprinciple might be adapted or integrated into these direct methods, or explore hybrid approaches. - Understanding Annotation Biases: The paper uses a
simulated annotation processwith acognitive bottleneck model(). While realistic, actual human annotation processes might involve more complex biases or non-transitive preferences that are not fully captured by this model. Future work could explorereward modeling objectivesthat are more robust to a wider range of annotation imperfections. - Dynamic Preference Adjustment: The paper references works like
Rewards-in-Context(Yang et al., 2024b), which aim formulti-objective alignmentwithdynamic preference adjustment. Integrating dynamic or context-dependent aspects into theorder consistency objectivecould be a promising direction.
7.3. Personal Insights & Critique
This paper provides a highly valuable contribution to the RLHF literature by systematically re-evaluating the Bradley-Terry model, which has been a cornerstone of LLM alignment but often applied without deep theoretical scrutiny in the specific context of sparse, high-dimensional textual data.
Strengths and Inspirations:
- Theoretical Rigor Meets Practicality: The paper successfully bridges theoretical analysis (providing
asymptotic theoryforBT regressionwithdeep neural networks) with practical implications (proposing simpler, more effective alternatives and annotation strategies). This dual approach is highly commendable. - Challenging Assumptions: The shift in perspective from
calibrated probabilitiestoorder consistencyas the core requirement forreward modelsis a profound insight. It simplifies the problem and opens the door for more flexible and robustmachine learning algorithms. This concept can be transferable to any ranking or preference-based system where the absolute magnitude of scores is less critical than their relative ordering for optimization. - Actionable Data Collection Insights: The theoretical and empirical validation of
cross-prompt comparisonsis a significant practical takeaway. It directly informs howpreference datashould be collected to maximize its utility, potentially leading to more efficient and higher-qualityRLHFpipelines with less data. This is particularly relevant given the high cost of human annotation. - Extensive Experiments: The sheer scale and systematic nature of the empirical evaluation (12,000+ setups) provide high confidence in the findings and offer a rich benchmark for future research.
Potential Issues/Areas for Improvement:
-
Complexity of Theoretical Conditions: While impressive, the theoretical conditions (e.g.,
Hölder smoothness,small value bounds, specific scaling ofMLP depth/width/sparsity) required for thetruncated KL risk bound(Theorem 20) are quite stringent. It might be challenging to verify if real-worldLLM embedding spacesandtrue reward functionsstrictly adhere to these assumptions. Further work could explore robustness to violations of these assumptions. -
Generalizability of "Additive Constant": The convergence of reward differences up to an
additive constantis a common result for preference models. While sufficient for ranking, it means the absolute scale of the reward function is arbitrary. For certainRL algorithmsorvalue-based agents, the scale of rewards can sometimes influence learning dynamics. However, forBoN sampling(as used here) and manypolicy optimizationmethods, this is indeed not an issue. -
Simulated Annotators: Relying on
golden reward modelsas annotators, while ensuring reproducibility, might not fully capture the nuances, inconsistencies, and emergent biases of real human annotators. Further validation with large-scale human-annotated datasets would strengthen the empirical claims, especially regarding theclassification model's robustnesstohuman noise. -
Efficiency of Cross-Prompt Annotation in Practice: While theoretically beneficial, implementing
cross-prompt annotationefficiently in a human labeling pipeline might introduce new logistical challenges (e.g., cognitive load for annotators if prompts are too disparate, interface design). The paper's analysis focuses on the statistical quality of the resulting data, but practical human factors also play a role.Overall, this paper makes a substantial contribution by providing a deeper understanding of reward modeling foundations and offering practical, empirically validated alternatives that can make
LLM alignmentmore efficient and effective. Its emphasis onorder consistencyandcross-prompt comparisonsare particularly insightful and actionable for the field.
Similar papers
Recommended via semantic vector search.