LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
TL;DR Summary
LLaDA 1.5 uses Variance-Reduced Preference Optimization to reduce ELBO estimator variance, improving human preference alignment and outperforming prior models on math, code, and alignment benchmarks.
Abstract
While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models. It indicates that the paper introduces an improved version (1.5) of a large language diffusion model (LLaDA) by applying a novel technique called Variance-Reduced Preference Optimization (VRPO).
1.2. Authors
-
Fengqi Zhu (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
-
Rongzhen Wang (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
-
Shen Nie (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
-
Xiaolu Zhang (Ant Group)
-
Chunwei Wu (Ant Group)
-
Jun Hu (Ant Group)
-
Jun Zhou (Ant Group)
-
Jianfei Chen (Tsinghua University)
-
Yankai Lin (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
-
Ji-Rong Wen (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
-
Chongxuan Li (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)
The authors are primarily affiliated with the Gaoling School of AI at Renmin University of China, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, Tsinghua University, and Ant Group. Their backgrounds appear to span artificial intelligence, machine learning, and large language models, specifically focusing on diffusion models and their applications.
1.3. Journal/Conference
The paper is published on arXiv (arXiv preprint arXiv:2505.19223). arXiv is a well-known open-access archive for scholarly articles, primarily in physics, mathematics, computer science, and related fields. Papers on arXiv are typically preprints, meaning they have not yet undergone formal peer review, though many eventually get published in peer-reviewed conferences or journals. As a preprint, it allows for rapid dissemination of research findings.
1.4. Publication Year
2025 (May 25, 2025, Version 2)
1.5. Abstract
Masked Diffusion Models (MDMs), like LLaDA, are a promising paradigm for language modeling, but aligning them with human preferences through reinforcement learning (RL) has received limited attention. A major challenge is the high variance observed in Evidence Lower Bound (ELBO)-based likelihood estimates, which are crucial for preference optimization. To tackle this, the authors propose Variance-Reduced Preference Optimization (VRPO). This framework provides a formal analysis of the ELBO estimator's variance and derives bounds for both the bias and variance of preference optimization gradients. Based on this theoretical foundation, VRPO introduces unbiased variance reduction techniques, such as optimal Monte Carlo budget allocation and antithetic sampling, which significantly enhance the alignment performance of MDMs. When applied to LLaDA, the resulting model, LLaDA 1.5, consistently and significantly surpasses its Supervised Fine-Tuning (SFT)-only predecessor across various benchmarks: mathematical tasks (GSM8K +4.7), code generation (HumanEval +3.0, MBPP +1.8), and alignment tasks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 achieves highly competitive mathematical performance compared to other strong language MDMs and autoregressive models (ARMs).
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2505.19223 - PDF Link:
https://arxiv.org/pdf/2505.19223v2.pdf - Publication Status: This paper is a preprint on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of large language models (LLMs) has seen significant advancements, with two primary paradigms emerging: Autoregressive Models (ARMs) and Masked Diffusion Models (MDMs). While ARMs have achieved remarkable success and extensive research has gone into aligning them with human preferences (e.g., via Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)), MDMs are a newer, promising paradigm that have shown competitive or superior performance in language modeling at various scales. However, the alignment of these MDMs with human preferences, which is crucial for developing helpful and safe models, remains relatively underexplored.
The core problem this paper aims to solve is the high variance associated with Evidence Lower Bound (ELBO)-based likelihood estimates in MDMs during preference optimization. Unlike ARMs, where log-likelihoods are often directly computable, MDMs rely on ELBOs as approximations, which introduces nested expectations and requires Monte Carlo sampling. This sampling, while necessary, leads to significant variance in the estimated preference score (a linear combination of ELBO terms), which in turn propagates to the loss and gradient of the preference optimization objective. This high variance hinders stable and effective training, making it difficult to align MDMs with human preferences reliably across diverse tasks.
The paper's entry point is to systematically study this variance problem in the context of DPO, a popular and empirically strong alignment method. By formally analyzing the bias and variance introduced by ELBO approximations, the authors identify the score-estimator variance as the dominant factor affecting optimization stability. This insight drives the development of principled variance reduction strategies specifically tailored for MDMs.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Formal Analysis of Variance and Bias: It provides a rigorous theoretical framework that formally analyzes the
biasandvarianceof theDPO lossandgradientwhen using ELBO-based likelihood estimates in MDMs. This analysis reveals that these errors are primarily governed by thevariance of the score estimator(a linear combination of ELBO terms). This theoretical grounding offers crucial insights into the challenges of aligning MDMs. -
Introduction of Variance-Reduced Preference Optimization (VRPO): Based on the theoretical findings, the paper proposes
VRPO, a novel framework that integrates multiple unbiased variance reduction techniques. These techniques include:- Increased Sampling Budget: Strategically increasing the number of
Monte Carlo samplesused for ELBO estimation. - Optimal Allocation: Distributing the sampling budget optimally across
diffusion timestepsby setting (number of timesteps equals total budget) and n_y_t = 1 (one masked sample per timestep). - Antithetic Sampling: Sharing the same sampled
timestepsandmasked databetween thecurrent policy() andreference policy() ELBO estimates for the same input. These methods are proven to reduce variance without introducing bias.
- Increased Sampling Budget: Strategically increasing the number of
-
Empirical Validation and LLaDA 1.5: The effectiveness of
VRPOis demonstrated by applying it to LLaDA, an 8B-parametermasked diffusion language model. The resulting model,LLaDA 1.5, consistently and significantlyoutperformsitsSupervised Fine-Tuning(SFT)-only predecessor (LLaDA Instruct) across a wide range of benchmarks:- Mathematical Tasks: GSM8K (+4.7), Math (+0.4), GPQA (+3.6).
- Code Generation: HumanEval (+3.0), MBPP (+1.8).
- Alignment Tasks: IFEval (+4.0), Arena-Hard (+4.3), AlignBench (+0.5), MTBench (+0.1). These improvements establish LLaDA 1.5 as a highly competitive model, even against strong ARMs, particularly in mathematical performance.
-
Generalizability of Techniques: The paper discusses how the proposed variance reduction techniques and analysis can be extended beyond
DPOto otherReinforcement Learning(RL)-based alignment algorithms, such asProximal Policy Optimization(PPO) andGeneralized Reinforcement Preference Optimization(GRPO), making them broadly applicable toMDM alignment.The key conclusions are that systematically addressing the high variance in ELBO estimates is critical for effective
MDM alignment, and theVRPOframework provides a theoretically sound and empirically validated solution to achieve this.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a solid grasp of several core concepts in machine learning, particularly in natural language processing (NLP) and generative models, is essential.
3.1.1. Large Language Models (LLMs)
Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on massive amounts of text data. They learn to predict the next word in a sequence, enabling them to generate human-like text, answer questions, summarize documents, and perform many other language-related tasks.
3.1.2. Autoregressive Models (ARMs)
Autoregressive Models (ARMs) are a common type of LLM where text is generated token-by-token in a sequential manner. Each token's generation is conditioned on all previously generated tokens. This process makes their log-likelihood (the probability of observing a sequence of tokens) directly computable. Examples include GPT-series models and LLaMA.
- How they work: To generate "Hello world!", an ARM first predicts "Hello", then predicts "world!" given "Hello".
- Likelihood: The probability of a sequence given a prompt is typically calculated as:
$
P(y|x) = \prod_{i=1}^L P(y_i | x, y_1, \dots, y_{i-1})
$
The
log-likelihoodis then .
3.1.3. Diffusion Models
Diffusion Models are a class of generative models that learn to reverse a gradual noise process. In the context of images, they learn to transform noisy data back into a clean image. For discrete data like text, they operate by progressively masking out information and then learning to denoise or unmask it.
3.1.4. Masked Diffusion Models (MDMs)
Masked Diffusion Models (MDMs) adapt the diffusion paradigm to discrete data like language. Instead of adding Gaussian noise, they mask out tokens in a sequence.
- Forward Process: Starting from an original, clean text sequence, tokens are progressively
masked(replaced with a special[MASK]token) over a series oftimesteps. As time progresses, more tokens are masked, eventually leading to a fully masked sequence. This is a controlled process, often defined by amasking schedule. - Reverse Process: The model learns to reverse this process. Given a partially masked sequence at a certain
timestep, the model predicts the original, unmasked tokens. By iterativelydenoising(unmasking) the sequence from a fully masked state back to a clean state, it generates new text. - Difference from ARMs: MDMs are
non-autoregressiveorpartially autoregressive, meaning they can predict multiple masked tokens simultaneously or in parallel, which can lead to faster generation. However, theirlog-likelihoodis generallyintractableto compute directly, which is a key challenge addressed in this paper.
3.1.5. Evidence Lower Bound (ELBO)
The Evidence Lower Bound (ELBO) is a variational approximation used in probabilistic models, especially when the true log-likelihood of the data is intractable to compute directly. It provides a lower bound on the true log-likelihood, meaning . Optimizing the ELBO is a common strategy to train such models.
- In
MDMs, the ELBO often involvesexpectationsover thediffusion process(differenttimesteps) andmasked data configurationswithin each timestep.
3.1.6. Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a technique used to align LLMs with human preferences and values. It typically involves three steps:
- Supervised Fine-Tuning (SFT): A base LLM is fine-tuned on a dataset of high-quality human-written demonstrations to make it follow instructions.
- Reward Model Training: A separate
reward model(RM) is trained to predict human preferences. Humans provide feedback by ranking multiple model-generated responses to a given prompt. The RM learns to assign higher scores to preferred responses. - Reinforcement Learning (RL): The SFT model is then further fine-tuned using
Proximal Policy Optimization(PPO) or similar RL algorithms, with the reward signal provided by the trained RM. The goal is to maximize the RM's score while staying close to the SFT policy to preventmode collapseordrift.
3.1.7. Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) is a simplified and often more stable alternative to traditional RLHF. Instead of explicitly training a separate reward model and then using RL, DPO directly optimizes the language model's policy to satisfy human preferences by minimizing a single objective function. This objective is derived from the Bradley-Terry model of preferences and directly relates the model's log-likelihood ratios to human preferences. It avoids the complexities of RM training and RL training stability issues.
3.1.8. Monte Carlo Methods
Monte Carlo methods are computational algorithms that rely on repeated random sampling to obtain numerical results. They are often used when direct analytical solutions are intractable.
- Estimation: To estimate an
expectation, one can draw samples from the distribution of and compute the sample average . - Bias: An
estimatorisunbiasedif its expected value equals the true parameter it's estimating. - Variance: The
varianceof an estimator measures how much its estimates vary from the true parameter across different sets of samples. High variance means estimates are unstable and noisy.
3.1.9. Variance Reduction Techniques
These are methods used in Monte Carlo simulations to reduce the variance of an estimator without changing its expected value (i.e., maintaining unbiasedness). This makes the estimates more precise and reliable for a given number of samples.
- Antithetic Sampling (Antithetic Variates): A technique where pairs of random variables are generated that are negatively correlated. By averaging estimates from these negatively correlated samples, the overall variance of the estimator can be significantly reduced. For example, if you need to estimate , and and are negatively correlated, then might have lower variance than just
f(X).
3.2. Previous Works
The paper builds upon and differentiates itself from existing research in several areas:
3.2.1. Masked Diffusion Models (MDMs) in Language
- Foundational Diffusion: The concept originated with
Sohl-Dickstein et al. (2015)andAustin et al. (2021a)for continuous data. - Discrete Diffusion:
Campbell et al. (2022)extended to discrete data,Hoogeboom et al. (2021)andMeng et al. (2022)further explored this. - Language MDMs: , ,
Sahoo et al. (2024), showed MDMs achieving comparable or superior performance to ARMs at small scales, optimizing ELBO or simplified variants. - Scaling MDMs: ,
Gong et al. (2024), (LLaDA) demonstrated excellent scalability, achieving competitive results with state-of-the-art ARMs like LLaMA 3 (Dubey et al., 2024).
3.2.2. Alignment of LLMs (ARMs)
- Traditional RLHF:
Ziegler et al. (2019)andOuyang et al. (2022)established the two-stage process of reward modeling followed by RL (e.g., PPO) for ARMs. - Direct Preference Optimization (DPO):
Rafailov et al. (2023)introduced DPO, a simplified yet effective alternative that directly optimizes a policy to satisfy preferences without an explicit reward model. Its strong empirical performance is noted byGrattafiori et al. (2024).
3.2.3. Alignment of MDMs (Existing Efforts)
The paper acknowledges several emerging works on MDM alignment:
Zekri and Boullé (2025): General policy-gradient method leveraging denoising distribution.Borso et al. (2025): DPO variant for discrete diffusion by viewing token steps as actions, validated on small-scale binary sequences.Zhao et al. (2025),Yang et al. (2025),Tang et al. (2025): GRPO-based methods treating token steps as actions to enhance reasoning.Huang et al. (2025): GRPO variant viewing intermediate diffusion steps as RL trajectories, focusing on reasoning and code generation.Gong et al. (2025): GRPO-based algorithm for code generation with acoupled-sampling variance-reduction technique.
3.2.4. Variance Reduction Techniques
The paper draws inspiration from broader fields:
- Monte Carlo Methods: Classic techniques like
control variatesandstratified sampling(Kroese et al., 2013). The paper specifically adaptsantithetic variates. - Doubly Stochastic Optimization: Related to nested expectations in ELBOs, drawing from ,
Titsias and Lázaro-Gredilla (2014),Gower et al. (2020), . - Variational Inference: Connections to
importance weighted variational inference(Burda et al., 2016; Huang and Courville, 2019), where outer bias is reduced by inner variance reduction.
3.3. Technological Evolution
The evolution of language models has progressed from rule-based systems to statistical models, then to deep learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs), and most recently to Transformer-based models. Within the Transformer era, Autoregressive Models (ARMs) have dominated, with significant strides in pre-training scale and subsequent alignment via RLHF or DPO.
Masked Diffusion Models (MDMs) represent a newer wave, offering a different generative paradigm—often parallel generation—which holds promise for efficiency and robustness. However, their internal mechanisms, particularly the use of ELBOs for likelihood approximation, introduce unique challenges when adapting existing alignment methodologies designed for ARMs. Most prior work on MDM alignment (as cited in 3.2.3) attempts to adapt existing methods without a deep dive into the specific statistical challenges posed by MDMs' ELBO approximations or focuses on specialized tasks.
This paper's work fits into this timeline by addressing a critical missing piece: a systematic theoretical and empirical study of preference optimization for MDMs on general tasks, specifically tackling the high variance issue inherent in their ELBO-based likelihood estimates. It aims to bring the robustness and alignment capabilities seen in ARMs to the MDM paradigm.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's core differences and innovations are:
-
Focus on ELBO Variance in MDM Alignment: Unlike prior MDM alignment works that mostly adapt existing frameworks (like DPO or GRPO) to MDMs and introduce various likelihood approximation methods, this paper identifies and formally analyzes the
high variance in ELBO-based likelihood estimatesas the central challenge. This provides a fundamental understanding of the problem specific to MDMs. -
Theoretical Grounding for Variance Reduction: The paper provides a
rigorous theoretical foundationby analyzing thebiasandvarianceof theDPO lossandgradientin terms of thescore-estimator variance. This systematic analysis leads directly to the design ofunbiased variance reduction strategies(VRPO), rather than relying solely on empirical tuning or heuristic adaptations. -
Comprehensive General Task Alignment: Many existing MDM alignment efforts focus on
specialized taskslike reasoning or code generation. This paper explicitly aims forbroader alignment tasksand validates VRPO on a comprehensive suite of benchmarks covering mathematics, code, and general alignment, demonstrating its effectiveness forgeneral-purpose language MDMs. -
Principled Variance Reduction Techniques: The
VRPOframework introduces specific, theoretically motivated techniques:Optimal Monte Carlo budget allocationfor ELBO estimation (setting , n_y_t = 1).Antithetic samplingbetween current and reference policies. These are presented asunbiasedand shown to be effective, distinguishing them from potentially biased or less effective ad-hoc solutions. Thecoupled-sampling variance-reduction techniquementioned byGong et al. (2025)is concurrent and complementary, indicating an active research area that this paper also contributes to with a DPO-specific focus.
-
LLaDA 1.5 as a Strong Baseline: By applying VRPO to a large-scale MDM (LLaDA 8B), the paper demonstrates
state-of-the-art performancefor MDMs across diverse tasks, effectively establishing a new benchmark and methodology for aligning large language diffusion models.In essence, while others have adapted alignment methods to MDMs, this paper systematically addresses the intrinsic statistical challenges of MDMs' likelihood approximations within the alignment process, leading to a more robust and effective solution.
4. Methodology
4.1. Principles
The core principle behind the proposed Variance-Reduced Preference Optimization (VRPO) is to address the instability and inefficiency of aligning Masked Diffusion Models (MDMs) with human preferences due to high variance in their Evidence Lower Bound (ELBO)-based likelihood estimates. Direct Preference Optimization (DPO) is chosen as the alignment framework for its simplicity and strong empirical performance.
The fundamental intuition is that the exact log-likelihood required by DPO is intractable for MDMs. This necessitates approximating it with the ELBO, which introduces Monte Carlo sampling and, consequently, stochasticity. This stochasticity leads to:
-
Bias: Due to the
non-linearityof thesigmoidfunction in the DPO loss, theexpected valueof the estimated loss does not equal the loss computed from the true (unestimated) ELBOs. -
High Variance: The
estimated preference score(a linear combination of ELBOs) exhibits high variance, which directly translates to high variance in the DPOlossand itsgradient. This noisy gradient makes the optimization process unstable and slow.The paper formally analyzes these issues and discovers that both the
biasandvarianceof the estimated DPO loss and itsgradientare directly bounded by thevariance of the score estimator. This crucial insight dictates the core strategy: to effectively align MDMs, one mustreduce the variance of this score estimator. VRPO then proposes specific,unbiased variance reduction techniquesto achieve this, aiming for a more stable and efficient optimization process.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology section first reviews existing alignment methods and MDM specifics, then details how the ELBO is incorporated into DPO, leading to the identification of the variance problem, and finally presents the VRPO solution.
4.2.1. Alignment Methods
The paper first outlines traditional Reinforcement Learning (RL) alignment methods and Direct Preference Optimization (DPO).
4.2.1.1. Reward Modeling
In traditional two-stage RL alignment, the first step involves training a reward model () using a dataset of human preferences. The dataset consists of triplets , where is a prompt, is the human-preferred response, and is the less preferred response. The reward model is trained to output higher scores for preferred responses by minimizing the following objective based on the Bradley-Terry formulation:
$ \mathcal{L}{\mathrm{Reward}}(\phi) \triangleq -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right] $
Here:
- represents the parameters of the
reward model. - denotes the
expectationover the preference dataset . - is the
sigmoid function, defined as . Its role is to map the difference in reward scores to a probability-like value between 0 and 1. - is the
reward scoreassigned by the model to the preferred response given prompt . - is the
reward scoreassigned by the model to the less preferred response given prompt . The objective encourages to be significantly greater than .
4.2.1.2. Reinforcement Learning (RL)
In the second stage of traditional RL alignment, a language model policy (which defines the probability of generating response given prompt ) is optimized via RL to maximize a specific objective:
$ \operatorname*{max}{\pi\theta} \mathbb{E}{x \sim \mathcal{D}, \ y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \mathbb{D}{\mathrm{KL}} \left( \pi\theta(\cdot|x) \Vert \pi_{\mathrm{ref}}(\cdot|x) \right) $
Here:
-
represents the parameters of the
language model policy(). -
denotes the
expectationover prompts from the dataset and responses sampled from the current policy . -
is the
rewardprovided by the trained reward model for generating response to prompt . -
is a
coefficientthat controls the strength of theregularization term. -
is the
Kullback-Leibler (KL) divergencebetween the current policy and afixed reference policy. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution. This term prevents the optimized policy fromdrifting too farfrom the original (oftenSupervised Fine-Tuned(SFT)) model, which is crucial for maintaining general capabilities and avoidingmode collapse. -
is typically a
frozen SFT model.For
autoregressive models(ARMs), both sampling responses from and evaluating theirlog-likelihoodsare straightforward.
4.2.1.3. Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) simplifies the two-stage RLHF process by directly optimizing a policy without an explicit reward model. The DPO objective minimizes the following loss:
$ \mathcal{L}{\mathrm{DPO}}(\boldsymbol{\theta}) \triangleq \mathbb{E}{(\boldsymbol{x}, \boldsymbol{y}w, \boldsymbol{y}l) \sim \mathcal{D}} \left[ \ell{\mathrm{DPO}}(\boldsymbol{x}, \boldsymbol{y}w, \boldsymbol{y}l; \boldsymbol{\theta}) \right] $ where $ \ell{\mathrm{DPO}}(x, y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \log \frac{\pi\theta(y_w \mid x)}{\pi{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) $
Here:
- represents the parameters of the
language model policy(). - denotes the
expectationover the preference dataset . - is the
per-sample DPO loss. - is the
sigmoid function. - is the
coefficientcontrolling regularization strength, similar to the RL objective. - represents the
log-likelihood ratiobetween the current policy and the reference policy for a given response and prompt . This term implicitly measures how much the current policy prefers a response compared to the reference policy. The DPO loss directly encourages the model to increase the log-likelihood ratio for preferred responses () and decrease it for less preferred responses ().
4.2.2. Masked Diffusion Models (MDMs)
4.2.2.1. MDM Formulation
The paper describes Masked Diffusion Models (MDMs) as defining a model distribution through a forward-reverse framework.
-
Forward Process: Starting from an original input at , tokens are progressively
maskedwith a probability that increases over time. This leads to afully masked sequenceat . Given a prompt , the forward process is formulated as: $ q(y_t | t, y, x) = \prod_{i=1}^L q(y_t^i | t, y^i, x) $ where $ q(y_t^i | t, y^i, x) = \left{ \begin{array}{ll} 1-t, & y_t^i = y^i, \ t, & y_t^i = \mathbf{M}, \end{array} \right. $ Here:- is the
original full response(sequence of tokens). - is the
corrupted sequenceattimestep. - is the -th
tokenof the response . - is the -th token of the corrupted sequence .
- denotes the
mask token. - is the
masking probability(or noise level), increasing from 0 to 1. If , it means the token isunmasked; if , it means the token ismasked.
- is the
-
Reverse Process: The model learns to
denoisethe sequence. For timesteps , the reverse process is defined as: $ q(y_s | s, t, y_t, x) = \prod_{i=1}^L q(y_s^i | s, t, y_t, x) $ where $ q(y_s^i | s, t, y_t, x) = \left{ \begin{array}{ll} \frac{t-s}{t} p_\theta(y^i | y_t, x), & y_t^i = \mathbf{M} \wedge y_s^i \neq \mathbf{M}, \ \frac{s}{t}, & y_t^i = \mathbf{M} \wedge y_s^i = \mathbf{M}, \ 1, & y_t^i \neq \mathbf{M} \wedge y_s^i = y_t^i, \ 0, & \mathrm{otherwise}, \end{array} \right. $ Here:- is the sequence at a
previous timestep. - is the
mask prediction model(the neural network) that learns to predict the original token given the masked sequence and prompt . The reverse process iteratively unmasks tokens, moving from (fully masked) to (original sequence).
- is the sequence at a
4.2.2.2. Likelihood Estimation in MDMs
A key challenge for MDMs is that the exact log-likelihood is intractable. Instead, it is typically approximated by its Evidence Lower Bound (ELBO). The paper uses the following ELBO formulation (Equation (14) from Appendix A, noted to have lower variance than Equation (12)):
$ \mathcal{B}{\pi}(y|x) \triangleq \mathbb{E}{l \sim \mathcal{U}({1, 2, \ldots, L})} \mathbb{E}{y_l \sim q(y_l|l, y, x)} \ell{\pi}^{\prime}(y_l, l, y|x) $ where $ \ell_{\boldsymbol{\pi}}^{\prime}(y_l, l, y|x) \triangleq \left[ \frac{L}{l} \sum_{i=1}^L \mathbf{1}[y_l^i = \mathbf{M}] \log p_{\theta}(y^i | y_l, x) \right] $
Here:
- is the
ELBOapproximation of the log-likelihood of response given prompt under policy . - is the
number of masked tokens, uniformly sampled from (where is the sequence length). This ensures that exactly tokens are masked. - denotes a
uniform distributionover the specified set. - is the sequence obtained by masking tokens (without replacement).
- is the
per-step lossof the mask prediction model. - is an
indicator function, which is 1 if the -th token in is masked, and 0 otherwise. - is the
log-probabilitypredicted by the model for the original token , given the masked sequence and prompt . This is essentially the model's ability todenoiseorpredictthe masked token. The ELBO is an expectation over sampled numbers of masked tokens () and specific masked configurations (). For a well-trained model, the bias of the ELBO relative to the exact likelihood is considered negligible.
4.2.2.3. Monte Carlo Approximation of ELBO
Computing exactly is intractable due to the double expectations. In practice, it is approximated using a doubly Monte Carlo method.
- Let be the number of samples for
timesteps(or number of masked tokens in the chosen formulation). - Let be the number of samples for
masked data configurationsper timestep (or per number of masked tokens ). The sampling process involves: $ S_t \triangleq { t^{(j)} }{j=1}^{n_t} \stackrel{\mathrm{i.i.d.}}{\sim} \mathcal{U}[0,1] \quad \mathrm{and} \quad S{y_{t(j)}|y} \triangleq { y_{t(j)}^{(k)} }{k=1}^{n{y_t}} \stackrel{\mathrm{i.i.d.}}{\sim} q(y_t|t^{(j)}, y), j=1, \ldots, n_t, $ In the specific formulation used (Equation (14) with ), this would correspond to sampling values, and for each , sampling configurations. TheELBO estimatoris then:
$ \widehat{\mathcal{B}}\pi(y) \triangleq \frac{1}{n_t} \sum{j=1}^{n_t} \frac{1}{n_{y_t}} \sum_{k=1}^{n_{y_t}} \ell_\pi(y_{t^{(j)}}^{(k)}, t^{(j)}, y) $
Here:
- is the
estimated ELBO. - represents the -th sampled
timestep(or sampled number of masked tokens ). - represents the -th sampled
masked data configurationfor the -th timestep (or for ). - is the
per-step lossas defined above. The total number of mask-prediction loss computations is . This estimator isunbiasedfor the true ELBO , meaning . However, due to practical computational constraints, is typically small, leading to significantvariancein the estimate.
4.2.3. Substituting Likelihoods with ELBOs in DPO
The first step of the proposed VRPO framework is to adapt the DPO loss (from Equation (3)) by replacing the intractable log-likelihoods with their ELBO approximations .
The modified DPO loss, referred to as ELBO-based DPO loss, is:
$ \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \left( \mathcal{B}{\pi\theta}(y_w) - \mathcal{B}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \mathcal{B}{\pi\theta}(y_l) - \mathcal{B}{\pi{\mathrm{ref}}}(y_l) \right) \right) $
Here:
- The term in the argument of the
sigmoid function(in red in the original paper) is defined as theELBO-based preference score, denoted . $ s_\theta(y_w, y_l) = \beta \left( \mathcal{B}{\pi\theta}(y_w) - \mathcal{B}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \mathcal{B}{\pi\theta}(y_l) - \mathcal{B}{\pi{\mathrm{ref}}}(y_l) \right) $ This loss intuitively encourages the current model to assign a higher ELBO to the preferred response relative to the reference model , and a lower ELBO to the less preferred response relative to the reference model.
In practice, each of the four ELBO terms in must be estimated using the Monte Carlo method (as in Equation (6)). This leads to the estimated ELBO-based DPO loss:
$ \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \left( \widehat{\mathcal{B}}{\pi_\theta}(y_w) - \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \widehat{\mathcal{B}}{\pi\theta}(y_l) - \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y_l) \right) \right) $
The term inside the log sigmoid function is the estimated preference score, denoted . The stochastic sampling involved in estimating this score introduces randomness into the estimated loss, making it a random variable.
Problem Identified: This introduces two issues:
-
Bias: Due to the
non-linearityof thelog sigmoidfunction, even if is anunbiased estimatorfor (meaning ), is not an unbiased estimator for . That is, . This is illustrated in Figure 2 (a). -
Variance: The stochastic sampling also introduces high variance into both the estimated loss and its
gradient. This high variance makes optimization unstable.The paper formally proves that the preference score estimator is indeed an
unbiased estimatorfor the true preference score . Proposition 3 (Unbiasedness of preference score estimator): The preference score estimator defined in Eq. (8) is an unbiased estimator of the true preference score defined in Eq. (7): $ \mathbb{E}{S{\hat{s}|y_w, y_l}} \left[ \hat{s}\theta(y_w, y_l) \right] = s\theta(y_w, y_l) $ Here:
- denotes the
expectationover the stochastic sampling () involved in the estimation of , given the preference pair . - is the
estimated preference score. - is the
true (ELBO-based) preference score. This proposition confirms that on average, the estimated score is correct, but due to the non-linearity of thelog sigmoid, the average of the loss calculated from the estimated score is still biased.
4.2.4. Variance-Reduced Preference Optimization (VRPO)
The core insight of VRPO is that the bias and variance of the estimated DPO loss (and its gradient) are directly governed by the variance of the score estimator (). Therefore, reducing will simultaneously mitigate these errors.
The following theorem quantifies this relationship: Theorem 1: Given a pair of preference data , the bias and variance of over stochastic sampling in the score estimation can be bounded as: $ \mathbb{E}{S{\hat{s}|y_w, y_l}} \bigg[ \bigg| \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) - \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \bigg| \bigg] \leq \sqrt{\mathbb{V}{S_{\hat{s}|y_w, y_l}} \big[ \hat{s}\theta(y_w, y_l) \big]} $ $ \mathbb{V}{S_{\hat{s}|y_w, y_l}} \bigg[ \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \bigg] \leq 4 \mathbb{E}{y_w, y_l} \bigg[ \mathbb{V}{S{\hat{s}|y_w, y_l}} \big[ \hat{s}_\theta(y_w, y_l) \big] \bigg] $ Here:
- The first inequality bounds the
absolute bias(the expected absolute difference between the true and estimated loss) by thesquare root of the variance of the estimated score. - The second inequality bounds the
variance of the estimated lossbyfour times the expected variance of the estimated score. This theorem, illustrated conceptually in Figure 2 (b), shows a direct relationship:reducing the variance of the score estimator() willreduce both the bias and varianceof the DPO loss.
The paper then breaks down the variance of the score estimator to understand how to reduce it. The variance of (omitting subscripts for brevity) can be expanded as:
$
\mathbb{V} \hat{s}\theta (y_w, y_l) = \beta^2 \sum{y \in {y_w, y_l}} \Big[ \mathbb{V} \widehat{\mathcal{B}}{\pi\theta}(y) + \mathbb{V} \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y) - 2 \mathrm{Corr} \Big( \widehat{\mathcal{B}}{\pi\theta}(y), \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y) \Big) \sqrt{\mathbb{V} \widehat{\mathcal{B}}{\pi\theta}(y) \mathbb{V} \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y)} \Big]
$
This decomposition highlights two strategies for variance reduction:
-
Decreasing the variance of each individual ELBO estimation(). -
Increasing the correlationbetween the ELBO estimates of thecurrent policyand thereference policyfor the same input ().VRPO integrates three principled techniques to implement these strategies:
4.2.4.1. Technique 1: Sampling Budget
The first technique is to increase the total number of samples used to estimate each ELBO . More samples inherently lead to lower variance. This increases computational cost proportionally (e.g., if , 8x more FLOPs).
4.2.4.2. Technique 2: Optimal Allocation
The second technique, applied when a fixed total budget is available, is to optimally allocate this budget. Specifically, it involves setting and . This means sampling distinct timesteps (or numbers of masked tokens ) and only one masked data configuration per timestep.
Proposition 1 (Reduce the ELBO variance): Given a total budget of masked samples for estimating , we have: (i) $ \mathbb{V} \widehat{\mathcal{B}}_\pi(y) = \Theta\left(\frac{1}{n}\right) $ (ii) is minimized when . Here:
- indicates that the
variance of the ELBO estimatorisinversely proportionalto the total number of samples . This formally justifies increasing the sampling budget. - The second part states that to minimize this variance for a fixed , one should prioritize sampling more
timesteps() over moremasked data configurationsper timestep (). This is because the variance of the ELBO estimator (from Lemma 5 in Appendix B.1.5) is given by: $ \mathbb{V} \widehat{\mathcal{B}}\pi(y) = \frac{1}{n_t} V_t + \frac{1}{n_t n{y_t}} V_{y_t} $ where is thevariance across timestepsand is thevariance due to masked data at each step. By setting , the term becomes , which, when combined with , simplifies to . Since , the total variance becomes . This strategy effectively spreads the budget across the most significant source of variability (timesteps). This technique incurs no additional computational cost, as it only redistributes existing samples.
4.2.4.3. Technique 3: Antithetic Sampling
The third technique involves sharing the same sampled timesteps and masked data between the ELBO estimates of the current policy () and the reference policy () for the same input or .
Proposition 2 (Antithetic Sampling): When (i.e., the ELBO estimates for the two policies are positively correlated) and the Monte Carlo samples and are shared between and , sharing Monte Carlo samples yields lower than using independent samples. Here:
- denotes the
correlation coefficient. Thepropositionstates that if the ELBO estimates from the current and reference policies are positively correlated (which is common, especially since is typically an SFT model close to ), reusing the same random samples for both estimates will lead to alower variancein their difference, and thus in the overall score estimator. This is a classicantithetic variatestechnique and provides afree lunchas it reuses existing samples without additional computation.
The conceptual flow of VRPO is illustrated in Figure 4:
该图像是论文中图4的示意图,展示了提出的VRPO方法的分析流程。图中包含对偏差和方差的控制、两种方差降低策略的识别,以及最终涵盖采样预算、最优分配和对称采样的VRPO方案。
Figure 4: Illustration of the analysis process. This diagram outlines the conceptual flow that leads to the proposed VRPO method. Gray boxes represent theoretical analyses, and the blue box highlights the final sampling strategy. Starting from a bias and variance analysis of the estimated loss and gradient, we identify the score-estimator variance as a dominant controller. These theoretical findings collectively motivate the design of the VRPO algorithm, which is equipped with provable properties (dashed lines): unbiasedness and guaranteed variance reduction.
The VRPO algorithm is further illustrated in Figure 3:
该图像是图3的示意图,比较了未使用最优分配和对偶采样的VRPO(左)与使用该策略的VRPO(右)。右侧VRPO在时间步间分配采样预算,仅采样一个掩码数据,并在成对的ELBO之间共享蒙特卡洛样本,以计算损失。
Figure 3: Illustration of VRPO. We compare VRPO (right) with VRPO without optimal allocation and antithetic sampling (left). VRPO allocates the sampling budget across timesteps to sample only one masked data per timestep (indicated by red arrows) and shares Monte Carlo samples between paired ELBOs (highlighted with the red annotations above the blocks).
Proposition 4 (Unbiasedness of VRPO): The paper confirms that all these variance reduction techniques (sampling budget, optimal allocation, and antithetic sampling) do not introduce bias into the preference score estimator, maintaining its unbiasedness for the true preference score.
4.2.4.4. Deferred Analysis of Estimated Gradient
The paper also extends this analysis to the gradient of the DPO loss.
Assumption 1 (Bounded gradient of per-step mask prediction loss): The gradient of the per-step masked prediction loss (Eq. (4)) is bounded, i.e., there exists a constant such that for all in the model parameter space, in , and .
This assumption is practically reasonable for neural networks.
Corollary 1 (Bounded gradient of preference score estimator): Under Assumption 1, the gradient of the preference score estimator is bounded, i.e., there exists a constant such that for all in the model parameter space and in . This is derived from the linearity of gradients and the boundedness of the individual loss gradients.
Theorem 4: Suppose Assumption 1 holds. Then, there exists a constant such that, given a pair of preference data , the bias and variance of can be bounded as: $ \mathbb{E}{S{\hat{s}|y_w, y_l}} [|\nabla_\theta \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) - \nabla_\theta \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta)|2] \leq \frac{\tilde{C}}{4} \sqrt{\mathbb{V}{S{\hat{s}|y_w, y_l}} \hat{s}\theta(y_w, y_l)} + \sqrt{\mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \nabla_\theta \hat{s}\theta(y_w, y_l)} $ and $ \mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \left[ \nabla_\theta \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \right] \leq \frac{\tilde{C}^2}{8} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \hat{s}\theta(y_w, y_l) + \mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \nabla_\theta \hat{s}_\theta(y_w, y_l) $ Here:
- is the
expected absolute biasof the gradient estimator. - is the
trace of the covariance matrixof the gradient estimator (a measure of its total variance). - is the
true gradientof the DPO loss. - is the
estimated gradientof the DPO loss. This theorem demonstrates that thebiasandvarianceof theestimated gradientare also bounded by thevariance of the score estimator() and thevariance of its gradient().
Proposition 5 (Sampling budget and allocation for gradient variance): Let be estimated using a total of masked samples. Then we have:
(i) $
\mathbb{V} \nabla_\theta \widehat{\mathcal{B}}_\pi(y) = \Theta\left(\frac{1}{n}\right)
$
(ii) is minimized when and with a fixed .
This proposition, analogous to Proposition 1, confirms that increasing the sampling budget and applying optimal allocation also effectively reduces the variance of the gradient of the ELBO estimator, and consequently the variance of the overall DPO gradient.
4.2.5. Extension to Other Alignment Methods
The techniques and analysis of VRPO are broadly applicable beyond DPO. Other algorithms like PPO and GRPO also rely on estimating likelihood terms () or likelihood-ratio terms (), which, when applied to MDMs, would similarly involve ELBO-based estimations. The variance reduction strategies can be directly applied to these ELBO estimations, often with simpler analysis because these methods do not involve the outer non-linear log sigmoid function that complicates DPO's theoretical guarantees.
5. Experimental Setup
5.1. Datasets
The authors trained LLaDA 8B Instruct using VRPO on 350K preference pairs for one epoch to produce LLaDA 1.5.
- Source: The data was collected
internally at scaleby the authors, implying proprietary or custom datasets. - Processing: The data underwent several steps:
Filteringout low-quality samples.Removing duplicatesvia similarity matching.- Using a
reward modelto rank data, suggesting that some form of preference or quality scoring was applied to select high-quality pairs. Replacing some responseswith outputs fromadvanced LLMs, potentially to further enhance quality or diversity.
- Characteristics and Domain: The dataset is described as
high-qualityanddiverse, covering a wide range of topics:- Creative writing: ~35%
- Knowledge Q&A: ~18%
- NLP tasks: ~16%
- Mathematics tasks: ~14%
- Recommendation tasks: ~7%
- Code generation: ~5%
- Reasoning tasks: ~3%
- A small portion of safety and other tasks.
- Choice Rationale: These datasets are crucial for validating the general capabilities of the proposed method across various domains where LLMs are typically evaluated, ensuring that
VRPOimproves alignment across a broad spectrum of human preferences.
5.2. Evaluation Metrics
For every evaluation metric, a conceptual definition is provided, along with its mathematical formula where applicable.
5.2.1. Mathematics & Scientific Reasoning
These benchmarks assess the model's ability to solve mathematical problems and answer scientific questions.
5.2.1.1. GSM8K (Grade School Math 8K)
- Conceptual Definition:
GSM8Kevaluates a model's ability to solvegrade school level math word problems. The problems often require multiple steps of reasoning to arrive at the correct numerical answer. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Problems}} $
- Symbol Explanation:
Number of Correct Answers: The count of problems where the model's final answer matches the ground truth.Total Number of Problems: The total number of math problems in theGSM8Kdataset.
5.2.1.2. Math
- Conceptual Definition: The
MATH datasetcontains a broad set ofmathematics problemsfrom various school levels (e.g., algebra, geometry, number theory). It's designed to be challenging and requires advanced mathematical reasoning. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Problems}} $
- Symbol Explanation:
Number of Correct Answers: The count of problems where the model's final answer matches the ground truth.Total Number of Problems: The total number of math problems in theMATHdataset.
5.2.1.3. GPQA (Graduate-level Google-Proof Q&A)
- Conceptual Definition:
GPQAis a challengingquestion-answering benchmarkcomprisinggraduate-level questionsfrom biology, physics, and chemistry. It's designed to be "Google-proof," meaning answers are not easily found with simple web searches and require deep scientific understanding and reasoning. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} $
- Symbol Explanation:
Number of Correct Answers: The count of questions where the model's answer is factually correct.Total Number of Questions: The total number of questions in theGPQAdataset.
5.2.2. Code Generation
These benchmarks evaluate the model's ability to generate functional code.
5.2.2.1. HumanEval
- Conceptual Definition:
HumanEvalassesses a model's ability togenerate Python codethat solves specific programming problems. Each problem comes with a function signature and docstrings, and the model must complete the function body. Solutions are evaluated by executing test cases. - Mathematical Formula (Pass@k): The typical metric is
Pass@k, which estimates the probability that at least one of generated samples passes the unit tests. $ \mathrm{Pass@k} = \frac{\sum_{i=1}^{N} \mathbf{1}[\text{at least one of k samples for problem i passes}]}{N} $ - Symbol Explanation:
- : The
total number of problemsin theHumanEvalbenchmark. - : An
indicator functionthat is 1 if the condition is true, and 0 otherwise. - : The
number of code samplesgenerated by the model for each problem.
- : The
5.2.2.2. MBPP (Mostly Basic Python Programs)
- Conceptual Definition:
MBPPis a dataset ofPython programming problemsdesigned to evaluate the capability of models to generate short, correct Python programs from natural language descriptions. - Mathematical Formula (Pass@k): Similar to HumanEval,
Pass@kis used. $ \mathrm{Pass@k} = \frac{\sum_{i=1}^{N} \mathbf{1}[\text{at least one of k samples for problem i passes}]}{N} $ - Symbol Explanation:
- : The
total number of problemsin theMBPPbenchmark. - : An
indicator function. - : The
number of code samplesgenerated by the model for each problem.
- : The
5.2.3. Alignment Tasks
These benchmarks evaluate how well the model adheres to instructions, avoids harmful content, and generally aligns with human preferences in open-ended conversations.
5.2.3.1. IFEval (Instruction-Following Evaluation)
- Conceptual Definition:
IFEvalmeasures a model's ability tofollow complex, multi-turn instructions, including constraints on output format, content, and style. It assesses whether the model can generate responses that adhere to all specified directives. - Evaluation Method: Typically involves automated checks or
LLM-as-a-judgeevaluation to determine adherence to instructions. The reported score usually represents the percentage of instructions correctly followed.
5.2.3.2. Arena-Hard
- Conceptual Definition:
Arena-Hardis a benchmark for evaluatingchallenging instruction-following and safety scenarioswhere models are prone to failure. It focuses on adversarial or difficult prompts to push the limits of model alignment. - Evaluation Method: Involves
LLM-as-a-judgeevaluation, where a powerful reference LLM (e.g.,GPT-4) is used to rate the quality of responses, often in terms ofwin rateagainst other models or a direct quality score.
5.2.3.3. AlignBench
- Conceptual Definition:
AlignBenchis specifically designed forbenchmarking Chinese alignmentof large language models. It covers various aspects of helpfulness, harmlessness, and adherence to cultural norms in Chinese language contexts. - Evaluation Method: Uses an
LLM-as-a-judgeapproach, where a powerful LLM evaluates the quality, alignment, and helpfulness of responses to Chinese prompts.
5.2.3.4. MTBench
- Conceptual Definition:
MTBenchis amulti-turn benchmarkfor evaluating conversational AI models. It consists of multi-turn prompts covering various categories (e.g., writing, reasoning, roleplay) and assesses a model's ability to maintain coherence, provide relevant information, and follow instructions over multiple turns. - Evaluation Method: Relies on
LLM-as-a-judge(e.g.,GPT-4) to score the quality of model responses in a multi-turn dialogue. The scores are often averaged across prompts and turns.
5.2.4. LLM-as-a-Judge
For metrics like Arena-Hard, AlignBench, and MTBench, the paper mentions that results are obtained via LLM-as-a-judge scoring using the gpt-4-32k API.
- Conceptual Definition:
LLM-as-a-judgeis an evaluation paradigm where a powerful, larger language model (the "judge" LLM) is used to rate or rank the outputs of other LLMs. The judge LLM is given the prompt, the generated responses from different models, and sometimes a rubric, and then it produces a score or a preference ranking. This method attempts to automate human-like evaluation for complex, open-ended generation tasks where traditional metrics are insufficient.
5.3. Baselines
The paper compares LLaDA 1.5 against two baselines:
LLaDA 8B Instruct: This is theSupervised Fine-Tuned (SFT)-onlypredecessor model. It represents the base model before any preference optimization is applied.LLaDA DPO: This baseline appliesnaive DPOto LLaDA. It uses a minimal sampling configuration (, ) andwithout antithetic sampling. This effectively represents a direct application of DPO to MDMs without the proposed variance reduction techniques, serving as a direct comparison to highlight the impact of VRPO.
5.4. Computational Cost
- VRPO Sampling Budget: A default
sampling budgetof is used forVRPO. - Overhead: This results in approximately an
8-fold increase in computationcompared to methods without Monte Carlo estimation (e.g., ARMs) or with a minimal budget (). - Affordability: Despite this increase, the overall cost remains
modest, less than0.5% of the pre-training costof LLaDA, making the overhead practically acceptable. - GPU Hours: Training consumed approximately
405 H100 GPU hoursfor 8 Monte Carlo samples.
5.5. Implementation Details
-
Packing Strategy:
VRPOis implemented using apacking strategywhere multiple preference data samples are combined into a single sequence. Anattention maskis used to ensure tokens from different samples cannot attend to each other. -
Padding: All sequences are
paddedto a fixed length of4096 tokenswith[EOS](End-Of-Sequence) tokens, matching LLaDA's pre-training context length. These padded[EOS]tokens areexcluded from loss calculation. -
Model Architecture (LLaDA):
Transformer Encoder-basedmasked diffusion modelwith8 billion parameters.- Follows
LLaMAarchitecture:RMSNorm(Zhang and Sennrich, 2019) for normalization,RoPE(Su et al., 2024) for positional encoding, andSwiGLU(Shazeer, 2020) as the activation function. The detailed architecture is shown in Table 3: The following are the results from Table 3 of the original paper:
LLaDA Layers 32 Model dimension 4096 Attention heads 32 Vocabulary size 126,464 FFN dimension 12,288 Key/Value heads 32 Total parameters 8.02 B Non-embedding parameters 6.98 B -
Training Configuration:
Epochs: 1 epoch.Batch Size: 64.Optimizer:AdamWwithweight decayof 0.01, of 0.9, and of 0.95.Learning Rate Schedule: 15warmup stepsto a maximum learning rate of , followed bycosine decay.DPO Loss Coefficient: .MDMs SFT Loss: Complemented with a0.05 weighted MDMs SFT lossfor training stability.Reference Policy: is initialized withLLaDA Instruct.Hyperparameter Tuning: No hyperparameter search was performed due to hardware constraints.
5.6. Evaluation Details
- Sampling Strategies for MDMs: MDMs benefit from various inference sampling strategies to enhance sample quality. The paper employs:
Diffusion Sampling: The standard reverse process.Diffusion Semi-Autoregressive Sampling: Generates tokens in blocks, with each block generated by the diffusion process, and blocks generated autoregressively.Low-Confidence Remasking: Remasks predicted tokens that have the lowest confidence scores during inference, and then re-predicts them.
- [EOS] Token Confidence: A critical observation was that LLaDA SFT models tend to generate excessive
[EOS]tokens, leading to incomplete content. To address this, theconfidence scorefor the[EOS]token was set tozeroduring inference. This improvedHumanEvalscores from 47.6 to 49.4 for LLaDA. This setting was adopted for evaluation. - Optimal Inference Configurations: For fair comparison, both LLaDA and LLaDA 1.5 were evaluated using diffusion sampling and semi-autoregressive sampling, reporting the best results found by tuning
answer length({64, 128, 256, 512, 1024}) andblock lengthfor semi-autoregressive sampling ({8, 16, 32, 64, 128}). These optimal configurations are detailed in Table 6. - LLM-as-a-Judge Evaluation:
MTBench,AlignBench, andArena-Hardbenchmarks use thegpt-4-32k APIfor scoring.
5.7. Calculation of Variances (for Ablation Studies)
For the ablation studies (Table 2), the variances of the score estimator, loss, and gradient were estimated:
Data Points: 128 preference data samples.Batch Size: Processed with a batch size of 16.Independent Calculations: 8 independent calculations were performed for each data point.Model Checkpoint: was initialized with LLaDA, and was a model checkpoint from the VRPO training process.Gradient Variance Proxy: Due to the computational cost of storing full gradients for large models, thegradients of the up-projection layer within the Feed-Forward Network module of the first transformer blockwere used as aproxyfor the full gradient variance.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results consistently demonstrate the effectiveness of VRPO and the superior performance of LLaDA 1.5 over its SFT-only predecessor (LLaDA Instruct) and a naive DPO baseline.
The main benchmark results are presented in Table 1: The following are the results from Table 1 of the original paper:
| LLaDA 8B Instruct | LLaDA DPO | LLaDA 1.5 8B | |
|---|---|---|---|
| Post-training | SFT | SFT + naive DPO | SFT + VRPO (Ours) |
| Mathematics & Science | |||
| GSM8K | 78.6 | 80.7 (+2.1) | 83.3 (+4.7) |
| Math | 42.2 | 41.6 (-0.6) | 42.6 (+0.4) |
| GPQA | 33.3 | 34.3 (+1.0) | 36.9 (+3.6) |
| Code | |||
| HumanEval | 49.4 | 48.2 (-1.2) | 52.4 (+3.0) |
| MBPP | 41.0 | 41.4 (+0.4) | 42.8 (+1.8) |
| Alignment Tasks | |||
| IFEval | 62.2 | 62.0 (-0.2) | 66.2 (+4.0) |
| Arena-Hard | 10.0 | 11.9 (+1.9) | 14.3 (+4.3) |
| AlignBench | 5.4 | 5.8 (+0.4) | 5.9 (+0.5) |
| MTbench | 7.2 | 7.1 (-0.1) | 7.3 (+0.1) |
Key Observations from Table 1:
-
Consistent Outperformance:
LLaDA 1.5(trained with VRPO) consistently outperforms bothLLaDA 8B Instruct(SFT-only) andLLaDA DPO(naive DPO) across all evaluated benchmarks. This strong and uniform improvement underscores the effectiveness of theVRPOframework. -
Significant Gains: The improvements are particularly significant in:
GSM8K(+4.7 compared to SFT-only, +2.6 compared to naive DPO).GPQA(+3.6 compared to SFT-only, +2.6 compared to naive DPO).HumanEval(+3.0 compared to SFT-only, +4.2 compared to naive DPO).IFEval(+4.0 compared to SFT-only, +4.2 compared to naive DPO).Arena-Hard(+4.3 compared to SFT-only, +2.4 compared to naive DPO).
-
Naive DPO Limitations:
LLaDA DPO(naive DPO without VRPO's techniques) shows mixed results, even degrading performance onMath,HumanEval,IFEval, andMTBenchcompared to the SFT-only baseline. This highlights the critical issue of high variance in MDM alignment and validates the paper's central motivation. Simple application of DPO is not sufficient for MDMs. -
Mathematical Prowess:
LLaDA 1.5demonstrates strong mathematical performance. As also shown in the right panel of Figure 1 (Image 1), it achieves competitive results compared to other stronglanguage MDMsandARMs, and notably achieves thehighest four-shot score on GSM8Kand thehighest zero-shot score on Mathamong compared models.
该图像是图表,展示了论文中图1的Benchmark结果。左侧雷达图显示LLaDA 1.5在多个基准测试中较LLaDA SFT有一致且显著提升;右侧柱状图对比了多语言模型在数学任务(GSM8K和Math)上的性能,LLaDA 1.5表现优异。
Figure 1: Benchmark results. The left panel shows that LLaDA 1.5 improves LLaDA consistently and significantly on various benchmarks. The right panel demonstrates that LLaDA 1.5 has a highly competitive mathematical performance compared to strong language MDMs and ARMs.
6.2. Ablation Studies
The paper conducts detailed ablation studies to assess the impact of each variance reduction technique within VRPO. These studies manipulate sampling budget (), allocation strategy (), and the use of antithetic sampling. The results confirm the theoretical analyses presented in Section 3.
The following are the results from Table 2 of the original paper:
| Base | Budget | Allocation | Antithetic | |||
|---|---|---|---|---|---|---|
| # Timesteps nt | 4 | 1 | 8 | 1 | 22 | 4 |
| # Masked samples nyt | 1 | 1 | 1 | 4 | 1 | |
| Antithetic sampling | ✓ | ✓ | ✓ | ✓ | ✓ | X |
| Variances | ||||||
| Var of score estimator | 2.2 | 44.0 | 1.0 | 7.3 | 4.7 | 2183.7 |
| Var of loss | 3.1e-3 | 8.7e-2 | 2.6e-3 | 3.2e-2 | 7.3e-3 | 62.0 |
| Var of gradient | 2.5 | 13.0 | 1.6 | 4.7 | 2.5 | 10.6 |
| Mathematics & Science | ||||||
| GSM8K | 82.8 | 80.1 | 83.3 | 81.4 | 82.3 | 82.0 |
| Math | 42.3 | 41.7 | 42.6 | 41.9 | 42.4 | 42.4 |
| GPQA | 36.4 | 34.3 | 36.9 | 34.9 | 36.4 | 35.9 |
| Code | ||||||
| HumanEval | 51.2 | 50.6 | 52.4 | 48.2 | 48.8 | 47.0 |
| MBPP | 42.8 | 40.6 | 42.8 | 40.8 | 41.0 | 41.2 |
| Alignment Tasks | ||||||
| IFEval | 66.1 | 63.9 | 66.2 | 64.8 | 66.2 | 65.8 |
| Arena-Hard | 13.9 | 13.5 | 14.3 | 13.8 | 13.4 | 15.6 |
| AlignBench | 5.9 | 5.6 | 5.9 | 5.8 | 5.9 | 5.9 |
| MTbench | 7.4 | 7.0 | 7.3 | 7.0 | 7.2 | 7.2 |
Key Observations from Table 2:
- Impact of Score Estimator Variance: The results strongly support
Theorem 1. Lowervariances of the score estimator() generally lead to lowervariances in both the loss and gradient, as well asimproved task performance. For example, reducing from 44.0 to 1.0 (by increasing budget from to ) correlates with a significant jump in GSM8K accuracy from 80.1 to 83.3. - Increasing Sampling Budget: Increasing the
sampling budget() consistently reduces estimator variance and improves performance. Comparing "Budget " (naive DPO, ) to "Budget " (optimal allocation, ), drops from 44.0 to 1.0, and GSM8K improves from 80.1 to 83.3. This validatesProposition 1 (i). - Optimal Allocation: Under a fixed sampling budget,
optimal allocation(e.g., "Base" with ) generally yields lower variance and better results than repeating multiple masked samples per timestep (e.g., "Allocation "). For a budget of , "Base" () has , while "Allocation " has . The "Base" configuration performs better on most tasks, supportingProposition 1 (ii). - Antithetic Sampling:
Antithetic sampling(✓vs. in the last column) leads to notabledecreases in variance. Comparing "Base" (Antithetic ✓, ) to "Antithetic" (Antithetic X, ), the variance reduction is dramatic. This confirmsProposition 2.- Performance vs. Variance Trade-off: Interestingly, while antithetic sampling drastically reduces variance, the direct translation to
downstream benchmark improvementsis not always as dramatic. For instance, "Antithetic" (without antithetic sampling) achieves a higher Arena-Hard score (15.6) than "Base" (13.9), despite significantly higher variance. The authors hypothesize that disabling antithetic sampling might expose the model to a broader diversity of data patterns, which could benefit certain downstream tasks, suggesting a complex interplay between optimization stability and generalization.
- Performance vs. Variance Trade-off: Interestingly, while antithetic sampling drastically reduces variance, the direct translation to
6.3. Loss Curves
Figure 5 illustrates the training loss dynamics for different configurations, providing a visual complement to the quantitative results in Table 2.
该图像是论文中图5的图表,展示了不同方差减少策略下的损失曲线。上半部分为使用对偶采样(antithetic sampling)时的曲线,下半部分为未使用对偶采样时的曲线。曲线通过调整时间步数 和蒙版样本数 进行变化,所有曲线均采用指数移动平均平滑,系数为0.3。
Figure 5: Loss curves under different variance reduction strategies. Top: w/ antithetic sampling; bottom: w/o antithetic sampling. The curve labeled "w/o antithetic sampling, , corresponds to the training loss of the naive DPO baseline reported in Table 1, all other curves come from the ablation study in Table 2, obtained by varying the number of timesteps `n _ { t }` , the number of masked samples `n _ { y _ { t } }` , and whether antithetic sampling is applied. We present two panels because the loss magnitudes differ substantially across settings. For visual clarity, all curves are smoothed with an exponential moving average with coefficient 0.3.
Key Observations from Figure 5:
- Smoother and Lower Variability: Configurations with
variance reduction strategiesapplied (e.g., higher , optimal allocation, antithetic sampling) result insmoother loss curveswithsubstantially lower variability. This directly demonstrates the improvedstability of the optimization dynamicsof MDMs. - Faster Convergence and Lower Final Loss: The smoother curves also tend to show a
faster decrease in lossand often lead to alower final lossvalue. This is consistent with the theoretical expectation that reduced gradient variance leads to more efficient and effective optimization. - Impact of Antithetic Sampling: The two panels (with/without antithetic sampling) clearly show the large magnitude difference in loss, especially the initial spikes. Disabling antithetic sampling drastically increases the loss variability, making training much noisier.
6.4. Sampling Strategies Ablation
The paper further explores the generality of VRPO by evaluating LLaDA and LLaDA 1.5 across different sampling strategies for MDM inference.
The following are the results from Table 4 of the original paper:
| LLaDA 8B Instruct | LLaDA 1.5 8B | |
|---|---|---|
| GSM8K | ||
| Diffusion Sampling | 53.2 | 55.7 |
| Low-Confidence Remasking | 69.4 | 70.3 |
| Semi-Autoregressive Sampling | 78.6 | 83.3 |
| HumanEval | ||
| Diffusion Sampling | 12.2 | 17.1 |
| Low-Confidence Remasking | 49.4 | 47.0 |
| Semi-Autoregressive Sampling | 47.6 | 52.4 |
| IFEval | ||
| Diffusion Sampling | 55.2 | 59.4 |
| Low-Confidence Remasking | 62.2 | 60.1 |
| Semi-Autoregressive Sampling | 61.7 | 66.2 |
Key Observations from Table 4:
LLaDA 1.5consistently showsperformance gainsoverLLaDA 8B Instructacross most sampling strategies.- The optimal sampling strategies vary by task (e.g.,
Semi-Autoregressive Samplingis best forGSM8KandIFEvalfor both models, whileLow-Confidence Remaskingis strong forHumanEvalfor LLaDA SFT, butSemi-Autoregressive Samplingis best for LLaDA 1.5). This indicates that the choice of inference strategy is crucial and task-dependent for MDMs. - The improvements from
VRPOare not tied to a single inference method but are generally applicable, demonstrating the robustness of the alignment technique.
6.5. Training Randomness
To assess the stability and reliability of VRPO, the authors retrained LLaDA using VRPO with two additional random seeds, resulting in three independent runs.
The following are the results from Table 5 of the original paper:
| Task | LLaDA | LLaDA 1.5 |
|---|---|---|
| GSM8K | 78.6 | 82.9 ± 0.6 (95% CI: [81.4, 84.3]) |
| Math | 42.2 | 43.0 ± 0.3 (95% CI: [42.2, 43.8]) |
| GPQA | 33.3 | 35.7 ± 1.0 (95% CI: [33.1, 38.3]) |
| HumanEval | 49.4 | 52.0 ± 0.7 (95% CI: [50.3, 53.7]) |
| MBPP | 41.0 | 42.3 ± 0.8 (95% CI: [40.4, 44.1]) |
| IFEval | 62.2 | 65.1 ± 0.9 (95% CI: [62.8, 67.4]) |
Key Observations from Table 5:
LLaDA 1.5consistentlyoutperformsLLaDA(SFT-only) across all benchmarks withhigher mean scores.- The
standard deviationsforLLaDA 1.5performance aresmall(e.g., 0.6 for GSM8K, 0.3 for Math), indicatingstable performanceacross different training runs (random seeds). This suggests thatVRPOis a robust optimization method, not overly sensitive to initialization or stochasticity. - For most tasks, the
95% confidence intervalsforLLaDA 1.5lieentirely above the corresponding LLaDA means, providing strong statistical evidence of consistent improvements and supporting the reliability and effectiveness ofVRPO.
6.6. Inference Configurations
The following are the results from Table 6 of the original paper:
| | | LLaDA 8B Instruct | | LLaDA 1.5 8B | | :---------- | :-------------- | :------------ | :------------ | :------------ | :------------ | | | Block length | Answer length | Block length | Answer length | GSM8K | | 8 | 256 | 16 | 256 | Math | | 64 | 512 | 128 | 1024 | GPQA | | 64 | 64 | 16 | 256 | HumanEval | | 512 | 512 | 32 | 512 | MBPP | | 256 | 256 | 32 | 512 | IFEval | | 512 | 512 | 32 | 512 | Arena-Hard | | 128 | 1024 | 128 | 1024 | AlignBench | | 32 | 512 | 32 | 512 | MTbench | | 32 | 512 | 16 | 256
Key Observations from Table 6:
- The optimal
inference configurations(block length and answer length) vary significantly across different benchmarks for bothLLaDAandLLaDA 1.5. This highlights the importance ofinference tuningfor MDMs to achieve optimal performance on specific tasks. - For tasks like
GSM8K,GPQA,HumanEval,MBPP, andIFEval,LLaDA 1.5tends to prefersmaller block lengthsorlarger answer lengthscompared to LLaDA Instruct, suggesting a more efficient or flexible generation process due to improved alignment. A block length smaller than the answer length indicates the use ofdiffusion semi-autoregressive sampling.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper rigorously investigates the challenges of aligning Masked Diffusion Models (MDMs) with human preferences, identifying the high variance and bias in Evidence Lower Bound (ELBO)-based likelihood estimates as a critical bottleneck. To address this, the authors propose Variance-Reduced Preference Optimization (VRPO), a framework grounded in theoretical analysis of the score-estimator variance that drives these errors. VRPO introduces practical, unbiased variance reduction strategies, including increasing the Monte Carlo sampling budget, implementing optimal allocation of samples across timesteps, and employing antithetic sampling between policy and reference model ELBOs. Empirical validation with LLaDA 1.5—an 8B-parameter MDM—demonstrates consistent and significant performance improvements over its SFT-only predecessor across diverse benchmarks encompassing mathematical reasoning, code generation, and general alignment tasks. These results confirm VRPO's effectiveness at a large scale, establishing a robust foundation for the future development of language MDMs.
7.2. Limitations & Future Work
7.2.1. Limitations
- Misuse Risks: The
Ethics Statementexplicitly acknowledges that despite focusing on alignment for helpfulness,misuse risksremain. Models might still generatediscriminatory, biased, or harmful content. While preference data curation and filtering are employed, these risks are inherent to LLMs and require ongoing vigilance. - Computational Cost of Sampling: While deemed acceptable (less than 0.5% of pre-training cost for ), increasing the sampling budget for ELBOs does increase computational overhead by a factor of . For even larger budgets or more frequent sampling, this could become a more significant constraint.
- Generalization vs. Variance Reduction Trade-off (Antithetic Sampling): The ablation study notes that while antithetic sampling drastically reduces variance, its direct impact on downstream benchmark performance is not always a proportional improvement. This suggests other factors (like data diversity exposure) might influence generalization, a complex aspect not fully controlled.
7.2.2. Future Work
- Extension to Broader RL-based Alignment Algorithms: The paper explicitly states that its variance reduction techniques and analysis are not limited to
DPObut can be extended to otherRL-based alignment algorithmssuch asPPOandGRPO. This indicates a promising avenue for further research to generalize VRPO across the entire RLHF ecosystem for MDMs. - Further Enhancement of MDMs: The work lays the groundwork for future research to further enhance MDMs' performance, implying continued exploration of architectural improvements, training techniques, and alignment strategies.
- Exploring Generalization Factors: The observation about antithetic sampling's effect on downstream performance suggests a need to better understand the interplay between variance reduction, optimization stability, and factors that contribute to model generalization.
7.3. Personal Insights & Critique
This paper provides a crucial theoretical and empirical advancement for Masked Diffusion Models in the context of preference optimization.
- Theoretical Rigor for Practical Problems: A key strength is its
principled approach. Instead of just empirically trying different methods, the authors first perform aformal analysisto understand why MDM alignment is difficult (high variance in ELBO estimates). This theoretical grounding (Theorem 1, 4) is highly valuable as it directs the solution towards fundamental issues rather than superficial fixes. This mindset—understanding the root cause before prescribing a solution—is a strong example for research. - "Free Lunch" Techniques: The identification of
optimal budget allocationandantithetic samplingas techniques that reduce variance without additional computational cost is particularly elegant. In a field constantly battling with computational constraints, such "free lunches" are immensely valuable. - Importance of
Reference Modelin DPO: The reliance onlog-likelihood ratiosand the interaction between thecurrent policyandreference policyare central to DPO. The paper's explicit strategy ofantithetic samplingleveraging the correlation between these two policies highlights a nuanced understanding of DPO's mechanics for MDMs. - Potential for Broader Impact: The claim that
VRPO's techniques are transferable toPPOandGRPOfor MDMs is significant. If validated, it means this framework could become a standard component in theRLHF pipelinefor all MDMs, accelerating their development and adoption. - Critique on Generalization: The observation regarding antithetic sampling and downstream performance is thought-provoking. While variance reduction is typically seen as universally beneficial for optimization, the suggestion that some noise (or rather, the diverse sampling patterns introduced by not using antithetic sampling) might aid
generalizationon certain benchmarks opens up an interesting research question. It implies that there might be a subtle balance to strike between purely reducing gradient noise and ensuring sufficient exploration of the data manifold for robust generalization. This could be an area for future work, perhaps exploring adaptive variance reduction or targeted noise injection if empirical generalization benefits are consistently observed. - Comparison to ARMs: The consistent outperformance of
LLaDA 1.5overLLaDA SFT-onlyand the naiveLLaDA DPObaseline, coupled with its competitiveness against strong ARMs, strongly suggests that MDMs, when properly aligned, can be viable alternatives to ARMs, potentially offering unique advantages in aspects like parallel generation or robustness. This work significantly closes the alignment gap for MDMs.
Similar papers
Recommended via semantic vector search.