Paper status: completed

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Published:05/26/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LLaDA 1.5 uses Variance-Reduced Preference Optimization to reduce ELBO estimator variance, improving human preference alignment and outperforming prior models on math, code, and alignment benchmarks.

Abstract

While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: https://ml-gsai.github.io/LLaDA-1.5-Demo/.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models. It indicates that the paper introduces an improved version (1.5) of a large language diffusion model (LLaDA) by applying a novel technique called Variance-Reduced Preference Optimization (VRPO).

1.2. Authors

  • Fengqi Zhu (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

  • Rongzhen Wang (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

  • Shen Nie (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

  • Xiaolu Zhang (Ant Group)

  • Chunwei Wu (Ant Group)

  • Jun Hu (Ant Group)

  • Jun Zhou (Ant Group)

  • Jianfei Chen (Tsinghua University)

  • Yankai Lin (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

  • Ji-Rong Wen (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

  • Chongxuan Li (Gaoling School of AI, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE)

    The authors are primarily affiliated with the Gaoling School of AI at Renmin University of China, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, Tsinghua University, and Ant Group. Their backgrounds appear to span artificial intelligence, machine learning, and large language models, specifically focusing on diffusion models and their applications.

1.3. Journal/Conference

The paper is published on arXiv (arXiv preprint arXiv:2505.19223). arXiv is a well-known open-access archive for scholarly articles, primarily in physics, mathematics, computer science, and related fields. Papers on arXiv are typically preprints, meaning they have not yet undergone formal peer review, though many eventually get published in peer-reviewed conferences or journals. As a preprint, it allows for rapid dissemination of research findings.

1.4. Publication Year

2025 (May 25, 2025, Version 2)

1.5. Abstract

Masked Diffusion Models (MDMs), like LLaDA, are a promising paradigm for language modeling, but aligning them with human preferences through reinforcement learning (RL) has received limited attention. A major challenge is the high variance observed in Evidence Lower Bound (ELBO)-based likelihood estimates, which are crucial for preference optimization. To tackle this, the authors propose Variance-Reduced Preference Optimization (VRPO). This framework provides a formal analysis of the ELBO estimator's variance and derives bounds for both the bias and variance of preference optimization gradients. Based on this theoretical foundation, VRPO introduces unbiased variance reduction techniques, such as optimal Monte Carlo budget allocation and antithetic sampling, which significantly enhance the alignment performance of MDMs. When applied to LLaDA, the resulting model, LLaDA 1.5, consistently and significantly surpasses its Supervised Fine-Tuning (SFT)-only predecessor across various benchmarks: mathematical tasks (GSM8K +4.7), code generation (HumanEval +3.0, MBPP +1.8), and alignment tasks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 achieves highly competitive mathematical performance compared to other strong language MDMs and autoregressive models (ARMs).

  • Original Source Link: https://arxiv.org/abs/2505.19223
  • PDF Link: https://arxiv.org/pdf/2505.19223v2.pdf
  • Publication Status: This paper is a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The field of large language models (LLMs) has seen significant advancements, with two primary paradigms emerging: Autoregressive Models (ARMs) and Masked Diffusion Models (MDMs). While ARMs have achieved remarkable success and extensive research has gone into aligning them with human preferences (e.g., via Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)), MDMs are a newer, promising paradigm that have shown competitive or superior performance in language modeling at various scales. However, the alignment of these MDMs with human preferences, which is crucial for developing helpful and safe models, remains relatively underexplored.

The core problem this paper aims to solve is the high variance associated with Evidence Lower Bound (ELBO)-based likelihood estimates in MDMs during preference optimization. Unlike ARMs, where log-likelihoods are often directly computable, MDMs rely on ELBOs as approximations, which introduces nested expectations and requires Monte Carlo sampling. This sampling, while necessary, leads to significant variance in the estimated preference score (a linear combination of ELBO terms), which in turn propagates to the loss and gradient of the preference optimization objective. This high variance hinders stable and effective training, making it difficult to align MDMs with human preferences reliably across diverse tasks.

The paper's entry point is to systematically study this variance problem in the context of DPO, a popular and empirically strong alignment method. By formally analyzing the bias and variance introduced by ELBO approximations, the authors identify the score-estimator variance as the dominant factor affecting optimization stability. This insight drives the development of principled variance reduction strategies specifically tailored for MDMs.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Formal Analysis of Variance and Bias: It provides a rigorous theoretical framework that formally analyzes the bias and variance of the DPO loss and gradient when using ELBO-based likelihood estimates in MDMs. This analysis reveals that these errors are primarily governed by the variance of the score estimator (a linear combination of ELBO terms). This theoretical grounding offers crucial insights into the challenges of aligning MDMs.

  • Introduction of Variance-Reduced Preference Optimization (VRPO): Based on the theoretical findings, the paper proposes VRPO, a novel framework that integrates multiple unbiased variance reduction techniques. These techniques include:

    1. Increased Sampling Budget: Strategically increasing the number of Monte Carlo samples used for ELBO estimation.
    2. Optimal Allocation: Distributing the sampling budget optimally across diffusion timesteps by setting nt=nn_t = n (number of timesteps equals total budget) and n_y_t = 1 (one masked sample per timestep).
    3. Antithetic Sampling: Sharing the same sampled timesteps and masked data between the current policy (πθ\pi_\theta) and reference policy (πref\pi_{ref}) ELBO estimates for the same input. These methods are proven to reduce variance without introducing bias.
  • Empirical Validation and LLaDA 1.5: The effectiveness of VRPO is demonstrated by applying it to LLaDA, an 8B-parameter masked diffusion language model. The resulting model, LLaDA 1.5, consistently and significantly outperforms its Supervised Fine-Tuning (SFT)-only predecessor (LLaDA Instruct) across a wide range of benchmarks:

    • Mathematical Tasks: GSM8K (+4.7), Math (+0.4), GPQA (+3.6).
    • Code Generation: HumanEval (+3.0), MBPP (+1.8).
    • Alignment Tasks: IFEval (+4.0), Arena-Hard (+4.3), AlignBench (+0.5), MTBench (+0.1). These improvements establish LLaDA 1.5 as a highly competitive model, even against strong ARMs, particularly in mathematical performance.
  • Generalizability of Techniques: The paper discusses how the proposed variance reduction techniques and analysis can be extended beyond DPO to other Reinforcement Learning (RL)-based alignment algorithms, such as Proximal Policy Optimization (PPO) and Generalized Reinforcement Preference Optimization (GRPO), making them broadly applicable to MDM alignment.

    The key conclusions are that systematically addressing the high variance in ELBO estimates is critical for effective MDM alignment, and the VRPO framework provides a theoretically sound and empirically validated solution to achieve this.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a solid grasp of several core concepts in machine learning, particularly in natural language processing (NLP) and generative models, is essential.

3.1.1. Large Language Models (LLMs)

Large Language Models (LLMs) are deep learning models, typically based on the Transformer architecture, that are trained on massive amounts of text data. They learn to predict the next word in a sequence, enabling them to generate human-like text, answer questions, summarize documents, and perform many other language-related tasks.

3.1.2. Autoregressive Models (ARMs)

Autoregressive Models (ARMs) are a common type of LLM where text is generated token-by-token in a sequential manner. Each token's generation is conditioned on all previously generated tokens. This process makes their log-likelihood (the probability of observing a sequence of tokens) directly computable. Examples include GPT-series models and LLaMA.

  • How they work: To generate "Hello world!", an ARM first predicts "Hello", then predicts "world!" given "Hello".
  • Likelihood: The probability of a sequence y=(y1,y2,,yL)y = (y_1, y_2, \dots, y_L) given a prompt xx is typically calculated as: $ P(y|x) = \prod_{i=1}^L P(y_i | x, y_1, \dots, y_{i-1}) $ The log-likelihood is then logP(yx)\log P(y|x).

3.1.3. Diffusion Models

Diffusion Models are a class of generative models that learn to reverse a gradual noise process. In the context of images, they learn to transform noisy data back into a clean image. For discrete data like text, they operate by progressively masking out information and then learning to denoise or unmask it.

3.1.4. Masked Diffusion Models (MDMs)

Masked Diffusion Models (MDMs) adapt the diffusion paradigm to discrete data like language. Instead of adding Gaussian noise, they mask out tokens in a sequence.

  • Forward Process: Starting from an original, clean text sequence, tokens are progressively masked (replaced with a special [MASK] token) over a series of timesteps. As time progresses, more tokens are masked, eventually leading to a fully masked sequence. This is a controlled process, often defined by a masking schedule.
  • Reverse Process: The model learns to reverse this process. Given a partially masked sequence at a certain timestep, the model predicts the original, unmasked tokens. By iteratively denoising (unmasking) the sequence from a fully masked state back to a clean state, it generates new text.
  • Difference from ARMs: MDMs are non-autoregressive or partially autoregressive, meaning they can predict multiple masked tokens simultaneously or in parallel, which can lead to faster generation. However, their log-likelihood is generally intractable to compute directly, which is a key challenge addressed in this paper.

3.1.5. Evidence Lower Bound (ELBO)

The Evidence Lower Bound (ELBO) is a variational approximation used in probabilistic models, especially when the true log-likelihood of the data is intractable to compute directly. It provides a lower bound on the true log-likelihood, meaning BlogP(yx)\mathcal{B} \le \log P(y|x). Optimizing the ELBO is a common strategy to train such models.

  • In MDMs, the ELBO often involves expectations over the diffusion process (different timesteps) and masked data configurations within each timestep.

3.1.6. Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a technique used to align LLMs with human preferences and values. It typically involves three steps:

  1. Supervised Fine-Tuning (SFT): A base LLM is fine-tuned on a dataset of high-quality human-written demonstrations to make it follow instructions.
  2. Reward Model Training: A separate reward model (RM) is trained to predict human preferences. Humans provide feedback by ranking multiple model-generated responses to a given prompt. The RM learns to assign higher scores to preferred responses.
  3. Reinforcement Learning (RL): The SFT model is then further fine-tuned using Proximal Policy Optimization (PPO) or similar RL algorithms, with the reward signal provided by the trained RM. The goal is to maximize the RM's score while staying close to the SFT policy to prevent mode collapse or drift.

3.1.7. Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a simplified and often more stable alternative to traditional RLHF. Instead of explicitly training a separate reward model and then using RL, DPO directly optimizes the language model's policy to satisfy human preferences by minimizing a single objective function. This objective is derived from the Bradley-Terry model of preferences and directly relates the model's log-likelihood ratios to human preferences. It avoids the complexities of RM training and RL training stability issues.

3.1.8. Monte Carlo Methods

Monte Carlo methods are computational algorithms that rely on repeated random sampling to obtain numerical results. They are often used when direct analytical solutions are intractable.

  • Estimation: To estimate an expectation E[f(X)]\mathbb{E}[f(X)], one can draw NN samples X1,,XNX_1, \dots, X_N from the distribution of XX and compute the sample average 1Ni=1Nf(Xi)\frac{1}{N}\sum_{i=1}^N f(X_i).
  • Bias: An estimator is unbiased if its expected value equals the true parameter it's estimating.
  • Variance: The variance of an estimator measures how much its estimates vary from the true parameter across different sets of samples. High variance means estimates are unstable and noisy.

3.1.9. Variance Reduction Techniques

These are methods used in Monte Carlo simulations to reduce the variance of an estimator without changing its expected value (i.e., maintaining unbiasedness). This makes the estimates more precise and reliable for a given number of samples.

  • Antithetic Sampling (Antithetic Variates): A technique where pairs of random variables are generated that are negatively correlated. By averaging estimates from these negatively correlated samples, the overall variance of the estimator can be significantly reduced. For example, if you need to estimate E[f(X)]\mathbb{E}[f(X)], and XX and XX' are negatively correlated, then f(X)+f(X)2\frac{f(X) + f(X')}{2} might have lower variance than just f(X).

3.2. Previous Works

The paper builds upon and differentiates itself from existing research in several areas:

3.2.1. Masked Diffusion Models (MDMs) in Language

  • Foundational Diffusion: The concept originated with Sohl-Dickstein et al. (2015) and Austin et al. (2021a) for continuous data.
  • Discrete Diffusion: Campbell et al. (2022) extended to discrete data, Hoogeboom et al. (2021) and Meng et al. (2022) further explored this.
  • Language MDMs: Louetal.(2023)Lou et al. (2023), Ouetal.(2024)Ou et al. (2024), Sahoo et al. (2024), Shietal.(2024)Shi et al. (2024) showed MDMs achieving comparable or superior performance to ARMs at small scales, optimizing ELBO or simplified variants.
  • Scaling MDMs: Nieetal.(2024)Nie et al. (2024), Gong et al. (2024), Nieetal.(2025)Nie et al. (2025) (LLaDA) demonstrated excellent scalability, achieving competitive results with state-of-the-art ARMs like LLaMA 3 (Dubey et al., 2024).

3.2.2. Alignment of LLMs (ARMs)

  • Traditional RLHF: Ziegler et al. (2019) and Ouyang et al. (2022) established the two-stage process of reward modeling followed by RL (e.g., PPO) for ARMs.
  • Direct Preference Optimization (DPO): Rafailov et al. (2023) introduced DPO, a simplified yet effective alternative that directly optimizes a policy to satisfy preferences without an explicit reward model. Its strong empirical performance is noted by Grattafiori et al. (2024).

3.2.3. Alignment of MDMs (Existing Efforts)

The paper acknowledges several emerging works on MDM alignment:

  • Zekri and Boullé (2025): General policy-gradient method leveraging denoising distribution.
  • Borso et al. (2025): DPO variant for discrete diffusion by viewing token steps as actions, validated on small-scale binary sequences.
  • Zhao et al. (2025), Yang et al. (2025), Tang et al. (2025): GRPO-based methods treating token steps as actions to enhance reasoning.
  • Huang et al. (2025): GRPO variant viewing intermediate diffusion steps as RL trajectories, focusing on reasoning and code generation.
  • Gong et al. (2025): GRPO-based algorithm for code generation with a coupled-sampling variance-reduction technique.

3.2.4. Variance Reduction Techniques

The paper draws inspiration from broader fields:

  • Monte Carlo Methods: Classic techniques like control variates and stratified sampling (Kroese et al., 2013). The paper specifically adapts antithetic variates.
  • Doubly Stochastic Optimization: Related to nested expectations in ELBOs, drawing from Daietal.(2014)Dai et al. (2014), Titsias and Lázaro-Gredilla (2014), Gower et al. (2020), Kimetal.(2024)Kim et al. (2024).
  • Variational Inference: Connections to importance weighted variational inference (Burda et al., 2016; Huang and Courville, 2019), where outer bias is reduced by inner variance reduction.

3.3. Technological Evolution

The evolution of language models has progressed from rule-based systems to statistical models, then to deep learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs), and most recently to Transformer-based models. Within the Transformer era, Autoregressive Models (ARMs) have dominated, with significant strides in pre-training scale and subsequent alignment via RLHF or DPO.

Masked Diffusion Models (MDMs) represent a newer wave, offering a different generative paradigm—often parallel generation—which holds promise for efficiency and robustness. However, their internal mechanisms, particularly the use of ELBOs for likelihood approximation, introduce unique challenges when adapting existing alignment methodologies designed for ARMs. Most prior work on MDM alignment (as cited in 3.2.3) attempts to adapt existing methods without a deep dive into the specific statistical challenges posed by MDMs' ELBO approximations or focuses on specialized tasks.

This paper's work fits into this timeline by addressing a critical missing piece: a systematic theoretical and empirical study of preference optimization for MDMs on general tasks, specifically tackling the high variance issue inherent in their ELBO-based likelihood estimates. It aims to bring the robustness and alignment capabilities seen in ARMs to the MDM paradigm.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's core differences and innovations are:

  • Focus on ELBO Variance in MDM Alignment: Unlike prior MDM alignment works that mostly adapt existing frameworks (like DPO or GRPO) to MDMs and introduce various likelihood approximation methods, this paper identifies and formally analyzes the high variance in ELBO-based likelihood estimates as the central challenge. This provides a fundamental understanding of the problem specific to MDMs.

  • Theoretical Grounding for Variance Reduction: The paper provides a rigorous theoretical foundation by analyzing the bias and variance of the DPO loss and gradient in terms of the score-estimator variance. This systematic analysis leads directly to the design of unbiased variance reduction strategies (VRPO), rather than relying solely on empirical tuning or heuristic adaptations.

  • Comprehensive General Task Alignment: Many existing MDM alignment efforts focus on specialized tasks like reasoning or code generation. This paper explicitly aims for broader alignment tasks and validates VRPO on a comprehensive suite of benchmarks covering mathematics, code, and general alignment, demonstrating its effectiveness for general-purpose language MDMs.

  • Principled Variance Reduction Techniques: The VRPO framework introduces specific, theoretically motivated techniques:

    • Optimal Monte Carlo budget allocation for ELBO estimation (setting nt=nn_t = n, n_y_t = 1).
    • Antithetic sampling between current and reference policies. These are presented as unbiased and shown to be effective, distinguishing them from potentially biased or less effective ad-hoc solutions. The coupled-sampling variance-reduction technique mentioned by Gong et al. (2025) is concurrent and complementary, indicating an active research area that this paper also contributes to with a DPO-specific focus.
  • LLaDA 1.5 as a Strong Baseline: By applying VRPO to a large-scale MDM (LLaDA 8B), the paper demonstrates state-of-the-art performance for MDMs across diverse tasks, effectively establishing a new benchmark and methodology for aligning large language diffusion models.

    In essence, while others have adapted alignment methods to MDMs, this paper systematically addresses the intrinsic statistical challenges of MDMs' likelihood approximations within the alignment process, leading to a more robust and effective solution.

4. Methodology

4.1. Principles

The core principle behind the proposed Variance-Reduced Preference Optimization (VRPO) is to address the instability and inefficiency of aligning Masked Diffusion Models (MDMs) with human preferences due to high variance in their Evidence Lower Bound (ELBO)-based likelihood estimates. Direct Preference Optimization (DPO) is chosen as the alignment framework for its simplicity and strong empirical performance.

The fundamental intuition is that the exact log-likelihood required by DPO is intractable for MDMs. This necessitates approximating it with the ELBO, which introduces Monte Carlo sampling and, consequently, stochasticity. This stochasticity leads to:

  1. Bias: Due to the non-linearity of the sigmoid function in the DPO loss, the expected value of the estimated loss does not equal the loss computed from the true (unestimated) ELBOs.

  2. High Variance: The estimated preference score (a linear combination of ELBOs) exhibits high variance, which directly translates to high variance in the DPO loss and its gradient. This noisy gradient makes the optimization process unstable and slow.

    The paper formally analyzes these issues and discovers that both the bias and variance of the estimated DPO loss and its gradient are directly bounded by the variance of the score estimator. This crucial insight dictates the core strategy: to effectively align MDMs, one must reduce the variance of this score estimator. VRPO then proposes specific, unbiased variance reduction techniques to achieve this, aiming for a more stable and efficient optimization process.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology section first reviews existing alignment methods and MDM specifics, then details how the ELBO is incorporated into DPO, leading to the identification of the variance problem, and finally presents the VRPO solution.

4.2.1. Alignment Methods

The paper first outlines traditional Reinforcement Learning (RL) alignment methods and Direct Preference Optimization (DPO).

4.2.1.1. Reward Modeling

In traditional two-stage RL alignment, the first step involves training a reward model (rϕr_\phi) using a dataset of human preferences. The dataset D\mathcal{D} consists of triplets (x,yw,yl)(x, y_w, y_l), where xx is a prompt, ywy_w is the human-preferred response, and yly_l is the less preferred response. The reward model rϕr_\phi is trained to output higher scores for preferred responses by minimizing the following objective based on the Bradley-Terry formulation:

$ \mathcal{L}{\mathrm{Reward}}(\phi) \triangleq -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right] $

Here:

  • ϕ\phi represents the parameters of the reward model.
  • E()D\mathbb{E}_{(\cdot) \sim \mathcal{D}} denotes the expectation over the preference dataset D\mathcal{D}.
  • σ()\sigma(\cdot) is the sigmoid function, defined as σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}. Its role is to map the difference in reward scores to a probability-like value between 0 and 1.
  • rϕ(x,yw)r_\phi(x, y_w) is the reward score assigned by the model to the preferred response ywy_w given prompt xx.
  • rϕ(x,yl)r_\phi(x, y_l) is the reward score assigned by the model to the less preferred response yly_l given prompt xx. The objective encourages rϕ(x,yw)r_\phi(x, y_w) to be significantly greater than rϕ(x,yl)r_\phi(x, y_l).

4.2.1.2. Reinforcement Learning (RL)

In the second stage of traditional RL alignment, a language model policy πθ(yx)\pi_\theta(y|x) (which defines the probability of generating response yy given prompt xx) is optimized via RL to maximize a specific objective:

$ \operatorname*{max}{\pi\theta} \mathbb{E}{x \sim \mathcal{D}, \ y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \mathbb{D}{\mathrm{KL}} \left( \pi\theta(\cdot|x) \Vert \pi_{\mathrm{ref}}(\cdot|x) \right) $

Here:

  • θ\theta represents the parameters of the language model policy (πθ\pi_\theta).

  • ExD, yπθ(x)\mathbb{E}_{x \sim \mathcal{D}, \ y \sim \pi_\theta(\cdot|x)} denotes the expectation over prompts xx from the dataset and responses yy sampled from the current policy πθ\pi_\theta.

  • rϕ(x,y)r_\phi(x, y) is the reward provided by the trained reward model for generating response yy to prompt xx.

  • β\beta is a coefficient that controls the strength of the regularization term.

  • DKL(πθ(x)πref(x))\mathbb{D}_{\mathrm{KL}}(\pi_\theta(\cdot|x) \Vert \pi_{\mathrm{ref}}(\cdot|x)) is the Kullback-Leibler (KL) divergence between the current policy πθ\pi_\theta and a fixed reference policy πref\pi_{\mathrm{ref}}. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution. This term prevents the optimized policy from drifting too far from the original (often Supervised Fine-Tuned (SFT)) model, which is crucial for maintaining general capabilities and avoiding mode collapse.

  • πref\pi_{\mathrm{ref}} is typically a frozen SFT model.

    For autoregressive models (ARMs), both sampling responses from πθ\pi_\theta and evaluating their log-likelihoods are straightforward.

4.2.1.3. Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) simplifies the two-stage RLHF process by directly optimizing a policy without an explicit reward model. The DPO objective minimizes the following loss:

$ \mathcal{L}{\mathrm{DPO}}(\boldsymbol{\theta}) \triangleq \mathbb{E}{(\boldsymbol{x}, \boldsymbol{y}w, \boldsymbol{y}l) \sim \mathcal{D}} \left[ \ell{\mathrm{DPO}}(\boldsymbol{x}, \boldsymbol{y}w, \boldsymbol{y}l; \boldsymbol{\theta}) \right] $ where $ \ell{\mathrm{DPO}}(x, y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \log \frac{\pi\theta(y_w \mid x)}{\pi{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) $

Here:

  • θ\theta represents the parameters of the language model policy (πθ\pi_\theta).
  • E(x,yw,yl)D\mathbb{E}_{(\boldsymbol{x}, \boldsymbol{y}_w, \boldsymbol{y}_l) \sim \mathcal{D}} denotes the expectation over the preference dataset D\mathcal{D}.
  • DPO\ell_{\mathrm{DPO}} is the per-sample DPO loss.
  • σ()\sigma(\cdot) is the sigmoid function.
  • β\beta is the coefficient controlling regularization strength, similar to the RL objective.
  • logπθ(yx)πref(yx)\log \frac{\pi_\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} represents the log-likelihood ratio between the current policy and the reference policy for a given response yy and prompt xx. This term implicitly measures how much the current policy prefers a response compared to the reference policy. The DPO loss directly encourages the model to increase the log-likelihood ratio for preferred responses (ywy_w) and decrease it for less preferred responses (yly_l).

4.2.2. Masked Diffusion Models (MDMs)

4.2.2.1. MDM Formulation

The paper describes Masked Diffusion Models (MDMs) as defining a model distribution through a forward-reverse framework.

  • Forward Process: Starting from an original input yy at t=0t=0, tokens are progressively masked with a probability that increases over time. This leads to a fully masked sequence at t=1t=1. Given a prompt xx, the forward process is formulated as: $ q(y_t | t, y, x) = \prod_{i=1}^L q(y_t^i | t, y^i, x) $ where $ q(y_t^i | t, y^i, x) = \left{ \begin{array}{ll} 1-t, & y_t^i = y^i, \ t, & y_t^i = \mathbf{M}, \end{array} \right. $ Here:

    • yy is the original full response (sequence of tokens).
    • yty_t is the corrupted sequence at timestep t[0,1]t \in [0,1].
    • yiy^i is the ii-th token of the response yy.
    • ytiy_t^i is the ii-th token of the corrupted sequence yty_t.
    • M\mathbf{M} denotes the mask token.
    • tt is the masking probability (or noise level), increasing from 0 to 1. If yti=yiy_t^i = y^i, it means the token is unmasked; if yti=My_t^i = \mathbf{M}, it means the token is masked.
  • Reverse Process: The model learns to denoise the sequence. For timesteps 0s<t10 \le s < t \le 1, the reverse process is defined as: $ q(y_s | s, t, y_t, x) = \prod_{i=1}^L q(y_s^i | s, t, y_t, x) $ where $ q(y_s^i | s, t, y_t, x) = \left{ \begin{array}{ll} \frac{t-s}{t} p_\theta(y^i | y_t, x), & y_t^i = \mathbf{M} \wedge y_s^i \neq \mathbf{M}, \ \frac{s}{t}, & y_t^i = \mathbf{M} \wedge y_s^i = \mathbf{M}, \ 1, & y_t^i \neq \mathbf{M} \wedge y_s^i = y_t^i, \ 0, & \mathrm{otherwise}, \end{array} \right. $ Here:

    • ysy_s is the sequence at a previous timestep ss.
    • pθp_\theta is the mask prediction model (the neural network) that learns to predict the original token yiy^i given the masked sequence yty_t and prompt xx. The reverse process iteratively unmasks tokens, moving from t=1t=1 (fully masked) to t=0t=0 (original sequence).

4.2.2.2. Likelihood Estimation in MDMs

A key challenge for MDMs is that the exact log-likelihood logπ(yx)\log \pi(y|x) is intractable. Instead, it is typically approximated by its Evidence Lower Bound (ELBO). The paper uses the following ELBO formulation (Equation (14) from Appendix A, noted to have lower variance than Equation (12)):

$ \mathcal{B}{\pi}(y|x) \triangleq \mathbb{E}{l \sim \mathcal{U}({1, 2, \ldots, L})} \mathbb{E}{y_l \sim q(y_l|l, y, x)} \ell{\pi}^{\prime}(y_l, l, y|x) $ where $ \ell_{\boldsymbol{\pi}}^{\prime}(y_l, l, y|x) \triangleq \left[ \frac{L}{l} \sum_{i=1}^L \mathbf{1}[y_l^i = \mathbf{M}] \log p_{\theta}(y^i | y_l, x) \right] $

Here:

  • Bπ(yx)\mathcal{B}_\pi(y|x) is the ELBO approximation of the log-likelihood of response yy given prompt xx under policy π\pi.
  • ll is the number of masked tokens, uniformly sampled from {1,2,,L}\{1, 2, \ldots, L\} (where LL is the sequence length). This ensures that exactly ll tokens are masked.
  • U({})\mathcal{U}(\{\dots\}) denotes a uniform distribution over the specified set.
  • yly_l is the sequence obtained by masking ll tokens (without replacement).
  • π\ell_{\pi}^{\prime} is the per-step loss of the mask prediction model.
  • 1[yli=M]\mathbf{1}[y_l^i = \mathbf{M}] is an indicator function, which is 1 if the ii-th token in yly_l is masked, and 0 otherwise.
  • logpθ(yiyl,x)\log p_{\theta}(y^i | y_l, x) is the log-probability predicted by the model for the original token yiy^i, given the masked sequence yly_l and prompt xx. This is essentially the model's ability to denoise or predict the masked token. The ELBO is an expectation over sampled numbers of masked tokens (ll) and specific masked configurations (yly_l). For a well-trained model, the bias of the ELBO relative to the exact likelihood is considered negligible.

4.2.2.3. Monte Carlo Approximation of ELBO

Computing Bπ(y)\mathcal{B}_\pi(y) exactly is intractable due to the double expectations. In practice, it is approximated using a doubly Monte Carlo method.

  • Let ntn_t be the number of samples for timesteps (or number of masked tokens ll in the chosen formulation).
  • Let nytn_{y_t} be the number of samples for masked data configurations per timestep (or per number of masked tokens ll). The sampling process involves: $ S_t \triangleq { t^{(j)} }{j=1}^{n_t} \stackrel{\mathrm{i.i.d.}}{\sim} \mathcal{U}[0,1] \quad \mathrm{and} \quad S{y_{t(j)}|y} \triangleq { y_{t(j)}^{(k)} }{k=1}^{n{y_t}} \stackrel{\mathrm{i.i.d.}}{\sim} q(y_t|t^{(j)}, y), j=1, \ldots, n_t, $ In the specific formulation used (Equation (14) with ll), this would correspond to sampling ll values, and for each ll, sampling yly_l configurations. The ELBO estimator is then:

$ \widehat{\mathcal{B}}\pi(y) \triangleq \frac{1}{n_t} \sum{j=1}^{n_t} \frac{1}{n_{y_t}} \sum_{k=1}^{n_{y_t}} \ell_\pi(y_{t^{(j)}}^{(k)}, t^{(j)}, y) $

Here:

  • B^π(y)\widehat{\mathcal{B}}_\pi(y) is the estimated ELBO.
  • t(j)t^{(j)} represents the jj-th sampled timestep (or sampled number of masked tokens l(j)l^{(j)}).
  • yt(j)(k)y_{t^{(j)}}^{(k)} represents the kk-th sampled masked data configuration for the jj-th timestep (or for l(j)l^{(j)}).
  • π()\ell_\pi(\cdot) is the per-step loss as defined above. The total number of mask-prediction loss computations is n=nt×nytn = n_t \times n_{y_t}. This estimator is unbiased for the true ELBO Bπ(y)\mathcal{B}_\pi(y), meaning E[B^π(y)]=Bπ(y)\mathbb{E}[\widehat{\mathcal{B}}_\pi(y)] = \mathcal{B}_\pi(y). However, due to practical computational constraints, nn is typically small, leading to significant variance in the estimate.

4.2.3. Substituting Likelihoods with ELBOs in DPO

The first step of the proposed VRPO framework is to adapt the DPO loss (from Equation (3)) by replacing the intractable log-likelihoods logπ(yx)\log \pi(y|x) with their ELBO approximations Bπ(yx)\mathcal{B}_\pi(y|x). The modified DPO loss, referred to as ELBO-based DPO loss, is:

$ \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \left( \mathcal{B}{\pi\theta}(y_w) - \mathcal{B}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \mathcal{B}{\pi\theta}(y_l) - \mathcal{B}{\pi{\mathrm{ref}}}(y_l) \right) \right) $

Here:

  • The term in the argument of the sigmoid function (in red in the original paper) is defined as the ELBO-based preference score, denoted sθ(yw,yl)s_\theta(y_w, y_l). $ s_\theta(y_w, y_l) = \beta \left( \mathcal{B}{\pi\theta}(y_w) - \mathcal{B}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \mathcal{B}{\pi\theta}(y_l) - \mathcal{B}{\pi{\mathrm{ref}}}(y_l) \right) $ This loss intuitively encourages the current model πθ\pi_\theta to assign a higher ELBO to the preferred response ywy_w relative to the reference model πref\pi_{\mathrm{ref}}, and a lower ELBO to the less preferred response yly_l relative to the reference model.

In practice, each of the four ELBO terms in sθ(yw,yl)s_\theta(y_w, y_l) must be estimated using the Monte Carlo method (as in Equation (6)). This leads to the estimated ELBO-based DPO loss:

$ \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \triangleq - \log \sigma \left( \beta \left( \widehat{\mathcal{B}}{\pi_\theta}(y_w) - \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y_w) \right) - \beta \left( \widehat{\mathcal{B}}{\pi\theta}(y_l) - \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y_l) \right) \right) $

The term inside the log sigmoid function is the estimated preference score, denoted s^θ(yw,yl)\hat{s}_\theta(y_w, y_l). The stochastic sampling involved in estimating this score introduces randomness into the estimated loss, making it a random variable.

Problem Identified: This introduces two issues:

  1. Bias: Due to the non-linearity of the log sigmoid function, even if s^θ\hat{s}_\theta is an unbiased estimator for sθs_\theta (meaning E[s^θ]=sθ\mathbb{E}[\hat{s}_\theta] = s_\theta), ^DPOE\widehat{\ell}_{\mathrm{DPO-E}} is not an unbiased estimator for DPOE\ell_{\mathrm{DPO-E}}. That is, E[^DPOE]DPOE\mathbb{E}[\widehat{\ell}_{\mathrm{DPO-E}}] \neq \ell_{\mathrm{DPO-E}}. This is illustrated in Figure 2 (a).

  2. Variance: The stochastic sampling also introduces high variance into both the estimated loss and its gradient. This high variance makes optimization unstable.

    The paper formally proves that the preference score estimator s^θ(yw,yl)\hat{s}_\theta(y_w, y_l) is indeed an unbiased estimator for the true preference score sθ(yw,yl)s_\theta(y_w, y_l). Proposition 3 (Unbiasedness of preference score estimator): The preference score estimator defined in Eq. (8) is an unbiased estimator of the true preference score defined in Eq. (7): $ \mathbb{E}{S{\hat{s}|y_w, y_l}} \left[ \hat{s}\theta(y_w, y_l) \right] = s\theta(y_w, y_l) $ Here:

  • ESs^yw,yl\mathbb{E}_{S_{\hat{s}|y_w, y_l}} denotes the expectation over the stochastic sampling (SS) involved in the estimation of s^θ\hat{s}_\theta, given the preference pair (yw,yl)(y_w, y_l).
  • s^θ(yw,yl)\hat{s}_\theta(y_w, y_l) is the estimated preference score.
  • sθ(yw,yl)s_\theta(y_w, y_l) is the true (ELBO-based) preference score. This proposition confirms that on average, the estimated score is correct, but due to the non-linearity of the log sigmoid, the average of the loss calculated from the estimated score is still biased.

4.2.4. Variance-Reduced Preference Optimization (VRPO)

The core insight of VRPO is that the bias and variance of the estimated DPO loss (and its gradient) are directly governed by the variance of the score estimator (s^θ\hat{s}_\theta). Therefore, reducing V[s^θ]\mathbb{V}[\hat{s}_\theta] will simultaneously mitigate these errors.

The following theorem quantifies this relationship: Theorem 1: Given a pair of preference data yw,yly_w, y_l, the bias and variance of ^DPOE(yw,yl;θ)\widehat{\ell}_{\mathrm{DPO-E}}(y_w, y_l; \theta) over stochastic sampling in the score estimation can be bounded as: $ \mathbb{E}{S{\hat{s}|y_w, y_l}} \bigg[ \bigg| \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) - \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \bigg| \bigg] \leq \sqrt{\mathbb{V}{S_{\hat{s}|y_w, y_l}} \big[ \hat{s}\theta(y_w, y_l) \big]} $ $ \mathbb{V}{S_{\hat{s}|y_w, y_l}} \bigg[ \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \bigg] \leq 4 \mathbb{E}{y_w, y_l} \bigg[ \mathbb{V}{S{\hat{s}|y_w, y_l}} \big[ \hat{s}_\theta(y_w, y_l) \big] \bigg] $ Here:

  • The first inequality bounds the absolute bias (the expected absolute difference between the true and estimated loss) by the square root of the variance of the estimated score.
  • The second inequality bounds the variance of the estimated loss by four times the expected variance of the estimated score. This theorem, illustrated conceptually in Figure 2 (b), shows a direct relationship: reducing the variance of the score estimator (V[s^θ]\mathbb{V}[\hat{s}_\theta]) will reduce both the bias and variance of the DPO loss.

The paper then breaks down the variance of the score estimator to understand how to reduce it. The variance of s^θ(yw,yl)\hat{s}_\theta(y_w, y_l) (omitting subscripts for brevity) can be expanded as: $ \mathbb{V} \hat{s}\theta (y_w, y_l) = \beta^2 \sum{y \in {y_w, y_l}} \Big[ \mathbb{V} \widehat{\mathcal{B}}{\pi\theta}(y) + \mathbb{V} \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y) - 2 \mathrm{Corr} \Big( \widehat{\mathcal{B}}{\pi\theta}(y), \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y) \Big) \sqrt{\mathbb{V} \widehat{\mathcal{B}}{\pi\theta}(y) \mathbb{V} \widehat{\mathcal{B}}{\pi{\mathrm{ref}}}(y)} \Big] $ This decomposition highlights two strategies for variance reduction:

  1. Decreasing the variance of each individual ELBO estimation (VB^π(y)\mathbb{V} \widehat{\mathcal{B}}_\pi(y)).

  2. Increasing the correlation between the ELBO estimates of the current policy and the reference policy for the same input yy (Corr(B^πθ(y),B^πref(y))\mathrm{Corr}(\widehat{\mathcal{B}}_{\pi_\theta}(y), \widehat{\mathcal{B}}_{\pi_{\mathrm{ref}}}(y))).

    VRPO integrates three principled techniques to implement these strategies:

4.2.4.1. Technique 1: Sampling Budget

The first technique is to increase the total number of samples n=nt×nytn = n_t \times n_{y_t} used to estimate each ELBO B^π(y)\widehat{\mathcal{B}}_\pi(y). More samples inherently lead to lower variance. This increases computational cost proportionally (e.g., if n=8n=8, 8x more FLOPs).

4.2.4.2. Technique 2: Optimal Allocation

The second technique, applied when a fixed total budget nn is available, is to optimally allocate this budget. Specifically, it involves setting nt=nn_t = n and nyt=1n_{y_t} = 1. This means sampling nn distinct timesteps (or numbers of masked tokens ll) and only one masked data configuration per timestep.

Proposition 1 (Reduce the ELBO variance): Given a total budget of n=nt×nytn = n_t \times n_{y_t} masked samples for estimating B^π(y)\widehat{\mathcal{B}}_\pi(y), we have: (i) $ \mathbb{V} \widehat{\mathcal{B}}_\pi(y) = \Theta\left(\frac{1}{n}\right) $ (ii) VB^π(y)\mathbb{V} \widehat{\mathcal{B}}_\pi(y) is minimized when nt=n,nyt=1n_t = n, n_{y_t} = 1. Here:

  • Θ(1n)\Theta\left(\frac{1}{n}\right) indicates that the variance of the ELBO estimator is inversely proportional to the total number of samples nn. This formally justifies increasing the sampling budget.
  • The second part states that to minimize this variance for a fixed nn, one should prioritize sampling more timesteps (ntn_t) over more masked data configurations per timestep (nytn_{y_t}). This is because the variance of the ELBO estimator (from Lemma 5 in Appendix B.1.5) is given by: $ \mathbb{V} \widehat{\mathcal{B}}\pi(y) = \frac{1}{n_t} V_t + \frac{1}{n_t n{y_t}} V_{y_t} $ where VtV_t is the variance across timesteps and VytV_{y_t} is the variance due to masked data at each step. By setting nyt=1n_{y_t}=1, the term 1ntnytVyt\frac{1}{n_t n_{y_t}} V_{y_t} becomes 1ntVyt\frac{1}{n_t} V_{y_t}, which, when combined with 1ntVt\frac{1}{n_t} V_t, simplifies to 1nt(Vt+Vyt)\frac{1}{n_t} (V_t + V_{y_t}). Since nt=nn_t = n, the total variance becomes 1n(Vt+Vyt)\frac{1}{n} (V_t + V_{y_t}). This strategy effectively spreads the budget across the most significant source of variability (timesteps). This technique incurs no additional computational cost, as it only redistributes existing samples.

4.2.4.3. Technique 3: Antithetic Sampling

The third technique involves sharing the same sampled timesteps and masked data between the ELBO estimates of the current policy (πθ\pi_\theta) and the reference policy (πref\pi_{\mathrm{ref}}) for the same input ywy_w or yly_l.

Proposition 2 (Antithetic Sampling): When Corr(B^πθ(y),B^πref(y))>0\mathrm{Corr}\left(\widehat{\mathcal{B}}_{\pi_\theta}(y), \widehat{\mathcal{B}}_{\pi_{\mathrm{ref}}}(y)\right) > 0 (i.e., the ELBO estimates for the two policies are positively correlated) and the Monte Carlo samples StS_t and {Syt(j)y}j=1nt\{S_{y_t^{(j)}}|y\}_{j=1}^{n_t} are shared between B^πθ(y)\widehat{\mathcal{B}}_{\pi_\theta}(y) and B^πref(y)\widehat{\mathcal{B}}_{\pi_{\mathrm{ref}}}(y), sharing Monte Carlo samples yields lower Vs^θ(yw,yl)\mathbb{V} \hat{s}_\theta(y_w, y_l) than using independent samples. Here:

  • Corr(,)\mathrm{Corr}(\cdot, \cdot) denotes the correlation coefficient. The proposition states that if the ELBO estimates from the current and reference policies are positively correlated (which is common, especially since πref\pi_{ref} is typically an SFT model close to πθ\pi_\theta), reusing the same random samples for both estimates will lead to a lower variance in their difference, and thus in the overall score estimator. This is a classic antithetic variates technique and provides a free lunch as it reuses existing samples without additional computation.

The conceptual flow of VRPO is illustrated in Figure 4:

Figure 4: Illustration of the analysis process. This diagram outlines the conceptual flow that leads to the proposed VRPO method. Gray boxes represent theoretical analyses, and the blue box highlight… 该图像是论文中图4的示意图,展示了提出的VRPO方法的分析流程。图中包含对偏差和方差的控制、两种方差降低策略的识别,以及最终涵盖采样预算、最优分配和对称采样的VRPO方案。


Figure 4: Illustration of the analysis process. This diagram outlines the conceptual flow that leads to the proposed VRPO method. Gray boxes represent theoretical analyses, and the blue box highlights the final sampling strategy. Starting from a bias and variance analysis of the estimated loss and gradient, we identify the score-estimator variance as a dominant controller. These theoretical findings collectively motivate the design of the VRPO algorithm, which is equipped with provable properties (dashed lines): unbiasedness and guaranteed variance reduction.

The VRPO algorithm is further illustrated in Figure 3:

Figure 3: Illustration of VRPO. We compare VRPO (right) with VRPO without optimal allocation and antithetic sampling (left). VRPO allocates the sampling budget across timesteps to sample only one mas… 该图像是图3的示意图,比较了未使用最优分配和对偶采样的VRPO(左)与使用该策略的VRPO(右)。右侧VRPO在时间步间分配采样预算,仅采样一个掩码数据,并在成对的ELBO之间共享蒙特卡洛样本,以计算损失。


Figure 3: Illustration of VRPO. We compare VRPO (right) with VRPO without optimal allocation and antithetic sampling (left). VRPO allocates the sampling budget across timesteps to sample only one masked data per timestep (indicated by red arrows) and shares Monte Carlo samples between paired ELBOs (highlighted with the red annotations above the blocks).

Proposition 4 (Unbiasedness of VRPO): The paper confirms that all these variance reduction techniques (sampling budget, optimal allocation, and antithetic sampling) do not introduce bias into the preference score estimator, maintaining its unbiasedness for the true preference score.

4.2.4.4. Deferred Analysis of Estimated Gradient

The paper also extends this analysis to the gradient of the DPO loss. Assumption 1 (Bounded gradient of per-step mask prediction loss): The gradient of the per-step masked prediction loss πθ(yt,t,y)\ell_{\pi_\theta}(y_t, t, y) (Eq. (4)) is bounded, i.e., there exists a constant 0C<0 \leq C < \infty such that θπθ(yt,t,y)2C\|\nabla_\theta \ell_{\pi_\theta}(y_t, t, y)\|_2 \leq C for all θ\theta in the model parameter space, yy in D\mathcal{D}, and t[0,1]t \in [0,1]. This assumption is practically reasonable for neural networks.

Corollary 1 (Bounded gradient of preference score estimator): Under Assumption 1, the gradient of the preference score estimator s^θ(yw,yl)\hat{s}_\theta(y_w, y_l) is bounded, i.e., there exists a constant 0C~<0 \leq \tilde{C} < \infty such that θs^θ(yw,yl)2C~\|\nabla_\theta \hat{s}_\theta(y_w, y_l)\|_2 \leq \tilde{C} for all θ\theta in the model parameter space and (yw,yl)(y_w, y_l) in D\mathcal{D}. This is derived from the linearity of gradients and the boundedness of the individual loss gradients.

Theorem 4: Suppose Assumption 1 holds. Then, there exists a constant 0C~<0 \leq \tilde{C} < \infty such that, given a pair of preference data yw,yly_w, y_l, the bias and variance of θ^DPOE\nabla_\theta \widehat{\ell}_{\mathrm{DPO-E}} can be bounded as: $ \mathbb{E}{S{\hat{s}|y_w, y_l}} [|\nabla_\theta \ell_{\mathrm{DPO-E}}(y_w, y_l; \theta) - \nabla_\theta \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta)|2] \leq \frac{\tilde{C}}{4} \sqrt{\mathbb{V}{S{\hat{s}|y_w, y_l}} \hat{s}\theta(y_w, y_l)} + \sqrt{\mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \nabla_\theta \hat{s}\theta(y_w, y_l)} $ and $ \mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \left[ \nabla_\theta \widehat{\ell}{\mathrm{DPO-E}}(y_w, y_l; \theta) \right] \leq \frac{\tilde{C}^2}{8} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \hat{s}\theta(y_w, y_l) + \mathrm{tr} \mathbb{V}{S_{\hat{s}|y_w, y_l}} \nabla_\theta \hat{s}_\theta(y_w, y_l) $ Here:

  • ESs^yw,yl[]\mathbb{E}_{S_{\hat{s}|y_w, y_l}} [\dots] is the expected absolute bias of the gradient estimator.
  • trVSs^yw,yl[]\mathrm{tr} \mathbb{V}_{S_{\hat{s}|y_w, y_l}} [\dots] is the trace of the covariance matrix of the gradient estimator (a measure of its total variance).
  • θDPOE\nabla_\theta \ell_{\mathrm{DPO-E}} is the true gradient of the DPO loss.
  • θ^DPOE\nabla_\theta \widehat{\ell}_{\mathrm{DPO-E}} is the estimated gradient of the DPO loss. This theorem demonstrates that the bias and variance of the estimated gradient are also bounded by the variance of the score estimator (Vs^θ\mathbb{V} \hat{s}_\theta) and the variance of its gradient (Vθs^θ\mathbb{V} \nabla_\theta \hat{s}_\theta).

Proposition 5 (Sampling budget and allocation for gradient variance): Let B^π(y)\widehat{\mathcal{B}}_\pi(y) be estimated using a total of n=nt×nytn = n_t \times n_{y_t} masked samples. Then we have: (i) $ \mathbb{V} \nabla_\theta \widehat{\mathcal{B}}_\pi(y) = \Theta\left(\frac{1}{n}\right) $ (ii) VθB^π(y)\mathbb{V} \nabla_\theta \widehat{\mathcal{B}}_\pi(y) is minimized when nt=nn_t = n and nyt=1n_{y_t} = 1 with a fixed nn. This proposition, analogous to Proposition 1, confirms that increasing the sampling budget and applying optimal allocation also effectively reduces the variance of the gradient of the ELBO estimator, and consequently the variance of the overall DPO gradient.

4.2.5. Extension to Other Alignment Methods

The techniques and analysis of VRPO are broadly applicable beyond DPO. Other algorithms like PPO and GRPO also rely on estimating likelihood terms (π(yx)\pi(y|x)) or likelihood-ratio terms (π1(yx)π2(yx)\frac{\pi_1(y|x)}{\pi_2(y|x)}), which, when applied to MDMs, would similarly involve ELBO-based estimations. The variance reduction strategies can be directly applied to these ELBO estimations, often with simpler analysis because these methods do not involve the outer non-linear log sigmoid function that complicates DPO's theoretical guarantees.

5. Experimental Setup

5.1. Datasets

The authors trained LLaDA 8B Instruct using VRPO on 350K preference pairs for one epoch to produce LLaDA 1.5.

  • Source: The data was collected internally at scale by the authors, implying proprietary or custom datasets.
  • Processing: The data underwent several steps:
    1. Filtering out low-quality samples.
    2. Removing duplicates via similarity matching.
    3. Using a reward model to rank data, suggesting that some form of preference or quality scoring was applied to select high-quality pairs.
    4. Replacing some responses with outputs from advanced LLMs, potentially to further enhance quality or diversity.
  • Characteristics and Domain: The dataset is described as high-quality and diverse, covering a wide range of topics:
    • Creative writing: ~35%
    • Knowledge Q&A: ~18%
    • NLP tasks: ~16%
    • Mathematics tasks: ~14%
    • Recommendation tasks: ~7%
    • Code generation: ~5%
    • Reasoning tasks: ~3%
    • A small portion of safety and other tasks.
  • Choice Rationale: These datasets are crucial for validating the general capabilities of the proposed method across various domains where LLMs are typically evaluated, ensuring that VRPO improves alignment across a broad spectrum of human preferences.

5.2. Evaluation Metrics

For every evaluation metric, a conceptual definition is provided, along with its mathematical formula where applicable.

5.2.1. Mathematics & Scientific Reasoning

These benchmarks assess the model's ability to solve mathematical problems and answer scientific questions.

5.2.1.1. GSM8K (Grade School Math 8K)

  • Conceptual Definition: GSM8K evaluates a model's ability to solve grade school level math word problems. The problems often require multiple steps of reasoning to arrive at the correct numerical answer.
  • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Problems}} $
  • Symbol Explanation:
    • Number of Correct Answers: The count of problems where the model's final answer matches the ground truth.
    • Total Number of Problems: The total number of math problems in the GSM8K dataset.

5.2.1.2. Math

  • Conceptual Definition: The MATH dataset contains a broad set of mathematics problems from various school levels (e.g., algebra, geometry, number theory). It's designed to be challenging and requires advanced mathematical reasoning.
  • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Problems}} $
  • Symbol Explanation:
    • Number of Correct Answers: The count of problems where the model's final answer matches the ground truth.
    • Total Number of Problems: The total number of math problems in the MATH dataset.

5.2.1.3. GPQA (Graduate-level Google-Proof Q&A)

  • Conceptual Definition: GPQA is a challenging question-answering benchmark comprising graduate-level questions from biology, physics, and chemistry. It's designed to be "Google-proof," meaning answers are not easily found with simple web searches and require deep scientific understanding and reasoning.
  • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} $
  • Symbol Explanation:
    • Number of Correct Answers: The count of questions where the model's answer is factually correct.
    • Total Number of Questions: The total number of questions in the GPQA dataset.

5.2.2. Code Generation

These benchmarks evaluate the model's ability to generate functional code.

5.2.2.1. HumanEval

  • Conceptual Definition: HumanEval assesses a model's ability to generate Python code that solves specific programming problems. Each problem comes with a function signature and docstrings, and the model must complete the function body. Solutions are evaluated by executing test cases.
  • Mathematical Formula (Pass@k): The typical metric is Pass@k, which estimates the probability that at least one of kk generated samples passes the unit tests. $ \mathrm{Pass@k} = \frac{\sum_{i=1}^{N} \mathbf{1}[\text{at least one of k samples for problem i passes}]}{N} $
  • Symbol Explanation:
    • NN: The total number of problems in the HumanEval benchmark.
    • 1[condition]\mathbf{1}[\text{condition}]: An indicator function that is 1 if the condition is true, and 0 otherwise.
    • kk: The number of code samples generated by the model for each problem.

5.2.2.2. MBPP (Mostly Basic Python Programs)

  • Conceptual Definition: MBPP is a dataset of Python programming problems designed to evaluate the capability of models to generate short, correct Python programs from natural language descriptions.
  • Mathematical Formula (Pass@k): Similar to HumanEval, Pass@k is used. $ \mathrm{Pass@k} = \frac{\sum_{i=1}^{N} \mathbf{1}[\text{at least one of k samples for problem i passes}]}{N} $
  • Symbol Explanation:
    • NN: The total number of problems in the MBPP benchmark.
    • 1[condition]\mathbf{1}[\text{condition}]: An indicator function.
    • kk: The number of code samples generated by the model for each problem.

5.2.3. Alignment Tasks

These benchmarks evaluate how well the model adheres to instructions, avoids harmful content, and generally aligns with human preferences in open-ended conversations.

5.2.3.1. IFEval (Instruction-Following Evaluation)

  • Conceptual Definition: IFEval measures a model's ability to follow complex, multi-turn instructions, including constraints on output format, content, and style. It assesses whether the model can generate responses that adhere to all specified directives.
  • Evaluation Method: Typically involves automated checks or LLM-as-a-judge evaluation to determine adherence to instructions. The reported score usually represents the percentage of instructions correctly followed.

5.2.3.2. Arena-Hard

  • Conceptual Definition: Arena-Hard is a benchmark for evaluating challenging instruction-following and safety scenarios where models are prone to failure. It focuses on adversarial or difficult prompts to push the limits of model alignment.
  • Evaluation Method: Involves LLM-as-a-judge evaluation, where a powerful reference LLM (e.g., GPT-4) is used to rate the quality of responses, often in terms of win rate against other models or a direct quality score.

5.2.3.3. AlignBench

  • Conceptual Definition: AlignBench is specifically designed for benchmarking Chinese alignment of large language models. It covers various aspects of helpfulness, harmlessness, and adherence to cultural norms in Chinese language contexts.
  • Evaluation Method: Uses an LLM-as-a-judge approach, where a powerful LLM evaluates the quality, alignment, and helpfulness of responses to Chinese prompts.

5.2.3.4. MTBench

  • Conceptual Definition: MTBench is a multi-turn benchmark for evaluating conversational AI models. It consists of multi-turn prompts covering various categories (e.g., writing, reasoning, roleplay) and assesses a model's ability to maintain coherence, provide relevant information, and follow instructions over multiple turns.
  • Evaluation Method: Relies on LLM-as-a-judge (e.g., GPT-4) to score the quality of model responses in a multi-turn dialogue. The scores are often averaged across prompts and turns.

5.2.4. LLM-as-a-Judge

For metrics like Arena-Hard, AlignBench, and MTBench, the paper mentions that results are obtained via LLM-as-a-judge scoring using the gpt-4-32k API.

  • Conceptual Definition: LLM-as-a-judge is an evaluation paradigm where a powerful, larger language model (the "judge" LLM) is used to rate or rank the outputs of other LLMs. The judge LLM is given the prompt, the generated responses from different models, and sometimes a rubric, and then it produces a score or a preference ranking. This method attempts to automate human-like evaluation for complex, open-ended generation tasks where traditional metrics are insufficient.

5.3. Baselines

The paper compares LLaDA 1.5 against two baselines:

  • LLaDA 8B Instruct: This is the Supervised Fine-Tuned (SFT)-only predecessor model. It represents the base model before any preference optimization is applied.
  • LLaDA DPO: This baseline applies naive DPO to LLaDA. It uses a minimal sampling configuration (nt=1n_t = 1, nyt=1n_{y_t} = 1) and without antithetic sampling. This effectively represents a direct application of DPO to MDMs without the proposed variance reduction techniques, serving as a direct comparison to highlight the impact of VRPO.

5.4. Computational Cost

  • VRPO Sampling Budget: A default sampling budget of n=8n = 8 is used for VRPO.
  • Overhead: This results in approximately an 8-fold increase in computation compared to methods without Monte Carlo estimation (e.g., ARMs) or with a minimal budget (n=1n=1).
  • Affordability: Despite this increase, the overall cost remains modest, less than 0.5% of the pre-training cost of LLaDA, making the overhead practically acceptable.
  • GPU Hours: Training consumed approximately 405 H100 GPU hours for 8 Monte Carlo samples.

5.5. Implementation Details

  • Packing Strategy: VRPO is implemented using a packing strategy where multiple preference data samples are combined into a single sequence. An attention mask is used to ensure tokens from different samples cannot attend to each other.

  • Padding: All sequences are padded to a fixed length of 4096 tokens with [EOS] (End-Of-Sequence) tokens, matching LLaDA's pre-training context length. These padded [EOS] tokens are excluded from loss calculation.

  • Model Architecture (LLaDA):

    • Transformer Encoder-based masked diffusion model with 8 billion parameters.
    • Follows LLaMA architecture: RMSNorm (Zhang and Sennrich, 2019) for normalization, RoPE (Su et al., 2024) for positional encoding, and SwiGLU (Shazeer, 2020) as the activation function. The detailed architecture is shown in Table 3: The following are the results from Table 3 of the original paper:
    LLaDA
    Layers 32
    Model dimension 4096
    Attention heads 32
    Vocabulary size 126,464
    FFN dimension 12,288
    Key/Value heads 32
    Total parameters 8.02 B
    Non-embedding parameters 6.98 B
  • Training Configuration:

    • Epochs: 1 epoch.
    • Batch Size: 64.
    • Optimizer: AdamW with weight decay of 0.01, β1\beta_1 of 0.9, and β2\beta_2 of 0.95.
    • Learning Rate Schedule: 15 warmup steps to a maximum learning rate of 5×1075 \times 10^{-7}, followed by cosine decay.
    • DPO Loss Coefficient: β=0.2\beta = 0.2.
    • MDMs SFT Loss: Complemented with a 0.05 weighted MDMs SFT loss for training stability.
    • Reference Policy: πref\pi_{\mathrm{ref}} is initialized with LLaDA Instruct.
    • Hyperparameter Tuning: No hyperparameter search was performed due to hardware constraints.

5.6. Evaluation Details

  • Sampling Strategies for MDMs: MDMs benefit from various inference sampling strategies to enhance sample quality. The paper employs:
    1. Diffusion Sampling: The standard reverse process.
    2. Diffusion Semi-Autoregressive Sampling: Generates tokens in blocks, with each block generated by the diffusion process, and blocks generated autoregressively.
    3. Low-Confidence Remasking: Remasks predicted tokens that have the lowest confidence scores during inference, and then re-predicts them.
  • [EOS] Token Confidence: A critical observation was that LLaDA SFT models tend to generate excessive [EOS] tokens, leading to incomplete content. To address this, the confidence score for the [EOS] token was set to zero during inference. This improved HumanEval scores from 47.6 to 49.4 for LLaDA. This setting was adopted for evaluation.
  • Optimal Inference Configurations: For fair comparison, both LLaDA and LLaDA 1.5 were evaluated using diffusion sampling and semi-autoregressive sampling, reporting the best results found by tuning answer length ({64, 128, 256, 512, 1024}) and block length for semi-autoregressive sampling ({8, 16, 32, 64, 128}). These optimal configurations are detailed in Table 6.
  • LLM-as-a-Judge Evaluation: MTBench, AlignBench, and Arena-Hard benchmarks use the gpt-4-32k API for scoring.

5.7. Calculation of Variances (for Ablation Studies)

For the ablation studies (Table 2), the variances of the score estimator, loss, and gradient were estimated:

  • Data Points: 128 preference data samples.
  • Batch Size: Processed with a batch size of 16.
  • Independent Calculations: 8 independent calculations were performed for each data point.
  • Model Checkpoint: πref\pi_{\mathrm{ref}} was initialized with LLaDA, and πθ\pi_\theta was a model checkpoint from the VRPO training process.
  • Gradient Variance Proxy: Due to the computational cost of storing full gradients for large models, the gradients of the up-projection layer within the Feed-Forward Network module of the first transformer block were used as a proxy for the full gradient variance.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate the effectiveness of VRPO and the superior performance of LLaDA 1.5 over its SFT-only predecessor (LLaDA Instruct) and a naive DPO baseline.

The main benchmark results are presented in Table 1: The following are the results from Table 1 of the original paper:

LLaDA 8B Instruct LLaDA DPO LLaDA 1.5 8B
Post-training SFT SFT + naive DPO SFT + VRPO (Ours)
Mathematics & Science
GSM8K 78.6 80.7 (+2.1) 83.3 (+4.7)
Math 42.2 41.6 (-0.6) 42.6 (+0.4)
GPQA 33.3 34.3 (+1.0) 36.9 (+3.6)
Code
HumanEval 49.4 48.2 (-1.2) 52.4 (+3.0)
MBPP 41.0 41.4 (+0.4) 42.8 (+1.8)
Alignment Tasks
IFEval 62.2 62.0 (-0.2) 66.2 (+4.0)
Arena-Hard 10.0 11.9 (+1.9) 14.3 (+4.3)
AlignBench 5.4 5.8 (+0.4) 5.9 (+0.5)
MTbench 7.2 7.1 (-0.1) 7.3 (+0.1)

Key Observations from Table 1:

  • Consistent Outperformance: LLaDA 1.5 (trained with VRPO) consistently outperforms both LLaDA 8B Instruct (SFT-only) and LLaDA DPO (naive DPO) across all evaluated benchmarks. This strong and uniform improvement underscores the effectiveness of the VRPO framework.

  • Significant Gains: The improvements are particularly significant in:

    • GSM8K (+4.7 compared to SFT-only, +2.6 compared to naive DPO).
    • GPQA (+3.6 compared to SFT-only, +2.6 compared to naive DPO).
    • HumanEval (+3.0 compared to SFT-only, +4.2 compared to naive DPO).
    • IFEval (+4.0 compared to SFT-only, +4.2 compared to naive DPO).
    • Arena-Hard (+4.3 compared to SFT-only, +2.4 compared to naive DPO).
  • Naive DPO Limitations: LLaDA DPO (naive DPO without VRPO's techniques) shows mixed results, even degrading performance on Math, HumanEval, IFEval, and MTBench compared to the SFT-only baseline. This highlights the critical issue of high variance in MDM alignment and validates the paper's central motivation. Simple application of DPO is not sufficient for MDMs.

  • Mathematical Prowess: LLaDA 1.5 demonstrates strong mathematical performance. As also shown in the right panel of Figure 1 (Image 1), it achieves competitive results compared to other strong language MDMs and ARMs, and notably achieves the highest four-shot score on GSM8K and the highest zero-shot score on Math among compared models.


    Figure 1: Benchmark results. The left panel shows that LLaDA 1.5 improves LLaDA consistently and significantly on various benchmarks. The right panel demonstrates that LLaDA 1.5 has a highly competit… 该图像是图表,展示了论文中图1的Benchmark结果。左侧雷达图显示LLaDA 1.5在多个基准测试中较LLaDA SFT有一致且显著提升;右侧柱状图对比了多语言模型在数学任务(GSM8K和Math)上的性能,LLaDA 1.5表现优异。


Figure 1: Benchmark results. The left panel shows that LLaDA 1.5 improves LLaDA consistently and significantly on various benchmarks. The right panel demonstrates that LLaDA 1.5 has a highly competitive mathematical performance compared to strong language MDMs and ARMs.

6.2. Ablation Studies

The paper conducts detailed ablation studies to assess the impact of each variance reduction technique within VRPO. These studies manipulate sampling budget (nn), allocation strategy (nt/nytn_t / n_{y_t}), and the use of antithetic sampling. The results confirm the theoretical analyses presented in Section 3. The following are the results from Table 2 of the original paper:

Base Budget Allocation Antithetic
# Timesteps nt 4 1 8 1 22 4
# Masked samples nyt 1 1 1 4 1
Antithetic sampling X
Variances
Var of score estimator 2.2 44.0 1.0 7.3 4.7 2183.7
Var of loss 3.1e-3 8.7e-2 2.6e-3 3.2e-2 7.3e-3 62.0
Var of gradient 2.5 13.0 1.6 4.7 2.5 10.6
Mathematics & Science
GSM8K 82.8 80.1 83.3 81.4 82.3 82.0
Math 42.3 41.7 42.6 41.9 42.4 42.4
GPQA 36.4 34.3 36.9 34.9 36.4 35.9
Code
HumanEval 51.2 50.6 52.4 48.2 48.8 47.0
MBPP 42.8 40.6 42.8 40.8 41.0 41.2
Alignment Tasks
IFEval 66.1 63.9 66.2 64.8 66.2 65.8
Arena-Hard 13.9 13.5 14.3 13.8 13.4 15.6
AlignBench 5.9 5.6 5.9 5.8 5.9 5.9
MTbench 7.4 7.0 7.3 7.0 7.2 7.2

Key Observations from Table 2:

  • Impact of Score Estimator Variance: The results strongly support Theorem 1. Lower variances of the score estimator (Vs^θ\mathbb{V} \hat{s}_\theta) generally lead to lower variances in both the loss and gradient, as well as improved task performance. For example, reducing Vs^θ\mathbb{V} \hat{s}_\theta from 44.0 to 1.0 (by increasing budget from n=1n=1 to n=8n=8) correlates with a significant jump in GSM8K accuracy from 80.1 to 83.3.
  • Increasing Sampling Budget: Increasing the sampling budget (n=nt×nytn = n_t \times n_{y_t}) consistently reduces estimator variance and improves performance. Comparing "Budget nt=1,nyt=1n_t=1, n_{y_t}=1" (naive DPO, n=1n=1) to "Budget nt=8,nyt=1n_t=8, n_{y_t}=1" (optimal allocation, n=8n=8), Vs^θ\mathbb{V} \hat{s}_\theta drops from 44.0 to 1.0, and GSM8K improves from 80.1 to 83.3. This validates Proposition 1 (i).
  • Optimal Allocation: Under a fixed sampling budget, optimal allocation (e.g., "Base" with nt=4,nyt=1n_t=4, n_{y_t}=1) generally yields lower variance and better results than repeating multiple masked samples per timestep (e.g., "Allocation nt=1,nyt=4n_t=1, n_{y_t}=4"). For a budget of n=4n=4, "Base" (nt=4,nyt=1n_t=4, n_{y_t}=1) has Vs^θ=2.2\mathbb{V} \hat{s}_\theta = 2.2, while "Allocation nt=1,nyt=4n_t=1, n_{y_t}=4" has Vs^θ=7.3\mathbb{V} \hat{s}_\theta = 7.3. The "Base" configuration performs better on most tasks, supporting Proposition 1 (ii).
  • Antithetic Sampling: Antithetic sampling ( vs. XX in the last column) leads to notable decreases in variance. Comparing "Base" (Antithetic ✓, Vs^θ=2.2\mathbb{V} \hat{s}_\theta = 2.2) to "Antithetic" (Antithetic X, Vs^θ=2183.7\mathbb{V} \hat{s}_\theta = 2183.7), the variance reduction is dramatic. This confirms Proposition 2.
    • Performance vs. Variance Trade-off: Interestingly, while antithetic sampling drastically reduces variance, the direct translation to downstream benchmark improvements is not always as dramatic. For instance, "Antithetic" (without antithetic sampling) achieves a higher Arena-Hard score (15.6) than "Base" (13.9), despite significantly higher variance. The authors hypothesize that disabling antithetic sampling might expose the model to a broader diversity of data patterns, which could benefit certain downstream tasks, suggesting a complex interplay between optimization stability and generalization.

6.3. Loss Curves

Figure 5 illustrates the training loss dynamics for different configurations, providing a visual complement to the quantitative results in Table 2.

Figure 5: Loss curves under different variance reduction strategies. Top: w/ antithetic sampling; bottom: w/o antithetic sampling. The curve labeled "w/o antithetic sampling, \(n _ { t } = 1\) , \$n _ {… 该图像是论文中图5的图表,展示了不同方差减少策略下的损失曲线。上半部分为使用对偶采样(antithetic sampling)时的曲线,下半部分为未使用对偶采样时的曲线。曲线通过调整时间步数 ntn_{t} 和蒙版样本数 nytn_{y_{t}} 进行变化,所有曲线均采用指数移动平均平滑,系数为0.3。


Figure 5: Loss curves under different variance reduction strategies. Top: w/ antithetic sampling; bottom: w/o antithetic sampling. The curve labeled "w/o antithetic sampling, nt=1n _ { t } = 1 , nyt=1,,n _ { y _ { t } } = 1 ^ { , , } corresponds to the training loss of the naive DPO baseline reported in Table 1, all other curves come from the ablation study in Table 2, obtained by varying the number of timesteps `n _ { t }` , the number of masked samples `n _ { y _ { t } }` , and whether antithetic sampling is applied. We present two panels because the loss magnitudes differ substantially across settings. For visual clarity, all curves are smoothed with an exponential moving average with coefficient 0.3.

Key Observations from Figure 5:

  • Smoother and Lower Variability: Configurations with variance reduction strategies applied (e.g., higher ntn_t, optimal allocation, antithetic sampling) result in smoother loss curves with substantially lower variability. This directly demonstrates the improved stability of the optimization dynamics of MDMs.
  • Faster Convergence and Lower Final Loss: The smoother curves also tend to show a faster decrease in loss and often lead to a lower final loss value. This is consistent with the theoretical expectation that reduced gradient variance leads to more efficient and effective optimization.
  • Impact of Antithetic Sampling: The two panels (with/without antithetic sampling) clearly show the large magnitude difference in loss, especially the initial spikes. Disabling antithetic sampling drastically increases the loss variability, making training much noisier.

6.4. Sampling Strategies Ablation

The paper further explores the generality of VRPO by evaluating LLaDA and LLaDA 1.5 across different sampling strategies for MDM inference. The following are the results from Table 4 of the original paper:

LLaDA 8B Instruct LLaDA 1.5 8B
GSM8K
Diffusion Sampling 53.2 55.7
Low-Confidence Remasking 69.4 70.3
Semi-Autoregressive Sampling 78.6 83.3
HumanEval
Diffusion Sampling 12.2 17.1
Low-Confidence Remasking 49.4 47.0
Semi-Autoregressive Sampling 47.6 52.4
IFEval
Diffusion Sampling 55.2 59.4
Low-Confidence Remasking 62.2 60.1
Semi-Autoregressive Sampling 61.7 66.2

Key Observations from Table 4:

  • LLaDA 1.5 consistently shows performance gains over LLaDA 8B Instruct across most sampling strategies.
  • The optimal sampling strategies vary by task (e.g., Semi-Autoregressive Sampling is best for GSM8K and IFEval for both models, while Low-Confidence Remasking is strong for HumanEval for LLaDA SFT, but Semi-Autoregressive Sampling is best for LLaDA 1.5). This indicates that the choice of inference strategy is crucial and task-dependent for MDMs.
  • The improvements from VRPO are not tied to a single inference method but are generally applicable, demonstrating the robustness of the alignment technique.

6.5. Training Randomness

To assess the stability and reliability of VRPO, the authors retrained LLaDA using VRPO with two additional random seeds, resulting in three independent runs. The following are the results from Table 5 of the original paper:

Task LLaDA LLaDA 1.5
GSM8K 78.6 82.9 ± 0.6 (95% CI: [81.4, 84.3])
Math 42.2 43.0 ± 0.3 (95% CI: [42.2, 43.8])
GPQA 33.3 35.7 ± 1.0 (95% CI: [33.1, 38.3])
HumanEval 49.4 52.0 ± 0.7 (95% CI: [50.3, 53.7])
MBPP 41.0 42.3 ± 0.8 (95% CI: [40.4, 44.1])
IFEval 62.2 65.1 ± 0.9 (95% CI: [62.8, 67.4])

Key Observations from Table 5:

  • LLaDA 1.5 consistently outperforms LLaDA (SFT-only) across all benchmarks with higher mean scores.
  • The standard deviations for LLaDA 1.5 performance are small (e.g., 0.6 for GSM8K, 0.3 for Math), indicating stable performance across different training runs (random seeds). This suggests that VRPO is a robust optimization method, not overly sensitive to initialization or stochasticity.
  • For most tasks, the 95% confidence intervals for LLaDA 1.5 lie entirely above the corresponding LLaDA means, providing strong statistical evidence of consistent improvements and supporting the reliability and effectiveness of VRPO.

6.6. Inference Configurations

The following are the results from Table 6 of the original paper:

| | | LLaDA 8B Instruct | | LLaDA 1.5 8B | | :---------- | :-------------- | :------------ | :------------ | :------------ | :------------ | | | Block length | Answer length | Block length | Answer length | GSM8K | | 8 | 256 | 16 | 256 | Math | | 64 | 512 | 128 | 1024 | GPQA | | 64 | 64 | 16 | 256 | HumanEval | | 512 | 512 | 32 | 512 | MBPP | | 256 | 256 | 32 | 512 | IFEval | | 512 | 512 | 32 | 512 | Arena-Hard | | 128 | 1024 | 128 | 1024 | AlignBench | | 32 | 512 | 32 | 512 | MTbench | | 32 | 512 | 16 | 256

Key Observations from Table 6:

  • The optimal inference configurations (block length and answer length) vary significantly across different benchmarks for both LLaDA and LLaDA 1.5. This highlights the importance of inference tuning for MDMs to achieve optimal performance on specific tasks.
  • For tasks like GSM8K, GPQA, HumanEval, MBPP, and IFEval, LLaDA 1.5 tends to prefer smaller block lengths or larger answer lengths compared to LLaDA Instruct, suggesting a more efficient or flexible generation process due to improved alignment. A block length smaller than the answer length indicates the use of diffusion semi-autoregressive sampling.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously investigates the challenges of aligning Masked Diffusion Models (MDMs) with human preferences, identifying the high variance and bias in Evidence Lower Bound (ELBO)-based likelihood estimates as a critical bottleneck. To address this, the authors propose Variance-Reduced Preference Optimization (VRPO), a framework grounded in theoretical analysis of the score-estimator variance that drives these errors. VRPO introduces practical, unbiased variance reduction strategies, including increasing the Monte Carlo sampling budget, implementing optimal allocation of samples across timesteps, and employing antithetic sampling between policy and reference model ELBOs. Empirical validation with LLaDA 1.5—an 8B-parameter MDM—demonstrates consistent and significant performance improvements over its SFT-only predecessor across diverse benchmarks encompassing mathematical reasoning, code generation, and general alignment tasks. These results confirm VRPO's effectiveness at a large scale, establishing a robust foundation for the future development of language MDMs.

7.2. Limitations & Future Work

7.2.1. Limitations

  • Misuse Risks: The Ethics Statement explicitly acknowledges that despite focusing on alignment for helpfulness, misuse risks remain. Models might still generate discriminatory, biased, or harmful content. While preference data curation and filtering are employed, these risks are inherent to LLMs and require ongoing vigilance.
  • Computational Cost of Sampling: While deemed acceptable (less than 0.5% of pre-training cost for n=8n=8), increasing the sampling budget for ELBOs does increase computational overhead by a factor of nn. For even larger budgets or more frequent sampling, this could become a more significant constraint.
  • Generalization vs. Variance Reduction Trade-off (Antithetic Sampling): The ablation study notes that while antithetic sampling drastically reduces variance, its direct impact on downstream benchmark performance is not always a proportional improvement. This suggests other factors (like data diversity exposure) might influence generalization, a complex aspect not fully controlled.

7.2.2. Future Work

  • Extension to Broader RL-based Alignment Algorithms: The paper explicitly states that its variance reduction techniques and analysis are not limited to DPO but can be extended to other RL-based alignment algorithms such as PPO and GRPO. This indicates a promising avenue for further research to generalize VRPO across the entire RLHF ecosystem for MDMs.
  • Further Enhancement of MDMs: The work lays the groundwork for future research to further enhance MDMs' performance, implying continued exploration of architectural improvements, training techniques, and alignment strategies.
  • Exploring Generalization Factors: The observation about antithetic sampling's effect on downstream performance suggests a need to better understand the interplay between variance reduction, optimization stability, and factors that contribute to model generalization.

7.3. Personal Insights & Critique

This paper provides a crucial theoretical and empirical advancement for Masked Diffusion Models in the context of preference optimization.

  • Theoretical Rigor for Practical Problems: A key strength is its principled approach. Instead of just empirically trying different methods, the authors first perform a formal analysis to understand why MDM alignment is difficult (high variance in ELBO estimates). This theoretical grounding (Theorem 1, 4) is highly valuable as it directs the solution towards fundamental issues rather than superficial fixes. This mindset—understanding the root cause before prescribing a solution—is a strong example for research.
  • "Free Lunch" Techniques: The identification of optimal budget allocation and antithetic sampling as techniques that reduce variance without additional computational cost is particularly elegant. In a field constantly battling with computational constraints, such "free lunches" are immensely valuable.
  • Importance of Reference Model in DPO: The reliance on log-likelihood ratios and the interaction between the current policy and reference policy are central to DPO. The paper's explicit strategy of antithetic sampling leveraging the correlation between these two policies highlights a nuanced understanding of DPO's mechanics for MDMs.
  • Potential for Broader Impact: The claim that VRPO's techniques are transferable to PPO and GRPO for MDMs is significant. If validated, it means this framework could become a standard component in the RLHF pipeline for all MDMs, accelerating their development and adoption.
  • Critique on Generalization: The observation regarding antithetic sampling and downstream performance is thought-provoking. While variance reduction is typically seen as universally beneficial for optimization, the suggestion that some noise (or rather, the diverse sampling patterns introduced by not using antithetic sampling) might aid generalization on certain benchmarks opens up an interesting research question. It implies that there might be a subtle balance to strike between purely reducing gradient noise and ensuring sufficient exploration of the data manifold for robust generalization. This could be an area for future work, perhaps exploring adaptive variance reduction or targeted noise injection if empirical generalization benefits are consistently observed.
  • Comparison to ARMs: The consistent outperformance of LLaDA 1.5 over LLaDA SFT-only and the naive LLaDA DPO baseline, coupled with its competitiveness against strong ARMs, strongly suggests that MDMs, when properly aligned, can be viable alternatives to ARMs, potentially offering unique advantages in aspects like parallel generation or robustness. This work significantly closes the alignment gap for MDMs.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.