Paper status: completed

Large Language Diffusion Models

Published:02/14/2025

Transformer architecture (14)Large Language Model Fine-Tuning (50)Auto-Regressive Diffusion Model (8)Large Language Diffusion Models (1)Probabilistic Inference Generation (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LLaDA, a diffusion-based large language model, uses masking and reverse generation with Transformers to predict tokens, optimizing likelihood bounds. It matches autoregressive baselines in diverse tasks and excels in context learning, demonstrating diffusion models’ promise for s

Abstract

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.

Mind Map

In-depth Reading

English Analysis~30 min read · 41,969 chars

1. Bibliographic Information

1.1. Title

Large Language Diffusion Models

1.2. Authors

The paper lists a team of authors with affiliations primarily from the Gaoling School of Artificial Intelligence, Renmin University of China, and Ant Group. Key authors include Shen Nie, Fengqi Zhu, and Chongxuan Li.

1.3. Journal/Conference

This paper is published as a preprint on arXiv (arXiv:2502.09992). This indicates it has not yet undergone formal peer review for a journal or conference, but it is publicly available for discussion and feedback within the academic community.

1.4. Publication Year

2025

1.5. Abstract

The abstract introduces LLaDA (Large Language Diffusion with mAsking), a novel diffusion model designed to challenge the prevailing assumption that Large Language Models (LLMs) rely solely on autoregressive models (ARMs). LLaDA is trained from scratch using a pre-training and supervised fine-tuning (SFT) paradigm. Its core mechanism involves a forward data masking process and a reverse generation process, where a Transformer predicts masked tokens. This approach offers a principled generative method for probabilistic inference by optimizing a likelihood lower bound. Across various benchmarks, including general tasks, math, and code, LLaDA 8B demonstrates strong scalability and performance comparable to ARM baselines. Notably, LLaDA 8B performs competitively with LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following capabilities in multi-turn dialogues. Furthermore, it effectively addresses the reversal curse, even outperforming GPT-4o in a reversal poem completion task. The findings suggest that core LLM capabilities are not inherently dependent on ARMs, highlighting the promise of diffusion models for large-scale language modeling.

1.6. Original Source Link

https://arxiv.org/abs/2502.09992 This is a preprint.

1.7. PDF Link

https://arxiv.org/pdf/2502.09992v3.pdf

2. Executive Summary

2.1. Background & Motivation

The field of Large Language Models (LLMs) has been overwhelmingly dominated by autoregressive models (ARMs), often referred to as the "next-token prediction" paradigm. These models generate text sequentially, token by token, from left to right. While this approach has achieved remarkable success, leading to powerful LLMs, a fundamental question remains: are the core capabilities of LLMs—such as scalability, in-context learning, and instruction-following—inherently tied to the autoregressive nature of these models, or do they stem from broader generative modeling principles?

The paper argues that the success of LLMs is more attributable to the underlying generative modeling principles (i.e., maximizing the likelihood of the data distribution) combined with the power of Transformers, large model sizes, and extensive data, rather than the specific autoregressive formulation. This perspective is supported by the success of diffusion models (which are also generative) in other domains like computer vision (e.g., diffusion transformers for image generation). Furthermore, ARMs exhibit certain inherent limitations, such as difficulties with reversal reasoning tasks, due to their unidirectional generation process.

The core problem this paper aims to solve is to investigate whether diffusion models, a different class of generative models, can achieve the same, or even superior, core capabilities as ARMs when scaled to the size and data typical of modern LLMs. This challenges a deeply ingrained assumption in the field and seeks to open new avenues for LLM development beyond the autoregressive paradigm.

2.2. Main Contributions / Findings

The paper introduces LLaDA (Large Language Diffusion with mAsking), a novel diffusion model for language, and presents several key contributions and findings:

Challenging the Autoregressive Hegemony: LLaDA demonstrates that core LLM capabilities, such as scalability, in-context learning, and instruction-following, can emerge from generative modeling principles beyond the autoregressive paradigm. This fundamentally questions the necessity of autoregression for building powerful language models.
Novel Model Architecture (LLaDA): LLaDA employs a masked diffusion model (MDM) approach, featuring a forward data masking process and a reverse generation process. It uses a Transformer to predict masked tokens and optimizes a likelihood lower bound, providing a principled generative approach for probabilistic inference.
Unprecedented Scale for Diffusion Language Models: LLaDA is scaled to an 8 billion (8B) parameter model, trained from scratch on 2.3 trillion (T) tokens. This represents the largest diffusion model applied to language to date, pushing the boundaries of diffusion model capabilities in Natural Language Processing (NLP).
Strong Scalability: LLaDA exhibits impressive scalability, matching the overall performance of autoregressive model (ARM) baselines (which were trained on the same data) across diverse tasks like MMLU and GSM8K, as its computational budget increases.
Competitive In-Context Learning: The pre-trained LLaDA 8B Base model performs comparably to strong LLMs like LLaMA3 8B Base on 15 standard zero/few-shot learning tasks, often surpassing LLaMA2 7B Base. This demonstrates its effective in-context learning capability without specific fine-tuning.
Impressive Instruction-Following: After Supervised Fine-Tuning (SFT), LLaDA 8B Instruct shows strong instruction-following abilities, including in complex scenarios like multi-turn dialogue, marking a significant achievement for a non-autoregressive model.
Addressing the Reversal Curse: LLaDA effectively overcomes the reversal curse, a known limitation of ARMs where models struggle with tasks requiring reasoning in a reverse direction (e.g., inferring the preceding element given the succeeding one). LLaDA maintains consistent performance across forward and reversal tasks and remarkably outperforms GPT-4o in a reversal poem completion task.
Principled Generative Objective: Unlike some prior discrete diffusion models that used heuristic objectives, LLaDA is built upon a strong theoretical foundation, optimizing a variational lower bound of its log-likelihood, which is crucial for its scalability and generative capabilities.

These findings open up a new research direction for language modeling, suggesting that diffusion models could offer a powerful alternative or complementary paradigm to autoregressive models for developing next-generation LLMs.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts:

Large Language Models (LLMs): These are advanced neural network models, typically based on the Transformer architecture, trained on vast amounts of text data. Their goal is to understand, generate, and process human language. They exhibit emergent properties like in-context learning (learning from examples within the prompt) and instruction-following (obeying commands given in natural language). Examples include GPT-3, LLaMA, and GPT-4.
Generative Modeling: This is a branch of machine learning focused on creating models that can generate new data samples that resemble the training data. For LLMs, this means generating human-like text. The ultimate goal is to capture the underlying probability distribution of the real data, $p_{\mathrm{data}}(x)$ .
- Maximum Likelihood Estimation (MLE): A common technique for training generative models. It involves finding the model parameters ( $\theta$ ) that maximize the probability (likelihood) of observing the training data $x$ under the model's distribution $p_{\theta}(x)$ . This is often expressed as maximizing $\mathbb{E}_{p_{\mathrm{data}}(x)} \log p_{\theta}(x)$ .
- KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution $p$ diverges from a second, expected probability distribution $q$ . Minimizing KL divergence between the data distribution $p_{\mathrm{data}}(x)$ and the model distribution $p_{\theta}(x)$ , i.e., $\mathrm{KL}(p_{\mathrm{data}}(x) || p_{\theta}(x))$ , is equivalent to MLE.
Autoregressive Models (ARMs): These are a specific type of generative model that predict the next token in a sequence based on all preceding tokens. They process information strictly from left-to-right.
- Next-token prediction: The primary training objective for ARMs. Given a sequence $x = (x^1, x^2, \ldots, x^L)$ , the model learns to predict $x^i$ given $(x^1, \ldots, x^{i-1})$ .
- Likelihood Factorization: The probability of a sequence $x$ is factorized as a product of conditional probabilities: $p_{\theta}(x) = p_{\theta}(x^1) \prod_{i=2}^{L} p_{\theta}(x^i \mid x^1, \ldots, x^{i-1})$ This formula (Equation 2 in the paper) defines how ARMs model the probability of an entire sequence.
Diffusion Models: A class of generative models that learn to reverse a diffusion process.
- Forward Diffusion Process: This process gradually adds noise (or masks tokens, in discrete cases) to data until it becomes pure noise (or fully masked). It's typically fixed and does not involve learning.
- Reverse Diffusion Process: This is the learning part. The model learns to iteratively denoise (or unmask) the data, starting from noise (or fully masked data), to reconstruct a clear sample from the data distribution.
- Masked Diffusion Models (MDMs): A variant of diffusion models specifically for discrete data like text. Instead of adding continuous noise, they progressively mask (replace with a special [MASK] token) parts of the input. The model's task is to predict the original tokens that were masked.
Transformers: The foundational neural network architecture for modern LLMs.
- Attention Mechanism: The core component of Transformers, allowing the model to weigh the importance of different parts of the input sequence when processing each token. Unlike recurrent neural networks, Transformers can process all tokens in parallel.
- Self-Attention: A type of attention where the mechanism attends to other positions in the same sequence to compute a representation for each position.
- Causal Mask: A specific type of mask used in Transformers for autoregressive models. It prevents a token from attending to future tokens in the sequence, enforcing the left-to-right generation constraint. LLaDA explicitly states it does not use a causal mask.
Pre-training and Supervised Fine-Tuning (SFT): A common paradigm for training LLMs.
- Pre-training: The initial phase where a model is trained on a massive, diverse text corpus using an unsupervised objective (e.g., next-token prediction for ARMs or mask prediction for MDMs). This allows the model to learn general language understanding and generation capabilities.
- Supervised Fine-Tuning (SFT): After pre-training, the model is further trained on a smaller dataset of (prompt, response) pairs. This phase teaches the model to follow instructions and align its outputs with human preferences.
In-context Learning: The ability of an LLM to learn new tasks or adapt to new instructions by observing a few examples provided directly within the input prompt, without requiring any model weight updates.
Instruction-Following: The ability of an LLM to understand and execute commands or instructions given in natural language, often after SFT.
Reversal Curse: A phenomenon observed in ARMs where models trained on "A is B" facts struggle to infer "B is A" without explicit training on the reverse relationship. This highlights a limitation in their generalization capabilities, often attributed to their unidirectional processing.

3.2. Previous Works

The paper contextualizes LLaDA against a backdrop of autoregressive models (ARMs) and prior attempts at diffusion models for language.

Autoregressive Language Models (ARMs): The dominant paradigm, exemplified by models like GPT [2] and LLaMA [6, 21]. These models define the probability of a sequence $x$ as: $p_{\theta}(x) = p_{\theta}(x^1) \prod_{i=2}^{L} p_{\theta}(x^i \mid x^1, \ldots, x^{i-1})$ This "next-token prediction" approach has led to impressive LLMs, but their unidirectional nature is a potential limitation. The paper directly challenges this assumption.
BERT (Bidirectional Encoder Representations from Transformers) [22]: While not a generative ARM, BERT is a seminal Transformer-based model that introduced the masked language modeling (MLM) objective. BERT trains a Transformer encoder to predict randomly masked tokens in a sequence, allowing it to learn bidirectional contexts. This is a precursor to masked diffusion models as it involves masking and prediction. However, BERT was primarily used for language understanding tasks (e.g., classification, question answering) rather than generative modeling of full sequences. Key differences from LLaDA include BERT's fixed masking ratio and its lack of a principled maximum likelihood generative framework.
Diffusion Models in Vision: The success of diffusion models in generating high-quality images [38, 39, 40] and the emergence of diffusion transformers [9, 10] in computer vision provided inspiration. These models demonstrate the power of diffusion as a generative paradigm at scale.
Discrete Diffusion Models for Language:
- Early Attempts (Continuous State Spaces) [41-51]: Some works tried to adapt continuous diffusion models by embedding discrete text data into continuous spaces. These often faced challenges with scalability and computational cost. The paper notes that a 1B model might require 64x the compute of an ARM for comparable performance [57].
- Discrete Process Variants [58-71]: Other approaches developed discrete diffusion directly, replacing continuous noise with discrete operations like masking.
  - Austin et al. [16] were pioneers in discrete diffusion for language.
  - Lou et al. [17] showed masked diffusion could achieve perplexity comparable to GPT-2 scale ARMs.
  - Shi et al. [18], Sahoo et al. [19], Ou et al. [20] established theoretical foundations for masked diffusion models (MDMs), particularly linking the training objective to maximum likelihood estimation. These theoretical results are crucial for LLaDA's design.
  - Nie et al. [27] explored scaling laws for MDMs and their application to tasks like question answering at GPT-2 scale. This work suggested MDMs might require more computation than ARMs for the same likelihood.
MaskGIT (Masked Generative Image Transformer) [23]: A parallel work in image generation that uses a masked generative Transformer. MaskGIT uses a heuristic training objective and introduces low-confidence remasking for sampling, an idea adopted by LLaDA.

3.3. Technological Evolution

The evolution of LLMs has primarily been driven by the Transformer architecture and the autoregressive paradigm. Initially, recurrent neural networks (RNNs) and LSTMs handled sequential data, but their inability to parallelize computations limited scalability. The Transformer [7] revolutionized this by introducing the attention mechanism, allowing parallel processing and enabling models to scale to billions of parameters. This led to the rise of pre-trained language models like BERT (for understanding) and GPT (for generation).

The GPT series firmly established the autoregressive "next-token prediction" approach as the standard for generative LLMs. Subsequent LLMs like LLaMA, PaLM, Claude, and GPT-4 largely followed this recipe, focusing on scaling model size, data, and computational resources. This paradigm has been incredibly successful, yielding models with impressive in-context learning and instruction-following capabilities.

However, researchers have continually explored alternative generative frameworks. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) were prominent in other domains, but diffusion models have recently emerged as state-of-the-art in image generation. The work by Austin et al. [16] and others began applying discrete diffusion to text, but scaling them to LLM levels remained a significant challenge.

LLaDA fits into this evolution by being the first work to successfully scale a masked diffusion model to 8 billion parameters and demonstrate its competitive performance against leading ARMs. It represents a critical point in the technological timeline, suggesting a potential paradigm shift or diversification in LLM development, moving beyond the sole reliance on autoregressive generation.

3.4. Differentiation Analysis

LLaDA differentiates itself from ARMs and previous diffusion models in several key ways:

Generative Mechanism:
- ARMs: Generate text token-by-token, strictly from left-to-right. This imposes a causal (unidirectional) dependency.
- LLaDA (MDM): Generates text by iteratively predicting and unmasking tokens across the entire sequence simultaneously. It uses a Transformer without a causal mask, allowing it to learn bidirectional dependencies. This bidirectional modeling is a core difference, enabling capabilities like reversal reasoning.
Theoretical Foundation and Scale:
- Previous Discrete Diffusion for Language: Many prior works either used heuristic training objectives or struggled to scale effectively. Nie et al. [27] noted that MDMs might require significantly more computation than ARMs for comparable likelihood.
- LLaDA: Builds on strong theoretical foundations [18, 19, 20] that link its training objective (Eq. 3) to a likelihood lower bound, making it a principled generative model. Crucially, it scales this principled approach to an unprecedented 8B parameters for a diffusion language model, demonstrating scalability comparable to ARMs on downstream tasks, not just likelihood.
Core LLM Capabilities:
- ARMs: In-context learning and instruction-following are emergent properties.
- LLaDA: Demonstrates that these capabilities are not exclusive to ARMs but arise from fundamental generative modeling principles. LLaDA achieves competitive in-context learning and impressive instruction-following post-SFT, challenging the common assumption.
Addressing Limitations:
- ARMs: Suffer from the reversal curse due to their unidirectional nature.
- LLaDA: Its bidirectional modeling inherent in the masked diffusion approach naturally addresses the reversal curse, showing consistent performance in forward and reverse tasks and even outperforming GPT-4o in specific scenarios.
  
  In essence, LLaDA represents a robust, theoretically grounded, and highly scalable diffusion model that directly competes with and, in some aspects, surpasses state-of-the-art ARMs, thereby opening a new frontier for LLM research and development.

4. Methodology

4.1. Principles

The core idea behind LLaDA is to adapt masked diffusion models (MDMs) for large-scale language generation. Unlike autoregressive models (ARMs) that predict the next token sequentially, LLaDA leverages a process inspired by diffusion models in other domains. It defines a generative process through two main stages: a forward process that gradually masks tokens in the input data, and a reverse process where a neural network learns to unmask these tokens to reconstruct the original data. This reverse process is parameterized by a Transformer that predicts all masked tokens simultaneously, allowing for bidirectional context learning.

The theoretical basis for LLaDA lies in optimizing a variational lower bound of the log-likelihood of the data. This means that, despite its non-autoregressive nature, LLaDA is a principled generative model aiming to capture the true data distribution $p_{\mathrm{data}}(x_0)$ . The intuition is that by learning to denoise (or unmask) text from a progressively masked version, the model implicitly learns the underlying structure and dependencies of language, similar to how ARMs learn through next-token prediction. The key difference is the bidirectional learning and generation, which removes the left-to-right constraint of ARMs.

4.2. Core Methodology In-depth (Layer by Layer)

LLaDA defines its model distribution $p_{\theta}(x_0)$ through a forward masking process and a reverse generation process.

4.2.1. Probabilistic Formulation

The probabilistic formulation of LLaDA is central to its operation and distinguishes it from ARMs.

4.2.1.1. Forward Process

The forward process (or masking process) gradually introduces mask tokens into an original data sample $x_0$ . It is indexed by a time variable $t \in [0, 1]$ .

At $t=0$ , the data $x_0$ is fully observed (no masks).
As $t$ increases from 0 to 1, tokens are progressively masked.
At $t=1$ , the sequence $x_1$ is fully masked.

This process is defined by a conditional distribution $q_{t|0}(x_t|x_0)$ , which is factorized across all tokens in the sequence: $q_{t|0}(x_t|x_0) = \prod_{i=1}^{L} q_{t|0}(x_t^i|x_0^i)$ where $L$ is the sequence length. This means each token $x_0^i$ is masked independently. The conditional probability for each token $x_0^i$ to become $x_t^i$ at time $t$ is given by: $q_{t|0}(x_t^i|x_0^i) = \left\{ \begin{array}{ll} 1-t, & x_t^i = x_0^i, \\ t, & x_t^i = \mathbf{M}. \end{array} \right.$ Here, $x_0^i$ is the $i$ -th token of the original sequence, $x_t^i$ is the $i$ -th token of the masked sequence at time $t$ , and $\mathbf{M}$ represents the special mask token. This formula indicates that each token either remains unchanged with probability 1-t or is replaced by the mask token $\mathbf{M}$ with probability $t$ . This linear masking probability means that as $t$ increases, more tokens are expected to be masked.

4.2.1.2. Reverse Process

The reverse process is what LLaDA learns. It aims to recover the original data distribution by iteratively unmasking tokens, moving from a fully masked state ( $t=1$ ) back to an unmasked state ( $t=0$ ). The approximate reverse process is parameterized by a mask predictor $p_{\theta}(\cdot|x_t)$ .

The conditional distribution for the reverse process, moving from time $t$ to $s$ (where $0 \leq s < t \leq 1$ ), is also factorized across tokens: $q_{s|t}(x_s|x_t) = \prod_{i=1}^{L} q_{s|t}(x_s^i|x_t)$ The conditional probability for each token transition is defined as: $\begin{array}{r} q_{s|t}(x_s^i|x_t) = \left\{ \begin{array}{ll} 1, & x_t^i \neq \mathbf{M}, x_s^i = x_t^i, \\ \frac{s}{t}, & x_t^i = \mathbf{M}, x_s^i = \mathbf{M}, \\ \frac{t-s}{t} q_{0|t}(x_s^i|x_t), & x_t^i = \mathbf{M}, x_s^i \neq \mathbf{M}, \\ 0, & \mathrm{otherwise}. \end{array} \right. \end{array}$ This formula describes how tokens are updated during a reverse step.

If a token $x_t^i$ is not masked, it remains unchanged ( $x_s^i = x_t^i$ ).
If a token $x_t^i$ is masked, it either remains masked with probability s/t, or it is unmasked to its original value $x_s^i \neq \mathbf{M}$ with probability (t-s)/t according to the distribution $q_{0|t}(x_s^i|x_t)$ .

The key function to estimate is $q_{0|t}(x_s^i|x_t)$ , which predicts the original token if it was masked in $x_t$ . The paper notes that this can be reparameterized as: $q_{0|t}(x_s^i|x_t) = p_{\mathrm{data}}(x_0^i|x_t^{\mathrm{UM}}), \quad \forall i \ \mathrm{such} \ \mathrm{that} \ x_t^i = \mathbf{M},$ where $x_t^{\mathrm{UM}}$ refers to the unmasked tokens in $x_t$ . This means that estimating the data prediction function is equivalent to estimating conditional distributions on clean data, which is time-invariant. Consequently, the time $t$ does not need to be an input to the parametric model (the Transformer).

4.2.1.3. Mask Predictor and Training Objective

LLaDA uses a Transformer as its mask predictor, denoted as $p_{\theta}(\cdot|x_t)$ . This Transformer takes the partially masked sequence $x_t$ as input and simultaneously predicts all masked tokens. Crucially, unlike ARMs, this Transformer does not use a causal mask, allowing it to leverage bidirectional context.

The mask predictor is trained by optimizing a cross-entropy loss computed only on the masked tokens: $\mathcal{L}(\theta) \triangleq - \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i=1}^{L} \mathbf{1}[x_t^i = \mathbf{M}] \log p_{\theta}(x_0^i | x_t) \right]$ Here:

$\mathcal{L}(\theta)$ is the loss function for model parameters $\theta$ .
$\mathbb{E}_{t, x_0, x_t}$ denotes the expectation over $t$ (uniformly sampled from [0,1]), $x_0$ (sampled from data), and $x_t$ (sampled from the forward process given $x_0$ ).
$L$ is the sequence length.
$\mathbf{1}[x_t^i = \mathbf{M}]$ is an indicator function, which is 1 if the $i$ -th token in $x_t$ is masked, and 0 otherwise. This ensures the loss is calculated only for masked tokens.
$\log p_{\theta}(x_0^i | x_t)$ is the log-probability assigned by the model to the original token $x_0^i$ given the masked sequence $x_t$ .
The term $\frac{1}{t}$ is a weighting factor, which is crucial for the theoretical link to maximum likelihood estimation.

This loss function $\mathcal{L}(\theta)$ has been proven to be an upper bound on the negative log-likelihood of the model distribution $p_{\theta}(x_0)$ : $- \mathbb{E}_{x_0 \sim p_{\mathrm{data}}(x_0)} \left[ \log p_{\theta}(x_0) \right] \leq \mathcal{L}(\theta)$ This theoretical guarantee makes LLaDA a principled generative model capable of in-context learning and instruction-following, similar to ARMs.

4.2.2. Pre-training

The pre-training phase for LLaDA follows standard practices in LLM development but adapted for the MDM framework.

Model Architecture: LLaDA utilizes a Transformer [7] as its mask predictor. Unlike ARMs, it does not use a causal mask because its formulation allows it to see the entire input for predictions. The model uses RMSNorm [105] for stability, SwiGLU [106] as the activation function, and RoPE [107] for positional encoding. (Refer to Table 5 in Experimental Setup for detailed architectural configurations).
Data: The model was pre-trained on a massive dataset of 2.3 trillion (T) tokens, similar to LLMs (e.g., [25, 26]). This dataset comprises various sources (web, books, academic papers, social media, encyclopedias, math, code) and languages (11% Chinese, 61% English, 28% code). Low-quality content was filtered using manual rules and LLM-based approaches.
Training Process:
- For each training sequence $x_0$ , a time step $t$ is randomly sampled uniformly from [0, 1].
- Each token in $x_0$ is then independently masked with probability $t$ to create the partially masked sequence $x_t$ (as illustrated in Figure 2a).
- The model (Transformer) takes $x_t$ as input and predicts the original tokens for all masked positions.
- The cross-entropy loss (Eq. 3) is estimated using the Monte Carlo method for stochastic gradient descent training.
- To enhance robustness to variable-length inputs, 1% of the pre-training data sequences are set to a random length uniformly sampled from [1, 4096]. The fixed sequence length for training is 4096 tokens.
- Computational Cost: The pre-training consumed 0.13 million H800 GPU hours, which is comparable to ARMs of the same scale and dataset size.
Optimization:
- An AdamW optimizer [29] with a weight decay of 0.1 was used.
- A Warmup-Stable-Decay learning rate scheduler [28] was employed:
  - Linear increase from 0 to $4 \times 10^{-4}$ over the first 2000 iterations.
  - Maintained at $4 \times 10^{-4}$ until 1.2T tokens processed.
  - Decayed to $1 \times 10^{-4}$ and held constant for the next 0.8T tokens.
  - Linearly reduced from $1 \times 10^{-4}$ to $1 \times 10^{-5}$ for the final 0.3T tokens.
- Batch size: 1280 global, 4 local per GPU.
- The 8B model was trained once without hyperparameter tuning.

4.2.3. Supervised Fine-Tuning (SFT)

After pre-training, LLaDA undergoes Supervised Fine-Tuning (SFT) to improve its instruction-following capabilities. This phase adapts the model to generate responses $r_0$ given a prompt $p_0$ , effectively modeling the conditional distribution $p_{\theta}(r_0 | p_0)$ .

Data: The SFT dataset consists of 4.5 million (prompt, response) pairs, including 1 million human-annotated and 3.5 million synthetic samples. The data spans various domains like code, mathematics, and general instruction-following.
Masking Strategy for SFT:
- For an input pair $(p_0, r_0)$ , the prompt $p_0$ remains unchanged (unmasked).
- Only the tokens in the response $r_0$ are independently masked with probability $t$ to form $r_t$ (Figure 2b).
- The concatenated sequence of $p_0$ and $r_t$ is fed to the pre-trained mask predictor.
SFT Loss Function: The loss is computed only on the masked tokens within the response: $- \mathbb{E}_{t, p_0, r_0, r_t} \left[ \frac{1}{t} \sum_{i=1}^{L'} \mathbf{1}[r_t^i = \mathbf{M}] \log p_{\theta}(r_0^i | p_0, r_t) \right]$ Here, $L'$ is the length of the response $r_0$ . This objective is fully compatible with pre-training, as $(p_0, r_0)$ can be seen as a clean data $x_0$ , and $(p_0, r_t)$ as its masked version $x_t$ , with masks only in the response part.
Handling Response Lengths:
- $|EOS| tokens$ (End Of Sequence) are appended to shorter pairs in a mini-batch to ensure uniform sequence lengths.
- These $|EOS| tokens$ are treated as normal tokens during training (masked and included in the loss) but removed during sampling. This allows LLaDA to automatically control response length.
Multi-turn Dialogue: A multi-turn dialogue $(p_0^0, r_0^0, p_0^1, r_0^1, \ldots, p_0^{n-1}, r_0^{n-1})$ is treated as $n$ single-turn pairs: $(p_0^0, r_0^0)$ , $(p_0^0 r_0^0 p_0^1, r_0^1)$ , etc. This allows LLaDA to learn multi-turn dialogue capabilities.
Optimization:
- An AdamW optimizer with a weight decay of 0.1 was used.
- Learning rate scheduler: Linear increase from 0 to $2.5 \times 10^{-5}$ over the first 50 iterations, then constant, and linearly reduced to $2.5 \times 10^{-6}$ for the final 10% of iterations.
- Global batch size: 256, local batch size: 2 per GPU.
- The SFT experiment was executed once without hyperparameter tuning.

4.2.4. Inference

LLaDA can generate new text and evaluate likelihood in a diffusion manner, different from left-to-right autoregressive generation.

4.2.4.1. Reverse Generation Process (Diffusion Sampling)

Starting Point: Given a prompt $p_0$ , the process begins with a fully masked response sequence $r_1$ of a specified length $L$ (which is a hyperparameter).
Iterative Unmasking: The reverse process is discretized into $N$ $N$ sampling steps, moving from $t=1$ $t = 1$ down to $t=0$ $t = 0$ . At each step from time $t$ $t$ to $s$ $s$ (where $s = t - 1/N$ $s = t - 1/ N$ ):
1. The model's mask predictor (Transformer) takes the prompt $p_0$ and the current partially masked response $r_t$ as input.
2. It predicts the original tokens for all masked positions simultaneously.
3. Remasking Strategy: A fraction $\frac{s}{t}$ $\frac{s}{t}$ of the predicted tokens are then remasked to obtain $r_s$ $r_{s}$ . This ensures the transition aligns with the forward process.
  - Purely Random Remasking: In principle, this remasking should be purely random.
  - Low-Confidence Remasking: Inspired by annealed sampling in LLMs [4, 30] and MaskGIT [23], LLaDA adopts a low-confidence remasking strategy. Here, $\frac{s}{t}$ of the predicted tokens with the lowest confidence (i.e., lowest probability assigned by the model to its prediction) are chosen to be remasked. This helps improve sample quality. (See Algorithm 5 in Appendix A.3).
Termination: The process continues until $t=0$ (fully unmasked). Tokens appearing after an $|EOS| token$ are discarded.
Flexibility: The total number of sampling steps $N$ is a hyperparameter, offering a trade-off between efficiency and sample quality. Generation length $L$ is also a hyperparameter, but the model is insensitive to it due to variable-length training (Appendix B.5).
Flexible Sampling beyond Pure Diffusion: LLaDA can also support autoregressive and block diffusion [31] sampling directly, without further training. However, pure diffusion sampling typically yields the best performance.

4.2.4.2. Conditional Log-likelihood Evaluation

For evaluating the conditional likelihood of a response $r_0$ given a prompt $p_0$ , LLaDA uses an equivalent form of the loss function (Eq. 14 in Appendix A.2) that exhibits lower variance than directly using Eq. (5). This form is: $- \mathbb{E}_{l, r_0, r_l} \left[ \frac{L}{l} \sum_{i=1}^{L} \mathbf{1}[r_l^i = \mathbf{M}] \log p_{\theta}(r_0^i | p_0, r_l) \right]$ Here:

$L$ is the sequence length of $r_0$ .
$l$ is uniformly sampled from $\{1, 2, \ldots, L\}$ (number of tokens to mask).
$r_l$ is obtained by uniformly sampling $l$ tokens from $r_0$ without replacement for masking. This deterministically masks $l$ tokens, reducing variance compared to the probabilistic masking of $t$ .
This can be seen as a conditional version of the pre-training objective.
Monte Carlo estimation is used to approximate this expectation.

4.2.4.3. Connection to Any-order Autoregressive Models (AO-ARM)

The paper notes that the training objective (Eq. 12) is equivalent to that of Any-order Autoregressive Models (AO-ARMs) [59, 101, 102], which characterize the joint distribution by considering all possible orders of variable dependencies. An AO-ARM minimizes the expected negative log-likelihood over uniform distributions of all orders $\pi$ : $- \mathbb{E}_{x_0, \pi \sim U_{\pi}} \left[ \sum_{i=1}^{L} \log p_{\theta}(x_0^{\pi(i)} | x_0^{\pi(<i)}; \pi) \right]$ This connection intuitively explains LLaDA's bidirectional reasoning capabilities, as it implicitly learns dependencies regardless of token order.

4.2.4.4. Classifier-free Guidance (CFG)

LLaDA is compatible with Classifier-free Guidance (CFG) [37, 27], a technique to improve generation quality by balancing adherence to a prompt (alignment) and diversity. CFG modifies the mask predictor for inference: $\tilde{p}_{\theta}(r_0 | p_0, r_t) \propto \frac{p_{\theta}(r_0 | p_0, r_t)^{1+w}}{p_{\theta}(r_0 | m, r_t)^w}$ Here:

$\tilde{p}_{\theta}$ is the modified probability distribution used for sampling.
$p_{\theta}(r_0 | p_0, r_t)$ is the original model prediction given the prompt $p_0$ .
$p_{\theta}(r_0 | m, r_t)$ is the model prediction given a mask sequence $m$ (same length as $p_0$ ), representing an "unconditional" generation.
$w$ is a tunable hyperparameter controlling the strength of the guidance from the prompt. The paper states that CFG is not used in the main results to ensure fair comparison with ARMs but shows it consistently improves LLaDA's performance.

4.2.5. Algorithms

The paper provides detailed algorithms for training and inference.

Algorithm 1 Pre-training of LLaDA

Require: mask predictor $p_{\theta}$ , data distribution $p_{\mathrm{data}}$

1: repeat 2: $x_0 \sim p_{\mathrm{data}}$ # with a probability of $1\%$ , the sequence length of $x_0$ follows U[1, 4096] 3: $t \sim \mathrm{U}(0, 1]$ 4: $x_t \sim q_{t|0}(x_t | x_0)$ # $q_{t|0}$ is defined in Eq. (7) 5: Calculate $\mathcal{L} = - \frac{1}{t \cdot L} \sum_{i=1}^{L} \mathbf{1}[x_t^i = \mathbf{M}] \log p_{\theta}(x_0^i | x_t)$ # $L$ is the sequence length of $x_0$ 6: Calculate $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ and run optimizer. 7: until Converged 8: Return $p_{\theta}$

Explanation: This algorithm outlines the pre-training loop. It repeatedly samples an original sequence ( $x_0$ ), a masking ratio ( $t$ ), applies the forward masking process to get $x_t$ , computes the cross-entropy loss (Eq. 3) for masked tokens, and updates the model parameters using stochastic gradient descent. The special handling for variable sequence lengths is also noted.

Algorithm 2 Supervised Fine-Tuning of LLaDA

Require: mask predictor $p_{\theta}$ , pair data distribution $p_{\mathrm{data}}$

1: repeat 2: $p_0, r_0 \sim p_{\mathrm{data}}$ # please refer to Appendix B.1 for details about the SFT data 3: $t \sim \mathrm{U}(0, 1]$ 4: $r_t \sim q_{t|0}(\bar{r}_t | r_0)$ # is defined in Eq. (7) 5: Calculate $\mathcal{L} = - \frac{1}{t \cdot L'} \sum_{i=1}^{L'} \mathbf{1}[r_t^i = \mathbf{M}] \log p_{\theta}(r_0^i | p_0, r_t)$ # $L'$ is the sequence length of $r_0$ 6: Calculate $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ and run optimizer. 7: until Converged 8: Return $p_{\theta}$

Explanation: This algorithm details the SFT loop. It samples (prompt, response) pairs $(p_0, r_0)$ , applies masking only to the response $r_0$ (to get $r_t$ ), then computes the cross-entropy loss (Eq. 5) using the prompt $p_0$ and masked response $r_t$ to predict original response tokens. Model parameters are updated based on this loss.

Algorithm 3 Conditional Log-likelihood Evaluation of LLaDA

Require: mask predictor $p_{\theta}$ , prompt $p_0$ , response $r_0$ , the number of Monte Carlo estimations $n_{mc}$

1: log_likelihood $= 0$ 2: for $i \gets 1$ to $n_{mc}$ do 3: $l \sim \{1, 2, \ldots, L\}$ # $L$ is the sequence length of $r_0$ 4: Obtain $r_l$ by uniformly sampling $l$ tokens from $r_0$ without replacement for masking 5: log_likelihood $= \text{log\_likelihood} + \frac{L}{l} \sum_{i=1}^{L} \mathbf{1}[r_l^i = \mathbf{M}] \log p_{\theta}(r_0^i | p_0, r_l)$ 6: end for 7: log_likelihood $= \text{log\_likelihood} / n_{mc}$ 8: Return log_likelihood

Explanation: This algorithm calculates the conditional log-likelihood of a response $r_0$ given a prompt $p_0$ . It uses Monte Carlo estimation with $n_{mc}$ samples. In each sample, $l$ tokens are randomly chosen from $r_0$ and masked to create $r_l$ . The model then predicts these masked tokens, and their log-probabilities are aggregated, weighted by L/l. The final result is the average over all Monte Carlo samples. This uses the lower-variance formulation from Eq. (6).

Algorithm 4 Reverse Generation Process (Random Remasking)

Require: mask predictor $p_{\theta}$ , prompt $p_0$ , answer length $L$ , sampling steps $N$

1: Set $r_1$ is a fully masked sequence of length $L$ . 2: for $t \gets 1$ down to $\frac{1}{N}$ step $\frac{1}{N}$ do 3: $s = t - \frac{1}{N}$ 4: $r_0 = \mathrm{argmax}_{r_0} p_{\theta}(r_0 | p_0, r_t)$ # we employ greedy sampling when predicting masked tokens 5: for $i \gets 1$ to $L$ do 6: if $r_t^i \neq \mathbf{M}$ then 7: $r_0^i = r_t^i$ 8: else 9: with probability $\frac{s}{t}$ , $r_0^i$ is set to $\mathbf{M}$ 10: end if 11: end for 12: $r_s = r_0$ 13: end for 14: Return $r_0$

Explanation: This algorithm describes pure diffusion sampling using random remasking. It starts with a fully masked sequence $r_1$ . In each step, it predicts the original tokens ( $r_0$ ) from the current masked sequence ( $r_t$ ) and the prompt ( $p_0$ ). Then, it remasks a fraction s/t of the newly predicted tokens back to $\mathbf{M}$ randomly. Tokens that were already unmasked in $r_t$ remain unmasked. This iterative process gradually refines the sequence until $t=0$ .

Algorithm 5 Reverse Generation Process (Low-confidence Remasking)

Require: mask predictor $p_{\theta}$ , prompt $p_0$ , answer length $L$ , sampling steps $N$

1: Set $r_1$ is a fully masked sequence of length $L$ . 2: for $t \gets 1$ down to $\frac{1}{N}$ step $\frac{1}{N}$ do 3: $s = t - \frac{1}{N}$ 4: for $i \gets 1$ to $L$ do 5: if $r_t^i \neq \mathbf{M}$ then 6: $r_0^i = r_t^i, c^i = 1$ 7: else 8: $r_0^i = \arg\operatorname*{max}_{r_0^i} p_{\theta}(r_0^i | p_0, r_t)$ 9: $c^i = p_{\theta}(r_0^i | p_0, r_t)_{r_0^i}$ 10: end if 11: end for 12: $n_{un} = \lfloor L(1-s) \rfloor$ # the number of unmasked tokens is $n_{un}$ in timestep $s$ 13: for $i \gets 1$ to $L$ do 14: if $c^i \in \operatorname{Lowest-}n_{un}(\{c^i\}_1^L)$ then 15: $r_0^i = \mathbf{M}$ # the $n_{un}$ positions with the least confidence are selected for remasking. 16: end if 17: end for 18: $r_s = r_0$ 19: end for 20: Return $r_0$

Explanation: This algorithm is similar to Algorithm 4 but uses low-confidence remasking. Instead of randomly remasking, it calculates a confidence score $c^i$ (the predicted probability of the chosen token) for each token. It then identifies tokens that need to be remasked based on having the lowest confidence scores. The number of tokens to remask is determined by the target unmasked count $n_{un} = \lfloor L(1-s) \rfloor$ for the current timestep $s$ . This heuristic aims to retain high-confidence predictions while re-evaluating low-confidence ones in subsequent steps.

5. Experimental Setup

5.1. Datasets

The paper describes the datasets used for both pre-training and Supervised Fine-Tuning (SFT).

Pre-training Corpus:
- Scale: Comprises 2.3 trillion (T) tokens, a standard scale for modern LLMs.
- Sources: Diverse publicly available sources, including:
  - Web data
  - Books
  - Academic papers
  - Social media
  - Encyclopedias
  - Mathematics
  - Code
- Languages: Approximately 11% Chinese, 61% English, and 28% code.
- Cleaning: Involved PDF text extraction, deduplication, and harmful content filtering. A BERT model fine-tuned for automated data quality annotation was used to select higher-quality samples.
- Processing: Documents were concatenated and segmented into fixed-length sequences (4096 tokens). As mentioned in the methodology, 1% of pre-training data had sequence lengths randomly sampled from [1, 4096] to improve variable-length handling.
Supervised Fine-Tuning (SFT) Dataset:
- Scale: Consists of 4.5 million (M) (prompt, response) pairs.
- Composition:
  - 1 million human-annotated samples.
  - 3.5 million synthetic samples, generated using methods similar to those in Xu et al. [103] and Wei et al. [104].
- Domains: Spans multiple domains, including code, mathematics, and instruction-following.
- Processing:
  - $|EOS| tokens$ are appended to the end of shorter pairs within each mini-batch to ensure uniform sequence lengths. These are treated as normal tokens during training but removed during sampling.
  - Multi-turn dialogues are partitioned into single-turn pairs for training (e.g., a 3-turn dialogue (P0, R0, P1, R1, P2, R2) is treated as three pairs: (P0, R0), (P0 R0 P1, R1), (P0 R0 P1 R1 P2, R2)).
    
    The paper does not provide explicit examples of raw data samples, but the SFT examples in the case studies (e.g., multi-turn dialogue in Table 3, Table 15, Table 16, and poem completion in Table 17) implicitly show the (prompt, response) format. For instance:

Multi-turn Dialogue Example from Table 3: User: Could you please share with me what the initial two lines of the renowned poem 'The Road Not Taken'? LLaDA: Certainly! The initial two lines of the renowned poem "The Road Not Taken" by Robert Frost are: "Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.

These datasets are standard for LLM training and are effective for validating general language capabilities, mathematical reasoning, code generation, and multilingual understanding.

5.2. Evaluation Metrics

The paper evaluates LLaDA across a comprehensive suite of benchmarks, covering general tasks, mathematics & science, code, and Chinese language understanding. The evaluation protocols follow existing LLM assessment practices, primarily using conditional likelihood estimation and conditional generation.

Here's a breakdown of the metrics, with conceptual definitions and, where necessary, supplemental formulas:

5.2.1. General Tasks

MMLU (Massive Multitask Language Understanding) [110]:
- Concept: Measures an LLM's knowledge and problem-solving abilities across 57 diverse subjects, ranging from humanities and social sciences to STEM fields. It uses multiple-choice questions.
- Metric: Accuracy.
- Formula: $\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$
  - Symbols:
    - Number of correct predictions: The count of questions where the model selects the correct multiple-choice option.
    - Total number of predictions: The total number of multiple-choice questions in the dataset.
BBH (Big-Bench Hard) [111]:
- Concept: A challenging subset of the Big-Bench benchmark, designed to evaluate LLMs on complex reasoning tasks that often require multi-step inference, common sense, or domain-specific knowledge, areas where LLMs typically struggle.
- Metric: Accuracy or Exact Match (EM) depending on the subtask.
- Formula: Same as MMLU for Accuracy. For EM, it's 1 if the generated answer is identical to the reference, 0 otherwise.
ARC-C (AI2 Reasoning Challenge - Challenge Set) [112]:
- Concept: A challenging dataset of science questions designed to test models' ability to answer questions that require multi-hop reasoning, common sense, and background knowledge, not just surface-level text understanding. The "Challenge Set" specifically includes questions that are difficult for information retrieval systems.
- Metric: Accuracy.
- Formula: Same as MMLU.
Hellaswag [113]:
- Concept: A large-scale adversarial dataset for commonsense reasoning. Given a sentence, the model must choose the most plausible continuation from several options. It's designed to be difficult for models that rely on superficial cues.
- Metric: Accuracy.
- Formula: Same as MMLU.
TruthfulQA [114]:
- Concept: A benchmark that measures whether language models are truthful in generating answers to questions, even when common human misconceptions or falsehoods might lead them astray.
- Metric: Truthfulness (measured by BLEURT or human evaluation, but often simplified to Accuracy for multiple-choice tasks). For this paper, it's likely Accuracy on multiple-choice format.
- Formula: Same as MMLU.
WinoGrande [115]:
- Concept: An adversarial Winograd Schema Challenge dataset for commonsense reasoning. It tests pronoun resolution, where the model must identify the antecedent of a pronoun in a sentence based on common sense. Designed to avoid statistical biases.
- Metric: Accuracy.
- Formula: Same as MMLU.
PIQA (Physical Commonsense in Natural Language) [116]:
- Concept: A dataset designed to evaluate LLMs on physical commonsense knowledge. Given a goal (e.g., "wash the dishes"), the model must choose the most plausible plan of action from two options.
- Metric: Accuracy.
- Formula: Same as MMLU.

5.2.2. Mathematics & Science

GSM8K (Grade School Math 8K) [117]:
- Concept: A dataset of 8,500 grade school math word problems. It requires multi-step reasoning to derive the correct numerical answer.
- Metric: Exact Match (EM) Accuracy of the final numerical answer.
- Formula: $\text{EM Accuracy} = \frac{\text{Number of problems with exactly correct final answer}}{\text{Total number of problems}}$
  - Symbols:
    - Number of problems with exactly correct final answer: Count of problems where the model's generated final answer (after extraction) perfectly matches the ground truth.
    - Total number of problems: Total count of problems in the GSM8K dataset.
Math [118]:
- Concept: A dataset for mathematical problem-solving across various difficulty levels and mathematical domains (e.g., algebra, geometry, number theory). Problems are typically text-based and require detailed reasoning.
- Metric: Exact Match (EM) Accuracy of the final numerical or symbolic answer.
- Formula: Same as GSM8K.
GPQA (Graduate-level Google-Proof Q&A Benchmark) [119]:
- Concept: A highly challenging Q&A dataset composed of graduate-level questions from various scientific fields, designed to be "Google-proof" (i.e., not easily answerable by simple web search). Requires deep understanding and reasoning.
- Metric: Accuracy.
- Formula: Same as MMLU.

5.2.3. Code Generation

HumanEval [120]:
- Concept: A dataset for code generation. It consists of handwritten programming problems, each with a function signature, docstring, and unit tests. Models must generate the function body that passes the tests.
- Metric: Pass@k. Measures the percentage of problems where at least one of $k$ generated solutions passes all unit tests. The paper reports Pass@1 (implicitly, if not specified otherwise, it's typically $k=1$ ).
- Formula (Pass@k): $\text{Pass@k} = \frac{1}{\text{N}} \sum_{i=1}^{\text{N}} \mathbf{1}\left[ \text{max}_{j=1}^k (\text{tests\_passed}(C_{i,j})) \ge \text{total\_tests}(P_i) \right]$
  - Symbols:
    - $N$ : Total number of problems.
    - $\mathbf{1}[\cdot]$ : Indicator function (1 if condition is true, 0 otherwise).
    - $tests_passed(C_{i,j})$ : Number of unit tests passed by the $j$ -th candidate solution for problem $i$ .
    - $total_tests(P_i)$ : Total number of unit tests for problem $i$ .
    - This formula implicitly becomes simpler for Pass@1 where $k=1$ .
HumanEval-FIM (Fill-in-the-Middle) [121]:
- Concept: A variant of HumanEval where models are given a code snippet with a missing middle part, and they must fill in the blank. This tests the model's ability to reason bidirectionally within code.
- Metric: Pass@k (typically Pass@1).
- Formula: Same as HumanEval.
MBPP (Mostly Basic Python Problems) [122]:
- Concept: A dataset of approximately 1,000 Python programming problems, often simpler than HumanEval, designed to evaluate basic programming skills. Each problem includes a prompt, a solution, and three unit tests.
- Metric: Pass@k (typically Pass@1).
- Formula: Same as HumanEval.

5.2.4. Chinese Understanding

CMMLU (Chinese Massive Multitask Language Understanding) [123]:
- Concept: The Chinese counterpart to MMLU, designed to evaluate LLMs on a wide range of knowledge and reasoning tasks in Chinese across diverse subjects.
- Metric: Accuracy.
- Formula: Same as MMLU.
C-Eval [124]:
- Concept: A comprehensive Chinese evaluation suite for foundation models, covering various academic disciplines and difficulty levels. It is designed to be challenging and representative of real-world Chinese knowledge and reasoning.
- Metric: Accuracy.
- Formula: Same as MMLU.

5.2.5. Evaluation Protocol

Base Models:
- Conditional likelihood estimation: Used for MMLU, CMMLU, C-Eval, ARC-C, Hellaswag, TruthfulQA, WinoGrande, PIQA, and GPQA. For single-token likelihoods (e.g., MMLU), one Monte Carlo (MC) estimate is used. For others, 128 MC samples are used (based on Eq. 6).
- Conditional generation: Used for BBH, GSM8K, Math, HumanEval, HumanEval-FIM, and MBPP.
Instruct Models: All benchmarks are evaluated using conditional generation.
Evaluation Frameworks:
- lm-evaluation-harness [125]: Used for some base model evaluations, except for HumanEval-FIM.
- Internal evaluation library: Used for HumanEval-FIM on base models and all evaluations on instruct models, as lm-evaluation-harness showed greater deviation from LLaMA3 reported results.
Generation Parameters:
- Pure diffusion sampling with low-confidence remasking is applied to both Base and Instruct models.
- Base models: Generation length and sampling steps are set to 1024.
- Instruct models: Sampling steps equal to answer length, which varies per task (e.g., 3 for MMLU, 64 for GPQA, 256 for MBPP, 512 for HumanEval/GSM8K/Math/ARC-C).
- For HumanEval, MBPP, GSM8K, Math, and GPQA for Instruct models, the confidence of the $|EOS| token$ is set to zero during sampling to prevent premature termination, especially due to heavy padding in SFT data.

5.3. Baselines

The paper compares LLaDA against several state-of-the-art autoregressive LLMs of similar scale.

Self-constructed ARM Baselines:
- ARM 1B and ARM 7B: These were trained by the authors with identical architectures, data, and configurations as LLaDA (where possible) to provide a direct comparison, especially for scalability analysis. The 7B ARM baseline was previously trained.
Leading LLMs:
- LLaMA3 8B [6]: A highly performant, open-source LLM from Meta, representing a strong competitor in the 8B parameter range.
- LLaMA2 7B [21]: The predecessor to LLaMA3, also an open-source model.
- Qwen2 7B [25]: An ARM from Alibaba Cloud, known for its strong performance across various tasks.
- Qwen2.5 7B [26]: An improved version of Qwen2.
- Mistral 7B [33]: A highly efficient and performant 7B ARM.
- Deepseek 7B [32]: Another competitive open-source LLM.
- Gemma2 9B [26] (for Instruct model comparison): Google's LLM suite.
- GPT-4o (for reversal curse comparison): OpenAI's state-of-the-art proprietary model.
  
  These baselines were chosen because they are widely recognized, publicly available (mostly open-source), and are of a comparable parameter scale (7B-9B), allowing for a fair assessment of LLaDA's capabilities against established ARMs. Some representative LLMs (LLaMA3 8B, LLaMA2 7B) were re-evaluated using the authors' internal implementation for more direct comparison.

5.4. Model Architecture Details

The Transformer architecture employed for LLaDA and the ARM baselines is similar to that of LLaMA [6, 21] with specific enhancements:

RMSNorm [105]: Root Mean Square Layer Normalization is used to stabilize training.
SwiGLU [106]: Swish-Gated Linear Unit is used as the activation function within the Feed-Forward Networks (FFNs) to enhance non-linearity.
RoPE [107]: Rotary Position Embedding is integrated for more expressive positional encoding, which helps models handle long sequences.

The paper provides a detailed architectural comparison in Table 5:

The following are the results from Table 5 of the original paper:

	Our ARM Baseline 1B	LLaDA 1B	Our ARM Baseline 7B	LLaDA 8B	LLaMA3 8B
Layers	22	22	28	32	32
Model dimension	2048	2048	4096	4096	4096
Attention heads	32	32	32	32	32
Vocabulary size	126,464	126,464	126,464	126,464	128,000
FFN dimension	5634	5634	13,440	12,288	14,336
Key/Value heads	4	4	8	32	8
Total parameters	1.49 B	1.49 B	6.83 B	8.02 B	8.03 B
Non-embedding parameters	0.97 B	0.97 B	5.80 B	6.98 B	6.98 B

Key Differences:
- For the 1B models, LLaDA and the ARM baseline have identical architectures.
- For larger models, LLaDA 8B uses 32 Key/Value heads compared to LLaMA3 8B's 8. This is because LLaDA uses vanilla multi-head attention and is incompatible with KV caching (which grouped query attention [24] is designed to optimize). The larger number of Key/Value heads in LLaDA means its attention layer has more parameters, leading to a slightly smaller FFN dimension (12,288 vs 14,336 in LLaMA3) to maintain a comparable total parameter count.
- The vocabulary size also differs slightly (126,464 for LLaDA vs 128,000 for LLaMA3) due to a tokenizer adapted on LLaDA's specific data.

5.5. Training Details

Optimizer: AdamW optimizer [29] with a weight decay of 0.1 for all LLaDA and ARM baselines.
Learning Rate Scheduler: Warmup-Stable-Decay [28].
- LLaDA 1B & 8B, ARM 1B: Max learning rate $4 \times 10^{-4}$ , batch size 1280.
- ARM 7B: Max learning rate $4.2 \times 10^{-4}$ , batch size 4224 (selected via grid search).
- LLaDA 8B specific schedule: Detailed in Section 4.2.2 (linear warmup, stable, then two decay steps).
Computational Cost Calculation: The training FLOPs (Floating Point Operations) shown in Figure 3 are calculated using the widely used 6ND formulation [108, 109], where:
- $N$ : Number of non-embedding parameters.
- $D$ : Total number of training tokens.

6. Results & Analysis

6.1. Scalability of LLaDA on Language Tasks

The paper first investigates LLaDA's scalability by comparing it against ARM baselines trained on the same data. The evaluation focuses on six standard tasks: MMLU, CMMLU, ARC-C, PIQA, GSM8K, and HumanEval. The pre-training computational cost (FLOPs) serves as the unified scaling metric.

The following are the results from Figure 3 of the original paper:

Figure 3: Scalability of LLaDA. We evaluate the performance of LLaDA and our ARM baselines trained on the same data across increasing pre-training computational FLOPs. LLaDA exhibits strong scalabili… 该图像是图表，展示了图3中LLaDA和自回归基线模型在六个任务上，随预训练计算FLOPs增加的性能表现。图中横轴为FLOPs，纵轴为不同任务的评价指标，LLaDA表现出强劲的可扩展性，整体性能与自回归模型相当。

As shown in Figure 3, LLaDA demonstrates impressive scalability, with its overall performance trend being highly competitive with ARMs.

Stronger Scalability on Specific Tasks: On tasks like MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math 8K), LLaDA exhibits even stronger scalability, often surpassing the ARM baselines as FLOPs increase.
Narrowing Gap: For tasks where ARMs initially perform better, such as PIQA (Physical Commonsense in Natural Language), the performance gap between LLaDA and ARMs narrows considerably as the models scale up.
Hypothesis for Performance Gain: The authors hypothesize that LLaDA's gains on certain benchmarks stem from its architectural difference: ARMs optimize only left-to-right conditional probabilities, while LLaDA is trained to consider multiple conditioning directions (due to its masked diffusion nature and lack of causal mask). This offers greater flexibility and potentially leads to better generalization, especially for tasks requiring bidirectional reasoning. This hypothesis is supported by LLaDA's strong performance on reversal reasoning tasks.

The paper also addresses previous findings by Nie et al. [27], which suggested that MDMs might require 16 times more computation than ARMs for the same likelihood. The current work clarifies that likelihood is an indirect metric for downstream performance, and diffusion models optimize a bound of the likelihood, making direct comparisons complex. Furthermore, LLaDA extends the scaling range to $10^{23}$ FLOPs, demonstrating capabilities at a much larger scale than previously explored.

The detailed numerical results corresponding to Figure 3 are provided in Table 18 and Table 19 in the Appendix. I will transcribe them here for completeness.

The following are the results from Table 18 of the original paper:

Model	Training Tokens	FLOPs	MMLU	CMMLU	ARC-C	PIQA	GSM8K	HumanEval
LLaDA 1B	37.75B	2.20e20	25.52	25.95	25.17	59.41	1.82	0.00
LLaDA 1B	88.08B	5.13e20	27.11	26.52	26.96	61.86	3.03	1.83
LLaDA 1B	138.41B	8.06e20	29.32	27.13	30.20	63.38	2.35	0.00
LLaDA 1B	239.08B	1.39e21	31.48	30.77	27.99	63.11	3.26	1.22
LLaDA 1B	352.32B	2.05e21	35.86	34.35	31.31	65.34	3.64	3.05
LLaDA 1B	461.37B	2.69e21	31.86	30.98	30.12	65.51	2.35	0.61
LLaDA 8B	62.91B	2.63e21	32.22	28.50	30.20	63.82	3.87	2.44
LLaDA 8B	125.83B	5.27e21	33.39	33.90	34.64	66.54	8.72	3.66
LLaDA 8B	251.66B	1.05e22	42.84	40.59	40.10	69.04	15.31	3.66
LLaDA 8B	377.49B	1.58e22	45.11	43.99	39.25	68.61	25.40	9.76
LLaDA 8B	503.32B	2.11e22	43.57	41.38	42.06	70.24	27.52	9.76
LLaDA 8B	629.14B	2.63e22	48.80	47.13	42.24	72.09	30.10	12.80
LLaDA 8B	679.48B	2.85e22	49.61	48.19	41.30	70.84	26.31	8.54
LLaDA 8B	792.72B	3.31e22	50.88	49.01	42.58	70.51	31.99	6.10
LLaDA 8B	981.47B	4.11e22	49.47	48.10	40.27	71.38	-	6.10
LLaDA 8B	1107.30B	4.64e22	51.13	47.57	41.13	69.26	36.69	10.37
LLaDA 8B	1233.13B	5.16e22	50.52	49.72	45.05	71.49	38.97	9.76
LLaDA 8B	1358.95B	5.69e22	54.61	53.97	49.40	74.05	48.14	17.68
LLaDA 8B	1547.70B	6.48e22	57.38	56.04	49.49	74.59	53.30	20.73
LLaDA 8B	1975.52B	8.27e22	58.52	57.87	50.68	75.35	-	19.51

The following are the results from Table 19 of the original paper:

Model	Training Tokens	FLOPs	MMLU	CMMLU	ARC-C	PIQA	GSM8K	HumanEval
ARM 1B	37.75B	2.20e20	25.47	25.38	30.20	67.36	2.20	4.88
ARM 1B	88.08B	5.13e20	24.67	25.23	33.96	70.02	7.51	10.37
ARM 1B	138.41B	8.06e20	29.25	27.48	33.45	70.29	8.34	9.76
ARM 7B	17.30B	6.02e20	26.92	25.18	21.02	57.18	1.29	1.22
ARM 7B	34.60B	1.20e21	25.83	25.38	24.07	62.84	1.59	2.44
ARM 7B	86.50B	3.01e21	24.41	24.90	25.42	71.11	2.88	7.93
ARM 7B	173.02B	6.02e21	26.20	24.78	26.10	74.27	6.67	9.15
ARM 7B	207.62B	7.23e21	30.36	28.86	31.86	74.48	8.57	12.80
ARM 7B	224.92B	7.83e21	29.49	32.26	31.19	74.37	8.95	8.54
ARM 7B	242.22B	8.43e21	33.62	31.38	34.92	75.41	10.84	9.15
ARM 7B	259.52B	9.03e21	34.11	34.20	32.88	75.19	9.33	10.98
ARM 7B	311.43B	1.08e22	35.66	35.49	36.61	75.14	11.30	10.37
ARM 7B	363.33B	1.26e22	34.54	37.67	34.58	76.55	12.28	14.02
ARM 7B	415.24B	1.45e22	35.37	38.37	35.25	76.39	14.40	12.80
ARM 7B	449.84B	1.57e22	39.51	39.24	34.92	76.82	14.94	14.63
ARM 7B	519.09B	1.81e22	40.30	40.69	37.29	77.15	14.03	14.63
ARM 7B	778.57B	2.71e22	43.33	43.50	38.31	77.53	17.59	14.63
ARM 7B	1038.09B	3.61e22	45.06	46.12	41.69	77.86	20.02	15.85
ARM 7B	1384.12B	4.82e22	47.63	48.18	47.80	76.93	22.82	15.24
ARM 7B	2076.18B	7.23e22	47.68	50.85	44.07	77.37	24.79	14.63
ARM 7B	2214.59B	7.71e22	49.26	52.08	53.56	77.69	27.37	17.07

6.2. Benchmark Results (Pre-trained)

To evaluate in-context learning and instruction-following, LLaDA 8B Base is compared against existing LLMs of similar scale. The evaluation protocols were standardized, and some representative LLMs (LLaMA3 8B, LLaMA2 7B) were re-evaluated using the authors' implementation for direct comparison.

The following are the results from Table 1 of the original paper:

	LLaDA 8B*	LLaMA3 8B*	LLaMA2 7B*	Qwen2 7B†	Qwen2.5 7B†	Mistral 7B†	Deepseek 7B‡
Model	Diffusion	AR	AR	AR	AR	AR	AR
Training tokens	2.3T	15T	2T	7T	18T	-	2T
General Tasks
MMLU	65.9 (5)	65.4 (5)	45.9 (5)	70.3 (5)	74.2 (5)	64.2 (5)	48.2 (5)
BBH	49.7 (3)	62.1 (3)	39.4 (3)	62.3 (3)	70.4 (3)	56.1 (3)	39.5 (3)
ARC-C	45.9 (0)	53.1 (0)	46.3 (0)	60.6 (25)	63.7 (25)	60.0 (25)	48.1 (0)
Hellaswag	70.5 (0)	79.1 (0)	76.0 (0)	80.7 (10)	80.2 (10)	83.3 (10)	75.4 (0)
TruthfulQA	46.1 (0)	44.0 (0)	39.0 (0)	54.2 (0)	56.4 (0)	42.2 (0)	-
WinoGrande	74.8 (5)	77.3 (5)	72.5 (5)	77.0 (5)	75.9 (5)	78.4 (5)	70.5 (0)
PIQA	73.6 (0)	80.6 (0)	79.1 (0)	-	-	-	79.2 (0)
Mathematics & Science
GSM8K	70.3 (4)	48.7 (4)	13.1 (4)	80.2 (4)	85.4 (4)	36.2 (4)	17.4 (8)
Math	31.4 (4)	16.0 (4)	4.3 (4)	43.5 (4)	49.8 (4)	10.2 (4)	6.0 (4)
GPQA	25.2 (5)	25.9 (5)	25.7 (5)	30.8 (5)	36.4 (5)	24.7 (5)	-
Code
HumanEval	35.4 (0)	34.8 (0)	12.8 (0)	51.2 (0)	57.9 (0)	29.3 (0)	26.2 (0)
HumanEval-FIM	73.8 (2)	73.3 (2)	26.9 (2)
MBPP	40.0 (4)	48.8 (4)	23.2 (4)	64.2 (0)	74.9 (0)	51.1 (0)	39.0 (3)
Chinese
CMMLU	69.9 (5)	50.7 (5)	32.5 (5)	83.9 (5)			47.2 (5)
C-Eval	70.5 (5)	51.7 (5)	34.0 (5)	83.2 (5)			45.0 (5)

Overall Performance: LLaDA 8B Base, trained on 2.3T tokens, demonstrates remarkable performance. It surpasses LLaMA2 7B Base (trained on 2T tokens) on nearly all tasks. More importantly, it is overall competitive with LLaMA3 8B Base (trained on 15T tokens), despite LLaMA3 being trained on significantly more data.
Strengths in Math and Chinese: LLaDA shows distinct advantages in mathematics (GSM8K, Math) and Chinese language understanding (CMMLU, C-Eval). For example, on GSM8K, LLaDA scores 70.3 compared to LLaMA3's 48.7 and LLaMA2's 13.1. Similarly, for CMMLU, LLaDA achieves 69.9, significantly higher than LLaMA3's 50.7 and LLaMA2's 32.5.
Reasons for Performance Differences: The authors conjecture that differences in data quality and distribution (especially given the closed-source nature of some LLM datasets) contribute to LLaDA's varying performance across tasks.
Ruling out Data Leakage: To ensure LLaDA's superior performance on GSM8K is not due to data leakage, the authors tested it on iGSM [34], an unseen, synthetic GSM8K-like task. The results (Table 12) confirm LLaDA's advantages on unseen mathematical problems, further validating its mathematical reasoning capabilities.

6.3. Benchmark Results (Post-trained)

The SFT (Supervised Fine-Tuning) procedure further enhances LLaDA's abilities. This section compares LLaDA 8B Instruct with other post-trained LLMs, noting that most competitors also employ reinforcement learning (RL) alignment, which LLaDA currently lacks.

The following are the results from Table 2 of the original paper:

	LLaDA 8B*	LLaMA3 8B*	LLaMA2 7B*	Qwen2 7B†	Qwen2.5 7B†	Gemma2 9B†	Deepseek 7B‡
Model	Diffusion	AR	AR	AR	AR	AR	AR
Training tokens	2.3T	15T	2T	7T	18T	8T	2T
Post-training Alignment pairs	SFT 4.5M	SFT+RL	SFT+RL	SFT+RL 0.5M+-	SFT+RL 1M + 0.15M	SFT+RL	SFT+RL 1.5M+-
- - General Tasks

MMLU	65.5 (5)	68.4 (5)	44.1 (5)	-			49.4 (0)
MMLU-pro	37.0 (0)	41.9 (0)	4.6 (0)	44.1 (5)	56.3 (5)	52.1 (5)
Hellaswag	74.6 (0)	75.5 (0)	51.5 (0)				68.5 (-)
ARC-C	88.5 (0)	82.4 (0)	57.3 (0)	-	-		49.4 (-)
Mathematics & Science
GSM8K	69.4 (4)	78.3 (4)	29.0 (4)	85.7 (0)	91.6 (0)	76.7 (0)	63.0 (0)
Math	31.9 (0)	29.6 (0)	3.8 (0)	52.9 (0)	75.5 (0)	44.3 (0)	15.8 (0)
GPQA	33.3 (5)	31.9 (5)	28.4 (5)	34.3 (0)	36.4 (0)	32.8 (0)
Code
HumanEval	49.4 (0)	59.8 (0)	16.5 (0)	79.9 (0)	84.8 (0)	68.9 (0)	48.2 (-)
MBPP	41.0 (4)	57.6 (4)	20.6 (4)	67.2 (-)	79.2 (0)	74.9 (0)	35.2 (-)

SFT Impact: SFT generally improves LLaDA's performance across most downstream tasks. For instance, ARC-C jumps from 45.9 to 88.5, and HumanEval from 35.4 to 49.4.
Declines: A few metrics, such as MMLU, show slight declines (65.9 to 65.5), which the authors attribute to suboptimal quality of the SFT data.
Comparison with RL-aligned Models: Despite only using SFT (while other models like LLaMA3, Qwen2, Gemma2, Deepseek incorporate SFT + Reinforcement Learning (RL) alignment), LLaDA 8B Instruct is competitive. The gaps with LLaMA3 8B Instruct are noticeable (e.g., GSM8K: 69.4 vs 78.3; HumanEval: 49.4 vs 59.8), but often remain small, especially considering the lack of RL alignment in LLaDA. This demonstrates the impressive instruction-following abilities achieved purely through SFT with a diffusion model.
Sampling Strategy Nuances: The paper notes that all results in Section 3 (including Table 1 and 2) are based on pure diffusion methods. For LLaDA 8B Instruct, block diffusion style sampling (a variant of diffusion sampling) actually performs better on GSM8K (78.6 vs 69.4) and Math (42.2 vs 31.9) than the default pure diffusion sampling. This is attributed to $extensive |EOS| token padding$ in the SFT data, which can cause premature termination in low-confidence remasking used by default pure diffusion.

6.4. Reversal Reasoning and Analyses

The reversal curse [15] is a known limitation of ARMs, where models struggle to generalize from "A is B" to "B is A". LLaDA's bidirectional modeling is hypothesized to address this. The authors evaluate this on a poem completion task using 496 Chinese poem sentence pairs, where models must generate either the subsequent (forward) or preceding (reversal) line.

The following are the results from Table 4 of the original paper:

	Forward	Reversal
GPT-4o (2024-08-06)	82.7	34.3
Qwen2.5-7B Instruct	75.9	38.0
LLaDA-8B Instruct	51.8	45.6

Addressing the Reversal Curse: As shown in Table 4, LLaDA effectively addresses the reversal curse. It demonstrates consistent zero-shot performance across both forward (51.8) and reversal (45.6) tasks, with a small performance gap.
Comparison with ARMs: In contrast, GPT-4o and Qwen2.5-7B Instruct exhibit significant performance gaps between forward and reversal tasks. GPT-4o scores 82.7 on forward but drops to 34.3 on reversal. Qwen2.5 scores 75.9 on forward and 38.0 on reversal.
Superiority in Reversal Task: While GPT-4o and Qwen2.5 perform strongly on the forward generation task (benefiting from larger datasets and computational resources), LLaDA outperforms both by a large margin in the reversal task (45.6 vs 34.3 for GPT-4o and 38.0 for Qwen2.5).
Intuition: The authors suggest that LLaDA's inherent bidirectional nature, treating tokens uniformly without inductive bias, leads to this balanced performance. The connection to Any-order Autoregressive Models (AO-ARM) (Eq. 15) in Appendix A.2 further supports this, indicating LLaDA implicitly learns dependencies irrespective of token order.

An example of a poem completion task, from Appendix B.9: Example 1: Prompt: -? Answer: 梅须逊雪三分白，雪却输梅一段香。 (Plum blossoms yield three points of whiteness to snow, but snow loses a segment of fragrance to plum blossoms.)

6.5. Ablation on Classifier-free Guidance

Classifier-free guidance (CFG) [37, 27] is a technique used in diffusion models to improve generation quality by balancing alignment with prompts and sample diversity. While not used in main results for fair comparison with ARMs, the paper shows LLaDA's compatibility and benefits from CFG.

The following are the results from Table 6 of the original paper:

	ARC-C	Hellaswag	TruthfulQA	WinoGrande	GPQA	PIQA
w/o CFG	45.9	70.5	46.1	74.8	25.2	73.6
w/ CFG	47.9	72.5	46.4	74.8	26.1	74.4

As seen in Table 6, applying CFG consistently improves the performance of LLaDA 8B Base across the tested benchmarks (ARC-C, HellaSwag, TruthfulQA, WinoGrande, GPQA, PIQA). For instance, ARC-C increases from 45.9 to 47.9, and HellaSwag from 70.5 to 72.5. This confirms that CFG is a valuable plug-and-play technique for LLaDA to further enhance its generation quality.

6.6. Ablation on Sampling Strategies

LLaDA supports flexible sampling strategies, including autoregressive, block diffusion, and pure diffusion sampling. Ablation studies are conducted on LLaDA 8B Base and LLaDA 8B Instruct to evaluate these.

The following are the results from Table 7 of the original paper:

		BBH	GSM8K	Math	HumanEval	MBPP
Autoregressive		38.1	63.1	23.6	18.3	33.4
Block Diffusion	L′= 2	37.3	62.6	25.2	14.6	33.6
	L′ = 4	40.0	65.7	26.6	15.9	36.0
	L′ = 8	42.0	68.2	27.7	19.5	39.2
	L′ = 32	45.7	68.6	29.7	29.9	37.4
Block Diffusion LLaDA	L′= 2	48.0	70.0	30.8	26.2	40.0
	L′ = 4	48.5	70.3	31.3	27.4	38.8
	L′ = 8	48.6	70.2	30.9	31.1	39.0
	L′ = 32	48.3	70.3	31.2	32.3	40.0
Pure Diffusion		49.7	70.3	31.4	35.4	40.0

Table 7 (for LLaDA 8B Base) shows:

Block Length Impact: For block diffusion sampling, performance generally improves as the block length ( $L'$ ) increases (e.g., BBH from 37.3 at $L'=2$ to 45.7 at $L'=32$ ).
Block Diffusion vs. Autoregressive: Block diffusion sampling consistently outperforms autoregressive sampling.
Block Diffusion LLaDA vs. Standard Block Diffusion: Block diffusion LLaDA (also called semi-autoregressive remasking) further improves upon standard block diffusion sampling. For example, on BBH with $L'=32$ , Block Diffusion is 45.7, while Block Diffusion LLaDA is 48.3.

Pure Diffusion: Pure diffusion sampling achieves the best overall performance (e.g., BBH 49.7, GSM8K 70.3, HumanEval 35.4).

The following are the results from Table 8 of the original paper:

	GSM8K	Math	HumanEval	MBPP	GPQA	MMLU-Pro	ARC-C
Autoregressive	0	9.5	0	0	0	0	84.4
Block Diffusion	24.6	23.5	17.1	21.2	29.3	32.5	88.1
Block Diffusion LLaDA	77.5	42.2	46.3	34.2	31.3	34.8	85.4
Pure Diffusion	69.4	31.9	49.4	41.0	33.3	37.0	88.5

Table 8 (for LLaDA 8B Instruct):

Autoregressive Sampling for Instruct Model: Autoregressive sampling performs very poorly for the Instruct model, often scoring 0 or near 0. This is because SFT data (being complete sentences) causes the model to generate a full sentence and immediately output $|EOS|$ , leading to extremely short or empty generations.
Block Diffusion Variants: Block diffusion and Block Diffusion LLaDA significantly improve over autoregressive. Block Diffusion LLaDA shows stronger results on GSM8K (77.5), Math (42.2), and HumanEval (46.3).
Pure Diffusion: Pure diffusion sampling still achieves the best overall performance across most tasks (e.g., HumanEval 49.4, MBPP 41.0, ARC-C 88.5).
$|EOS| Token Handling$ : For Instruct models, the paper notes a specific adjustment: setting the confidence score of the $|EOS| token$ to zero during pure diffusion sampling for HumanEval, MBPP, GSM8K, Math, and GPQA. This prevents the model from prematurely ending generations due to heavy $|EOS| padding$ in SFT data, which would otherwise degrade performance. This also explains why Block Diffusion LLaDA performed slightly better on GSM8K and Math in Table 2 (78.6 and 42.2) than pure diffusion without this specific EOS token adjustment (69.4 and 31.9).

The following are the results from Table 9 of the original paper:

Length BBH GSM8K Math HumanEval MBPP

Random Remasking 32.1 21.3 9.2 11.6 21.0

Low-confidence Remasking 45.0 70.0 30.3 32.9 40.2

Length	BBH	GSM8K	Math	HumanEval	MBPP
Random Remasking	32.1	21.3	9.2	11.6	21.0
Low-confidence Remasking	45.0	70.0	30.3	32.9	40.2

Table 9 (Comparison of Remasking Strategies):

Low-confidence remasking consistently and significantly outperforms random remasking across all tested benchmarks for LLaDA 8B Base. For example, on BBH, low-confidence yields 45.0 compared to random's 32.1. On GSM8K, the difference is even more stark: 70.0 vs 21.3. This highlights the importance of intelligently deciding which tokens to re-mask, similar to annealed sampling strategies.

6.7. Ablation on Generated Length

The generation length is a user-specified hyperparameter in LLaDA. This ablation studies its impact on performance using LLaDA 8B Base.

The following are the results from Table 10 of the original paper:

Length	BBH	GSM8K	Math	HumanEval	MBPP
256	45.0	70.0	30.3	32.9	40.2
512	50.4	70.8	30.9	32.9	39.2
1024	49.7	70.3	31.4	35.4	40.0

Table 10 demonstrates that LLaDA's performance is relatively insensitive to the generation length hyperparameter. While there are minor fluctuations (e.g., BBH peaking at 512 length), the results remain robust across lengths of 256, 512, and 1024. This is a practical advantage, as it reduces the need for extensive tuning of this parameter and indicates that the variable-length training (Section 4.2.2) is effective.

6.8. Analysis of Sampling Efficiency

The paper analyzes LLaDA's sampling speed (throughput) and memory consumption compared to LLaMA3 8B Base.

The following are the results from Figure 5 of the original paper:

Figure 5: Analysis of Sampling Efficiency. The generation length for LLaDA is set to 256, with sampling steps set to 32, 64, 128, and 256 across the figures. This corresponds to decoding 8, 4, 2, and… 该图像是多子图的折线图，展示了LLaDA 8B与LLaMA3 8B在不同吞吐率下的GSM8K、Math、HumanEval和MBPP任务性能对比，体现了采样效率与生成速度的权衡。

Figure 5 (Analysis of Sampling Efficiency):

Quality-Speed Trade-off: LLaDA enables a flexible trade-off between generation quality and sampling speed by adjusting the number of sampling steps. Fewer steps mean faster generation but potentially lower quality. The figures show performance (y-axis) vs. throughput (x-axis) for different numbers of sampling steps (32, 64, 128, 256, corresponding to 8, 4, 2, 1 tokens per forward pass respectively).
Throughput vs. LLaMA3:
- On GSM8K and Math, LLaDA 8B Base achieves comparable performance to LLaMA3 8B Base while delivering 1.5 to 1.8 times higher throughput. This is notable because LLaMA3 uses KV Cache (an inference optimization) while LLaDA operates without such techniques.
- On HumanEval, LLaDA performs comparably to LLaMA3 when their throughputs are matched.
- On MBPP, LLaDA lags behind LLaMA3.

KV Cache Benefit: The benefit of KV caching for LLaMA3 is weaker on HumanEval due to its relatively short prompt lengths (average 132 tokens), compared to GSM8K (894), Math (680), and MBPP (628).

The following are the results from Table 11 of the original paper:

Input Length	Output Length	LLaDA 8B	LLaMA3 8B w/o KV-Cache	LLaMA3 8B w/ KV-Cache
512	512	17.03	16.70	16.32
	1024	17.53	17.49	16.43
	2048	18.52	20.00	16.73
1024	512	17.53	17.16	16.36
	1024	18.01	18.26	16.41
	2048	19.02	21.39	16.74

Table 11 (Memory Consumption):

LLaDA's memory usage (in GB) is comparable to LLaMA3 8B without KV-Cache.
It is slightly higher than LLaMA3 with KV-Cache.
LLaDA's memory usage remains constant regardless of the number of sampling steps, which is a characteristic of its non-autoregressive nature (it processes the entire sequence in parallel for each step).

The authors emphasize that the primary goal of this study is not to achieve faster generation than ARMs, but to demonstrate the promise of diffusion models at scale. They acknowledge that further research on efficiency-oriented optimizations (e.g., architectural innovations, distillation, sampling algorithm design) is a promising future direction for LLaDA.

6.9. Evaluation on iGSM Dataset

To further validate LLaDA's mathematical capabilities and rule out any data leakage concerns (especially given GSM8K performance), the models were tested on iGSM [34], an infinite, synthetic GSM8K-like dataset. iGSM generates problems with varying solution steps based on specific rules, ensuring the dataset is unseen during training.

The following are the results from Table 12 of the original paper:

	4 steps	5 steps	6 steps
LLaMA3 8B Base	38.0	35.0	34.0
LLaDA 8B Base	64.0	41.0	44.0

Table 12 shows LLaDA 8B Base demonstrating significant and consistent advantages over LLaMA3 8B Base on these unseen mathematical problems.

For 4-step problems, LLaDA scores 64.0 compared to LLaMA3's 38.0.
For 5-step problems, LLaDA scores 41.0 compared to LLaMA3's 35.0.
For 6-step problems, LLaDA scores 44.0 compared to LLaMA3's 34.0.

These results align with the GSM8K findings in Table 1 and provide strong evidence that LLaDA possesses robust mathematical reasoning capabilities that generalize well to novel problems.

6.10. Case Studies

The paper includes several case studies to showcase LLaDA 8B Instruct's instruction-following abilities and the visualization of its sampling process.

The following are the results from Table 13 of the original paper:

Table 13: Visualization of the Block Diffusion LLaDA Sampling Process. In the response of LLaDA, darker colors indicate tokens predicted in the later stages of sampling, while lighter colors correspo… 该图像是图表，展示了LLaDA模型对“定期锻炼对身心健康的益处”问题的采样生成过程。颜色深浅表示预测阶段，浅色为早期预测，深色为后期预测，体现了模型的逐步生成机制。

Figure 6 (Visualization of the Block Diffusion LLaDA Sampling Process): This image illustrates the iterative generation process. Darker colors indicate tokens predicted in later stages of sampling, while lighter colors correspond to earlier predictions. The example shows LLaDA generating a response to a math problem, progressively filling in masked tokens and refining the output.

The following are the results from Table 14 of the original paper:

Table 14: Visualization of the Multi-turn Dialogue. We employ random remasking strategy. In the response of LLaDA, darker colors indicate tokens predicted in the later stages of sampling, while light… 该图像是论文中展示多轮对话的示意图，使用随机重掩码策略。LLaDA模型生成的响应中，颜色越深表示该词在采样后期预测，颜色越浅表示先期预测。

Figure 7 (Visualization of the Multi-turn Dialogue with Random Remasking): Similar to Figure 6, this image visualizes the multi-turn dialogue generation process using random remasking. It showcases LLaDA's ability to maintain context across turns and generate coherent responses, with color intensity indicating the prediction stage.

Multi-turn Dialogue Example (Table 3 excerpt, also includes images 6 & 7): User: Could you please share with me what the initial two lines of the renowned poem 'The Road Not Taken'? LLaDA: Certainly! The initial two lines of the renowned poem "The Road Not Taken" by Robert Frost are: "Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths. User: Please help me translate into Chinese. LLaDA: Sure! The first two lines of "The Road Not Taken" by Robert Frost can be translated into Chinese as:"","" (Note: The Chinese text is missing in the OCR, but the model attempts translation.) User: Please translate into German. LLaDA: Sure! The first two lines of "The Road Not Taken" by Robert Frost can be translated into German as: "Zwei Wege trennten sich im gelben Wald, und leider konnte ich nicht beide tun." User: Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter. LLaDA: Certainly! Here's a poem about life choices, with each sentence starting with the letter "C": Choices cascade through life's journey, Creating connections with each decision, Careful consideration leads to better paths, Courageous choices shape our destiny, Choosing wisely

This example from Table 3 highlights:

Coherent and Fluent Text Generation: LLaDA generates natural-sounding, extended text in a non-autoregressive manner.
Multi-turn Dialogue Capability: It effectively retains conversation history and produces contextually appropriate responses.
Multilingualism: Demonstrated by the ability to translate text into Chinese and German.

Creative Constraints: LLaDA can follow complex instructions, such as generating a poem with specific structural and alphabetical constraints.

The following are the results from Table 15 of the original paper:


Daily life User
LLaDA	What's a piece of advice you've received that you've found helpful? One helpful piece of advice I've received is to always be open to learning and growth. I've found that being willing to explore new ideas and experiences has helped me become more resilient and adaptable. By embracing a growth mindset, I've been able to identify
Daily life User
LLaDA	Please recommend me three famous movies Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim: 1. The Shawshank Redemption (1994) - Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his
	wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate. 2. The Godfather (1972) - Directed by Francis Ford Coppola, this film is based on Mario Puzo's novel of the same name. It follows the story of the Corleone family,
	particularly Michael Corleone, who rises to power and takes over the family's criminal empire. 3. The Empire Strikes Back (1980) - Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful
	weapon, the Death Star. All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters.
Translation User
LLaDA	Please help me translate into Chinese: 'What is now proved was once only imagined'
Translation User
LLaDA Math	Please help me translate into German: 'What is now proved was once only imagined' Wer heute bestätigt ist, war einst nur immaginiert. Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many
User
LLaDA	total meters does she run a week? Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500
Code User
LLaDA	I am writing python code, I have an array, please help me sort it from largest to smallest Certainly! You can sort an array from the largest to smallest in Python using the 'sort()' method or the 'sorted()' function. Here's an example: ### Using 'sort()' method: my_array = [3, 1, 4, 1, 5, 2]

Table 15 (Single-turn Dialogue Cases): Showcases LLaDA's ability to handle various types of single-turn prompts including:

General knowledge/advice (What's a piece of advice...).
Recommendations (Please recommend me three famous movies).
Translations (Please help me translate into Chinese/German).
Math word problems (Emma decides to run 4 sprints...).

Code generation (I am writing python code...).

The following are the results from Table 16 of the original paper:

Daily life User	How do I prepare for a job interview?
LLaDA	To prepare for a job interview, you should research the company and the job, practice
User	common interview questions, and dress appropriately.
LLaDA	What are some common interview questions? Some common interview questions include "Tell me about yourself" and "Why do you
Daily life User	want to work for us"? How do I start learning programming?
LLaDA	To learn programming, you can start by choosing a programming language that you are interested in. Some popular languages include Python, Java, and C++.
User	Which programming language should I learn first?
LLaDA	It depends on your interests and goals. Python is a great language for beginners for its simplicity and versatility.
User LLaDA User	How long does it take to learn Python? It can take anywhere from a few weeks to several months to learn Python. What are some good Python projects for beginners?
LLaDA	Some good Python projects for beginners include building a simple calculator, a game, and a web application.
Math User	Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.
LLaDA	How many clips did Natalia sell altogether in April and May? Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold
	48/2 = 24 clips in May. Altogether, Natalia sold 48 + 24 = 72 clips in April and May. The final result is 72.
User	Natalia sold three times as many items in June as she did in May. How many clips did
LLaDA	Natalia sell altogether in these three months? Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Natalia sold three times as many items in June as she did in May,

Table 16 (Multi-turn Dialogue Cases): Reinforces LLaDA's ability to engage in extended multi-turn conversations, demonstrating context awareness and helpful responses in both general advice and multi-step math problems.

The following are the results from Table 17 of the original paper:

Question1 '-?
User GPT-40	-
LLaDA Question2	""A-EUT
User	'HE-EH?
GPT-40	-"
LLaDA	# "-"

Table 17 (Poem Reversal Completion Cases): This table provides concrete examples where LLaDA 8B Instruct succeeds in reversal poem completion while GPT-4o fails. The specific content of the poems is difficult to interpret from the OCR text, but the qualitative result supports LLaDA's claimed advantage in reversal reasoning.

Overall, these case studies provide strong qualitative evidence of LLaDA's capabilities, especially its instruction-following and multi-turn dialogue skills, which are particularly impressive for a non-autoregressive model.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LLaDA, a groundbreaking diffusion language model trained from scratch at an unprecedented scale of 8 billion parameters. LLaDA successfully demonstrates strong capabilities in scalability, in-context learning, and instruction-following, achieving performance comparable to, and in some cases surpassing, leading autoregressive LLMs such as LLaMA3. Crucially, LLaDA challenges the long-held assumption that core LLM capabilities are inherently dependent on autoregressive models by providing a robust, theoretically grounded alternative. Its unique bidirectional modeling approach enables it to effectively address limitations like the reversal curse, where it outperforms even GPT-4o. These findings mark a significant scientific innovation, opening new paradigms for language modeling and offering fresh insights into the fundamental mechanisms underpinning LLM intelligence.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations of the current work and outline promising avenues for future research:

Fixed Generation Length: The generation length is currently a user-specified hyperparameter. While LLaDA is shown to be insensitive to this, an adaptive generation length mechanism would be more efficient and user-friendly.
Computational Constraints for Direct Comparison: Due to significant computational resources required, direct comparisons with ARMs (e.g., training an 8B ARM baseline on the exact same 2.3T dataset) were not fully realized. At larger scales, the ARM baselines used were of slightly different sizes or pre-trained on different data volumes, making some comparisons less perfectly controlled.
Lack of Specialized Architectural Optimizations: LLaDA does not yet incorporate specialized architectural designs or optimizations common in ARMs, such as grouped query attention [24] or KV caching [24], which are crucial for inference efficiency in ARMs.
Sampling Algorithm Efficiency: While LLaDA offers a flexible quality-speed trade-off in sampling, more efficient and controllable sampling algorithms [31, 78-87] for MDMs are still nascent and require further development.
Absence of Reinforcement Learning (RL) Alignment: LLaDA has not yet undergone alignment with reinforcement learning [90, 91], which is critical for fully optimizing LLM performance and aligning with human intent (e.g., in chat models).
Scaling and Data Volume: Despite its unprecedented scale for a diffusion language model, LLaDA's model size and training data volume remain smaller than those of the absolute leading ARM counterparts (e.g., LLaMA3 on 15T tokens, or proprietary models like GPT-4). Further scaling is needed to fully evaluate its ultimate potential.
Multi-modal Data: LLaDA's capability to process multi-modal data (e.g., text and images) remains unexplored.
Integration with Advanced Techniques: Its impact on prompt tuning techniques [96] and integration into agent-based systems [97, 98] is not yet fully understood.
Post-training Investigation: A systematic investigation into advanced post-training methods (e.g., O1-like systems [99, 100] for improved reasoning) for diffusion language models is needed.

7.3. Personal Insights & Critique

This paper is a significant milestone in language modeling research. For a long time, the autoregressive paradigm has been the de facto standard, almost synonymous with Large Language Models. LLaDA offers compelling evidence that this assumption might be too narrow. The demonstration of a diffusion model achieving competitive performance at 8B parameters on diverse LLM benchmarks, especially in areas like reversal reasoning, is a strong validation of the underlying generative principles being more fundamental than the specific sequential generation mechanism.

Inspirations:

The work opens up entirely new architectural and training considerations for LLMs. If diffusion models can truly parallelize generation and learn bidirectional contexts effectively, it could lead to more efficient inference strategies (e.g., generating entire responses in fewer, larger steps) and models inherently robust to tasks requiring non-sequential reasoning.
The reversal curse finding is particularly insightful. It suggests that the bidirectional nature of masked diffusion inherently addresses a limitation that ARMs struggle with, even with massive scale. This might point towards hybrid architectures or training methods that combine the strengths of both paradigms.
The potential for diffusion models to generate text in a more "holistic" or "canvas-filling" manner, rather than strict left-to-right, could lead to novel applications in creative writing, controlled generation (e.g., filling in blanks within a larger context), or even interactive human-AI co-creation tools where parts of a text can be iteratively refined.

Critique / Areas for Improvement:
Computational Parity for Scalability Claims: While the scalability figure (Figure 3) is compelling, the authors' admission that "direct comparisons between LLaDA and ARMs—such as training on identical datasets—were restricted to a computational budget of less than $10^{23}$ FLOPs" and that larger scale ARM baselines were "of slightly different sizes" is a critical point. This makes it harder to definitively conclude that LLaDA scales as efficiently as ARMs at the extreme end of the current LLM scale. Further investment in training a truly identical 8B ARM baseline on 2.3T tokens would strengthen this claim.
Inference Efficiency: The sampling speed analysis (Figure 5 and Table 11) indicates that while LLaDA offers a trade-off, it doesn't fundamentally surpass ARMs in raw throughput without KV caching. The absence of KV caching and other optimizations in LLaDA is a practical limitation for real-world deployment, and it remains to be seen how much diffusion models can close this gap. Given that Transformer models for MDMs don't use causal masks, they cannot benefit from KV caching in the same way ARMs do. This suggests that novel inference optimization techniques specific to diffusion models will be crucial.
Data Quality Impact: The authors note that SFT sometimes led to declines in performance (e.g., MMLU) due to "suboptimal quality of the SFT data." This highlights that diffusion models are likely just as sensitive to data quality as ARMs, and careful data curation remains paramount.
Long-term Dependencies and Consistency: While bidirectional modeling is an advantage, diffusion models typically perform multiple passes over the entire sequence. It would be interesting to see further analysis on how well LLaDA maintains long-range coherence and consistency across these passes, especially for very long text generations.

Despite these points, LLaDA is a truly exciting development that shakes up the foundational assumptions of LLM design. It paves the way for a more diverse landscape of generative language models and encourages a deeper theoretical understanding of what truly makes LLMs powerful.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Large Language Diffusion Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 41,969 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Probabilistic Formulation

4.2.1.1. Forward Process

4.2.1.2. Reverse Process

4.2.1.3. Mask Predictor and Training Objective

4.2.2. Pre-training

4.2.3. Supervised Fine-Tuning (SFT)

4.2.4. Inference

4.2.4.1. Reverse Generation Process (Diffusion Sampling)

4.2.4.2. Conditional Log-likelihood Evaluation

4.2.4.3. Connection to Any-order Autoregressive Models (AO-ARM)

4.2.4.4. Classifier-free Guidance (CFG)

4.2.5. Algorithms

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. General Tasks

5.2.2. Mathematics & Science

5.2.3. Code Generation

5.2.4. Chinese Understanding

5.2.5. Evaluation Protocol

5.3. Baselines

5.4. Model Architecture Details

5.5. Training Details

6. Results & Analysis

6.1. Scalability of LLaDA on Language Tasks

6.2. Benchmark Results (Pre-trained)

6.3. Benchmark Results (Post-trained)

6.4. Reversal Reasoning and Analyses

6.5. Ablation on Classifier-free Guidance

6.6. Ablation on Sampling Strategies

6.7. Ablation on Generated Length

6.8. Analysis of Sampling Efficiency

6.9. Evaluation on iGSM Dataset

6.10. Case Studies

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers