Large Language Diffusion Models
TL;DR Summary
LLaDA, a diffusion-based large language model, uses masking and reverse generation with Transformers to predict tokens, optimizing likelihood bounds. It matches autoregressive baselines in diverse tasks and excels in context learning, demonstrating diffusion models’ promise for s
Abstract
The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Large Language Diffusion Models
1.2. Authors
The paper lists a team of authors with affiliations primarily from the Gaoling School of Artificial Intelligence, Renmin University of China, and Ant Group. Key authors include Shen Nie, Fengqi Zhu, and Chongxuan Li.
1.3. Journal/Conference
This paper is published as a preprint on arXiv (arXiv:2502.09992). This indicates it has not yet undergone formal peer review for a journal or conference, but it is publicly available for discussion and feedback within the academic community.
1.4. Publication Year
2025
1.5. Abstract
The abstract introduces LLaDA (Large Language Diffusion with mAsking), a novel diffusion model designed to challenge the prevailing assumption that Large Language Models (LLMs) rely solely on autoregressive models (ARMs). LLaDA is trained from scratch using a pre-training and supervised fine-tuning (SFT) paradigm. Its core mechanism involves a forward data masking process and a reverse generation process, where a Transformer predicts masked tokens. This approach offers a principled generative method for probabilistic inference by optimizing a likelihood lower bound. Across various benchmarks, including general tasks, math, and code, LLaDA 8B demonstrates strong scalability and performance comparable to ARM baselines. Notably, LLaDA 8B performs competitively with LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following capabilities in multi-turn dialogues. Furthermore, it effectively addresses the reversal curse, even outperforming GPT-4o in a reversal poem completion task. The findings suggest that core LLM capabilities are not inherently dependent on ARMs, highlighting the promise of diffusion models for large-scale language modeling.
1.6. Original Source Link
https://arxiv.org/abs/2502.09992 This is a preprint.
1.7. PDF Link
https://arxiv.org/pdf/2502.09992v3.pdf
2. Executive Summary
2.1. Background & Motivation
The field of Large Language Models (LLMs) has been overwhelmingly dominated by autoregressive models (ARMs), often referred to as the "next-token prediction" paradigm. These models generate text sequentially, token by token, from left to right. While this approach has achieved remarkable success, leading to powerful LLMs, a fundamental question remains: are the core capabilities of LLMs—such as scalability, in-context learning, and instruction-following—inherently tied to the autoregressive nature of these models, or do they stem from broader generative modeling principles?
The paper argues that the success of LLMs is more attributable to the underlying generative modeling principles (i.e., maximizing the likelihood of the data distribution) combined with the power of Transformers, large model sizes, and extensive data, rather than the specific autoregressive formulation. This perspective is supported by the success of diffusion models (which are also generative) in other domains like computer vision (e.g., diffusion transformers for image generation). Furthermore, ARMs exhibit certain inherent limitations, such as difficulties with reversal reasoning tasks, due to their unidirectional generation process.
The core problem this paper aims to solve is to investigate whether diffusion models, a different class of generative models, can achieve the same, or even superior, core capabilities as ARMs when scaled to the size and data typical of modern LLMs. This challenges a deeply ingrained assumption in the field and seeks to open new avenues for LLM development beyond the autoregressive paradigm.
2.2. Main Contributions / Findings
The paper introduces LLaDA (Large Language Diffusion with mAsking), a novel diffusion model for language, and presents several key contributions and findings:
-
Challenging the Autoregressive Hegemony: LLaDA demonstrates that core
LLM capabilities, such as scalability,in-context learning, andinstruction-following, can emerge from generative modeling principles beyond the autoregressive paradigm. This fundamentally questions the necessity of autoregression for building powerful language models. -
Novel Model Architecture (
LLaDA): LLaDA employs amasked diffusion model (MDM)approach, featuring aforward data masking processand areverse generation process. It uses a Transformer to predict masked tokens and optimizes a likelihood lower bound, providing a principled generative approach for probabilistic inference. -
Unprecedented Scale for Diffusion Language Models: LLaDA is scaled to an
8 billion (8B)parameter model, trained from scratch on 2.3 trillion (T) tokens. This represents the largestdiffusion modelapplied to language to date, pushing the boundaries of diffusion model capabilities inNatural Language Processing (NLP). -
Strong Scalability: LLaDA exhibits impressive scalability, matching the overall performance of
autoregressive model (ARM)baselines (which were trained on the same data) across diverse tasks likeMMLUandGSM8K, as its computational budget increases. -
Competitive In-Context Learning: The pre-trained LLaDA 8B Base model performs comparably to strong
LLMslikeLLaMA3 8B Baseon 15 standard zero/few-shot learning tasks, often surpassingLLaMA2 7B Base. This demonstrates its effectivein-context learningcapability without specific fine-tuning. -
Impressive Instruction-Following: After
Supervised Fine-Tuning (SFT), LLaDA 8B Instruct shows stronginstruction-following abilities, including in complex scenarios likemulti-turn dialogue, marking a significant achievement for a non-autoregressive model. -
Addressing the Reversal Curse: LLaDA effectively overcomes the
reversal curse, a known limitation ofARMswhere models struggle with tasks requiring reasoning in a reverse direction (e.g., inferring the preceding element given the succeeding one). LLaDA maintains consistent performance across forward and reversal tasks and remarkably outperformsGPT-4oin a reversal poem completion task. -
Principled Generative Objective: Unlike some prior
discrete diffusion modelsthat used heuristic objectives, LLaDA is built upon a strong theoretical foundation, optimizing a variational lower bound of its log-likelihood, which is crucial for its scalability and generative capabilities.These findings open up a new research direction for
language modeling, suggesting thatdiffusion modelscould offer a powerful alternative or complementary paradigm toautoregressive modelsfor developingnext-generation LLMs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts:
-
Large Language Models (LLMs): These are advanced
neural networkmodels, typically based on theTransformerarchitecture, trained on vast amounts of text data. Their goal is to understand, generate, and process human language. They exhibit emergent properties likein-context learning(learning from examples within the prompt) andinstruction-following(obeying commands given in natural language). Examples includeGPT-3,LLaMA, andGPT-4. -
Generative Modeling: This is a branch of machine learning focused on creating models that can generate new data samples that resemble the training data. For
LLMs, this means generating human-like text. The ultimate goal is to capture the underlying probability distribution of the real data, .- Maximum Likelihood Estimation (MLE): A common technique for training generative models. It involves finding the model parameters () that maximize the probability (likelihood) of observing the training data under the model's distribution . This is often expressed as maximizing .
- KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution . Minimizing
KL divergencebetween the data distribution and the model distribution , i.e., , is equivalent toMLE.
-
Autoregressive Models (ARMs): These are a specific type of generative model that predict the next token in a sequence based on all preceding tokens. They process information strictly from left-to-right.
- Next-token prediction: The primary training objective for
ARMs. Given a sequence , the model learns to predict given . - Likelihood Factorization: The probability of a sequence is factorized as a product of conditional probabilities:
This formula (Equation 2 in the paper) defines how
ARMsmodel the probability of an entire sequence.
- Next-token prediction: The primary training objective for
-
Diffusion Models: A class of generative models that learn to reverse a
diffusion process.- Forward Diffusion Process: This process gradually adds noise (or masks tokens, in discrete cases) to data until it becomes pure noise (or fully masked). It's typically fixed and does not involve learning.
- Reverse Diffusion Process: This is the learning part. The model learns to iteratively
denoise(or unmask) the data, starting from noise (or fully masked data), to reconstruct a clear sample from the data distribution. - Masked Diffusion Models (MDMs): A variant of
diffusion modelsspecifically for discrete data like text. Instead of adding continuous noise, they progressivelymask(replace with a special[MASK]token) parts of the input. The model's task is to predict the original tokens that were masked.
-
Transformers: The foundational neural network architecture for modern
LLMs.- Attention Mechanism: The core component of
Transformers, allowing the model to weigh the importance of different parts of the input sequence when processing each token. Unlikerecurrent neural networks,Transformerscan process all tokens in parallel. - Self-Attention: A type of
attentionwhere the mechanism attends to other positions in the same sequence to compute a representation for each position. - Causal Mask: A specific type of mask used in
Transformersforautoregressive models. It prevents a token from attending to future tokens in the sequence, enforcing the left-to-right generation constraint.LLaDAexplicitly states it does not use acausal mask.
- Attention Mechanism: The core component of
-
Pre-training and Supervised Fine-Tuning (SFT): A common paradigm for training
LLMs.- Pre-training: The initial phase where a model is trained on a massive, diverse text corpus using an
unsupervisedobjective (e.g.,next-token predictionforARMsormask predictionforMDMs). This allows the model to learn general language understanding and generation capabilities. - Supervised Fine-Tuning (SFT): After pre-training, the model is further trained on a smaller dataset of
(prompt, response)pairs. This phase teaches the model to follow instructions and align its outputs with human preferences.
- Pre-training: The initial phase where a model is trained on a massive, diverse text corpus using an
-
In-context Learning: The ability of an
LLMto learn new tasks or adapt to new instructions by observing a few examples provided directly within the inputprompt, without requiring any model weight updates. -
Instruction-Following: The ability of an
LLMto understand and execute commands or instructions given in natural language, often afterSFT. -
Reversal Curse: A phenomenon observed in
ARMswhere models trained on "A is B" facts struggle to infer "B is A" without explicit training on the reverse relationship. This highlights a limitation in their generalization capabilities, often attributed to their unidirectional processing.
3.2. Previous Works
The paper contextualizes LLaDA against a backdrop of autoregressive models (ARMs) and prior attempts at diffusion models for language.
-
Autoregressive Language Models (ARMs): The dominant paradigm, exemplified by models like
GPT[2] andLLaMA[6, 21]. These models define the probability of a sequence as: This "next-token prediction" approach has led to impressiveLLMs, but their unidirectional nature is a potential limitation. The paper directly challenges this assumption. -
BERT (Bidirectional Encoder Representations from Transformers) [22]: While not a generative
ARM,BERTis a seminalTransformer-based model that introduced themasked language modeling (MLM)objective.BERTtrains aTransformer encoderto predict randomly masked tokens in a sequence, allowing it to learn bidirectional contexts. This is a precursor tomasked diffusion modelsas it involvesmaskingandprediction. However,BERTwas primarily used forlanguage understandingtasks (e.g., classification, question answering) rather thangenerative modelingof full sequences. Key differences fromLLaDAincludeBERT's fixed masking ratio and its lack of a principledmaximum likelihoodgenerative framework. -
Diffusion Models in Vision: The success of
diffusion modelsin generating high-quality images [38, 39, 40] and the emergence ofdiffusion transformers[9, 10] incomputer visionprovided inspiration. These models demonstrate the power ofdiffusionas a generative paradigm at scale. -
Discrete Diffusion Models for Language:
- Early Attempts (Continuous State Spaces) [41-51]: Some works tried to adapt
continuous diffusion modelsby embedding discrete text data into continuous spaces. These often faced challenges with scalability and computational cost. The paper notes that a 1B model might require 64x the compute of anARMfor comparable performance [57]. - Discrete Process Variants [58-71]: Other approaches developed
discrete diffusiondirectly, replacing continuous noise with discrete operations likemasking.Austin et al. [16]were pioneers indiscrete diffusionfor language.Lou et al. [17]showedmasked diffusioncould achieve perplexity comparable toGPT-2scaleARMs.Shi et al. [18],Sahoo et al. [19],Ou et al. [20]established theoretical foundations formasked diffusion models (MDMs), particularly linking the training objective tomaximum likelihood estimation. These theoretical results are crucial forLLaDA's design.Nie et al. [27]explored scaling laws forMDMsand their application to tasks likequestion answeringatGPT-2scale. This work suggestedMDMsmight require more computation thanARMsfor the same likelihood.
- Early Attempts (Continuous State Spaces) [41-51]: Some works tried to adapt
-
MaskGIT (Masked Generative Image Transformer) [23]: A parallel work in image generation that uses a
masked generative Transformer.MaskGITuses a heuristic training objective and introduceslow-confidence remaskingfor sampling, an idea adopted byLLaDA.
3.3. Technological Evolution
The evolution of LLMs has primarily been driven by the Transformer architecture and the autoregressive paradigm. Initially, recurrent neural networks (RNNs) and LSTMs handled sequential data, but their inability to parallelize computations limited scalability. The Transformer [7] revolutionized this by introducing the attention mechanism, allowing parallel processing and enabling models to scale to billions of parameters. This led to the rise of pre-trained language models like BERT (for understanding) and GPT (for generation).
The GPT series firmly established the autoregressive "next-token prediction" approach as the standard for generative LLMs. Subsequent LLMs like LLaMA, PaLM, Claude, and GPT-4 largely followed this recipe, focusing on scaling model size, data, and computational resources. This paradigm has been incredibly successful, yielding models with impressive in-context learning and instruction-following capabilities.
However, researchers have continually explored alternative generative frameworks. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) were prominent in other domains, but diffusion models have recently emerged as state-of-the-art in image generation. The work by Austin et al. [16] and others began applying discrete diffusion to text, but scaling them to LLM levels remained a significant challenge.
LLaDA fits into this evolution by being the first work to successfully scale a masked diffusion model to 8 billion parameters and demonstrate its competitive performance against leading ARMs. It represents a critical point in the technological timeline, suggesting a potential paradigm shift or diversification in LLM development, moving beyond the sole reliance on autoregressive generation.
3.4. Differentiation Analysis
LLaDA differentiates itself from ARMs and previous diffusion models in several key ways:
-
Generative Mechanism:
- ARMs: Generate text token-by-token, strictly from left-to-right. This imposes a
causal(unidirectional) dependency. - LLaDA (MDM): Generates text by iteratively predicting and unmasking tokens across the entire sequence simultaneously. It uses a
Transformerwithout acausal mask, allowing it to learn bidirectional dependencies. Thisbidirectional modelingis a core difference, enabling capabilities likereversal reasoning.
- ARMs: Generate text token-by-token, strictly from left-to-right. This imposes a
-
Theoretical Foundation and Scale:
- Previous Discrete Diffusion for Language: Many prior works either used heuristic training objectives or struggled to scale effectively.
Nie et al. [27]noted thatMDMsmight require significantly more computation thanARMsfor comparable likelihood. - LLaDA: Builds on strong theoretical foundations [18, 19, 20] that link its training objective (Eq. 3) to a
likelihood lower bound, making it a principledgenerative model. Crucially, it scales this principled approach to an unprecedented8B parametersfor a diffusion language model, demonstratingscalabilitycomparable toARMson downstream tasks, not just likelihood.
- Previous Discrete Diffusion for Language: Many prior works either used heuristic training objectives or struggled to scale effectively.
-
Core LLM Capabilities:
- ARMs:
In-context learningandinstruction-followingare emergent properties. - LLaDA: Demonstrates that these capabilities are not exclusive to
ARMsbut arise from fundamentalgenerative modeling principles. LLaDA achieves competitivein-context learningand impressiveinstruction-followingpost-SFT, challenging the common assumption.
- ARMs:
-
Addressing Limitations:
-
ARMs: Suffer from the
reversal cursedue to their unidirectional nature. -
LLaDA: Its
bidirectional modelinginherent in themasked diffusionapproach naturally addresses thereversal curse, showing consistent performance in forward and reverse tasks and even outperformingGPT-4oin specific scenarios.In essence,
LLaDArepresents a robust, theoretically grounded, and highly scalablediffusion modelthat directly competes with and, in some aspects, surpassesstate-of-the-art ARMs, thereby opening a new frontier forLLMresearch and development.
-
4. Methodology
4.1. Principles
The core idea behind LLaDA is to adapt masked diffusion models (MDMs) for large-scale language generation. Unlike autoregressive models (ARMs) that predict the next token sequentially, LLaDA leverages a process inspired by diffusion models in other domains. It defines a generative process through two main stages: a forward process that gradually masks tokens in the input data, and a reverse process where a neural network learns to unmask these tokens to reconstruct the original data. This reverse process is parameterized by a Transformer that predicts all masked tokens simultaneously, allowing for bidirectional context learning.
The theoretical basis for LLaDA lies in optimizing a variational lower bound of the log-likelihood of the data. This means that, despite its non-autoregressive nature, LLaDA is a principled generative model aiming to capture the true data distribution . The intuition is that by learning to denoise (or unmask) text from a progressively masked version, the model implicitly learns the underlying structure and dependencies of language, similar to how ARMs learn through next-token prediction. The key difference is the bidirectional learning and generation, which removes the left-to-right constraint of ARMs.
4.2. Core Methodology In-depth (Layer by Layer)
LLaDA defines its model distribution through a forward masking process and a reverse generation process.
4.2.1. Probabilistic Formulation
The probabilistic formulation of LLaDA is central to its operation and distinguishes it from ARMs.
4.2.1.1. Forward Process
The forward process (or masking process) gradually introduces mask tokens into an original data sample . It is indexed by a time variable .
-
At , the data is fully observed (no masks).
-
As increases from
0to1, tokens are progressively masked. -
At , the sequence is fully masked.
This process is defined by a conditional distribution , which is factorized across all tokens in the sequence: where is the sequence length. This means each token is masked independently. The conditional probability for each token to become at time is given by: Here, is the -th token of the original sequence, is the -th token of the masked sequence at time , and represents the special
mask token. This formula indicates that each token either remains unchanged with probability1-tor is replaced by the mask token with probability . Thislinear masking probabilitymeans that as increases, more tokens are expected to be masked.
4.2.1.2. Reverse Process
The reverse process is what LLaDA learns. It aims to recover the original data distribution by iteratively unmasking tokens, moving from a fully masked state () back to an unmasked state (). The approximate reverse process is parameterized by a mask predictor .
The conditional distribution for the reverse process, moving from time to (where ), is also factorized across tokens: The conditional probability for each token transition is defined as: This formula describes how tokens are updated during a reverse step.
-
If a token is not masked, it remains unchanged ().
-
If a token is masked, it either remains masked with probability
s/t, or it isunmaskedto its original value with probability(t-s)/taccording to the distribution .The key function to estimate is , which predicts the original token if it was masked in . The paper notes that this can be reparameterized as: where refers to the unmasked tokens in . This means that estimating the data prediction function is equivalent to estimating conditional distributions on clean data, which is
time-invariant. Consequently, the time does not need to be an input to the parametric model (the Transformer).
4.2.1.3. Mask Predictor and Training Objective
LLaDA uses a Transformer as its mask predictor, denoted as . This Transformer takes the partially masked sequence as input and simultaneously predicts all masked tokens. Crucially, unlike ARMs, this Transformer does not use a causal mask, allowing it to leverage bidirectional context.
The mask predictor is trained by optimizing a cross-entropy loss computed only on the masked tokens:
Here:
-
is the loss function for model parameters .
-
denotes the expectation over (uniformly sampled from
[0,1]), (sampled from data), and (sampled from the forward process given ). -
is the sequence length.
-
is an indicator function, which is 1 if the -th token in is masked, and 0 otherwise. This ensures the loss is calculated only for masked tokens.
-
is the log-probability assigned by the model to the original token given the masked sequence .
-
The term is a weighting factor, which is crucial for the theoretical link to
maximum likelihood estimation.This loss function has been proven to be an
upper boundon the negative log-likelihood of the model distribution : This theoretical guarantee makesLLaDAa principledgenerative modelcapable ofin-context learningandinstruction-following, similar toARMs.
4.2.2. Pre-training
The pre-training phase for LLaDA follows standard practices in LLM development but adapted for the MDM framework.
- Model Architecture:
LLaDAutilizes aTransformer[7] as itsmask predictor. UnlikeARMs, it does not use acausal maskbecause its formulation allows it tosee the entire inputfor predictions. The model usesRMSNorm[105] for stability,SwiGLU[106] as the activation function, andRoPE[107] for positional encoding. (Refer to Table 5 in Experimental Setup for detailed architectural configurations). - Data: The model was pre-trained on a massive dataset of 2.3 trillion (T) tokens, similar to
LLMs(e.g., [25, 26]). This dataset comprises various sources (web, books, academic papers, social media, encyclopedias, math, code) and languages (11% Chinese, 61% English, 28% code). Low-quality content was filtered using manual rules andLLM-based approaches. - Training Process:
- For each training sequence , a time step is randomly sampled uniformly from
[0, 1]. - Each token in is then independently masked with probability to create the partially masked sequence (as illustrated in Figure 2a).
- The model (Transformer) takes as input and predicts the original tokens for all masked positions.
- The
cross-entropy loss(Eq. 3) is estimated using theMonte Carlo methodforstochastic gradient descenttraining. - To enhance robustness to variable-length inputs, 1% of the pre-training data sequences are set to a random length uniformly sampled from
[1, 4096]. The fixed sequence length for training is 4096 tokens. - Computational Cost: The pre-training consumed 0.13 million H800 GPU hours, which is comparable to
ARMsof the same scale and dataset size.
- For each training sequence , a time step is randomly sampled uniformly from
- Optimization:
- An
AdamW optimizer[29] with a weight decay of 0.1 was used. - A
Warmup-Stable-Decay learning rate scheduler[28] was employed:- Linear increase from 0 to over the first 2000 iterations.
- Maintained at until 1.2T tokens processed.
- Decayed to and held constant for the next 0.8T tokens.
- Linearly reduced from to for the final 0.3T tokens.
- Batch size: 1280 global, 4 local per GPU.
- The 8B model was trained once without hyperparameter tuning.
- An
4.2.3. Supervised Fine-Tuning (SFT)
After pre-training, LLaDA undergoes Supervised Fine-Tuning (SFT) to improve its instruction-following capabilities. This phase adapts the model to generate responses given a prompt , effectively modeling the conditional distribution .
- Data: The SFT dataset consists of 4.5 million
(prompt, response)pairs, including 1 million human-annotated and 3.5 million synthetic samples. The data spans various domains like code, mathematics, and general instruction-following. - Masking Strategy for SFT:
- For an input pair , the prompt remains unchanged (unmasked).
- Only the tokens in the response are independently masked with probability to form (Figure 2b).
- The concatenated sequence of and is fed to the pre-trained
mask predictor.
- SFT Loss Function: The loss is computed only on the masked tokens within the response: Here, is the length of the response . This objective is fully compatible with pre-training, as can be seen as a clean data , and as its masked version , with masks only in the response part.
- Handling Response Lengths:
- (End Of Sequence) are appended to shorter pairs in a mini-batch to ensure uniform sequence lengths.
- These are treated as normal tokens during training (masked and included in the loss) but removed during sampling. This allows
LLaDAto automatically control response length.
- Multi-turn Dialogue: A
multi-turn dialogueis treated as single-turn pairs: , , etc. This allowsLLaDAto learnmulti-turn dialogue capabilities. - Optimization:
- An
AdamW optimizerwith a weight decay of 0.1 was used. - Learning rate scheduler: Linear increase from 0 to over the first 50 iterations, then constant, and linearly reduced to for the final 10% of iterations.
- Global batch size: 256, local batch size: 2 per GPU.
- The SFT experiment was executed once without hyperparameter tuning.
- An
4.2.4. Inference
LLaDA can generate new text and evaluate likelihood in a diffusion manner, different from left-to-right autoregressive generation.
4.2.4.1. Reverse Generation Process (Diffusion Sampling)
- Starting Point: Given a prompt , the process begins with a fully masked response sequence of a specified length (which is a hyperparameter).
- Iterative Unmasking: The
reverse processis discretized intosampling steps, moving from down to . At each step from time to (where ):- The model's
mask predictor(Transformer) takes the prompt and the current partially masked response as input. - It predicts the original tokens for all masked positions simultaneously.
- Remasking Strategy: A fraction of the predicted tokens are then
remaskedto obtain . This ensures the transition aligns with theforward process.- Purely Random Remasking: In principle, this
remaskingshould be purely random. - Low-Confidence Remasking: Inspired by
annealed samplinginLLMs[4, 30] andMaskGIT[23],LLaDAadopts alow-confidence remasking strategy. Here, of the predicted tokens with the lowest confidence (i.e., lowest probability assigned by the model to its prediction) are chosen to be remasked. This helps improve sample quality. (See Algorithm 5 in Appendix A.3).
- Purely Random Remasking: In principle, this
- The model's
- Termination: The process continues until (fully unmasked). Tokens appearing after an are discarded.
- Flexibility: The total number of
sampling stepsis a hyperparameter, offering a trade-off betweenefficiencyandsample quality.Generation lengthis also a hyperparameter, but the model is insensitive to it due to variable-length training (Appendix B.5). - Flexible Sampling beyond Pure Diffusion:
LLaDAcan also supportautoregressiveandblock diffusion[31] sampling directly, without further training. However,pure diffusion samplingtypically yields the best performance.
4.2.4.2. Conditional Log-likelihood Evaluation
For evaluating the conditional likelihood of a response given a prompt , LLaDA uses an equivalent form of the loss function (Eq. 14 in Appendix A.2) that exhibits lower variance than directly using Eq. (5). This form is:
Here:
- is the sequence length of .
- is uniformly sampled from (number of tokens to mask).
- is obtained by uniformly sampling tokens from without replacement for masking. This deterministically masks tokens, reducing variance compared to the probabilistic masking of .
- This can be seen as a conditional version of the pre-training objective.
Monte Carlo estimationis used to approximate this expectation.
4.2.4.3. Connection to Any-order Autoregressive Models (AO-ARM)
The paper notes that the training objective (Eq. 12) is equivalent to that of Any-order Autoregressive Models (AO-ARMs) [59, 101, 102], which characterize the joint distribution by considering all possible orders of variable dependencies. An AO-ARM minimizes the expected negative log-likelihood over uniform distributions of all orders :
This connection intuitively explains LLaDA's bidirectional reasoning capabilities, as it implicitly learns dependencies regardless of token order.
4.2.4.4. Classifier-free Guidance (CFG)
LLaDA is compatible with Classifier-free Guidance (CFG) [37, 27], a technique to improve generation quality by balancing adherence to a prompt (alignment) and diversity. CFG modifies the mask predictor for inference:
Here:
- is the modified probability distribution used for sampling.
- is the original model prediction given the prompt .
- is the model prediction given a
mask sequence(same length as ), representing an "unconditional" generation. - is a tunable hyperparameter controlling the strength of the guidance from the prompt.
The paper states that
CFGis not used in the main results to ensure fair comparison withARMsbut shows it consistently improvesLLaDA's performance.
4.2.5. Algorithms
The paper provides detailed algorithms for training and inference.
Algorithm 1 Pre-training of LLaDA
Require: mask predictor , data distribution
1: repeat 2: # with a probability of , the sequence length of follows U[1, 4096] 3: 4: # is defined in Eq. (7) 5: Calculate # is the sequence length of 6: Calculate and run optimizer. 7: until Converged 8: Return
- Explanation: This algorithm outlines the
pre-trainingloop. It repeatedly samples an original sequence (), a masking ratio (), applies theforward masking processto get , computes thecross-entropy loss(Eq. 3) for masked tokens, and updates the model parameters usingstochastic gradient descent. The special handling for variable sequence lengths is also noted.
Algorithm 2 Supervised Fine-Tuning of LLaDA
Require: mask predictor , pair data distribution
1: repeat 2: # please refer to Appendix B.1 for details about the SFT data 3: 4: # is defined in Eq. (7) 5: Calculate # is the sequence length of 6: Calculate and run optimizer. 7: until Converged 8: Return
- Explanation: This algorithm details the
SFTloop. It samples(prompt, response)pairs , applies masking only to the response (to get ), then computes thecross-entropy loss(Eq. 5) using the prompt and masked response to predict original response tokens. Model parameters are updated based on this loss.
Algorithm 3 Conditional Log-likelihood Evaluation of LLaDA
Require: mask predictor , prompt , response , the number of Monte Carlo estimations
1: log_likelihood
2: for to do
3: # is the sequence length of
4: Obtain by uniformly sampling tokens from without replacement for masking
5: log_likelihood
6: end for
7: log_likelihood
8: Return log_likelihood
- Explanation: This algorithm calculates the
conditional log-likelihoodof a response given a prompt . It usesMonte Carlo estimationwith samples. In each sample, tokens are randomly chosen from and masked to create . The model then predicts these masked tokens, and theirlog-probabilitiesare aggregated, weighted byL/l. The final result is the average over allMonte Carlosamples. This uses the lower-variance formulation from Eq. (6).
Algorithm 4 Reverse Generation Process (Random Remasking)
Require: mask predictor , prompt , answer length , sampling steps
1: Set is a fully masked sequence of length . 2: for down to step do 3: 4: # we employ greedy sampling when predicting masked tokens 5: for to do 6: if then 7: 8: else 9: with probability , is set to 10: end if 11: end for 12: 13: end for 14: Return
- Explanation: This algorithm describes
pure diffusion samplingusingrandom remasking. It starts with a fully masked sequence . In each step, it predicts the original tokens () from the current masked sequence () and the prompt (). Then, itremasksa fractions/tof the newly predicted tokens back to randomly. Tokens that were already unmasked in remain unmasked. This iterative process gradually refines the sequence until .
Algorithm 5 Reverse Generation Process (Low-confidence Remasking)
Require: mask predictor , prompt , answer length , sampling steps
1: Set is a fully masked sequence of length . 2: for down to step do 3: 4: for to do 5: if then 6: 7: else 8: 9: 10: end if 11: end for 12: # the number of unmasked tokens is in timestep 13: for to do 14: if then 15: # the positions with the least confidence are selected for remasking. 16: end if 17: end for 18: 19: end for 20: Return
- Explanation: This algorithm is similar to Algorithm 4 but uses
low-confidence remasking. Instead of randomly remasking, it calculates a confidence score (the predicted probability of the chosen token) for each token. It then identifies tokens that need to be remasked based on having the lowest confidence scores. The number of tokens to remask is determined by the target unmasked count for the current timestep . This heuristic aims to retain high-confidence predictions while re-evaluating low-confidence ones in subsequent steps.
5. Experimental Setup
5.1. Datasets
The paper describes the datasets used for both pre-training and Supervised Fine-Tuning (SFT).
-
Pre-training Corpus:
- Scale: Comprises
2.3 trillion (T)tokens, a standard scale for modernLLMs. - Sources: Diverse publicly available sources, including:
- Web data
- Books
- Academic papers
- Social media
- Encyclopedias
- Mathematics
- Code
- Languages: Approximately
11% Chinese,61% English, and28% code. - Cleaning: Involved
PDF text extraction,deduplication, andharmful content filtering. ABERTmodel fine-tuned for automated data quality annotation was used to select higher-quality samples. - Processing: Documents were concatenated and segmented into fixed-length sequences (4096 tokens). As mentioned in the methodology,
1% of pre-training datahad sequence lengths randomly sampled from[1, 4096]to improve variable-length handling.
- Scale: Comprises
-
Supervised Fine-Tuning (SFT) Dataset:
- Scale: Consists of
4.5 million (M) (prompt, response)pairs. - Composition:
1 million human-annotated samples.3.5 million synthetic samples, generated using methods similar to those inXu et al. [103]andWei et al. [104].
- Domains: Spans multiple domains, including
code,mathematics, andinstruction-following. - Processing:
-
are appended to the end of shorter pairs within each mini-batch to ensure uniform sequence lengths. These are treated as normal tokens during training but removed during sampling.
-
Multi-turn dialoguesare partitioned into single-turn pairs for training (e.g., a 3-turn dialogue (P0, R0, P1, R1, P2, R2) is treated as three pairs: (P0, R0), (P0 R0 P1, R1), (P0 R0 P1 R1 P2, R2)).The paper does not provide explicit examples of raw data samples, but the SFT examples in the case studies (e.g., multi-turn dialogue in Table 3, Table 15, Table 16, and poem completion in Table 17) implicitly show the
(prompt, response)format. For instance:
-
- Scale: Consists of
Multi-turn Dialogue Example from Table 3:
User: Could you please share with me what the initial two lines of the renowned poem 'The Road Not Taken'?
LLaDA: Certainly! The initial two lines of the renowned poem "The Road Not Taken" by Robert Frost are: "Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.
These datasets are standard for LLM training and are effective for validating general language capabilities, mathematical reasoning, code generation, and multilingual understanding.
5.2. Evaluation Metrics
The paper evaluates LLaDA across a comprehensive suite of benchmarks, covering general tasks, mathematics & science, code, and Chinese language understanding. The evaluation protocols follow existing LLM assessment practices, primarily using conditional likelihood estimation and conditional generation.
Here's a breakdown of the metrics, with conceptual definitions and, where necessary, supplemental formulas:
5.2.1. General Tasks
-
MMLU (Massive Multitask Language Understanding) [110]:
- Concept: Measures an
LLM's knowledge and problem-solving abilities across 57 diverse subjects, ranging from humanities and social sciences to STEM fields. It uses multiple-choice questions. - Metric:
Accuracy. - Formula:
- Symbols:
Number of correct predictions: The count of questions where the model selects the correct multiple-choice option.Total number of predictions: The total number of multiple-choice questions in the dataset.
- Symbols:
- Concept: Measures an
-
BBH (Big-Bench Hard) [111]:
- Concept: A challenging subset of the
Big-Benchbenchmark, designed to evaluateLLMson complex reasoning tasks that often require multi-step inference, common sense, or domain-specific knowledge, areas whereLLMstypically struggle. - Metric:
AccuracyorExact Match (EM)depending on the subtask. - Formula: Same as
MMLUforAccuracy. ForEM, it's 1 if the generated answer is identical to the reference, 0 otherwise.
- Concept: A challenging subset of the
-
ARC-C (AI2 Reasoning Challenge - Challenge Set) [112]:
- Concept: A challenging dataset of science questions designed to test models' ability to answer questions that require multi-hop reasoning, common sense, and background knowledge, not just surface-level text understanding. The "Challenge Set" specifically includes questions that are difficult for information retrieval systems.
- Metric:
Accuracy. - Formula: Same as
MMLU.
-
Hellaswag [113]:
- Concept: A large-scale adversarial dataset for
commonsense reasoning. Given a sentence, the model must choose the most plausible continuation from several options. It's designed to be difficult for models that rely on superficial cues. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: A large-scale adversarial dataset for
-
TruthfulQA [114]:
- Concept: A benchmark that measures whether language models are truthful in generating answers to questions, even when common human misconceptions or falsehoods might lead them astray.
- Metric:
Truthfulness(measured byBLEURTor human evaluation, but often simplified toAccuracyfor multiple-choice tasks). For this paper, it's likelyAccuracyonmultiple-choiceformat. - Formula: Same as
MMLU.
-
WinoGrande [115]:
- Concept: An adversarial
Winograd Schema Challengedataset forcommonsense reasoning. It tests pronoun resolution, where the model must identify the antecedent of a pronoun in a sentence based on common sense. Designed to avoid statistical biases. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: An adversarial
-
PIQA (Physical Commonsense in Natural Language) [116]:
- Concept: A dataset designed to evaluate
LLMsonphysical commonsense knowledge. Given a goal (e.g., "wash the dishes"), the model must choose the most plausible plan of action from two options. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: A dataset designed to evaluate
5.2.2. Mathematics & Science
-
GSM8K (Grade School Math 8K) [117]:
- Concept: A dataset of 8,500 grade school math word problems. It requires multi-step reasoning to derive the correct numerical answer.
- Metric:
Exact Match (EM)Accuracyof the final numerical answer. - Formula:
- Symbols:
Number of problems with exactly correct final answer: Count of problems where the model's generated final answer (after extraction) perfectly matches the ground truth.Total number of problems: Total count of problems in theGSM8Kdataset.
- Symbols:
-
Math [118]:
- Concept: A dataset for mathematical problem-solving across various difficulty levels and mathematical domains (e.g., algebra, geometry, number theory). Problems are typically text-based and require detailed reasoning.
- Metric:
Exact Match (EM)Accuracyof the final numerical or symbolic answer. - Formula: Same as
GSM8K.
-
GPQA (Graduate-level Google-Proof Q&A Benchmark) [119]:
- Concept: A highly challenging
Q&Adataset composed ofgraduate-level questionsfrom various scientific fields, designed to be "Google-proof" (i.e., not easily answerable by simple web search). Requires deep understanding and reasoning. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: A highly challenging
5.2.3. Code Generation
-
HumanEval [120]:
- Concept: A dataset for
code generation. It consists of handwritten programming problems, each with a function signature, docstring, and unit tests. Models must generate the function body that passes the tests. - Metric:
Pass@k. Measures the percentage of problems where at least one of generated solutions passes all unit tests. The paper reportsPass@1(implicitly, if not specified otherwise, it's typically ). - Formula (Pass@k):
- Symbols:
- : Total number of problems.
- : Indicator function (1 if condition is true, 0 otherwise).
- : Number of unit tests passed by the -th candidate solution for problem .
- : Total number of unit tests for problem .
- This formula implicitly becomes simpler for
Pass@1where .
- Symbols:
- Concept: A dataset for
-
HumanEval-FIM (Fill-in-the-Middle) [121]:
- Concept: A variant of
HumanEvalwhere models are given acode snippetwith amissing middle part, and they mustfill inthe blank. This tests the model's ability to reason bidirectionally within code. - Metric:
Pass@k(typicallyPass@1). - Formula: Same as
HumanEval.
- Concept: A variant of
-
MBPP (Mostly Basic Python Problems) [122]:
- Concept: A dataset of approximately 1,000 Python programming problems, often simpler than
HumanEval, designed to evaluate basic programming skills. Each problem includes a prompt, a solution, and three unit tests. - Metric:
Pass@k(typicallyPass@1). - Formula: Same as
HumanEval.
- Concept: A dataset of approximately 1,000 Python programming problems, often simpler than
5.2.4. Chinese Understanding
-
CMMLU (Chinese Massive Multitask Language Understanding) [123]:
- Concept: The Chinese counterpart to
MMLU, designed to evaluateLLMson a wide range of knowledge and reasoning tasks in Chinese across diverse subjects. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: The Chinese counterpart to
-
C-Eval [124]:
- Concept: A comprehensive Chinese
evaluation suitefor foundation models, covering various academic disciplines and difficulty levels. It is designed to be challenging and representative of real-world Chinese knowledge and reasoning. - Metric:
Accuracy. - Formula: Same as
MMLU.
- Concept: A comprehensive Chinese
5.2.5. Evaluation Protocol
- Base Models:
Conditional likelihood estimation: Used forMMLU,CMMLU,C-Eval,ARC-C,Hellaswag,TruthfulQA,WinoGrande,PIQA, andGPQA. For single-token likelihoods (e.g., MMLU), oneMonte Carlo (MC)estimate is used. For others, 128MCsamples are used (based on Eq. 6).Conditional generation: Used forBBH,GSM8K,Math,HumanEval,HumanEval-FIM, andMBPP.
- Instruct Models: All benchmarks are evaluated using
conditional generation. - Evaluation Frameworks:
lm-evaluation-harness[125]: Used for some base model evaluations, except forHumanEval-FIM.- Internal evaluation library: Used for
HumanEval-FIMon base models and all evaluations oninstruct models, aslm-evaluation-harnessshowed greater deviation fromLLaMA3reported results.
- Generation Parameters:
Pure diffusion samplingwithlow-confidence remaskingis applied to both Base and Instruct models.- Base models:
Generation lengthandsampling stepsare set to 1024. - Instruct models:
Sampling stepsequal toanswer length, which varies per task (e.g., 3 for MMLU, 64 for GPQA, 256 for MBPP, 512 for HumanEval/GSM8K/Math/ARC-C). - For
HumanEval,MBPP,GSM8K,Math, andGPQAfor Instruct models, the confidence of the is set to zero during sampling to prevent premature termination, especially due toheavy paddingin SFT data.
5.3. Baselines
The paper compares LLaDA against several state-of-the-art autoregressive LLMs of similar scale.
- Self-constructed ARM Baselines:
ARM 1BandARM 7B: These were trained by the authors with identical architectures, data, and configurations asLLaDA(where possible) to provide a direct comparison, especially forscalability analysis. The 7B ARM baseline was previously trained.
- Leading LLMs:
-
LLaMA3 8B[6]: A highly performant, open-sourceLLMfrom Meta, representing a strong competitor in the 8B parameter range. -
LLaMA2 7B[21]: The predecessor toLLaMA3, also an open-source model. -
Qwen2 7B[25]: AnARMfrom Alibaba Cloud, known for its strong performance across various tasks. -
Qwen2.5 7B[26]: An improved version ofQwen2. -
Mistral 7B[33]: A highly efficient and performant 7BARM. -
Deepseek 7B[32]: Another competitive open-sourceLLM. -
Gemma2 9B[26] (for Instruct model comparison): Google'sLLMsuite. -
GPT-4o(forreversal cursecomparison): OpenAI'sstate-of-the-artproprietary model.These baselines were chosen because they are widely recognized, publicly available (mostly open-source), and are of a comparable parameter scale (7B-9B), allowing for a fair assessment of
LLaDA's capabilities against establishedARMs. Some representativeLLMs(LLaMA3 8B,LLaMA2 7B) were re-evaluated using the authors' internal implementation for more direct comparison.
-
5.4. Model Architecture Details
The Transformer architecture employed for LLaDA and the ARM baselines is similar to that of LLaMA [6, 21] with specific enhancements:
-
RMSNorm [105]:
Root Mean Square Layer Normalizationis used to stabilize training. -
SwiGLU [106]:
Swish-Gated Linear Unitis used as the activation function within theFeed-Forward Networks (FFNs)to enhance non-linearity. -
RoPE [107]:
Rotary Position Embeddingis integrated for more expressive positional encoding, which helps models handle long sequences.The paper provides a detailed architectural comparison in Table 5:
The following are the results from Table 5 of the original paper:
| Our ARM Baseline 1B | LLaDA 1B | Our ARM Baseline 7B | LLaDA 8B | LLaMA3 8B | |
| Layers | 22 | 22 | 28 | 32 | 32 |
| Model dimension | 2048 | 2048 | 4096 | 4096 | 4096 |
| Attention heads | 32 | 32 | 32 | 32 | 32 |
| Vocabulary size | 126,464 | 126,464 | 126,464 | 126,464 | 128,000 |
| FFN dimension | 5634 | 5634 | 13,440 | 12,288 | 14,336 |
| Key/Value heads | 4 | 4 | 8 | 32 | 8 |
| Total parameters | 1.49 B | 1.49 B | 6.83 B | 8.02 B | 8.03 B |
| Non-embedding parameters | 0.97 B | 0.97 B | 5.80 B | 6.98 B | 6.98 B |
- Key Differences:
- For the 1B models,
LLaDAand theARMbaseline have identical architectures. - For larger models,
LLaDA 8Buses 32Key/Value headscompared toLLaMA3 8B's 8. This is becauseLLaDAuses vanillamulti-head attentionand is incompatible withKV caching(whichgrouped query attention[24] is designed to optimize). The larger number ofKey/Value headsinLLaDAmeans itsattention layerhas more parameters, leading to a slightly smallerFFN dimension(12,288 vs 14,336 in LLaMA3) to maintain a comparable total parameter count. - The
vocabulary sizealso differs slightly (126,464 for LLaDA vs 128,000 for LLaMA3) due to a tokenizer adapted onLLaDA's specific data.
- For the 1B models,
5.5. Training Details
- Optimizer:
AdamW optimizer[29] with a weight decay of 0.1 for allLLaDAandARMbaselines. - Learning Rate Scheduler:
Warmup-Stable-Decay[28].- LLaDA 1B & 8B, ARM 1B: Max learning rate , batch size 1280.
- ARM 7B: Max learning rate , batch size 4224 (selected via grid search).
- LLaDA 8B specific schedule: Detailed in Section 4.2.2 (linear warmup, stable, then two decay steps).
- Computational Cost Calculation: The
training FLOPs(Floating Point Operations) shown in Figure 3 are calculated using the widely used6NDformulation [108, 109], where:- : Number of
non-embedding parameters. - : Total number of
training tokens.
- : Number of
6. Results & Analysis
6.1. Scalability of LLaDA on Language Tasks
The paper first investigates LLaDA's scalability by comparing it against ARM baselines trained on the same data. The evaluation focuses on six standard tasks: MMLU, CMMLU, ARC-C, PIQA, GSM8K, and HumanEval. The pre-training computational cost (FLOPs) serves as the unified scaling metric.
The following are the results from Figure 3 of the original paper:
该图像是图表,展示了图3中LLaDA和自回归基线模型在六个任务上,随预训练计算FLOPs增加的性能表现。图中横轴为FLOPs,纵轴为不同任务的评价指标,LLaDA表现出强劲的可扩展性,整体性能与自回归模型相当。
As shown in Figure 3, LLaDA demonstrates impressive scalability, with its overall performance trend being highly competitive with ARMs.
-
Stronger Scalability on Specific Tasks: On tasks like
MMLU(Massive Multitask Language Understanding) andGSM8K(Grade School Math 8K),LLaDAexhibits even stronger scalability, often surpassing theARMbaselines as FLOPs increase. -
Narrowing Gap: For tasks where
ARMsinitially perform better, such asPIQA(Physical Commonsense in Natural Language), the performance gap betweenLLaDAandARMsnarrows considerably as the models scale up. -
Hypothesis for Performance Gain: The authors hypothesize that
LLaDA's gains on certain benchmarks stem from its architectural difference:ARMsoptimize only left-to-right conditional probabilities, whileLLaDAis trained to considermultiple conditioning directions(due to itsmasked diffusionnature and lack ofcausal mask). This offers greater flexibility and potentially leads to better generalization, especially for tasks requiringbidirectional reasoning. This hypothesis is supported byLLaDA's strong performance onreversal reasoningtasks.The paper also addresses previous findings by
Nie et al. [27], which suggested thatMDMsmight require 16 times more computation thanARMsfor the samelikelihood. The current work clarifies thatlikelihoodis an indirect metric for downstream performance, anddiffusion modelsoptimize aboundof the likelihood, making direct comparisons complex. Furthermore,LLaDAextends thescaling rangeto FLOPs, demonstrating capabilities at a much larger scale than previously explored.
The detailed numerical results corresponding to Figure 3 are provided in Table 18 and Table 19 in the Appendix. I will transcribe them here for completeness.
The following are the results from Table 18 of the original paper:
| Model | Training Tokens | FLOPs | MMLU | CMMLU | ARC-C | PIQA | GSM8K | HumanEval |
| LLaDA 1B | 37.75B | 2.20e20 | 25.52 | 25.95 | 25.17 | 59.41 | 1.82 | 0.00 |
| LLaDA 1B | 88.08B | 5.13e20 | 27.11 | 26.52 | 26.96 | 61.86 | 3.03 | 1.83 |
| LLaDA 1B | 138.41B | 8.06e20 | 29.32 | 27.13 | 30.20 | 63.38 | 2.35 | 0.00 |
| LLaDA 1B | 239.08B | 1.39e21 | 31.48 | 30.77 | 27.99 | 63.11 | 3.26 | 1.22 |
| LLaDA 1B | 352.32B | 2.05e21 | 35.86 | 34.35 | 31.31 | 65.34 | 3.64 | 3.05 |
| LLaDA 1B | 461.37B | 2.69e21 | 31.86 | 30.98 | 30.12 | 65.51 | 2.35 | 0.61 |
| LLaDA 8B | 62.91B | 2.63e21 | 32.22 | 28.50 | 30.20 | 63.82 | 3.87 | 2.44 |
| LLaDA 8B | 125.83B | 5.27e21 | 33.39 | 33.90 | 34.64 | 66.54 | 8.72 | 3.66 |
| LLaDA 8B | 251.66B | 1.05e22 | 42.84 | 40.59 | 40.10 | 69.04 | 15.31 | 3.66 |
| LLaDA 8B | 377.49B | 1.58e22 | 45.11 | 43.99 | 39.25 | 68.61 | 25.40 | 9.76 |
| LLaDA 8B | 503.32B | 2.11e22 | 43.57 | 41.38 | 42.06 | 70.24 | 27.52 | 9.76 |
| LLaDA 8B | 629.14B | 2.63e22 | 48.80 | 47.13 | 42.24 | 72.09 | 30.10 | 12.80 |
| LLaDA 8B | 679.48B | 2.85e22 | 49.61 | 48.19 | 41.30 | 70.84 | 26.31 | 8.54 |
| LLaDA 8B | 792.72B | 3.31e22 | 50.88 | 49.01 | 42.58 | 70.51 | 31.99 | 6.10 |
| LLaDA 8B | 981.47B | 4.11e22 | 49.47 | 48.10 | 40.27 | 71.38 | - | 6.10 |
| LLaDA 8B | 1107.30B | 4.64e22 | 51.13 | 47.57 | 41.13 | 69.26 | 36.69 | 10.37 |
| LLaDA 8B | 1233.13B | 5.16e22 | 50.52 | 49.72 | 45.05 | 71.49 | 38.97 | 9.76 |
| LLaDA 8B | 1358.95B | 5.69e22 | 54.61 | 53.97 | 49.40 | 74.05 | 48.14 | 17.68 |
| LLaDA 8B | 1547.70B | 6.48e22 | 57.38 | 56.04 | 49.49 | 74.59 | 53.30 | 20.73 |
| LLaDA 8B | 1975.52B | 8.27e22 | 58.52 | 57.87 | 50.68 | 75.35 | - | 19.51 |
The following are the results from Table 19 of the original paper:
| Model | Training Tokens | FLOPs | MMLU | CMMLU | ARC-C | PIQA | GSM8K | HumanEval |
| ARM 1B | 37.75B | 2.20e20 | 25.47 | 25.38 | 30.20 | 67.36 | 2.20 | 4.88 |
| ARM 1B | 88.08B | 5.13e20 | 24.67 | 25.23 | 33.96 | 70.02 | 7.51 | 10.37 |
| ARM 1B | 138.41B | 8.06e20 | 29.25 | 27.48 | 33.45 | 70.29 | 8.34 | 9.76 |
| ARM 7B | 17.30B | 6.02e20 | 26.92 | 25.18 | 21.02 | 57.18 | 1.29 | 1.22 |
| ARM 7B | 34.60B | 1.20e21 | 25.83 | 25.38 | 24.07 | 62.84 | 1.59 | 2.44 |
| ARM 7B | 86.50B | 3.01e21 | 24.41 | 24.90 | 25.42 | 71.11 | 2.88 | 7.93 |
| ARM 7B | 173.02B | 6.02e21 | 26.20 | 24.78 | 26.10 | 74.27 | 6.67 | 9.15 |
| ARM 7B | 207.62B | 7.23e21 | 30.36 | 28.86 | 31.86 | 74.48 | 8.57 | 12.80 |
| ARM 7B | 224.92B | 7.83e21 | 29.49 | 32.26 | 31.19 | 74.37 | 8.95 | 8.54 |
| ARM 7B | 242.22B | 8.43e21 | 33.62 | 31.38 | 34.92 | 75.41 | 10.84 | 9.15 |
| ARM 7B | 259.52B | 9.03e21 | 34.11 | 34.20 | 32.88 | 75.19 | 9.33 | 10.98 |
| ARM 7B | 311.43B | 1.08e22 | 35.66 | 35.49 | 36.61 | 75.14 | 11.30 | 10.37 |
| ARM 7B | 363.33B | 1.26e22 | 34.54 | 37.67 | 34.58 | 76.55 | 12.28 | 14.02 |
| ARM 7B | 415.24B | 1.45e22 | 35.37 | 38.37 | 35.25 | 76.39 | 14.40 | 12.80 |
| ARM 7B | 449.84B | 1.57e22 | 39.51 | 39.24 | 34.92 | 76.82 | 14.94 | 14.63 |
| ARM 7B | 519.09B | 1.81e22 | 40.30 | 40.69 | 37.29 | 77.15 | 14.03 | 14.63 |
| ARM 7B | 778.57B | 2.71e22 | 43.33 | 43.50 | 38.31 | 77.53 | 17.59 | 14.63 |
| ARM 7B | 1038.09B | 3.61e22 | 45.06 | 46.12 | 41.69 | 77.86 | 20.02 | 15.85 |
| ARM 7B | 1384.12B | 4.82e22 | 47.63 | 48.18 | 47.80 | 76.93 | 22.82 | 15.24 |
| ARM 7B | 2076.18B | 7.23e22 | 47.68 | 50.85 | 44.07 | 77.37 | 24.79 | 14.63 |
| ARM 7B | 2214.59B | 7.71e22 | 49.26 | 52.08 | 53.56 | 77.69 | 27.37 | 17.07 |
6.2. Benchmark Results (Pre-trained)
To evaluate in-context learning and instruction-following, LLaDA 8B Base is compared against existing LLMs of similar scale. The evaluation protocols were standardized, and some representative LLMs (LLaMA3 8B, LLaMA2 7B) were re-evaluated using the authors' implementation for direct comparison.
The following are the results from Table 1 of the original paper:
| LLaDA 8B* | LLaMA3 8B* | LLaMA2 7B* | Qwen2 7B† | Qwen2.5 7B† | Mistral 7B† | Deepseek 7B‡ | |
| Model | Diffusion | AR | AR | AR | AR | AR | AR |
| Training tokens | 2.3T | 15T | 2T | 7T | 18T | - | 2T |
| General Tasks | |||||||
| MMLU | 65.9 (5) | 65.4 (5) | 45.9 (5) | 70.3 (5) | 74.2 (5) | 64.2 (5) | 48.2 (5) |
| BBH | 49.7 (3) | 62.1 (3) | 39.4 (3) | 62.3 (3) | 70.4 (3) | 56.1 (3) | 39.5 (3) |
| ARC-C | 45.9 (0) | 53.1 (0) | 46.3 (0) | 60.6 (25) | 63.7 (25) | 60.0 (25) | 48.1 (0) |
| Hellaswag | 70.5 (0) | 79.1 (0) | 76.0 (0) | 80.7 (10) | 80.2 (10) | 83.3 (10) | 75.4 (0) |
| TruthfulQA | 46.1 (0) | 44.0 (0) | 39.0 (0) | 54.2 (0) | 56.4 (0) | 42.2 (0) | - |
| WinoGrande | 74.8 (5) | 77.3 (5) | 72.5 (5) | 77.0 (5) | 75.9 (5) | 78.4 (5) | 70.5 (0) |
| PIQA | 73.6 (0) | 80.6 (0) | 79.1 (0) | - | - | - | 79.2 (0) |
| Mathematics & Science | |||||||
| GSM8K | 70.3 (4) | 48.7 (4) | 13.1 (4) | 80.2 (4) | 85.4 (4) | 36.2 (4) | 17.4 (8) |
| Math | 31.4 (4) | 16.0 (4) | 4.3 (4) | 43.5 (4) | 49.8 (4) | 10.2 (4) | 6.0 (4) |
| GPQA | 25.2 (5) | 25.9 (5) | 25.7 (5) | 30.8 (5) | 36.4 (5) | 24.7 (5) | - |
| Code | |||||||
| HumanEval | 35.4 (0) | 34.8 (0) | 12.8 (0) | 51.2 (0) | 57.9 (0) | 29.3 (0) | 26.2 (0) |
| HumanEval-FIM | 73.8 (2) | 73.3 (2) | 26.9 (2) | ||||
| MBPP | 40.0 (4) | 48.8 (4) | 23.2 (4) | 64.2 (0) | 74.9 (0) | 51.1 (0) | 39.0 (3) |
| Chinese | |||||||
| CMMLU | 69.9 (5) | 50.7 (5) | 32.5 (5) | 83.9 (5) | 47.2 (5) | ||
| C-Eval | 70.5 (5) | 51.7 (5) | 34.0 (5) | 83.2 (5) | 45.0 (5) | ||
- Overall Performance:
LLaDA 8B Base, trained on 2.3T tokens, demonstrates remarkable performance. It surpassesLLaMA2 7B Base(trained on 2T tokens) on nearly all tasks. More importantly, it is overall competitive withLLaMA3 8B Base(trained on 15T tokens), despiteLLaMA3being trained on significantly more data. - Strengths in Math and Chinese:
LLaDAshows distinct advantages inmathematics(GSM8K,Math) andChinese language understanding(CMMLU,C-Eval). For example, onGSM8K,LLaDAscores 70.3 compared toLLaMA3's 48.7 andLLaMA2's 13.1. Similarly, forCMMLU,LLaDAachieves 69.9, significantly higher thanLLaMA3's 50.7 andLLaMA2's 32.5. - Reasons for Performance Differences: The authors conjecture that differences in
data qualityanddistribution(especially given the closed-source nature of someLLM datasets) contribute toLLaDA's varying performance across tasks. - Ruling out Data Leakage: To ensure
LLaDA's superior performance onGSM8Kis not due to data leakage, the authors tested it oniGSM[34], anunseen, synthetic GSM8K-like task. The results (Table 12) confirmLLaDA's advantages on unseen mathematical problems, further validating itsmathematical reasoningcapabilities.
6.3. Benchmark Results (Post-trained)
The SFT (Supervised Fine-Tuning) procedure further enhances LLaDA's abilities. This section compares LLaDA 8B Instruct with other post-trained LLMs, noting that most competitors also employ reinforcement learning (RL) alignment, which LLaDA currently lacks.
The following are the results from Table 2 of the original paper:
| LLaDA 8B* | LLaMA3 8B* | LLaMA2 7B* | Qwen2 7B† | Qwen2.5 7B† | Gemma2 9B† | Deepseek 7B‡ | |
| Model | Diffusion | AR | AR | AR | AR | AR | AR |
| Training tokens | 2.3T | 15T | 2T | 7T | 18T | 8T | 2T |
| Post-training Alignment pairs | SFT 4.5M | SFT+RL | SFT+RL | SFT+RL 0.5M+- | SFT+RL 1M + 0.15M | SFT+RL | SFT+RL 1.5M+- |
| - - General Tasks | |||||||
| MMLU | 65.5 (5) | 68.4 (5) | 44.1 (5) | - | 49.4 (0) | ||
| MMLU-pro | 37.0 (0) | 41.9 (0) | 4.6 (0) | 44.1 (5) | 56.3 (5) | 52.1 (5) | |
| Hellaswag | 74.6 (0) | 75.5 (0) | 51.5 (0) | 68.5 (-) | |||
| ARC-C | 88.5 (0) | 82.4 (0) | 57.3 (0) | - | - | 49.4 (-) | |
| Mathematics & Science | |||||||
| GSM8K | 69.4 (4) | 78.3 (4) | 29.0 (4) | 85.7 (0) | 91.6 (0) | 76.7 (0) | 63.0 (0) |
| Math | 31.9 (0) | 29.6 (0) | 3.8 (0) | 52.9 (0) | 75.5 (0) | 44.3 (0) | 15.8 (0) |
| GPQA | 33.3 (5) | 31.9 (5) | 28.4 (5) | 34.3 (0) | 36.4 (0) | 32.8 (0) | |
| Code | |||||||
| HumanEval | 49.4 (0) | 59.8 (0) | 16.5 (0) | 79.9 (0) | 84.8 (0) | 68.9 (0) | 48.2 (-) |
| MBPP | 41.0 (4) | 57.6 (4) | 20.6 (4) | 67.2 (-) | 79.2 (0) | 74.9 (0) | 35.2 (-) |
- SFT Impact:
SFTgenerally improvesLLaDA's performance across most downstream tasks. For instance,ARC-Cjumps from 45.9 to 88.5, andHumanEvalfrom 35.4 to 49.4. - Declines: A few metrics, such as
MMLU, show slight declines (65.9 to 65.5), which the authors attribute to suboptimal quality of the SFT data. - Comparison with RL-aligned Models: Despite only using
SFT(while other models likeLLaMA3,Qwen2,Gemma2,DeepseekincorporateSFT + Reinforcement Learning (RL)alignment),LLaDA 8B Instructis competitive. The gaps withLLaMA3 8B Instructare noticeable (e.g.,GSM8K: 69.4 vs 78.3;HumanEval: 49.4 vs 59.8), but often remain small, especially considering the lack ofRLalignment inLLaDA. This demonstrates the impressiveinstruction-following abilitiesachieved purely throughSFTwith adiffusion model. - Sampling Strategy Nuances: The paper notes that all results in Section 3 (including Table 1 and 2) are based on
pure diffusion methods. ForLLaDA 8B Instruct,block diffusion style sampling(a variant of diffusion sampling) actually performs better onGSM8K(78.6 vs 69.4) andMath(42.2 vs 31.9) than the defaultpure diffusion sampling. This is attributed to in the SFT data, which can cause premature termination inlow-confidence remaskingused by defaultpure diffusion.
6.4. Reversal Reasoning and Analyses
The reversal curse [15] is a known limitation of ARMs, where models struggle to generalize from "A is B" to "B is A". LLaDA's bidirectional modeling is hypothesized to address this. The authors evaluate this on a poem completion task using 496 Chinese poem sentence pairs, where models must generate either the subsequent (forward) or preceding (reversal) line.
The following are the results from Table 4 of the original paper:
| Forward | Reversal | |
| GPT-4o (2024-08-06) | 82.7 | 34.3 |
| Qwen2.5-7B Instruct | 75.9 | 38.0 |
| LLaDA-8B Instruct | 51.8 | 45.6 |
-
Addressing the Reversal Curse: As shown in Table 4,
LLaDAeffectively addresses thereversal curse. It demonstratesconsistent zero-shot performanceacross bothforward(51.8) andreversal(45.6) tasks, with a small performance gap. -
Comparison with ARMs: In contrast,
GPT-4oandQwen2.5-7B Instructexhibit significant performance gaps between forward and reversal tasks.GPT-4oscores 82.7 on forward but drops to 34.3 on reversal.Qwen2.5scores 75.9 on forward and 38.0 on reversal. -
Superiority in Reversal Task: While
GPT-4oandQwen2.5perform strongly on theforward generationtask (benefiting from larger datasets and computational resources),LLaDAoutperforms both by a large margin in thereversal task(45.6 vs 34.3 forGPT-4oand 38.0 forQwen2.5). -
Intuition: The authors suggest that
LLaDA's inherentbidirectionalnature, treating tokens uniformly without inductive bias, leads to this balanced performance. The connection toAny-order Autoregressive Models (AO-ARM)(Eq. 15) in Appendix A.2 further supports this, indicatingLLaDAimplicitly learns dependencies irrespective of token order.An example of a poem completion task, from Appendix B.9: Example 1: Prompt:
-?Answer:梅须逊雪三分白,雪却输梅一段香。(Plum blossoms yield three points of whiteness to snow, but snow loses a segment of fragrance to plum blossoms.)
6.5. Ablation on Classifier-free Guidance
Classifier-free guidance (CFG) [37, 27] is a technique used in diffusion models to improve generation quality by balancing alignment with prompts and sample diversity. While not used in main results for fair comparison with ARMs, the paper shows LLaDA's compatibility and benefits from CFG.
The following are the results from Table 6 of the original paper:
| ARC-C | Hellaswag | TruthfulQA | WinoGrande | GPQA | PIQA | |
| w/o CFG | 45.9 | 70.5 | 46.1 | 74.8 | 25.2 | 73.6 |
| w/ CFG | 47.9 | 72.5 | 46.4 | 74.8 | 26.1 | 74.4 |
As seen in Table 6, applying CFG consistently improves the performance of LLaDA 8B Base across the tested benchmarks (ARC-C, HellaSwag, TruthfulQA, WinoGrande, GPQA, PIQA). For instance, ARC-C increases from 45.9 to 47.9, and HellaSwag from 70.5 to 72.5. This confirms that CFG is a valuable plug-and-play technique for LLaDA to further enhance its generation quality.
6.6. Ablation on Sampling Strategies
LLaDA supports flexible sampling strategies, including autoregressive, block diffusion, and pure diffusion sampling. Ablation studies are conducted on LLaDA 8B Base and LLaDA 8B Instruct to evaluate these.
The following are the results from Table 7 of the original paper:
| BBH | GSM8K | Math | HumanEval | MBPP | ||
| Autoregressive | 38.1 | 63.1 | 23.6 | 18.3 | 33.4 | |
| Block Diffusion | L′= 2 | 37.3 | 62.6 | 25.2 | 14.6 | 33.6 |
| L′ = 4 | 40.0 | 65.7 | 26.6 | 15.9 | 36.0 | |
| L′ = 8 | 42.0 | 68.2 | 27.7 | 19.5 | 39.2 | |
| L′ = 32 | 45.7 | 68.6 | 29.7 | 29.9 | 37.4 | |
| Block Diffusion LLaDA | L′= 2 | 48.0 | 70.0 | 30.8 | 26.2 | 40.0 |
| L′ = 4 | 48.5 | 70.3 | 31.3 | 27.4 | 38.8 | |
| L′ = 8 | 48.6 | 70.2 | 30.9 | 31.1 | 39.0 | |
| L′ = 32 | 48.3 | 70.3 | 31.2 | 32.3 | 40.0 | |
| Pure Diffusion | 49.7 | 70.3 | 31.4 | 35.4 | 40.0 | |
Table 7 (for LLaDA 8B Base) shows:
-
Block Length Impact: For
block diffusion sampling, performance generally improves as theblock length() increases (e.g.,BBHfrom 37.3 at to 45.7 at ). -
Block Diffusion vs. Autoregressive:
Block diffusion samplingconsistently outperformsautoregressive sampling. -
Block Diffusion LLaDA vs. Standard Block Diffusion:
Block diffusion LLaDA(also calledsemi-autoregressive remasking) further improves upon standardblock diffusion sampling. For example, onBBHwith ,Block Diffusionis 45.7, whileBlock Diffusion LLaDAis 48.3. -
Pure Diffusion:
Pure diffusion samplingachieves the best overall performance (e.g.,BBH49.7,GSM8K70.3,HumanEval35.4).The following are the results from Table 8 of the original paper:
GSM8K Math HumanEval MBPP GPQA MMLU-Pro ARC-C Autoregressive 0 9.5 0 0 0 0 84.4 Block Diffusion 24.6 23.5 17.1 21.2 29.3 32.5 88.1 Block Diffusion LLaDA 77.5 42.2 46.3 34.2 31.3 34.8 85.4 Pure Diffusion 69.4 31.9 49.4 41.0 33.3 37.0 88.5
Table 8 (for LLaDA 8B Instruct):
-
Autoregressive Sampling for Instruct Model:
Autoregressive samplingperforms very poorly for theInstructmodel, often scoring 0 or near 0. This is becauseSFTdata (being complete sentences) causes the model to generate a full sentence and immediately output , leading to extremely short or empty generations. -
Block Diffusion Variants:
Block diffusionandBlock Diffusion LLaDAsignificantly improve overautoregressive.Block Diffusion LLaDAshows stronger results onGSM8K(77.5),Math(42.2), andHumanEval(46.3). -
Pure Diffusion:
Pure diffusion samplingstill achieves the best overall performance across most tasks (e.g.,HumanEval49.4,MBPP41.0,ARC-C88.5). -
: For
Instructmodels, the paper notes a specific adjustment: setting the confidence score of the to zero duringpure diffusion samplingforHumanEval,MBPP,GSM8K,Math, andGPQA. This prevents the model from prematurely ending generations due to heavy inSFTdata, which would otherwise degrade performance. This also explains whyBlock Diffusion LLaDAperformed slightly better onGSM8KandMathin Table 2 (78.6 and 42.2) thanpure diffusionwithout this specificEOStoken adjustment (69.4 and 31.9).The following are the results from Table 9 of the original paper:
Length BBH GSM8K Math HumanEval MBPP Random Remasking 32.1 21.3 9.2 11.6 21.0 Low-confidence Remasking 45.0 70.0 30.3 32.9 40.2
Table 9 (Comparison of Remasking Strategies):
Low-confidence remaskingconsistently and significantly outperformsrandom remaskingacross all tested benchmarks forLLaDA 8B Base. For example, onBBH,low-confidenceyields 45.0 compared torandom's 32.1. OnGSM8K, the difference is even more stark: 70.0 vs 21.3. This highlights the importance of intelligently deciding which tokens to re-mask, similar toannealed samplingstrategies.
6.7. Ablation on Generated Length
The generation length is a user-specified hyperparameter in LLaDA. This ablation studies its impact on performance using LLaDA 8B Base.
The following are the results from Table 10 of the original paper:
| Length | BBH | GSM8K | Math | HumanEval | MBPP |
| 256 | 45.0 | 70.0 | 30.3 | 32.9 | 40.2 |
| 512 | 50.4 | 70.8 | 30.9 | 32.9 | 39.2 |
| 1024 | 49.7 | 70.3 | 31.4 | 35.4 | 40.0 |
Table 10 demonstrates that LLaDA's performance is relatively insensitive to the generation length hyperparameter. While there are minor fluctuations (e.g., BBH peaking at 512 length), the results remain robust across lengths of 256, 512, and 1024. This is a practical advantage, as it reduces the need for extensive tuning of this parameter and indicates that the variable-length training (Section 4.2.2) is effective.
6.8. Analysis of Sampling Efficiency
The paper analyzes LLaDA's sampling speed (throughput) and memory consumption compared to LLaMA3 8B Base.
The following are the results from Figure 5 of the original paper:
该图像是多子图的折线图,展示了LLaDA 8B与LLaMA3 8B在不同吞吐率下的GSM8K、Math、HumanEval和MBPP任务性能对比,体现了采样效率与生成速度的权衡。
Figure 5 (Analysis of Sampling Efficiency):
-
Quality-Speed Trade-off:
LLaDAenables a flexible trade-off betweengeneration qualityandsampling speedby adjusting the number ofsampling steps. Fewer steps mean faster generation but potentially lower quality. The figures show performance (y-axis) vs. throughput (x-axis) for different numbers ofsampling steps(32, 64, 128, 256, corresponding to 8, 4, 2, 1 tokens per forward pass respectively). -
Throughput vs. LLaMA3:
- On
GSM8KandMath,LLaDA 8B Baseachievescomparable performancetoLLaMA3 8B Basewhile delivering1.5 to 1.8 times higher throughput. This is notable becauseLLaMA3usesKV Cache(an inference optimization) whileLLaDAoperates without such techniques. - On
HumanEval,LLaDAperformscomparablytoLLaMA3when their throughputs are matched. - On
MBPP,LLaDAlags behindLLaMA3.
- On
-
KV Cache Benefit: The benefit of
KV cachingforLLaMA3is weaker onHumanEvaldue to its relatively short prompt lengths (average 132 tokens), compared toGSM8K(894),Math(680), andMBPP(628).The following are the results from Table 11 of the original paper:
Input Length Output Length LLaDA 8B LLaMA3 8B w/o KV-Cache LLaMA3 8B w/ KV-Cache 512 512 17.03 16.70 16.32 1024 17.53 17.49 16.43 2048 18.52 20.00 16.73 1024 512 17.53 17.16 16.36 1024 18.01 18.26 16.41 2048 19.02 21.39 16.74
Table 11 (Memory Consumption):
-
LLaDA'smemory usage(in GB) iscomparabletoLLaMA3 8BwithoutKV-Cache. -
It is
slightly higherthanLLaMA3withKV-Cache. -
LLaDA's memory usage remainsconstantregardless of the number ofsampling steps, which is a characteristic of itsnon-autoregressivenature (it processes the entire sequence in parallel for each step).The authors emphasize that the primary goal of this study is not to achieve faster generation than
ARMs, but to demonstrate the promise ofdiffusion modelsat scale. They acknowledge that further research on efficiency-oriented optimizations (e.g., architectural innovations, distillation, sampling algorithm design) is a promising future direction forLLaDA.
6.9. Evaluation on iGSM Dataset
To further validate LLaDA's mathematical capabilities and rule out any data leakage concerns (especially given GSM8K performance), the models were tested on iGSM [34], an infinite, synthetic GSM8K-like dataset. iGSM generates problems with varying solution steps based on specific rules, ensuring the dataset is unseen during training.
The following are the results from Table 12 of the original paper:
| 4 steps | 5 steps | 6 steps | |
| LLaMA3 8B Base | 38.0 | 35.0 | 34.0 |
| LLaDA 8B Base | 64.0 | 41.0 | 44.0 |
Table 12 shows LLaDA 8B Base demonstrating significant and consistent advantages over LLaMA3 8B Base on these unseen mathematical problems.
-
For 4-step problems,
LLaDAscores 64.0 compared toLLaMA3's 38.0. -
For 5-step problems,
LLaDAscores 41.0 compared toLLaMA3's 35.0. -
For 6-step problems,
LLaDAscores 44.0 compared toLLaMA3's 34.0.These results align with the
GSM8Kfindings in Table 1 and provide strong evidence thatLLaDApossesses robustmathematical reasoningcapabilities that generalize well to novel problems.
6.10. Case Studies
The paper includes several case studies to showcase LLaDA 8B Instruct's instruction-following abilities and the visualization of its sampling process.
The following are the results from Table 13 of the original paper:
该图像是图表,展示了LLaDA模型对“定期锻炼对身心健康的益处”问题的采样生成过程。颜色深浅表示预测阶段,浅色为早期预测,深色为后期预测,体现了模型的逐步生成机制。
Figure 6 (Visualization of the Block Diffusion LLaDA Sampling Process): This image illustrates the iterative generation process. Darker colors indicate tokens predicted in later stages of sampling, while lighter colors correspond to earlier predictions. The example shows LLaDA generating a response to a math problem, progressively filling in masked tokens and refining the output.
The following are the results from Table 14 of the original paper:
该图像是论文中展示多轮对话的示意图,使用随机重掩码策略。LLaDA模型生成的响应中,颜色越深表示该词在采样后期预测,颜色越浅表示先期预测。
Figure 7 (Visualization of the Multi-turn Dialogue with Random Remasking): Similar to Figure 6, this image visualizes the multi-turn dialogue generation process using random remasking. It showcases LLaDA's ability to maintain context across turns and generate coherent responses, with color intensity indicating the prediction stage.
- Multi-turn Dialogue Example (Table 3 excerpt, also includes images 6 & 7):
User:
Could you please share with me what the initial two lines of the renowned poem 'The Road Not Taken'?LLaDA:Certainly! The initial two lines of the renowned poem "The Road Not Taken" by Robert Frost are: "Two roads diverged in a yellow wood, And sorry I could not travel both." These lines set the stage for the poem, introducing the idea of a speaker facing a choice between two paths.User:Please help me translate into Chinese.LLaDA:Sure! The first two lines of "The Road Not Taken" by Robert Frost can be translated into Chinese as:"",""(Note: The Chinese text is missing in the OCR, but the model attempts translation.) User:Please translate into German.LLaDA:Sure! The first two lines of "The Road Not Taken" by Robert Frost can be translated into German as: "Zwei Wege trennten sich im gelben Wald, und leider konnte ich nicht beide tun."User:Please also write a poem about life choices. I require 5 sentences, and the first word of each sentence starts with C as the first letter.LLaDA:Certainly! Here's a poem about life choices, with each sentence starting with the letter "C": Choices cascade through life's journey, Creating connections with each decision, Careful consideration leads to better paths, Courageous choices shape our destiny, Choosing wisely
This example from Table 3 highlights:
-
Coherent and Fluent Text Generation:
LLaDAgenerates natural-sounding, extended text in anon-autoregressivemanner. -
Multi-turn Dialogue Capability: It effectively retains
conversation historyand producescontextually appropriate responses. -
Multilingualism: Demonstrated by the ability to translate text into Chinese and German.
-
Creative Constraints:
LLaDAcan follow complex instructions, such as generating a poem with specific structural and alphabetical constraints.The following are the results from Table 15 of the original paper:
Daily life User LLaDA What's a piece of advice you've received that you've found helpful? One helpful piece of advice I've received is to always be open to learning and growth. I've found that being willing to explore new ideas and experiences has helped me become more resilient and adaptable. By embracing a growth mindset, I've been able to identify Daily life User LLaDA Please recommend me three famous movies Certainly! Here are three famous movies that have achieved widespread popularity and critical acclaim: 1. **The Shawshank Redemption** (1994) - Inspired by a novella and directed by Frank Darabont, this film tells the story of Andy Dufresne, a man convicted of murdering his wife, dealing with his harsh life in Shawshank Prison and forming an unlikely friendship with a fellow inmate. 2. **The Godfather** (1972) - Directed by Francis Ford Coppola, this film is based on Mario Puzo's novel of the same name. It follows the story of the Corleone family, particularly Michael Corleone, who rises to power and takes over the family's criminal empire. 3. **The Empire Strikes Back** (1980) - Directed by George Lucas, this film continues the story of a young Skywalker in the Star Wars saga. It follows Luke Skywalker as he helps the Rebel Alliance fight against the Galactic Empire and destroy the powerful weapon, the Death Star. All of these films are celebrated for their cinematic achievements, powerful storytelling, and memorable characters. Translation User LLaDA Please help me translate into Chinese: 'What is now proved was once only imagined' Translation User LLaDA Math Please help me translate into German: 'What is now proved was once only imagined' Wer heute bestätigt ist, war einst nur immaginiert. Emma decides to run 4 sprints 5 times a week. She runs 75 meters each sprint. How many User LLaDA total meters does she run a week? Emma runs 4 sprints each time, so she runs a total of 4 x 5 = 20 sprints a week. Each sprint is 75 meters long, so Emma runs a total of 20 x 75 = 1500 meters a week. The final result is 1500 Code User LLaDA I am writing python code, I have an array, please help me sort it from largest to smallest Certainly! You can sort an array from the largest to smallest in Python using the 'sort()' method or the 'sorted()' function. Here's an example: ### Using 'sort()' method: my_array = [3, 1, 4, 1, 5, 2]
Table 15 (Single-turn Dialogue Cases): Showcases LLaDA's ability to handle various types of single-turn prompts including:
-
General knowledge/advice (
What's a piece of advice...). -
Recommendations (
Please recommend me three famous movies). -
Translations (
Please help me translate into Chinese/German). -
Math word problems (
Emma decides to run 4 sprints...). -
Code generation (
I am writing python code...).The following are the results from Table 16 of the original paper:
Daily life User How do I prepare for a job interview? LLaDA To prepare for a job interview, you should research the company and the job, practice User common interview questions, and dress appropriately. LLaDA What are some common interview questions? Some common interview questions include "Tell me about yourself" and "Why do you Daily life User want to work for us"? How do I start learning programming? LLaDA To learn programming, you can start by choosing a programming language that you are interested in. Some popular languages include Python, Java, and C++. User Which programming language should I learn first? LLaDA It depends on your interests and goals. Python is a great language for beginners for its simplicity and versatility. User LLaDA User How long does it take to learn Python? It can take anywhere from a few weeks to several months to learn Python. What are some good Python projects for beginners? LLaDA Some good Python projects for beginners include building a simple calculator, a game, and a web application. Math User Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. LLaDA How many clips did Natalia sell altogether in April and May? Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Altogether, Natalia sold 48 + 24 = 72 clips in April and May. The final result is 72. User Natalia sold three times as many items in June as she did in May. How many clips did LLaDA Natalia sell altogether in these three months? Natalia sold clips to 48 friends in April. She sold half as many clips in May, so she sold 48/2 = 24 clips in May. Natalia sold three times as many items in June as she did in May,
Table 16 (Multi-turn Dialogue Cases): Reinforces LLaDA's ability to engage in extended multi-turn conversations, demonstrating context awareness and helpful responses in both general advice and multi-step math problems.
The following are the results from Table 17 of the original paper:
| Question1 '-? | |
| User GPT-40 | - |
| LLaDA Question2 | ""A-EUT |
| User | 'HE-EH? |
| GPT-40 | -" |
| LLaDA | # "-" |
Table 17 (Poem Reversal Completion Cases): This table provides concrete examples where LLaDA 8B Instruct succeeds in reversal poem completion while GPT-4o fails. The specific content of the poems is difficult to interpret from the OCR text, but the qualitative result supports LLaDA's claimed advantage in reversal reasoning.
Overall, these case studies provide strong qualitative evidence of LLaDA's capabilities, especially its instruction-following and multi-turn dialogue skills, which are particularly impressive for a non-autoregressive model.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LLaDA, a groundbreaking diffusion language model trained from scratch at an unprecedented scale of 8 billion parameters. LLaDA successfully demonstrates strong capabilities in scalability, in-context learning, and instruction-following, achieving performance comparable to, and in some cases surpassing, leading autoregressive LLMs such as LLaMA3. Crucially, LLaDA challenges the long-held assumption that core LLM capabilities are inherently dependent on autoregressive models by providing a robust, theoretically grounded alternative. Its unique bidirectional modeling approach enables it to effectively address limitations like the reversal curse, where it outperforms even GPT-4o. These findings mark a significant scientific innovation, opening new paradigms for language modeling and offering fresh insights into the fundamental mechanisms underpinning LLM intelligence.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations of the current work and outline promising avenues for future research:
- Fixed Generation Length: The
generation lengthis currently a user-specifiedhyperparameter. WhileLLaDAis shown to be insensitive to this, anadaptive generation lengthmechanism would be more efficient and user-friendly. - Computational Constraints for Direct Comparison: Due to significant computational resources required, direct comparisons with
ARMs(e.g., training an8B ARMbaseline on the exact same2.3Tdataset) were not fully realized. At larger scales, theARMbaselines used were ofslightly different sizesorpre-trained on different data volumes, making some comparisons less perfectly controlled. - Lack of Specialized Architectural Optimizations:
LLaDAdoes not yet incorporate specialized architectural designs or optimizations common inARMs, such asgrouped query attention[24] orKV caching[24], which are crucial forinference efficiencyinARMs. - Sampling Algorithm Efficiency: While
LLaDAoffers a flexible quality-speed trade-off in sampling, more efficient and controllablesampling algorithms[31, 78-87] forMDMsare still nascent and require further development. - Absence of Reinforcement Learning (RL) Alignment:
LLaDAhas not yet undergonealignment with reinforcement learning[90, 91], which is critical for fully optimizingLLMperformance and aligning withhuman intent(e.g., inchat models). - Scaling and Data Volume: Despite its unprecedented scale for a
diffusion language model,LLaDA's model size and training data volume remain smaller than those of the absolute leadingARM counterparts(e.g.,LLaMA3on15Ttokens, or proprietary models likeGPT-4). Further scaling is needed to fully evaluate its ultimate potential. - Multi-modal Data:
LLaDA's capability to processmulti-modal data(e.g., text and images) remains unexplored. - Integration with Advanced Techniques: Its impact on
prompt tuning techniques[96] andintegration into agent-based systems[97, 98] is not yet fully understood. - Post-training Investigation: A systematic investigation into advanced
post-trainingmethods (e.g.,O1-like systems[99, 100] for improved reasoning) fordiffusion language modelsis needed.
7.3. Personal Insights & Critique
This paper is a significant milestone in language modeling research. For a long time, the autoregressive paradigm has been the de facto standard, almost synonymous with Large Language Models. LLaDA offers compelling evidence that this assumption might be too narrow. The demonstration of a diffusion model achieving competitive performance at 8B parameters on diverse LLM benchmarks, especially in areas like reversal reasoning, is a strong validation of the underlying generative principles being more fundamental than the specific sequential generation mechanism.
Inspirations:
-
The work opens up entirely new architectural and training considerations for
LLMs. Ifdiffusion modelscan truly parallelize generation and learnbidirectional contextseffectively, it could lead to more efficient inference strategies (e.g., generating entire responses in fewer, larger steps) and models inherently robust to tasks requiring non-sequential reasoning. -
The
reversal cursefinding is particularly insightful. It suggests that thebidirectionalnature ofmasked diffusioninherently addresses a limitation thatARMsstruggle with, even with massive scale. This might point towardshybrid architecturesor training methods that combine the strengths of both paradigms. -
The potential for
diffusion modelsto generate text in a more "holistic" or "canvas-filling" manner, rather than strict left-to-right, could lead to novel applications in creative writing, controlled generation (e.g., filling in blanks within a larger context), or even interactive human-AI co-creation tools where parts of a text can be iteratively refined.Critique / Areas for Improvement:
-
Computational Parity for Scalability Claims: While the
scalabilityfigure (Figure 3) is compelling, the authors' admission that "direct comparisons between LLaDA and ARMs—such as training on identical datasets—were restricted to a computational budget of less than FLOPs" and that larger scaleARMbaselines were "of slightly different sizes" is a critical point. This makes it harder to definitively conclude thatLLaDAscales as efficiently asARMsat the extreme end of the currentLLMscale. Further investment in training a truly identical8B ARMbaseline on2.3Ttokens would strengthen this claim. -
Inference Efficiency: The
sampling speedanalysis (Figure 5 and Table 11) indicates that whileLLaDAoffers a trade-off, it doesn't fundamentally surpassARMsin rawthroughputwithoutKV caching. The absence ofKV cachingand other optimizations inLLaDAis a practical limitation for real-world deployment, and it remains to be seen how muchdiffusion modelscan close this gap. Given thatTransformermodels forMDMsdon't usecausal masks, they cannot benefit fromKV cachingin the same wayARMsdo. This suggests that novelinference optimizationtechniques specific todiffusion modelswill be crucial. -
Data Quality Impact: The authors note that
SFTsometimes led to declines in performance (e.g.,MMLU) due to "suboptimal quality of the SFT data." This highlights thatdiffusion modelsare likely just as sensitive to data quality asARMs, and careful data curation remains paramount. -
Long-term Dependencies and Consistency: While
bidirectional modelingis an advantage,diffusion modelstypically perform multiple passes over the entire sequence. It would be interesting to see further analysis on how wellLLaDAmaintains long-range coherence and consistency across these passes, especially for very long text generations.Despite these points,
LLaDAis a truly exciting development that shakes up the foundational assumptions ofLLMdesign. It paves the way for a more diverse landscape ofgenerative language modelsand encourages a deeper theoretical understanding of what truly makesLLMspowerful.
Similar papers
Recommended via semantic vector search.