AiPaper
Paper status: completed

OneFlowSeq: Achieving One-Step Generation for Diffusion Language Models via Lightweight Distillation

Published:10/08/2025
Original LinkPDF
Price: 0.10
Price: 0.10
16 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OneFlowSeq distills a multi-step diffusion teacher into a single-step generator using MeanFlow supervision and Jacobian-vector product signals, greatly accelerating inference and improving performance with 1600× fewer trainable parameters.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 O NE F LOW S EQ : A CHIEVING O NE -S TEP G ENERA - TION FOR D IFFUSION L ANGUAGE M ODELS VIA L IGHTWEIGHT D ISTILLATION Anonymous authors Paper under double-blind review A BSTRACT Autoregressive models dominate Seq2Seq generation but suffer from slow, error- prone token-by-token decoding. Diffusion language models (DLMs) enable paral- lel refinement and global coherence, yet their iterative denoising requires hundreds of steps, limiting practicality. Attempting to address this issue, we propose One- FlowSeq , a novel framework that distills a powerful multi-step diffusion teacher (LLaDA-8B-Instruct) into a one-step generator via MeanFlow-based supervision and parameter-efficient prompt tuning. Our OneFlowSeq introduces a Jacobian- vector product signal that provides richer guidance than conventional distillation, allowing the student model to not only match the 128-step teacher model 1 in term

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: OneFlowSeq: Achieving One-Step Generation for Diffusion Language Models via Lightweight Distillation
  • Authors: Anonymous authors (Paper under double-blind review)
  • Journal/Conference: Published at OpenReview. OpenReview is an online platform for peer review, typically used for submitting and discussing research papers for conferences, often prior to or during the review process. This indicates the paper is undergoing peer review for a major conference (likely ICLR given the ethics statement referencing ICLR Code of Ethics), suggesting it is a serious academic submission, though its final acceptance status might be pending.
  • Publication Year: 2025
  • Abstract: The paper addresses the limitations of sequence-to-sequence (Seq2Seq) generation. While autoregressive (AR) models are dominant, they suffer from slow, token-by-token decoding. Diffusion language models (DLMs) offer parallel refinement and global coherence but are impractical due to their hundreds of iterative denoising steps. To solve this, the authors propose OneFlowSeq, a novel framework that distills a powerful multi-step diffusion teacher model (LLaDA-8B-Instruct) into a one-step generator. This distillation is achieved through MeanFlow-based supervision and parameter-efficient prompt tuning. A key innovation is the introduction of a Jacobian-vector product (JVP) signal, which provides richer guidance than conventional distillation. This signal enables the student model to match the 128-step teacher's one-step generation quality. Experiments across paraphrasing, text simplification, and question generation benchmarks demonstrate that OneFlowSeq achieves state-of-the-art performance, reduces trainable parameters by 1600×1600 \times, and offers inference speeds orders of magnitude faster than both AR and multi-step diffusion baselines. This work aims to establish one-step diffusion as a practical and scalable paradigm for Seq2Seq generation.
  • Original Source Link: https://openreview.net/forum?id=P7OzWxOUHK&referrer=%5Bthe%20profile%20of%20Yijia%20Fan%5D(%2Fprofile%3Fid%3D~Yijia_Fan1)
  • PDF Link: https://openreview.net/pdf?id=P7OzWxOUHK
  • Publication Status: The paper is published at OpenReview, indicating it is under double-blind review for an upcoming conference (likely ICLR 2025).

Executive Summary

Background & Motivation (Why)

The paper addresses critical challenges in sequence-to-sequence (Seq2Seq) text generation, a fundamental task in natural language processing.

  • Problem with Autoregressive (AR) models: While dominant, AR models (like GPT-series, LLaMA) generate text token-by-token. This sequential nature leads to slow inference latency that scales linearly with sequence length, and their unidirectional context can hinder global planning, potentially leading to inconsistencies.
  • Problem with Diffusion Language Models (DLMs): DLMs offer an alternative by supporting parallel generation and holistic refinement, which is beneficial for global coherence. However, their practical application is severely limited by their iterative denoising process, often requiring hundreds or thousands of steps, making inference prohibitively slow.
  • Gaps in prior one-step diffusion work: Existing attempts at one-step diffusion, such as MeanFlow and DLM-One, face significant drawbacks. MeanFlow provides stable one-step generation but is prohibitively expensive to train large models from scratch. DLM-One accelerates inference by distilling multi-step diffusion into a single pass but requires retraining billions of parameters and relies on adversarial stabilization, shifting the inefficiency from inference to training. This creates a deadlock where one-step diffusion is either elegant but costly to train, or fast but bloated in training cost, lacking a scalable, resource-friendly solution for Seq2Seq generation.

Main Contributions / Findings (What)

The paper introduces OneFlowSeq to overcome these limitations, offering a novel framework with several key contributions:

  • Novel Framework for One-Step DLM Generation: OneFlowSeq distills a powerful multi-step diffusion teacher (LLaDA-8B-Instruct) into a one-step generator. This framework uniquely combines the theoretical stability of MeanFlow with a parameter-efficient distillation strategy.
  • Jacobian-Vector Product (JVP) Supervision Signal: A core innovation is the use of a JVP signal, which encodes not just the first-order direction but also the second-order dynamics of the teacher model. This provides richer and more stable guidance to the student model compared to conventional distillation, enabling high-fidelity one-step generation.
  • Parameter-Efficient Prompt Tuning: Instead of retraining large models, OneFlowSeq employs a lightweight, trainable Soft Prompt Module (~5M parameters) to distill the teacher's dynamics. This approach freezes the large teacher model, drastically reducing training costs and memory usage.
  • State-of-the-Art Performance: OneFlowSeq achieves state-of-the-art results on challenging Seq2Seq benchmarks including paraphrasing (QQP), text simplification (Wiki-Auto), and question generation (Quasar-T). It matches or surpasses the quality of its 128-step teacher in one-step generation.
  • Revolutionary Efficiency Gains: The framework reduces trainable parameters by approximately 1600×1600 \times compared to full-model distillation of an 8B model. Furthermore, it delivers inference speeds orders of magnitude faster (over 160×160 \times faster than its teacher) than both AR and multi-step diffusion baselines. This effectively resolves the long-standing trade-off between performance and inference cost.
  • Practical and Scalable Paradigm: OneFlowSeq transforms one-step diffusion from a theoretical curiosity into a practical, scalable, and resource-efficient paradigm for real-world NLP applications.

Prerequisite Knowledge & Related Work

This section outlines the foundational concepts and previous research critical to understanding OneFlowSeq.

Foundational Concepts

  • Autoregressive (AR) Models: These are language models that predict the next token in a sequence based on all preceding tokens. They generate text sequentially, token by token. Examples include GPT-2, LLaMA.
    • Seq2Seq Generation: A common paradigm in NLP where a model takes an input sequence and produces an output sequence. Examples include machine translation, summarization, and paraphrasing. AR models are widely used for Seq2Seq tasks.
  • Diffusion Models: Originally developed for image generation, diffusion models learn to reverse a gradual noising process. They start with random noise and progressively refine it into a coherent data sample (e.g., an image or text).
    • Diffusion Language Models (DLMs): An adaptation of diffusion models for text generation. Instead of continuous pixel values, they operate on token representations. They enable parallel generation (generating multiple parts of the sequence simultaneously) and global coherence (considering the entire sequence context during generation).
    • Iterative Denoising: The process in diffusion models where the model applies a denoising step repeatedly to gradually transform noise into a clear sample. This iterative nature is typically slow.
  • One-Step Diffusion: A goal in diffusion modeling to generate a high-quality sample in a single forward pass, bypassing the iterative denoising process. This promises significant speedups.
  • MeanFlow: A method that reformulates diffusion dynamics by predicting the average velocity of a flow over a time interval, rather than instantaneous velocity. This allows for stable one-step generation without distillation.
    • MeanFlow Identity: A key mathematical relationship that connects the average velocity uu over an interval (r, t) to the instantaneous velocity vv at time tt and the derivative of uu with respect to tt. This identity is central to OneFlowSeq's distillation.
  • Score Distillation: A technique where a smaller "student" model learns to mimic the "score" (gradient of the log-probability density) predicted by a larger "teacher" model. In the context of diffusion, this aims to compress the teacher's generation capabilities into the student.
  • Parameter-Efficient Fine-Tuning (PEFT): A family of techniques designed to adapt large pre-trained models to downstream tasks by tuning only a small subset of parameters, rather than the entire model. This significantly reduces computational cost and memory footprint.
    • Prompt Tuning (Prefix-tuning): A specific PEFT method where a small, trainable sequence of "soft prompt" vectors is prepended to the input embeddings (or injected into attention layers) of a frozen pre-trained model. The base model's parameters remain fixed, and only the prompt vectors are updated during training, steering the model's behavior.
  • Jacobian-Vector Product (JVP): In calculus, the Jacobian matrix contains all first-order partial derivatives of a vector-valued function. A JVP is the product of the Jacobian of a function with a vector. In machine learning, it can be efficiently computed using automatic differentiation libraries (e.g., torch.func.jvp), often corresponding to a single forward-backward pass. It provides information about the directional derivative of a function.
  • Stop-Gradient Operator (sg(.)): A mechanism in deep learning frameworks that prevents gradients from flowing backward through a particular part of the computational graph during backpropagation. This is used in distillation to ensure that the teacher model's parameters are not updated and gradients only affect the student.
  • Logit Space: In machine learning, logits are the raw, unnormalized outputs of a model's final layer before applying a softmax function to obtain probabilities. Operating in logit space can be more stable for numerical computations, especially in generative models where probability distributions are involved.

Previous Works

  • LLaDA (Large Language Diffusion with mAsking): This is the teacher model used in OneFlowSeq. It's described as a Masked Diffusion Model (MDM) trained from scratch on 2.3 trillion tokens. Its generative process involves a forward masking procedure (corrupting a clean sequence by replacing tokens with [MASK]) and a learned reverse process (a bidirectional Transformer predicting original tokens from the corrupted sequence). While powerful, its iterative denoising is slow.
  • MeanFlow (Geng et al., 2025): A method that offers a principled reformulation of diffusion dynamics via average velocity, enabling stable one-step generation without distillation. However, its main drawback is the prohibitive cost of training large models from scratch using this method.
  • DLM-One (Chen et al., 2025): This work demonstrates that score distillation can compress multi-step language diffusion into a single forward pass, achieving significant inference acceleration. Its limitation, as highlighted by OneFlowSeq, is that it requires retraining billions of parameters and relies on adversarial stabilization, leading to training inefficiency.
  • DiffuSeq (Gong et al., 2023a) and DiffuSeq-v2 (Gong et al., 2023b): Early diffusion models for Seq2Seq generation. DiffuSeq introduced parallel decoding on continuous embeddings, addressing error propagation. DiffuSeq-v2 improved efficiency. These are multi-step diffusion models and serve as baselines.
  • Knowledge Distillation (Hinton et al., 2015): A general technique where a smaller student model learns from a larger, more complex teacher model. OneFlowSeq builds on this principle but applies it specifically to diffusion models with MeanFlow and PEFT.
  • Consistency Models (Song et al., 2023), Flow Matching (Lipman et al., 2023), Rectified Flows (Esser et al., 2024): These are approaches that reformulate the diffusion process to learn more direct generation trajectories, enabling faster sampling. They often require intensive training themselves. OneFlowSeq leverages the MeanFlow identity, which is related to Flow Matching.

Technological Evolution

The field of text generation has seen a progression from traditional statistical methods to neural network-based approaches. Autoregressive (AR) models like RNNs, LSTMs, and later Transformers (e.g., GPT, LLaMA) became dominant due to their ability to model complex language patterns. However, their sequential nature posed limitations regarding speed and global coherence. Diffusion Models emerged as a powerful alternative, initially in computer vision, offering parallel generation and holistic refinement. Adapting these to language tasks led to Diffusion Language Models (DLMs). While DLMs promised advantages, their iterative nature made them slow, thus sparking interest in one-step generation techniques. Early one-step methods like MeanFlow (training from scratch) and DLM-One (full-model distillation) presented their own challenges. OneFlowSeq represents the next step in this evolution, aiming to combine the best aspects: the theoretical stability of MeanFlow and the practical efficiency of PEFT-based distillation, to make one-step DLMs viable for real-world Seq2Seq applications.

Differentiation

OneFlowSeq differentiates itself from previous works primarily in its synergistic combination of MeanFlow's theoretical stability with parameter-efficient distillation:

  • vs. MeanFlow: While MeanFlow offers stable one-step generation, it requires training large models from scratch, which is computationally prohibitive for large language models. OneFlowSeq distills the MeanFlow dynamics from a pre-trained, frozen multi-step teacher, making it practical for large language models without retraining their billions of parameters.
  • vs. DLM-One: DLM-One also performs distillation but retrains billions of parameters of a student model and relies on adversarial stabilization, leading to significant training costs and complexity. OneFlowSeq, in contrast, uses Parameter-Efficient Prompt Tuning, only training a minuscule Soft Prompt Module (around 5 million parameters), which is a 1600×1600 \times reduction in trainable parameters for an 8B teacher model. This makes training significantly more efficient and accessible.
  • Novelty of JVP Signal: OneFlowSeq introduces a unique Jacobian-vector product (JVP) signal in its distillation objective. This JVP term provides a richer, second-order guidance, encoding not just the teacher's instantaneous direction but also its dynamic curvature. This deeper signal allows the student to achieve higher fidelity in one-step generation than conventional distillation methods which typically rely only on first-order matching.
  • Efficiency Frontier: By combining these elements, OneFlowSeq pushes the efficiency frontier, offering state-of-the-art generation quality at dramatically reduced training and inference costs, addressing the core limitations of both AR and prior DLM approaches.

Methodology (Core Technology & Implementation Details)

The OneFlowSeq framework is designed to distill the capabilities of a powerful, multi-step Diffusion Language Model (DLM) into a highly efficient one-step generator. This is achieved by leveraging the MeanFlow identity within a parameter-efficient distillation strategy.

Principles

The core idea is to transform a complex, iterative denoising process into a single, direct generation step. This transformation relies on:

  1. Teacher-Student Distillation: A large, pre-trained multi-step DLM (the teacher) provides the "knowledge" of high-quality generation. A smaller, lightweight component (the student) learns to mimic this knowledge.
  2. MeanFlow Dynamics: Instead of learning instantaneous velocity fields, the student learns to predict the average velocity over a time interval, as formulated by the MeanFlow identity. This mathematically allows for stable one-step prediction.
  3. Jacobian-Vector Product (JVP) Signal: To provide richer guidance, the distillation objective incorporates a JVP term. This term captures the second-order dynamics (how the velocity field itself changes), leading to a more accurate and stable approximation of the teacher's behavior.
  4. Parameter-Efficient Fine-Tuning (PEFT): To make the distillation practical for large models, only a tiny Soft Prompt Module (the student) is trained, while the vast majority of the teacher model's parameters remain frozen. This dramatically reduces computational costs.

Steps & Procedures

1. The LLaDA Teacher: A Masked Diffusion Model (MDM)

The foundation of OneFlowSeq is a frozen, pre-trained LLaDA-8B-Instruct model. LLaDA stands for "Large Language Diffusion with mAsking" and is an 8-billion parameter model.

  • Nature: LLaDA is a Masked Diffusion Model (MDM). Unlike autoregressive models that predict the next token, MDMs corrupt input sequences by masking tokens and then learn to predict the original tokens.
  • Generative Process:
    • Forward Masking Procedure: A clean sequence x0\mathbf{x}_0 is corrupted into xt\mathbf{x}_t by independently replacing each token with a [MASK] token with a probability t[0,1]t \in [0, 1]. Here, tt represents the "time" or corruption level.
    • Learned Reverse Process: A bidirectional Transformer model, parameterized by θ\theta, learns to predict the original tokens from the corrupted sequence xt\mathbf{x}_t.
  • Training Objective for LLaDA (Teacher): L(θ)=Et,x0,xt[i=1LI[xti=MASK]logpθ(x0ixt)] \mathcal{L}(\theta) = - \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_t} \left[ \sum_{i=1}^{L} \mathbb{I}[x_t^i = \mathrm{MASK}] \log p_{\theta}(x_0^i \mid \mathbf{x}_t) \right]
    • L(θ)\mathcal{L}(\theta): The loss function being minimized to train the LLaDA model's parameters θ\theta.
    • Et,x0,xt\mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_t}: Expectation taken over randomly sampled time steps tt, clean sequences x0\mathbf{x}_0, and their corrupted versions xt\mathbf{x}_t.
    • i=1L\sum_{i=1}^{L}: Summation over all LL tokens in the sequence.
    • I[xti=MASK]\mathbb{I}[x_t^i = \mathrm{MASK}]: An indicator function that is 1 if the ii-th token in xt\mathbf{x}_t is [MASK], and 0 otherwise. This means the loss is only computed for masked tokens.
    • logpθ(x0ixt)\log p_{\theta}(x_0^i \mid \mathbf{x}_t): The logarithm of the probability that the model pθp_{\theta} predicts the original ii-th token x0ix_0^i given the corrupted sequence xt\mathbf{x}_t.
  • Limitation: Despite its power, LLaDA's iterative denoising process is slow, motivating the need for one-step generation.

2. The MeanFlow Method for One-Step Generation

MeanFlow is a theoretical framework that enables one-step generation by predicting an average velocity.

  • Average Velocity uu: Instead of learning an instantaneous velocity field v(zt,t)v(\mathbf{z}_t, t) that describes the change at a specific moment, MeanFlow learns an average velocity u(zt,r,t)u(\mathbf{z}_t, r, t) over a time interval [r, t]. u(zt,r,t)=1trrtv(zτ,τ)dτ u(\mathbf{z}_t, r, t) = \frac{1}{t-r} \int_r^t v(\mathbf{z}_\tau, \tau) d\tau
    • u(zt,r,t)u(\mathbf{z}_t, r, t): The average velocity field.
    • zt\mathbf{z}_t: The current state (e.g., logit representation of tokens) at time tt.
    • r, t: The start and end times of the interval over which the average velocity is computed (0r<t10 \leq r < t \leq 1).
    • v(zτ,τ)v(\mathbf{z}_\tau, \tau): The instantaneous velocity field at time τ\tau.
  • MeanFlow Identity: This crucial identity relates the average velocity uu to the instantaneous velocity vv and the total derivative of uu: u(zt,r,t)=v(zt,t)(tr)ddtu(zt,r,t) u(\mathbf{z}_t, r, t) = v(\mathbf{z}_t, t) - (t-r) \frac{d}{dt} u(\mathbf{z}_t, r, t)
    • ddtu(zt,r,t)\frac{d}{dt} u(\mathbf{z}_t, r, t): The total derivative of the average velocity uu with respect to time tt. This term accounts for how the average velocity itself changes as the endpoint of the interval tt changes.
  • Total Derivative (JVP Form): The total derivative ddtu(zt,r,t)\frac{d}{dt} u(\mathbf{z}_t, r, t) is expanded as: ddtu(zt,r,t)=uz(zt,r,t)v(zt,t)+ut(zt,r,t) \frac{d}{dt} u(\mathbf{z}_t, r, t) = \frac{\partial u}{\partial \mathbf{z}} (\mathbf{z}_t, r, t) \cdot v(\mathbf{z}_t, t) + \frac{\partial u}{\partial t} (\mathbf{z}_t, r, t) This is effectively a Jacobian-vector product (JVP) along the tangent vector (v,0,1)(v, 0, 1). It can be computed efficiently using automatic differentiation libraries (e.g., torch.func.jvp) in a single forward-backward pass.

3. The OneFlowSeq Framework: Distillation with PEFT

OneFlowSeq combines the MeanFlow identity with PEFT to efficiently distill the teacher.

  • Framework Overview (Figure 1):

    Figure 1: An overview of our OneFlowSeq framework. The frozen teacher model (top) is a multistep generator that provides an instantaneous velocity target. The student model (bottom) uses a sa ye trai… 该图像是图示OneFlowSeq框架的结构示意图。上部为多步教师模型,依次经过掩码预测器和重掩码模块生成逐步的即时速度,底部为单步学生模型,利用软提示模块引导掩码预测器在一步内预测平均速度,最终通过损失计算实现高效一阶生成。

    The framework involves a frozen LLaDA-8B-Instruct teacher providing an instantaneous velocity target. A lightweight Soft Prompt Module (the student) guides the frozen backbone to predict the average velocity in a single step. The student is trained solely by minimizing a distillation loss between the two velocities.

  • Architectural Design:

    • Frozen Teacher: The LLaDA-8B-Instruct model's parameters are completely frozen.
    • Soft Prompt Module (Student): This is the only trainable component. It's a small Multi-Layer Perceptron (MLP) called the Prompt Network.
      • Input: The Prompt Network takes sinusoidal time embeddings of rr and tt as input.
      • Output: It maps these embeddings to a sequence of kk soft prompt vectors.
      • Injection: These soft prompt vectors are injected as prefixes into the Key (K) and Value (V) sequences of every self-attention layer within the frozen LLaDA Transformer backbone. This prefix-tuning approach allows the small prompt module to steer the behavior of the large model without modifying its core weights.
      • Parameter Count: Only ~5M parameters are trainable, a tiny fraction of the 8B teacher.

4. Training Objective and Procedure

The core of OneFlowSeq's training is a distillation objective based on the MeanFlow identity, enhanced by the JVP signal.

  • Representation Space: All velocity-related operations (instantaneous and average) are defined in a continuous logit space. The teacher's instantaneous velocity vteacher(xt,t)v_{\mathrm{teacher}}(\mathbf{x}_t, t) and the student's average velocity uθ(xt,r,t)u_{\theta}(\mathbf{x}_t, r, t) both operate on the corrupted token sequence xt\mathbf{x}_t and produce outputs in RL×V\mathbb{R}^{L \times V} (logit space for sequence length LL and vocabulary size VV). This ensures that vector operations like subtraction and integration in the MeanFlow identity are well-defined.
  • Discrete Diffusion Note: Although the MeanFlow identity (Eq. 6) is derived for continuous state spaces, its application to discrete token sequences is justified. The velocity field is interpreted as governing the expected denoising trajectory in the space of token probabilities, with time tt representing the masking probability. A rigorous justification for its validity under discrete masking diffusion (logit-space formulation) is provided in Appendix A, showing that the expected velocity field induced by discrete masking satisfies the MeanFlow identity.
  • Distillation Objective: The training objective transfers the teacher's dynamics to the student using the rich signal from the MeanFlow identity and its JVP term. Ldistill(θ)=Ex0,r,tuθ(xt,r,t)sg(vteacher(xt,t)(tr)ddtuθ(xt,r,t))2 \mathcal{L}_{\mathrm{distill}}(\theta) = \mathbb{E}_{\mathbf{x}_0, r, t} \left\| u_{\theta}(\mathbf{x}_t, r, t) - \mathrm{sg} \Big( v_{\mathrm{teacher}}(\mathbf{x}_t, t) - (t-r) \frac{d}{dt} u_{\theta}(\mathbf{x}_t, r, t) \Big) \right\|^2
    • Ldistill(θ)\mathcal{L}_{\mathrm{distill}}(\theta): The distillation loss minimized by updating the student's parameters θ\theta.
    • Ex0,r,t\mathbb{E}_{\mathbf{x}_0, r, t}: Expectation over clean sequences, and sampled time intervals [r, t].
    • uθ(xt,r,t)u_{\theta}(\mathbf{x}_t, r, t): The student model's prediction of the average velocity.
    • sg()\mathrm{sg}(\cdot): The stop-gradient operator. This is crucial: it treats the target (everything inside sg) as a constant during backpropagation, preventing gradients from flowing into the frozen teacher model or the JVP computation itself from incurring costly higher-order derivatives for the teacher. Gradients only update the student's parameters θ\theta to match this target.
    • vteacher(xt,t)v_{\mathrm{teacher}}(\mathbf{x}_t, t): The instantaneous velocity output by the frozen teacher model.
    • (tr)ddtuθ(xt,r,t)(t-r) \frac{d}{dt} u_{\theta}(\mathbf{x}_t, r, t): The JVP term. This is computed precisely and efficiently using automatic differentiation (e.g., torch.func.jvp). It calculates the directional derivative of the student network uθu_{\theta} with respect to its inputs (xt,r,t)(\mathbf{x}_t, r, t) along the tangent vector (vteacher(xt,t),0,1)(v_{\mathrm{teacher}}(\mathbf{x}_t, t), 0, 1). This term encourages the student to obey the MeanFlow identity, providing a self-consistency signal.
  • Algorithm 1: OneFlowSeq Training (Discrete Diffusion via Masking Corruption):
    1. Loop: For each training step:
    2. Sample Data: Sample a clean sequence x0\mathbf{x}_0 from the dataset.
    3. Sample Times: Sample rUniform(0,0.8)r \sim \mathrm{Uniform}(0, 0.8) and tUniform(r+0.1,1.0)t \sim \mathrm{Uniform}(r + 0.1, 1.0). These ranges are chosen to keep (t-r) small on average, aiding stability (as explained in Appendix B.1).
    4. Corrupt Sequence: Generate xt\mathbf{x}_t by masking x0\mathbf{x}_0 with probability tt (discrete diffusion corruption).
    5. Compute Student Output: Compute uθ(xt,r,t)u_{\theta}(\mathbf{x}_t, r, t) using the student's Soft Prompt Module and the frozen LLaDA backbone.
    6. Compute Teacher Output: Compute vteacher(xt,t)v_{\mathrm{teacher}}(\mathbf{x}_t, t) using the frozen LLaDA backbone.
    7. Compute JVP Term: Calculate ddtuθ(xt,r,t)\frac{d}{dt} u_{\theta}(\mathbf{x}_t, r, t) using automatic differentiation.
    8. Calculate Loss & Update: Calculate the distillation loss Ldistill(θ)\mathcal{L}_{\mathrm{distill}}(\theta) (Eq. 9) and update the student's parameters θ\theta.
  • Implementation Details:
    • Prompt Network: 2-layer MLP with a hidden dimension of 32. Maps time embeddings to k×dk \times d output, where k=32k=32 (prompt length) and d=4096d=4096 (embedding dimension).
    • Optimizer: AdamW.
    • Learning Rate: η=5×104\eta = 5 \times 10^{-4}.
    • Weight Decay: 0.01.
    • Batch Size: 32.
    • Training Steps: 80,000 steps.
    • Hardware: 8 NVIDIA A100 GPUs.
    • Parameters: Only ~5M parameters trained.

5. Inference

All inference updates are performed in the logit space.

  • Starting Point: Inference begins with an initial logit tensor z1\mathbf{z}_1, representing a maximally corrupted sequence (e.g., a zero tensor or logits corresponding to a uniform distribution over the vocabulary).
  • Multi-step (K-NFE) Inference: For KK steps, a partition 1=tK>tK1>>t0=01 = t_K > t_{K-1} > \dots > t_0 = 0 is defined. The logit tensor is iteratively updated: zti1=zti(titi1)uθ(xti,ti1,ti),i=K,,1 \mathbf{z}_{t_{i-1}} = \mathbf{z}_{t_i} - (t_i - t_{i-1}) u_{\theta}(\mathbf{x}_{t_i}, t_{i-1}, t_i), \quad i = K, \ldots, 1
    • zti1\mathbf{z}_{t_{i-1}}: Logits at the next (less corrupted) time step.
    • zti\mathbf{z}_{t_i}: Current logits at time tit_i.
    • xti\mathbf{x}_{t_i}: Token sequence obtained by decoding zti\mathbf{z}_{t_i} at each intermediate step (or directly feeding continuous embeddings).
    • uθ(xti,ti1,ti)u_{\theta}(\mathbf{x}_{t_i}, t_{i-1}, t_i): The student model's predicted average velocity for the interval [ti1,ti][t_{i-1}, t_i].
  • Single-step (1-NFE) Inference: This is a special case where K=1K=1, meaning (r,t)=(0,1)(r, t) = (0, 1). The final logits are computed directly: z0=z1uθ(x1,0,1) \mathbf{z}_0 = \mathbf{z}_1 - u_{\theta}(\mathbf{x}_1, 0, 1)
    • z0\mathbf{z}_0: The final, denoised logits.
    • z1\mathbf{z}_1: The initial maximally corrupted logits.
    • x1\mathbf{x}_1: The fully masked input token sequence.
  • Final Decoding: The resulting token sequence x0\mathbf{x}_0 is obtained by decoding z0\mathbf{z}_0, typically via an argmax operation over the vocabulary dimension at each position.
  • Correctness and Error Propagation: Appendix B.3 details the correctness of single-step generation and analyzes its error propagation, showing that the error is bounded and non-accumulating, unlike iterative solvers.

Theoretical Justification (Appendix B)

The appendices provide deeper theoretical insights into the method's stability and correctness.

  • Operator View (Fixed-Point and Convergence): The MeanFlow identity (u=T[u]u = T[u]) is viewed as a fixed-point equation. The operator TT is shown to be a contraction mapping if (tr)Lt<1(t-r)L_t < 1, where LtL_t is a Lipschitz constant related to the smoothness of uu. This ensures a unique fixed point and geometric convergence for iterative applications of TT. The distillation objective is interpreted as a proximal update towards this fixed point, explaining training stability.
  • Sobolev-like Training Perspective: The loss function L(uθ)=EΔ(uθ)2\mathcal{L}(u_\theta) = \mathbb{E}\|\Delta(u_\theta)\|^2 is interpreted as a Sobolev-like regularizer. It penalizes not just the value mismatch (u-v) but also the flow-aligned derivative error ((tr)ddtu (t-r) \frac{d}{dt} u), effectively enforcing smoothness and curvature fidelity in the learned flow field. This second-order information from JVP is crucial for high-quality one-step generation in discrete spaces.
  • One-Step Sampling Correctness: For one-step sampling, the logit error is bounded by z0θz0supε\|z_0^\theta - z_0^\star\| \leq \sup \|\varepsilon\|, where ε\varepsilon is the pointwise approximation error. This shows that errors are first-order in ε\varepsilon and non-accumulating, supporting the empirical observation of high quality in one step.
  • Limit rtr \to t (Reduction to Flow Matching): As rtr \to t, the MeanFlow identity rigorously reduces to the standard Flow Matching objective, where the model learns to predict the instantaneous velocity vteacher(zt,t)v_{\mathrm{teacher}}(\mathbf{z}_t, t). This demonstrates that the JVP-augmented loss generalizes Flow Matching, offering derivative supervision for better global consistency over larger intervals.

Experimental Setup

Datasets

The effectiveness of OneFlowSeq is evaluated across three widely-used Seq2Seq benchmarks, covering different tasks:

  1. Paraphrasing (PP) on Quora Question Pairs (QQP):

    • Origin: The dataset consists of question pairs from Quora, a popular question-and-answer website.
    • Characteristics: The task is to determine if two questions are semantically equivalent (paraphrases). For generation, the goal is to transform one question into a semantically equivalent but syntactically different one.
    • Domain: General knowledge, natural language understanding.
    • Example (Hypothetical, based on task):
      • Source: "What are the best ways to learn machine learning on your own?"
      • Target (Paraphrase): "How can I effectively self-study machine learning?"
    • Justification: QQP is a standard benchmark for paraphrase generation, allowing evaluation of semantic preservation and stylistic variation.
  2. Text Simplification (TS) on Wiki-Auto:

    • Origin: Derived from Wikipedia and Simple English Wikipedia.
    • Characteristics: Contains pairs of complex sentences and their simplified counterparts. The task involves reducing linguistic complexity while retaining the original meaning.
    • Domain: Encyclopedic knowledge, general text.
    • Example (Hypothetical, based on task):
      • Source: "The unprecedented proliferation of digital technologies has ushered in an era of ubiquitous information access, fundamentally transforming socioeconomic paradigms."
      • Target (Simplified): "Many new digital tools have made it easier to find information everywhere, changing how societies and economies work."
    • Justification: Wiki-Auto is a common dataset for text simplification, testing the model's ability to reduce complexity while maintaining factual accuracy.
  3. Question Generation (QG) on Quasar-T:

    • Origin: A large-scale dataset for question answering by search and reading.
    • Characteristics: Provides contexts (e.g., passages of text) from which questions need to be generated. The generated questions should be answerable from the provided context.
    • Domain: General knowledge, reading comprehension.
    • Example (Hypothetical, based on task):
      • Source Context: "The Eiffel Tower, located in Paris, France, was designed by Gustave Eiffel. It was constructed from 1887 to 1889 as the entrance arch for the 1889 World's Fair."
      • Target (Question): "Who designed the Eiffel Tower?"
    • Justification: Quasar-T is a challenging dataset for question generation, assessing a model's ability to comprehend text and formulate relevant questions.

Evaluation Metrics

The models are evaluated across generation quality, diversity, and efficiency.

  1. BLEU (Bilingual Evaluation Understudy):

    • Conceptual Definition: Measures the similarity between generated text and one or more reference texts. It counts the number of n-gram matches (sequences of nn words) between the candidate and reference translations, with a penalty for brevity. Higher scores indicate better quality.
    • Mathematical Formula: BLEU=BPexp(n=1Nwnlogpn) \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) BP=min(1,e(1ref_len/cand_len)) \mathrm{BP} = \min\left(1, e^{(1 - \mathrm{ref\_len}/\mathrm{cand\_len})}\right)
    • Symbol Explanation:
      • NN: Maximum n-gram order (typically 4, meaning unigrams, bigrams, trigrams, and 4-grams are considered).
      • wnw_n: Weight for each n-gram (often 1/N1/N).
      • pnp_n: Precision for n-grams, which is the count of matching n-grams in the candidate divided by the total number of n-grams in the candidate. It's clipped to avoid over-counting.
      • BP\mathrm{BP}: Brevity Penalty. This factor penalizes short generated sentences that might achieve high precision merely by matching a few words.
      • ref_len\mathrm{ref\_len}: Length of the closest reference sentence (or sum of lengths of reference sentences).
      • cand_len\mathrm{cand\_len}: Length of the candidate generated sentence.
  2. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence):

    • Conceptual Definition: Measures the overlap between the generated text and reference text based on the longest common subsequence (LCS). It's particularly useful for summarization tasks and focuses on recall, meaning how much of the reference information is captured in the generated text. The paper reports the F1 score for ROUGE-L. Higher scores indicate better quality.
    • Mathematical Formula (F1 Score): ROUGELF1=(1+β2)RPR+β2P \mathrm{ROUGE-L_{F1}} = \frac{(1 + \beta^2) R \cdot P}{R + \beta^2 P} Where RR is Recall and PP is Precision, and β\beta is a weight (often 1 for F1). R=LCS(candidate,reference)length(reference) R = \frac{\mathrm{LCS}(\text{candidate}, \text{reference})}{\text{length}(\text{reference})} P=LCS(candidate,reference)length(candidate) P = \frac{\mathrm{LCS}(\text{candidate}, \text{reference})}{\text{length}(\text{candidate})}
    • Symbol Explanation:
      • LCS(candidate,reference)\mathrm{LCS}(\text{candidate}, \text{reference}): Length of the Longest Common Subsequence between the candidate and reference texts. A subsequence doesn't have to be contiguous.
      • length(reference)\text{length}(\text{reference}): Length of the reference text.
      • length(candidate)\text{length}(\text{candidate}): Length of the candidate generated text.
      • β\beta: A parameter weighting recall vs. precision; typically β=1\beta=1 for F1-score.
  3. BERTScore:

    • Conceptual Definition: A metric for evaluating text generation that leverages contextual embeddings from pre-trained BERT models. It measures semantic similarity by computing cosine similarity between BERT embeddings of tokens in the candidate and reference sentences, allowing for flexible matching and capturing nuances beyond exact n-gram overlaps. The paper reports the F1 score. Higher scores indicate better quality.
    • Mathematical Formula (simplified for F1): BERTScoreF1=2BERTScorePBERTScoreRBERTScoreP+BERTScoreR \mathrm{BERTScore_{F1}} = 2 \cdot \frac{\mathrm{BERTScore_P} \cdot \mathrm{BERTScore_R}}{\mathrm{BERTScore_P} + \mathrm{BERTScore_R}} Where: BERTScoreP=1x^x^ix^maxxjx(cosine(BERT_embedding(x^i),BERT_embedding(xj))) \mathrm{BERTScore_P} = \frac{1}{|\hat{x}|} \sum_{\hat{x}_i \in \hat{x}} \max_{x_j \in x} \left( \mathrm{cosine}(\mathrm{BERT\_embedding}(\hat{x}_i), \mathrm{BERT\_embedding}(x_j)) \right) BERTScoreR=1xxjxmaxx^ix^(cosine(BERT_embedding(xj),BERT_embedding(x^i))) \mathrm{BERTScore_R} = \frac{1}{|x|} \sum_{x_j \in x} \max_{\hat{x}_i \in \hat{x}} \left( \mathrm{cosine}(\mathrm{BERT\_embedding}(x_j), \mathrm{BERT\_embedding}(\hat{x}_i)) \right)
    • Symbol Explanation:
      • x^\hat{x}: The candidate generated sentence.
      • xx: The reference sentence.
      • x^i\hat{x}_i: An individual token in the candidate sentence.
      • xjx_j: An individual token in the reference sentence.
      • BERT_embedding(t)\mathrm{BERT\_embedding}(t): The contextual embedding for token tt produced by a BERT model (specifically roberta-large in this paper).
      • cosine(,)\mathrm{cosine}(\cdot, \cdot): Cosine similarity between two vectors.
      • max\max: The maximum similarity across all possible token alignments.
      • BERTScoreP\mathrm{BERTScore_P}: Precision - each token in the candidate is matched to the most similar token in the reference.
      • BERTScoreR\mathrm{BERTScore_R}: Recall - each token in the reference is matched to the most similar token in the candidate.
  4. Dist-1:

    • Conceptual Definition: Measures the diversity of generated text within a sample. It calculates the number of unique unigrams (single words) in the generated output, normalized by the total number of words. Higher Dist-1 indicates greater lexical diversity.
    • Mathematical Formula (for a set of generated sentences SS): Dist1=Number of unique unigrams in STotal number of unigrams in S \mathrm{Dist-1} = \frac{\text{Number of unique unigrams in } S}{\text{Total number of unigrams in } S}
    • Symbol Explanation:
      • SS: The set of all generated sentences for a given input.
      • Number of unique unigrams in S: The count of distinct individual words appearing across all generated sentences.
      • Total number of unigrams in S: The total count of all words (including repetitions) appearing across all generated sentences.
  5. Div-4:

    • Conceptual Definition: Similar to Dist-1, but measures diversity based on unique 4-grams (sequences of four words). A higher Div-4 implies more diverse sentence structures and longer unique phrases in the generated text.
    • Mathematical Formula (for a set of generated sentences SS): Div4=Number of unique 4-grams in STotal number of 4-grams in S \mathrm{Div-4} = \frac{\text{Number of unique 4-grams in } S}{\text{Total number of 4-grams in } S}
    • Symbol Explanation:
      • SS: The set of all generated sentences for a given input.
      • Number of unique 4-grams in S: The count of distinct sequences of four words appearing across all generated sentences.
      • Total number of 4-grams in S: The total count of all 4-grams (including repetitions) appearing across all generated sentences.
  6. Self-BLEU:

    • Conceptual Definition: Measures the inter-sample similarity or lack of diversity across multiple generated outputs for the same input. For each source input, multiple outputs are generated. Self-BLEU computes the average pairwise BLEU-4 score between these generated outputs. A lower Self-BLEU score indicates higher diversity among the generated samples for a given input, which is desirable.
    • Mathematical Formula (for KK samples s1,,sKs_1, \dots, s_K for a given input): SelfBLEU=1K(K1)i=1Kji,j=1KBLEU(si,sj) \mathrm{Self-BLEU} = \frac{1}{K(K-1)} \sum_{i=1}^{K} \sum_{j \neq i, j=1}^{K} \mathrm{BLEU}(s_i, s_j)
    • Symbol Explanation:
      • KK: The number of samples generated for each source input (here, K=5K=5).
      • si,sjs_i, s_j: Individual generated samples.
      • BLEU(si,sj)\mathrm{BLEU}(s_i, s_j): The BLEU score (typically BLEU-4) computed between generated sample sis_i and sjs_j (where sjs_j acts as a reference for sis_i).
  7. Wall-Clock Time:

    • Conceptual Definition: A direct measure of inference efficiency, representing the actual time taken to process a sample or batch of samples. It's reported in seconds per sample.
    • Nuance: The paper notes that latency is highly implementation-dependent. For autoregressive models, latency is reported for single-request (batch size 1) generation. For parallel-decoding diffusion models like OneFlowSeq, it reports amortized latency from a high-throughput scenario (batch size of 256) for a fair comparison.

Baselines

OneFlowSeq is compared against a diverse set of baselines:

  1. Fine-tuned Autoregressive Models:
    • GPT-2 (fine-tuned): A large, Transformer-based autoregressive model fine-tuned for the specific Seq2Seq tasks. Represents a strong, standard AR baseline.
    • LLaMA2-7B (reference): A 7-billion parameter version of the LLaMA2 family, another powerful autoregressive model, also fine-tuned. Represents a more recent and larger AR baseline.
  2. Multi-step Diffusion Models:
    • DiffuSeq(MBR=1)DiffuSeq (MBR=1): An early diffusion model for Seq2Seq generation, using Minimum Bayes Risk (MBR) decoding with 1 candidate.
    • DiffuSeq(MBR=10)DiffuSeq (MBR=10): DiffuSeq using MBR decoding with 10 candidates, typically improving quality at the cost of speed.
    • DiffuSeqv2(MBR=2)DiffuSeq-v2 (MBR=2): An accelerated version of DiffuSeq, using MBR with 2 candidates.
    • LLaDA-8B-Instruct (Teacher): The multi-step teacher model from which OneFlowSeq is distilled. It uses 128 sampling steps, chosen for the best quality-efficiency trade-off. This serves as the upper bound for quality that OneFlowSeq aims to match.
  3. Prior One-step Distillation Model:
    • DLM-One: A prior method for one-step diffusion language modeling via score distillation. The authors re-implemented it and distilled it on the same LLaDA-8B-Instruct teacher to ensure a fair comparison of distillation effectiveness.

Implementation Details

  • All models are trained using the official training sets of each benchmark.
  • Results are reported on the corresponding test sets.
  • Hyperparameters are detailed in Appendix C (and will be transcribed below in the Results section for completeness).
  • All experiments are conducted on 8 NVIDIA A100 GPUs.
  • For LLaDA-8B-Instruct, the number of sampling steps is fixed to 128, optimized for quality-efficiency trade-off.

Results & Analysis

The experimental results demonstrate that OneFlowSeq achieves state-of-the-art performance in terms of quality while offering unprecedented efficiency for one-step generation in diffusion language models.

Main Comparisons and Analyses

The primary results comparing OneFlowSeq against various baselines and its teacher are presented in Table 1.

(Table 1: Main results on paraphrasing (PP), text simplification (TS), and question generation (QG) benchmarks. Arrows indicate whether higher (↑) or lower (↓) values are better. Best results in each category are highlighted in bold.)

Task Model BLEU (↑) ROUGE-L (↑) BertScore (↑) Dist-1 (↑) Self-BLEU (↓) Div-4 (↑) Wall-Clock Time (↓)
PP (QQP) GPT-2 (fine-tuned) 0.20(59) 0.54(15) 0.83(63) 0.98(19) 0.26(25) 0.50(20) ~0.08
LLaMA2-7B (reference) 0.32(71) 0.64(70) 0.87(70) 0.89 (27) ∼1.35
DiffuSeq (MBR=1) 0.18(29) 0.52(99) 0.79(32) 0.97(47) 0.27(32) 0.86(41) 14.94
DiffuSeq (MBR=10) 0.24(13) 0.58(80) 0.83(65) 0.98(07) 0.29(64) 0.87(12) ∼20.0
DiffuSeq-v2 (MBR=2) 0.21(15) 0.56(51) 0.80(36) 0.97(82) 0.27(98) 0.86(98) ~0.0025
LLaDA-8B-Instruct (Teacher) 0.49(72) 0.7(23) 0.9 (50) 0.9 (01) 0.18(71) 0.90(21) 0.05
DLM-One 0.16(88) 0.52(65) 0.78(51) 0.96(71) 0.34(18) 0.62(56) 0.03
OneFlowSeq (Ours) 0.48(02) 0.70(67) 0.92(10) 0.99 (32) 0.19(88) 0.89 (73) 0.0003
TS (Wiki-Auto) GPT-2 (fine-tuned) 0.26(93) 0.51(11) 0.78(82) 0.94(64) 0.40(42) 0.48(76) ∼0.08
LLaMA2-7B (reference) 0.39(28) 0.70(12) 0.89 (10) 0.96(14) 0.32(18) 0.7 (68) ∼1.35
DiffuSeq (MBR=1) 0.29(29) 0.53(13) 0.77 (81) 0.9(72) 0.46(42) 0.63(04) 14.94
DiffuSeq (MBR=10) 0.36(22) 0.58(49) 0.81(26) 0.9(64) 0.48(12) 0.66(21) ∼20.0
DiffuSeq-v2 (MBR=2) 0.32(72) 0.54(31) 0.79(23) 0.93 (21) 0.46(85) 0.64(30) ~0.0025
LLaDA-8B-Instruct (Teacher) 0.54(12) 0.72(91) 0.89(40) 0.92(14) 0.30(19) 0.83(64) 0.05
DLM-One 0.29(27) 0.52(99) 0.75(65) 0.89(24) 0.39(56) 0.40(98) 0.03
OneFlowSeq (Ours) 0.52(13) 0.73(28) 0.88(31) 0.92(98) 0.31(18) 0.82(16) 0.0003
QG (Quasar-T) GPT-2 (fine-tuned) 0.11(10) 0.32(15) 0.63(46) 0.96(70) 0.29(10) 0.80(86) ∼0.08
LLaMA2-7B (reference) 0.23(21) 0.38(81) 0.71(32) 0.94(21) 0.24(12) 0.8(21) ∼1.35
DiffuSeq (MBR=1) 0.15(12) 0.34(68) 0.58(71) 0.1(41) 0.27(89) 0.81(03) 14.94
DiffuSeq (MBR=10) 0.17(31) 0.36(65) 0.61(23) 0.90(56) 0.29(12) 0.82(42) ∼20.0
DiffuSeq-v2 (MBR=2) 0.15(92) 0.35(12) 0.60(10) 0.91 (98) 0.28(19) 0.82(03) ~0.0025
LLaDA-8B-Instruct (Teacher) 0.30(21) 0.53(19) 0.78(32) 0.96(16) 0.18(34) 0.89(57) 0.05
DLM-One 0.15(12) 0.32(57) 0.56(83) 0.96(66) 0.19(66) 0.37(98) 0.03
OneFlowSeq (Ours) 0.29(81) 0.54(12) 0.80(12) 0.96(91) 0.19(91) 0.87(85) 0.0003

Key Observations from Table 1:

  • Quality Matching Teacher Performance: OneFlowSeq (Ours) achieves generation quality metrics (BLEU, ROUGE-L, BertScore) highly comparable to its multi-step teacher, LLaDA-8B-Instruct. For example, in Paraphrasing (PP), OneFlowSeq's BLEU is 0.4802 compared to the teacher's 0.4972, and its BertScore (0.9210) even surpasses the teacher's (0.9050). Similar trends are observed in Text Simplification (TS) and Question Generation (QG). This validates the effectiveness of the MeanFlow-based distillation and the rich JVP signal.

  • Superior to Other Baselines: OneFlowSeq consistently outperforms strong autoregressive baselines like LLaMA2-7B and the previous one-step diffusion method DLM-One across all quality metrics. DLM-One, despite being a one-step method, performs poorly, even trailing AR models and showing very low diversity (e.g., Div-4 of 0.6256 on QQP vs. OneFlowSeq's 0.8973).

  • Dramatic Speedup: The most striking result is the Wall-Clock Time. OneFlowSeq operates at an astonishing 0.0003 seconds per sample. This makes it:

    • ~166x faster than its teacher (LLaDA-8B-Instruct at 0.05s).
    • ~4500x faster than LLaMA2-7B (1.35s).
    • ~50x faster than GPT-2 (0.08s).
    • Orders of magnitude faster than DiffuSeq variants (14.94s - 20.0s). This effectively resolves the speed bottleneck of both iterative diffusion models and autoregressive generation.
  • Strong Diversity: OneFlowSeq maintains competitive diversity metrics (Dist-1, Div-4, Self-BLEU) compared to the teacher, without evidence of mode collapse, which can be a concern for highly compressed models. For instance, its Self-BLEU scores are very close to the teacher's, indicating similar inter-sample diversity.

    These results lead to two significant conclusions:

  1. OneFlowSeq successfully captures the essence of a complex, iterative diffusion process in a single forward pass without critical loss of quality, validating the MeanFlow-based distillation framework.
  2. The immense speedup transforms one-step diffusion into a practical and deployable solution, overcoming a major barrier for diffusion language models.

Efficiency and Distillation Analyses

Convergence Speed and Distillation Effectiveness

The paper further analyzes the training dynamics by comparing OneFlowSeq's convergence against a re-implemented DLM-One using the same LLaDA-8B-Instruct teacher.

Figure 2: Validation BLEU score on the QQP dataset versus training steps. To ensure a fair comparison, we implemented a strong baseline, DLM-One (LLaDA), by applying the score distillation method fro…

Key Observations from Figure 2:

  • Faster Convergence: OneFlowSeq converges dramatically faster. It achieves a rapid performance increase, reaching a BLEU score of 0.38 by 40,000 training steps.

  • Superior Final Performance: At 40,000 steps, OneFlowSeq already surpasses the fully converged performance of DLM-One (0.19 BLEU at 100,000 steps). OneFlowSeq ultimately converges to a BLEU score of 0.49, more than double that of DLM-One.

  • Stable Training: The training dynamics of OneFlowSeq appear much more stable, as indicated by the smoother curve.

    This analysis strongly suggests that the MeanFlow-based objective, particularly with the JVP signal, provides a significantly more effective and efficient learning signal for distillation, allowing the student model to learn a much stronger generative capability.

Training Resource Efficiency

A core advantage of OneFlowSeq is its exceptional training efficiency, detailed in Table 2.

(Table 2: Comparison of training resource overhead. Our PEFT-based approach offers orders-of-magnitude improvements in efficiency over both standard diffusion model training and full-model distillation.)

Resource Dimension DiffuSeq / v2 DLM-One OneFlowSeq (Ours) Advantage (vs. Full)
Training Paradigm Full Model Training Full Model Distill. PEFT Distill.
Trainable Parameters ~91 Million ∼8 Billion ∼5 Million ∼18x /∼1600×
Peak Training VRAM ~45 GB > 60 GB < 30 GB > 1.5x / > 2×
Parameter Storage (FP16) ∼182 MB ∼16 GB < 20 MB ∼9x /∼800×

Key Observations from Table 2:

  • Trainable Parameters: OneFlowSeq updates only ~5 Million parameters. This is a staggering ~1600x reduction compared to DLM-One (which trains an 8B model) and ~18x compared to DiffuSeq (91M parameters). This is due to its PEFT-based approach (prompt tuning) on a frozen backbone.

  • Peak Training VRAM: This parameter reduction directly translates to VRAM savings. OneFlowSeq uses <30GB< 30 GB peak VRAM, which is more than 2x less than DLM-One (>60GB> 60 GB). Even compared to DiffuSeq (~45 GB), it offers significant savings.

  • Parameter Storage: The checkpoint size is also drastically reduced. OneFlowSeq's <20MB< 20 MB storage is ~800x smaller than DLM-One (~16 GB) and ~9x smaller than DiffuSeq (~182 MB).

    These results underscore the practicality and scalability of OneFlowSeq, making state-of-the-art one-step generation accessible even with limited computational resources.

Performance Scaling Analysis

The paper investigates the scalability and robustness of OneFlowSeq along two dimensions: multi-step inference performance and prompt capacity scaling.

Multi-Step Inference Performance

Figure 3: Performance (BLEU on QQP) as a function of the number of inference steps (NFE). OneFlowSeq establishes a new state-of-the-art performance frontier, starting at a much higher quality and mai…

Key Observations from Figure 3:

  • Superior Baseline: At just a single inference step (1-NFE), OneFlowSeq achieves a BLEU score of 0.488 on QQP, which is more than 2.5x higher than DiffuSeq-v2 (0.195) and DLM-One (0.179). This establishes a fundamentally higher performance baseline from the outset.

  • Effective Scaling with Steps: OneFlowSeq scales more effectively with additional inference steps. Its BLEU score rises to 0.527 at 128 steps, showing that while it's exceptional at one-step, it can still benefit from (and perform well in) a few-step setting.

  • Limited Scaling for Baselines: In contrast, DLM-One shows almost no improvement with added computation, and DiffuSeq-v2 improves but remains significantly below OneFlowSeq at all step counts.

    This analysis confirms OneFlowSeq as not only the best one-step generator but also a state-of-the-art few-step sampler that operates on a consistently higher quality-efficiency frontier.

Scalability with Prompt Capacity

Key Observations from Figure 4:

  • Monotonic Improvement: As the prompt length (number of soft prompt vectors) increases from 8 to 64, performance across all three quality metrics (BLEU, ROUGE-L, BertScore) monotonically improves.
  • Significant Gains from Small Modules: For instance, the BLEU score rises from 0.461 to 0.492 as prompt length scales. This demonstrates that scaling only a small fraction of parameters (~1.3M to ~10M for prompt length 8 to 64) yields notable gains.
  • Parameter Efficiency: This highlights the strong parameter efficiency of the PEFT framework. Even with a minuscule trainable component, increasing its capacity leads to better generation quality at minimal additional cost.

Ablation Study

An ablation study on the QQP dataset investigates the contribution of OneFlowSeq's core components, particularly the MeanFlow identity and the JVP term.

(Table 3: Ablation study on QQP. We compare our model with its teacher and three distillation variants, showing the importance of the MeanFlow identity and precise JVP computation.)

Model BLEU (↑) ROUGE-L (↑) BertScore (↑) Self-BLEU (↓) Div-4 (↑)
LLaDA-8B-Instruct (Teacher) 0.46(72) 0.71(23) 0.91(50) 0.18(71) 0.90(21)
Flow Matching Distill. 0.31(45) 0.62(18) 0.86(91) 0.25(12) 0.81(76)
w/o JVP Signal 0.35(16) 0.65() 0.88(54) 0.23(45) 0.84(19)
w/ Finite Difference 0.42(58) 0.68(95) 0.90(88) 0.20(93) 0.88 (04)
OneFlowSeq (Ours) 0.47(02) 0.70(67) 0.92(10) 0.19(88) 0.89(73)

Key Observations from Table 3:

  • Importance of JVP Signal: The w/o JVP Signal variant (trained only on the teacher's instantaneous velocity, removing the second-order dynamics) shows a dramatic drop of nearly 12 BLEU points (0.3516 vs. OneFlowSeq's 0.4702). This decisively confirms that the Jacobian-vector product term, providing a second-order, self-consistency signal, is the cornerstone of OneFlowSeq's high-fidelity distillation.
  • Precision of Analytical JVP: The w/ Finite Difference variant, which approximates the JVP numerically, performs better than w/o JVP Signal but still significantly underperforms the full OneFlowSeq model (0.4258 BLEU vs. 0.4702 BLEU). This demonstrates that the precise, analytical derivative computed via JVP provides a more stable and accurate learning target than its numerical approximation, which can introduce noise.
  • Superiority of MeanFlow: OneFlowSeq significantly surpasses the Flow Matching Distill. baseline (0.3145 BLEU vs. 0.4702 BLEU). This underscores the superiority of the MeanFlow-based identity, which explicitly accounts for average velocities over intervals, over standard Flow Matching which focuses on instantaneous velocities.

Qualitative Analysis and Case Studies (Appendix E)

The paper includes qualitative examples to illustrate model behavior, both successes and failures.

Case Study 1: Paraphrasing (QQP) - Success Case

  • Source Question: "What are the best ways to learn machine learning on your own?"

  • Reference Gold: "What are the most effective methods for self-study in machine learning?" (my interpretation, the paper's reference is truncated "What heos ive esty r hin")

  • LLaMA2-7B (Baseline): "How can I learn machine learning by myself, and what are the best ways?" (Clunky, repetitive)

  • LLaDA-8B-Instruct (Teacher, 128-step): "What are the top resources for self-teaching machine learning?" (Fluent, natural)

  • OneFlowSeq (Ours, 1-step): "What is the most effective way to self-study machine learning?" (Fluent, natural)

    Analysis: OneFlowSeq successfully generates a high-quality, fluent paraphrase that precisely preserves the semantic core. It uses effective synonyms and phrases, performing on par with the multi-step teacher. This highlights that the distillation successfully transfers nuanced generative capabilities without discernible quality loss.

Case Study 2: Question Generation (Quasar-T) - Error Analysis

  • Source Context: "Penicillin, the first true antibiotic, was discovered accidentally by Scottish physician Alexander Fleming in 1928. While studying Staphylococcus bacteria, he noticed that a mold of the Penicillium genus had contaminated one of his culture plates and that the bacteria surrounding the mold had been destroyed."

  • Reference (Gold): "Who discovered the first antibiotic?"

  • LLaMA2-7B (Baseline): "What did Alexander Fleming find in 1928 on a culture plate?" (Overly specific, misses the main point)

  • LLaDA-8B-Instruct (Teacher, 128-step): "Who is credited with the accidental discovery of penicillin in 1928?" (Comprehensive, high-quality)

  • OneFlowSeq (Ours, 1-step): "Who discovered penicillin?" (Good, relevant, but lacks detail)

    Analysis: OneFlowSeq correctly identifies the main subject. However, it oversimplifies, omitting the important context that penicillin was "the first true antibiotic" or the year of discovery. This is identified as a typical limitation of the one-step generation approach: while capturing primary information, it may sometimes omit secondary details. This represents a known trade-off for the immense speedup, highlighting an area for future improvement where the richness of the multi-step teacher's output is not fully captured.

Ablation Study on Key Hyperparameters (Appendix D)

The paper includes an ablation study on key hyperparameters to validate robustness and justify choices.

(Table 5: Ablation study on key hyperparameters evaluated on the QQP test set. The default configuration used in our main experiments is highlighted in bold. Performance is reported using BLEU, ROUGE-L, and BertScore.)

Hyperparameter Value BLEU (↑) ROUGE-L (↑) BertScore (↑)
Learning Rate (η) 1 × 10−4 0.452 0.691 0.915
2 × 10−4 0.468 0.702 0.918
5 × 10−4 0.480 0.707 0.921
1 × 10−3 0.471 0.704 0.919
Prompt Length (k) 8 0.461 0.705 0.912
16 0.473 0.706 0.918
32 0.480 0.707 0.921
64 0.492 0.718 0.923
Weight of JVP Term ((t — r) multiplier) 0.0 (w/o JVP) 0.352 0.658 0.885
0.5 0.465 0.699 0.916
1.0 (Default) 0.480 0.707 0.921
2.0 0.473 0.701 0.917

Analysis of Results from Table 5:

  • Learning Rate: The default learning rate of 5×1045 × 10^-4 is confirmed as a "sweet spot." Lower rates lead to suboptimal convergence, while higher rates introduce instability and degraded performance.
  • Prompt Length: Performance monotonically improves with increasing prompt length. A length of 64 yields the best results (BLEU 0.492), slightly surpassing the default 32 (BLEU 0.480). This highlights that while 32 is a strong balance, further scaling of the tiny PEFT module can yield additional gains, demonstrating scalability.
  • Weight of JVP Term: This ablation reinforces the critical role of the JVP signal. Removing it (0.0) causes a severe performance drop. Deviating from the default weight of 1.0 (either 0.5 or 2.0) also harms performance, indicating that the theoretically-grounded formulation of the MeanFlow identity, where the JVP term's weight is implicitly 1, provides the most stable and effective learning signal.

Conclusion & Personal Thoughts

Conclusion Summary

OneFlowSeq presents a significant advancement in the field of sequence-to-sequence generation by effectively resolving the long-standing trade-off between generation quality and inference speed in Diffusion Language Models (DLMs). By combining the theoretical stability of the MeanFlow identity with a highly parameter-efficient distillation strategy, OneFlowSeq achieves one-step generation quality that matches or even surpasses its 128-step teacher (LLaDA-8B-Instruct). The core innovation lies in leveraging a rich Jacobian-vector product (JVP) supervision signal, which provides deeper, second-order dynamic guidance to the student model. This, coupled with the training of only a tiny Soft Prompt Module via prefix-tuning, leads to state-of-the-art results on paraphrasing, text simplification, and question generation benchmarks. The framework demonstrates revolutionary efficiency, reducing trainable parameters by 1600×1600 \times and accelerating inference by over 160×160 \times compared to its teacher, making it orders of magnitude faster than both autoregressive and multi-step diffusion baselines. OneFlowSeq successfully establishes one-step diffusion as a practical, scalable, and resource-efficient paradigm for widespread adoption in real-world NLP applications.

Limitations & Future Work

The paper implicitly discusses some limitations and points towards future work:

  • Omission of Secondary Details in One-Step Generation: As highlighted in the Question Generation case study, while OneFlowSeq excels at capturing primary information in a single step, it might occasionally omit secondary but important details from the source context. This is presented as a known trade-off for the immense speedup and an area for future improvement to enhance the "richness" of one-step outputs.
  • Generalizability of PEFT: While prompt tuning is highly efficient, its generalizability across vastly different tasks or domains without re-distillation for each might be a consideration. The current work focuses on distillation for Seq2Seq tasks.
  • Complexity of JVP Computation: While analytical JVP computation is efficient via autodiff, it still requires understanding and correctly implementing the derivative terms within the MeanFlow identity, which can be more complex than simpler L2L_2 distillation objectives.

Personal Insights & Critique

The OneFlowSeq paper presents a compelling solution to one of the most pressing issues in generative AI: computational efficiency without sacrificing quality. The core idea of marrying the MeanFlow theoretical framework with the practical benefits of PEFT and a sophisticated JVP distillation signal is genuinely novel and impactful.

Novelty: The explicit use of the Jacobian-vector product (JVP) signal for distillation, particularly in the context of the MeanFlow identity, stands out as a key innovation. This moves beyond simpler score-matching or direct output-matching objectives, offering a richer, self-consistent learning signal that accounts for the dynamics of the flow field. The theoretical backing in Appendix B provides strong justification for its stability and effectiveness.

Robustness and Scalability: The ablation studies thoroughly demonstrate the critical role of the JVP term and the benefits of accurate (analytical) computation. The performance scaling analyses further highlight the robustness of the PEFT approach, showing that even small increases in prompt capacity yield tangible quality improvements. The ability to distill an 8B parameter model into a 5M parameter module with minimal VRAM and storage footprint is a testament to the method's resource efficiency.

Potential Impact: This work has the potential to significantly accelerate the adoption of diffusion models in NLP. The inference speedup by orders of magnitude means DLMs could finally compete with or even surpass AR models in latency-critical applications, while retaining their benefits of parallel generation and global coherence. This could open doors for new applications in real-time content generation, interactive dialogue systems, and large-scale document processing.

Open Questions/Future Directions:

  1. Complexity of One-Step Outputs: While the paper acknowledges the occasional oversimplification in one-step generation (e.g., in QG), further research could explore techniques to imbue more nuanced details and richness into single-step outputs without sacrificing speed. Perhaps a multi-stage single-step approach, where an initial generation is followed by a single refinement step using a different specialized one-step model, could be explored.

  2. Beyond Seq2Seq: The framework is currently applied to Seq2Seq tasks. Investigating its applicability to other generative tasks like unconditional text generation, code generation, or even multimodal generation could be a fruitful avenue.

  3. Teacher Model Size and Performance: The paper uses LLaDA-8B-Instruct. It would be interesting to see how OneFlowSeq performs when distilling from even larger teachers (e.g., 70B parameters) and if the ~1600x parameter reduction ratio holds.

  4. Dynamic Prompt Length: Could the optimal prompt length be dynamically determined or adaptively scaled during inference based on the complexity of the input or desired output quality?

  5. Interpretability of JVP: While the JVP term is mathematically justified, further work could explore what specific linguistic or semantic features this "second-order dynamic guidance" encourages the student to learn, potentially offering insights into the underlying generative process.

    Overall, OneFlowSeq is a meticulously crafted and highly promising piece of research that pushes the boundaries of efficient diffusion-based language generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.