Black-Box On-Policy Distillation of Large Language Models
TL;DR Summary
This study introduces Generative Adversarial Distillation (GAD) for extracting knowledge from a teacher LLM in a black-box setting. By framing a minimax game, it trains a discriminator that evolves with the student, offering on-policy feedback, outperforming traditional sequence-
Abstract
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Black-Box On-Policy Distillation of Large Language Models
The title clearly states the paper's core research area: distilling (or compressing) Large Language Models (LLMs). It highlights two key constraints and one methodological approach:
Black-Box: The distillation process assumes no access to the internal workings (parameters or logits) of the powerful "teacher" model, only its text outputs.On-Policy: The "student" model learns from its own generated outputs, not just by mimicking the teacher's pre-recorded responses.Distillation: The overall goal is to transfer knowledge from a large teacher model to a smaller student model.
1.2. Authors
Zewen Chi, Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei.
The authors are affiliated with Microsoft Research. This is a leading industrial research lab known for significant contributions to the field of AI and Natural Language Processing, including the development of influential models and architectures. The authors have a strong publication record in top-tier conferences, indicating a high level of expertise in the domain.
1.3. Journal/Conference
The paper is available as a preprint on arXiv. The publication venue is not specified, as the publication date is set in the future (November 2025). The references to top-tier conferences like ICLR and NeurIPS suggest that this work is targeted at a highly competitive and influential venue within the machine learning community.
1.4. Publication Year
2025 (as listed on the preprint). The paper was submitted to arXiv on November 13, 2025.
1.5. Abstract
The abstract introduces a new method called Generative Adversarial Distillation (GAD) for training smaller Large Language Models (LLMs) using a proprietary, "black-box" teacher model. The core idea is to frame the distillation process as a minimax game, similar to a Generative Adversarial Network (GAN). The student LLM acts as a generator, trying to produce text that is indistinguishable from the teacher's. A discriminator model is trained simultaneously to tell the difference between student and teacher responses. This discriminator effectively becomes an on-policy reward model that adapts and provides feedback to the student as it learns. The authors show experimentally that GAD significantly outperforms the standard method of sequence-level knowledge distillation (SeqKD). A notable result is that a 14-billion parameter student model (Qwen2.5-14B-Instruct) trained with GAD achieves performance comparable to its teacher (GPT-5-Chat) on an automatic evaluation benchmark. The paper concludes that GAD is a powerful and promising technique for black-box LLM distillation.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2511.10643v1 - PDF Link:
https://arxiv.org/pdf/2511.10643v1 - Publication Status: This is a preprint available on arXiv. It has not yet undergone formal peer review for a conference or journal publication. The use of future model names like
GPT-5-Chatsuggests the paper is positioned as a forward-looking piece of research.
2. Executive Summary
2.1. Background & Motivation
What is the core problem the paper aims to solve?
The primary problem is black-box knowledge distillation for Large Language Models (LLMs). State-of-the-art LLMs, like OpenAI's GPT series, are often proprietary and accessible only through APIs. This means researchers and developers cannot access the model's internal parameters or the probability distributions (logits) over its vocabulary. This "black-box" nature prevents the use of traditional, highly effective white-box distillation techniques that rely on this internal information to train smaller, more efficient "student" models.
Why is this problem important in the current field? What specific challenges or gaps exist in prior research?
The importance of this problem stems from the need to create smaller, open-source, and specialized models that can run on less powerful hardware while retaining the capabilities of massive, proprietary models. The main challenges and gaps in prior black-box methods are:
- Limited Supervision Signal: The most common black-box method,
Sequence-Level Knowledge Distillation (SeqKD), simply fine-tunes the student model on a static dataset of (prompt, teacher-response) pairs. This is equivalent to standard supervised learning and only teaches the student to imitate the teacher's surface-level text, failing to capture deeper stylistic or reasoning patterns. - Exposure Bias: SeqKD is an
off-policymethod, meaning the student learns from a fixed set of teacher outputs. When the student generates its own text during inference, it may produce sequences it has never seen during training, leading to compounding errors. This is known as exposure bias. - Lack of On-Policy Feedback:
On-policylearning, where a model learns from its own generations, can mitigate exposure bias. However, in a black-box setting, it's difficult to implement. If the student generates a new response, there is no direct way to get a quality score or corrective signal from the black-box teacher, as the teacher's logits are unavailable.
What is the paper's entry point or innovative idea?
The paper's key innovation is Generative Adversarial Distillation (GAD), which enables on-policy learning in the black-box setting. The core idea is to reframe distillation as a game:
-
The student LLM is treated as a
generator. -
A second model, the discriminator, is trained to distinguish between the student's and the teacher's responses.
-
The discriminator's output serves as a reward signal for the student. The student is trained using reinforcement learning to generate text that "fools" the discriminator, i.e., maximizes this reward.
Crucially, the discriminator is not static; it co-evolves with the student. This makes it an adaptive,
on-policy reward modelthat continually adjusts to the student's improving capabilities, providing stable and relevant feedback throughout the training process and avoiding the "reward hacking" common in RL with fixed reward models.
2.2. Main Contributions / Findings
What are the paper's primary contributions?
- A Novel Framework (GAD): The paper proposes
Generative Adversarial Distillation (GAD), a new framework for black-box LLM distillation that leverages adversarial training to enable on-policy learning without requiring access to teacher logits. - An On-Policy Reward Modeling Paradigm: It introduces the concept of using a co-evolving discriminator as an on-policy reward model. This provides a dynamic and adaptive feedback mechanism that is more robust than the fixed reward models used in standard Reinforcement Learning from Human Feedback (RLHF).
- Strong Empirical Validation: The paper provides extensive experimental evidence showing that GAD consistently outperforms the widely-used
SeqKDbaseline across various model sizes and datasets.
What key conclusions or findings did the paper reach? What specific problems do these findings solve?
- GAD is Superior to SeqKD: GAD-trained student models achieve significantly higher performance than those trained with SeqKD, especially on out-of-distribution datasets, demonstrating better generalization. For example, a 3B parameter model trained with GAD can match the performance of a 7B model trained with SeqKD.
- Achieving Teacher-Level Performance: A large-enough student model (Qwen2.5-14B) trained with GAD can approach the performance of its much larger, proprietary teacher (GPT-5-Chat), closing the capability gap.
- GAD Avoids Overfitting and Reward Hacking: Analysis shows that SeqKD tends to overfit to the teacher's local lexical patterns (N-gram overlap), while GAD captures more global stylistic features. Furthermore, the on-policy nature of the GAD discriminator prevents the "reward hacking" phenomenon that plagues RL systems with fixed, off-policy reward models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
What are the most fundamental technologies, theories, or models required to understand this paper?
-
Large Language Models (LLMs): These are massive neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. They excel at understanding and generating human-like text for a wide range of tasks, such as question answering, summarization, and conversation. Models like GPT-4, Llama 3, and Qwen2.5 are examples of LLMs.
-
Knowledge Distillation (KD): A technique for model compression where a smaller "student" model is trained to mimic the behavior of a larger, more capable "teacher" model. The goal is to transfer the knowledge from the teacher to the student, creating a more efficient model that retains much of the teacher's performance.
-
White-Box vs. Black-Box Distillation:
- White-Box: The student model has access to the teacher's internal states, such as its parameters, hidden layer activations, or, most commonly, its output probability distribution (logits) for each token. This rich information allows for fine-grained supervision.
- Black-Box: The student model can only access the final text output generated by the teacher (e.g., via an API call). It has no access to logits or other internal information. This is a more challenging but practical scenario for proprietary models.
-
On-Policy vs. Off-Policy Learning: These terms originate from reinforcement learning.
- On-Policy: The learning agent (the student model in this case) updates its policy based on data generated by its current version. It learns from its own experiences. This helps it explore and correct its own unique mistakes.
- Off-Policy: The agent learns from data generated by a different policy. In distillation,
SeqKDis off-policy because the student learns from a fixed dataset of responses generated by the teacher, not from its own outputs.
-
Generative Adversarial Networks (GANs): A class of machine learning frameworks where two neural networks contest with each other in a game.
- The Generator () learns to create plausible data (e.g., images, text).
- The Discriminator () learns to distinguish the generator's fake data from real data. The two are trained in a minimax game: tries to fool , while tries to get better at catching . This adversarial process pushes the generator to produce increasingly realistic data.
-
Reinforcement Learning (RL): A paradigm of learning where an "agent" learns to make decisions by performing actions in an "environment" to maximize a cumulative "reward." In the context of LLMs, the model is the agent, generating text (actions) token by token, and the reward is a score indicating the quality of the generated text.
RLHF (Reinforcement Learning from Human Feedback)is a popular technique where a reward model is first trained on human preference data and then used to fine-tune an LLM with RL. -
Bradley-Terry Model: A statistical model for predicting the outcome of a paired comparison. Given two items, it models the probability that one is preferred over the other. In this paper, it's used to model the discriminator's preference for the teacher's response () over the student's response (
G(x)).
3.2. Previous Works
-
Sequence-Level Knowledge Distillation (SeqKD): This is the most common baseline for black-box distillation, first proposed by Kim and Rush (2016). It treats the teacher's outputs as ground-truth labels and trains the student using standard supervised fine-tuning (SFT). The objective is to maximize the likelihood of the student generating the teacher's exact response sequence. It is simple but often leads to "over-imitation" and poor generalization.
-
White-Box On-Policy Distillation (e.g., MiniLLM): Recent works like Gu et al. (2024) have shown the benefits of on-policy learning in the white-box setting. They use an objective like reverse Kullback-Leibler (KL) divergence. While standard (forward) KL divergence forces the student to match the entire teacher distribution (mode-covering), reverse KL encourages the student to focus its probability mass on high-probability regions of the teacher distribution (mode-seeking). This is done by having the student generate a response and then penalizing the divergence of its distribution from the teacher's for that specific response. This is impossible in the black-box setting as it requires the teacher's logits.
-
Generative Adversarial Networks for Text (e.g., SeqGAN): Training GANs for discrete data like text is challenging because the sampling process (choosing a word) is non-differentiable. Yu et al. (2017) in
SeqGANaddressed this by framing the generator as an RL agent. The generator is trained with a policy gradient algorithm, where the reward is provided by the discriminator. This is the same core principle that GAD adopts for its generator training.
3.3. Technological Evolution
The field of LLM distillation has evolved along two axes: access level and learning policy.
-
From White-Box to Black-Box: Early methods assumed full
white-boxaccess, using rich signals like KL divergence on logits. As powerful models became proprietary APIs, the focus shifted to the more practical but challengingblack-boxsetting. -
From Off-Policy to On-Policy: The standard black-box method,
SeqKD, isoff-policy(supervised fine-tuning). This was recognized to have limitations like exposure bias. In the white-box world, methods evolved to beon-policyusing reverse KL. However, there was no effective way to do on-policy learning in the black-box setting.This paper's work, GAD, sits at the intersection of these trends. It introduces a method that is both
black-boxandon-policy, filling a critical gap in the technological landscape.
3.4. Differentiation Analysis
Compared to the main methods in related work, GAD's core innovations are:
-
GAD vs. SeqKD:
- Learning Policy: GAD is on-policy; the student learns from its own generations and receives feedback. SeqKD is off-policy; it only learns from a fixed set of teacher generations.
- Feedback Mechanism: GAD receives a dynamic, scalar reward from an adaptive discriminator. SeqKD receives a static, token-level supervision signal via cross-entropy loss.
- Behavior: GAD exhibits mode-seeking behavior, learning to produce high-quality responses, while SeqKD exhibits mode-covering behavior, trying to average over all possible teacher responses, which can lead to bland outputs.
-
GAD vs. Standard RLHF:
- Reward Model: In standard RLHF, the reward model is trained once on a static dataset of human preferences and then frozen. The policy is then optimized against this fixed reward function, which can lead to reward hacking (finding loopholes to get a high score without actually improving quality).
- GAD's Discriminator: In GAD, the discriminator (acting as the reward model) co-evolves with the student (policy) in a minimax game. It is continuously updated to stay one step ahead of the student, making it much harder to "hack" and providing a more stable, adaptive training signal.
4. Methodology
4.1. Principles
The core principle of Generative Adversarial Distillation (GAD) is to reframe the black-box distillation problem as a two-player minimax game between the student LLM and a discriminator.
-
The student model, acting as the Generator (), aims to produce responses that are so similar to the teacher's that they can "fool" the discriminator.
-
The discriminator model () is trained to distinguish between responses generated by the teacher and those generated by the student.
The discriminator's output provides a real-time, on-policy quality score (a reward) for the student's self-generated responses. This allows the student to learn and improve from its own outputs, even without direct access to the teacher's internal logic (logits). This adversarial setup forces the student to capture the underlying distribution of the teacher's responses, rather than just memorizing specific examples.
The training procedure of GAD is illustrated in the figure below.
该图像是示意图,展示了生成对抗蒸馏(GAD)框架的工作机制。图中显示了教师模型和学生生成器之间的交互,以及判别器的作用,判别器通过最大化和最小化损失 来进行训练与反馈。
4.2. Core Methodology In-depth (Layer by Layer)
The entire GAD framework is optimized around a single value function which represents the minimax game.
4.2.1. The Minimax Game Objective
The central objective function for GAD is formulated as: Let's break down this formula:
-
: The generator, which is the student LLM being trained.
-
: The discriminator, a model that assigns a scalar score to a (prompt, response) pair.
-
: A prompt sampled from the training dataset .
-
: The corresponding response generated by the teacher model.
-
G(x): The response generated by the student model for the same prompt . -
: The scalar score assigned by the discriminator to the teacher's response.
-
D(G(x)): The scalar score assigned by the discriminator to the student's response. -
: The sigmoid function, , which maps any real number to a value between 0 and 1.
-
: This is the negative log-likelihood loss, commonly used in binary classification.
The expression is based on the Bradley-Terry model. It represents the probability that the teacher's response is preferred over the student's response
G(x).
The game proceeds in two alternating steps:
- Minimizing w.r.t. (Training the Discriminator): The discriminator is trained to minimize the value function. This is equivalent to maximizing the probability of correctly identifying the teacher's response as better than the student's. In other words, learns to make the score difference as large and positive as possible.
- Maximizing w.r.t. (Training the Generator): The generator is trained to maximize the value function. This means tries to generate responses
G(x)that make the score difference as small as possible, ideally pushingD(G(x))to be close to or even higher than , thereby "fooling" the discriminator.
4.2.2. Training the Generator (Student LLM)
From the main minimax objective, the generator's objective is to maximize its own score as assigned by the discriminator.
-
Explanation: The generator is updated to maximize the expected score
D(G(x))that the current discriminator gives to its own generated outputsG(x).A critical challenge here is that the generation process
G(x)involves sampling tokens, which is a non-differentiable operation. Therefore, we cannot use standard backpropagation. The paper solves this by treating the problem as a reinforcement learning task: -
The student is the policy.
-
The generated response is the action sequence.
-
The discriminator's score
D(G(x))is the reward.The objective is optimized using a policy gradient algorithm, specifically
GRPO(a variant of Proximal Policy Optimization, PPO), to update the generator's parameters .
4.2.3. Training the Discriminator
The discriminator's objective is to minimize its loss, which is derived directly from the minimax objective:
-
Explanation: This is a binary classification-style loss. For each prompt , the discriminator is given a pair of responses: the teacher's () and the student's (
G(x)). It is trained to output a higher score for than forG(x). This loss function effectively pushes the value of up and the value ofD(G(x))down.The discriminator is initialized with the student model's parameters, with an added linear head that projects the final hidden state of the sequence to a single scalar score.
4.2.4. Algorithm Flow and Warmup
The overall training process is summarized in Algorithm 1. A crucial component is the warmup stage.
Algorithm 1 GAD: Generative Adversarial Distillation
-
Input: Distillation data , student LLM , discriminator .
-
Output: Trained student model .
-
// Warmup Stage
-
for each batch do
-
Update generator with cross-entropy loss on (This is standard SeqKD/SFT). -
Update discriminator with Bradley-Terry loss (using student responses generated by the pre-trained model and teacher responses ). -
end for
-
// GAD Training Stage
-
repeat
-
**for** each batch **do** -
Sample student responses `G(x)`. -
Update generator using `D(G(x))` as reward for reinforcement learning. -
Update discriminator with Bradley-Terry loss on pairs . -
**end for** -
until convergence
-
return
The warmup stage is critical for stabilizing the adversarial training.
- Generator Warmup: The student model is first fine-tuned on the teacher's responses for one epoch using supervised learning (SeqKD). This gives the student a reasonable starting point, ensuring its initial generations are not completely random, which would make the discriminator's job too easy and provide uninformative gradients.
- Discriminator Warmup: The discriminator is also pre-trained to distinguish between the initial student's outputs and the teacher's. This ensures it can provide meaningful feedback from the very beginning of the GAD training stage. Without this warmup, the initial gap between the generator and teacher is too large, leading to unstable training.
4.2.5. Implementation with GRPO
The paper provides details on implementing GAD using the GRPO algorithm in Appendix A.1.
For each input prompt , the student samples a group of responses . The discriminator scores each of these, and the scores are normalized to compute an advantage for each response:
-
Advantage (): This value indicates how much better or worse a particular response is compared to the average response from the student's current policy. Normalizing by the mean and standard deviation helps stabilize the RL training.
The generator (student) is then trained to maximize the expected advantage: (Note: The paper shows a simplified version, but the core idea is to use advantage-weighted log-probabilities as in policy gradient methods.)
The discriminator is updated using all student responses against the single teacher response : This trains the discriminator to assign a higher score to the teacher's response than to any of the student's sampled responses.
5. Experimental Setup
5.1. Datasets
- Training Dataset: A 200,000-sample subset of
LMSYS-Chat-1M-Clean. This dataset contains high-quality, real-world user conversations with various chatbots from the Chatbot Arena platform. The authors collected responses from the teacher model (GPT-5-Chat) for these prompts. - In-Distribution Test Dataset: A reserved 500-sample test set from
LMSYS-Chat-1M-Clean. This evaluates how well the student models perform on the same type of data they were trained on. - Out-of-Distribution (OOD) Test Datasets: To measure generalization ability, the authors used three additional datasets from different domains:
-
Dolly: A dataset of instruction-following examples crowdsourced from Databricks employees. (500 samples used) -
SelfInst: A dataset of instructions generated by an LLM itself (GPT-3). (252 samples) -
Vicuna: A benchmark set of 80 challenging questions designed to evaluate chatbot capabilities.The choice of these datasets allows for a comprehensive evaluation of both in-domain performance and the ability to generalize to unseen instruction styles and topics.
-
5.2. Evaluation Metrics
The primary metric used for evaluation is the GPT-4o score.
-
Conceptual Definition: This metric uses a powerful, unbiased LLM (
GPT-4o) as a judge to evaluate the quality of a model's response. The judge is given the user's prompt, a reference answer (also generated byGPT-4o), and the student model's answer. It then scores the student's response on a scale of 1 to 10 based on criteria like helpfulness, relevance, accuracy, and detail. ThisLLM-as-a-judgeapproach has become a standard way to automate the evaluation of conversational AI, as it correlates well with human judgment. The evaluation prompt used is shown in Figure 8.
The paper also uses N-gram Overlap (F1 Score) in its analysis.
- Conceptual Definition: This metric measures the similarity between two texts by comparing the overlap of their contiguous sequences of words (N-grams). A high overlap suggests that the texts share similar phrasing and local patterns. The F1 score is the harmonic mean of precision and recall, providing a balanced measure. It's used to analyze whether a student model is simply memorizing the teacher's wording (
high N-gram overlap) or learning deeper concepts. - Mathematical Formula: $ \text{Precision} = \frac{|\text{N-grams}{pred} \cap \text{N-grams}{ref}|}{|\text{N-grams}{pred}|} $ $ \text{Recall} = \frac{|\text{N-grams}{pred} \cap \text{N-grams}{ref}|}{|\text{N-grams}{ref}|} $ $ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
- : The set of unique N-grams in the predicted (student) text.
- : The set of unique N-grams in the reference (teacher) text.
- : The number of elements in a set.
5.3. Baselines
The paper's method, GAD, is compared against two main baselines:
-
Before Distillation: This refers to the original, pre-trained instruction-tuned student models (e.g.,
Qwen2.5-3B-Instruct) without any distillation. This baseline shows how much improvement is gained from the distillation process itself. -
SeqKD (Sequence-Level Knowledge Distillation): This is the standard and most common method for black-box distillation. It involves performing supervised fine-tuning (SFT) on the student model using the (prompt, teacher-response) pairs. It serves as a strong baseline representing the status quo.
The experiments use
GPT-5-Chatas the primary proprietary teacher and various open-source models from theQwen2.5andLlama3families as students.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results, presented in Figure 1 and Table 2, consistently demonstrate the superiority of GAD over both the original instruction-tuned models and the SeqKD baseline.
The following figure compares the performance of GAD and SeqKD on both in-distribution (LMSYS) and out-of-distribution datasets.
该图像是示意图,展示了在LMSYS-Chat基准和分布外泛化下,不同模型基于参数数量的平均得分。其中,GAD (ours) 和 Qwen2.5-Instruct 在 3B、7B 和 14B 参数数量下的表现明显优于 SeqKD 模型,指示GAD在蒸馏过程中的有效性。
Key observations from the chart and tables:
-
Consistent Outperformance: Across all model sizes (3B, 7B, 14B) and families (Qwen2.5, Llama3), GAD achieves higher GPT-4o scores than SeqKD.
-
Improved Model Efficiency: GAD enables smaller models to match the performance of larger models trained with SeqKD. For instance,
Qwen2.5-3B-Instructwith GAD performs on par withQwen2.5-7B-Instructwith SeqKD. This shows GAD is a more effective knowledge extractor. -
Closing the Gap with the Teacher: With a sufficiently large student model, GAD can nearly match the teacher's performance.
Qwen2.5-14B-Instructwith GAD achieves a score (52.1) very close to theGPT-5-Chatteacher (51.7) on the LMSYS test set. -
Superior Generalization: The performance gap is even more pronounced on the out-of-distribution (OOD) datasets (Dolly, SelfInst, Vicuna). SeqKD provides marginal or sometimes even negative gains on these datasets, suggesting it overfits to the training data's style. In contrast, GAD shows robust improvements, indicating it learns more generalizable capabilities.
The following are the detailed results from Table 2 of the original paper:
Model Method LMSYS Dolly SelfInst Vicuna GPT-5-Chat Teacher 51.7 49.8 49.7 49.9 Qwen2.5-3B-Instruct Before Distill. 45.8 45.1 45.6 47.3 SeqKD
GAD47.5
48.944.8
46.745.7
47.748.0
49.4Qwen2.5-7B-Instruct Before Distill. 48.7 47.6 48.3 49.1 SeqKD
GAD49.2
50.847.2
48.548.3
50.149.5
51.4Qwen2.5-14B-Instruct Before Distill. 50.0 49.1 49.4 50.0 SeqKD
GAD50.6
52.148.2
50.449.4
51.149.7
51.6Llama-3.2-3B-Instruct Before Distill. 44.0 45.8 47.0 46.9 SeqKD
GAD47.6
48.147.0
48.547.1
49.148.1
48.9Llama-3.1-8B-Instruct Before Distill. 46.9 46.6 48.4 47.9 SeqKD
GAD49.7
50.347.7
48.848.7
49.548.7
50.2
Human Evaluation: Human evaluation results, shown in Figure 3, corroborate the automatic evaluation findings. In pairwise comparisons, responses from GAD-trained models were preferred by human annotators over both the original models and SeqKD-trained models, with win rates consistently exceeding 50%.
该图像是图表,展示了不同模型在黑箱置换(GAD)与序列级知识蒸馏(SeqKD)之间的比较结果。图表中包含三个子图,分别对应于 Qwen2.5-7B-Instruct、Qwen2.5-14B-Instruct 和 Llama-3.1-8B-Instruct。在每个子图中,绿色条表示“GAD赢”,黄色条表示“平局”,红色条表示“GAD输”。图中的百分比数据反映了各模型在与蒸馏前的比较中GAD的优势。
6.2. Analysis
The paper conducts several insightful analyses to understand why GAD works better than SeqKD.
6.2.1. SeqKD Overfits to Local Patterns
Figure 4 shows the N-gram F1 score between student and teacher responses. The SeqKD student has a higher N-gram overlap with the teacher than the GAD student. However, the GAD student achieves a higher GPT-4o score. This suggests that SeqKD tends to overfit to the teacher's exact wording and surface-level lexical patterns. In contrast, GAD's RL-based approach allows it to learn the teacher's global style and intent without being constrained to simple imitation, leading to higher-quality responses.
该图像是一个柱状图,展示了不同 N-gram 大小 (n) 下 SeqKD 和 GAD 的重叠度比较。可以看到,GAD 方法在大多数 N-gram 大小下的表现优于 SeqKD,特别是在较小的 n 值时重叠度接近 1。
6.2.2. Mode-Seeking vs. Mode-Covering
A toy experiment (Figure 5) visualizes the different optimization behaviors of GAD and SeqKD. The teacher is a multi-modal distribution.
-
SeqKD, optimizing via maximum likelihood, exhibits mode-covering behavior. It tries to spread its probability mass to cover all the modes (peaks) of the teacher distribution, resulting in an averaged, often bland, output. -
GAD, via its adversarial, RL-based objective, exhibits mode-seeking behavior. It learns to focus its probability mass on one of the high-quality modes of the teacher distribution, producing more coherent and sharp outputs.
该图像是一个示意图,展示了教师模型(黑色线)、SeqKD(蓝色线)和GAD(橙色线)在各个类上的概率分布。可以看到,GAD相比于SeqKD和教师模型在类5的概率值更高,表明其在黑箱蒸馏中表现出色。
6.2.3. On-Policy GAD vs. Off-Policy Discriminator
Figure 6 compares GAD (with its on-policy, co-evolving discriminator) to an alternative where a discriminator is trained once (off-policy) and then frozen to be used as a reward model. The off-policy approach quickly suffers from reward hacking: the student learns to exploit the fixed reward model by generating overly long responses that get a high score but are low quality. In contrast, GAD's on-policy discriminator adapts to the student, preventing such exploits and ensuring stable training over thousands of steps.
该图像是一个示意图,展示了 on-policy 和 off-policy 蒸馏的响应长度变化。蓝线代表 off-policy 蒸馏,而橙线代表本文提出的 on-policy 蒸馏方法。图中标注了 'Reward Hacking' 的位置,表示在训练过程中响应长度的显著变化。
6.3. Ablation Studies / Parameter Analysis
The authors performed an ablation study on the warmup strategy (Table 3), confirming its importance.
The following are the results from Table 3 of the original paper:
| LMSYS | Others | |
|---|---|---|
| SeqKD | 49.2 | 48.3 |
| GAD | 50.8 | 50.0 |
| w/o Gen. Warmup | 49.7 | 49.7 |
| w/o Disc. Warmup | 49.0 | 47.7 |
-
Without Generator Warmup: If the generator (student) is not warmed up with SFT, its initial quality is very low. The discriminator can easily tell it apart from the teacher, but the feedback signal is not very informative for nuanced improvement. This leads to a performance drop.
-
Without Discriminator Warmup: If the discriminator is not warmed up, it starts with no ability to distinguish between student and teacher. This creates an imbalance where the generator has already been improved by SFT, but the discriminator provides random, unhelpful feedback, making the adversarial training ineffective. The model's performance barely improves beyond the SFT warmup phase.
This study confirms that jointly warming up both the generator and discriminator is crucial for creating a balanced and effective adversarial game from the start.
6.4. Additional Automatic Evaluation Results
The paper provides extended results in Table 4, including response lengths. It shows that SeqKD tends to generate shorter responses, closely mimicking the teacher's length, while GAD models produce slightly longer responses, more in line with their original behavior but infused with the teacher's style. This reinforces the idea that GAD integrates knowledge rather than just imitating.
The following are the results from Table 4 of the original paper:
| Model | Method | LMSYS | Dolly | SelfInst | Vicuna | ||||
|---|---|---|---|---|---|---|---|---|---|
| Score | Len. | Score | Len. | Score | Len. | Score | Len. | ||
| GPT-5-Chat | Teacher | 51.7 | 329.1 | 49.8 | 148.5 | 49.7 | 188.5 | 49.9 | 378.6 |
| Qwen2.5-3B-I | Before Distill. | 45.8 | 338.9 | 45.1 | 219.2 | 45.6 | 279.3 | 47.3 | 520.9 |
| SeqKD | 47.5 | 318.2 | 44.8 | 160.6 | 45.7 | 207.1 | 48.0 | 370.4 | |
| GAD | 48.9 | 438.0 | 46.7 | 239.5 | 47.7 | 281.8 | 49.4 | 517.9 | |
| Qwen2.5-7B-I | Before Distill. | 48.7 | 345.2 | 47.6 | 220.0 | 48.3 | 259.1 | 49.1 | 501.7 |
| SeqKD | 49.2 | 320.2 | 47.2 | 152.3 | 48.3 | 182.3 | 49.5 | 398.1 | |
| GAD | 50.8 | 414.0 | 48.5 | 225.1 | 50.1 | 288.5 | 51.4 | 511.9 | |
| Qwen2.5-14B-I | Before Distill. | 50.0 | 322.1 | 49.1 | 201.6 | 49.4 | 252.0 | 50.0 | 475.4 |
| SeqKD | 50.6 | 319.3 | 48.2 | 151.2 | 49.4 | 199.8 | 49.7 | 402.5 | |
| GAD | 52.1 | 438.9 | 50.4 | 262.6 | 51.1 | 284.1 | 51.6 | 499.6 | |
| Llama-3.2-3B-I | Before Distill. | 44.0 | 334.4 | 45.8 | 174.5 | 47.0 | 265.6 | 46.9 | 437.6 |
| SeqKD | 47.6 | 328.6 | 47.0 | 147.4 | 47.1 | 214.5 | 48.1 | 389.3 | |
| GAD | 48.1 | 371.5 | 48.5 | 232.3 | 49.1 | 275.7 | 48.9 | 461.8 | |
| Llama-3.1-8B-I | Before Distill. | 46.9 | 329.2 | 46.6 | 184.7 | 48.4 | 276.2 | 47.9 | 487.8 |
| SeqKD | 49.7 | 319.6 | 47.7 | 148.4 | 48.7 | 199.7 | 48.7 | 400.3 | |
| GAD | 50.3 | 394.6 | 48.8 | 200.6 | 49.5 | 263.8 | 50.2 | 504.2 | |
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces and validates Generative Adversarial Distillation (GAD), a novel framework for the black-box distillation of LLMs. By framing the process as a minimax game between a student (generator) and an adaptive discriminator, GAD enables effective on-policy learning without requiring access to the teacher model's internal logits. The discriminator serves as a dynamic, on-policy reward model that provides stable feedback and avoids the reward hacking issues common in RL with fixed reward models. Extensive experiments demonstrate that GAD consistently surpasses the standard SeqKD baseline, showing superior performance, better generalization, and the ability to train smaller student models that can rival the capabilities of their proprietary teachers. GAD is established as a robust and highly effective paradigm for creating powerful open-source models from closed-source APIs.
7.2. Limitations & Future Work
The paper does not explicitly list limitations, but we can infer some potential areas for future exploration:
-
Computational Cost: Adversarial training is inherently more complex and computationally expensive than simple supervised fine-tuning. GAD requires generating samples from the student model, running forward passes on the discriminator, and then performing backward passes for both models in each step. This increases training time and resource requirements compared to SeqKD.
-
Training Stability: While the paper reports stable training and avoidance of reward hacking, GAN-style training is notoriously sensitive to hyper-parameters and can sometimes be unstable. The proposed warmup strategy mitigates this, but its robustness across even more diverse tasks and models could be a subject for future research.
-
Discriminator Architecture: The paper uses a discriminator initialized from the student model. The impact of different discriminator architectures or initialization strategies on final performance could be further investigated.
-
Data Efficiency: The study uses 200K samples. Investigating how GAD performs in lower-data regimes would be valuable, as collecting large datasets of teacher responses can be costly.
The authors do not explicitly suggest future work, but natural extensions would be applying GAD to other modalities (e.g., vision-language models) or other generation tasks beyond instruction following.
7.3. Personal Insights & Critique
This paper presents a very elegant and powerful solution to a practical and important problem.
Positive Aspects:
- Conceptual Elegance: The core idea of using a GAN-like framework to create an on-policy reward signal is brilliant. It neatly sidesteps the central challenge of on-policy black-box learning: the lack of a supervision signal for self-generated data. The interpretation of the discriminator as a co-evolving reward model is a key insight that connects distillation to modern RL paradigms.
- Strong Solution to a Real-World Problem: As the most powerful LLMs are likely to remain proprietary, methods like GAD are crucial for the broader community to build and democratize capable AI. It provides a principled way to "learn from the best" without needing their secrets.
- Robustness and Stability: The demonstration that GAD avoids reward hacking, a major pitfall of RLHF with static reward models, is a significant practical advantage. This makes the training process more reliable.
- Thorough Evaluation: The paper's experiments are comprehensive, covering multiple model families, sizes, in-distribution and out-of-distribution datasets, and both automatic and human evaluation. The analyses provide convincing evidence for why GAD works.
Potential Critique & Further Thoughts:
-
Reliance on LLM-as-a-Judge: The primary evaluation metric is the
GPT-4o score. While this is a standard and effective method, it relies on another black-box LLM which has its own biases. The results are strong, but they are relative to the judge's preferences. The inclusion of human evaluation helps to mitigate this concern. -
"Fictional" Models: The use of future-dated model names like
GPT-5-ChatandQwen2.5makes the paper feel slightly speculative, though this is likely a stylistic choice to position the work as forward-looking. The underlying methodology is sound regardless of the specific model names used. -
Transferability: The method appears highly general. It could potentially be applied to distill other capabilities beyond instruction following, such as specific writing styles, reasoning abilities (by distilling from chain-of-thought outputs), or even coding proficiency. This versatility is one of GAD's most promising aspects.
Overall, this paper is a significant contribution to the field of knowledge distillation. It offers a fresh perspective and a robust technical solution that could become a new standard for black-box model training.
Similar papers
Recommended via semantic vector search.