DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
TL;DR Summary
DistillSeq uses knowledge distillation to transfer safety knowledge from large to small models, combines syntax-tree and LLM-based malicious query generation, and employs sequential filtering, boosting attack success rates by 93% while reducing testing costs for LLM safety alignm
Abstract
Large Language Models (LLMs) have showcased their remarkable capabilities in diverse domains, encompassing natural language understanding, translation, and even code generation. The potential for LLMs to generate harmful content is a significant concern. This risk necessitates rigorous testing and comprehensive evaluation of LLMs to ensure safe and responsible use. However, extensive testing of LLMs requires substantial computational resources, making it an expensive endeavor. Therefore, exploring cost-saving strategies during the testing phase is crucial to balance the need for thorough evaluation with the constraints of resource availability. To address this, our approach begins by transferring the moderation knowledge from an LLM to a small model. Subsequently, we deploy two distinct strategies for generating malicious queries: one based on a syntax tree approach, and the other leveraging an LLM-based method. Finally, our approach incorporates a sequential filter-test process designed to identify test cases that are prone to eliciting toxic responses. Our research evaluated the efficacy of DistillSeq across four LLMs: GPT-3.5, GPT-4.0, Vicuna-13B, and Llama-13B. In the absence of DistillSeq, the observed attack success rates on these LLMs stood at 31.5% for GPT-3.5, 21.4% for GPT-4.0, 28.3% for Vicuna-13B, and 30.9% for Llama-13B. However, upon the application of DistillSeq, these success rates notably increased to 58.5%, 50.7%, 52.5%, and 54.4%, respectively. This translated to an average escalation in attack success rate by a factor of 93.0% when compared to scenarios without the use of DistillSeq. Such findings highlight the significant enhancement DistillSeq offers in terms of reducing the time and resource investment required for effectively testing LLMs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation
- Authors:
- Mingke Yang (ShanghaiTech University)
- Yuqi Chen (ShanghaiTech University)
- Yi Liu (Nanyang Technological University)
- Ling Shi (Nanyang Technological University)
- Journal/Conference: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '24).
- Venue Reputation: ISSTA is a premier, highly respected conference in the field of software engineering, specifically focusing on software testing and analysis. Publication at ISSTA indicates a work of high quality and significant contribution to the field.
- Publication Year: 2024
- Abstract: The paper addresses the high computational cost of safety testing for Large Language Models (LLMs). The authors propose DistillSeq, a framework that first uses knowledge distillation to transfer an LLM's moderation knowledge to a smaller, more efficient model. This small model then acts as a filter. The framework generates malicious queries using two methods (syntax-tree based and LLM-based) and uses the distilled model to filter out ineffective queries before testing the target LLM. Experiments on GPT-3.5, GPT-4, Vicuna-13B, and Llama-13B show that DistillSeq significantly increases the attack success rate (an average increase of 93.0%), thereby reducing the resources needed for effective safety testing.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2407.10106
- PDF Link: https://arxiv.org/pdf/2407.10106v4.pdf
- Publication Status: The paper is a preprint available on arXiv and is slated for publication in the proceedings of ISSTA '24.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: While Large Language Models (LLMs) are powerful, they can be manipulated to generate harmful, toxic, or unethical content. To prevent this, developers must rigorously test the "safety alignment" of these models, which ensures they refuse to comply with malicious requests. However, this testing process is extremely expensive and time-consuming, requiring millions of interactions (queries) with the LLM, which incurs significant computational and financial costs.
- Gap in Prior Work: Previous research has focused heavily on developing more sophisticated attack methods (
jailbreaks) to bypass safety measures. This often results in a "brute-force" approach, where many generated queries are ineffective, leading to wasted resources. There is a need for a more efficient, cost-effective testing strategy. - Innovation: The paper's key innovation is to shift the focus from generating more attacks to intelligently filtering attacks before testing. It introduces DistillSeq, a framework that first learns the target LLM's safety behavior using a cheap, small model (via
knowledge distillation) and then uses this small model as a pre-screening filter to identify the most promising malicious queries. This drastically reduces the number of expensive interactions with the actual LLM.
-
Main Contributions / Findings (What):
- A Novel Knowledge Distillation Method for Safety Alignment: The paper proposes a technique to "distill" the complex, black-box moderation knowledge of a large, powerful LLM into a small, fast, and efficient student model.
- An Effective and Automated Testing Framework (DistillSeq): The paper introduces a complete, two-stage framework that first generates potentially malicious queries and then uses the distilled model in a sequential filter-test process to select only the most potent queries for testing the target LLM.
- Strong Empirical Evidence of Effectiveness: The authors evaluated DistillSeq on four popular LLMs and showed that it dramatically improves testing efficiency. The attack success rate nearly doubled on average (a 93% increase), demonstrating that the framework successfully reduces the time and cost required to find safety vulnerabilities in LLMs.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive artificial intelligence models (e.g., GPT-4, Llama) trained on vast amounts of text data. They can understand, generate, and translate human-like language for a wide range of tasks.
- Safety Alignment: This is the process of training and fine-tuning an LLM to align its behavior with human values, ethics, and safety guidelines. A key goal is to make the model refuse to generate harmful, biased, or dangerous content.
- Jailbreaking: This term refers to the use of clever prompts or inputs to trick an LLM into bypassing its safety alignment and generating content it is normally supposed to refuse. As shown in Image 2, a direct malicious query is refused, but when wrapped in a "jailbreak prompt" that puts the LLM in a fictional role, it complies.
- Knowledge Distillation: A machine learning technique where knowledge from a large, complex "teacher" model is transferred to a smaller, simpler "student" model. The student model learns to mimic the teacher's outputs, allowing it to perform a similar task with much lower computational cost.
-
Previous Works:
- The paper builds upon a body of research focused on attacking LLMs to test their safety.
- Wei et al. [39] and Sun et al. [35] demonstrated the vulnerability of LLMs to jailbreak prompts and categorized various attack types.
- Liu et al. [26] and Zou et al. [45] automated the generation of these attacks, using LLMs to create jailbreak prompts or adding adversarial suffixes.
- These prior works primarily focused on improving the effectiveness of attacks, often overlooking the cost and efficiency of the testing process. They generate a large volume of queries, many of which are unsuccessful, leading to high resource consumption.
-
Differentiation:
- DistillSeq differentiates itself by explicitly targeting the cost-efficiency of safety testing. Instead of just developing better attacks, it introduces a pre-filtering step.
- While other methods interact directly and frequently with the expensive target LLM, DistillSeq's core idea is to perform the bulk of the evaluation on a cheap, distilled "proxy" model. This proxy filters out low-quality test cases, ensuring that only the most likely-to-succeed queries are sent to the target LLM, thus saving significant time and money.
4. Methodology (Core Technology & Implementation)
The DistillSeq framework operates in two main stages, as illustrated in Image 1.

Part 1: Building the Distillation Model
The first stage focuses on training a small "student" model to predict whether a malicious query will cause the large "teacher" LLM to generate a toxic response.
- Principles: The core idea is
response-based knowledge distillation. Since the internal workings of many powerful LLMs (like GPT-4) are a black box, the framework can only learn from the inputs it sends and the outputs it receives. The goal is to train a small model, , to mimic the safety-aligned behavior of the large LLM, . - Steps & Procedures:
- Create a Training Dataset: The authors generate a dataset by combining
jailbreak promptswithmalicious queriesfrom theRealToxicPromptsdataset. - Query the Teacher LLM: These combined inputs are fed to the target LLM (e.g., GPT-3.5).
- Label the Responses: The LLM's responses are evaluated for toxicity using the
Perspective API. A response with a toxicity score above 0.7 is labeled "toxic" (label 1), and below 0.7 is "non-toxic" (label 0). - Train the Student Model: A smaller model (e.g., BERT, RoBERTa) is trained on the malicious queries (as input) and the corresponding toxicity labels (as output). The training process aims to minimize the difference between the student model's predictions and the teacher LLM's actual behavior.
- Create a Training Dataset: The authors generate a dataset by combining
- Mathematical Formulation: The distillation process is framed as an optimization problem to minimize the
distillation loss() between the teacher's output and the student's prediction over the set of malicious queries ().- : The distilled student model being trained.
- : The output (toxicity label) of the teacher LLM for a given malicious query .
- : The predicted output of the student model for the same query .
- : The domain of malicious queries.
- : The loss function, which the paper specifies as
cross-entropy loss, measuring the discrepancy between the two models' predictions.
Part 2: Generating and Filtering New Malicious Queries
The second stage uses the trained distilled model to efficiently find new, effective malicious queries.
-
Steps & Procedures:
- Generate New Queries: The framework uses two methods to create a large pool of new, potentially malicious queries.
- Filter with Distilled Model: The distilled model () from Part 1 is used to predict the toxicity potential of each newly generated query. Queries predicted to be ineffective are discarded.
- Test with Target LLM: Only the queries that pass the filter are sent to the expensive target LLM for final testing.
-
Query Generation Methods:
-
Syntax Tree-Based Method: This method generates new queries by manipulating their grammatical structure, not just individual words.
-
It parses two malicious queries, and , into syntax trees, and .
-
It identifies the most "important" subtrees in using an importance score, . A subtree is considered important if its removal significantly changes the distilled model's toxicity prediction.
-
: The importance score of subtree .
-
: The distilled model's prediction for the full sentence.
-
: The prediction after removing subtree .
-
: The length of the subtree, used to penalize long subtrees and favor concise, meaningful ones.
-
The method then replaces important subtrees in with subtrees of the same syntactic category (e.g., replacing one verb phrase with another) from . Image 3 provides a clear example where "help others out of trouble" is replaced by "destroy other people's property".
该图像是一个示意图,展示了图3中的语法树替换示例。图中展示了输入文本通过语法树替换生成恶意查询的过程,并通过蒸馏模型识别该文本的安全性,标记为“Negative!”或“Positive!”。
-
-
LLM-Based Method: This method uses a fine-tuned LLM to generate new malicious queries.
- Phase 1 (Understanding): A Vicuna-13B model is fine-tuned on the
RealToxicPromptsdataset to better understand the style and content of malicious language. - Phase 2 (Generation): The fine-tuned Vicuna model is given a few examples of malicious queries and prompted to generate new ones that are similar in style, tone, and theme.
- Phase 1 (Understanding): A Vicuna-13B model is fine-tuned on the
-
5. Experimental Setup
- Datasets:
- Jailbreak Prompts: Sourced from
Jailbreak Chat, an online repository. The top-voted prompts from three categories (Pretending,Attention Shifting,Privilege Escalation) were selected. - Malicious Queries: The
RealToxicPromptsdataset and theJigsaw Toxic Comment Classification Challengedataset were used. These contain text snippets that are known to be toxic or harmful.
- Jailbreak Prompts: Sourced from
- Evaluation Metrics:
- Agreement: Measures the consistency between the distilled model's predictions and the target LLM's actual responses.
- Conceptual Definition: It calculates the percentage of times the small, distilled model correctly predicts whether the large LLM will produce a toxic or non-toxic response for a given query.
- Mathematical Formula:
- Symbol Explanation:
Consistent Classificationsrefers to cases where both the distilled model and the LLM classify a query's response as toxic, or both classify it as non-toxic.
- Loss: The standard machine learning metric (cross-entropy loss in this case) used during training to measure the error between the student model's predictions and the true labels from the teacher LLM. Lower loss is better.
- Attack Success Rate (ASR): The primary metric for evaluating the final testing efficiency.
- Conceptual Definition: It measures the percentage of queries that successfully trick the LLM into generating a toxic response.
- Mathematical Formula (without filter):
- Mathematical Formula (with DistillSeq filter):
- Symbol Explanation:
Effective queriesare those that result in a toxic response from the LLM. The second formula is crucial as it shows the success rate on the filtered, high-potential subset of queries, which is expected to be much higher.
- Agreement: Measures the consistency between the distilled model's predictions and the target LLM's actual responses.
- Baselines:
- Simple Methods:
random word replacement,Textfooler(a more sophisticated word-substitution method). - State-of-the-Art (SOTA) Tools:
JailbreakingLLMs,GPTFuzzer,FuzzLLM,HouYi. These are established frameworks for generating attacks against LLMs.
- Simple Methods:
6. Results & Analysis
-
RQ1 (Effectiveness): Can the framework distill the LLM's moderation knowledge?
-
Findings: Yes. As shown in the transcribed Table 2 below, knowledge distillation (
A.K.D.) dramatically improved the performance of all student models compared to their baseline performance before distillation (B.K.D.). -
Analysis: For example,
RoBERTa's agreement with the LLM jumped from 53.12% to 96.73% on the training set, and its loss dropped from 0.61 to 0.01. This indicates that the small models successfully learned to mimic the LLM's moderation behavior.RoBERTaandDeBERTawere the best-performing student model architectures.This is a manual transcription of Table 2 from the paper.
Model Training Set B.K.D. Agreement Training Set A.K.D. Agreement Training Set B.K.D. Loss Training Set A.K.D. Loss Test Set B.K.D. Agreement Test Set A.K.D. Agreement Test Set B.K.D. Loss Test Set A.K.D. Loss BERT 42.73% 93.58% 0.71 0.03 44.20% 91.90% 0.68 0.05 RoBERTa 53.12% 96.73% 0.61 0.01 56.30% 93.60% 0.55 0.02 DeBERTa 54.23% 95.67% 0.60 0.02 53.80% 94.00% 0.57 0.01 ERNIE 62.92% 94.23% 0.48 0.03 46.30% 92.90% 0.64 0.04
-
-
RQ2 (Tradeoff): How to balance performance and dataset size?
-
Findings: The performance of the distilled model improves as the training dataset size increases, but with diminishing returns.
-
Analysis: As seen in Image 4, the agreement score rises sharply at first but then plateaus. The authors calculate that the cost to achieve a 1% improvement in agreement becomes prohibitively expensive at larger dataset sizes. Based on this cost-benefit analysis, they conclude that using 4,000 samples provides a good balance between performance and the cost of data collection for distillation.
该图像是一个折线图,展示了不同训练样本数量下,四种蒸馏模型(BERT、RoBERTA、DeBERTA、ERNIE)与GPT-3.5在一致性指标上的表现趋势。
-
-
RQ3 (Generality): Is the framework effective on new data and different LLMs?
-
Findings: Yes, the framework is general, though performance varies.
-
Analysis: The transcribed tables below show that knowledge distillation consistently improves agreement across all four tested LLMs (Table 3) and on a new dataset (
Jigsaw, Table 4). However, the overall agreement is lower than in RQ1. The authors attribute this to potential overfitting, the inherent randomness of LLMs, and the limitations ofresponse-baseddistillation. Despite this, the average improvement in agreement was still substantial (~40%), confirming the general applicability of the approach.This is a manual transcription of Table 3 from the paper.
Model GPT-3.5 B.K.D. GPT-3.5 A.K.D. GPT-4.0 B.K.D. GPT-4.0 A.K.D. Vicuna-13B B.K.D. Vicuna-13B A.K.D. LLama-13B B.K.D. LLama-13B A.K.D. Average B.K.D. Average A.K.D. BERT 17.30% 66.50% 13.40% 57.60% 25.60% 61.90% 16.80% 63.50% 18.28% 62.38% RoBERTa 33.60% 69.70% 24.20% 63.00% 14.40% 68.50% 29.50% 62.30% 25.43% 65.88% DeBERTa 26.70% 68.50% 19.60% 64.80% 19.50% 65.10% 20.10% 68.20% 21.48% 66.65% ERNIE 24.50% 71.80% 21.50% 59.30% 23.20% 55.10% 34.60% 52.10% 25.95% 59.58%
This is a manual transcription of Table 4 from the paper.
Model B.K.D. Loss A.K.D. Loss BERT 0.78 0.17 RoBERTa 0.76 0.12 DeBERTa 0.64 0.19 ERNIE 0.85 0.23 -
-
RQ4 (Comparisons): How does DistillSeq perform against state-of-the-art tools?
- Findings: DistillSeq significantly outperforms all baseline and SOTA methods.
- Analysis: The paper states that with the filter, the
LLM-basedDistillSeq method achieved an average ASR of 58.25%, and thesyntax-basedmethod achieved 49.78%. In contrast, none of the other methods (without the filter) could even reach a 20% average ASR. Crucially, the paper also shows that applying the DistillSeq filter to other SOTA tools also dramatically boosted their ASR (e.g.,Randomfrom 1.48% to 40.35%). This demonstrates that the filtering mechanism is a powerful, standalone contribution that can enhance existing testing methods.
-
RQ5 (Configuration):
- Note: The provided text is incomplete and cuts off before this section. This question cannot be answered based on the available content.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents DistillSeq, an innovative and practical framework for making LLM safety testing more cost-effective. By using knowledge distillation to create a lightweight pre-screening filter, the framework ensures that only high-potential malicious queries are tested against expensive LLMs. The empirical results provide strong evidence that this approach nearly doubles the attack success rate, leading to significant savings in time and computational resources.
-
Limitations & Future Work:
- Performance on Unseen Data: As noted in RQ3, the distilled model's performance drops when faced with data from a different distribution than what it was trained on. Future work could explore using more diverse training datasets or more robust distillation techniques to improve generalization.
- Reliance on Teacher LLM's Randomness: The framework learns from a teacher LLM whose responses can be stochastic (random). This inherent variability can introduce noise into the training labels, limiting the upper bound of the distilled model's performance.
- Dependence on
Perspective API: The "ground truth" for toxicity is determined by an external tool (Perspective API). Any biases or inaccuracies in this API will be propagated into the distilled model.
-
Personal Insights & Critique:
- Novelty and Impact: The core contribution of this paper is not just another attack method, but a paradigm shift in how to approach LLM testing. The focus on efficiency and cost-reduction is highly pragmatic and valuable for developers and organizations deploying LLMs in the real world. The idea of using a cheap "proxy" model is clever and widely applicable beyond just safety testing.
- Practicality: The framework's ability to be integrated with existing SOTA tools (as shown in RQ4) is a major strength. It doesn't seek to replace other methods but to augment them, making it a highly practical and adoptable solution.
- Open Questions: Could this distillation approach be used to learn other "soft" behaviors of LLMs, such as their stylistic preferences, political biases, or sense of humor? The framework provides a blueprint for creating cheap proxy models for a variety of LLM characteristics, which is a promising direction for future research.
Similar papers
Recommended via semantic vector search.