Multi-Turn Jailbreaking Large Language Models via Attention Shifting
TL;DR Summary
This work reveals multi-turn jailbreaks shift LLM attention away from harmful keywords, proposing ASJA which uses genetic algorithms to iteratively fabricate dialogue history, effectively inducing harmful outputs and boosting attack success.
Abstract
Multi-Turn Jailbreaking Large Language Models via Attention Shifting Xiaohu Du 1,2,3,4 , Fan Mo 7 , Ming Wen 1,2,3,4,6,* , Tu Gu 7 , Huadi Zheng 7 , Hai Jin 2,3,5 , Jie Shi 7 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST) 2 National Engineering Research Center for Big Data Technology and System 3 Services Computing Technology and System Lab 4 Hubei Engineering Research Center on Big Data Security and Hubei Key Laboratory of Distributed System Security 5 Cluster and Grid Computing Lab, School of Computer Science and Technology, HUST 6 JinYinHu Laboratory 7 Huawei International { xhdu, mwenaa, hjin } @hust.edu.cn, { mofan10, gu.tu, zhenghuadi, shi.jie1 } @huawei.com Abstract Large Language Models (LLMs) have achieved significant performance in various natural language processing tasks but also pose safety and ethical threats, thus requiring red team- ing and alignment processes to bolster their safety. To effec- tively exploit these aligned LLMs, recent studies have intro- duced jailbreak attacks based on multi-turn dialogues. These attacks aim to prompt LLMs to generate harmful or biased content by guiding
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Multi-Turn Jailbreaking Large Language Models via Attention Shifting
1.2. Authors
Xiaohu Du, Fan Mo, Ming Wen, Tu Gu, Huadi Zheng, Hai Jin, and Jie Shi. The authors are affiliated with several institutions, including the Huazhong University of Science and Technology (HUST) and Huawei International. This suggests a collaboration between academic researchers and industry professionals, which is common in applied AI security research where academic insights are tested against industry-scale models and challenges.
1.3. Journal/Conference
The paper provides a future publication date of April 11, 2025. The format and content are typical of a submission to a top-tier computer security conference (such as USENIX Security, ACM CCS, NDSS) or a leading AI conference (like NeurIPS, ICLR, ACL). These venues are highly competitive and are considered prestigious in the fields of cybersecurity and artificial intelligence.
1.4. Publication Year
2025
1.5. Abstract
The abstract introduces the problem of Large Language Model (LLM) safety, noting that while alignment processes are in place, vulnerabilities remain, especially in multi-turn dialogues. Existing multi-turn jailbreak attacks are effective but lack a theoretical explanation for their success, often just extending single-turn strategies. This paper provides a novel analysis, finding that successful multi-turn jailbreaks work by dispersing the LLM's attention away from harmful keywords, particularly towards the model's own historical responses. Based on this insight, the authors propose ASJA (Attention Shifting for JAilbreaking), a new attack that fabricates dialogue history using a genetic algorithm to intentionally shift the LLM's attention. The abstract claims that extensive experiments show ASJA surpasses existing methods in attack effectiveness, stealthiness, and efficiency, and highlights the need to improve the robustness of LLM attention mechanisms as a defense strategy.
1.6. Original Source Link
The provided link is /files/papers/690edf32a05cc8091a1130b2/paper.pdf. This appears to be a local file path or an identifier from a paper repository (like an internal server or a preprint archive). Its status is likely a preprint or a paper submitted for review, not yet officially published in a conference proceedings or journal digital library.
2. Executive Summary
2.1. Background & Motivation
- Core Problem: Large Language Models (LLMs) are equipped with safety mechanisms (alignment) to prevent them from generating harmful, unethical, or dangerous content. However, these defenses can be bypassed through malicious prompts, a process known as "jailbreaking." While early jailbreaks focused on single, cleverly crafted prompts (
single-turn), recent work has shown that engaging an LLM in a multi-turn conversation is a more effective way to elicit harmful responses. - Importance and Gaps: The problem is critical because as LLMs become more integrated into daily life, their potential for misuse grows. The "why" behind the success of multi-turn jailbreaks has remained largely unexplored. Existing methods, while effective, are often brute-force extensions of single-turn techniques; they focus on optimizing the user's queries without deeply understanding the LLM's internal state changes during a conversation. This lack of fundamental understanding hinders the development of robust, principled defenses.
- Innovative Idea: This paper's central innovative idea is to move beyond just crafting malicious prompts and instead investigate the LLM's internal attention mechanism during a dialogue. The authors hypothesize that a multi-turn conversation provides a large context that can dilute or "shift" the model's attention away from specific harmful keywords in the final malicious query. If the attention on these keywords falls below a certain threshold, the model's safety filter may not trigger, leading to a successful jailbreak.
2.2. Main Contributions / Findings
- Primary Contributions:
- First-of-its-kind Analysis: The paper presents the first empirical study analyzing the differences in LLM attention distribution between successful and failed multi-turn jailbreak attempts.
- Novel Attack Method (
ASJA): Based on their findings, the authors proposeASJA(Attention Shifting for JAilbreaking). Unlike previous methods that only optimize user queries,ASJAfabricates the entire dialogue history, including both user queries and the model's own (simulated) past responses, to strategically manipulate the target LLM's attention. - Efficient Optimization:
ASJAemploys a genetic algorithm to efficiently search the vast space of possible dialogue histories, making the attack practical and effective.
- Key Findings:
- Attention Shifting is Key: Successful multi-turn jailbreaks significantly reduce the LLM's attention on harmful keywords in the final query compared to failed attempts. The attention is redirected towards the conversational history, especially the LLM's previous responses.
- Superior Performance:
ASJAsignificantly outperforms state-of-the-art jailbreaking methods. On average, it increases the rate of generating harmful responses by 44.91% and improves the relevance of those responses to the original query by 34.02% compared to the best baseline. - Practical Advantages:
ASJAis not only more effective but also produces more "stealthy" (natural-sounding) prompts, is more efficient (requires fewer queries to succeed), and its attacks can be transferred from open-source models to powerful closed-source models like GPT-4o.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Large Language Models (LLMs): LLMs are advanced AI models, most commonly based on the Transformer architecture, that are trained on enormous amounts of text data. This initial "pre-training" phase allows them to learn grammar, facts, reasoning abilities, and language patterns. After pre-training, they are often "fine-tuned" on specific datasets to follow instructions, engage in conversation, and adhere to safety guidelines.
- Attention Mechanism: This is the core component of the Transformer architecture and the central concept of this paper. The attention mechanism allows a model to weigh the importance of different words (or tokens) in the input when producing an output. Instead of treating the entire input as a single chunk of information, the model can "pay attention" to the most relevant parts.
- Crucial Background - Scaled Dot-Product Attention: The most common form of attention, which this paper implicitly analyzes, is calculated using Queries (Q), Keys (K), and Values (V). For a given word being generated (represented by a
Queryvector), it is compared against all words in the input context (represented byKeyvectors). The similarity scores are then used to create a weighted sum of the input words'Valuevectors. The formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ - Symbol Explanation:
- (Query): A matrix representing the set of queries. In a decoder, this is the token being generated.
- (Key): A matrix representing the set of keys. These are derived from the input tokens.
- (Value): A matrix representing the set of values, also derived from the input tokens.
- : This dot product computes the similarity (or "attention score") between each query and all keys.
- : A scaling factor (the square root of the dimension of the key vectors) to stabilize gradients during training.
- : A function that converts the raw scores into a probability distribution (weights that sum to 1), indicating how much attention to pay to each input token. This paper's innovation is to analyze these attention weights to understand and exploit the model's focus during a conversation.
- Crucial Background - Scaled Dot-Product Attention: The most common form of attention, which this paper implicitly analyzes, is calculated using Queries (Q), Keys (K), and Values (V). For a given word being generated (represented by a
- Jailbreaking: This is the practice of crafting special inputs (prompts) that trick an LLM into violating its own safety policies. The goal is to make the model generate content it was explicitly trained to refuse, such as instructions for illegal activities, hate speech, or private information.
- Red Teaming & Alignment: These are two sides of the same coin in LLM safety.
- Alignment: The process of fine-tuning an LLM to align its behavior with human values and safety rules. This often involves techniques like Reinforcement Learning from Human Feedback (RLHF), where the model is rewarded for helpful and harmless responses and penalized for unsafe ones.
- Red Teaming: The process of actively trying to break the model's safety alignment, essentially performing structured jailbreaking attacks. The vulnerabilities found during red teaming are used as feedback to improve the next round of alignment.
- Genetic Algorithm (GA): A type of optimization algorithm inspired by Charles Darwin's theory of natural evolution. It is used here to find the most effective dialogue history for jailbreaking. The basic steps are:
- Population: Start with a set of random solutions (in this case, dialogue histories).
- Fitness: Evaluate how "good" each solution is using a fitness function (here, how much it lowers attention on the final query).
- Selection: Select the best solutions (the "fittest") to be "parents" for the next generation.
- Crossover: Combine parts of two parents to create a new solution ("child").
- Mutation: Randomly change a small part of a solution to introduce new variations. This cycle repeats, gradually "evolving" a highly effective solution.
3.2. Previous Works
The paper categorizes prior jailbreaking research into two main types:
-
Single-round jailbreaks: These attacks try to succeed with a single, complex prompt.
GCG (Greedy Coordinate Gradient): A white-box attack that uses gradient information from the model to find a short, gibberish-like adversarial suffix that, when appended to a prompt, causes a jailbreak.AutoDAN,GPTFUZZER,FuzzLLM: These methods use search or "fuzzing" techniques (inspired by traditional software testing) to automatically discover and evolve effective jailbreak prompts without needing gradient access.PAP (Persuasive Adversarial Prompts): Instead of technical tricks, this method uses natural language to try and persuade the LLM to ignore its safety rules, for example, by creating an elaborate fictional scenario.
-
Multi-turn jailbreaks: These attacks use a sequence of prompts in a conversation to achieve their goal.
PAIR (Prompt Automatic Iterative Refinement): Uses an "attacker" LLM to generate a jailbreak prompt and a "refiner" LLM to analyze the target model's refusal and suggest improvements for the next turn.Crescendo: A gradual escalation method where the conversation starts with benign topics and slowly introduces more sensitive elements, conditioning the model to eventually comply with a harmful request.REDEVAL: Sets up a role-playing scenario where the target model is asked to complete a dialogue between a "harmful" agent and a "helpful but unsafe" agent.CoA (Chain of Attack): Similar toCrescendo, this method guides the model from secure to insecure scenarios step-by-step.
3.3. Technological Evolution
The field of jailbreaking has evolved rapidly. Initially, it was a manual process where users discovered tricks like DAN ("Do Anything Now") through trial and error. This evolved into automated single-turn attacks that optimized prompts using gradients (GCG) or search algorithms (AutoDAN). As LLM defenses improved against these single-shot attacks, the focus shifted to multi-turn attacks (PAIR, Crescendo), which mimic human conversational patterns to be more effective. This paper marks the next step in this evolution: moving from simply optimizing the content of the conversation to analyzing and manipulating the internal mechanics (the attention mechanism) of the LLM itself during the conversation.
3.4. Differentiation Analysis
The core innovation of ASJA is its fundamental shift in attack philosophy compared to previous work:
- Target of Manipulation:
- Previous Methods (
PAIR,AutoDAN, etc.): Focus on optimizing the user's queries. They try to find the perfect sequence of questions to ask. ASJA: Focuses on fabricating the entire dialogue history, including both the user's queries and the LLM's own supposed past responses. This is a crucial distinction, as the paper finds the model's own responses are a more potent area for attention manipulation.
- Previous Methods (
- Underlying Principle:
- Previous Methods: Operate on the principle of "escalation" or "refinement," trying to trick or wear down the safety alignment through clever wording.
ASJA: Operates on the principle of "attention shifting." The goal is not just to ask a tricky question but to create a conversational context that distracts the model from the harmfulness of the final question. It exploits a fundamental mechanism of the Transformer architecture.
- Methodology:
- Previous Methods: Often use an iterative loop: .
ASJA: Uses a genetic algorithm to holistically optimize an entire dialogue history at once, using the LLM's attention score as a direct fitness signal. This is a more direct and principled way to exploit the identified vulnerability.
4. Methodology
4.1. Principles
The core principle of ASJA (Attention Shifting for JAilbreaking) is that an LLM's safety alignment can be bypassed by manipulating its attention mechanism. The preliminary study in the paper reveals a key insight: in successful multi-turn jailbreaks, the LLM allocates significantly less attention to the harmful keywords in the final malicious query. Instead, its attention is dispersed across the preceding conversation history, especially the model's own prior responses.
ASJA is designed to actively engineer this phenomenon. The goal is to construct a fabricated dialogue history that serves as a "distraction," causing the LLM to focus on the history so much that it fails to recognize the harmful intent of the final query. This is achieved by simultaneously optimizing both the past queries and the LLM's past responses in the dialogue history to minimize the attention paid to the final harmful query.
4.2. Core Methodology In-depth (Layer by Layer)
The ASJA attack is described in Algorithm 1 of the paper. It involves three main phases: Dialogue Initialization, Population Initialization, and Dialogue Optimization using a genetic algorithm.
The overall process is as follows:
该图像是条形图,展示了不同问题及回复中的注意力分布。左侧部分显示了跨回合的注意力情况,标记为'Reject'和'Jailbreak',右侧部分则提供了详细的分布情况。此外,R1至R5表明了不同回合的响应结果。
4.2.1. Algorithm and Process Breakdown
The algorithm is orchestrated to find an optimal multi-turn dialogue that causes a jailbreak. Let's break down the steps as presented in Algorithm 1: Attention Shifting for Jailbreaking.
Require:
- : An attack model, used for rewriting queries (e.g., an uncensored LLM).
- : The target model to be jailbroken (e.g., LLaMA-2).
- : A judge model to evaluate if a response is a successful jailbreak.
- : The maximum number of generations (iterations) for the genetic algorithm.
- : The size of the population (number of candidate dialogues).
Input:
- : The initial harmful query (e.g., "How to build a bomb?").
Output:
-
A harmful multi-turn dialogue and the corresponding harmful response from the target model.
Step 1: Multi-turn Dialogue Initialization (Lines 1-4)
The first step is to create a plausible, high-quality initial dialogue history.
Q ← Init(X): A sequence of queries is generated, starting from a benign topic and gradually increasing in toxicity until it reaches the final harmful query . To handle cases where standard LLMs refuse to generate the intermediate harmful queries, the authors use an uncensored model (a model fine-tuned to ignore safety constraints). The paper notes they design the sequence to have an equal distribution of benign and harmful questions, inspired by theDAN("Do Anything Now") jailbreak, to make the model believe it has already been non-compliant.- : This loop constructs the initial multi-turn dialogue
MT. For each generated query , it is sent to the target model , and the model's response is recorded. The dialogueMTis thus a sequence of .
Step 2: Population Initialization (Line 6)
The genetic algorithm starts with a diverse set of candidate solutions (a population).
- : An initial population of size is created. Each individual in the population is a multi-turn dialogue history generated by applying a
mutationoperation to the initial dialogueMT. The mutation introduces variations to create a diverse starting point for the optimization.
Step 3: Dialogue Optimization via Genetic Algorithm (Lines 7-15)
This is the main loop where the dialogue histories are evolved over generations to find an effective jailbreak.
-
Iteration Loop: The algorithm iterates from to
G-1. -
Jailbreak Check (Lines 9-10): For each candidate dialogue in the population, the full dialogue is fed to the target model . The resulting final response is evaluated by the judge model . If the judge model determines the output is a successful jailbreak (), the algorithm terminates and returns the successful dialogue and response.
-
Fitness Evaluation (Line 11): If no jailbreak is found, the "fitness" of each candidate dialogue in the population is calculated. This is the most crucial part of
ASJA's design. The fitness is defined as the attention score the target model places on the final harmful query in the dialogue. A lower attention score is considered fitter.The paper defines the attention score for the final query as the sum of the attention scores of its constituent tokens: $ A(Q_k) = \sum_{i=1}^{n} A(t_i) $ where is the attention score for the -th input token in the final query . The calculation for is based on averaging attention across all layers and heads for subsequent output tokens, as defined in the preliminary study: $ A(t_k) = \frac{1}{m - 1} \sum_{t=2}^{m} A(y_t, t_k) $ And the attention from a single output token to an input token is the average across all layers and heads: $ A(y_t, t_k) = \frac{1}{L \times H} \sum_{i=1}^{L} \sum_{j=1}^{H} A_{i,j}(y_t, t_k) $
- Symbol Explanation:
- : The attention score from the -th output token to the -th input token , specifically from the -th attention head in the -th layer of the Transformer.
- : Total number of layers in the LLM.
- : Total number of attention heads per layer.
- : The number of tokens in the model's output sequence. The paper uses in the formula, but is a clearer representation for output length. The objective of the genetic algorithm is to find a dialogue history that minimizes .
- Symbol Explanation:
-
Selection and Evolution (Lines 12-14):
- Elitism: The best individual from the current generation (the one with the lowest fitness score, i.e., lowest attention on ), called
elite, is preserved and passed directly to the next generation . This ensures the best-found solution is never lost. - Selection: For the remaining
N-1spots in the new generation, parent dialogues (, ) are selected from the current population . The selection is probabilistic, where individuals with lower fitness scores (better solutions) have a higher chance of being chosen. - Crossover: A new
childdialogue is created from the two parents using uniform crossover. This means for each "dialogue unit" (a query or a response in a turn), the child's unit is randomly copied from either or . This promotes the mixing of good traits from different solutions. - Mutation: The newly created
childundergoes mutation with a certain probability. This is the key operator for introducing novelty. Mutation can modify the dialogue history in two ways:- Query Mutation: A query in the history is rewritten using the attack model and one of eight predefined jailbreak strategies (e.g.,
Defined Persona,Imagined Scenario). - Response Mutation: A response in the history is regenerated using the
uncensored modelto be more affirmative and compliant with the harmful query. This fabrication strengthens the illusion that the model has already been "broken" in previous turns.
- Query Mutation: A query in the history is rewritten using the attack model and one of eight predefined jailbreak strategies (e.g.,
- Elitism: The best individual from the current generation (the one with the lowest fitness score, i.e., lowest attention on ), called
-
New Generation: The new population is formed by the
elitesample and theN-1new offspring created through selection, crossover, and mutation. This new population is then used for the next iteration of the loop.
5. Experimental Setup
5.1. Datasets
Two datasets were used to evaluate the proposed method:
-
QuestionList: A dataset from Yu et al. (2023) containing 100 queries. These queries cover a wide range of prohibited scenarios, including illegal activities, unethical practices, and toxic content. This dataset is relatively small but diverse. -
AdvBench(Adversarial Bench): A larger and more structured dataset from Zou et al. (2023) with 520 instances of harmful behaviors. These are categorized into seven specific scenarios: "Illegal Activity," "Hate Speech," "Malware," "Physical Harm," "Economic Harm," "Fraud," and "Privacy Violence." The paper notes that this dataset contains repeated topics, which allows for more reliable results by simulating multiple runs on the same theme.These datasets were chosen because they are standard benchmarks in LLM safety and jailbreaking research, allowing for a fair comparison with prior work.
5.2. Evaluation Metrics
The paper uses three main metrics to evaluate the performance of the attacks: two for effectiveness and one for stealthiness.
-
ASR-1(Attack Success Rate, Type 1):- Conceptual Definition: This metric measures the percentage of prompts that successfully elicit a harmful response. It uses a two-stage process: first, it checks if the model explicitly refused the request, and if not, it then uses a powerful LLM to judge if the content is actually harmful.
- Mathematical Formula: $ \text{ASR-1} = \frac{\sum_{i=1}^{N} \mathbb{I}(\text{is_not_refused}(R_i) \land \text{is_harmful}(R_i))}{N} $
- Symbol Explanation:
- : The total number of test samples.
- : The response generated by the target model for the -th sample.
- : A function that returns
trueif the response does not contain refusal keywords (e.g., "I cannot," "I'm sorry," "As a language model..."). - : A function, typically implemented by a powerful judge LLM, that returns
trueif the content of the response is harmful. - : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
-
ASR-2(Attack Success Rate, Type 2):- Conceptual Definition: This is a stricter version of ASR-1. In addition to being non-refused and harmful, the response must also be relevant to the original harmful query. This is important because some attacks might cause the model to output something harmful but completely unrelated to the user's intent.
- Mathematical Formula: $ \text{ASR-2} = \frac{\sum_{i=1}^{N} \mathbb{I}(\text{is_not_refused}(R_i) \land \text{is_harmful}(R_i) \land \text{is_relevant}(R_i, Q_i))}{N} $
- Symbol Explanation:
- : The original harmful query for the -th sample.
- : A function, also implemented by a judge LLM, that returns
trueif the response directly addresses the query . - Other symbols are the same as in ASR-1.
-
PPL(Perplexity):- Conceptual Definition: Perplexity is a metric used to evaluate the fluency and naturalness of language generated by a model. In the context of jailbreaking, it measures the "stealthiness" of the adversarial prompt. A lower perplexity score indicates that the prompt is more like natural human language and is therefore less likely to be detected by defense mechanisms that flag statistically unusual text.
- Mathematical Formula: For a sequence of tokens , perplexity is calculated as the exponential of the average negative log-likelihood: $ \text{PPL}(W) = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, \dots, w_{i-1}) \right) $
- Symbol Explanation:
- : The number of tokens in the sequence.
- : The probability assigned by a language model (in this case, GPT-2) to the -th token, given all preceding tokens.
5.3. Baselines
The paper compares ASJA against three representative state-of-the-art jailbreaking methods:
-
AutoDAN: A method that uses a genetic algorithm to optimize a single-turn jailbreak prompt. It is a strong baseline for automated prompt optimization. -
ReNeLLM: An attack that generates jailbreak prompts by rewriting them and nesting them within specific scenarios (e.g., hiding a prompt inside a LaTeX code block). -
PAIR: A multi-turn attack that uses an LLM to iteratively refine prompts based on the target model's responses. This is the most direct competitor as it also operates in a multi-turn setting.These baselines were chosen to represent different families of attacks: automated single-turn optimization (
AutoDAN), prompt obfuscation (ReNeLLM), and iterative multi-turn dialogue (PAIR).
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results are presented in Table 1, which compares ASJA against the baselines on three open-source LLMs (LLaMA-2, LLaMA-3.1, Qwen-2) and two datasets (AdvBench, QuestionList).
The following are the results from Table 1 of the original paper:
| Dataset | Attack | LLaMA-2 | LLaMA-3.1 | Qwen-2 | Average | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASR-1↑ | ASR-2↑ | PPL↓ | ASR-1↑ | ASR-2↑ | PPL↓ | ASR-1↑ | ASR-2↑ | PPL↓ | ASR-1↑ | ASR-2↑ | PPL↓ | ||
| AdvBench | AutoDAN | 24.42 | 16.54 | 116.30 | 9.04 | 10.00 | 137.37 | 64.04 | 50.00 | 114.55 | 32.50 | 25.51 | 122.74 |
| ReNeLLM | 30.38 | 19.42 | 82.21 | 32.31 | 29.04 | 62.68 | 52.50 | 50.77 | 74.51 | 38.40 | 33.08 | 73.13 | |
| PAIR | 28.85 | 19.23 | 24.66 | 41.15 | 38.65 | 19.92 | 74.62 | 64.62 | 23.38 | 48.21 | 40.83 | 22.65 | |
| ASJA | 57.70 | 37.88 | 34.48 | 54.23 | 54.04 | 39.01 | 78.27 | 69.23 | 38.42 | 63.40 | 53.72 | 37.30 | |
| QuestionList | AutoDAN | 23.00 | 20.00 | 128.48 | 9.00 | 11.00 | 154.80 | 67.00 | 50.00 | 131.62 | 33.00 | 27.00 | 138.30 |
| ReNeLLM | 37.00 | 30.00 | 64.40 | 38.00 | 40.00 | 55.72 | 51.00 | 64.00 | 61.28 | 42.00 | 44.67 | 60.47 | |
| PAIR | 40.00 | 30.00 | 31.34 | 46.00 | 42.00 | 26.68 | 69.00 | 66.00 | 30.60 | 51.67 | 46.00 | 29.54 | |
| ASJA | 78.00 | 52.00 | 33.12 | 81.00 | 63.00 | 41.09 | 85.00 | 71.00 | 36.58 | 81.33 | 62.67 | 36.93 | |
- Effectiveness (
ASR-1andASR-2):ASJAis the undisputed winner. It achieves the highestASR-1andASR-2across all three models and both datasets, often by a large margin. For instance, onQuestionList,ASJAachieves an averageASR-1of 81.33%, while the next best,PAIR, is at 51.67%. This demonstrates that the attention-shifting strategy is significantly more effective than prior methods. The strong performance onASR-2is particularly noteworthy, as it showsASJAnot only elicits harmful content but also ensures the content is relevant to the original query, which is a common failure point for other attacks. - Stealthiness (
PPL):ASJAachieves excellent stealthiness. Its average perplexity (PPL) is consistently low, second only toPAIR. BothASJAandPAIRsignificantly outperformAutoDANandReNeLLMin this regard. This is because they leverage LLMs to generate natural, fluent conversational text, whereasAutoDANandReNeLLMrely on more rigid templates and optimization that can result in less natural-sounding prompts.ASJAsuccessfully combines high effectiveness with high stealth.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Efficiency Analysis
The paper evaluates efficiency by measuring the ASR-1 achieved within a certain budget of queries to the target model. The results are shown in Figure 3 from the paper.
该图像是一个图表,展示了不同方法(ASJA、PAIR和AutoDAN)在查询时间与ASR之间的关系。随着查询时间的增加,ASJA的ASR逐渐提高,而PAIR和AutoDAN的变化相对平缓。
- ASJA's Superior Efficiency: The chart shows that
ASJA(the blue line) achieves a much higher attack success rate with fewer queries compared to the baselines. For any given number of queries,ASJA'sASR-1is substantially higher. This indicates that its optimization process, guided by the attention mechanism, is far more direct and efficient at finding vulnerabilities. - Baseline Inefficiencies:
AutoDAN(the orange line) requires a large number of queries to even begin, as its method involves multiple forward passes to calculate loss for candidate selection.PAIR(the green line) shows some success at a low query count but quickly plateaus, suggesting its iterative refinement strategy is limited and may get stuck in local optima.ASJA's genetic algorithm, guided by a more fundamental signal (attention), appears to explore the search space more effectively.
6.2.2. Transferability Analysis
The paper tests whether adversarial dialogues crafted for an open-source model (LLaMA-2) can successfully jailbreak powerful, closed-source models (GPT-3.5 and GPT-4o). This is a critical test for the real-world applicability of an attack.
The following are the results from Table 2 of the original paper:
| Attack | GPT-3.5 | GPT-4o | ||||
|---|---|---|---|---|---|---|
| ASR-1 | ASR-2 | PPL | ASR-1 | ASR-2 | PPL | |
| AutoDAN | 61.00 | 53.00 | 146.52 | 46.00 | 59.00 | 149.18 |
| ReNeLLM | 59.00 | 48.00 | 60.29 | 57.00 | 58.00 | 58.07 |
| PAIR | 18.00 | 35.00 | 34.24 | 14.00 | 33.00 | 36.39 |
| ASJA | 56.00 | 54.00 | 40.38 | 57.00 | 63.00 | 37.33 |
- High Transferability of ASJA: The results show that
ASJAmaintains strong performance when transferred. It achieves anASR-1of 57% and anASR-2of 63% onGPT-4o, the highest relevance-aware success rate among all tested methods. This is a powerful result, suggesting that the vulnerabilityASJAexploits—the distractibility of the attention mechanism—is a fundamental issue present even in highly advanced, closed-source models. - Comparison with Baselines: While
ReNeLLMalso shows good transferability, its prompts have much higher perplexity (PPL), making them less stealthy.PAIR, which is also a multi-turn method, transfers very poorly, indicating its refinement strategy is likely overfitted to the specific target model it was optimized on.ASJAstrikes the best balance, achieving high success rates on transfer attacks while maintaining lowPPL.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper makes a significant contribution to the field of LLM security by providing a new, principled understanding of why multi-turn jailbreaks are effective. The core finding is that successful attacks work by shifting the LLM's attention away from harmful keywords in a malicious query and towards the conversational history.
Building on this insight, the authors developed ASJA, a novel attack that fabricates dialogue histories using a genetic algorithm guided by the LLM's attention scores. Extensive experiments demonstrated that ASJA is superior to existing methods across multiple dimensions: it achieves a higher attack success rate (effectiveness), generates more natural and human-like prompts (stealthiness), requires fewer queries to succeed (efficiency), and its attacks generalize well to other, more powerful models (transferability). The research concludes that securing LLMs in the future will require not just better content filtering, but also enhancing the robustness of the underlying attention mechanism in multi-turn scenarios.
7.2. Limitations & Future Work
The authors' primary focus is on demonstrating the vulnerability and the proposed attack. While they do not explicitly list limitations, some can be inferred from the methodology:
-
White-Box Dependency: The core
ASJAmethod requires access to the target model's internal attention scores to calculate fitness. This makes it a white-box attack, which is not always feasible, especially against proprietary, closed-source models served via API. Although the paper demonstrates strong transferability (a black-box scenario), the initial optimization process relies on a white-box setup. -
Reliance on Uncensored Models: The method depends on an "uncensored model" for both initializing dialogues and mutating responses. The effectiveness of
ASJAcould be highly dependent on the quality and capabilities of this auxiliary model. -
Evaluation by LLM: The use of
LLaMA-3.1-70bas a judge for harmfulness and relevance is a standard practice but introduces a potential source of noise and bias, as LLM-based evaluation is not infallible.The paper suggests that future work should focus on developing defense strategies that specifically target this vulnerability by enhancing the robustness of the attention mechanism in multi-turn dialogues.
7.3. Personal Insights & Critique
This paper is highly insightful and represents a clear step forward in LLM security research.
- Key Innovation: The most impressive aspect is the shift in perspective from treating the LLM as a black-box text generator to a white-box system whose internal mechanisms can be analyzed and exploited. By grounding the attack in the attention mechanism, the authors move from empirical prompt hacking to a more principled form of adversarial attack. This is a much more powerful and fundamental approach.
- Practical Implications: The finding that fabricating the LLM's own past responses is an effective manipulation vector is a critical insight for defenders. It suggests that models have a strong "recency bias" or "contextual trust" in their own generated text, which can be exploited. Defenses may need to treat the model's own history with the same level of scrutiny as user input.
- Potential for Improvement:
- A future direction could be to develop a purely black-box version of
ASJA. Instead of using attention scores as a fitness signal, one could use proxy signals, such as the confidence scores of refusal-related tokens in the output, to guide the genetic algorithm. - The concept of "attention shifting" could be applied to other domains beyond security. For instance, it could be used to improve the steerability of LLMs in creative writing or to debug why a model is focusing on irrelevant information in a reasoning task.
- A future direction could be to develop a purely black-box version of
- Overall Critique:
ASJAis a sophisticated and potent attack that lays bare a fundamental weakness in current Transformer-based LLMs. The paper is well-executed, with a clear hypothesis, a novel methodology to test it, and comprehensive experiments to validate it. It serves as a powerful call to action for the research community to look beyond surface-level safety filters and address vulnerabilities in the core architecture of these models.
Similar papers
Recommended via semantic vector search.