PEARL: Towards Permutation-Resilient LLMs
TL;DR Summary
PEARL uses distributionally robust optimization and a permutation-proposal network to enhance LLMs' resilience against worst-case input orderings, effectively mitigating permutation attacks and boosting performance across varied contexts.
Abstract
The in-context learning (ICL) capability of large language models (LLMs) enables them to perform challenging tasks using provided demonstrations. However, ICL is highly sensitive to the ordering of demonstrations, leading to instability in predictions. This paper shows that this vulnerability can be exploited to design a natural attack - difficult for model providers to detect
- that achieves nearly 80% success rate on LLaMA-3 by simply permuting the demonstrations. Existing mitigation methods primarily rely on post-processing and fail to enhance the model's inherent robustness to input permutations, raising concerns about safety and reliability of LLMs. To address this issue, we propose Permutation-resilient learning (PEARL), a novel framework based on distributionally robust optimization (DRO), which optimizes model performance against the worst-case input permutation. Specifically, PEARL consists of a permutation-proposal network (P-Net) and the LLM. The P-Net generates the most challenging permutations by treating it as an optimal transport problem, which is solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net and the LLM iteratively optimize against each other, progressively improving the LLM's robustness. Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance. Notably, despite being trained on fewer shots and shorter contexts, PEARL achieves performance gains of up to 40% when scaled to many-shot and long-context scenarios, highlighting its efficiency and generalization capabilities.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
PEARL: Towards Permutation-Resilient LLMs
1.2. Authors
Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong
Affiliations:
- The Chinese University of Hong Kong: Liang Chen, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong
- Shenzhen Campus of Sun Yat-sen University: Li Shen
- SMU: Yang Deng
1.3. Journal/Conference
This paper is published at arXiv, which is a preprint server for scientific papers. It is indicated as a preprint (v1) and its publication status beyond this is not specified. arXiv is a highly influential platform for rapid dissemination of research in various scientific fields, including artificial intelligence and natural language processing.
1.4. Publication Year
2025 (Based on the UTC timestamp 2025-02-20T15:07:02.000Z)
1.5. Abstract
The abstract outlines the critical vulnerability of Large Language Models (LLMs) in in-context learning (ICL): their predictions are highly sensitive to the ordering of demonstrations. This sensitivity can be exploited by a "natural attack" through simple permutation of demonstrations, achieving nearly an 80% success rate on models like LLaMA-3. Existing mitigation methods often rely on post-processing or architectural modifications, failing to enhance the LLM's inherent robustness to input permutations. To counter this, the paper proposes Permutation-resilient learning (PEARL), a novel framework based on distributionally robust optimization (DRO). PEARL optimizes model performance against the worst-case input permutation. It comprises a permutation-proposal network (P-Net) and the LLM. The P-Net identifies challenging permutations by treating it as an optimal transport problem, solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net and LLM iteratively train, progressively improving the LLM's robustness. Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate PEARL's effectiveness in mitigating permutation attacks and enhancing performance. Notably, PEARL shows significant performance gains (up to 40%) in many-shot and long-context scenarios, even when trained on fewer shots and shorter contexts, highlighting its efficiency and generalization capabilities.
1.6. Original Source Link
https://arxiv.org/abs/2502.14628
1.7. PDF Link
https://arxiv.org/pdf/2502.14628v1.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the permutation sensitivity of in-context learning (ICL) in Large Language Models (LLMs). While ICL allows LLMs to perform complex tasks by learning from demonstrations, the effectiveness of this learning is highly unstable and dependent on the specific order in which these demonstrations are presented.
This problem is critically important in the current field due to several reasons:
-
Instability in Predictions: The unpredictability stemming from demonstration ordering makes
LLMsless reliable and harder to control for consistent performance. -
Challenges for Prompt Engineering: Practitioners struggle to engineer effective prompts because a slight change in demonstration order can drastically alter model behavior, hindering the development of robust
LLMapplications. -
Security Vulnerability: The paper highlights that this sensitivity can be exploited as a "natural attack." A malicious actor can degrade
LLMperformance by simply reordering demonstrations, without altering their content. Such attacks are difficult for model providers to detect, posing significant safety and reliability concerns. Existing mitigation methods are often post-processing techniques or architectural changes, which incur additional inference overhead or lack scalability, failing to address the fundamental lack ofinherent robustnessto input permutations within theLLMitself.The paper's entry point or innovative idea is to address this vulnerability from a fundamental learning perspective using
Distributionally Robust Optimization (DRO). Instead of superficial fixes,PEARL(Permutation-resilient learning) focuses on making theLLMinherently robust by training it to perform well even under the worst-case permutations of demonstrations, thereby enhancing its resilience against such attacks and improving overall reliability.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
-
Demonstration of Permutation Vulnerability as an Attack: The authors empirically show that even advanced
LLMslike LLaMA-3 are highly susceptible to simplepermutation-based attacks, with success rates approaching 80%. This re-framespermutation sensitivitynot just as a performance quirk but as a critical security vulnerability. -
Introduction of PEARL Framework: They propose
Permutation-resilient learning (PEARL), a novel training framework based onDistributionally Robust Optimization (DRO). This framework is designed to explicitly optimizeLLMperformance against the most challenging input permutations. -
Novel Permutation-Proposal Network (P-Net):
PEARLintroduces aP-Netthat acts as an adversary. ThisP-Netis designed to generate the "worst-case" permutations by formulating the problem as anoptimal transportproblem, which is efficiently solved using anentropy-constrained Sinkhorn algorithm. -
Minimax Optimization for Robustness: The
P-Netand theLLMare trained together throughminimax optimization, whereP-Netmaximizes theLLM's loss by finding difficult permutations, and theLLMminimizes its loss, thereby learning to be robust to these challenging orders. This adversarial training progressively enhances theLLM's inherent robustness. -
Empirical Validation and Performance Gains: Experiments on synthetic pre-training tasks (linear functions) and real-world
instruction tuningtasks (Super-Natural Instructions) demonstratePEARL's effectiveness.- It successfully mitigates
permutation attacksand significantly improves both average and worst-case performance. - Notably,
PEARL-trained models exhibit strong generalization capabilities, achieving performance gains of up to 40% when scaled tomany-shotandlong-contextscenarios, even though they were trained on fewer shots and shorter contexts. PEARLalso improvesshot efficiency, allowing models to achieve comparable average performance with two to four times fewer shots thanERMbaselines.
- It successfully mitigates
-
Generalizability Across LLMs: The method's effectiveness is shown across different
LLMfamilies, including Llama, Gemma, and Mistral models, indicating its broad applicability.These findings address the critical issue of
permutation sensitivityby providing a robust, training-stage solution that enhances the intrinsic reliability and security ofLLMsinICLsettings.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the PEARL framework and its implications, a beginner should be familiar with the following core concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
Transformerarchitecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language for a wide range of tasks, from translation to creative writing. Examples include GPT-3, LLaMA, and Mistral. -
In-Context Learning (ICL): A powerful capability of
LLMswhere the model learns a new task or adapts its behavior from a few examples provided directly within the input prompt, without requiring explicit weight updates or fine-tuning. For instance, to classify sentiment, one might provide a few examples like "Movie was great. -> Positive" followed by "Movie was bad. -> Negative" before asking the model to classify a new sentence. The examples are calleddemonstrationsorshots. -
Permutation Sensitivity: This refers to the phenomenon where the performance or output of an
LLMchanges significantly when the order of thedemonstrationswithin theICLprompt is altered, even if the content of the demonstrations remains the same. This implies a lack oforder invarianceorpermutation invariance, which is often desired for robustAIsystems. -
Distributionally Robust Optimization (DRO): A framework in optimization that aims to find solutions that perform well under the worst-case probability distribution within a predefined set of plausible distributions, known as an
ambiguity set. Unlike standard optimization (likeEmpirical Risk Minimizationwhich optimizes for the observed training data),DROprovides stronger guarantees against distributional shifts and uncertainties. In the context of this paper, the uncertainty lies in the order ofICLdemonstrations. -
Optimal Transport (OT): A mathematical theory that deals with efficiently moving a "pile of dirt" from one shape to another, minimizing the total cost of transportation. More formally, it provides a way to define a distance (like the
Wasserstein distance) between probability distributions. In this paper,OTis used to model how to transform one distribution of permutations (e.g., random orderings) into a "worst-case" distribution of permutations that challenges theLLM. -
Sinkhorn Algorithm: An iterative algorithm used to approximate solutions to
optimal transportproblems, particularly for findingdoubly stochastic matrices. Adoubly stochastic matrixis a square matrix of non-negative real numbers where the sum of each row and each column is 1. InPEARL, it helps transform a relationship matrix into a probability distribution over permutations. -
Minimax Optimization: A type of optimization problem structured as a game between two players, often called a "min-max" game. One player (the "minimizer") tries to minimize an objective function, while the other player (the "maximizer") tries to maximize the same objective function. In
PEARL, theP-Net(maximizer) tries to find the worst-case permutation to maximize theLLM's loss, while theLLM(minimizer) tries to improve its performance to minimize this worst-case loss. This adversarial process drives both components to improve. -
Empirical Risk Minimization (ERM): The standard training paradigm in machine learning, where a model is trained to minimize its average loss on the observed training data. While effective for many tasks,
ERMmodels can be brittle to inputs that differ significantly from the training distribution, as highlighted bypermutation sensitivity. -
ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence): A common metric used to evaluate the quality of text summaries or generated text by comparing it to a reference text.
ROUGE-Lspecifically measures thelongest common subsequence(LCS) between the generated and reference texts, reflecting their structural similarity. A higherROUGE-Lscore indicates greater similarity.
3.2. Previous Works
The paper contextualizes PEARL by discussing existing approaches to ICL and permutation sensitivity:
-
Order Sensitivity in ICL:
- Zhao et al. (2021), Lu et al. (2022), Reynolds & McDonell (2021) established the problem of
LLMs' sensitivity to demonstration permutations. - This is seen as a
fragilityinLLMsdespite theirICLcapabilities.
- Zhao et al. (2021), Lu et al. (2022), Reynolds & McDonell (2021) established the problem of
-
Existing Mitigation Methods (Training-Stage):
- Improving General Performance: Most
ICLresearch (Min et al., 2022; Wei et al., 2023) focuses onnormal-case performancewithout specifically targetingpermutation robustness. - Architectural Modifications:
- Xiang et al. (2024) (
InfoAC): Attempts to modifytraining objectivesusingcontrastive learningto mitigate limitations oftransformers'unidirectional attention, aiming forbidirectional token visibility. However, it showed limited success and was restricted toclassification tasks. - Chen et al. (2023b): Explored
permutation-equivariant architectureslikeDeepSet, which showed betterpermutation invariance. However, suchMLP-basedarchitectures are oftentoo smalland lack the scalability for complexlanguage modeling taskscompared toTransformers.
- Xiang et al. (2024) (
- Improving General Performance: Most
-
Existing Mitigation Methods (Inference-Stage): These methods modify the input or output during inference, introducing
additional computational overhead per inference call, thus limiting practicality.- Demonstration Selection: (Chang & Jia, 2023; Peng et al., 2024) Aims to select the "best" demonstrations to improve
normal-case performancebut lacksworst-case guaranteesunder permutations. - Output Calibration: (Zhao et al., 2021; Li et al., 2023; Guo et al., 2024a) Effective for
classification tasksbut less applicable togeneration tasksdue to challenges insequence calibration. - Order Optimization: (Lu et al., 2022) Tries to find the
best orderingof demonstrations duringinference, but this involves anexponential search space(n!), making it computationally prohibitive for many demonstrations. - Prediction Ensembling: (Zhang et al., 2024) Transforms
n-shot ICLinton one-shot predictionsand ensembles the results. While effective forclassification, it can harmgeneration tasks.
- Demonstration Selection: (Chang & Jia, 2023; Peng et al., 2024) Aims to select the "best" demonstrations to improve
-
Distributionally Robust Optimization (DRO):
- Previous applications of
DRO(Ben-Tal et al., 2013; Lam & Zhou, 2015; Duchi et al., 2016; Miyato et al., 2018) have focused ondistributional shiftslikelabel shift(Hu et al., 2018),data source shift(Oren et al., 2019), andgroup shift(Sagawa et al., 2020). This paper is the first to applyDROtoICL robustnessby defining theambiguity setover permutations.
- Previous applications of
-
Optimal Transport (OT):
- Fundamental mathematical discipline (Monge, 1781; Kantorovich, 1942) used for
distribution matching(Montesuma et al., 2024; Xiao et al., 2024). - Mena et al. (2018) explored learning
permutation structuresthroughneural networksfor tasks like sorting. This paper extends this by applyingOTin the context ofLLMsto generate challenging permutations.
- Fundamental mathematical discipline (Monge, 1781; Kantorovich, 1942) used for
3.3. Technological Evolution
The evolution of ICL has seen a progression from focusing on simply achieving few-shot performance to addressing its inherent instabilities. Initially, the success of Transformers and the ICL paradigm (Brown et al., 2020) demonstrated powerful learning from examples. However, early research quickly identified fragilities, such as sensitivity to the choice and order of demonstrations.
The first wave of solutions typically involved inference-stage heuristics (like demonstration selection or output calibration) or minor training tweaks. These methods, while sometimes effective, often added inference overhead or were limited to specific task types (e.g., classification). A parallel line of work explored architectural changes (e.g., DeepSet) to enforce permutation invariance, but these were often less powerful or scalable than the dominant Transformer architecture.
This paper's work (PEARL) represents a significant step in this evolution by integrating a robust optimization principle (DRO) directly into the LLM's training loop. Instead of fixing the output or modifying the architecture, PEARL aims to make the LLM learn to be inherently robust to permutations. It leverages advances in optimal transport for efficiently generating adversarial permutations, moving beyond simple random shuffling or exhaustive search. This places PEARL within the cutting edge of LLM robustness research, as it addresses a core vulnerability from a principled, training-stage perspective that preserves the Transformer architecture and its autoregressive objective, thereby maintaining scalability.
3.4. Differentiation Analysis
Compared to the main methods in related work, PEARL offers several core differences and innovations:
-
Training-Stage, Inherent Robustness vs. Post-Processing:
- Difference: Most existing mitigation methods are
inference-stage(e.g.,demonstration selection,order optimization,output calibration). They modify prompts or outputs during inference, addingcomputational overheadand not improving the model's fundamental understanding of permutation-invariant patterns. - Innovation:
PEARLis atraining-stagemethod. It fundamentally alters how theLLMlearns, making it inherently robust to permutations. Once trained, aPEARLLLMcan handle various demonstration orders without extra inference costs or external modifications, contrasting sharply withInfoACwhich also targets training but with limited scope and success.
- Difference: Most existing mitigation methods are
-
Principled Robustness via DRO vs. Heuristics/Randomness:
- Difference: Prior training methods (, ) use random shuffling or simple mixing, which are essentially
data augmentationtechniques. They don't explicitly seek out the most challenging permutations. - Innovation:
PEARLemploysDistributionally Robust Optimization (DRO). This provides a principled framework to explicitly optimize against theworst-case input permutationwithin anambiguity set. This is a significant theoretical advancement over empirical data augmentation strategies.
- Difference: Prior training methods (, ) use random shuffling or simple mixing, which are essentially
-
Adversarial Permutation Generation via P-Net & Optimal Transport:
- Difference:
Exhaustive searchfor worst-case permutations iscomputationally infeasible(n!). Heuristic search methods are not guaranteed to find the true worst-case. - Innovation:
PEARLintroduces theP-Net, a specialized neural network thatlearnsto generate adversarial permutations. ThisP-Netframes the permutation generation as anoptimal transportproblem, solved efficiently with anentropy-constrained Sinkhorn algorithmandGumbel samplingfor differentiability. This allows for efficient discovery of challenging permutations during training, a capability not present in other methods.
- Difference:
-
General Applicability to Generation Tasks:
- Difference: Many
inference-stagemethods (e.g.,output calibration,prediction ensembling) are more effective forclassification tasksand often underperform or are less applicable togeneration tasksdue to the complexity of sequence-level calibration or ensembling. - Innovation:
PEARLis demonstrated to be effective for bothNLG(Natural Language Generation) andNLU(Natural Language Understanding) tasks ininstruction tuning, making it broadly applicable without task-specific constraints on output types.
- Difference: Many
-
Preserving Transformer Architecture and Autoregressive Objective:
-
Difference: Some approaches explore
permutation-equivariant architectures(e.g.,DeepSet), which deviate from the standardTransformerand may lack its scale or versatility. -
Innovation:
PEARLenhances robustness without modifying the underlyingTransformerarchitecture or itsautoregressive objective. This meansPEARLcan be integrated into existingLLMsmore seamlessly, preserving their core capabilities and scalability.In essence,
PEARLintroduces a powerful and theoretically grounded adversarial training framework that tacklespermutation sensitivityat its root during theLLM's learning phase, providing robust and generalizable improvements across diverse models and tasks, which is a key differentiator from prior works.
-
4. Methodology
4.1. Principles
The core idea behind Permutation-resilient learning (PEARL) is to fundamentally enhance an LLM's robustness to the ordering of in-context learning (ICL) demonstrations. This is achieved by moving beyond Empirical Risk Minimization (ERM), which optimizes for average performance on observed data, to Distributionally Robust Optimization (DRO). The principle of DRO is to optimize model performance against the worst-case distribution within a defined set of plausible input distributions. In PEARL, this ambiguity set encompasses all possible permutations of ICL demonstrations.
To operationalize DRO efficiently, PEARL treats the problem as a minimax game between two players:
-
The Adversary (P-Net): A
permutation-proposal network (P-Net)that aims to find the most challenging permutation of demonstrations for a given input, thereby maximizing theLLM's loss. It conceptualizes this as anoptimal transportproblem. -
The Defender (LLM): The
Large Language Modelitself, which seeks to minimize its loss when faced with these challenging permutations generated by theP-Net.Through this iterative adversarial training process, the
LLMis forced to learn features that are invariant or robust to different demonstration orderings, including those that are most detrimental to its performance. This ensures that theLLMcan maintain high performance even under unexpected or "attack" permutations, thereby achievingpermutation resilience.
4.2. Core Methodology In-depth (Layer by Layer)
The PEARL framework addresses permutation sensitivity by integrating Distributionally Robust Optimization (DRO) into the LLM's fine-tuning process.
4.2.1. Instruction Tuning via DRO
The standard approach to training LLMs for few-shot learning via supervised fine-tuning (SFT) is Empirical Risk Minimization (ERM).
The ERM objective is defined as:
Where:
-
denotes the optimal model parameters found by
ERM. -
represents the parameters of the
language model. -
is a non-negative
loss functionthat measures the discrepancy between the model's prediction and the true output. -
( p , x , y )represents a training instance, where is anICLprompt (a sequence of demonstrations), is the input, and is the true output. -
denotes the
empirical distributionderived from the training dataset. -
denotes the expectation (average) over samples drawn from the empirical distribution.
ERMmodels often fail to generalize to different permutations because the training set only covers a subset of possible permutations.PEARLaddresses this usingDistributionally Robust Optimization (DRO). TheDROobjective is to find parameters that perform well under theworst-case distributionwithin a specifiedambiguity set: Where: -
denotes the optimal model parameters found by
DRO. -
signifies minimizing over the
LLMparameters . -
signifies maximizing over all possible distributions within the
ambiguity set. This finds the worst-case distribution for the currentLLMparameters. -
denotes the expectation (average) over samples drawn from a permutation-induced distribution .
The
ambiguity setis constructed as the convex hull of all distributions obtained by permuting the prompts in the empirical distribution : Where: -
is the
ambiguity set, containing all possible distributions formed by permutations. -
represents a convex combination of distributions.
-
is a
permutation matrixthat reorders the sequence of demonstrations in prompt . -
denotes the set of all possible
permutation matricesfor the demonstrations. -
is a vector of probabilities, , where is the
probability simplexof dimension . This means the elements of are non-negative and sum to 1, representing weights for different permutation distributions. -
denotes a distribution where each instance
( p , x , y )from the empirical distribution has its prompt permuted by .To visualize the difference, Figure 2 (from the original paper) illustrates how
ERMtends to assign high probabilities to frequently seen permutations and low probabilities to less frequent ones, leading to poor performance on unseen but valid permutations. In contrast,DROaims to distribute probabilities more uniformly across all possible permutations, ensuring robustness.
该图像是包含两部分的图表,比较了在ERM和DRO训练范式下模型学习的分布差异。蓝色柱状代表训练数据的经验分布 ,紫色曲线表示模型学习的分布 ,展示了在出现频率较低但有效的排列组合上的不同表现。
Figure 2: Comparison of models trained under ERM and DRO paradigms. The blue bars represent the empirical distribution of training data, showing different frequencies of six permutations in the training set. The purple curves denote the learned distribution by (a) ERM and (b) DRO models, illustrating their different behaviors on less appeared but valid permutations.
4.2.2. Learning to Generate Permutations via P-Net
Directly solving the sup step in the DRO objective (Equation 6) through exhaustive search is computationally prohibitive due to the factorial growth of permutations (n!). PEARL addresses this by introducing a Permutation-proposal Network (P-Net) which learns to efficiently find challenging permutations.
The P-Net is designed to learn a distribution over permutations, denoted as . From this distribution, PEARL samples challenging permutations to reorder the given demonstrations. The P-Net has two main components: a parameter component and a non-parameter component.
Parameter component: This part extracts features from the input and models relationships between demonstrations.
-
Feature Extractor: An
encoder model(e.g.,BERT-base) processes theICLprompt, which consists of demonstration pairs and a predicting sample( x , y ). It produces representations for each demonstration and the query. Where:- is a special token typically used to segment sequences and extract their representation in
Transformermodels likeBERT. ( x _ { i } , y _ { i } )represents the -th demonstration input-output pair.( x , y )represents the query input and its true output (used for context, though might not be available during prediction).- is the representation vector corresponding to the -th token, capturing the semantic information of the -th demonstration.
- is the representation for the query sample.
- is a special token typically used to segment sequences and extract their representation in
-
Cross-Relationship Modeling Layer: After obtaining the representations for the demonstrations (where is the hidden dimension), a simple layer models pairwise relationships between them to form a
relationship matrix. Where:- is the matrix where each row is a demonstration representation .
- is a learnable weight matrix.
- is the transpose of .
- denotes a non-linear activation function.
- The element
R _ { i j }in is interpreted as thepotential increase in task difficultyfor theLLMif demonstrations and are swapped. Higher values suggest a greater impact on prediction if swapped. This component essentially models anedge prediction processwithin a graph where demonstrations are nodes.
Non-parameter component:
The relationship matrix is then transformed into a doubly stochastic matrix, which represents a probability distribution over permutations. This is achieved using the Sinkhorn operator.
-
Sinkhorn Operator: Where:
S ( R )is theSinkhorn operatorapplied to the matrix .- indicates that the process iterates until convergence (as the number of iterations approaches infinity).
- applies the exponential function element-wise to .
- and are the row and column normalization operators, respectively, defined as:
Where:
- indicates element-wise division.
- is a column vector of ones of dimension .
- creates a matrix where each row is the sum of the corresponding row in .
- creates a matrix where each column is the sum of the corresponding column in .
The
Sinkhorn operatoriteratively normalizes rows and columns, ensuring that the resulting matrix has all its row and column sums equal to 1, thus becomingdoubly stochastic.
-
Gumbel Trick for Differentiable Sampling: To enable differentiable sampling of permutations from the distribution represented by the
doubly stochastic matrix(from theSinkhorn operator), theGumbel-softmax trickis applied. Where:- is the resulting
permutation matrix. - is a
temperature parameter. As approaches zero, the output of theSinkhorn operatorapproaches a discretepermutation matrix. - is
Gumbel noise, added to introduce randomness and make the sampling process differentiable. Each elementG _ { i j }is generated as: Where:-
is a sample drawn from a
uniform distributionbetween 0 and 1.By regarding permutation generation as an
optimal transportproblem and implementing it throughP-Net's parameterized and non-parameterized components, the framework transforms the input permutation distribution into a target distribution, whichP-Netlearns to make the most challenging for theLLM.
-
- is the resulting
The overall PEARL framework illustrating the interaction between the P-Net and the LLM is shown in Figure 3.

该图像是论文图3的示意图,展示了PEARL框架中P-Net和LLM的联合训练流程。P-Net基于输入示例序列生成置换矩阵,通过采样得到最挑战性的排列,并作用于输入后传入LLM计算损失,最后通过反向传播迭代优化。
Figure 3: An overview of the learning framework. The P-Net is a small model incorporating the Sinkhorn operator, trained jointly with the LLM under the adversarial optimization algorithm. Note that the permutation matrix operates on the input sequence's embeddings (simplified here as text sequences for clarity). After training, only the LLM is retained while the P-Net is discarded.
4.2.3. Adversarial Optimization
The LLM and the P-Net are jointly trained using an adversarial optimization framework. Let represent the parameters of the LLM and represent the parameters of the P-Net.
For each sample ( p , x , y ) from the empirical distribution :
-
The
P-Netgenerates an adversarial permutation that aims to maximize theLLM's loss. -
The
LLM, in turn, tries to minimize its loss when subjected to thisadversarial permutation.The
LLM's loss function, under the influence of theP-Net, is defined as: Where: -
is the
LLM's loss, which is a function of both its own parameters and theP-Net's parameters (because influences the generated permutation ). -
denotes the expectation over training samples from and over permutations sampled from the distribution generated by
P-Netparameterized by . -
is the loss of the
LLMwith parameters on a sample where the prompt has been permuted by .To prevent the
P-Netfrom generating trivial permutations (e.g., always producing the same, easily handled permutation, or permutations that destroy semantic meaning), anentropy-based regularization termis introduced for theP-Net: Where: -
is the entropy regularization loss for the
P-Net. -
denotes the
element-wise entropy function. Maximizing entropy encouragesP-Netto explore a wider range of permutations rather than collapsing to a single one.Combining these, the overall objective becomes a
two-player min-max optimization problem: Where: -
means the
LLMtries to minimize this objective by adjusting its parameters . -
means the
P-Nettries to maximize this objective by adjusting its parameters . -
is a
hyperparameterthat controls the strength of theentropy regularization. A higher encourages more diverse permutations fromP-Net.This
minimax optimizationis solved usingalternating optimization: Algorithm 1: Adversarial Optimization Algorithm for PEARL
| Input: , (LLM, P-Net); , (learning rates); (inner steps); (entropy coefficient) | ||
| repeat | ||
| for to do | ||
| ; | // Sample training examples | |
| ; | // Generate permutations | |
| ; | // Compute LLM loss | |
| ; | // Compute entropy regularization | |
| ; | // Update P-Net (maximize P-Net's objective, so add gradient) | |
| end | ||
| ; | ||
| until convergence; | ||
Where:
- The
forloop iterates times, updating theP-Net's parameters for a fixedLLMwith parameters . TheP-Net's gradient is ascended () because it's the maximizer. - After steps, the
LLM's parameters are updated (descended, ) to minimize its loss against theP-Net's generated permutations. - This alternating process continues until convergence, ideally leading to an
LLMthat is robust to the worst-case permutations discoverable by theP-Net. - Crucially, after training, only the
LLMis retained, and theP-Netis discarded, meaning no extra computational cost at inference time.
5. Experimental Setup
5.1. Datasets
The experiments are conducted in two main scenarios:
-
In-Context Learning with Linear Functions (Synthetic Pre-training):
- Source: Follows methodologies from Garg et al. (2022) and Guo et al. (2024b).
- Characteristics: The function class is defined as , where is the input dimension. This is a synthetic setting designed to isolate and study
ICLbehavior on simple, well-understood function classes. - Data Construction: Each data sample is constructed as follows:
- Function sampling: A weight vector is sampled (meaning each component of is drawn from a standard normal distribution), defining a linear function .
- Input sampling: Inputs are independently drawn.
- Output generation: For each input, the corresponding output is computed as for .
- Prompt Structure: The input prompt consists of
demonstrationsand the -th example as the query: . - Why chosen: This synthetic task allows for controlled study of
ICLmechanisms and robustness to permutations without the complexities of natural language, providing clear insights into the model's learning capabilities.
-
Instruction Fine-Tuning of Large Language Models (Real-world Tasks):
-
Source: Data derived from
Super-Natural Instructions(Wang et al., 2022), which is part of theFLAN v2 benchmark(Chung et al., 2024). -
Characteristics: Diverse set of 17 tasks, including 9
Natural Language Generation (NLG)tasks and 8Natural Language Understanding (NLU)tasks. -
Data Split: 4 datasets are designated as held-out test sets, and the remaining 13 datasets are used for training.
- Each training dataset contains 150 examples.
- Each test dataset contains 100 examples.
- This results in a total training set of 1,950 examples and a test set of 400 examples.
-
Why chosen:
Super-Natural Instructionsis a comprehensive and diverse benchmark for evaluatingLLMsoninstruction followingandgeneralizationacross variousNLPtasks, making it suitable for assessingPEARL's performance in real-world scenarios. -
The following are the details of datasets used in instruction tuning from natural instructions from Table 5 of the original paper:
Task ID Task Name Source Category 1297 QASC Question Answering QASC Question Answering 442 COM_QA Paraphrase Question Generation COM_QA Question Rewriting 908 DialogRE Identify Familial Relationships DialogRE Speaker Relation Classification 288 Gigaword Summarization Gigaword Title Generation 582 Natural Questions Answer Generation Natural Questions Question Answering 151 TOMQA Find Location Easy Clean TOM_QA Question Answering 1714 ConvAI3 Sentence Generation ClariQ Dialogue Generation 379 AGNews Topic Classification AG News Text Categorization 639 MultiWOZ User Utterance Generation MultiWOZ 2.2 Dialogue Generation 209 Stance Detection Classification StarCon Stance Detection 1516 IMPPRES Natural Language Inference IMPPRES Textual Entailment 589 Amazon Food Summary Text Generation Amazon Reviews Summarization 1285 KPA Keypoint Matching ArgKP Text Matching
-
The following are the details of the dataset summary for instruction tuning from Table 2 of the original paper:
| Split Category | |||
| Training | NLG | 7 | # Tasks # Samples 1050 |
| NLU | 6 | 900 | |
| Testing | NLG | 2 | 200 |
| NLU | 2 | 200 | |
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
Normalized Squared Error (MSE)
- Conceptual Definition: Used for the synthetic task of
in-context learninglinear functions.Mean Squared Error (MSE)measures the average of the squares of the errors—that is, the average squared difference between the estimated values (model predictions) and the actual value. A lowerMSEindicates that the model's predictions are closer to the true values. The normalized version scales this error relative to the problem's dimension, providing a standardized measure of prediction accuracy forlinear regression. - Mathematical Formula: The paper reports the normalized squared error as:
- Symbol Explanation:
L M ( p ): The prediction made by thelanguage model(LM) given the input prompt .- : The true output for the query input , computed by the ground truth linear function .
- : The dimension of the input vectors (problem dimension).
- : The squared difference between the prediction and the true value.
- Conceptual Definition: Used for the synthetic task of
-
ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence)
- Conceptual Definition:
ROUGE-Lis a metric for evaluating text generation models, particularly for summarization or open-ended generation tasks. It quantifies the overlap between a generated text and a set of reference texts by focusing on thelongest common subsequence(LCS). TheLCSis the longest sequence of words that appears in both texts in the same relative order, but not necessarily contiguously.ROUGE-Lis robust to minor syntactic variations and focuses on the main content and structure. It is often reported as an F-measure, combiningROUGE-L RecallandROUGE-L Precision. A higherROUGE-Lscore indicates a better quality of generated text compared to the reference. - Mathematical Formula (Standard Formulation):
Let be the reference text of length and be the generated text of length .
The
ROUGE-L Recall,ROUGE-L Precision, andROUGE-L F-measureare calculated as: Often, is set to 1, giving equal weight toRecallandPrecision: - Symbol Explanation:
- : The length of the
Longest Common Subsequencebetween text and text . - : The length (number of words) of the reference text .
- : The length (number of words) of the generated text .
- : A non-negative coefficient that controls the relative importance of
RecallandPrecision. A weightsRecallmore, while weightsPrecisionmore.
- : The length of the
- Conceptual Definition:
-
Attack Success Rate (ASR)
- Conceptual Definition:
ASRmeasures the proportion of samples for which an adversarial attack (in this case,permutation-based attack) successfully degrades the model's performance beyond a specified threshold. A higherASRindicates greater vulnerability of the model to the attack. - Mathematical Formula:
- Symbol Explanation:
- : The
Attack Success Ratefor a given task dataset and a performance degradation threshold . - : The total number of samples in the dataset .
- : The
indicator function, which returns 1 if its argument is true, and 0 otherwise. - : The
average performanceof the -th sample across all possible permutations. It is defined as: Where is the performance metric (e.g.,ROUGE-L), is the set of all possible permutations, andn!is the total number of permutations for demonstrations. - : The
compromised performanceof the -th sample achieved by thepermutation attack strategy. ForExhaustive Search Attack, . ForNeural Search Attack(usingP-Net), \omega _ { i } = g ( \Pi _ { i } \cdot p _ { i } , x _ _ { i } ; y _ { i } ) , \qquad \mathrm { s . t . } ~ \Pi _ { i } \sim \mathrm { P - N e t } ( p _ { i } , x _ { i } , y _ { i } ). - : The
relative performance degradation threshold, expressed as a percentage (e.g., 50%). An attack is successful if the performance drops by at least this percentage relative to the average performance.
- : The
- Conceptual Definition:
5.3. Baselines
The paper compares PEARL against several baseline learning algorithms:
-
For Linear Functions (Synthetic Pre-training):
- (
Empirical Risk Minimization with Curriculum Learning): This is the standardERMapproach where the model is trained to minimize average loss.Curriculum Learning(Bengio et al., 2009; Wu et al., 2020) is added, meaning the training process gradually increases the number ofdemonstrations(shots) presented to the model. This allows for progressive learning of more complex patterns and helps stabilize training. It serves as a strong baseline forICLon linear functions.
- (
-
For Instruction Fine-Tuning of LLMs (Real-world Tasks):
-
ERM (
Empirical Risk Minimization): (Min et al., 2022) This is the standard fine-tuning approach that minimizes the average loss over the training dataset. It is adopted by mainstreaminstruction tuningmodels likeFLAN(Chung et al., 2024),Natural Instructions(Mishra et al., 2022; Wang et al., 2022), andMetaICL(Min et al., 2022). It represents the default behavior ofLLMswhen fine-tuned without specificrobustnessconsiderations. -
ERM + DS (
Empirical Risk Minimization with Demonstration Shuffling): (Zhang et al., 2018) This method enhancesERMby randomly shuffling the order ofin-context demonstrationswithin each sample at each training step. This introduces a basic form ofdata augmentationto expose the model to different permutations, aiming for some level of robustness. It can be seen as anepoch-level data augmentationstrategy. -
ERM + IM (
Empirical Risk Minimization with Instance Mixup): (Zhang et al., 2018) This approach incorporates theInstance Mixuptechnique during each training step. For each data point, multiple augmented versions are generated by randomly selecting differentin-context demonstrations. The losses for these augmented versions are computed, averaged, and then a singlebackward passis performed using theaveraged loss. This provides a finer-graineddata augmentationthan simple shuffling. By comparing withPEARL, it contrastsmin-mean optimization() withmin-max optimization(PEARL). -
InfoAC: (Xiang et al., 2024) A
training-stagemethod that usescontrastive learningto allow earlier tokens in anautoregressive LMto access information from later tokens. This is specifically designed to mitigate theorder sensitivityinherent incausal language modelsby breaking the strictautoregressive constraint.These baselines are chosen to represent both standard
ERMtraining and existing attempts at addressingpermutation sensitivitythrough data augmentation or architectural modifications, providing a comprehensive evaluation context forPEARL.
-
5.4. Implementation Details
5.4.1. Architecture and Training for Linear Functions
- LLM (): Implemented using a
GPT-2 base model(Radford et al., 2019) with 12 layers, 8 attention heads, and a hidden dimension of 256. It takes a sequence of vectors in its embedding space and predicts the next vector. - P-Net: Initialized as a
BERT-basesizedtransformer encoder. - Training: Both
LLMandP-Netare trained from scratch.LLMpre-trained on a generated dataset of 40k linear functions.- Optimizer:
AdamW(Loshchilov & Hutter, 2019). - Batch size: 128.
- Training steps: 500k.
- Checkpoint selection: Based on validation set performance.
- Testing: New functions are sampled to evaluate the model's ability to infer new weights from
in-context demonstrations.
5.4.2. Architecture and Training for Instruction Fine-Tuning
- LLMs Evaluated: Llama3-8B, Llama2-7B, Llama2-13B, Mistral-7B, and Gemma-7B.
- P-Net:
FLAN-large encoder. - Fine-tuning method: Both
LLMandP-Netare fine-tuned usingLoRA(Low-Rank Adaptation) (Hu et al., 2022).- Number of
finetuned parametersofP-Netis 1/20 that of theLLM.
- Number of
- Training Details:
- GPU: Single NVIDIA A40.
- Epochs: 2.
- Batch size: 16.
- Total training steps: 246.
- Optimizer:
AdamW. - Learning rates:
P-Net:LLM:
Sinkhorn algorithmparameters:-
Iterations: 80.
-
Temperature parameter: 0.1.
-
Entropy constraint coefficient (): 1.0.
The following are the hyperparameter settings used in the main experiment from Table 6 of the original paper:
Category Hyperparameter Value LLMs Learning rate 3e-5 Batch size 16 Max sequence length 512 Weight decay coefficient 0.1 Epoch 2 LoRA Rank 8 Alpha 32 Dropout 0.1 P-Net target modules q, Vv LLMs target modules q_proj, k_proj, v_proj, o_proj, gate_proj, P-Net Temperature up_proj, down_proj 0.1 Iteration coefficient 80 Entropy constraint 1.0 Noise 0.3 Learning rate 1e-4 Batch size 16 Max sequence length 512
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Performance on In-Context Learning with Linear Functions
The experiments on linear functions highlight PEARL's ability to improve both average and worst-case performance while defending against permutation attacks.
The following are the normalized MSE across permutations from Table 1 of the original paper:
| Shot | Method | Avg. | Worst. |
| 3 | ERM+CL | 1.45 | 2.67 |
| PEARL | 0.86 (+40.7) | 0.92 (+65.5) | |
| 4 | ERM+CL | 1.20 | 3.34 |
| PEARL | 0.79 (+34.1) | 1.11 (+66.8) | |
| 5 | ERM+CL | 1.28 | 5.03 |
| PEARL | 0.87 (+32.0) | 1.33 (+73.6) |
-
Average vs. Worst-Case Performance: Table 1 shows that for the baseline , there's a significant gap between average and worst-case performance. For example, at 5 shots, the average
Normalized MSEis 1.28, but the worst-case plummets to 5.03. This large degradation indicates severe vulnerability to permutations. As the number of shots increases, this relative performance drop for worsens (from 74.6% at 3 shots to 84.1% at 4 shots), effectively negating the benefits of providing morecontextual information. -
PEARL's Improvement:
PEARLconsistently improves both average and worst-case performance. For instance, at 5 shots,PEARLachieves an averageMSEof 0.87 (32.0% better than ) and a worst-caseMSEof 1.33 (73.6% better than ). This demonstratesPEARL's effectiveness in making the model more robust to challenging permutations. -
Worst-Case Generalization: While
PEARL's average performance gains tend to stabilize with more shots, its worst-case performance gains continue to increase (from 65.5% at 3 shots to 73.6% at 5 shots). This highlightsPEARL's strength in enhancing the model's fundamentalrobustnessandgeneralizationunder adverse conditions.The defense capability against
permutation attacksis shown in Figure 4.
该图像是图表,展示了不同训练方法下在不同阈值比例下的攻击成功率对比。图中曲线比较了ERM+CL和PEARL在3、4、5个样本条件下的表现,显示PEARL在多种阈值下均有较低的攻击成功率,体现其对排列攻击的鲁棒性。
Figure 4: Comparison of attack success rates.
- Attack Success Rates (ASR): Figure 4 illustrates
PEARL's superior defense againstpermutation attacks. TheASRforPEARLis consistently lower than across allattack success thresholds() andshot numbers. - Robustness at High Thresholds:
PEARL's advantage becomes more pronounced at higher thresholds (e.g., ). At these thresholds,PEARL'sdefense success rate(inverse ofASR) is approximately double that of the baseline. This impliesPEARLis particularly effective at preventing large performance degradations caused by malicious permutations. - Scalability:
PEARL's performance improves with an increasing number of shots, suggesting better scalability in maintaining robustness in more complexICLscenarios.
6.1.2. Performance on Instruction Fine-Tuning of Large Language Models
PEARL's effectiveness extends to real-world instruction fine-tuning tasks, demonstrating improvements across various LLMs and scaling scenarios.
The following are the average and worst-case performance of Llama3-8B on four held-out tasks from Table 3 of the original paper:
| Average | CSQA | CurDial | CoLA | TMW | |||||||
| # Shot | Method | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. |
| 2 | ERM | 57.3 | 49.4 | 58.0 | 54.0 | 57.9 | 43.4 | 62.0 | 58.0 | 51.1 | 42.0 |
| ERM+DS | 57.5 (-0.2) | 48.6 (-1.6) | 62.0 | 54.0 | 54.1 | 37.8 | 61.0 | 60.0 | 51.5 | 42.7 | |
| ERM+IM | 53.5(-6.6) | 44.4 (-10.1) | 63.0 | 54.0 | 44.7 | 28.1 | 57.0 | 56.3 | 49.4 | 39.2 | |
| INFOAC | 55.7 (-2.9) | 47.6 (-3.7) | 57.5 | 56.0 | 53.4 | 36.4 | 63.0 | 61.5 | 48.7 | 37.3 | |
| PEARL | 62.9 (+9.8) | 56.4 (+14.2) | 65.0 | 62.0 | 60.3 | 50.7 | 71.0 | 68.0 | 55.1 | 44.8 | |
| 3 | ERM | 57.8 | 38.3 | 57.7 | 47.0 | 61.4 | 25.9 | 61.9 | 52.0 | 50.3 | 29.4 |
| ERM+DS | 56.1 (-2.9) | 39.7 (+3.7) | 60.0 | 46.0 | 54.1 | 25.4 | 60.0 | 56.0 | 50.3 | 31.5 | |
| ERM+IM | 55.3 (-4.3) | 39.8(+3.9) | 59.0 | 46.0 | 54.6 | 28.0 | 57.6 | 553.1 | 50.0 | 31.9 | |
| INFOAC | 56.3 (-2.6) | 39.5(+3.1) | 59.3 | 49.0 | 55.2 | 24.3 | 62.1 | 55.8 | 48.4 | 28.8 | |
| PEARL | 63.1 (+9.2) | 46.9 (+22.5) | 68.4 | 62.0 | 66.7 | 34.8 | 64.7 | 56.0 | 52.4 | 34.7 | |
| 4 | ERM | 59.7 | 30.6 | 61.3 | 38.0 | 62.9 | 21.3 | 63.3 | 45.8 | 51.1 | 17.5 |
| ERM+DS | 57.7 (-3.4) | 31.8 (+3.9) | 63.3 | 40.0 | 57.3 | 17.6 | 60.1 | 52.0 | 49.9 | 17.8 | |
| ERM+IM | 56.0 (-6.2) | 32.4(+5.9) | 63.2 | 42.0 | 53.7 | 17.8 | 57.6 | 48.5 | 49.6 | 21.3 | |
| INFOAC | 58.6(-1.8) | 33.0 (+7.8) | 63.7 | 44.0 | 58.7 | 19.0 | 63.9 | 51.0 | 48.1 | 17.0 | |
| PEARL | 63.1 (+5.7) | 39.6 (+29.4) | 68.4 | 52.0 | 69.2 | 31.3 | 64.7 | 52.0 | 50.1 | 23.0 | |
-
Consistent Improvement: Table 3 demonstrates that
PEARLconsistently improves both average and worst-case performance of Llama3-8B across four held-out tasks. For 2-shot,PEARLyields a 9.8% average gain and 14.2% worst-case gain overERM. -
Worst-Case Gain with More Shots: As the number of shots increases, the relative worst-case performance gain over
ERMprogressively increases, from 14.2% at two shots to 29.4% at four shots. This reinforcesPEARL's ability to buildrobustnessagainst increasingly complexpermutation spaces. -
Superior Average Performance: Despite optimizing for worst-case performance,
PEARLalso achieves superior average performance, with gains ranging from 5.7% to 9.8%. This suggests that making the model robust to difficult examples also improves its general performance. -
Comparison with Baselines: Other baselines like , , and
InfoACgenerally show limited or even negative impact on average performance, and often only marginal gains (or even losses) in worst-case performance compared toERM, highlightingPEARL's distinct advantage. The rapid convergence and effectiveness for advancedLLMslike Llama3 suggests thatDRO(focusing on challenging permutations) is more effective than random data augmentation.The generalization performance of our method across different types of LLMs and many-shot settings is shown in Figure 5.
该图像是图表,展示了论文中方法在不同大语言模型(LLM)及多示例任务中的泛化性能。左图为3-shot情况下在Mistral-7B、Gemma-7B、Llama2-7B及Llama3-8B不同模型的性能提升,右图显示基于5-shot训练时,在多示例数量(8、16、32、64)和长序列(8k标记)条件下的扩展表现,柱状图区分了平均性能提升和最差性能提升。
Figure 5: Generalization performance of our method across different types of LLMs and many-shot settings. Left: Performance gains on 3-shot across different LLMs (Mistral-7B, Gemma-7B, Llama 2-7B, and Llama3-8B). Right: Scaling behavior across many-shot settings (8, 16, 32, and 64 shots) and longer sequences tokens) when trained with 5 shots and a sequence length of 512 tokens.
-
Generalization Across LLMs (Left Panel of Figure 5):
PEARLconsistently improvesworst-case performanceby more than 10% forthree-shotsettings across variousLLMs(Mistral-7B, Gemma-7B, Llama 2-7B, and Llama3-8B). This confirms the method's broad applicability. Llama models are noted as most sensitive topermutations, followed by Gemma and Mistral, yetPEARLprovides consistent improvements across all. -
Scalability to Many-Shot and Long Contexts (Right Panel of Figure 5): Despite being trained on limited configurations (5 shots, 512 tokens
max sequence length),PEARLdemonstrates stronggeneralizationtomany-shot ICL(up to 64 shots) andlonger sequences(up to 8,000 tokens). It achieves substantialworst-case performance gainsranging from 24% to 40% in these scaled scenarios. This indicates thatPEARLhelpsLLMslearn morerobust featuresthat are not overfit to the training context length or number of shots.The shot efficiency from Table 4 of the original paper:
# Shots 2 4 8 16 32 64 ERM 57.3 59.7 61.8 66.9 67.4 68.1 PEARL 62.9 63.1 66.5 70.5 70.0 70.4 -
Shot Efficiency (Table 4):
PEARL-trained models achieve comparable average performance while requiring significantly fewer shots. For example,PEARLwith 2 shots (62.9) already outperformsERMwith 8 shots (61.8).PEARLat 4 shots (63.1) is better thanERMat 8 shots, andPEARLat 8 shots (66.5) is close toERMat 16 shots (66.9). This highlights the efficiency ofPEARL, as it can achieve desired performance levels with much lessin-context information.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Hyperparameters in Instruction Finetuning
The following are the impact of number of iterations and temperature on the average/worst-case performance from Figure 6 of the original paper:
| # Iter. | Temperature | ||
| 0.03 | 0.1 | 0.3 | |
| 80 | 55.7 / 40.0 | 55.7 / 40.0 | 55.4 / 39.6 |
| 200 | 55.7 / 40.0 | 55.8 / 40.0 | 55.8 / 40.6 |
-
Effect of Sinkhorn Algorithm Parameters (Table from Figure 6): The analysis of the
Sinkhorn algorithm's parameters (number of iterationsandtemperature) reveals a surprisingrobustness. With theentropy regularization coefficientfixed at 1, varying the number of iterations (80, 200) andtemperature(0.03, 0.1, 0.3) resulted in minimal performance variation. This suggests that theSinkhorn algorithmwithinPEARLis not highly sensitive to these specific parameters, implying a wider range of stable configurations for practical deployment.The impact of entropy coefficient is shown in Figure 7.
该图像是图表,展示了梯度范数与熵系数β之间的关系。图中曲线分别表示P-Net和LLM的梯度范数随β变化的趋势,反映了熵系数对模型训练过程的影响。
Figure 7: Impact of entropy coefficient.
- Influence of Entropy Regularization (Figure 7): The
entropy regularization coefficient() plays a crucial role inP-Net's learning.- At a low coefficient (0.3),
P-Net's gradient norm remains small, indicating it learns minimally and might generate simple semantic overlaps rather than truly challenging permutations. TheLLM's gradient also struggles to decrease. - The
P-Net's gradient norm peaks at 1.0, suggesting optimal learning conditions where it effectively finds challenging permutations. - As coefficients increase further (3.0 and 10.0),
P-Net's gradient norm decreases again, indicating that excessive regularization might restrict its ability to explore diverse permutations. - The range of 1.0-3.0 appears to strike an ideal balance, encouraging
P-Netto extract meaningful adversarial information. TheLLM's gradient norm consistently decreases with increasing coefficients, showing a clear response toentropy regularization.
- At a low coefficient (0.3),
6.3. Extended Instruction Finetuning Across Diverse LLMs
The following are the instruction fine-tuning results for Mistral-7B evaluated on four held-out tasks from Table 7 of the original paper:
| Average | CSQA | CurDial | CoLA | TMW | |||||||
| # Shot | Method | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. | Avg. | Worst. |
| 2 | ERM PEARL | 64.1 67.0 (+4.5) | 58.1 62.4 (+7.5) | 67.0 68.0 | 64.0 66.0 | 54.6 59.4 | 41.8 49.0 | 81.0 82.0 | 78.0 78.0 | 53.7 58.4 | 48.5 56.7 |
| 3 | ERM PEARL | 66.6 69.5 (+4.3) | 56.1 62.8 (+12.0) | 67.0 70.0 | 62.0 66.0 | 63.7 70.1 | 38.9 60.1 | 80.0 83.6 | 76.0 78.0 | 55.6 54.1 | 47.3 47.0 |
| 4 | ERM PEARL | 66.7 68.3 (+2.5) | 50.4 57.1 (+13.4) | 68.9 69.9 | 60.0 62.0 | 67.6 71.6 | 47.8 54.8 | 74.2 74.9 | 52.0 66.0 | 55.9 56.8 | 41.6 45.5 |
| 5 | ERM PEARL | 67.9 70.2 (+3.4) | 50.7 58.1 (+14.5) | 67.5 70.4 | 56.0 64.0 | 70.7 76.7 | 52.6 59.3 | 76.0 73.3 | 56.0 66.0 | 57.4 60.4 | 38.2 43.0 |
-
Mistral-7B (Table 7):
PEARLprovides consistent improvements for Mistral-7B, with worst-case performance gains reaching 14.5% at 5 shots.The following are the instruction fine-tuning results for Gemma-7B evaluated on four held-out tasks from Table 8 of the original paper:
Average CSQA CurDial CoLA TMW # Shot Method Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. 2 ERM PEARL 66.2 66.3 (+0.0) 59.5 60.7 (+2.0) 71.0 74.0 70.0 68.0 59.1 47.3 46.1 39.2 77.0 82.0 70.0 78.0 57.8 61.7 52.0 57.6 3 ERM PEARL 64.7 68.4 (+5.8) 52.5 59.3 (+13.0) 70.7 74.7 64.0 68.0 67.1 59.2 45.2 42.5 70.3 78.7 60.0 76.0 50.5 61.0 40.7 50.6 4 ERM PEARL 65.0 67.2 (+3.4) 46.5 52.5(+13.0) 65.0 71.4 54.0 60.0 71.4 60.7 41.1 38.9 72.5 75.9 58.0 66.0 51.1 60.8 32.9 45.2 5 ERM PEARL 64.3 66.3 (+3.1) 46.3 51.0 (+10.2) 65.9 70.3 54.0 60.0 73.4 63.4 48.3 43.6 65.6 71.3 50.0 60.0 52.3 60.2 32.9 40.4 -
Gemma-7B (Table 8): For Gemma-7B,
PEARLyields robust worst-case improvements, achieving 13.0% gains at both 3 and 4 shots, and 10.2% at 5 shots.The following are the instruction fine-tuning results for Llama2-7B evaluated on four held-out tasks from Table 9 of the original paper:
Average CSQA CurDial CoLA TMW # Shot Method Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. 2 ERM PEARL 56.6 57.4 (+1.5) 46.3 46.5 (+0.4) 56.0 58.0 50.0 48.0 61.3 55.2 50.2 44.7 58.2 62.0 42.0 48.0 50.7 54.4 43.1 45.4 3 ERM PEARL 58.2 59.6 (+2.3) 34.0 40.4 (+19.1) 52.7 56.3 34.0 40.0 64.0 66.2 36.4 46.2 66.0 67.0 36.0 42.0 50.1 48.7 29.4 33.5 4 ERM PEARL 58.9 60.5 (+2.7) 19.9 31.6 (+59.1) 60.0 61.2 26.0 40.0 68.1 69.4 24.4 40.1 60.2 62.4 14.0 24.0 47.3 48.9 15.1 22.4 5 ERM PEARL 61.9 62.9 (+1.6) 25.8 32.1 (+24.7) 59.0 62.4 32.0 38.0 74.2 73.3 43.9 43.4 65.7 64.8 10.0 24.0 48.6 51.0 17.1 23.0 -
Llama2-7B (Table 9): This model shows particularly high sensitivity, with
ERMworst-case performance dropping as low as 19.9 at 4 shots.PEARLprovides dramatic improvements, boosting worst-case performance by 59.1% at 4 shots and 24.7% at 5 shots.The following are the instruction fine-tuning results for Llama2-13B evaluated on four held-out tasks from Table 10 of the original paper:
Average CSQA CurDial CoLA TMW # Shot Method Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. 2.0 ERM PEARL 66.3 67.9 (+2.4) 56.6 60.7 (+7.3) 56.0 64.0 46.0 58.0 72.6 73.8 56.2 64.2 83.0 81.0 76.0 76.0 53.4 52.6 48.0 44.4 3.0 ERM PEARL 65.7 68.5 (+4.2) 46.2 50.3 (+8.7) 55.7 62.7 38.0 44.0 76.4 81.0 51.3 58.4 77.7 76.7 56.0 56.0 53.1 53.5 39.6 42.6 4.0 ERM PEARL 65.8 66.4 (+0.9) 33.2 40.2 (+21.1) 58.2 63.3 28.0 42.0 79.6 80.4 41.6 45.5 73.7 69.4 38.0 42.0 51.8 53.1 25.0 29.1 -
Llama2-13B (Table 10): Similar to Llama2-7B,
PEARLsignificantly improves worst-case performance, with a 21.1% gain at 4 shots.The following are the performance evaluation across 8-, 16-, 32-, and 64-shot settings comparing PEARL and ERM learning algorithm for Llama3-8B on four held-out tasks, with gains (%) relative to the ERM from Table 11 of the original paper:
Average CSQA CurDial CoLA TMW # Shot Method Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. Avg. Worst. 8 ERM PEARL 61.8 66.5 (+7.6) 21.3 29.7 (+39.2) 61.4 67.7 36.0 44.0 68.3 77.1 22.7 28.7 62.7 65.0 16.0 32.0 54.8 56.2 10.6 14.0 16 ERM PEARL 66.9 70.5 (+5.3) 21.3 67.3 36.0 76.5 31.4 67.2 8.0 56.5 56.9 9.7 9.8 32 ERM 67.4 26.3 (+23.7) 19.3 70.9 67.5 46.0 32.0 83.9 77.8 37.5 30.7 70.1 68.2 12.0 6.0 56.1 8.6 64 PEARL ERM PEARL 70.0 (+3.8) 68.1 26.4 (+36.4) 20.6 70.0 68.1 44.0 38.0 82.6 76.9 40.3 27.7 70.6 72.2 12.0 8.7 56.6 55.0 9.1 8.0 8.1 -
Scaling to Many-Shot (Table 11): Even when trained with only 5 shots,
PEARLexhibits stronggeneralizationtomany-shot scenarios(8, 16, 32, 64 shots). It consistently achieves substantialworst-case performance gains(e.g., 39.2% at 8 shots, 36.4% at 64 shots) overERM, underscoring its ability to enableLLMsto learn robust features that generalize effectively to larger contexts.The following are the best performance comparison between ERM and PEARL from Table 12 of the original paper:
#Shot Method Average Gain CSQA CurDial CoLA TMW 2 ERM 64.1 - 68.8 64.4 64.1 59.2 PEARL 68.8 7.2% 73.4 69.2 70.3 62.1 3 ERM 72.8 - 70.3 85.0 65.6 70.3 PEARL 77.0 5.7% 73.4 87.9 79.7 66.9 4 ERM 82.9 - 81.3 92.4 78.1 79.7 PEARL 84.3 1.7% 82.8 93.6 81.2 79.5 5 ERM 86.8 - 84.4 95.3 81.3 86.2 PEARL 89.3 2.9% 87.5 96.5 85.9 87.3 -
Best-Case Performance (Table 12): Surprisingly,
PEARLnot only improves worst-case scenarios but also slightly enhancesbest-case performanceacross all datasets and shot conditions, compared toERM. This suggests that therobustnesslearned throughDROgeneralizes to improve performance even under optimal permutation conditions, indicating a more comprehensively capable model.
6.4. Vulnerability Reassessment of LLaMA-3
As shown in the right of Figure 1, the permutation-based attacks are effective and approachable, even on advanced LLMs like LLaMA-3.

该图像是图表,展示了Llama-3在CurDial和TMW数据集上的性能与攻击成功率。左侧图表显示不同示例数量下的随机、平均和最差表现;右侧图表展示在不同阈值下穷举和神经搜索攻击方法的成功率。
Figure 1: Performance and attack success rates of Llama-3 on CurDial and TMW datasets. Left panels: Random, average and worst-case performance as a function of shot number. Right panels: Attack success rates for exhaustive and neural search attack methods at different thresholds.
- Attack Effectiveness:
Exhaustive search attacksuccessfully attacks over 50% and 80% of samples with onCurDialandTMWdatasets respectively, showing the upper bound of vulnerability. - Neural Attack Efficacy: The
neural attack(usingP-Netfor adversarial generation) achieved a success rate close to this upper bound across different shots. This validates theP-Net's ability to approximate the worst-case permutations efficiently, demonstrating that thevulnerabilityis a real concern and not just a theoretical one.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively identifies and addresses a critical vulnerability in Large Language Models (LLMs): their sensitivity to the ordering of in-context learning (ICL) demonstrations. The authors demonstrate that this permutation sensitivity can be exploited as a "natural attack," significantly degrading LLM performance. To counter this, they introduce PEARL (Permutation-resilient learning), a novel training framework built upon Distributionally Robust Optimization (DRO). PEARL utilizes a permutation-proposal network (P-Net) that leverages optimal transport and an entropy-constrained Sinkhorn algorithm to generate the most challenging permutations. Through minimax optimization, the LLM and P-Net iteratively train, leading to an LLM that is inherently robust to various demonstration orderings. Empirical evaluations on both synthetic tasks and real-world instruction-tuning tasks confirm PEARL's efficacy in mitigating permutation-based attacks and significantly enhancing both average and worst-case performance. Crucially, PEARL demonstrates strong generalization capabilities to many-shot and long-context scenarios, even when trained on fewer resources, and improves shot efficiency.
7.2. Limitations & Future Work
The authors highlight that while PEARL primarily focuses on improving in-context learning in LLMs, its underlying framework is generalizable. They suggest that PEARL provides a general framework for handling set-structured inputs with order-independent elements. This implies potential future research directions:
-
Beyond Text: Applying
PEARLto other modalities where input elements form a set but their order might implicitly influence model behavior, such as processing multiple documents, images in a gallery, or video frames. -
Broader Robustness: The
DROparadigm could be extended to address other forms of input variability oradversarial attacksbeyond just permutations. -
Theoretical Guarantees: Further theoretical work could explore the bounds and guarantees provided by
PEARL'sDROformulation in more diverse and complex settings. -
Efficiency of P-Net: While
P-Netis shown to be efficient, exploring alternative or more lightweightP-Netarchitectures could reduce training overhead even further, especially for extremely largeLLMs.The paper implicitly acknowledges that the
P-Netneeds to be sufficiently powerful to identify truly challenging permutations, which might be a limitation if theP-Netitself is too simple or gets stuck in local optima.
7.3. Personal Insights & Critique
The PEARL framework presents a highly innovative and principled approach to a well-known vulnerability in LLMs. The integration of Distributionally Robust Optimization (DRO) with Optimal Transport (OT) to create an adversarial permutation-proposal network (P-Net) is a sophisticated and elegant solution. It shifts the paradigm from simply trying to cope with permutation sensitivity to actively learning robustness against it.
One of the most compelling aspects is PEARL's ability to improve worst-case performance without sacrificing average performance; in fact, it often enhances it. This suggests that training on the "hardest" examples forces the LLM to learn more robust and generalizable features, which then benefits all input scenarios. The shot efficiency finding is also significant, implying that PEARL can potentially reduce the amount of in-context information needed for a given performance level, which has practical implications for prompt length and computational cost during inference.
A potential area for further exploration or a subtle unverified assumption might lie in the "naturalness" of the P-Net generated permutations. While P-Net is designed to find challenging permutations, it's not explicitly stated if these permutations always remain semantically coherent or representative of "natural" orderings that users might encounter. The entropy regularization helps prevent trivial solutions, but the quality of adversarial examples generated by P-Net is key to the overall framework's effectiveness.
The method's applicability beyond ICL to any set-structured input is a powerful insight. Imagine a model that processes a collection of medical images; PEARL could make it robust to the order in which those images are presented. This opens up exciting avenues for future research in multi-modal AI and data augmentation strategies across various domains.
In conclusion, PEARL offers a robust, principled, and highly effective solution to a critical problem in LLM reliability. Its theoretical foundation and strong empirical results make it a significant contribution to the field of robust AI and trustworthy language models.
Similar papers
Recommended via semantic vector search.