A Neuro-Inspired Interpretation of Unlearning in Large Language Models through Sample-Level Unlearning Difficulty
TL;DR Summary
This paper introduces the Memory Removal Difficulty (MRD) metric for quantifying sample-level unlearning difficulty in Large Language Models, analyzing sample characteristics, and a weighted sampling method to optimize existing unlearning algorithms, enhancing efficiency and effe
Abstract
Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm’s design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty (MRD) metric to quantify sample-level unlearning difficulty. Using MRD, we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an MRD-based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "A Neuro-Inspired Interpretation Of Unlearning In Large Language Models Through Sample-Level Unlearning Difficulty." It focuses on understanding and quantifying the difficulty of unlearning specific data samples within Large Language Models (LLMs), drawing inspiration from neuroscience.
1.2. Authors
The authors are listed as "Anonymous authors," indicating that the paper is currently under double-blind review. This means the authors' identities are concealed from reviewers to ensure an unbiased review process.
1.3. Journal/Conference
The paper is noted as "Paper under double-blind review," indicating it has been submitted to a conference or journal but has not yet been formally accepted or published. The venue is typically a reputable machine learning or natural language processing conference/journal, given the topic.
1.4. Publication Year
The publication date (UTC) is stated as 2025-04-09T00:00:00.000Z.
1.5. Abstract
The paper addresses the growing need for unlearning in Large Language Models (LLMs), driven by privacy regulations. It identifies a critical gap in current research: the lack of interpretability of the unlearning process, particularly regarding sample-level unlearning difficulty. Existing studies often assume uniform unlearning difficulty, which can lead to misattributing algorithm performance to sample selection rather than algorithmic design. To address this, the authors investigate the relationship between LLM unlearning and sample characteristics. They propose a neuro-inspired metric called Memory Removal Difficulty (MRD) to quantify sample-level unlearning difficulty. Using MRD, they analyze the characteristics that make samples hard or easy to unlearn. Furthermore, they introduce an MRD-based weighted sampling method to optimize existing unlearning algorithms by prioritizing easily forgettable samples, thereby enhancing unlearning efficiency and effectiveness. The proposed metric and method are validated through experiments on public benchmarks, confirming their efficacy.
1.6. Original Source Link
The original source link is /files/papers/6957ab954a1fbc163064c2a7/paper.pdf. This indicates it is likely a preprint or an internally hosted file, given its anonymous status and the context of being under review. Its publication status is "under double-blind review."
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) has brought significant advancements in natural language processing (NLP), largely due to their remarkable ability to memorize vast training corpora. However, this strong memorization capability also poses substantial challenges, including privacy breaches, bias propagation, and the generation of illegal content. Regulatory frameworks like the GDPR (General Data Protection Regulation) necessitate the ability to remove specific private or unwanted information from models upon user request. This requirement has propelled Machine Unlearning (MU) into the spotlight.
The core problem the paper aims to solve is the lack of interpretability and a nuanced understanding of the unlearning process in LLMs, specifically regarding sample-level unlearning difficulty. Prior research in MU, particularly for LLMs, frequently treats all data samples as having uniform unlearning difficulty. This simplification is problematic because:
-
Misattribution of Performance: An algorithm's apparent superior performance might stem from selecting
easy-to-unlearnsamples rather than genuine algorithmic advancements. This can lead to misleading conclusions about the effectiveness of unlearning methods. -
Hindered Development: Without understanding what makes a sample hard or easy to unlearn, the development of more efficient and effective unlearning algorithms is hampered. Researchers might optimize for the wrong aspects if they don't understand the underlying data characteristics influencing unlearning.
The paper's entry point is to challenge this uniform difficulty assumption by investigating the intrinsic characteristics of samples that influence their unlearning difficulty. Drawing inspiration from neuroscience, specifically how different types of memories exhibit varying resistance to forgetting (e.g., long-term vs. short-term memories and their robustness to minor brain injuries), the authors propose a novel approach to quantify this sample-level unlearning difficulty in LLMs.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Novel Metric for Sample-Level Unlearning Difficulty (MRD): The paper proposes
Memory Removal Difficulty (MRD)as a novel, computationally efficient metric to quantify the unlearning difficulty of individual data samples in LLMs. This metric is inspired by neuroscience, drawing an analogy between memory robustness in the human brain and the stability of generation probabilities in LLMs under parameter perturbations. A smallerMRDvalue indicates higher unlearning difficulty (i.e., less change in generation probability under small parameter changes). -
Analysis of Characteristics Influencing Unlearning Difficulty: Using the
MRDmetric, the paper provides an in-depth analysis of what characteristics make certain samples harder or easier to unlearn. Key findings include:High-frequency samplesor those withhigh initial generation probabilitiestend to have lowerMRDvalues, indicating they areharder to unlearn.High-complexity samples(e.g., nested syntax) or those containingrare wordstend to have higherMRDvalues, suggesting they areeasier to unlearn. These findings offer valuable insights into the factors influencing LLM unlearning behavior.
-
MRD-based Weighted Sampling Method: The paper proposes a practical, plug-and-play
MRD-based weighted sampling methodto optimize existing unlearning algorithms. Inspired bycurriculum learning, this method prioritizes the unlearning ofeasily forgettable samplesfirst, followed by harder ones, by adjusting their sampling probabilities based on theirMRDvalues. -
Enhanced Unlearning Efficiency and Effectiveness: Through extensive experiments on public benchmarks (TOFU, WMDP, WHP, SAFE), the paper demonstrates that the
MRDmetric effectively captures sample difficulty. TheMRD-based weighted sampling methodsignificantly improves both theunlearning completeness (UC)andmodel utility (UT)of various baseline unlearning algorithms (e.g., GradDiff, PO, NPO), while also accelerating convergence and reducing overall execution time. This confirms the practical value of understanding sample-level unlearning difficulty.These contributions address the identified gap by providing a formal definition and empirical understanding of sample-level unlearning difficulty, which is crucial for fair evaluation and directed development of future LLM unlearning techniques.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a foundational grasp of Large Language Models (LLMs), Machine Unlearning (MU), and related deep learning concepts is essential.
-
Large Language Models (LLMs):
- Definition: LLMs are deep learning models, typically based on the
Transformerarchitecture, trained on massive amounts of text data. They are designed to understand, generate, and process human-like language. The term "large" refers to their enormous number of parameters (billions to trillions), which allows them to capture complex patterns and relationships in language. - Autoregressive Models: Many LLMs, especially those used for text generation, are
autoregressive. This means they predict the next token (word or sub-word unit) in a sequence based on all the preceding tokens. Their training objective is usually to minimize the negative log-likelihood of the training data. - Log-Likelihood: In probabilistic models, the
log-likelihoodmeasures how well the model predicts the observed data. For an autoregressive model generating a sequence of tokens , the log-likelihood is\sum_{t=1}^{n} \log p(x_t | \pmb{x}_{<t}; \pmb{\theta}), where is the probability of token given previous tokens and model parameters . A higher log-likelihood means the model assigns higher probability to the observed sequence, indicating better memorization or understanding.
- Definition: LLMs are deep learning models, typically based on the
-
Machine Unlearning (MU):
- Definition:
Machine Unlearningrefers to the process of removing the influence of specific training data from a machine learning model, as if the model had never been trained on that data in the first place. This is crucial for privacy (e.g., GDPR's "right to be forgotten"), fairness (removing biased data), and security. - Forget Set () and Retain Set (): In
Machine Unlearning, the training dataset is typically divided into two parts: theforget set(), which contains the samples to be removed, and theretain set(), which comprises the data that should remain influential on the model. - Unlearning Completeness (UC): This metric quantifies how effectively the influence of the
forget sethas been removed from the model. High UC means the model behaves as if it never saw theforget setdata. - Model Utility (UT): This metric assesses how well the model retains its performance on
unrelated tasksor theretain setafter unlearning. High UT means unlearning theforget setdid not significantly degrade the model's general capabilities.
- Definition:
-
Gradient-based Optimization:
- Gradient Ascent/Descent: In machine learning,
gradient descentis an iterative optimization algorithm used to find the minimum of a function (e.g., a loss function) by taking steps proportional to the negative of the gradient.Gradient ascentis its counterpart, used to find the maximum of a function by moving in the direction of the positive gradient. In unlearning,gradient ascentis often used on theforget setto actively "push" the model away from memorizing those samples, effectively increasing theirnegative log-likelihood. - Hessian Matrix (
H_t = \nabla^2 P_t(\pmb{\theta})): TheHessian matrixis a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function. In the context of model parameters and a function like log-likelihood , theHessianindicates how sensitive the function's output is to small changes in parameters. A largetrace(sum of diagonal elements) of theHessianimplies high curvature, meaning the function changes rapidly with small parameter perturbations.
- Gradient Ascent/Descent: In machine learning,
-
Preference Optimization:
- Concept: This approach frames the task as aligning model behavior with desired preferences, often using techniques like
Reinforcement Learning from Human Feedback (RLHF). In unlearning, this can involve treating forgotten data as "negative examples" and optimizing the model to avoid generating responses associated with them, while simultaneously preferring "positive examples" (e.g., refusals or counterfactual samples) for theforget setprompts.
- Concept: This approach frames the task as aligning model behavior with desired preferences, often using techniques like
-
Curriculum Learning:
- Concept: Inspired by human learning,
curriculum learningis a training strategy where a model is first trained on easier examples and gradually exposed to more complex ones. The idea is that learning from a structured curriculum can lead to faster convergence and better generalization than random ordering. In the context of unlearning, this would mean prioritizingeasy-to-unlearnsamples before tacklingharder-to-unlearnones.
- Concept: Inspired by human learning,
-
Gaussian Perturbation:
- Definition: A
Gaussian perturbationinvolves adding random noise drawn from aGaussian (normal) distribution(e.g., ) to model parameters or inputs. Here, is the variance, and is the identity matrix, indicating independent noise components. This is often used to simulate small, random changes, similar to minor "injuries" in the paper's neuroscience analogy.
- Definition: A
3.2. Previous Works
The paper contextualizes its contributions by reviewing existing Machine Unlearning (MU) and LLM Unlearning research, highlighting the shift towards interpretability.
-
Machine Unlearning (General):
- Exact Unlearning: Aims to produce a model mathematically identical to one trained from scratch without the
forget set. These methods often involve intricate data partitioning or model ensemble techniques to distribute computational overhead (e.g., Bourtoule et al., 2021; Li et al., 2024b). While providing strong guarantees, they are computationally very expensive. - Approximate Unlearning: Seeks a model approximately equivalent to the retrained gold standard, either in terms of parameters or output behavior. This is more practical for large models. Methods include estimating data influence (Koh & Liang, 2017; Liu et al., 2024b) or fine-tuning with specific objectives.
- Exact Unlearning: Aims to produce a model mathematically identical to one trained from scratch without the
-
LLM Unlearning:
LLM unlearningis primarily framed asapproximate unlearningdue to the immense scale of LLMs. The goal is to balanceunlearning completeness (UC)(removing targeted knowledge) withmodel utility (UT)(preserving general performance).- Gradient-based Methods: Early approaches used
gradient ascent (GA)on theforget setto increase the loss, effectively pushing the model to "forget" (Jang et al., 2023). While improving UC, this often degraded UT. Subsequent work introducedregularization(e.g., parameter or loss regularization) to mitigate utility loss (Maini et al., 2024; Yao et al., 2024). - Preference Optimization-based Methods: These methods treat unlearning as a
preference optimizationtask. Theforget setdata is considerednegative examples, and the model is optimized to producerejection-based answers(e.g., refusals) orcounterfactual sampleswhen prompted withforget setinputs (Zhang et al., 2024). While integrated, they can suffer from low efficiency. - Model Weight-based Methods: More recent research explores identifying and guiding unlearning at the
module levelby analyzing model weights and modular structures within LLMs (Jia et al., 2024). These methods offer insights but can also be computationally intensive.
-
Interpretability of MU:
- Recent trends emphasize understanding why unlearning works or fails.
- Fan et al. (2024) studied how different
forget set partitionsimpact model performance onretain setsin image classification. - Zhao et al. (2024) investigated
explainable featureswithinforget setsand their link to unlearning difficulty. - Chen et al. (2024) showed
unlearning difficulty variationacross users in recommendation systems. - These studies collectively point towards the importance of
sample-level analysisfor unlearning interpretability.
3.3. Technological Evolution
The field of Machine Unlearning has evolved from theoretical concepts and exact unlearning methods, which were often computationally prohibitive for large-scale models, to more practical approximate unlearning techniques. Initially, the focus was on general classification models, with less emphasis on the specific challenges posed by Large Language Models.
The advent of LLMs brought new urgency to unlearning due to their massive scale, strong memorization capabilities, and the implications for privacy and bias. Early LLM unlearning methods adapted existing approximate techniques like gradient ascent and fine-tuning. However, a major challenge has been balancing unlearning completeness with model utility, as aggressive unlearning can degrade overall performance.
More recently, the research community has recognized the need for deeper interpretability in MU. This paper's work fits into this evolving landscape by addressing a critical oversight: the assumption of uniform unlearning difficulty across samples. It pushes the frontier of LLM unlearning by moving beyond simply how to unlearn to understanding what makes unlearning difficult for individual samples, a crucial step for fair evaluation and targeted algorithm design.
3.4. Differentiation Analysis
This paper differentiates itself from existing research in several key ways:
-
Focus on Sample-Level Unlearning Difficulty for LLMs: While some prior work (e.g., Chen et al., 2024 for recommendation systems) hinted at varying unlearning difficulty across users, this paper explicitly defines and quantifies
sample-level unlearning difficultyspecifically forLLMs. Most existingLLM unlearningstudies assume uniform difficulty or do not formally define it at the granular sample level. -
Neuro-Inspired Metric (MRD): The paper introduces a novel
Memory Removal Difficulty (MRD)metric, drawing a unique analogy from neuroscience regarding memory robustness to minor brain injuries. This provides a fresh theoretical perspective that links the stability of generation probabilities under parameter perturbations to unlearning difficulty. This "neuro-inspired" approach is distinct from purely mathematical or empirical metrics used previously. -
Computational Feasibility for LLMs: Unlike some theoretical measures that might require computationally expensive second-order information (e.g., full Hessian matrix inversion) or are designed for image classification (which may not generalize to text-based autoregressive LLMs),
MRDis designed to be computationally efficient for LLMs, relying on Monte Carlo sampling of parameter perturbations. -
Interpretability-Driven Analysis: The paper goes beyond just proposing a metric; it uses
MRDto conduct an in-depth analysis of sample characteristics (semantic complexity, frequency, rare words, initial generation probability) that influence unlearning difficulty. This contributes significantly to the interpretability of theLLM unlearningprocess, which has been largely underexplored. -
Practical Optimization Strategy: The paper translates its interpretability findings into a practical
MRD-based weighted sampling method. Thiscurriculum learning-inspired approach is a general, plug-and-play strategy that can enhance existingLLM unlearning algorithms, offering a concrete way to leverage theMRDmetric for improved efficiency and effectiveness. This moves beyond mere measurement to practical application.In essence, while previous works focused on performing unlearning or broadly interpreting unlearning phenomena, this paper zeroes in on the inherent difficulty of individual samples within LLMs, proposing a biologically inspired metric and a practical method to leverage this understanding.
4. Methodology
This section provides a detailed, step-by-step deconstruction of the paper's methodology, integrating the explanations of mathematical formulas directly into the procedural flow.
4.1. Principles
The core principle behind this paper's methodology is to interpret unlearning difficulty in Large Language Models (LLMs) through an analogy to human memory and brain function, specifically how long-term memories are more resistant to minor brain injuries than short-term memories.
The intuition is as follows:
-
In the human brain,
long-term memories(like core skills or significant personal experiences) are generally robust and not easily disrupted by minorTraumatic Brain Injuries (mTBI). -
Conversely,
short-term memoriesare more fragile and prone to disruption from such injuries. -
This suggests varying "forgetting difficulty" across different types of knowledge in the brain.
Extending this analogy to
LLMs, the authors hypothesize that: -
Hard-to-unlearn samples(analogous tolong-term memories) will exhibit minimal changes in theirgeneration probability distributionwhen the model's parameters undergominor perturbations(analogous tomTBI). -
Easy-to-unlearn samples(analogous toshort-term memories) will show more significant changes in theirgeneration probabilityunder similar minor parameter perturbations.Therefore, the principle is to quantify
unlearning difficultyby measuring the stability of a sample'sgeneration probabilityin the face of small, random changes to the model's parameters. A sample whose generation probability is highly stable (i.e., changes little) under perturbation is considered harder to unlearn, while one with volatile generation probability is easier to unlearn.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Setup of LLM Unlearning
The paper first formally defines the training of an autoregressive LLM and the objective of LLM unlearning.
Autoregressive Model Training:
Given a training set that is partitioned into a forget set and a retain set .
- contains samples to be unlearned.
- contains samples to be retained. Each sample is a sequence of tokens of length .
The parameters of an autoregressive model trained on are obtained by minimizing the Negative Log-Likelihood (NLL) loss:
Here:
- represents the model parameters after training on the full dataset .
- is the
Negative Log-Likelihoodloss over the entire dataset . - denotes the expectation (average) over all samples in the dataset .
\sum_{t=1}^{n_i} \log p(x_t \mid \mathbf{x}_{<t}; \theta)is thelog-likelihoodof a single sample , calculated as the sum of log-probabilities of generating each token given the preceding tokens and the current model parameters . Minimizing this term maximizes the likelihood of observing the training data.
Objective of LLM Unlearning:
The goal of LLM unlearning for a set of samples is typically framed as an optimization problem:
In this formulation:
-
The objective is to
maximizethe model'sutilityon theretain set. -
quantifies this utility, where is a set of functions (e.g., accuracy, perplexity) assessing various model capabilities on .
-
The
subject toclause enforces theunlearning completenessconstraint. -
means that a measure of unlearning for the
forget setmust exceed a certain threshold . For example, could ensure the probability of generating is below , or that the divergence between the model's output distribution on and the true distribution is sufficiently high.In essence, the optimization seeks to maintain the model's general abilities while ensuring the targeted
forget setinformation has been effectively erased.
4.2.2. Motivation for Measuring Sample-Level Unlearning Difficulty
The paper highlights that current evaluations of LLM unlearning algorithms often suffer from sample selection bias. This is demonstrated by two key observations:
-
Variability in Performance: As shown in Figure 2 (from the original paper), for the same unlearning algorithm (e.g.,
GradDifforNPO), the model's performance (Accindicating utility retention) varies significantly when different unlearning samples are chosen. This suggests that some samples are inherently easier or harder to unlearn, and random selection can lead to inconsistent evaluation outcomes.
该图像是一个图表,展示了不同样本在两种大语言模型(GradDiff 和 NPO)下的去学习效果评估(Acc)。图中包括三组结果,其中 (a) 为保留集,(b) 为真实作者,(c) 为世界事实,显示了样本在去学习过程中表现出的差异。The following figure (Figure 2 from the original paper) shows the impact of sample selection on unlearning evaluation:
-
Ranking Reversals: Figure 2 also shows that the ranking of unlearning algorithms can reverse depending on the chosen samples. For instance,
NPOgenerally outperformsGradDiff, but for certain samples,GradDiffperforms better. This further underscores that a method's perceived superiority might be due to the ease of unlearning the selected test samples, rather than its fundamental design.The paper argues that existing
training difficulty metrics(e.g.,GraNd,EN2L,VoG), which quantify how hard a sample is to learn, are not directly applicable tounlearning difficulty. Figure 3 (from the original paper) illustrates this: when samples are ranked by their actual unlearning difficulty (measured by parameter changes), these training metrics do not exhibit a clear monotonic relationship and even disagree sharply for samples of similar unlearning difficulty.
该图像是一个示意图,展示了25个样本组在样本难度排名下的GraNd、EN2L、VoG和MRD指标排名。红色方框标记了尽管样本难度相同,但指标之间存在显著差异的组别。
The following figure (Figure 3 from the original paper) illustrates the rankings of 25 sample groups by GraNd, EN2L, VoG and MRD under the sample unlearning difficulty ordering:
This motivates the need for a dedicated metric to quantify sample-level unlearning difficulty.
4.2.3. Defining Memory Removal Difficulty (MRD)
The paper proposes Memory Removal Difficulty (MRD) to quantify sample-level unlearning difficulty, inspired by the neuroscience analogy discussed earlier.
Initial Metric and its Limitations:
An initial, intuitive metric for MRD was proposed as:
where is the log-probability of the -th token given current parameters , and is a small random perturbation.
However, this initial metric had two limitations:
- Limited Perturbation Scope: A single perturbation direction might not fully capture the global impact of parameter variations.
- Absolute Metric Bias: Using absolute changes could unfairly penalize samples that inherently have low
generation probabilities, making their relative changes appear larger even if the model's confidence isn't drastically shifting.
Refined Memory Removal Difficulty (MRD) - Definition 3.1:
To address these limitations, the paper refines the MRD metric by incorporating sample length normalization, a global perturbation mechanism (through expectation over Gaussian noise), and relative measures.
Definition 3.1. For an LLM with parameters , the difficulty of unlearning a sample is defined as: Here, the symbols are explained as follows:
- : The
Memory Removal Difficultyfor a specific sample given the current model parameters . - : The absolute value, ensuring that
MRDis always non-negative. - : The expectation taken over
Gaussian perturbation vectors. This means is drawn from amultivariate normal distributionwith a mean of 0 and a covariance matrix of .- : A vector of small random perturbations applied to each parameter in .
- : A
multivariate normal distributionwhere0is the mean vector (no systematic bias in perturbation), is the variance, and is the identity matrix, implying that perturbations to different parameters are independent and have the same variance. This is referred to asGaussian isotropic noise.
- : The sum over all tokens in the sample , from to its length . This normalizes the measure by the sample's length (implicitly, as it sums over all tokens).
- : The
log-probability(ornegative log-likelihoodin some contexts, though here it refers tolog-likelihood) of generating the -th token given the preceding tokens and the current model parameters . This is thegeneration probabilityof the token. - : The
log-probabilityof generating the -th token after applying the perturbation to the model parameters. - : The
relative changein thelog-probabilityof the -th token due to the perturbation. This addresses the "absolute metric bias" limitation by considering the change proportional to the original probability.
Interpretation:
A smaller MRD value indicates less fluctuation in the generation probability of a sample under parameter perturbations. This implies that the sample's "memory" is more stable and thus harder to unlearn. Conversely, a larger MRD suggests greater sensitivity to perturbations, meaning the sample is easier to unlearn. The choice of Gaussian isotropic noise simplifies implementation compared to anisotropic noise by not requiring specific knowledge about which parameters to disturb or their exact ranges.
4.2.4. Approximation of MRD (Theorem 3.2)
The paper provides an approximation of the MRD metric using a second-order Taylor expansion, linking it to the Hessian matrix.
Theorem 3.2. Approximation of MRD. Assuming that and are non-zero, and represents a small perturbation where is sufficiently small, the MRD can be approximated as follows:
where H_t = \nabla^2 P_t(\pmb{\theta}) represents the Hessian matrix of w.r.t. .
Proof (from Appendix A):
The proof aims to show the relationship between MRD and the Hessian matrix.
-
Start with the Definition of MRD: Here, is the
log-likelihoodof the -th token, is theparameter perturbation, and is the length of the sentence . -
Multivariate Taylor Expansion: Perform a
multivariate Taylor expansionof up to the second-order term around :- : The
gradientof with respect to . H_t = \nabla^2 P_t(\boldsymbol{\theta}): TheHessian matrixof with respect to .
- : The
-
Substitute into the Numerator: Substitute this expansion into the numerator :
-
Express the Relative Change: The relative change term in
MRDbecomes: -
Compute Expectation: Substitute this expression back into the
MRDformula and compute the expectation over : \mathrm { M R D } ( { \pmb x } ^ { i } ; \theta ) \approx \left| \mathbb { E } _ { \delta \sim \mathcal { N } ( 0 , \sigma ^ { 2 } I ) } \sum _ { t = 1 } ^ { n _ { i } } \left( - \frac { \nabla P _ { t } ( \theta ) ^ { \top } \delta } { P _ _ { t } ( \theta ) } - \frac { 1 } { 2 } \frac { \delta ^ { \top } H _ { t } \delta } { P _ { t } ( \theta ) } \right) \right|- First-order term: Since , its expectation is . Therefore, the expectation of the first-order term vanishes:
- Second-order term: For a
multivariate normal distribution, the expectation of a quadratic form is . So, the expectation of the second-order term becomes:
-
Final Approximation: Considering that the first-order term's expectation is zero, and taking the absolute value of the second-order term (as is usually negative, and can be positive or negative, but the absolute value ensures a positive result for difficulty), the approximate expression for
MRDis: Here:- : The variance of the Gaussian perturbation.
- : The
trace(sum of diagonal elements) of theHessian matrix, which captures the local curvature of thelog-probabilityfunction with respect to the model parameters. - : The
log-probabilityof the -th token.
Interpretation of MRD from Approximation:
Theorem 3.2 shows that MRD is approximately proportional to the Hessian trace () and inversely related to the log-probability ().
-
A large
trace of the Hessian matrixindicates that theloss function(orgeneration probability) has a large overall curvature at that point in the parameter space. A steeper loss landscape means that smaller parameter changes are sufficient to significantly alter thegeneration probabilityand achieve theunlearning threshold. This implies fewer updates are needed, making the sample easier to unlearn. -
When (the
log-probability) is high, its magnitude is large (and typically negative). The inverse relationship suggests that a highlog-probability(meaning the model is very confident in generating this token) leads to a smallerMRD, indicating higher resistance to unlearning. This aligns with the intuition that well-memorized, high-probability samples are harder to forget.Thus,
MRDserves as a reasonable metric forunlearning difficultyby reflecting how sensitive a sample's generation probability is to parameter changes.
4.2.5. Computational Complexity of MRD
The analytical computation of the expectation in the MRD definition is infeasible. Therefore, Monte Carlo sampling is used to approximate it.
The computational complexity of calculating MRD for a sample of length using Monte Carlo samples is , where is the number of model parameters.
-
: Number of Monte Carlo samples for perturbation.
-
: Length of the input sequence (number of tokens). For each token, the
log-probabilityneeds to be computed. -
: Number of model parameters. Computing involves a forward pass through the model with perturbed parameters. A single forward pass scales linearly with the number of parameters .
This linear scaling with ensures computational efficiency, making
MRDfeasible for large LLMs.
The detailed computation of MRD is described in Algorithm 1.
Algorithm 1: Computation implementation of MRD
1: Input: Sample sequence ; model parameters ; disturbance variance ; number of Monte Carlo samples .
2: Output: The MRD value of sample .
3: Initialize: .
4: for to do
5: Sample disturbance vector .
6: .
7: for to do
8: .
9: .
10: .
11: .
12: end for
13: .
14: .
15: end for
16: Return: .
Explanation of Algorithm 1:
- Lines 4-15: The outer loop iterates times for
Monte Carlo sampling. - Line 5: In each iteration, a random
perturbation vectoris sampled from aGaussian distribution. - Lines 7-12: The inner loop iterates through each token of the sample .
- Line 8: Calculates the
log-probabilityof the token with the original model parameters . - Line 9: Calculates the
log-probabilityof the token with the perturbed model parameters . - Line 10: Computes the
relative changeinlog-probabilityfor the current token. - Line 11: Accumulates these
relative changesfor all tokens in the sample. - Line 13: Takes the absolute value of the total accumulated
relative changefor the current Monte Carlo sample, storing it as . - Line 14: Sums up the values from all Monte Carlo samples.
- Line 16: Returns the average
MRDvalue by dividing the sum by .
4.2.6. Characteristics Influencing MRD
Based on the theoretical analysis (Theorem 3.2), MRD is proportional to the local geometric curvature and inversely related to the log-probability (). This leads to several hypotheses about sample characteristics:
-
Semantic Complexity / Syntactic Simplicity:
- Low Complexity (e.g., "The cat is sleeping."): Samples that are syntactically simple and structurally clear tend to have
smooth output distributions. This implies a relativelysmall local geometric curvature(i.e., is small). Consequently, theirMRDvalues arelow, indicatinghigher resistance to unlearning. The model is very confident and stable in generating these common, simple sequences. - High Complexity (e.g., "The intricacies of quantum mechanics perplex many scientists."): Samples with nested syntax, complex modifications, or rare vocabulary often have
steeper distributionswith sharper variations inparameter space. These lead tohigher MRDvalues, making themmore susceptible to perturbationsand thuseasier to unlearn. The model's confidence in these complex sequences is more fragile.
- Low Complexity (e.g., "The cat is sleeping."): Samples that are syntactically simple and structurally clear tend to have
-
Initial Generation Probability:
- High Probability (e.g., "I love reading books."): If a sample's
generation probability() is high (meaning the model is very confident about it), its correspondingMRDwill besmall. This indicatesgreater resistance to unlearning. Such samples are often encountered frequently in training or share strong contextual associations, making them well-ingrained. - Low Probability (e.g., "The sesquipedalian lecturer pontificated endlessly."): Samples with complex syntax or rare vocabulary are expected to have
lower generation probabilities. They exhibitlarger changes in generation probabilitiesunder parameter perturbations, leading tohigher MRDvalues, making themmore susceptible to unlearning.
- High Probability (e.g., "I love reading books."): If a sample's
-
Occurrence Frequency: This is often correlated with initial generation probability.
-
High Frequency: Samples that appear frequently in the training data are typically well-memorized and have high
generation probabilities. They are expected to havelower MRDvalues, meaning they areharder to unlearn. -
Low Frequency (Long-tail distributions): Samples from long-tail distributions are less common and thus less well-memorized. They are expected to have
higher MRDvalues, making themeasier to unlearn.These characteristics, validated experimentally, provide valuable insights into why certain samples are harder or easier to unlearn, moving beyond a simple "black-box" understanding.
-
4.2.7. MRD-based Weighted Sampling Method
Leveraging the MRD metric, the paper proposes an MRD-based weighted sampling method to optimize existing unlearning algorithms. This method is inspired by curriculum learning, where learning progresses from simple to complex. Here, unlearning progresses from easily forgettable samples to harder-to-unlearn samples.
The strategy works as a general, plug-and-play enhancement:
-
MRD Calculation: Calculate the
MRDvalue for all samples in theforget set. -
Weighted Sampling: Use
MRDvalues as scores to adjust thesampling probabilityof unlearning samples. Samples with higherMRD(easier to unlearn) are prioritized by being sampled more frequently. -
Dynamic Progression: As
easy-to-unlearnsamples are forgotten, theirMRDvalues might change, and the sampling probabilities are periodically updated, effectively guiding the unlearning process from simple to complex.For analytical clarity, the paper extends
Stochastic Gradient Ascent (SGA)intoCurriculum Gradient Ascent (CGA)usingMRD.
Remark 3.3. Unlearning Efficiency:
For an unlearning algorithm , the unlearning efficiency is defined as:
where:
- : The total number of updates needed to meet the unlearning goal.
- : The computational cost per update.
Remark 3.4. Updates and MRD:
When the update magnitude per iteration is fixed, the average number of updates required to unlearn a sample can be regarded as:
This means samples with lower MRD (harder to unlearn) require more updates ( is larger), while samples with higher MRD (easier to unlearn) require fewer updates ( is smaller).
The CGA method aims for higher unlearning efficiency. The computational complexity analysis in Appendix B shows that can achieve significantly higher efficiency than , specifically , especially for large forget sets. This means for the same computational cost (e.g., fixed number of updates), CGA should achieve better unlearning completeness while preserving model utility.
The implementation of Curriculum Gradient Ascent (CGA) leveraging MRD is detailed in Algorithm 2.
Algorithm 2: Curriculum Gradient Ascent Unlearning 1: Input: Model parameters ; forget set ; difficulty metric ; update interval . 2: Output: Updated model parameter . 3: Initialize: Compute for each sample . 4: repeat 5: for to do (where is the total number of unlearning steps/iterations) 6: Sample sentences from with probability 7: . 8: Update by gradient ascent. 9: if then 10: Update for each sample. 11: end if 12: end for 13: until Convergence or maximum iteration reached 14: Return:
Explanation of Algorithm 2:
- Line 3: At the beginning, the
MRDvalue for every sample in theforget setis computed usingAlgorithm 1. This creates an initialdifficulty ranking. - Lines 4-13: The main unlearning process loops until
convergence(unlearning goal met) ormaximum iterationsare reached. - Lines 5-12: The inner loop performs unlearning steps.
- Line 7: For each unlearning step, samples are drawn from with a probability proportional to their current
MRDvalue. This ensureseasily forgettable samples(higherMRD) are sampled more frequently initially. - Line 8: The model parameters are updated using
gradient ascentbased on the selected sample(s). (Thisgradient ascentstep would typically involve increasing thenegative log-likelihoodfor the forget samples). - Lines 9-11: Periodically (every iterations, where is the
update interval), theMRDvalues for all samples are recomputed. This is important because as the model unlearns, theMRDvalues of samples change (e.g., an "easy" sample might become "harder" after its influence is mostly removed). This dynamic update allows the curriculum to adapt. - Line 14: The updated model parameters are returned.
4.2.8. Computational Complexity Analysis (Appendix B)
The appendix provides a comparative analysis of the computational complexity between Stochastic Gradient Ascent (SGA) and Curriculum Gradient Ascent (CGA).
Analyzing SGA:
- Procedure:
SGAinvolves two steps: (i) Randomly sample a forget sample at each iteration. (ii) Update parameters using the gradient of the negative log-likelihood for the selected sample. - Total Updates: Assuming uniform selection probability, the total updates required to unlearn all samples are
M(\mathcal{U}_{\mathrm{SGA}}) = \sum_{i=1}^{N_f} I(\pmb{x}^i), where is the number of updates needed for sample . Since samples are chosen with probability , the effective number of updates for a specific sample across many iterations is related to its inverse sampling probability. - Per-update Cost: The computational cost per update is for a forward/backward pass.
- Efficiency: The unlearning efficiency of
SGAisE(\mathcal{U}_{\mathrm{SGA}}) = 1 / (\sum_{i=1}^{N_f} I(\pmb{x}^i) \cdot \mathcal{O}(d)).
Analyzing CGA:
- Procedure:
CGAinvolves three key steps: (i) ComputeMRDvalues for all samples. (ii) Select samples based onMRD, prioritizing those with lower unlearning difficulty (higherMRDvalues). (iii) Applygradient ascentupdates. - Sampling Probability: The sampling probability for each sample is
p_i = I(\pmb{x}^i) / \sum_{j=1}^{N_f} I(\pmb{x}^j), based on the inverse relationship withMRDfrom Remark 3.4. This meansharder-to-unlearnsamples (larger ) are sampled more frequently, accelerating their unlearning. - Total Updates: The total number of unlearning updates for
CGAis alsoM(\mathcal{U}_{\mathrm{CGA}}) = \sum_{j=1}^{N_f} I(\pmb{x}^j). However, due to the weighted sampling, theeffective updatesfor hard samples are concentrated earlier, leading to faster overall progress towards the unlearning goal. - Complexity: The complexity of
CGAincludes forMRDcomputation (done initially and then every epochs) and for parameter updates. - Efficiency: The unlearning efficiency is
E(\mathcal{U}_{\mathrm{CGA}}) = 1 / (\sum_{j=1}^{N_f} I(\pmb{x}^j) \cdot \mathcal{O}(d)). The paper claims , suggesting a substantial efficiency gain because the weighted sampling allowsCGAto reach the unlearning objective more rapidly (lower for the same overall target) compared toSGA. This advantage is particularly pronounced for largeunlearning sets.
Discussion on Other Forms of Improving Methods (Appendix D.1):
The paper notes that MRD is a general metric and its potential to improve existing unlearning methods is independent of the model type. While curriculum learning-based weighted sampling is used here as a heuristic, other improvement directions could include:
- Constructing
hierarchical unlearningstrategies. - Using
MRDto buildreward mechanismsforreinforcement learning-based unlearning. - Incorporating
MRDas aregularization termin the unlearning objective.
Second-Order Approximation Analysis of MRD (Appendix D.2):
The paper clarifies that while high curvature in optimization training might correspond to sharp local minima (which can be problematic), in unlearning, it indicates that generation probabilities are highly sensitive to parameter perturbations. This sensitivity means the sample is easier to unlearn because small parameter changes can significantly alter its probability. Therefore, high curvature is a useful indicator of unlearning difficulty. The paper also differentiates MRD from influence functions and saliency maps by noting its focus on changes in predictions after removing data, and its computational feasibility for LLMs (sampling vs. second-order information for influence functions).
5. Experimental Setup
5.1. Datasets
The authors validate their MRD metric and MRD-based weighted sampling method across four mainstream LLM unlearning tasks and datasets.
-
TOFU (Maini et al., 2024):
- Task: Virtual author information unlearning. The goal is to remove knowledge about specific fictional authors.
- Source & Characteristics: This benchmark fine-tunes an LLM with data on 200 fictional authors, each associated with 20
Question-Answer (QA)pairs. A subset of these authors is designated as theunlearn set(forget set), and the remaining authors form theretain set. It specifically assesses the model's ability to selectively unlearn targeted factual information. - Data Example (from Table 11):
- Q: Is Farid Benoit currently writing any other books? A: It is reported that Farid Benoit is currently working on his sixth erotica novel, but the title has not been disclosed yet.
- Chosen Proportion: The authors selected a
10% proportionfor theforget setamong the available options (1%, 5%, 10%).
-
WMDP (Li et al., 2024a):
- Task: Unlearning harmful capabilities. This dataset tests the model's ability to forget dangerous knowledge.
- Source & Characteristics: This benchmark evaluates an LLM's capacity to unlearn harmful knowledge, particularly in domains such as
biosafety,cybersecurity, andchemical safety. Theforget setconsists of plain text related to biological and cybersecurity knowledge, whileunrelated textserves as theretain set.
-
Who's Harry Potter (WHP) (Eldan & Russinovich, 2023):
- Task: Copyright information removal. The objective is to remove content related to the Harry Potter series.
- Source & Characteristics: For this task, 200 data chunks, each containing 512 tokens, were extracted from the Harry Potter series to form the
forget set.
-
PKU SafeRLHF (SAFE) (Ji et al., 2024b):
-
Task: Unlearning model toxic responses. This benchmark aims to remove the model's ability to generate toxic outputs when given inappropriate prompts.
-
Source & Characteristics: For the
SAFEtask, 200 negative examples were randomly sampled from thePKU-SafeRLHFtraining set to construct theforget set. -
Retain Set for WHP and SAFE: To maintain
model utilityfor both thecopyright removal(WHP) anddetoxification(SAFE) tasks, theC4 dataset(Raffel et al., 2020) was utilized as theretain set. TheC4 dataset(Colossal Clean Crawled Corpus) is a very large, diverse, and cleaned-up collection of web text, commonly used for training and evaluating general language models.These datasets are chosen because they represent diverse and relevant
LLM unlearningscenarios, including factual knowledge, harmful content, copyright material, and toxic responses, making them effective for validating the proposed method's general applicability and performance.
-
5.2. Evaluation Metrics
The performance of unlearned LLMs is assessed across two primary dimensions: Unlearning Completeness (UC) and Model Utility (UT). For every evaluation metric, a conceptual definition, mathematical formula, and symbol explanation are provided.
5.2.1. Unlearning Completeness (UC) Metrics
Unlearning Completeness (UC) quantifies how effectively the model has forgotten the targeted data from the forget set. A higher value generally indicates better unlearning.
General Metrics:
-
Memorization Accuracy (MA) (Tirumala et al., 2022):
- Conceptual Definition:
MAquantifies how much a model has memorized a given sequence of tokens. It measures the proportion of tokens in a sequence that the model can correctly predict given the preceding tokens. A lowerMAon theforget setafter unlearning indicates betterUC. The early stopping condition (Appendix E.4) specifies that a sample is forgotten when itsMAfalls below a certain threshold. - Mathematical Formula:
- Symbol Explanation:
- : Memorization Accuracy for a token sequence .
- : A sequence of tokens of length .
- : Sums over all tokens from the second token () to the last but one token (
T-1) in the sequence, as each token is predicted based on its predecessors. - : The indicator function, which evaluates to 1 if the condition inside the braces is true, and 0 otherwise.
- : The token with the highest predicted probability (the most likely token) output by the model given the preceding tokens and current parameters .
- : The true -th token in the sequence.
T-1: The total number of predictions made (since the first token has no preceding context for prediction).
- Conceptual Definition:
-
Extraction Likelihood (EL) (Jang et al., 2023):
- Conceptual Definition:
ELestimates the general likelihood of extracting a target sequence from the model. It measures the average success rate of varying extraction attacks by quantifying the -gram overlap between model-generated sequences and target token sequences, starting from different prefixes. A lowerELon theforget setafter unlearning indicates betterUC. The early stopping condition (Appendix E.4) specifies that a sample is forgotten when itsELfalls below a certain threshold. - Mathematical Formula:
- Symbol Explanation:
- : Extraction Likelihood for a sequence using -grams.
- : A sequence of tokens of length .
- : The output token sequences generated by the LLM when given as input and parameters . The generated sequence can have a maximum length of but might be shorter if an
EOS (end-of-sequence) tokenis generated. - : The true suffix of the sequence starting from token .
T-n: The number of possible starting points for -gram overlap comparisons, accounting for the -gram length.- : A function measuring the -gram overlap between two token sequences (generated) and (target).
- : Denotes the list of -grams in the given token sequence.
- : Counts how many -grams from sequence are also present in sequence .
- : The total number of -grams in sequence .
- Conceptual Definition:
Task-Specific UC Metrics:
-
TOFU Task:
- Unlearning Accuracy (UA): Represented as (Jia et al., 2024).
FAmeasures the model's accuracy on theforget set(e.g., answering questions about forgotten authors). A higherUAindicates betterunlearning completeness. - Membership Inference Attack (MIA): Evaluates the
Area Under the ROC Curve (AUC)using theMin-k% Probmethod (Shi et al., 2023) to detect whether a sample was part of the training set. A higherMIAscore (meaning the model can be more easily distinguished from a non-unlearned model onforget setdata, but here it implies the model's confidence in generatingforget setsamples is reduced, thus higherMIAscore represents better unlearning) suggests improved model confidence in unlearning. - Rouge-L Recall (RR): Calculated as .
Rouge-Lmeasures the longest common subsequence between a generated text and a reference text. It is used over theforget set. A higherRR(meaning lowerRouge-Lscore, indicating less overlap with original forbidden content) signifies better unlearning performance. - Concept Relearning Score (Relearn): Defined as (Lo et al., 2024). The
saliency scoremeasures how strongly forgotten concepts re-emerge in the model after retraining or fine-tuning. A higherRelearnvalue (meaning lowersaliency score) indicates betterunlearning completenessand lower susceptibility to relearning the forgotten concept.
- Unlearning Accuracy (UA): Represented as (Jia et al., 2024).
-
WMDP Task:
UCis evaluated using onWMDP-Bio(biosafety knowledge) andWMDP-Cyber(cybersecurity knowledge) subsets.
-
WHP Task:
UCis determined usingRouge-Lon 300-token completions generated by the model based on Harry Potter-related instructions. A lowerRouge-L(thus higher RR) indicates successful unlearning.
-
SAFE Task:
UCis assessed usingToxic-BERT(Hanu & Unitary team, 2020) scores on toxic prompts from theSAFE test set. A lowerToxic-BERTscore (indicating less toxicity in responses) signifies betterunlearning completeness.Toxic-BERTis a BERT-based model trained to classify text toxicity.
5.2.2. Model Utility (UT) Metrics
Model Utility (UT) evaluates the impact of unlearning on the model's performance on unrelated tasks or the retain set. A higher value generally indicates better utility retention.
-
TOFU Task:
- Retain Set Accuracy (Acc) and Rouge-L Recall (RR): Measures the model's performance on the data it was meant to retain. Higher values are better.
- Real Author Acc and RR: Assesses the model's ability to retain knowledge about authors not in the forget set.
- World Fact Acc and RR: Evaluates general factual knowledge.
-
WMDP Task:
UTis measured byzero-shot accuracyon theMMLU (Massive Multitask Language Understanding)dataset (Hendrycks et al., 2020).MMLUis a broad set of challenging multiple-choice questions across 57 subjects, assessing general knowledge and reasoning abilities.
-
WHP and SAFE Tasks (General Utility Metrics):
- Perplexity (PPL): Measures how well a probability model predicts a sample. A lower
PPLindicates that the model is better at predicting the next word in a sequence, implying better language modeling capabilities. It's often evaluated on general text corpora likeWikitext(Merity etm al., 2016). - Zero-shot Accuracy: Averaged across multiple tasks using the
Language Model Evaluation Harness(Gao et al., 2021). The tasks include:BoolQ(Clark et al., 2019): Boolean Question Answering.RTE(Dagan et al., 2005): Recognizing Textual Entailment.HellaSwag(Zellers et al., 2019): Commonsense NLI for sentence completion.Winogrande(Sakaguchi et al., 2021): Commonsense reasoning with Winograd schemas.ARC-Challenge/ARC-Easy(Chollet, 2019): AI2 Reasoning Challenge, assessing scientific reasoning.OpenBookQA(Mihaylov et al., 2018): Open-book question answering.Piqa(Bisk et al., 2020): Physical commonsense reasoning. A higher average accuracy across these diverse tasks indicates better retention of general capabilities.
- TruthfulQA: Measures how truthful a model's answers are to questions that some LMs might answer falsely due to memorizing common misconceptions. A higher score indicates more truthful and less misleading responses.
- Perplexity (PPL): Measures how well a probability model predicts a sample. A lower
5.3. Baselines
The effectiveness of the MRD metric and the MRD-enhanced methods is assessed against mainstream unlearning baselines. For each baseline, an MRD-weighted sampling strategy is applied to create an MRD-enhanced method.
-
Gradient-based Methods:
- GA (Gradient Ascent) (Jang et al., 2023): A foundational unlearning method that performs gradient ascent on the
forget setloss to actively "un-memorize" the samples. - GradDiff (Yao et al., 2024): A more advanced gradient-based method that aims to improve
unlearning completenesswhile mitigating utility loss, possibly through regularization or refined gradient updates. SGA(Stochastic Gradient Ascent) andCGA(Curriculum Gradient Ascent) are used as specific instantiations of gradient-based unlearning in the context of the TOFU task.
- GA (Gradient Ascent) (Jang et al., 2023): A foundational unlearning method that performs gradient ascent on the
-
Preference Optimization Methods:
-
PO (Preference Optimization) (Maini et al., 2024): These methods frame unlearning as a preference alignment task, treating forgotten data as
negative examplesand optimizing forrejection-based answers. The paper mentions using rejection-based answers for theforget setinPO. -
NPO (Negative Preference Optimization) (Zhang et al., 2024): An improvement over
POthat aims to address issues like catastrophic collapse and improve unlearning effectiveness.For each baseline (
GA,GradDiff,PO,NPO), anMRD-enhancedversion is created (e.g.,GradDiff + MRD, ) by incorporating theMRD-based weighted sampling strategyinto its unlearning sequence. Comparisons are made against these original baselines, with results averaged over five independent trials to ensure robustness.
-
5.4. Training Setup
The specific configurations for training and unlearning experiments are as follows:
- Optimizer:
AdamW(Loshchilov, 2017) is set as the default optimization algorithm. - Learning Rate:
5e-5. - Perturbation Intensity (): For calculating
MRD, the standard deviation of the Gaussian perturbation is set to1e-5. - Monte Carlo Sampling Iterations (): The number of
Monte Carlo samplingiterations forMRDcalculation is set to 200. - Update Interval (): For
Algorithm 2(CGA), theMRDcalculation interval is set to . - Epochs/Steps:
- TOFU Task: Both
POandGradDiffmethods are run for 5 epochs. TheNPOmethod is run for 4 epochs. - WMDP Task: The maximum number of training steps for
NPOandGradDiffis set to 500. - WHP and SAFE Tasks: 5 epochs are conducted for these tasks.
- TOFU Task: Both
- Hardware: All experiments are conducted on two
NVIDIA RTX A800 GPUs. - PO Rejection Answers: For the
POmethod,rejection-based answersare used as target responses for theforget setsamples. Examples from Table 3 include:- TOFU: "I'm not informed about that subject," "I don't have the details on that issue."
- WHP/PKU-Safe: "I apologize, but I'm legally restricted from fulfilling this request," "I'm sorry, but my ability to generate content is limited by copyright laws."
- Early Stopping Condition (Appendix E.4): A sample is considered successfully forgotten when its corresponding
Extraction Likelihood (EL)value andMemorization Accuracy (MA)value on the current model decrease below the averageELandMAvalues of all samples on the initial model. This provides a consistent and objective termination condition for unlearning.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Differences in Unlearning Difficulty
The paper first validates the premise that unlearning difficulty varies across samples. To quantify this, 40 samples are randomly selected (with replacements) from the TOFU unlearning set and concatenated into 300 composite samples. Unlearning is performed using existing baselines with early stopping, and the average absolute value of parameter changes post-unlearning is computed.
The following figure (Figure 4 from the original paper) shows the comparison of unlearning difficulty across different sample sets in GA, GradDiff, and NPO:

Analysis:
Figure 4 demonstrates significant variability in parameter changes across different samples when using GA, GradDiff, and NPO algorithms. Each point represents a composite sample set, and the distance from the origin denotes the average absolute value of parameter changes. The scattered distribution indicates that different sample sets require vastly different magnitudes of parameter adjustments to be unlearned. This confirms that unlearning difficulty is not uniform across samples, and the choice of samples indeed has a substantial influence on unlearning performance. This observation directly supports the paper's core motivation.
6.1.2. Effectiveness of MRD
To validate the MRD metric, experiments are conducted on TOFU and WMDP tasks. For each task, 10 random samples are chosen, and various LLM unlearning baselines are applied. With identical hyperparameter settings, parameter update magnitudes, and early stopping conditions, the number of updates required for unlearning each sample is compared against its MRD value.
The following figure (Figure 5 from the original paper) shows the relationship between the MRD value and the number of unlearning updates (i.e., unlearning difficulty):

Analysis:
Figure 5 shows a consistent relationship between the MRD value and the number of unlearning updates required, which serves as a proxy for unlearning difficulty.
- For both
TOFUandWMDPdatasets, and across different unlearning algorithms (GradDiff,GA), a clear trend is observable: as theMRDvalue increases, the number of updates required for unlearning tends to decrease. - This indicates that samples with higher
MRDvalues areeasier to unlearn(requiring fewer updates), and those with lowerMRDvalues areharder to unlearn(requiring more updates). This perfectly aligns with the definition ofMRD(smallerMRD= higher difficulty). - Furthermore, the ranking of update counts across different methods for the same sample generally remains consistent. This suggests that the variability in
unlearning behavioris an intrinsic property of the samples themselves, rather than solely dependent on the unlearning algorithm. This experiment provides strong empirical evidence for the effectiveness and reliability ofMRDin quantifying sample-levelunlearning difficulty.
6.1.3. Characteristics Influencing MRD
To further interpret MRD and its implications, experiments are conducted on the TOFU task by categorizing samples based on four criteria: semantic complexity, occurrence frequency, initial generation probability, and presence of rare words. 40 samples from each category are randomly selected, and their MRD values are computed.
The following are the results from Table 11 of the original paper:
| Attribute | Level | Example From categorized set | MRD |
| Common Sentence | Q: Is Farid Benoit currently writing any other books? A: It is reported that Farid Benoit is currently working on his sixth erotica | 0.4957 | |
| novel, but the title has not been disclosed yet. Q: What is another well-known work by Albert Sidney Lane in the fantasy | |||
| genre? A: "Beneath the Emerald Veil" is another well-known work by Albert Sid- | 0.4322 | ||
| Semantic Complexity | Low | ney Lane in the fantasy genre. Q: What career did Li Mei Yu's mother have? | 0.3085 |
| High | A: Her mother was a nurse. Q: How have Leila Al-Sabah's books contributed to LGBTQ+ representa- tion in literary fiction? | 1.0026 | |
| A: Through her richly drawn characters and storylines, Leila Al-Sabah has helped to normalize LGBTQ+ experiences in literary fiction. Her books often center on LGBTQ+ protagonists, treating their identities and experi- ences with complexity, empathy, and realism, thereby increasing visibility and representation of the community in the genre. | |||
| Occurrence Frequency | Low | Q: Is Zo Hassani Raharizafy involved in any form of philanthropy? A: Yes, he established the Raharizafy Literary Foundation, which works to improve literacy rates in Madagascar, his home country. | 0.6374 |
| High | Q: Where was Samir Khoury born? A: Samir Khoury was born in Amman, Jordan. | 0.2529 | |
| Initial Generation Probability | Low | Q: What did her parents think of her decision to become a writer? A: Evangeline's parents were initially skeptical about her decision. How- ever, after reading her first novel and witnessing her dedication to the craft, | 0.3481 |
| High | they stood by her decision and have been her constant pillars of support. Q: What genre does Xin Lee Williams often write in, based on their most famous work, "The Town That Drowned"? A: Xin Lee Williams is recognized for their contributions to Canadian lit- erature, as seen from their trademark work, "The Town That Drowned." | 0.7689 | |
| Presence of Rare Words | Low | Q: What gender does the author Ji-Yeon Park identify as? A: The author Ji-Yeon Park identifies as female. | 0.3929 |
| High | Q: When did Samin Nosrat receive the "Prix Goncourt de Littérature His- torique" and for which book? A: Samin Nosrat received the "Prix Goncourt de Littérature Historique" for her vibrant piece "The Seed," which she received in 2011. | 0.7188 |
Analysis:
Table 11 validates the theoretical analysis from Section 3.3 regarding the characteristics influencing MRD.
- Semantic Complexity:
Low complexitysamples (e.g., "Her mother was a nurse.") havelow MRD(0.3085), indicating they areharder to unlearn.High complexitysamples (e.g., detailed description of Leila Al-Sabah's contribution to LGBTQ+ representation) havehigh MRD(1.0026), suggesting they areeasier to unlearn. This aligns with the idea that models are more stable on simple, common patterns.
- Occurrence Frequency:
High-frequencysamples (e.g., "Samir Khoury was born in Amman, Jordan.") havelow MRD(0.2529), making themharder to unlearn.Low-frequencysamples (e.g., about the Raharizafy Literary Foundation in Madagascar) havehigh MRD(0.6374), making themeasier to unlearn. This confirms that frequently seen information is more deeply ingrained.
- Initial Generation Probability:
High initial generation probabilitysamples (e.g., "Xin Lee Williams is recognized for their contributions...") havelow MRD(0.3481 is lower than 0.7689, this seems to be a swapped example in the table, the text indicates High Prob should have low MRD value) - Correction: Looking at the table, the example for "Low" probability (Evangeline's parents skepticism) has MRD 0.3481, and for "High" probability (Xin Lee Williams) has MRD 0.7689. This seems contradictory to the text in Section 3.3 which states "If a sample's generation probability () is high, its corresponding MRD will be small, indicating greater resistance to unlearning." There might be an error in the example assignment in the paper's table or the text explanation. However, following the paper's textual explanation, high probability should correspond to low MRD. Assuming the textual explanation is correct, the table examples for "Initial Generation Probability" might be mislabeled or misinterpreted.
- Presence of Rare Words:
-
Low rare word presence(e.g., "The author Ji-Yeon Park identifies as female.") haslow MRD(0.3929), indicatingharder to unlearn. -
High rare word presence(e.g., "Samin Nosrat received the 'Prix Goncourt de Littérature Historique'...") hashigh MRD(0.7188), suggestingeasier to unlearn. This confirms that less common vocabulary makes information more volatile.Overall, the experimental results largely align with the theoretical predictions, demonstrating that
MRDeffectively captures the influence of various sample characteristics onunlearning difficulty.
-
6.1.4. Effectiveness and Efficiency of MRD-based Weighted Sampling
The MRD-based weighted sampling method (MRD-enhanced method) is evaluated across four mainstream LLM unlearning tasks against baseline methods.
Results on TOFU Task: The following are the results from Table 1 of the original paper:
| Method | Unlearning Completeness (UC) | Model Utility (UT) | ||||||||||
| UA (↑) | MIA (↑) | RR(↑) | Relearn (↑) | Avg. (↑) | Retain Set Acc. (↑) | RR(↑) | Real Author RR(↑) | World Fact Acc. (↑) | RR(↑) | Avg. (↑) | Real Author Acc. (↑) | |
| Original | 0.1475 | 0.4515 | 0.0204 | 1.0000 | 0.4049 | 0.8575 | 0.9825 | 0.8900 | 0.9330 | 0.8632 | 0.8960 | 0.9037 |
| SGA | 0.3725 | 0.4490 | 0.5722 | 0.7375 | 0.5328 | 0.6125 | 0.4212 | 0.3500 | 0.3908 | 0.7094 | 0.7841 | 0.5447 |
| CGA | 0.3825 | 0.4594 | 0.5781 | 0.7625 | 0.5456 | 0.6575 | 0.4296 | 0.5100 | 0.5375 | 0.7436 | 0.7984 | 0.6128 |
| GradDiff | 0.8475 | 0.9977 | 0.9950 | 0.3575 | 0.7994 | 0.7253 | 0.5131 | 0.7100 | 0.7473 | 0.8120 | 0.8547 | 0.7271 |
| GradDiff + MRD | 0.8425 | 0.9997 | 0.9984 | 0.5350 | 0.8439 | 0.7350 | 0.5253 | 0.7300 | 0.7321 | 0.8205 | 0.8561 | 0.7332 |
| PO | 0.7275 | 0.6478 | 0.9314 | 0.5950 | 0.7254 | 0.6114 | 0.4190 | 0.6100 | 0.6988 | 0.7350 | 0.7862 | 0.6434 |
| PO + MRD | 0.7575 | 0.6512 | 0.9773 | 0.7800 | 0.7915 | 0.6250 | 0.4216 | 0.6400 | 0.6963 | 0.7436 | 0.7792 | 0.6510 |
| NPO | 0.8350 | 0.9913 | 0.9821 | 0.4825 | 0.8227 | 0.7433 | 0.5356 | 0.8300 | 0.8291 | 0.8262 | 0.8746 | 0.7731 |
| NPO + MRD | 0.8525 | 0.9992 | 0.9854 | 0.4750 | 0.8280 | 0.7775 | 0.5506 | 0.8900 | 0.8547 | 0.8462 | 0.8832 | 0.8004 |
Analysis of TOFU Results (Table 1):
-
Unlearning Completeness (UC): The
MRD-enhancedmethods (e.g.,GradDiff + MRD, , ) consistently show improvements inUCmetrics, particularlyMIA,RR, andRelearn, compared to their original baselines. For instance, improvesAvg. UCfrom 0.7254 to 0.7915.GradDiff + MRDimprovesAvg. UCfrom 0.7994 to 0.8439. improvesAvg. UCfrom 0.8227 to 0.8280. This confirms that prioritizing easier-to-unlearn samples enhances the model's ability to forget. -
Model Utility (UT): The
MRD-enhancedmethods also generally improvemodel utility. For example, increasesAvg. UTfrom 0.7731 to 0.8004. increasesAvg. UTfrom 0.6434 to 0.6510. This indicates that the curriculum-based sampling strategy not only helps in unlearning but also better preserves the model's capabilities on retained knowledge. -
SGA vs. CGA:
CGA(Curriculum Gradient Ascent) significantly outperformsSGA(Stochastic Gradient Ascent) across bothUCandUTmetrics, underscoring the benefits ofMRD-based weighted samplingeven for basic gradient methods.The overall improvements across various baselines demonstrate the general applicability and effectiveness of the
MRD-based weighted sampling methodin optimizingLLM unlearning algorithms.
The following are the results from Table 4 of the original paper, showing metrics change during the unlearning process for SGA, CGA, NPO, and NPO+MRD on TOFU:
| Method | Unlearning Completeness (UC) | Model Utility (UT) | ||||||||||
| UA (↑) | MIA (↑) | RR(↑) | Relearn (↑) | Avg. (↑) | Retain Set Acc. (↑) | RR(↑) | Real Author Acc. (↑) | RR(↑) | World Fact Acc. (↑) | RR (↑) | Avg. (↑) | |
| Original | 0.1475 | 0.4515 | 0.0204 | 1.0000 | 0.4049 | 0.8575 | 0.9825 | 0.8900 | 0.9330 | 0.8632 | 0.8960 | 0.9037 |
| SGA-epoch1 | 0.2025 | 0.4472 | 0.2421 | 0.9675 | 0.4648 | 0.7825 | 0.7514 | 0.7400 | 0.7362 | 0.8034 | 0.8471 | 0.7768 |
| SGA-epoch2 | 0.2750 | 0.4464 | 0.3892 | 0.8800 | 0.4977 | 0.7231 | 0.6353 | 0.6200 | 0.6261 | 0.7606 | 0.8062 | 0.6952 |
| SGA-epoch3 | 0.3200 | 0.4483 | 0.4933 | 0.8150 | 0.5217 | 0.6428 | 0.5277 | 0.4800 | 0.5109 | 0.7179 | 0.7983 | 0.6129 |
| SGA-epoch4 | 0.3725 | 0.4490 | 0.5722 | 0.7375 | 0.5328 | 0.6125 | 0.4212 | 0.3500 | 0.3908 | 0.7094 | 0.7841 | 0.5447 |
| CGA-epoch1 | 0.2475 | 0.4588 | 0.2922 | 0.9425 | 0.4852 | 0.8272 | 0.7614 | 0.7200 | 0.7552 | 0.8376 | 0.8518 | 0.7922 |
| CGA-epoch2 | 0.3075 | 0.4597 | 0.4272 | 0.8700 | 0.5161 | 0.7672 | 0.6526 | 0.6200 | 0.6817 | 0.8034 | 0.8337 | 0.7264 |
| CGA-epoch3 | 0.3450 | 0.4592 | 0.5094 | 0.8075 | 0.5302 | 0.6703 | 0.5328 | 0.5500 | 0.5691 | 0.7606 | 0.8138 | 0.6494 |
| CGA-epoch4 | 0.3825 | 0.4594 | 0.5781 | 0.7625 | 0.5456 | 0.6575 | 0.4296 | 0.5100 | 0.5375 | 0.7436 | 0.7984 | 0.6128 |
| NPO-epoch1 | 0.3375 | 0.8027 | 0.3417 | 0.8225 | 0.5761 | 0.8253 | 0.9015 | 0.8800 | 0.9018 | 0.8462 | 0.8901 | 0.8742 |
| NPO-epoch2 | 0.5650 | 0.9381 | 0.5293 | 0.6825 | 0.6787 | 0.7786 | 0.7803 | 0.8600 | 0.8725 | 0.8376 | 0.8886 | 0.8363 |
| NPO-epoch3 | 0.7125 | 0.9839 | 0.8172 | 0.5425 | 0.7640 | 0.7567 | 0.6519 | 0.8400 | 0.8493 | 0.8290 | 0.8823 | 0.8015 |
| NPO-epoch4 | 0.8350 | 0.913 | 0.9821 | 0.4825 | 0.8228 | 0.7433 | 0.5356 | 0.8300 | 0.8291 | 0.8262 | 0.8746 | 0.7731 |
| NPO+MRD-epoch1 | 0.3550 | 0.8162 | 0.3715 | 0.8175 | 0.5901 | 0.8367 | 0.9053 | 0.8900 | 0.8937 | 0.8547 | 0.8912 | 0.8786 |
| NPO+MRD-epoch2 | 0.5875 | 0.9481 | 0.5781 | 0.7050 | 0.7047 | 0.7844 | 0.7794 | 0.8800 | 0.8738 | 0.8462 | 0.8885 | 0.8421 |
| NPO+MRD-epoch3 | 0.7425 | 0.9846 | 0.8462 | 0.5325 | 0.7765 | 0.7678 | 0.6781 | 0.8800 | 0.8637 | 0.8462 | 0.8867 | 0.8204 |
| NPO+MRD-epoch4 | 0.8525 | 0.9992 | 0.9854 | 0.4750 | 0.8280 | 0.7775 | 0.5506 | 0.8900 | 0.8547 | 0.8832 | 0.8004 | |
Analysis of TOFU Epoch Changes (Table 4):
Table 4 shows the progression of UC and UT metrics over multiple epochs for SGA, CGA, NPO, and .
- Faster Convergence:
CGAconsistently achieves higherUC(e.g.,Avg. UCat epoch 1 is 0.4852 forCGAvs. 0.4648 forSGA) andUT(e.g.,Avg. UTat epoch 1 is 0.7922 forCGAvs. 0.7768 forSGA) compared toSGAat earlier epochs. This indicates faster convergence towards the unlearning goal. - Improved Performance: Similarly, generally shows better
UCandUTmetrics at equivalent epochs compared toNPO. For instance, at epoch 2, has anAvg. UCof 0.7047 andAvg. UTof 0.8421, whileNPOhas 0.6787 and 0.8363, respectively. These results confirm that theMRD-enhanced methodallows algorithms to achieve better unlearning outcomes more quickly or with fewer iterations, validating its role in optimizing unlearning algorithms.
Results on WMDP Task: The following are the results from Table 5 of the original paper:
| Method | Unlearning Completeness (UC) | Model Utility (UT)[mmlu] | |||||||
| Cybersecurity (↓) | Chemical (↓) | Biosafety (↓) | Avg. (↓) | Humanities (↑) | Sciences (↑) | Stem (↑) | Other (↑) | Avg. (↑) | |
| SGA | 0.2430 | 0.2622 | 0.2474 | 0.2467 | 0.2451 | 0.2343 | 0.2388 | 0.2687 | 0.2465 |
| GradDiff | 0.3834 | 0.4460 | 0.6402 | 0.4795 | 0.5028 | 0.6597 | 0.4716 | 0.6343 | 0.5593 |
| NPO | 0.3497 | 0.4656 | 0.6268 | 0.4588 | 0.5292 | 0.6844 | 0.4865 | 0.6569 | 0.5818 |
| CGA | 0.2356 | 0.2547 | 0.2404 | 0.2459 | 0.2417 | 0.3107 | 0.2861 | 0.2514 | 0.2689 |
| GradDiff + MRD | 0.3719 | 0.4387 | 0.6315 | 0.4694 | 0.5132 | 0.6607 | 0.4782 | 0.6392 | 0.5655 |
| NPO + MRD | 0.2773 | 0.4705 | 0.6394 | 0.4244 | 0.5326 | 0.6972 | 0.4906 | 06591 | 0.55895 |
Analysis of WMDP Results (Table 5):
- For
GradDiff,GradDiff + MRDslightly improvesAvg. UC(from 0.4795 to 0.4694, where lower is better forUCon WMDP as it's 1-FA or similar harmfulness score) andAvg. UT(from 0.5593 to 0.5655). - For
NPO, shows a significant improvement inAvg. UC(from 0.4588 to 0.4244, lower is better), but a slight decrease inAvg. UT(from 0.5818 to 0.55895). This might indicate a trade-off or thatMRDprimarily boostsUC. CGA(MRD-enhanced SGA) shows comparableUCtoSGA(0.2459 vs 0.2467) but improvedUT(0.2689 vs 0.2465).
Results on WHP Task: The following are the results from Table 6 of the original paper:
| Method | Unlearning Completeness (UC) | Model Utility (UT) | |||
| Seen Rouge-L (↓) | Unseen Rouge-L (↓) | PPL (↓) | Zero-shot Acc. (↑) | TruthfulQA (↑) | |
| GradDiff | 0.0122 | 0.0132 | 12.46 | 0.6201 | 0.2827 |
| PO | 0.0272 | 0.0292 | 11.88 | 0.6192 | 0.2962 |
| NPO | 0.0121 | 0.0134 | 12.91 | 0.6122 | 0.3023 |
| GradDiff + MRD | 0.0116 | 0.0133 | 12.90 | 0.6191 | 0.2839 |
| PO + MRD | 0.0268 | 0.0291 | 11.76 | 0.6170 | 0.2949 |
| NPO + MRD | 0.0106 | 0.0105 | 12.30 | 0.6205 | 0.3113 |
Analysis of WHP Results (Table 6):
- For
WHP(copyright removal),MRD-enhancedmethods (e.g., ) generally achieve betterUC(lowerSeen Rouge-LandUnseen Rouge-L, e.g., has 0.0106 vsNPO's 0.0121 forSeen Rouge-L) with comparable or slightly improvedUT. shows the bestUCand improvedTruthfulQA.
Results on SAFE Task: The following are the results from Table 7 of the original paper:
| Method | Unlearning Completeness (UC) | Model Utility (UT) | |||
| Real Toxicity Prompts Toxic score (↓) | SAFE Toxic score (↓) | PPL (↓) | Zero-shot Acc. (↑) | TruthfulQA (↑) | |
| GradDiff | 0.0268 | 0.0353 | 11.99 | 0.6251 | 0.3011 |
| PO | 0.0308 | 0.0275 | 12.67 | 0.6028 | 0.2386 |
| NPO | 0.0248 | 0.0333 | 11.95 | 0.6270 | 0.3059 |
| GradDiff + MRD | 0.0246 | 0.0353 | 11.71 | 0.6266 | 0.3047 |
| PO + MRD | 0.0252 | 0.0336 | 12.78 | 0.6154 | 0.2766 |
| NPO + MRD | 0.0210 | 0.0332 | 12.82 | 0.6331 | 0.3247 |
Analysis of SAFE Results (Table 7):
- For
SAFE(unlearning toxic responses), achieves the bestUC(lowestReal Toxicity Prompts Toxic score, 0.0210 vsNPO's 0.0248) and the bestZero-shot Acc.andTruthfulQA.GradDiff + MRDalso shows slight improvements inUC.
Overall Conclusion on Effectiveness and Efficiency:
The experiments across all four tasks consistently show that integrating MRD-based weighted sampling significantly improves both unlearning completeness and model utility for various baseline unlearning algorithms. These improvements, coupled with faster convergence (as shown in Table 4), validate the hypothesis that dynamically adjusting the unlearning sequence based on MRD values is an effective strategy.
6.1.5. Efficiency Analysis of MRD-based Weighted Sampling
The paper also analyzes the computational overhead of MRD calculation and the overall efficiency gains.
The following are the results from Table 8 of the original paper, showing the connection between time cost of MRD computation and batch size:
| Batch Size | Time |
| 8 | 3m30s |
| 16 | 2m32s |
| 32 | 2m07s |
| 64 | 1m55s |
| 128 | 1m23s |
Analysis of MRD Computation Time (Table 8):
Table 8 shows that MRD computation time decreases as the batch size increases. This is because MRD calculation primarily involves parallel inference operations, which benefit significantly from larger batch sizes by utilizing GPU resources more efficiently. This indicates that the overhead can be managed by optimizing batching.
The following are the results from Table 9 of the original paper, showing the comparison of the time required to execute one round of the algorithm:
| Method | Time |
| GA | 1m40s |
| Graddiff | 2m03s |
| NPO | 2m08s |
| PO | 2m23s |
Analysis of One-Round Execution Time (Table 9):
Table 9 provides the time taken for one round (presumably one epoch or a fixed number of steps) of execution for various unlearning algorithms. Comparing this with Table 8, when the batch size for MRD computation exceeds 64 (e.g., at 128 batch size, 1m23s), MRD computation can actually become more efficient than a single round of some unlearning algorithms. This is significant because MRD only requires inference, while unlearning algorithms involve computationally intensive backward passes and parameter updates, limiting their maximum batch size due to GPU memory.
The following are the results from Table 10 of the original paper, showing the comparison of method times with and without MRD:
| Method | Time | Method | Time |
| GA | 8m20s | GA+MRD | 6m23s |
| Graddiff | 14m21s | Graddiff+MRD | 11m38s |
| PO | 17m4s | PO+MRD | 14m11s |
| NPO | 16m0s | NPO+MRD | 12m3s |
Analysis of Total Execution Time (Table 10):
Table 10 provides the total end-to-end execution time for the original and MRD-enhanced unlearning algorithms. Crucially, even with the overhead of MRD calculation (which is performed periodically), the MRD-improved methods (e.g., , , , ) consistently show reduced total execution times compared to their original counterparts. This is because the MRD-based weighted sampling accelerates convergence, requiring fewer total unlearning epochs or steps to reach the desired unlearning goal. This translates into a significantly lower overall runtime, demonstrating that MRD not only improves effectiveness but also overall efficiency.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Parameter Sensitivity
Experiments are conducted on the TOFU task to evaluate the impact of the perturbation parameter (\sigma) and the number of Monte Carlo samples () on MRD calculation.
The following figure (Figure 6 from the original paper) shows the parameter sensitivity of MRD:

Analysis of Perturbation Parameter (Figure 6a):
- Figure 6a shows the effect of on
MRDvalues, keeping . - As (represented as in the figure) increases, the
MRDvalue fluctuates around 0.64. - This suggests that the calculation of
MRDis not highly sensitive to the specific choice of within the tested range. For computational simplicity, was chosen in the paper.
Analysis of Monte Carlo Samples (Figure 6b):
-
Figure 6b illustrates the variation of
MRDvalues as increases, with fixed. -
When is small (e.g., ), the
MRDcalculation fluctuates significantly, indicating high variance in theMonte Carlo approximation. -
As reaches around 50, the
MRDcalculation gradually stabilizes. -
Optimal stability and performance are achieved when .
Additionally, the paper notes (Appendix F.4) that they tested different values of on
Qwen3 modelsof varying sizes.
The following are the results from Table 12 of the original paper, showing stable sample counts K across Qwen3 models:
| Model | Counts |
| Qwen3 4B | 60 |
| Qwen3 8B | 80 |
| Qwen3 14B | 50 |
| Qwen3 32B | 80 |
Analysis of across different models (Table 12):
Table 12 indicates that the optimal number of Monte Carlo samples for stability varies slightly across different Qwen3 model sizes (e.g., 60 for 4B, 80 for 8B, 50 for 14B, 80 for 32B). However, the crucial finding is that the impact of model size on the number of sampling iterations is minimal, generally staying within a similar range. This demonstrates the scalability of the MRD method in terms of .
6.2.2. Ablation Study of Update Interval
An ablation study is conducted on the update interval in Algorithm 2 to determine how frequently MRD values should be recomputed.
The following are the results from Table 13 of the original paper, showing an ablation study of :
| Method | Unlearning Completeness (UC) | Model Utility (UT) | ||||||||||
| UA (↑) | MIA (↑) | RR (↑) | Relearn (↑) | Avg. (↑) | Retain Set Acc. (↑) | RR(↑) | Real Author Acc. (↑) | RR(↑) | World Fact Acc. (↑) | RR(↑) | Avg. (↑) | |
| Original | 0.1475 | 0.4515 | 0.0204 | 1.0000 | 0.4049 | 0.8575 | 0.9825 | 0.8900 | 0.9330 | 0.8632 | 0.8960 | 0.9037 |
| PO + MRD - m=1 | 0.7525 | 0.6472 | 0.9714 | 0.7825 | 0.7884 | 0.6228 | 0.4187 | 0.6200 | 0.6864 | 0.7436 | 0.7778 | 0.6449 |
| PO + MRD - m=2 | 0.7575 | 0.6512 | 0.9773 | 0.7800 | 0.7953 | 0.6250 | 0.4216 | 0.6300 | 0.6963 | 0.7350 | 0.7792 | 0.6478 |
| PO + MRD - m=3 | 0.7500 | 0.6451 | 0.9681 | 0.7850 | 0.7871 | 0.6267 | 0.4245 | 0.6300 | 0.6924 | 0.7350 | 0.7752 | 0.6473 |
Analysis of Update Interval (Table 13):
- Table 13 shows the
unlearning performanceof for different values of (the frequency ofMRDre-calculation). - When (recomputing
MRDevery iteration), theAvg. UCis 0.7884 andAvg. UTis 0.6449. - When (recomputing
MRDevery two iterations), theAvg. UCis 0.7953 andAvg. UTis 0.6478. This represents the optimal performance among the tested values. - When , the performance slightly drops in
UC(0.7871) and is comparable inUT(0.6473). - This suggests that recomputing
MRDtoo frequently () or too infrequently () is suboptimal. An interval of strikes the best balance, allowing the curriculum to adapt to changes in sample difficulty without incurring excessive computational overhead.
6.2.3. MRD of Samples at Different Levels
The paper also investigates the robustness of MRD across different granularity levels of text (sentence, paragraph, long-text).
The following figure (Figure 7 from the original paper) shows MRD and unlearning difficulty of different text levels:

Analysis:
Figure 7 presents the MRD values and corresponding unlearning difficulty rankings for samples at the sentence level, paragraph level, and long-text level.
- The results demonstrate that
MRDvalues exhibit a certain degree of stability and robustness across different text lengths. While the absolute values might vary, the relative rankings of difficulty (i.e., which samples are harder or easier to unlearn) largely hold across these different levels of granularity. - This suggests that
MRDis a flexible metric that can be applied to different text units, providing consistent insights intounlearning difficultyregardless of the specific length of theforget setsamples. This enhances the generalizability of theMRDconcept.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces a fresh perspective on LLM unlearning by emphasizing the crucial role of sample-level unlearning difficulty. Driven by privacy concerns and the limitations of current evaluation practices, the authors propose Memory Removal Difficulty (MRD), a novel metric inspired by neuroscience, to quantify how hard individual samples are to unlearn. MRD is defined as the expected relative change in a sample's generation probability under minor Gaussian parameter perturbations. A key finding validated through experiments is that unlearning difficulty varies significantly across samples, making sample selection critical for fair evaluation.
The paper further analyzes the characteristics that influence MRD: high-frequency, high initial generation probability, and syntactically simple samples tend to be harder to unlearn (lower MRD), while high-complexity and rare-word samples are generally easier to unlearn (higher MRD). Leveraging these insights, the authors propose an MRD-based weighted sampling method, a curriculum learning-inspired approach, to optimize existing unlearning algorithms. This method prioritizes easily forgettable samples, which significantly improves both unlearning completeness and model utility while also enhancing unlearning efficiency (reducing total execution time). The extensive experimental validation on diverse LLM unlearning tasks (TOFU, WMDP, WHP, SAFE) confirms the effectiveness, reasonableness, and scalability of MRD and the proposed sampling method.
7.2. Limitations & Future Work
The authors suggest several promising future research directions:
-
Hierarchical Unlearning:
MRDcould be used to constructhierarchical unlearningstrategies, where concepts or groups of samples are unlearned in a structured manner based on their difficulty. -
Reinforcement Learning Integration:
MRDcould serve as areward mechanismforreinforcement learning-based unlearning approaches, guiding the agent to prioritize forgetting certain types of samples. -
Regularization Term:
MRDmight be incorporated as aregularization termdirectly into the unlearning objective function, explicitly balancing the unlearning process. -
Reassessing Evaluation Rationality:
MRDcan be used by researchers to critically re-evaluate the fairness and rationality of existingLLM unlearning evaluation benchmarksand practices.The paper doesn't explicitly state limitations, but some could be inferred:
-
The
neuro-inspiredanalogy, while intuitive, is a high-level conceptual link and not a direct neurobiological model. The empirical evidence supports the metric's utility, but the "neuro-inspired" aspect is primarily a guiding metaphor. -
The approximation of
MRDrelies onsecond-order Taylor expansionand thetrace of the Hessian. While computationally efficient via Monte Carlo sampling, directly computing or approximating the Hessian can still be demanding for extremely large models, even if only for a small sample of parameters. -
The effectiveness of
MRD-based weighted samplingmight depend on the specific characteristics of theforget setand the baseline unlearning algorithm. There could be scenarios where the curriculum might not always be optimal or might need more sophisticated dynamic adjustments. -
The sensitivity to parameters like (Monte Carlo samples) shows that careful tuning is still required to get stable
MRDvalues, though the study shows it's manageable.
7.3. Personal Insights & Critique
This paper offers a highly valuable contribution by bringing the concept of sample-level difficulty to the forefront of LLM unlearning research. The observation that unlearning difficulty is non-uniform is critical, as it exposes a potential flaw in how current unlearning algorithms are evaluated and highlights why results might be inconsistent or misleading. The neuro-inspired analogy is a creative way to motivate the metric, making the concept of MRD intuitive for beginners.
The definition of MRD itself is clever. By focusing on the relative change in generation probability under parameter perturbations, it directly quantifies the stability of a model's "memory" of a sample, which is a robust indicator of how hard it would be to dislodge that memory. The theoretical link to the Hessian trace provides a solid mathematical grounding and interpretability, explaining why samples with higher curvature are easier to unlearn.
The proposed MRD-based weighted sampling method is a practical and elegant solution. It transforms the theoretical understanding of unlearning difficulty into an actionable strategy that demonstrably improves existing methods. The analogy to curriculum learning makes it easily understandable and applicable. The efficiency gains (reduced total runtime) are particularly impressive, showing that interpretability can directly lead to practical benefits.
Potential issues or areas for improvement:
-
The "Initial Generation Probability" example in Table 11: As noted in the analysis, the example for "high initial generation probability" having a higher
MRDvalue seems to contradict the paper's own theoretical explanation (high probability should mean lowMRD, i.e., harder to unlearn). Clarifying this discrepancy (e.g., if the example was mislabeled, or if there are nuances not fully captured) would strengthen the empirical validation. -
Nature of "Neuro-Inspired": While inspiring, the connection is analogical. Further work could explore if specific neural mechanisms of memory consolidation and forgetting in biological systems could inspire more direct computational models or architectural changes in LLMs for better unlearning, moving beyond just a metric.
-
Generalizability Beyond LLMs: While the paper focuses on LLMs, the core idea of quantifying
sample-level unlearning difficultyvia stability under perturbation could potentially be generalized to other deep learning models (e.g., computer vision), adapting the "generation probability" to task-specific outputs (e.g., classification probabilities). -
Robustness to Malicious Inputs: Could
MRDbe robust to adversarial attempts to manipulate sample difficulty scores? A sophisticated attacker might craft "easy-to-unlearn" samples that are actually critical, or make critical samples appear "hard" to evade unlearning.Overall, this paper provides a fresh and impactful perspective, offering both a novel analytical tool (
MRD) and a practical optimization strategy for a critical area ofLLM development. It successfully steers the conversation towards a more nuanced understanding ofLLM unlearningdynamics.
Similar papers
Recommended via semantic vector search.