Paper status: completed

A Neuro-Inspired Interpretation of Unlearning in Large Language Models through Sample-Level Unlearning Difficulty

Published:04/09/2025

Unlearning Capability in Large Language Models (1)Sample-Level Unlearning Difficulty Assessment (1)Memory Removal Difficulty Metric (1)Neuro-Inspired Unlearning Method (1)Unlearning Algorithm Optimization Method (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces the Memory Removal Difficulty (MRD) metric for quantifying sample-level unlearning difficulty in Large Language Models, analyzing sample characteristics, and a weighted sampling method to optimize existing unlearning algorithms, enhancing efficiency and effe

Abstract

Driven by privacy protection laws and regulations, unlearning in Large Language Models (LLMs) is gaining increasing attention. However, current research often neglects the interpretability of the unlearning process, particularly concerning sample-level unlearning difficulty. Existing studies typically assume a uniform unlearning difficulty across samples. This simplification risks attributing the performance of unlearning algorithms to sample selection rather than the algorithm’s design, potentially steering the development of LLM unlearning in the wrong direction. Thus, we investigate the relationship between LLM unlearning and sample characteristics, with a focus on unlearning difficulty. Drawing inspiration from neuroscience, we propose a Memory Removal Difficulty (MRD) metric to quantify sample-level unlearning difficulty. Using MRD, we analyze the characteristics of hard-to-unlearn versus easy-to-unlearn samples. Furthermore, we propose an MRD-based weighted sampling method to optimize existing unlearning algorithms, which prioritizes easily forgettable samples, thereby improving unlearning efficiency and effectiveness. We validate the proposed metric and method using public benchmarks and datasets, with results confirming its effectiveness.

Mind Map

In-depth Reading

English Analysis~18 min read · 26,712 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "A Neuro-Inspired Interpretation Of Unlearning In Large Language Models Through Sample-Level Unlearning Difficulty." It focuses on understanding and quantifying the difficulty of unlearning specific data samples within Large Language Models (LLMs), drawing inspiration from neuroscience.

1.2. Authors

The authors are listed as "Anonymous authors," indicating that the paper is currently under double-blind review. This means the authors' identities are concealed from reviewers to ensure an unbiased review process.

1.3. Journal/Conference

The paper is noted as "Paper under double-blind review," indicating it has been submitted to a conference or journal but has not yet been formally accepted or published. The venue is typically a reputable machine learning or natural language processing conference/journal, given the topic.

1.4. Publication Year

The publication date (UTC) is stated as 2025-04-09T00:00:00.000Z.

1.5. Abstract

The paper addresses the growing need for unlearning in Large Language Models (LLMs), driven by privacy regulations. It identifies a critical gap in current research: the lack of interpretability of the unlearning process, particularly regarding sample-level unlearning difficulty. Existing studies often assume uniform unlearning difficulty, which can lead to misattributing algorithm performance to sample selection rather than algorithmic design. To address this, the authors investigate the relationship between LLM unlearning and sample characteristics. They propose a neuro-inspired metric called Memory Removal Difficulty (MRD) to quantify sample-level unlearning difficulty. Using MRD, they analyze the characteristics that make samples hard or easy to unlearn. Furthermore, they introduce an MRD-based weighted sampling method to optimize existing unlearning algorithms by prioritizing easily forgettable samples, thereby enhancing unlearning efficiency and effectiveness. The proposed metric and method are validated through experiments on public benchmarks, confirming their efficacy.

1.6. Original Source Link

The original source link is /files/papers/6957ab954a1fbc163064c2a7/paper.pdf. This indicates it is likely a preprint or an internally hosted file, given its anonymous status and the context of being under review. Its publication status is "under double-blind review."

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) has brought significant advancements in natural language processing (NLP), largely due to their remarkable ability to memorize vast training corpora. However, this strong memorization capability also poses substantial challenges, including privacy breaches, bias propagation, and the generation of illegal content. Regulatory frameworks like the GDPR (General Data Protection Regulation) necessitate the ability to remove specific private or unwanted information from models upon user request. This requirement has propelled Machine Unlearning (MU) into the spotlight.

The core problem the paper aims to solve is the lack of interpretability and a nuanced understanding of the unlearning process in LLMs, specifically regarding sample-level unlearning difficulty. Prior research in MU, particularly for LLMs, frequently treats all data samples as having uniform unlearning difficulty. This simplification is problematic because:

Misattribution of Performance: An algorithm's apparent superior performance might stem from selecting easy-to-unlearn samples rather than genuine algorithmic advancements. This can lead to misleading conclusions about the effectiveness of unlearning methods.
Hindered Development: Without understanding what makes a sample hard or easy to unlearn, the development of more efficient and effective unlearning algorithms is hampered. Researchers might optimize for the wrong aspects if they don't understand the underlying data characteristics influencing unlearning.

The paper's entry point is to challenge this uniform difficulty assumption by investigating the intrinsic characteristics of samples that influence their unlearning difficulty. Drawing inspiration from neuroscience, specifically how different types of memories exhibit varying resistance to forgetting (e.g., long-term vs. short-term memories and their robustness to minor brain injuries), the authors propose a novel approach to quantify this sample-level unlearning difficulty in LLMs.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Novel Metric for Sample-Level Unlearning Difficulty (MRD): The paper proposes Memory Removal Difficulty (MRD) as a novel, computationally efficient metric to quantify the unlearning difficulty of individual data samples in LLMs. This metric is inspired by neuroscience, drawing an analogy between memory robustness in the human brain and the stability of generation probabilities in LLMs under parameter perturbations. A smaller MRD value indicates higher unlearning difficulty (i.e., less change in generation probability under small parameter changes).
Analysis of Characteristics Influencing Unlearning Difficulty: Using the MRD metric, the paper provides an in-depth analysis of what characteristics make certain samples harder or easier to unlearn. Key findings include:
- High-frequency samples or those with high initial generation probabilities tend to have lower MRD values, indicating they are harder to unlearn.
- High-complexity samples (e.g., nested syntax) or those containing rare words tend to have higher MRD values, suggesting they are easier to unlearn. These findings offer valuable insights into the factors influencing LLM unlearning behavior.
MRD-based Weighted Sampling Method: The paper proposes a practical, plug-and-play MRD-based weighted sampling method to optimize existing unlearning algorithms. Inspired by curriculum learning, this method prioritizes the unlearning of easily forgettable samples first, followed by harder ones, by adjusting their sampling probabilities based on their MRD values.
Enhanced Unlearning Efficiency and Effectiveness: Through extensive experiments on public benchmarks (TOFU, WMDP, WHP, SAFE), the paper demonstrates that the MRD metric effectively captures sample difficulty. The MRD-based weighted sampling method significantly improves both the unlearning completeness (UC) and model utility (UT) of various baseline unlearning algorithms (e.g., GradDiff, PO, NPO), while also accelerating convergence and reducing overall execution time. This confirms the practical value of understanding sample-level unlearning difficulty.

These contributions address the identified gap by providing a formal definition and empirical understanding of sample-level unlearning difficulty, which is crucial for fair evaluation and directed development of future LLM unlearning techniques.

3.1. Foundational Concepts

To fully understand this paper, a foundational grasp of Large Language Models (LLMs), Machine Unlearning (MU), and related deep learning concepts is essential.

Large Language Models (LLMs):
- Definition: LLMs are deep learning models, typically based on the Transformer architecture, trained on massive amounts of text data. They are designed to understand, generate, and process human-like language. The term "large" refers to their enormous number of parameters (billions to trillions), which allows them to capture complex patterns and relationships in language.
- Autoregressive Models: Many LLMs, especially those used for text generation, are autoregressive. This means they predict the next token (word or sub-word unit) in a sequence based on all the preceding tokens. Their training objective is usually to minimize the negative log-likelihood of the training data.
- Log-Likelihood: In probabilistic models, the log-likelihood measures how well the model predicts the observed data. For an autoregressive model generating a sequence of tokens $\pmb{x} = \{x_1, \dots, x_n\}$ , the log-likelihood is \sum_{t=1}^{n} \log p(x_t | \pmb{x}_{<t}; \pmb{\theta}), where $p(x_t | \pmb{x}_{<t}; \pmb{\theta})$ is the probability of token $x_t$ given previous tokens $\pmb{x}_{<t}$ and model parameters $\pmb{\theta}$ . A higher log-likelihood means the model assigns higher probability to the observed sequence, indicating better memorization or understanding.
Machine Unlearning (MU):
- Definition: Machine Unlearning refers to the process of removing the influence of specific training data from a machine learning model, as if the model had never been trained on that data in the first place. This is crucial for privacy (e.g., GDPR's "right to be forgotten"), fairness (removing biased data), and security.
- Forget Set ( $\mathcal{D}_F$ ) and Retain Set ( $\mathcal{D}_R$ ): In Machine Unlearning, the training dataset $\mathcal{D}$ is typically divided into two parts: the forget set ( $\mathcal{D}_F$ ), which contains the samples to be removed, and the retain set ( $\mathcal{D}_R$ ), which comprises the data that should remain influential on the model.
- Unlearning Completeness (UC): This metric quantifies how effectively the influence of the forget set has been removed from the model. High UC means the model behaves as if it never saw the forget set data.
- Model Utility (UT): This metric assesses how well the model retains its performance on unrelated tasks or the retain set after unlearning. High UT means unlearning the forget set did not significantly degrade the model's general capabilities.
Gradient-based Optimization:
- Gradient Ascent/Descent: In machine learning, gradient descent is an iterative optimization algorithm used to find the minimum of a function (e.g., a loss function) by taking steps proportional to the negative of the gradient. Gradient ascent is its counterpart, used to find the maximum of a function by moving in the direction of the positive gradient. In unlearning, gradient ascent is often used on the forget set to actively "push" the model away from memorizing those samples, effectively increasing their negative log-likelihood.
- Hessian Matrix (H_t = \nabla^2 P_t(\pmb{\theta})): The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function. In the context of model parameters $\pmb{\theta}$ and a function like log-likelihood $P_t(\pmb{\theta})$ , the Hessian indicates how sensitive the function's output is to small changes in parameters. A large trace (sum of diagonal elements) of the Hessian implies high curvature, meaning the function changes rapidly with small parameter perturbations.
Preference Optimization:
- Concept: This approach frames the task as aligning model behavior with desired preferences, often using techniques like Reinforcement Learning from Human Feedback (RLHF). In unlearning, this can involve treating forgotten data as "negative examples" and optimizing the model to avoid generating responses associated with them, while simultaneously preferring "positive examples" (e.g., refusals or counterfactual samples) for the forget set prompts.
Curriculum Learning:
- Concept: Inspired by human learning, curriculum learning is a training strategy where a model is first trained on easier examples and gradually exposed to more complex ones. The idea is that learning from a structured curriculum can lead to faster convergence and better generalization than random ordering. In the context of unlearning, this would mean prioritizing easy-to-unlearn samples before tackling harder-to-unlearn ones.
Gaussian Perturbation:
- Definition: A Gaussian perturbation involves adding random noise drawn from a Gaussian (normal) distribution (e.g., $\mathcal{N}(0, \sigma^2 I)$ ) to model parameters or inputs. Here, $\sigma^2$ is the variance, and $I$ is the identity matrix, indicating independent noise components. This is often used to simulate small, random changes, similar to minor "injuries" in the paper's neuroscience analogy.

3.2. Previous Works

The paper contextualizes its contributions by reviewing existing Machine Unlearning (MU) and LLM Unlearning research, highlighting the shift towards interpretability.

Machine Unlearning (General):
- Exact Unlearning: Aims to produce a model mathematically identical to one trained from scratch without the forget set. These methods often involve intricate data partitioning or model ensemble techniques to distribute computational overhead (e.g., Bourtoule et al., 2021; Li et al., 2024b). While providing strong guarantees, they are computationally very expensive.
- Approximate Unlearning: Seeks a model approximately equivalent to the retrained gold standard, either in terms of parameters or output behavior. This is more practical for large models. Methods include estimating data influence (Koh & Liang, 2017; Liu et al., 2024b) or fine-tuning with specific objectives.
LLM Unlearning:
- LLM unlearning is primarily framed as approximate unlearning due to the immense scale of LLMs. The goal is to balance unlearning completeness (UC) (removing targeted knowledge) with model utility (UT) (preserving general performance).
- Gradient-based Methods: Early approaches used gradient ascent (GA) on the forget set to increase the loss, effectively pushing the model to "forget" (Jang et al., 2023). While improving UC, this often degraded UT. Subsequent work introduced regularization (e.g., parameter or loss regularization) to mitigate utility loss (Maini et al., 2024; Yao et al., 2024).
- Preference Optimization-based Methods: These methods treat unlearning as a preference optimization task. The forget set data is considered negative examples, and the model is optimized to produce rejection-based answers (e.g., refusals) or counterfactual samples when prompted with forget set inputs (Zhang et al., 2024). While integrated, they can suffer from low efficiency.
- Model Weight-based Methods: More recent research explores identifying and guiding unlearning at the module level by analyzing model weights and modular structures within LLMs (Jia et al., 2024). These methods offer insights but can also be computationally intensive.
Interpretability of MU:
- Recent trends emphasize understanding why unlearning works or fails.
- Fan et al. (2024) studied how different forget set partitions impact model performance on retain sets in image classification.
- Zhao et al. (2024) investigated explainable features within forget sets and their link to unlearning difficulty.
- Chen et al. (2024) showed unlearning difficulty variation across users in recommendation systems.
- These studies collectively point towards the importance of sample-level analysis for unlearning interpretability.

3.3. Technological Evolution

The field of Machine Unlearning has evolved from theoretical concepts and exact unlearning methods, which were often computationally prohibitive for large-scale models, to more practical approximate unlearning techniques. Initially, the focus was on general classification models, with less emphasis on the specific challenges posed by Large Language Models.

The advent of LLMs brought new urgency to unlearning due to their massive scale, strong memorization capabilities, and the implications for privacy and bias. Early LLM unlearning methods adapted existing approximate techniques like gradient ascent and fine-tuning. However, a major challenge has been balancing unlearning completeness with model utility, as aggressive unlearning can degrade overall performance.

More recently, the research community has recognized the need for deeper interpretability in MU. This paper's work fits into this evolving landscape by addressing a critical oversight: the assumption of uniform unlearning difficulty across samples. It pushes the frontier of LLM unlearning by moving beyond simply how to unlearn to understanding what makes unlearning difficult for individual samples, a crucial step for fair evaluation and targeted algorithm design.

3.4. Differentiation Analysis

This paper differentiates itself from existing research in several key ways:

Focus on Sample-Level Unlearning Difficulty for LLMs: While some prior work (e.g., Chen et al., 2024 for recommendation systems) hinted at varying unlearning difficulty across users, this paper explicitly defines and quantifies sample-level unlearning difficulty specifically for LLMs. Most existing LLM unlearning studies assume uniform difficulty or do not formally define it at the granular sample level.
Neuro-Inspired Metric (MRD): The paper introduces a novel Memory Removal Difficulty (MRD) metric, drawing a unique analogy from neuroscience regarding memory robustness to minor brain injuries. This provides a fresh theoretical perspective that links the stability of generation probabilities under parameter perturbations to unlearning difficulty. This "neuro-inspired" approach is distinct from purely mathematical or empirical metrics used previously.
Computational Feasibility for LLMs: Unlike some theoretical measures that might require computationally expensive second-order information (e.g., full Hessian matrix inversion) or are designed for image classification (which may not generalize to text-based autoregressive LLMs), MRD is designed to be computationally efficient for LLMs, relying on Monte Carlo sampling of parameter perturbations.
Interpretability-Driven Analysis: The paper goes beyond just proposing a metric; it uses MRD to conduct an in-depth analysis of sample characteristics (semantic complexity, frequency, rare words, initial generation probability) that influence unlearning difficulty. This contributes significantly to the interpretability of the LLM unlearning process, which has been largely underexplored.
Practical Optimization Strategy: The paper translates its interpretability findings into a practical MRD-based weighted sampling method. This curriculum learning-inspired approach is a general, plug-and-play strategy that can enhance existing LLM unlearning algorithms, offering a concrete way to leverage the MRD metric for improved efficiency and effectiveness. This moves beyond mere measurement to practical application.

In essence, while previous works focused on performing unlearning or broadly interpreting unlearning phenomena, this paper zeroes in on the inherent difficulty of individual samples within LLMs, proposing a biologically inspired metric and a practical method to leverage this understanding.

4. Methodology

This section provides a detailed, step-by-step deconstruction of the paper's methodology, integrating the explanations of mathematical formulas directly into the procedural flow.

4.1. Principles

The core principle behind this paper's methodology is to interpret unlearning difficulty in Large Language Models (LLMs) through an analogy to human memory and brain function, specifically how long-term memories are more resistant to minor brain injuries than short-term memories.

The intuition is as follows:

In the human brain, long-term memories (like core skills or significant personal experiences) are generally robust and not easily disrupted by minor Traumatic Brain Injuries (mTBI).
Conversely, short-term memories are more fragile and prone to disruption from such injuries.
This suggests varying "forgetting difficulty" across different types of knowledge in the brain.

Extending this analogy to LLMs, the authors hypothesize that:
Hard-to-unlearn samples (analogous to long-term memories) will exhibit minimal changes in their generation probability distribution when the model's parameters undergo minor perturbations (analogous to mTBI).
Easy-to-unlearn samples (analogous to short-term memories) will show more significant changes in their generation probability under similar minor parameter perturbations.

Therefore, the principle is to quantify unlearning difficulty by measuring the stability of a sample's generation probability in the face of small, random changes to the model's parameters. A sample whose generation probability is highly stable (i.e., changes little) under perturbation is considered harder to unlearn, while one with volatile generation probability is easier to unlearn.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Setup of LLM Unlearning

The paper first formally defines the training of an autoregressive LLM and the objective of LLM unlearning.

Autoregressive Model Training: Given a training set $\mathcal{D}$ that is partitioned into a forget set $\mathcal{D}_F$ and a retain set $\mathcal{D}_R$ .

$\mathcal{D}_F = \{ \pmb{x}^1, \pmb{x}^2, \dots, \pmb{x}^{N_f} \}$ contains $N_f$ samples to be unlearned.
$\mathcal{D}_R = \{ \breve{\pmb{x}^1}, \pmb{x}^2, \dots, \pmb{x}^{N_r} \}$ contains $N_r$ samples to be retained. Each sample $\pmb{x}^i = \{ x_1, \dots, x_{n_i} \}$ is a sequence of tokens of length $n_i$ .

The parameters $\pmb{\theta}^\prime$ of an autoregressive model trained on $\mathcal{D}$ are obtained by minimizing the Negative Log-Likelihood (NLL) loss: $\theta^\prime = \underset{\theta}{\arg \operatorname*{min}} \mathcal{L}_{NLL}(\mathcal{D}; \theta) = \underset{\theta}{\arg \operatorname*{min}} - \mathbb{E}_{\mathbf{x}^i \sim \mathcal{D}} \left[ \sum_{t=1}^{n_i} \log p(x_t \mid \mathbf{x}_{<t}; \theta) \right]$ Here:

$\pmb{\theta}^\prime$ represents the model parameters after training on the full dataset $\mathcal{D}$ .
$\mathcal{L}_{NLL}(\mathcal{D}; \theta)$ is the Negative Log-Likelihood loss over the entire dataset $\mathcal{D}$ .
$\mathbb{E}_{\mathbf{x}^i \sim \mathcal{D}}[\dots]$ denotes the expectation (average) over all samples $\pmb{x}^i$ in the dataset $\mathcal{D}$ .
\sum_{t=1}^{n_i} \log p(x_t \mid \mathbf{x}_{<t}; \theta) is the log-likelihood of a single sample $\pmb{x}^i$ , calculated as the sum of log-probabilities of generating each token $x_t$ given the preceding tokens $\pmb{x}_{<t}$ and the current model parameters $\pmb{\theta}$ . Minimizing this term maximizes the likelihood of observing the training data.

Objective of LLM Unlearning: The goal of LLM unlearning for a set of samples $\mathcal{D}_F$ is typically framed as an optimization problem: $\operatorname*{max}_{\theta} \frac{1}{N_r} \sum_{g \in G} \sum_{\pmb{x}_r \in \mathcal{D}_R} g(\pmb{x}_r; \pmb{\theta}) \quad \mathrm{subject} \mathrm{to} \frac{1}{N_f} \sum_{\pmb{x}_f \in \mathcal{D}_F} \psi(\pmb{x}_f; \pmb{\theta}) \geq \epsilon$ In this formulation:

The objective is to maximize the model's utility on the retain set $\mathcal{D}_R$ .
$\frac{1}{N_r} \sum_{g \in G} \sum_{\pmb{x}_r \in \mathcal{D}_R} g(\pmb{x}_r; \pmb{\theta})$ quantifies this utility, where $G$ is a set of functions (e.g., accuracy, perplexity) assessing various model capabilities on $\mathcal{D}_R$ .
The subject to clause enforces the unlearning completeness constraint.
$\frac{1}{N_f} \sum_{\pmb{x}_f \in \mathcal{D}_F} \psi(\pmb{x}_f; \pmb{\theta}) \geq \epsilon$ means that a measure of unlearning $\psi(\pmb{x}_f; \pmb{\theta})$ for the forget set $\mathcal{D}_F$ must exceed a certain threshold $\epsilon$ . For example, $\psi$ could ensure the probability of generating $\mathcal{D}_F$ is below $\epsilon$ , or that the divergence between the model's output distribution on $\mathcal{D}_F$ and the true distribution is sufficiently high.

In essence, the optimization seeks to maintain the model's general abilities while ensuring the targeted forget set information has been effectively erased.

4.2.2. Motivation for Measuring Sample-Level Unlearning Difficulty

The paper highlights that current evaluations of LLM unlearning algorithms often suffer from sample selection bias. This is demonstrated by two key observations:

Variability in Performance: As shown in Figure 2 (from the original paper), for the same unlearning algorithm (e.g., GradDiff or NPO), the model's performance (Acc indicating utility retention) varies significantly when different unlearning samples are chosen. This suggests that some samples are inherently easier or harder to unlearn, and random selection can lead to inconsistent evaluation outcomes.

该图像是一个图表，展示了不同样本在两种大语言模型（GradDiff 和 NPO）下的去学习效果评估（Acc）。图中包括三组结果，其中 (a) 为保留集，(b) 为真实作者，(c) 为世界事实，显示了样本在去学习过程中表现出的差异。

The following figure (Figure 2 from the original paper) shows the impact of sample selection on unlearning evaluation:
Ranking Reversals: Figure 2 also shows that the ranking of unlearning algorithms can reverse depending on the chosen samples. For instance, NPO generally outperforms GradDiff, but for certain samples, GradDiff performs better. This further underscores that a method's perceived superiority might be due to the ease of unlearning the selected test samples, rather than its fundamental design.

The paper argues that existing training difficulty metrics (e.g., GraNd, EN2L, VoG), which quantify how hard a sample is to learn, are not directly applicable to unlearning difficulty. Figure 3 (from the original paper) illustrates this: when samples are ranked by their actual unlearning difficulty (measured by parameter changes), these training metrics do not exhibit a clear monotonic relationship and even disagree sharply for samples of similar unlearning difficulty.

该图像是一个示意图，展示了25个样本组在样本难度排名下的GraNd、EN2L、VoG和MRD指标排名。红色方框标记了尽管样本难度相同，但指标之间存在显著差异的组别。

The following figure (Figure 3 from the original paper) illustrates the rankings of 25 sample groups by GraNd, EN2L, VoG and MRD under the sample unlearning difficulty ordering:

This motivates the need for a dedicated metric to quantify sample-level unlearning difficulty.

4.2.3. Defining Memory Removal Difficulty (MRD)

The paper proposes Memory Removal Difficulty (MRD) to quantify sample-level unlearning difficulty, inspired by the neuroscience analogy discussed earlier.

Initial Metric and its Limitations: An initial, intuitive metric for MRD was proposed as: $\mathrm { M R D } ( { \pmb x } ^ { i } ; { \pmb \theta } ) = \left| \sum _ { t = 1 } ^ { n _ { i } } P _ { t } ( { \pmb \theta } ) - P _ { t } ( { \pmb \theta } + { \pmb \delta } ) \right|$ where $P_t(\pmb{\theta}) = \log p(x_t | \pmb{x}_{<t}; \pmb{\theta})$ is the log-probability of the $t$ -th token given current parameters $\pmb{\theta}$ , and $\pmb{\delta}$ is a small random perturbation.

However, this initial metric had two limitations:

Limited Perturbation Scope: A single perturbation direction might not fully capture the global impact of parameter variations.
Absolute Metric Bias: Using absolute changes could unfairly penalize samples that inherently have low generation probabilities, making their relative changes appear larger even if the model's confidence isn't drastically shifting.

Refined Memory Removal Difficulty (MRD) - Definition 3.1: To address these limitations, the paper refines the MRD metric by incorporating sample length normalization, a global perturbation mechanism (through expectation over Gaussian noise), and relative measures.

Definition 3.1. For an LLM with parameters $\pmb{\theta}$ , the difficulty of unlearning a sample $\mathbf{x}^i$ is defined as: $\mathrm{MRD}(\pmb{x}^i; \pmb{\theta}) = \left| \mathbb{E}_{\pmb{\delta} \sim \mathcal{N}(0, \pmb{\sigma}^2 I)} \sum_{t=1}^{n_i} \left( \frac{P_t(\pmb{\theta}) - P_t(\pmb{\theta} + \pmb{\delta})}{P_t(\pmb{\theta})} \right) \right|$ Here, the symbols are explained as follows:

$\mathrm{MRD}(\pmb{x}^i; \pmb{\theta})$ : The Memory Removal Difficulty for a specific sample $\pmb{x}^i$ given the current model parameters $\pmb{\theta}$ .
$|\dots|$ : The absolute value, ensuring that MRD is always non-negative.
$\mathbb{E}_{\pmb{\delta} \sim \mathcal{N}(0, \pmb{\sigma}^2 I)}[\dots]$ $E_{δ \sim N (0, σ^{2} I)} [\dots]$ : The expectation taken over Gaussian perturbation vectors $\pmb{\delta}$ $δ$ . This means $\pmb{\delta}$ $δ$ is drawn from a multivariate normal distribution with a mean of 0 and a covariance matrix of $\pmb{\sigma}^2 I$ $σ^{2} I$ .
- $\pmb{\delta}$ : A vector of small random perturbations applied to each parameter in $\pmb{\theta}$ .
- $\mathcal{N}(0, \pmb{\sigma}^2 I)$ : A multivariate normal distribution where 0 is the mean vector (no systematic bias in perturbation), $\pmb{\sigma}^2$ is the variance, and $I$ is the identity matrix, implying that perturbations to different parameters are independent and have the same variance. This is referred to as Gaussian isotropic noise.
$\sum_{t=1}^{n_i}$ : The sum over all tokens $t$ in the sample $\pmb{x}^i$ , from $t=1$ to its length $n_i$ . This normalizes the measure by the sample's length (implicitly, as it sums over all tokens).
$P_t(\pmb{\theta}) = \log p(x_t \mid \pmb{x}_{<t}; \pmb{\theta})$ : The log-probability (or negative log-likelihood in some contexts, though here it refers to log-likelihood) of generating the $t$ -th token $x_t$ given the preceding tokens $\pmb{x}_{<t}$ and the current model parameters $\pmb{\theta}$ . This is the generation probability of the token.
$P_t(\pmb{\theta} + \pmb{\delta})$ : The log-probability of generating the $t$ -th token $x_t$ after applying the perturbation $\pmb{\delta}$ to the model parameters.
$\frac{P_t(\pmb{\theta}) - P_t(\pmb{\theta} + \pmb{\delta})}{P_t(\pmb{\theta})}$ : The relative change in the log-probability of the $t$ -th token due to the perturbation. This addresses the "absolute metric bias" limitation by considering the change proportional to the original probability.

Interpretation: A smaller MRD value indicates less fluctuation in the generation probability of a sample under parameter perturbations. This implies that the sample's "memory" is more stable and thus harder to unlearn. Conversely, a larger MRD suggests greater sensitivity to perturbations, meaning the sample is easier to unlearn. The choice of Gaussian isotropic noise simplifies implementation compared to anisotropic noise by not requiring specific knowledge about which parameters to disturb or their exact ranges.

4.2.4. Approximation of MRD (Theorem 3.2)

The paper provides an approximation of the MRD metric using a second-order Taylor expansion, linking it to the Hessian matrix.

Theorem 3.2. Approximation of MRD. Assuming that $P_t(\pmb{\theta})$ and $P_t(\pmb{\theta} + \pmb{\delta})$ are non-zero, and $\pmb{\delta} \sim \mathcal{N}(0, \sigma^2 I)$ represents a small perturbation where $\sigma^2$ is sufficiently small, the MRD can be approximated as follows: $\mathrm{MRD}(\pmb{x}^i; \pmb{\theta}) \approx \frac{\pmb{\sigma}^2}{2} \sum_{t=1}^{n_i} \frac{\mathrm{Tr}(H_t)}{P_t(\pmb{\theta})}$ where H_t = \nabla^2 P_t(\pmb{\theta}) represents the Hessian matrix of $P_t(\pmb{\theta})$ w.r.t. $\pmb{\theta}$ .

Proof (from Appendix A): The proof aims to show the relationship between MRD and the Hessian matrix.

Start with the Definition of MRD: $\mathrm { M R D } ( { \pmb x } ^ { i } ; \theta ) = \left| \mathbb { E } _ { \delta \sim \mathcal { N } ( 0 , { \pmb \sigma } ^ { 2 } I ) } \sum _ { t = 1 } ^ { n _ { i } } \left( \frac { P _ { t } ( \theta ) - P _ { t } ( \theta + \delta ) } { P _ { t } ( \theta ) } \right) \right|$ Here, $P_t(\theta) = \log p(x_t | x_{<t}; \theta)$ is the log-likelihood of the $t$ -th token, $\delta \sim \mathcal{N}(0, \sigma^2 I)$ is the parameter perturbation, and $n_i$ is the length of the sentence $\mathbf{x}^i$ .
Multivariate Taylor Expansion: Perform a multivariate Taylor expansion of $P_t(\boldsymbol{\theta} + \boldsymbol{\delta})$ up to the second-order term around $\boldsymbol{\theta}$ : $P _ { t } ( \boldsymbol { \theta } + \boldsymbol { \delta } ) \approx P _ { t } ( \boldsymbol { \theta } ) + \nabla P _ { t } ( \boldsymbol { \theta } ) ^ { \top } \boldsymbol { \delta } + \frac { 1 } { 2 } \boldsymbol { \delta } ^ { \top } H _ { t } \boldsymbol { \delta }$
- $\nabla P_t(\boldsymbol{\theta})$ : The gradient of $P_t(\boldsymbol{\theta})$ with respect to $\boldsymbol{\theta}$ .
- H_t = \nabla^2 P_t(\boldsymbol{\theta}): The Hessian matrix of $P_t(\boldsymbol{\theta})$ with respect to $\boldsymbol{\theta}$ .
Substitute into the Numerator: Substitute this expansion into the numerator $P_t(\theta) - P_t(\theta + \pmb{\delta})$ : $P _ { t } ( \theta ) - P _ { t } ( \theta + \pmb \delta ) \approx - \nabla P _ { t } ( \theta ) ^ { \top } \pmb \delta - \frac { 1 } { 2 } \pmb \delta ^ { \top } H _ { t } \pmb \delta$
Express the Relative Change: The relative change term in MRD becomes: $\frac { P _ { t } ( \theta ) - P _ { t } ( \theta + \delta ) } { P _ { t } ( \theta ) } \approx - \frac { \nabla P _ { t } ( \theta ) ^ { \top } \delta } { P _ { t } ( \theta ) } - \frac { 1 } { 2 } \frac { \delta ^ { \top } H _ { t } \delta } { P _ { t } ( \theta ) }$
Compute Expectation: Substitute this expression back into the MRD formula and compute the expectation over $\pmb{\delta} \sim \mathcal{N}(0, \sigma^2 I)$ : $\mathrm { M R D } ( { \pmb x } ^ { i } ; \theta ) \approx \left| \mathbb { E } _ { \delta \sim \mathcal { N } ( 0 , \sigma ^ { 2 } I ) } \sum _ { t = 1 } ^ { n _ { i } } \left( - \frac { \nabla P _ { t } ( \theta ) ^ { \top } \delta } { P _ _ { t } ( \theta ) } - \frac { 1 } { 2 } \frac { \delta ^ { \top } H _ { t } \delta } { P _ { t } ( \theta ) } \right) \right|$
- First-order term: Since $\pmb{\delta} \sim \mathcal{N}(0, \sigma^2 I)$ , its expectation is $\mathbb{E}[\pmb{\delta}] = 0$ . Therefore, the expectation of the first-order term vanishes: $\mathbb { E } \left[ - \frac { \nabla P _ { t } ( \theta ) ^ { \top } \pmb { \delta } } { P _ { t } ( \theta ) } \right] = 0$
- Second-order term: For a multivariate normal distribution $\pmb{\delta} \sim \mathcal{N}(\bar{0}, \sigma^2 I)$ , the expectation of a quadratic form $\mathbb{E}[\pmb{\delta}^\top H_t \pmb{\delta}]$ is $\sigma^2 \mathrm{Tr}(H_t)$ . So, the expectation of the second-order term becomes: $\mathbb { E } \left[ - \frac { 1 } { 2 } \frac { \delta ^ { \top } H _ { t } \delta } { P _ { t } ( \theta ) } \right] = - \frac { \sigma ^ { 2 } } { 2 P _ { t } ( \theta ) } \operatorname { T r } ( H _ { t } )$
Final Approximation: Considering that the first-order term's expectation is zero, and taking the absolute value of the second-order term (as $P_t(\theta)$ is usually negative, and $\mathrm{Tr}(H_t)$ can be positive or negative, but the absolute value ensures a positive result for difficulty), the approximate expression for MRD is: $\mathrm { M R D } ( x ^ { i } ; \theta ) \approx \frac { \sigma ^ { 2 } } { 2 } \sum _ { t = 1 } ^ { n _ { i } } \frac { \mathrm { T r } ( H _ { t } ) } { P _ { t } ( \theta ) }$ Here:
- $\sigma^2$ : The variance of the Gaussian perturbation.
- $\mathrm{Tr}(H_t)$ : The trace (sum of diagonal elements) of the Hessian matrix $H_t$ , which captures the local curvature of the log-probability function with respect to the model parameters.
- $P_t(\theta)$ : The log-probability of the $t$ -th token.

Interpretation of MRD from Approximation: Theorem 3.2 shows that MRD is approximately proportional to the Hessian trace ( $\mathrm{Tr}(H_t)$ ) and inversely related to the log-probability ( $P_t(\theta)$ ).

A large trace of the Hessian matrix indicates that the loss function (or generation probability) has a large overall curvature at that point in the parameter space. A steeper loss landscape means that smaller parameter changes are sufficient to significantly alter the generation probability and achieve the unlearning threshold. This implies fewer updates are needed, making the sample easier to unlearn.
When $P_t(\theta)$ (the log-probability) is high, its magnitude is large (and typically negative). The inverse relationship suggests that a high log-probability (meaning the model is very confident in generating this token) leads to a smaller MRD, indicating higher resistance to unlearning. This aligns with the intuition that well-memorized, high-probability samples are harder to forget.

Thus, MRD serves as a reasonable metric for unlearning difficulty by reflecting how sensitive a sample's generation probability is to parameter changes.

4.2.5. Computational Complexity of MRD

The analytical computation of the expectation in the MRD definition is infeasible. Therefore, Monte Carlo sampling is used to approximate it.

The computational complexity of calculating MRD for a sample $\mathbf{x}^i$ of length $n_i$ using $K$ Monte Carlo samples is $O(K \cdot n_i \cdot d)$ , where $d$ is the number of model parameters.

$K$ : Number of Monte Carlo samples for perturbation.
$n_i$ : Length of the input sequence (number of tokens). For each token, the log-probability needs to be computed.
$d$ : Number of model parameters. Computing $P_t(\pmb{\theta} + \pmb{\delta})$ involves a forward pass through the model with perturbed parameters. A single forward pass scales linearly with the number of parameters $d$ .

This linear scaling with $d$ ensures computational efficiency, making MRD feasible for large LLMs.

The detailed computation of MRD is described in Algorithm 1.

Algorithm 1: Computation implementation of MRD 1: Input: Sample sequence $\pmb{x}^i = \{x_t\}_{t=1}^{n_i}$ ; model parameters $\pmb{\theta} \in \mathbb{R}^d$ ; disturbance variance $\sigma^2$ ; number of Monte Carlo samples $K$ . 2: Output: The MRD value of sample $\mathbf{x}^i$ . 3: Initialize: $\mathrm{MRD_{sum}} \gets 0$ . 4: for $k = 1$ to $K$ do 5: Sample disturbance vector $\delta_k \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)$ . 6: $\Delta_{\mathrm{sum}} \gets 0$ . 7: for $t = 1$ to $n_i$ do 8: $P_t(\pmb{\theta}) \gets \log p(x_t \mid x_{<t}; \pmb{\theta})$ . 9: $P_t(\pmb{\theta} + \delta_k) \gets \log p(x_t \mid x_{<t}; \pmb{\theta} + \delta_k)$ . 10: $\Delta_t \gets \frac{P_t(\pmb{\theta}) - P_t(\pmb{\theta} + \delta_k)}{P_t(\pmb{\theta})}$ . 11: $\Delta_{\mathrm{sum}} \gets \Delta_{\mathrm{sum}} + \Delta_t$ . 12: end for 13: $\mathrm{MRD}_k \gets |\Delta_{\mathrm{sum}}|$ . 14: $\mathrm{MRD_{sum}} \gets \mathrm{MRD_{sum}} + \mathrm{MRD}_k$ . 15: end for 16: Return: $\mathrm{MRD}(\pmb{x}^i; \pmb{\theta}) \gets \frac{\mathrm{MRD_{sum}}}{K}$ .

Explanation of Algorithm 1:

Lines 4-15: The outer loop iterates $K$ times for Monte Carlo sampling.
Line 5: In each iteration, a random perturbation vector $\pmb{\delta}_k$ is sampled from a Gaussian distribution.
Lines 7-12: The inner loop iterates through each token $t$ of the sample $\pmb{x}^i$ .
Line 8: Calculates the log-probability of the token $x_t$ with the original model parameters $\pmb{\theta}$ .
Line 9: Calculates the log-probability of the token $x_t$ with the perturbed model parameters $\pmb{\theta} + \pmb{\delta}_k$ .
Line 10: Computes the relative change in log-probability for the current token.
Line 11: Accumulates these relative changes for all tokens in the sample.
Line 13: Takes the absolute value of the total accumulated relative change for the current Monte Carlo sample, storing it as $\mathrm{MRD}_k$ .
Line 14: Sums up the $\mathrm{MRD}_k$ values from all Monte Carlo samples.
Line 16: Returns the average MRD value by dividing the sum by $K$ .

4.2.6. Characteristics Influencing MRD

Based on the theoretical analysis (Theorem 3.2), MRD is proportional to the local geometric curvature and inversely related to the log-probability ( $P_t(\pmb{\theta})$ ). This leads to several hypotheses about sample characteristics:

Semantic Complexity / Syntactic Simplicity:
- Low Complexity (e.g., "The cat is sleeping."): Samples that are syntactically simple and structurally clear tend to have smooth output distributions. This implies a relatively small local geometric curvature (i.e., $\Delta (\log p(x_t | x_{<t}; \pmb{\theta}))$ is small). Consequently, their MRD values are low, indicating higher resistance to unlearning. The model is very confident and stable in generating these common, simple sequences.
- High Complexity (e.g., "The intricacies of quantum mechanics perplex many scientists."): Samples with nested syntax, complex modifications, or rare vocabulary often have steeper distributions with sharper variations in parameter space. These lead to higher MRD values, making them more susceptible to perturbations and thus easier to unlearn. The model's confidence in these complex sequences is more fragile.
Initial Generation Probability:
- High Probability (e.g., "I love reading books."): If a sample's generation probability ( $P_t(\pmb{\theta})$ ) is high (meaning the model is very confident about it), its corresponding MRD will be small. This indicates greater resistance to unlearning. Such samples are often encountered frequently in training or share strong contextual associations, making them well-ingrained.
- Low Probability (e.g., "The sesquipedalian lecturer pontificated endlessly."): Samples with complex syntax or rare vocabulary are expected to have lower generation probabilities. They exhibit larger changes in generation probabilities under parameter perturbations, leading to higher MRD values, making them more susceptible to unlearning.
Occurrence Frequency: This is often correlated with initial generation probability.
- High Frequency: Samples that appear frequently in the training data are typically well-memorized and have high generation probabilities. They are expected to have lower MRD values, meaning they are harder to unlearn.
- Low Frequency (Long-tail distributions): Samples from long-tail distributions are less common and thus less well-memorized. They are expected to have higher MRD values, making them easier to unlearn.
  
  These characteristics, validated experimentally, provide valuable insights into why certain samples are harder or easier to unlearn, moving beyond a simple "black-box" understanding.

4.2.7. MRD-based Weighted Sampling Method

Leveraging the MRD metric, the paper proposes an MRD-based weighted sampling method to optimize existing unlearning algorithms. This method is inspired by curriculum learning, where learning progresses from simple to complex. Here, unlearning progresses from easily forgettable samples to harder-to-unlearn samples.

The strategy works as a general, plug-and-play enhancement:

MRD Calculation: Calculate the MRD value for all samples in the forget set.
Weighted Sampling: Use MRD values as scores to adjust the sampling probability of unlearning samples. Samples with higher MRD (easier to unlearn) are prioritized by being sampled more frequently.
Dynamic Progression: As easy-to-unlearn samples are forgotten, their MRD values might change, and the sampling probabilities are periodically updated, effectively guiding the unlearning process from simple to complex.

For analytical clarity, the paper extends Stochastic Gradient Ascent (SGA) into Curriculum Gradient Ascent (CGA) using MRD.

Remark 3.3. Unlearning Efficiency: For an unlearning algorithm $\mathcal{U}$ , the unlearning efficiency is defined as: $E(\mathcal{U}) = \frac{1}{M(\mathcal{U}) \cdot C(\mathcal{U})}$ where:

$M(\mathcal{U})$ : The total number of updates needed to meet the unlearning goal.
$C(\mathcal{U})$ : The computational cost per update.

Remark 3.4. Updates and MRD: When the update magnitude per iteration is fixed, the average number of updates required to unlearn a sample $\mathbf{x}^i$ can be regarded as: $I(\pmb{x}^i) \propto 1 / \mathrm{MRD}(\mathbf{x}^{\mathrm{i}}; \pmb{\theta})$ This means samples with lower MRD (harder to unlearn) require more updates ( $I(\pmb{x}^i)$ is larger), while samples with higher MRD (easier to unlearn) require fewer updates ( $I(\pmb{x}^i)$ is smaller).

The CGA method aims for higher unlearning efficiency. The computational complexity analysis in Appendix B shows that $\mathcal{U}_{\mathrm{CGA}}$ can achieve significantly higher efficiency than $\mathcal{U}_{\mathrm{SGA}}$ , specifically $E(\mathcal{U}_{\mathrm{CGA}}) \approx N_f E(\mathcal{U}_{\mathrm{SGA}})$ , especially for large forget sets. This means for the same computational cost (e.g., fixed number of updates), CGA should achieve better unlearning completeness while preserving model utility.

The implementation of Curriculum Gradient Ascent (CGA) leveraging MRD is detailed in Algorithm 2.

Algorithm 2: Curriculum Gradient Ascent Unlearning 1: Input: Model parameters $\pmb{\theta}_{\mathbf{\lambda}} \in \mathbb{R}^d$ ; forget set $\mathcal{D}_F = \{\mathbf{x}^1, \dots, \mathbf{x}^n\}$ ; difficulty metric $\operatorname{MRD}(\pmb{x}; \pmb{\theta})$ ; update interval $m$ . 2: Output: Updated model parameter $\pmb{\theta}$ . 3: Initialize: Compute $\mathrm{MRD}(\mathbf{x}^i; \pmb{\theta})$ for each sample $\pmb{x}^i, i = 1, \ldots, n$ . 4: repeat 5: for $t = 1$ to $T$ do (where $T$ is the total number of unlearning steps/iterations) 6: Sample sentences from $\mathcal{D}_F$ with probability 7: $p_i \gets \frac{\mathrm{MRD}_i}{\sum_{j=1}^n \mathrm{MRD}_j}$ . 8: Update $\pmb{\theta}$ by gradient ascent. 9: if $t \pmod m = 0$ then 10: Update $\mathrm{MRD}(\pmb{x}^i; \pmb{\theta})$ for each sample. 11: end if 12: end for 13: until Convergence or maximum iteration $T$ reached 14: Return: $\pmb{\theta}$

Explanation of Algorithm 2:

Line 3: At the beginning, the MRD value for every sample in the forget set $\mathcal{D}_F$ is computed using Algorithm 1. This creates an initial difficulty ranking.
Lines 4-13: The main unlearning process loops until convergence (unlearning goal met) or maximum iterations are reached.
Lines 5-12: The inner loop performs $T$ unlearning steps.
Line 7: For each unlearning step, samples are drawn from $\mathcal{D}_F$ with a probability $p_i$ proportional to their current MRD value. This ensures easily forgettable samples (higher MRD) are sampled more frequently initially.
Line 8: The model parameters $\pmb{\theta}$ are updated using gradient ascent based on the selected sample(s). (This gradient ascent step would typically involve increasing the negative log-likelihood for the forget samples).
Lines 9-11: Periodically (every $m$ iterations, where $m$ is the update interval), the MRD values for all samples are recomputed. This is important because as the model unlearns, the MRD values of samples change (e.g., an "easy" sample might become "harder" after its influence is mostly removed). This dynamic update allows the curriculum to adapt.
Line 14: The updated model parameters $\pmb{\theta}$ are returned.

4.2.8. Computational Complexity Analysis (Appendix B)

The appendix provides a comparative analysis of the computational complexity between Stochastic Gradient Ascent (SGA) and Curriculum Gradient Ascent (CGA).

Analyzing SGA:

Procedure: SGA involves two steps: (i) Randomly sample a forget sample $\bar{\pmb{x}^i} \in \mathcal{D}_F$ at each iteration. (ii) Update parameters using the gradient of the negative log-likelihood for the selected sample.
Total Updates: Assuming uniform selection probability, the total updates required to unlearn all $N_f$ samples are M(\mathcal{U}_{\mathrm{SGA}}) = \sum_{i=1}^{N_f} I(\pmb{x}^i), where $I(\pmb{x}^i)$ is the number of updates needed for sample $\pmb{x}^i$ . Since samples are chosen with probability $p_i = 1/N_f$ , the effective number of updates for a specific sample across many iterations is related to its inverse sampling probability.
Per-update Cost: The computational cost per update is $\mathcal{O}(d)$ for a forward/backward pass.
Efficiency: The unlearning efficiency of SGA is E(\mathcal{U}_{\mathrm{SGA}}) = 1 / (\sum_{i=1}^{N_f} I(\pmb{x}^i) \cdot \mathcal{O}(d)).

Analyzing CGA:

Procedure: CGA involves three key steps: (i) Compute MRD values for all samples. (ii) Select samples based on MRD, prioritizing those with lower unlearning difficulty (higher MRD values). (iii) Apply gradient ascent updates.
Sampling Probability: The sampling probability for each sample $\pmb{x}^i$ is p_i = I(\pmb{x}^i) / \sum_{j=1}^{N_f} I(\pmb{x}^j), based on the inverse relationship with MRD from Remark 3.4. This means harder-to-unlearn samples (larger $I(\pmb{x}^i)$ ) are sampled more frequently, accelerating their unlearning.
Total Updates: The total number of unlearning updates for CGA is also M(\mathcal{U}_{\mathrm{CGA}}) = \sum_{j=1}^{N_f} I(\pmb{x}^j). However, due to the weighted sampling, the effective updates for hard samples are concentrated earlier, leading to faster overall progress towards the unlearning goal.
Complexity: The complexity of CGA includes $\mathcal{O}(N_f \cdot K \cdot n_i \cdot d)$ for MRD computation (done initially and then every $m$ epochs) and $\mathcal{O}(d)$ for parameter updates.
Efficiency: The unlearning efficiency is E(\mathcal{U}_{\mathrm{CGA}}) = 1 / (\sum_{j=1}^{N_f} I(\pmb{x}^j) \cdot \mathcal{O}(d)). The paper claims $E(\mathcal{U}_{\mathrm{CGA}}) \approx N_f E(\mathcal{U}_{\mathrm{SGA}})$ , suggesting a substantial efficiency gain because the weighted sampling allows CGA to reach the unlearning objective more rapidly (lower $M(\mathcal{U})$ for the same overall target) compared to SGA. This advantage is particularly pronounced for large unlearning sets.

Discussion on Other Forms of Improving Methods (Appendix D.1): The paper notes that MRD is a general metric and its potential to improve existing unlearning methods is independent of the model type. While curriculum learning-based weighted sampling is used here as a heuristic, other improvement directions could include:

Constructing hierarchical unlearning strategies.
Using MRD to build reward mechanisms for reinforcement learning-based unlearning.
Incorporating MRD as a regularization term in the unlearning objective.

Second-Order Approximation Analysis of MRD (Appendix D.2): The paper clarifies that while high curvature in optimization training might correspond to sharp local minima (which can be problematic), in unlearning, it indicates that generation probabilities are highly sensitive to parameter perturbations. This sensitivity means the sample is easier to unlearn because small parameter changes can significantly alter its probability. Therefore, high curvature is a useful indicator of unlearning difficulty. The paper also differentiates MRD from influence functions and saliency maps by noting its focus on changes in predictions after removing data, and its computational feasibility for LLMs (sampling vs. second-order information for influence functions).

5. Experimental Setup

5.1. Datasets

The authors validate their MRD metric and MRD-based weighted sampling method across four mainstream LLM unlearning tasks and datasets.

TOFU (Maini et al., 2024):
- Task: Virtual author information unlearning. The goal is to remove knowledge about specific fictional authors.
- Source & Characteristics: This benchmark fine-tunes an LLM with data on 200 fictional authors, each associated with 20 Question-Answer (QA) pairs. A subset of these authors is designated as the unlearn set (forget set), and the remaining authors form the retain set. It specifically assesses the model's ability to selectively unlearn targeted factual information.
- Data Example (from Table 11):
  - Q: Is Farid Benoit currently writing any other books? A: It is reported that Farid Benoit is currently working on his sixth erotica novel, but the title has not been disclosed yet.
- Chosen Proportion: The authors selected a 10% proportion for the forget set among the available options (1%, 5%, 10%).
WMDP (Li et al., 2024a):
- Task: Unlearning harmful capabilities. This dataset tests the model's ability to forget dangerous knowledge.
- Source & Characteristics: This benchmark evaluates an LLM's capacity to unlearn harmful knowledge, particularly in domains such as biosafety, cybersecurity, and chemical safety. The forget set consists of plain text related to biological and cybersecurity knowledge, while unrelated text serves as the retain set.
Who's Harry Potter (WHP) (Eldan & Russinovich, 2023):
- Task: Copyright information removal. The objective is to remove content related to the Harry Potter series.
- Source & Characteristics: For this task, 200 data chunks, each containing 512 tokens, were extracted from the Harry Potter series to form the forget set.
PKU SafeRLHF (SAFE) (Ji et al., 2024b):
- Task: Unlearning model toxic responses. This benchmark aims to remove the model's ability to generate toxic outputs when given inappropriate prompts.
- Source & Characteristics: For the SAFE task, 200 negative examples were randomly sampled from the PKU-SafeRLHF training set to construct the forget set.
- Retain Set for WHP and SAFE: To maintain model utility for both the copyright removal (WHP) and detoxification (SAFE) tasks, the C4 dataset (Raffel et al., 2020) was utilized as the retain set. The C4 dataset (Colossal Clean Crawled Corpus) is a very large, diverse, and cleaned-up collection of web text, commonly used for training and evaluating general language models.
  
  These datasets are chosen because they represent diverse and relevant LLM unlearning scenarios, including factual knowledge, harmful content, copyright material, and toxic responses, making them effective for validating the proposed method's general applicability and performance.

5.2. Evaluation Metrics

The performance of unlearned LLMs is assessed across two primary dimensions: Unlearning Completeness (UC) and Model Utility (UT). For every evaluation metric, a conceptual definition, mathematical formula, and symbol explanation are provided.

5.2.1. Unlearning Completeness (UC) Metrics

Unlearning Completeness (UC) quantifies how effectively the model has forgotten the targeted data from the forget set. A higher value generally indicates better unlearning.

General Metrics:

Memorization Accuracy (MA) (Tirumala et al., 2022):
- Conceptual Definition: MA quantifies how much a model has memorized a given sequence of tokens. It measures the proportion of tokens in a sequence that the model can correctly predict given the preceding tokens. A lower MA on the forget set after unlearning indicates better UC. The early stopping condition (Appendix E.4) specifies that a sample is forgotten when its MA falls below a certain threshold.
- Mathematical Formula: $\operatorname { MA } ( \pmb { x } ) = \frac { \sum _ { t = 1 } ^ { T - 1 } \mathbb { I } \left\{ \operatorname { argmax } \left( f \left( \cdot \mid \pmb { x } _ { < t } ; \pmb { \theta } \right) \right) = x _ { t } \right\} } { T - 1 }$
- Symbol Explanation:
  - $\operatorname{MA}(\pmb{x})$ : Memorization Accuracy for a token sequence $\pmb{x}$ .
  - $\pmb{x} = (x_1, \dots, x_T)$ : A sequence of tokens of length $T$ .
  - $\sum_{t=1}^{T-1}$ : Sums over all tokens from the second token ( $t=1$ ) to the last but one token (T-1) in the sequence, as each token is predicted based on its predecessors.
  - $\mathbb{I}\{\dots\}$ : The indicator function, which evaluates to 1 if the condition inside the braces is true, and 0 otherwise.
  - $\operatorname{argmax}(f(\cdot \mid \pmb{x}_{<t}; \pmb{\theta}))$ : The token with the highest predicted probability (the most likely token) output by the model $f$ given the preceding tokens $\pmb{x}_{<t}$ and current parameters $\pmb{\theta}$ .
  - $x_t$ : The true $t$ -th token in the sequence.
  - T-1: The total number of predictions made (since the first token has no preceding context for prediction).
Extraction Likelihood (EL) (Jang et al., 2023):
- Conceptual Definition: EL estimates the general likelihood of extracting a target sequence from the model. It measures the average success rate of varying extraction attacks by quantifying the $n$ -gram overlap between model-generated sequences and target token sequences, starting from different prefixes. A lower EL on the forget set after unlearning indicates better UC. The early stopping condition (Appendix E.4) specifies that a sample is forgotten when its EL falls below a certain threshold.
- Mathematical Formula: $\begin{aligned} \mathrm{EL}_n(\pmb{x}) &= \frac{\sum_{t=1}^{T-n} \mathrm{OVERLAP}_n \left( f \left( \cdot \mid \mathbf{x}_{<t}; \pmb{\theta} \right), \pmb{x}_{\geq t} \right)}{T-n} \\ \mathrm{OVERLAP}_n(\pmb{a}, \pmb{b}) &= \frac{\sum_{c \in ng(\pmb{a})} \mathbb{I} \left\{ c \in ng(\pmb{b}) \right\}}{\left. ng(\pmb{a}) \right.} \end{aligned}$
- Symbol Explanation:
  - $\mathrm{EL}_n(\pmb{x})$ : Extraction Likelihood for a sequence $\pmb{x}$ using $n$ -grams.
  - $\pmb{x} = (x_1, \dots, x_T)$ : A sequence of tokens of length $T$ .
  - $f(\cdot \mid \mathbf{x}_{<t}; \pmb{\theta})$ : The output token sequences generated by the LLM $f$ when given $\mathbf{x}_{<t}$ as input and parameters $\pmb{\theta}$ . The generated sequence can have a maximum length of $|\pmb{x}_{\geq t}|$ but might be shorter if an EOS (end-of-sequence) token is generated.
  - $\pmb{x}_{\geq t}$ : The true suffix of the sequence $\pmb{x}$ starting from token $t$ .
  - T-n: The number of possible starting points for $n$ -gram overlap comparisons, accounting for the $n$ -gram length.
  - $\mathrm{OVERLAP}_n(\pmb{a}, \pmb{b})$ : A function measuring the $n$ -gram overlap between two token sequences $\pmb{a}$ (generated) and $\pmb{b}$ (target).
  - $ng(\cdot)$ : Denotes the list of $n$ -grams in the given token sequence.
  - $\sum_{c \in ng(\pmb{a})} \mathbb{I}\{c \in ng(\pmb{b})\}$ : Counts how many $n$ -grams from sequence $\pmb{a}$ are also present in sequence $\pmb{b}$ .
  - $|ng(\pmb{a})|$ : The total number of $n$ -grams in sequence $\pmb{a}$ .

Task-Specific UC Metrics:

TOFU Task:
- Unlearning Accuracy (UA): Represented as $1 - \mathrm{Forget Accuracy (FA)}$ (Jia et al., 2024). FA measures the model's accuracy on the forget set (e.g., answering questions about forgotten authors). A higher UA indicates better unlearning completeness.
- Membership Inference Attack (MIA): Evaluates the Area Under the ROC Curve (AUC) using the Min-k% Prob method (Shi et al., 2023) to detect whether a sample was part of the training set. A higher MIA score (meaning the model can be more easily distinguished from a non-unlearned model on forget set data, but here it implies the model's confidence in generating forget set samples is reduced, thus higher MIA score represents better unlearning) suggests improved model confidence in unlearning.
- Rouge-L Recall (RR): Calculated as $1 - \mathrm{Rouge-L}$ . Rouge-L measures the longest common subsequence between a generated text and a reference text. It is used over the forget set. A higher RR (meaning lower Rouge-L score, indicating less overlap with original forbidden content) signifies better unlearning performance.
- Concept Relearning Score (Relearn): Defined as $1 - \mathrm{Relearn Saliency Score}$ (Lo et al., 2024). The saliency score measures how strongly forgotten concepts re-emerge in the model after retraining or fine-tuning. A higher Relearn value (meaning lower saliency score) indicates better unlearning completeness and lower susceptibility to relearning the forgotten concept.
WMDP Task:
- UC is evaluated using $1 - \mathrm{FA}$ on WMDP-Bio (biosafety knowledge) and WMDP-Cyber (cybersecurity knowledge) subsets.
WHP Task:
- UC is determined using Rouge-L on 300-token completions generated by the model based on Harry Potter-related instructions. A lower Rouge-L (thus higher RR) indicates successful unlearning.
SAFE Task:
- UC is assessed using Toxic-BERT (Hanu & Unitary team, 2020) scores on toxic prompts from the SAFE test set. A lower Toxic-BERT score (indicating less toxicity in responses) signifies better unlearning completeness. Toxic-BERT is a BERT-based model trained to classify text toxicity.

5.2.2. Model Utility (UT) Metrics

Model Utility (UT) evaluates the impact of unlearning on the model's performance on unrelated tasks or the retain set. A higher value generally indicates better utility retention.

TOFU Task:
- Retain Set Accuracy (Acc) and Rouge-L Recall (RR): Measures the model's performance on the data it was meant to retain. Higher values are better.
- Real Author Acc and RR: Assesses the model's ability to retain knowledge about authors not in the forget set.
- World Fact Acc and RR: Evaluates general factual knowledge.
WMDP Task:
- UT is measured by zero-shot accuracy on the MMLU (Massive Multitask Language Understanding) dataset (Hendrycks et al., 2020). MMLU is a broad set of challenging multiple-choice questions across 57 subjects, assessing general knowledge and reasoning abilities.
WHP and SAFE Tasks (General Utility Metrics):
- Perplexity (PPL): Measures how well a probability model predicts a sample. A lower PPL indicates that the model is better at predicting the next word in a sequence, implying better language modeling capabilities. It's often evaluated on general text corpora like Wikitext (Merity etm al., 2016).
- Zero-shot Accuracy: Averaged across multiple tasks using the Language Model Evaluation Harness (Gao et al., 2021). The tasks include:
  - BoolQ (Clark et al., 2019): Boolean Question Answering.
  - RTE (Dagan et al., 2005): Recognizing Textual Entailment.
  - HellaSwag (Zellers et al., 2019): Commonsense NLI for sentence completion.
  - Winogrande (Sakaguchi et al., 2021): Commonsense reasoning with Winograd schemas.
  - ARC-Challenge / ARC-Easy (Chollet, 2019): AI2 Reasoning Challenge, assessing scientific reasoning.
  - OpenBookQA (Mihaylov et al., 2018): Open-book question answering.
  - Piqa (Bisk et al., 2020): Physical commonsense reasoning. A higher average accuracy across these diverse tasks indicates better retention of general capabilities.
- TruthfulQA: Measures how truthful a model's answers are to questions that some LMs might answer falsely due to memorizing common misconceptions. A higher score indicates more truthful and less misleading responses.

5.3. Baselines

The effectiveness of the MRD metric and the MRD-enhanced methods is assessed against mainstream unlearning baselines. For each baseline, an MRD-weighted sampling strategy is applied to create an MRD-enhanced method.

Gradient-based Methods:
- GA (Gradient Ascent) (Jang et al., 2023): A foundational unlearning method that performs gradient ascent on the forget set loss to actively "un-memorize" the samples.
- GradDiff (Yao et al., 2024): A more advanced gradient-based method that aims to improve unlearning completeness while mitigating utility loss, possibly through regularization or refined gradient updates.
- SGA (Stochastic Gradient Ascent) and CGA (Curriculum Gradient Ascent) are used as specific instantiations of gradient-based unlearning in the context of the TOFU task.
Preference Optimization Methods:
- PO (Preference Optimization) (Maini et al., 2024): These methods frame unlearning as a preference alignment task, treating forgotten data as negative examples and optimizing for rejection-based answers. The paper mentions using rejection-based answers for the forget set in PO.
- NPO (Negative Preference Optimization) (Zhang et al., 2024): An improvement over PO that aims to address issues like catastrophic collapse and improve unlearning effectiveness.
  
  For each baseline (GA, GradDiff, PO, NPO), an MRD-enhanced version is created (e.g., GradDiff + MRD, $NPO + MRD$ ) by incorporating the MRD-based weighted sampling strategy into its unlearning sequence. Comparisons are made against these original baselines, with results averaged over five independent trials to ensure robustness.

5.4. Training Setup

The specific configurations for training and unlearning experiments are as follows:

Optimizer: AdamW (Loshchilov, 2017) is set as the default optimization algorithm.
Learning Rate: 5e-5.
Perturbation Intensity ( $\sigma$ ): For calculating MRD, the standard deviation of the Gaussian perturbation $\sigma$ is set to 1e-5.
Monte Carlo Sampling Iterations ( $K$ ): The number of Monte Carlo sampling iterations for MRD calculation is set to 200.
Update Interval ( $m$ ): For Algorithm 2 (CGA), the MRD calculation interval $m$ is set to $m=2$ .
Epochs/Steps:
- TOFU Task: Both PO and GradDiff methods are run for 5 epochs. The NPO method is run for 4 epochs.
- WMDP Task: The maximum number of training steps for NPO and GradDiff is set to 500.
- WHP and SAFE Tasks: 5 epochs are conducted for these tasks.
Hardware: All experiments are conducted on two NVIDIA RTX A800 GPUs.
PO Rejection Answers: For the PO method, rejection-based answers are used as target responses for the forget set samples. Examples from Table 3 include:
- TOFU: "I'm not informed about that subject," "I don't have the details on that issue."
- WHP/PKU-Safe: "I apologize, but I'm legally restricted from fulfilling this request," "I'm sorry, but my ability to generate content is limited by copyright laws."
Early Stopping Condition (Appendix E.4): A sample is considered successfully forgotten when its corresponding Extraction Likelihood (EL) value and Memorization Accuracy (MA) value on the current model decrease below the average EL and MA values of all samples on the initial model. This provides a consistent and objective termination condition for unlearning.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Differences in Unlearning Difficulty

The paper first validates the premise that unlearning difficulty varies across samples. To quantify this, 40 samples are randomly selected (with replacements) from the TOFU unlearning set and concatenated into 300 composite samples. Unlearning is performed using existing baselines with early stopping, and the average absolute value of parameter changes post-unlearning is computed.

The following figure (Figure 4 from the original paper) shows the comparison of unlearning difficulty across different sample sets in GA, GradDiff, and NPO:

Figure 4: Comparison of unlearning difficulty across different sample sets in GA, GradDiff, and NPO, where samples are uniformly distributed in terms of angle, and the distance denotes the average absolute value of parameter changes.

Analysis: Figure 4 demonstrates significant variability in parameter changes across different samples when using GA, GradDiff, and NPO algorithms. Each point represents a composite sample set, and the distance from the origin denotes the average absolute value of parameter changes. The scattered distribution indicates that different sample sets require vastly different magnitudes of parameter adjustments to be unlearned. This confirms that unlearning difficulty is not uniform across samples, and the choice of samples indeed has a substantial influence on unlearning performance. This observation directly supports the paper's core motivation.

6.1.2. Effectiveness of MRD

To validate the MRD metric, experiments are conducted on TOFU and WMDP tasks. For each task, 10 random samples are chosen, and various LLM unlearning baselines are applied. With identical hyperparameter settings, parameter update magnitudes, and early stopping conditions, the number of updates required for unlearning each sample is compared against its MRD value.

The following figure (Figure 5 from the original paper) shows the relationship between the MRD value and the number of unlearning updates (i.e., unlearning difficulty):

Figure 5: The relationship between the MRD value and the number of unlearning updates (i.e., unlearning difficulty).

Analysis: Figure 5 shows a consistent relationship between the MRD value and the number of unlearning updates required, which serves as a proxy for unlearning difficulty.

For both TOFU and WMDP datasets, and across different unlearning algorithms (GradDiff, GA), a clear trend is observable: as the MRD value increases, the number of updates required for unlearning tends to decrease.
This indicates that samples with higher MRD values are easier to unlearn (requiring fewer updates), and those with lower MRD values are harder to unlearn (requiring more updates). This perfectly aligns with the definition of MRD (smaller MRD = higher difficulty).
Furthermore, the ranking of update counts across different methods for the same sample generally remains consistent. This suggests that the variability in unlearning behavior is an intrinsic property of the samples themselves, rather than solely dependent on the unlearning algorithm. This experiment provides strong empirical evidence for the effectiveness and reliability of MRD in quantifying sample-level unlearning difficulty.

6.1.3. Characteristics Influencing MRD

To further interpret MRD and its implications, experiments are conducted on the TOFU task by categorizing samples based on four criteria: semantic complexity, occurrence frequency, initial generation probability, and presence of rare words. 40 samples from each category are randomly selected, and their MRD values are computed.

The following are the results from Table 11 of the original paper:

Attribute	Level	Example From categorized set	MRD
Common Sentence		Q: Is Farid Benoit currently writing any other books? A: It is reported that Farid Benoit is currently working on his sixth erotica	0.4957
		novel, but the title has not been disclosed yet. Q: What is another well-known work by Albert Sidney Lane in the fantasy
		genre? A: "Beneath the Emerald Veil" is another well-known work by Albert Sid-	0.4322
Semantic Complexity	Low	ney Lane in the fantasy genre. Q: What career did Li Mei Yu's mother have?	0.3085
	High	A: Her mother was a nurse. Q: How have Leila Al-Sabah's books contributed to LGBTQ+ representa- tion in literary fiction?	1.0026
	High	A: Through her richly drawn characters and storylines, Leila Al-Sabah has helped to normalize LGBTQ+ experiences in literary fiction. Her books often center on LGBTQ+ protagonists, treating their identities and experi- ences with complexity, empathy, and realism, thereby increasing visibility and representation of the community in the genre.
Occurrence Frequency	Low	Q: Is Zo Hassani Raharizafy involved in any form of philanthropy? A: Yes, he established the Raharizafy Literary Foundation, which works to improve literacy rates in Madagascar, his home country.	0.6374
Occurrence Frequency	High	Q: Where was Samir Khoury born? A: Samir Khoury was born in Amman, Jordan.	0.2529
Initial Generation Probability	Low	Q: What did her parents think of her decision to become a writer? A: Evangeline's parents were initially skeptical about her decision. How- ever, after reading her first novel and witnessing her dedication to the craft,	0.3481
Initial Generation Probability	High	they stood by her decision and have been her constant pillars of support. Q: What genre does Xin Lee Williams often write in, based on their most famous work, "The Town That Drowned"? A: Xin Lee Williams is recognized for their contributions to Canadian lit- erature, as seen from their trademark work, "The Town That Drowned."	0.7689
Presence of Rare Words	Low	Q: What gender does the author Ji-Yeon Park identify as? A: The author Ji-Yeon Park identifies as female.	0.3929
Presence of Rare Words	High	Q: When did Samin Nosrat receive the "Prix Goncourt de Littérature His- torique" and for which book? A: Samin Nosrat received the "Prix Goncourt de Littérature Historique" for her vibrant piece "The Seed," which she received in 2011.	0.7188

Analysis: Table 11 validates the theoretical analysis from Section 3.3 regarding the characteristics influencing MRD.

Semantic Complexity:
- Low complexity samples (e.g., "Her mother was a nurse.") have low MRD (0.3085), indicating they are harder to unlearn.
- High complexity samples (e.g., detailed description of Leila Al-Sabah's contribution to LGBTQ+ representation) have high MRD (1.0026), suggesting they are easier to unlearn. This aligns with the idea that models are more stable on simple, common patterns.
Occurrence Frequency:
- High-frequency samples (e.g., "Samir Khoury was born in Amman, Jordan.") have low MRD (0.2529), making them harder to unlearn.
- Low-frequency samples (e.g., about the Raharizafy Literary Foundation in Madagascar) have high MRD (0.6374), making them easier to unlearn. This confirms that frequently seen information is more deeply ingrained.
Initial Generation Probability:
- High initial generation probability samples (e.g., "Xin Lee Williams is recognized for their contributions...") have low MRD (0.3481 is lower than 0.7689, this seems to be a swapped example in the table, the text indicates High Prob should have low MRD value) - Correction: Looking at the table, the example for "Low" probability (Evangeline's parents skepticism) has MRD 0.3481, and for "High" probability (Xin Lee Williams) has MRD 0.7689. This seems contradictory to the text in Section 3.3 which states "If a sample's generation probability ( $P_t(\pmb{\theta})$ ) is high, its corresponding MRD will be small, indicating greater resistance to unlearning." There might be an error in the example assignment in the paper's table or the text explanation. However, following the paper's textual explanation, high probability should correspond to low MRD. Assuming the textual explanation is correct, the table examples for "Initial Generation Probability" might be mislabeled or misinterpreted.
Presence of Rare Words:
- Low rare word presence (e.g., "The author Ji-Yeon Park identifies as female.") has low MRD (0.3929), indicating harder to unlearn.
- High rare word presence (e.g., "Samin Nosrat received the 'Prix Goncourt de Littérature Historique'...") has high MRD (0.7188), suggesting easier to unlearn. This confirms that less common vocabulary makes information more volatile.
  
  Overall, the experimental results largely align with the theoretical predictions, demonstrating that MRD effectively captures the influence of various sample characteristics on unlearning difficulty.

6.1.4. Effectiveness and Efficiency of MRD-based Weighted Sampling

The MRD-based weighted sampling method (MRD-enhanced method) is evaluated across four mainstream LLM unlearning tasks against baseline methods.

Results on TOFU Task: The following are the results from Table 1 of the original paper:

Method	Unlearning Completeness (UC)					Model Utility (UT)
Method	UA (↑)	MIA (↑)	RR(↑)	Relearn (↑)	Avg. (↑)	Retain Set Acc. (↑)	RR(↑)	Real Author RR(↑)	World Fact Acc. (↑)	RR(↑)	Avg. (↑)	Real Author Acc. (↑)
Original	0.1475	0.4515	0.0204	1.0000	0.4049	0.8575	0.9825	0.8900	0.9330	0.8632	0.8960	0.9037
SGA	0.3725	0.4490	0.5722	0.7375	0.5328	0.6125	0.4212	0.3500	0.3908	0.7094	0.7841	0.5447
CGA	0.3825	0.4594	0.5781	0.7625	0.5456	0.6575	0.4296	0.5100	0.5375	0.7436	0.7984	0.6128
GradDiff	0.8475	0.9977	0.9950	0.3575	0.7994	0.7253	0.5131	0.7100	0.7473	0.8120	0.8547	0.7271
GradDiff + MRD	0.8425	0.9997	0.9984	0.5350	0.8439	0.7350	0.5253	0.7300	0.7321	0.8205	0.8561	0.7332
PO	0.7275	0.6478	0.9314	0.5950	0.7254	0.6114	0.4190	0.6100	0.6988	0.7350	0.7862	0.6434
PO + MRD	0.7575	0.6512	0.9773	0.7800	0.7915	0.6250	0.4216	0.6400	0.6963	0.7436	0.7792	0.6510
NPO	0.8350	0.9913	0.9821	0.4825	0.8227	0.7433	0.5356	0.8300	0.8291	0.8262	0.8746	0.7731
NPO + MRD	0.8525	0.9992	0.9854	0.4750	0.8280	0.7775	0.5506	0.8900	0.8547	0.8462	0.8832	0.8004

Analysis of TOFU Results (Table 1):

Unlearning Completeness (UC): The MRD-enhanced methods (e.g., GradDiff + MRD, $PO + MRD$ , $NPO + MRD$ ) consistently show improvements in UC metrics, particularly MIA, RR, and Relearn, compared to their original baselines. For instance, $PO + MRD$ improves Avg. UC from 0.7254 to 0.7915. GradDiff + MRD improves Avg. UC from 0.7994 to 0.8439. $NPO + MRD$ improves Avg. UC from 0.8227 to 0.8280. This confirms that prioritizing easier-to-unlearn samples enhances the model's ability to forget.
Model Utility (UT): The MRD-enhanced methods also generally improve model utility. For example, $NPO + MRD$ increases Avg. UT from 0.7731 to 0.8004. $PO + MRD$ increases Avg. UT from 0.6434 to 0.6510. This indicates that the curriculum-based sampling strategy not only helps in unlearning but also better preserves the model's capabilities on retained knowledge.
SGA vs. CGA: CGA (Curriculum Gradient Ascent) significantly outperforms SGA (Stochastic Gradient Ascent) across both UC and UT metrics, underscoring the benefits of MRD-based weighted sampling even for basic gradient methods.

The overall improvements across various baselines demonstrate the general applicability and effectiveness of the MRD-based weighted sampling method in optimizing LLM unlearning algorithms.

The following are the results from Table 4 of the original paper, showing metrics change during the unlearning process for SGA, CGA, NPO, and NPO+MRD on TOFU:

Method	Unlearning Completeness (UC)					Model Utility (UT)
Method	UA (↑)	MIA (↑)	RR(↑)	Relearn (↑)	Avg. (↑)	Retain Set Acc. (↑)	RR(↑)	Real Author Acc. (↑)	RR(↑)	World Fact Acc. (↑)	RR (↑)	Avg. (↑)
Original	0.1475	0.4515	0.0204	1.0000	0.4049	0.8575	0.9825	0.8900	0.9330	0.8632	0.8960	0.9037
SGA-epoch1	0.2025	0.4472	0.2421	0.9675	0.4648	0.7825	0.7514	0.7400	0.7362	0.8034	0.8471	0.7768
SGA-epoch2	0.2750	0.4464	0.3892	0.8800	0.4977	0.7231	0.6353	0.6200	0.6261	0.7606	0.8062	0.6952
SGA-epoch3	0.3200	0.4483	0.4933	0.8150	0.5217	0.6428	0.5277	0.4800	0.5109	0.7179	0.7983	0.6129
SGA-epoch4	0.3725	0.4490	0.5722	0.7375	0.5328	0.6125	0.4212	0.3500	0.3908	0.7094	0.7841	0.5447
CGA-epoch1	0.2475	0.4588	0.2922	0.9425	0.4852	0.8272	0.7614	0.7200	0.7552	0.8376	0.8518	0.7922
CGA-epoch2	0.3075	0.4597	0.4272	0.8700	0.5161	0.7672	0.6526	0.6200	0.6817	0.8034	0.8337	0.7264
CGA-epoch3	0.3450	0.4592	0.5094	0.8075	0.5302	0.6703	0.5328	0.5500	0.5691	0.7606	0.8138	0.6494
CGA-epoch4	0.3825	0.4594	0.5781	0.7625	0.5456	0.6575	0.4296	0.5100	0.5375	0.7436	0.7984	0.6128
NPO-epoch1	0.3375	0.8027	0.3417	0.8225	0.5761	0.8253	0.9015	0.8800	0.9018	0.8462	0.8901	0.8742
NPO-epoch2	0.5650	0.9381	0.5293	0.6825	0.6787	0.7786	0.7803	0.8600	0.8725	0.8376	0.8886	0.8363
NPO-epoch3	0.7125	0.9839	0.8172	0.5425	0.7640	0.7567	0.6519	0.8400	0.8493	0.8290	0.8823	0.8015
NPO-epoch4	0.8350	0.913	0.9821	0.4825	0.8228	0.7433	0.5356	0.8300	0.8291	0.8262	0.8746	0.7731
NPO+MRD-epoch1	0.3550	0.8162	0.3715	0.8175	0.5901	0.8367	0.9053	0.8900	0.8937	0.8547	0.8912	0.8786
NPO+MRD-epoch2	0.5875	0.9481	0.5781	0.7050	0.7047	0.7844	0.7794	0.8800	0.8738	0.8462	0.8885	0.8421
NPO+MRD-epoch3	0.7425	0.9846	0.8462	0.5325	0.7765	0.7678	0.6781	0.8800	0.8637	0.8462	0.8867	0.8204
NPO+MRD-epoch4	0.8525	0.9992	0.9854	0.4750	0.8280	0.7775	0.5506	0.8900	0.8547	0.8832	0.8004

Analysis of TOFU Epoch Changes (Table 4): Table 4 shows the progression of UC and UT metrics over multiple epochs for SGA, CGA, NPO, and $NPO + MRD$ .

Faster Convergence: CGA consistently achieves higher UC (e.g., Avg. UC at epoch 1 is 0.4852 for CGA vs. 0.4648 for SGA) and UT (e.g., Avg. UT at epoch 1 is 0.7922 for CGA vs. 0.7768 for SGA) compared to SGA at earlier epochs. This indicates faster convergence towards the unlearning goal.
Improved Performance: Similarly, $NPO + MRD$ generally shows better UC and UT metrics at equivalent epochs compared to NPO. For instance, at epoch 2, $NPO + MRD$ has an Avg. UC of 0.7047 and Avg. UT of 0.8421, while NPO has 0.6787 and 0.8363, respectively. These results confirm that the MRD-enhanced method allows algorithms to achieve better unlearning outcomes more quickly or with fewer iterations, validating its role in optimizing unlearning algorithms.

Results on WMDP Task: The following are the results from Table 5 of the original paper:

Method	Unlearning Completeness (UC)				Model Utility (UT)[mmlu]
Method	Cybersecurity (↓)	Chemical (↓)	Biosafety (↓)	Avg. (↓)	Humanities (↑)	Sciences (↑)	Stem (↑)	Other (↑)	Avg. (↑)
SGA	0.2430	0.2622	0.2474	0.2467	0.2451	0.2343	0.2388	0.2687	0.2465
GradDiff	0.3834	0.4460	0.6402	0.4795	0.5028	0.6597	0.4716	0.6343	0.5593
NPO	0.3497	0.4656	0.6268	0.4588	0.5292	0.6844	0.4865	0.6569	0.5818
CGA	0.2356	0.2547	0.2404	0.2459	0.2417	0.3107	0.2861	0.2514	0.2689
GradDiff + MRD	0.3719	0.4387	0.6315	0.4694	0.5132	0.6607	0.4782	0.6392	0.5655
NPO + MRD	0.2773	0.4705	0.6394	0.4244	0.5326	0.6972	0.4906	06591	0.55895

Analysis of WMDP Results (Table 5):

For GradDiff, GradDiff + MRD slightly improves Avg. UC (from 0.4795 to 0.4694, where lower is better for UC on WMDP as it's 1-FA or similar harmfulness score) and Avg. UT (from 0.5593 to 0.5655).
For NPO, $NPO + MRD$ shows a significant improvement in Avg. UC (from 0.4588 to 0.4244, lower is better), but a slight decrease in Avg. UT (from 0.5818 to 0.55895). This might indicate a trade-off or that MRD primarily boosts UC.
CGA (MRD-enhanced SGA) shows comparable UC to SGA (0.2459 vs 0.2467) but improved UT (0.2689 vs 0.2465).

Results on WHP Task: The following are the results from Table 6 of the original paper:

Method	Unlearning Completeness (UC)		Model Utility (UT)
Method	Seen Rouge-L (↓)	Unseen Rouge-L (↓)	PPL (↓)	Zero-shot Acc. (↑)	TruthfulQA (↑)
GradDiff	0.0122	0.0132	12.46	0.6201	0.2827
PO	0.0272	0.0292	11.88	0.6192	0.2962
NPO	0.0121	0.0134	12.91	0.6122	0.3023
GradDiff + MRD	0.0116	0.0133	12.90	0.6191	0.2839
PO + MRD	0.0268	0.0291	11.76	0.6170	0.2949
NPO + MRD	0.0106	0.0105	12.30	0.6205	0.3113

Analysis of WHP Results (Table 6):

For WHP (copyright removal), MRD-enhanced methods (e.g., $NPO + MRD$ ) generally achieve better UC (lower Seen Rouge-L and Unseen Rouge-L, e.g., $NPO + MRD$ has 0.0106 vs NPO's 0.0121 for Seen Rouge-L) with comparable or slightly improved UT. $NPO + MRD$ shows the best UC and improved TruthfulQA.

Results on SAFE Task: The following are the results from Table 7 of the original paper:

Method	Unlearning Completeness (UC)		Model Utility (UT)
Method	Real Toxicity Prompts Toxic score (↓)	SAFE Toxic score (↓)	PPL (↓)	Zero-shot Acc. (↑)	TruthfulQA (↑)
GradDiff	0.0268	0.0353	11.99	0.6251	0.3011
PO	0.0308	0.0275	12.67	0.6028	0.2386
NPO	0.0248	0.0333	11.95	0.6270	0.3059
GradDiff + MRD	0.0246	0.0353	11.71	0.6266	0.3047
PO + MRD	0.0252	0.0336	12.78	0.6154	0.2766
NPO + MRD	0.0210	0.0332	12.82	0.6331	0.3247

Analysis of SAFE Results (Table 7):

For SAFE (unlearning toxic responses), $NPO + MRD$ achieves the best UC (lowest Real Toxicity Prompts Toxic score, 0.0210 vs NPO's 0.0248) and the best Zero-shot Acc. and TruthfulQA. GradDiff + MRD also shows slight improvements in UC.

Overall Conclusion on Effectiveness and Efficiency: The experiments across all four tasks consistently show that integrating MRD-based weighted sampling significantly improves both unlearning completeness and model utility for various baseline unlearning algorithms. These improvements, coupled with faster convergence (as shown in Table 4), validate the hypothesis that dynamically adjusting the unlearning sequence based on MRD values is an effective strategy.

6.1.5. Efficiency Analysis of MRD-based Weighted Sampling

The paper also analyzes the computational overhead of MRD calculation and the overall efficiency gains.

The following are the results from Table 8 of the original paper, showing the connection between time cost of MRD computation and batch size:

Batch Size	Time
8	3m30s
16	2m32s
32	2m07s
64	1m55s
128	1m23s

Analysis of MRD Computation Time (Table 8): Table 8 shows that MRD computation time decreases as the batch size increases. This is because MRD calculation primarily involves parallel inference operations, which benefit significantly from larger batch sizes by utilizing GPU resources more efficiently. This indicates that the overhead can be managed by optimizing batching.

The following are the results from Table 9 of the original paper, showing the comparison of the time required to execute one round of the algorithm:

Method	Time
GA	1m40s
Graddiff	2m03s
NPO	2m08s
PO	2m23s

Analysis of One-Round Execution Time (Table 9): Table 9 provides the time taken for one round (presumably one epoch or a fixed number of steps) of execution for various unlearning algorithms. Comparing this with Table 8, when the batch size for MRD computation exceeds 64 (e.g., at 128 batch size, 1m23s), MRD computation can actually become more efficient than a single round of some unlearning algorithms. This is significant because MRD only requires inference, while unlearning algorithms involve computationally intensive backward passes and parameter updates, limiting their maximum batch size due to GPU memory.

The following are the results from Table 10 of the original paper, showing the comparison of method times with and without MRD:

Method	Time	Method	Time
GA	8m20s	GA+MRD	6m23s
Graddiff	14m21s	Graddiff+MRD	11m38s
PO	17m4s	PO+MRD	14m11s
NPO	16m0s	NPO+MRD	12m3s

Analysis of Total Execution Time (Table 10): Table 10 provides the total end-to-end execution time for the original and MRD-enhanced unlearning algorithms. Crucially, even with the overhead of MRD calculation (which is performed periodically), the MRD-improved methods (e.g., $GA+MRD$ , $GradDiff+MRD$ , $PO+MRD$ , $NPO+MRD$ ) consistently show reduced total execution times compared to their original counterparts. This is because the MRD-based weighted sampling accelerates convergence, requiring fewer total unlearning epochs or steps to reach the desired unlearning goal. This translates into a significantly lower overall runtime, demonstrating that MRD not only improves effectiveness but also overall efficiency.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Parameter Sensitivity

Experiments are conducted on the TOFU task to evaluate the impact of the perturbation parameter (\sigma) and the number of Monte Carlo samples ( $K$ ) on MRD calculation.

The following figure (Figure 6 from the original paper) shows the parameter sensitivity of MRD:

$Figure 6: Parameter sensitivity of MRD. (a) Effect of perturbation parameter $\\delta$ , fluctuating around 0.64. (b) Effect of Monte Carlo samples $K$ , with stability achieved at $K = 1 0 0$ .$

Analysis of Perturbation Parameter $\sigma$ (Figure 6a):

Figure 6a shows the effect of $\sigma$ on MRD values, keeping $K=100$ .
As $\sigma$ (represented as $\delta$ in the figure) increases, the MRD value fluctuates around 0.64.
This suggests that the calculation of MRD is not highly sensitive to the specific choice of $\sigma$ within the tested range. For computational simplicity, $\sigma=1$ was chosen in the paper.

Analysis of Monte Carlo Samples $K$ (Figure 6b):

Figure 6b illustrates the variation of MRD values as $K$ increases, with $\sigma=1$ fixed.
When $K$ is small (e.g., $K < 50$ ), the MRD calculation fluctuates significantly, indicating high variance in the Monte Carlo approximation.
As $K$ reaches around 50, the MRD calculation gradually stabilizes.
Optimal stability and performance are achieved when $K=100$ .

Additionally, the paper notes (Appendix F.4) that they tested different values of $K$ on Qwen3 models of varying sizes.

The following are the results from Table 12 of the original paper, showing stable sample counts K across Qwen3 models:

Model	Counts
Qwen3 4B	60
Qwen3 8B	80
Qwen3 14B	50
Qwen3 32B	80

Analysis of $K$ across different models (Table 12): Table 12 indicates that the optimal number of Monte Carlo samples $K$ for stability varies slightly across different Qwen3 model sizes (e.g., 60 for 4B, 80 for 8B, 50 for 14B, 80 for 32B). However, the crucial finding is that the impact of model size on the number of sampling iterations is minimal, generally staying within a similar range. This demonstrates the scalability of the MRD method in terms of $K$ .

6.2.2. Ablation Study of Update Interval $m$

An ablation study is conducted on the update interval $m$ in Algorithm 2 to determine how frequently MRD values should be recomputed.

The following are the results from Table 13 of the original paper, showing an ablation study of $m$ :

Method	Unlearning Completeness (UC)					Model Utility (UT)
Method	UA (↑)	MIA (↑)	RR (↑)	Relearn (↑)	Avg. (↑)	Retain Set Acc. (↑)	RR(↑)	Real Author Acc. (↑)	RR(↑)	World Fact Acc. (↑)	RR(↑)	Avg. (↑)
Original	0.1475	0.4515	0.0204	1.0000	0.4049	0.8575	0.9825	0.8900	0.9330	0.8632	0.8960	0.9037
PO + MRD - m=1	0.7525	0.6472	0.9714	0.7825	0.7884	0.6228	0.4187	0.6200	0.6864	0.7436	0.7778	0.6449
PO + MRD - m=2	0.7575	0.6512	0.9773	0.7800	0.7953	0.6250	0.4216	0.6300	0.6963	0.7350	0.7792	0.6478
PO + MRD - m=3	0.7500	0.6451	0.9681	0.7850	0.7871	0.6267	0.4245	0.6300	0.6924	0.7350	0.7752	0.6473

Analysis of Update Interval $m$ (Table 13):

Table 13 shows the unlearning performance of $PO + MRD$ for different values of $m$ (the frequency of MRD re-calculation).
When $m=1$ (recomputing MRD every iteration), the Avg. UC is 0.7884 and Avg. UT is 0.6449.
When $m=2$ (recomputing MRD every two iterations), the Avg. UC is 0.7953 and Avg. UT is 0.6478. This represents the optimal performance among the tested values.
When $m=3$ , the performance slightly drops in UC (0.7871) and is comparable in UT (0.6473).
This suggests that recomputing MRD too frequently ( $m=1$ ) or too infrequently ( $m=3$ ) is suboptimal. An interval of $m=2$ strikes the best balance, allowing the curriculum to adapt to changes in sample difficulty without incurring excessive computational overhead.

6.2.3. MRD of Samples at Different Levels

The paper also investigates the robustness of MRD across different granularity levels of text (sentence, paragraph, long-text).

The following figure (Figure 7 from the original paper) shows MRD and unlearning difficulty of different text levels:

Figure 7: MRD and unlearning difficulty of different text levels.

Analysis: Figure 7 presents the MRD values and corresponding unlearning difficulty rankings for samples at the sentence level, paragraph level, and long-text level.

The results demonstrate that MRD values exhibit a certain degree of stability and robustness across different text lengths. While the absolute values might vary, the relative rankings of difficulty (i.e., which samples are harder or easier to unlearn) largely hold across these different levels of granularity.
This suggests that MRD is a flexible metric that can be applied to different text units, providing consistent insights into unlearning difficulty regardless of the specific length of the forget set samples. This enhances the generalizability of the MRD concept.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a fresh perspective on LLM unlearning by emphasizing the crucial role of sample-level unlearning difficulty. Driven by privacy concerns and the limitations of current evaluation practices, the authors propose Memory Removal Difficulty (MRD), a novel metric inspired by neuroscience, to quantify how hard individual samples are to unlearn. MRD is defined as the expected relative change in a sample's generation probability under minor Gaussian parameter perturbations. A key finding validated through experiments is that unlearning difficulty varies significantly across samples, making sample selection critical for fair evaluation.

The paper further analyzes the characteristics that influence MRD: high-frequency, high initial generation probability, and syntactically simple samples tend to be harder to unlearn (lower MRD), while high-complexity and rare-word samples are generally easier to unlearn (higher MRD). Leveraging these insights, the authors propose an MRD-based weighted sampling method, a curriculum learning-inspired approach, to optimize existing unlearning algorithms. This method prioritizes easily forgettable samples, which significantly improves both unlearning completeness and model utility while also enhancing unlearning efficiency (reducing total execution time). The extensive experimental validation on diverse LLM unlearning tasks (TOFU, WMDP, WHP, SAFE) confirms the effectiveness, reasonableness, and scalability of MRD and the proposed sampling method.

7.2. Limitations & Future Work

The authors suggest several promising future research directions:

Hierarchical Unlearning: MRD could be used to construct hierarchical unlearning strategies, where concepts or groups of samples are unlearned in a structured manner based on their difficulty.
Reinforcement Learning Integration: MRD could serve as a reward mechanism for reinforcement learning-based unlearning approaches, guiding the agent to prioritize forgetting certain types of samples.
Regularization Term: MRD might be incorporated as a regularization term directly into the unlearning objective function, explicitly balancing the unlearning process.
Reassessing Evaluation Rationality: MRD can be used by researchers to critically re-evaluate the fairness and rationality of existing LLM unlearning evaluation benchmarks and practices.

The paper doesn't explicitly state limitations, but some could be inferred:
The neuro-inspired analogy, while intuitive, is a high-level conceptual link and not a direct neurobiological model. The empirical evidence supports the metric's utility, but the "neuro-inspired" aspect is primarily a guiding metaphor.
The approximation of MRD relies on second-order Taylor expansion and the trace of the Hessian. While computationally efficient via Monte Carlo sampling, directly computing or approximating the Hessian can still be demanding for extremely large models, even if only for a small sample of parameters.
The effectiveness of MRD-based weighted sampling might depend on the specific characteristics of the forget set and the baseline unlearning algorithm. There could be scenarios where the curriculum might not always be optimal or might need more sophisticated dynamic adjustments.
The sensitivity to parameters like $K$ (Monte Carlo samples) shows that careful tuning is still required to get stable MRD values, though the study shows it's manageable.

7.3. Personal Insights & Critique

This paper offers a highly valuable contribution by bringing the concept of sample-level difficulty to the forefront of LLM unlearning research. The observation that unlearning difficulty is non-uniform is critical, as it exposes a potential flaw in how current unlearning algorithms are evaluated and highlights why results might be inconsistent or misleading. The neuro-inspired analogy is a creative way to motivate the metric, making the concept of MRD intuitive for beginners.

The definition of MRD itself is clever. By focusing on the relative change in generation probability under parameter perturbations, it directly quantifies the stability of a model's "memory" of a sample, which is a robust indicator of how hard it would be to dislodge that memory. The theoretical link to the Hessian trace provides a solid mathematical grounding and interpretability, explaining why samples with higher curvature are easier to unlearn.

The proposed MRD-based weighted sampling method is a practical and elegant solution. It transforms the theoretical understanding of unlearning difficulty into an actionable strategy that demonstrably improves existing methods. The analogy to curriculum learning makes it easily understandable and applicable. The efficiency gains (reduced total runtime) are particularly impressive, showing that interpretability can directly lead to practical benefits.

Potential issues or areas for improvement:

The "Initial Generation Probability" example in Table 11: As noted in the analysis, the example for "high initial generation probability" having a higher MRD value seems to contradict the paper's own theoretical explanation (high probability should mean low MRD, i.e., harder to unlearn). Clarifying this discrepancy (e.g., if the example was mislabeled, or if there are nuances not fully captured) would strengthen the empirical validation.
Nature of "Neuro-Inspired": While inspiring, the connection is analogical. Further work could explore if specific neural mechanisms of memory consolidation and forgetting in biological systems could inspire more direct computational models or architectural changes in LLMs for better unlearning, moving beyond just a metric.
Generalizability Beyond LLMs: While the paper focuses on LLMs, the core idea of quantifying sample-level unlearning difficulty via stability under perturbation could potentially be generalized to other deep learning models (e.g., computer vision), adapting the "generation probability" to task-specific outputs (e.g., classification probabilities).
Robustness to Malicious Inputs: Could MRD be robust to adversarial attempts to manipulate sample difficulty scores? A sophisticated attacker might craft "easy-to-unlearn" samples that are actually critical, or make critical samples appear "hard" to evade unlearning.

Overall, this paper provides a fresh and impactful perspective, offering both a novel analytical tool (MRD) and a practical optimization strategy for a critical area of LLM development. It successfully steers the conversation towards a more nuanced understanding of LLM unlearning dynamics.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.