Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
TL;DR Summary
The paper introduces the Agent Compression Benchmark (ACBench) to evaluate the impact of compression on LLMs' agentic capabilities across 12 tasks and 4 abilities. Results show 4-bit quantization minimally affects workflow and tool use, but degrades real-world accuracy by about 1
Abstract
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
1.2. Authors
Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by the arxiv.org link. As of the publication date, it is likely undergoing peer review for a conference or journal. arXiv is a well-respected open-access repository for preprints of scientific papers, particularly in mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
1.4. Publication Year
2025
1.5. Abstract
This paper addresses a gap in the evaluation of compressed Large Language Models (LLMs), which traditionally focuses on language modeling (e.g., perplexity) and natural language understanding (NLU) tasks (e.g., GLUE accuracy). The authors argue that these benchmarks overlook "agentic capabilities" crucial for real-world applications, such as workflow generation, tool use, long-context understanding, and practical application. To fill this gap, they introduce the Agent Compression Benchmark (ACBench), a comprehensive suite designed to evaluate how compression affects these agentic abilities.
ACBench incorporates 12 tasks across 4 key agentic capabilities, evaluates two major compression techniques—quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT)—and tests 15 diverse LLMs, ranging from small to standard and distilled reasoning models. The empirical evaluation reveals significant trade-offs: 4-bit quantization effectively preserves workflow generation and tool use (with only a 1%-3% performance drop) but substantially degrades real-world application accuracy (by 10%-15%). To systematize the analysis, the paper introduces three novel metrics: ERank (Efficient Rank), Top-k Ranking Correlation, and Energy-based Analysis. ACBench aims to provide actionable insights for optimizing LLM compression strategies specifically for agentic scenarios.
1.6. Original Source Link
https://arxiv.org/abs/2505.19433v2 (Preprint) PDF Link: https://arxiv.org/pdf/2505.19433v2.pdf
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) (models with a vast number of parameters, typically billions, trained on massive text datasets) has revolutionized many domains, from code generation to scientific research and multi-agent collaboration. However, their practical deployment is severely hampered by prohibitive computational and memory costs. This issue necessitates post-training compression techniques (methods applied after a model has been fully trained) like quantization (reducing the precision of numerical representations) and pruning (removing redundant parameters).
Current LLM compression benchmarks primarily focus on single-turn language modeling (e.g., measuring perplexity, how well a model predicts the next word) and Natural Language Understanding (NLU) tasks (e.g., GLUE accuracy, a common benchmark for various NLU tasks). The core problem the paper identifies is that these benchmarks are insufficient. They do not adequately assess the agentic capabilities of LLMs, which are multi-step, interactive, and often real-world oriented.
Why is this problem important? Real-world applications demand capabilities beyond static benchmarks. For instance, robotic control or financial analytics require LLMs to perform multi-step planning, maintain long-context coherence (understanding extended conversations or documents), adapt reasoning across conversational turns, and seamlessly integrate with external tools. The current compression evaluations overlook how these critical agentic capabilities are affected, leading to a critical oversight in guiding deployment strategies for interactive AI agents.
The paper's entry point or innovative idea is to introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark specifically designed to evaluate the impact of compression on LLMs' agentic abilities. This directly addresses the existing gap by moving beyond traditional NLU and language modeling metrics to focus on real-world interactive scenarios.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Introduction of ACBench: They propose
ACBench, the first comprehensive benchmark specifically tailored to evaluate the impact ofLLM compressiononagentic capabilities. This benchmark covers:- Four core agentic capabilities: Action Execution (tool use/function call), Workflow Generation, Long-Context Understanding, and Real-world Application.
- 12 diverse tasks across these capabilities (e.g.,
WorfBenchfor workflow generation,Needle-in-Haystackfor long-context retrieval). - Two mainstream compression methods:
Quantization(GPTQ, AWQ) andPruning(Wanda, SparseGPT). - Evaluation across 15 diverse LLMs: Including small models (Gemma-2B), standard models (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill).
- Novel Analytical Tools: To systematize the analysis of compression effects, they introduce three new statistical metrics:
ERank(Efficient Rank): Measures the effective dimensionality of weight matrices, indicating structural changes due to compression.Top-k Ranking Correlation: Quantifies the consistency of top-k token predictions between compressed and uncompressed models.Energy-based Analysis: Evaluates shifts in logit energy distributions, reflecting changes in model confidence.
- Empirical Findings on Compression Trade-offs: Their experiments reveal crucial insights into how compression impacts agentic capabilities:
-
Quantization vs. Pruning: Quantization methods (GPTQ, AWQ) generally preserve
workflow generationandtool usecapabilities more effectively (with only a 1%-3% drop in performance) compared to pruning methods. -
Real-world Application Degradation: Despite preserving some capabilities, 4-bit quantization leads to a significant performance degradation (10%-15% accuracy drop) in
real-world applicationtasks. -
Distillation Limitations:
Distilled reasoning models(e.g., DeepSeek-R1-Distill) often show performance degradation in agentic scenarios, suggesting current distillation techniques may not effectively transfer complex agentic skills. -
Model Size Sensitivity: Smaller models are more sensitive to compression, particularly in complex reasoning tasks (e.g., Qwen2.5-3B's catastrophic performance collapse in GSM8K reasoning post-compression), while larger models exhibit more resilience.
The key conclusions reached are that while compression is vital for deployment, its impact on agentic capabilities is nuanced and task-dependent. Quantization appears more favorable for certain agentic tasks than pruning, but real-world interactive scenarios remain challenging. The findings highlight the need for tailored compression strategies and improved distillation techniques that prioritize agentic behaviors.
ACBenchprovides a practical tool and framework for guiding future research in this area.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should grasp the following foundational concepts:
- Large Language Models (LLMs): These are advanced Artificial Intelligence models designed to understand, generate, and process human language. They are typically based on the
transformerarchitecture and trained on massive amounts of text data, allowing them to learn complex patterns and relationships within language. Their "largeness" refers to the huge number of parameters (weights and biases) they contain, often billions, which enables their sophisticated abilities but also leads to high computational and memory costs. - LLM Compression: The process of reducing the size and computational requirements of an LLM while minimizing performance degradation. This is crucial for deploying LLMs on resource-constrained devices (e.g., mobile phones, edge devices) or for reducing inference costs in data centers. The paper focuses on two main types:
- Quantization: This technique reduces the precision of the numerical representations (e.g., weights, activations) in the neural network. Instead of using high-precision floating-point numbers (e.g., 32-bit or 16-bit floats),
quantizationmaps these values to lower-bit integer representations (e.g., 8-bit, 4-bit, or even 2-bit integers). This significantly shrinks the model size and speeds up computations because integer operations are faster and consume less memory. - Pruning: This involves removing redundant or less important parameters (weights or connections) from the neural network. The idea is that not all connections or weights contribute equally to the model's performance; some can be removed without significant impact.
Pruningcan lead to sparser models, meaning many weights are zero, which reduces memory footprint and can accelerate inference if specialized hardware or software supports sparse computations.
- Quantization: This technique reduces the precision of the numerical representations (e.g., weights, activations) in the neural network. Instead of using high-precision floating-point numbers (e.g., 32-bit or 16-bit floats),
- Agentic Capabilities / LLM-based Agents: An
LLM-based agentis an AI system that uses an LLM as its core reasoning engine to perform complex, autonomous tasks by interacting with an environment.Agentic capabilitiesrefer to the specific skills that enable such an agent to function effectively. These go beyond simple text generation or understanding and include:- Workflow Generation: The ability to break down a complex goal into a sequence of actionable steps or a structured plan. This often involves understanding dependencies between sub-tasks.
- Tool Use/Function Call: The ability to identify when external tools (e.g., calculators, web search APIs, code interpreters) are needed and to correctly invoke them (
function call) with appropriate inputs, and then interpret their outputs. - Long-Context Understanding: The ability to process, reason over, and maintain coherence across very long input sequences (e.g., tens of thousands of tokens) and retain relevant information over extended interactions or documents.
- Real-world Application: The ability to operate successfully in practical, often interactive, deployment scenarios (e.g., embodied AI, gaming, e-commerce) that combine multiple agentic skills.
- Perplexity (PPL): A common metric in language modeling that measures how well a probability model predicts a sample. Lower
perplexitygenerally indicates a better model, as it means the model is more confident and accurate in predicting the next token in a sequence. - GLUE Accuracy:
General Language Understanding Evaluation (GLUE)is a benchmark for evaluating the performance of NLU models across a diverse set of nine tasks (e.g., sentiment analysis, textual entailment, question answering).GLUE accuracyrefers to the aggregate performance metric on these tasks. - Logits: In a neural network,
logitsare the raw, unnormalized outputs from the final layer before thesoftmaxactivation function is applied. For classification tasks, theselogitsrepresent the model's confidence scores for each possible class. Higherlogitsfor a particular class mean the model is more confident about that class. - Softmax Function: A function that takes a vector of arbitrary real-valued
logitsand normalizes them into a probability distribution, where each value is between 0 and 1, and all values sum to 1. $ p_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $ where is the logit for class , and is the total number of classes. - Jaccard Similarity: A statistic used for gauging the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets. For two sets and , the
Jaccard similarityis .
3.2. Previous Works
The paper contextualizes its work by citing existing research in LLM compression, LLM-based agents, and specific agentic capabilities.
3.2.1. LLM Compression
- Quantization: The paper references
GPTQ(Frantar et al., 2022) andAWQ(Lin et al., 2023) as mainstream quantization methods.GPTQis known for its accurate post-training quantization by optimizing parameters layer by layer, whileAWQfocuses on outlier-aware quantization.SmoothQuant(Xiao et al., 2023) is also mentioned, which aims to smooth activations for better quantization. Other general works on quantization (Park et al., 2024; Frantar & Alistarh, 2022; Lee et al., 2024; Du et al., 2024a; Kim et al., 2023; Dong et al., 2024c; Gu et al., 2025; Du et al., 2024b; Li et al., 2024f) are also cited, underscoring the active research in this area. - Pruning:
SparseGPT(Frantar & Alistarh, 2023a) andWanda(Sun et al., 2024b) are highlighted as key pruning techniques.SparseGPTis noted for its ability to accurately prune massive language models in a single shot.Wandais a simple and effective pruning approach. The paper also mentions general pruning research (Frantar & Alistarh, 2023b; Sun et al., 2024c; Shao et al., 2024; Zhang et al., 2024d; Dong et al., 2024b; Tang et al., 2020; Dong et al., 2024a; Lai et al., 2025), including unstructured, structured, and dynamic pruning methods. - Knowledge Distillation (KD): The concept of training a smaller student model to mimic a larger teacher model (Hinton et al., 2015b) is crucial for creating efficient, smaller models.
KD-zero(Li et al., 2023b) is an example of such a technique. - Low-Rank Decomposition: Techniques like
SVD-LLM(Wang et al., 2024e; 2025b) andASVD(Yuan et al., 2023) are used to approximate weight matrices with lower-rank versions, reducing parameters and computation.
3.2.2. LLM-based Agents
The paper references several frameworks and studies on LLM-based agents, indicating their growing importance: MetaGPT (Hong et al., 2023), AutoGen (Wu et al., 2023), AgentVerse (Chen et al., 2023b), and MegaAgent (Wang et al., 2024b). These works demonstrate that LLMs can act as reasoning engines for autonomous systems, combining LLMs with planning, tool use, and memory to achieve goals.
3.2.3. Agentic Capabilities Benchmarks
The paper points out that existing compression evaluations often neglect agentic capabilities, citing works (Li et al., 2024e; Yang et al., 2024a; Gong et al., 2024; Wang et al., 2024a;b) that focus on single-turn NLU or perplexity. This highlights the gap ACBench aims to fill. For agentic tasks, T-Eval (Chen et al., 2023c) is mentioned for tool use, WorfBench (Qiao et al., 2024) for workflow generation, and AgentBoard (Ma et al., 2024) for real-world applications. For long-context understanding, LongBench (Bai et al., 2024), LongGenBench (Liu et al., 2024a), and Needle-in-the-Haystack (Kamradt, 2023) are relevant prior benchmarks.
3.3. Technological Evolution
The evolution of LLM technology has progressed from raw language modeling to sophisticated reasoning and agentic behaviors. Initially, the focus was on building larger models that could achieve better perplexity and NLU performance. This led to breakthroughs like the transformer architecture (Vaswani et al., 2017) and models like GPT-3 (Brown et al., 2020).
As models grew in size and capability, the computational and memory demands became a bottleneck. This spurred intense research into compression techniques (quantization, pruning, distillation, low-rank decomposition) to make these powerful models deployable. Early compression efforts aimed to preserve perplexity and NLU accuracy, as these were the primary metrics.
More recently, the concept of LLM-based agents emerged, where LLMs are not just static text processors but active decision-makers capable of planning, tool use, and interacting with environments. This shift highlighted that traditional compression benchmarks were insufficient; a model might perform well on NLU after compression but fail catastrophically when asked to use a tool or follow a complex workflow.
This paper's work fits squarely within this latest stage of evolution, recognizing that compression must now be evaluated not just on intrinsic language properties but on extrinsic, interactive agentic capabilities. It's a critical step towards enabling the ubiquitous deployment of intelligent agents.
3.4. Differentiation Analysis
Compared to prior work, the core differences and innovations of this paper's approach are:
-
Agentic Focus in Compression Benchmarking: Previous compression benchmarks (e.g., those evaluated in Gong et al., 2024; Yang et al., 2024a) primarily assessed
perplexityorNLUaccuracy. This paper is the first to introduce a comprehensive benchmark,ACBench, specifically designed to evaluate how compression impactsagentic capabilitieslike workflow generation, tool use, long-context understanding, and real-world application. This shifts the evaluation paradigm to better reflect real-world deployment needs for intelligent agents. -
Comprehensive Coverage:
ACBenchintegrates a broader and more diverse set ofagentic tasks(12 tasks across 4 capabilities) than typically found in compression studies. It also systematically evaluates both major compression techniques (quantizationandpruning) across a wide range of model sizes (15 models), including specializeddistilled reasoning models. -
Novel Analytical Tools: The introduction of
ERank,Top-k Ranking Correlation, andEnergy-based Analysisprovides novel, fine-grained tools for understanding how compression affects the internal workings and decision-making processes of LLMs, beyond just black-box performance metrics. These metrics offer insights into the structural changes, prediction consistency, and confidence patterns induced by compression. -
Practical Insights and Trade-offs: The paper provides actionable insights into the trade-offs of compression for agentic scenarios. For example, showing that 4-bit quantization preserves
workflow generationandtool usereasonably well, but significantly degradesreal-world applicationaccuracy. This level of granular insight into agentic performance under compression is novel and crucial for practitioners.In essence, while others have compressed LLMs and evaluated them on standard tasks, and others have evaluated agentic capabilities, this paper uniquely combines both by asking: "Can compressed LLMs truly act?" and provides the first dedicated benchmark and analytical framework to answer this question.
4. Methodology
The paper's methodology centers on evaluating the impact of post-training compression on LLM agentic capabilities using a newly proposed benchmark, ACBench, and novel statistical analysis tools.
4.1. Principles
The core idea is to systematically assess whether compressed LLMs retain the complex, multi-step, and interactive abilities required for agentic applications, which go beyond traditional language modeling and NLU tasks. The theoretical basis is that compression, while reducing model size, might disproportionately affect the intricate reasoning and planning mechanisms crucial for agentic behaviors, which might not be captured by simpler metrics like perplexity. The intuition is that agentic tasks demand robust internal representations and decision-making processes that could be more fragile to information loss from compression.
The methodology involves:
- Applying diverse compression techniques: Quantization and pruning are applied to a range of LLMs.
- Evaluating compressed models on a custom benchmark (ACBench): This benchmark specifically targets four categories of agentic capabilities.
- Analyzing the impact using novel statistical metrics: These metrics (
ERank,Top-K Ranking Consistency,Energy-based Analysis) provide deeper insights into how compression alters the model's internal representations and output confidence.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Compression Methods
The paper focuses on two primary post-training compression methods: Quantization and Weight Pruning.
4.2.1.1. Quantization
Quantization reduces the memory and computational demands of neural networks by mapping high-precision values (e.g., 16-bit floating-point numbers) to lower-bit integer representations. The process typically involves an affine transformation to scale and shift the floating-point values into an integer range.
The affine transformation is given by: $ \mathbf{X}{\mathrm{INT}} = \mathrm{round}\left(\frac{\mathbf{X}{\mathrm{FP16}} - Z}{S}\right) \quad (1) $ And the scaling factor is calculated as: $ S = \frac{\max(\mathbf{X}{\mathrm{FP}16}) - \min(\mathbf{X}{\mathrm{FP}16})}{2^N - 1} \quad (2) $ Where:
-
: The resulting lower-bit integer representation of the tensor.
-
: The original full-precision (e.g., 16-bit floating-point) tensor.
-
: A rounding function that converts the scaled floating-point value to the nearest integer.
-
: The zero-point, an integer value that aligns the integer zero with the floating-point zero point. It helps in representing asymmetric ranges.
-
: The scaling factor, which determines the range mapping between floating-point and integer values.
-
: The maximum value in the original floating-point tensor.
-
: The minimum value in the original floating-point tensor.
-
: The target integer bit-width (e.g., 4 bits, 8 bits).
The paper primarily focuses on
GPTQ(Frantar et al., 2022) andAWQ(Lin et al., 2023) for quantization. -
GPTQ (Generative Pretrained Transformer Quantization): A highly accurate post-training quantization method that quantizes LLMs to 4-bit precision without any retraining. It works by quantizing weights block-wise while minimizing the squared error introduced by quantization.
-
AWQ (Activation-aware Weight Quantization): A quantization method that focuses on protecting salient weights (outliers) from quantization error, as these often have a disproportionately large impact on model performance. It scales weights based on activation statistics.
-
SmoothQuant: (Mentioned in the Appendix as also used, though the abstract only lists GPTQ and AWQ). This method smooths out the activation outliers in LLMs to make them easier to quantize accurately.
4.2.1.2. Weight Pruning
Weight Pruning removes redundant parameters to create sparse weight matrices, which reduces model size and can improve inference latency.
The process of unstructured sparsity is achieved via element-wise masking: $ \tilde{\mathbf{W}} = \mathbf{W}\odot \mathbf{M},\quad \mathbf{M}{ij} = \left{ \begin{array}{ll}1 & \mathrm{if}\left|\mathbf{W}{ij}\right| > \tau \ 0 & \mathrm{otherwise} \end{array} \right. \quad (3) $ Where:
-
: The pruned (sparse) weight matrix.
-
: The original full weight matrix.
-
: A binary mask matrix, where is either 1 (keep the weight) or 0 (prune the weight).
-
: The element-wise (Hadamard) product.
-
: The absolute value of an individual weight element.
-
: A predefined threshold. Weights with absolute values below this threshold are pruned.
The paper mentions two specific pruning methods:
-
SparseGPT: (Frantar & Alistarh, 2023a) A one-shot pruning method that can achieve high sparsity (e.g., 50-60%) in LLMs post-training without fine-tuning, while maintaining accuracy. It uses a greedy approach to find a subset of weights to prune.
-
Wanda (Pruning without Activation Outliers): (Sun et al., 2024b) A simple and effective pruning method for LLMs that prunes weights based on their magnitude scaled by the corresponding activation. It can be applied in unstructured settings (individual weights) or structured (e.g., 2:4 semi-structured settings, meaning for every 4 weights, 2 are kept).
4.2.2. Agentic Taxonomy & ACBench Structure
The paper classifies agentic capabilities into four main categories, which form the basis of ACBench:
-
Action Execution:
- Focus: Function call and tool use.
Function callmeans the agent uses predefined internal functions, whiletool usemeans it leverages external tools/APIs. - Sub-capabilities: Plan, Reason, Retrieve, Understand, Instruct, and Review.
- Benchmark:
T-Eval(Chen et al., 2023c).
- Focus: Function call and tool use.
-
Workflow Generation:
- Focus: Breaking down complex tasks into executable sequences of steps. This includes single-task workflows (direct solution path) and multi-step workflows (intermediate planning, coordination).
- Task Types: Function Call, Embodied, Problem-Solving, and Open-Grounded tasks.
- Benchmark:
WorfBench(Qiao et al., 2024).
-
Long-Context Understanding:
- Focus: Processing and understanding information over very long input sequences, crucial for multi-step workflows.
- Task Types: General long-context tasks (single/multi-doc QA, summarization, code) and more challenging ones like
LongGenBench(Liu et al., 2024a) andNeedle-in-the-Haystack(Kamradt, 2023). - Benchmarks:
LongBench(Bai et al., 2024),LongGenBench,Needle-in-the-Haystack.
-
Real-world Application:
-
Focus: Operating in practical deployment scenarios (e.g., e-commerce, robotics, scientific experimentation). Requires coordinating multiple capabilities (tool use, planning, environmental interaction).
-
Task Types: Embodied AI, Game, Tool Use, Tool-Query, Tool-Operation.
-
Benchmark:
AgentBoard(Ma et al., 2024).The paper's overall
ACBenchframework is visually summarized in Figure 1 (from the paper), which illustrates the different capabilities and components.
Figure 1: Overview of Agent Compression Benchmarks and Methods for Large Language Models (LLMs). (a) Benchmark Comparison: Illustrates the transition from single-turn quantized LLMs to multi-turn compressed LLMs in agentic scenarios. (b) Compression Methods: Summarizes the techniques used for quantization (e.g., GPTQ, AWQ, SmoothQuant) and sparsification (e.g., SparseGPT, Wanda). (c) Overview of Agent Compression Benchmark: Provides a comprehensive view of the capabilities and components involved in agentic LLM compression, including action execution, workflow build, real-world applications, and long-context processing.
-
4.2.3. Statistical Analysis Metrics
To investigate the influences of compression on LLMs and how it affects them, the paper employs three novel statistical analysis metrics:
4.2.3.1. Efficient Rank (ERank)
Efficient Rank quantifies the effective dimensionality of a matrix, providing a measure of its structural complexity and the distribution of its singular values. A decrease in ERank suggests a loss of information and structural complexity in the weight matrices due to compression.
The Efficient Rank of a non-zero matrix is defined as:
$
\mathtt{eRank(A)} = \exp \left(-\sum_{i = 1}^{Q}\frac{\sigma_{i}}{\sum_{j = 1}^{Q}\sigma_{j}}\log \left(\frac{\sigma_{i}}{\sum_{j = 1}^{Q}\sigma_{j}}\right)\right) \quad (4)
$
Where:
-
: The effective rank of matrix .
-
: The maximum possible rank of the matrix , which is the smaller of its dimensions.
-
: The singular values of the matrix , typically obtained from Singular Value Decomposition (SVD), sorted in descending order.
-
: Represents the normalized contribution of each singular value to the total "energy" or variance of the matrix. This term is similar to a probability distribution.
-
: The natural logarithm.
-
: The exponential function, used to convert the result back from a logarithmic scale.
This formula essentially calculates the exponential of the negative Shannon entropy of the normalized singular values. A higher
ERankindicates a more uniform distribution of singular values and thus higher effective dimensionality, implying richer information content.
4.2.3.2. Top-K Ranking Consistency
Top-K Ranking Consistency measures how consistently the top-k predicted tokens (most probable next tokens) of a compressed LLM align with those of the original (uncompressed) LLM. This is crucial because the top few tokens often dictate the model's confidence and the direction of text generation.
For a given input, let be the set of top-k tokens predicted by the original model, and be the set of top-k tokens predicted by the compressed model. The ranking consistency is measured using the Jaccard similarity:
$
J_{k} = \frac{|\mathcal{T}{k}^{(o)}}\cap\mathcal{T}{k}^{(c)}|}{|\mathcal{T}{k}^{(o)}\cup\mathcal{T}{k}^{(c)}|} \quad (5)
$
Where:
-
: The Jaccard similarity for the top-k tokens.
-
: The set of top-k tokens from the original model's
logits(aftersoftmax). -
: The set of top-k tokens from the compressed model's
logits. -
: Denotes the cardinality (number of elements) of a set.
-
: Set intersection.
-
: Set union.
A higher value (closer to 1) indicates greater consistency in the top-k predictions between the compressed and original models, implying better preservation of the original model's prediction confidence.
4.2.3.3. Energy-based Analysis
Inspired by Out-of-Distribution (OOD) detection techniques, Energy-based Analysis evaluates distributional shifts in logit energies between compressed and uncompressed models. It provides insights into how compression affects the model's overall confidence and calibratedness.
Let be a discriminative neural classifier that maps an input to logits. For both the original model and compressed model , the categorical distribution (probabilities) is derived using the softmax function:
$
p(y|\mathbf{x}) = \frac{e^{f_{y}(\mathbf{x}) / T}}{\sum_{i = 1}^{K}e^{f_{i}(\mathbf{x}) / T}} \quad (6)
$
Where:
-
: The probability of class given input .
-
: The logit corresponding to class label for input .
-
: The temperature parameter, a positive scalar that controls the softness of the
softmaxprobabilities. A higher makes probabilities softer (more uniform), while a lower makes them sharper (more confident in the highest logit).The
energy functionfor a given input is defined as: $ E(\mathbf{x};f) = -T\cdot \log \sum_{i = 1}^{K}e^{f_{i}(\mathbf{x}) / T} \quad (7) $ Where: -
: The energy score for input using model .
-
: This term is proportional to the negative logarithm of the sum of exponentiated
logits(scaled bytemperature). It is inversely related to thesoftmaxprobability of the most likely class, providing a measure of confidence. Lower (more negative) energy generally indicates higher confidence.The analysis compares the energy distributions between original and compressed models using . A lower indicates better preservation of model confidence patterns, suggesting that the compressed model's confidence aligns well with the original.
5. Experimental Setup
5.1. Datasets
The ACBench comprehensively evaluates LLMs across various datasets tailored to assess the four core agentic capabilities.
5.1.1. Action Execution (Tool Use)
- Benchmark:
T-Eval(Chen et al., 2023c) - Characteristics: This benchmark assesses six core competencies:
planning,reasoning,retrieval,understanding,instruction following, andreviewing. It features 15 carefully selected tools from diverse domains (Research, Travel, Entertainment, Web, Life, Financials) with high availability and complete documentation. The dataset comprises 553 high-quality query-solution annotation pairs, resulting in 23,305 test cases across its sub-capabilities. It emphasizes comprehensive and fine-grained evaluation of tool utilization, with an average of 5.8 tool calling steps per query.
5.1.2. Workflow Generation
- Benchmark:
WorfBench(Qiao et al., 2024) - Characteristics:
WorfBenchis a comprehensive benchmark employing a graph-based workflow representation to assess both workflow generation capabilities and execution performance in multi-turn conversational settings. It includes 18,000 training samples, 2,146 test samples, and 723 held-out tasks. The evaluation spans four distinct categories:function calling,embodied interaction,problem-solving, andopen-grounded scenarios. It features multi-faceted scenarios and complex graph-structured workflows, addressing limitations of existing benchmarks that focus only on linear workflows.
5.1.3. Long-Context Understanding
The paper uses three benchmarks for long-context understanding:
- Benchmark 1: LongBench (Bai et al., 2024)
- Characteristics: A multi-task benchmark spanning five categories:
Single-Doc QA(e.g., NarrativeQA, Qasper),Multi-Doc QA(e.g., HotpotQA, MultiNews),Summarization(e.g., GovReport, QMSum),Few-shot Learning, andSynthetic tasks. Documents range from 2K to 32K tokens in length, covering diverse domains like healthcare, legal, scientific research, and creative writing.
- Characteristics: A multi-task benchmark spanning five categories:
- Benchmark 2: LongGenBench (Liu et al., 2024a)
- Characteristics: A synthetic benchmark designed for
extreme-length generationwith multi-shot in-context examples. It specifically targets a model's ability to produce coherent, contextually accurate, and structurally sound text over extended outputs, rather than just retrieval. It supports customizable context length configurations and evaluates models on generating single, unified long-form responses (e.g., GSM8K and MMLU tasks).
- Characteristics: A synthetic benchmark designed for
- Benchmark 3: Needle-in-the-Haystack (Kamradt, 2023)
- Characteristics: A retrieval-focused task probing fine-grained contextual understanding. It tests models' ability to locate and utilize critical information embedded within lengthy contexts, with documents ranging from 4K to 64K tokens. Key information is deliberately placed at varying positions (beginning, middle, end) and with different levels of surrounding distractor content. Tasks include fact retrieval, reasoning chains, and verification.
5.1.4. Real-World Application
- Benchmark:
AgentBoard(Ma et al., 2024) - Characteristics:
AgentBoardis designed to comprehensively evaluate LLMs asgeneralist agents. It comprises a diverse set of 9 unique tasks and 1,013 environments, covering a wide range of scenarios:embodied AI tasks(AlfWorld, ScienceWorld, BabyAI),game environments(Jericho, PDDL),web-based tasks(WebShop, WebArena), andtool-oriented tasks(Tool-Query, Tool-Operation). Each environment is crafted for multi-round interactions and partially observable characteristics, with subgoals defined to track detailed progress. It introduces a fine-grainedprogress rate metricfor nuanced evaluation.
5.2. Evaluation Metrics
The paper employs a combination of standard and novel metrics to assess the performance of compressed LLMs.
5.2.1. Task-Specific Metrics (Standard)
-
Accuracy (ACC):
- Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a common metric for classification tasks, indicating the overall correctness of a model's predictions.
- Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The count of instances where the model's output matches the ground truth.Total Number of Predictions: The total count of instances evaluated.
-
F1 Score:
- Conceptual Definition: The harmonic mean of precision and recall. It is particularly useful when evaluating models on datasets with imbalanced class distributions, as it considers both false positives and false negatives. A higher F1 score indicates a more robust model.
- Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\mathrm{True;Positives}}{\mathrm{True;Positives} + \mathrm{False;Positives}} $ $ \mathrm{Recall} = \frac{\mathrm{True;Positives}}{\mathrm{True;Positives} + \mathrm{False;Negatives}} $
- Symbol Explanation:
- : Instances correctly identified as positive.
- : Instances incorrectly identified as positive (Type I error).
- : Instances correctly identified as negative.
- : Instances incorrectly identified as negative (Type II error).
-
Spearman Correlation Coefficient:
- Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. It is used in the paper to correlate
perplexitywithTop-k ranking consistency. - Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
- : The Spearman rank correlation coefficient.
- : The difference between the ranks of corresponding observations for each variable.
- : The number of observations (data points).
- Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. It is used in the paper to correlate
-
Kendall's Tau Coefficient:
- Conceptual Definition: A non-parametric measure of the ordinal association between two measured quantities. A
Tauvalue of 1 implies perfect agreement, -1 implies perfect disagreement, and 0 implies no relationship. It is used to evaluateTop-k Ranking Consistency. - Mathematical Formula: $ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1)} $
- Symbol Explanation:
- : The Kendall's Tau correlation coefficient.
number of concordant pairs: Pairs of observations that maintain the same relative order in both rankings.number of discordant pairs: Pairs of observations that have different relative orders in the two rankings.- : The number of observations.
- Conceptual Definition: A non-parametric measure of the ordinal association between two measured quantities. A
5.2.2. Novel Statistical Analysis Metrics
These metrics are described in detail in Section 4.2.3 and will not be repeated here, but their definitions and formulas are crucial evaluation tools.
- Efficient Rank (ERank) (Equation 4)
- Top-K Ranking Consistency (Jaccard Similarity) (Equation 5)
- Energy-based Analysis (Energy Function) (Equation 7)
5.3. Baselines
The paper evaluates a diverse set of LLMs and compression methods:
5.3.1. LLMs Evaluated
The models are categorized into three groups:
- Medium-scale models (7B parameters):
InternLM-2.5-7B(Cai et al., 2024)Qwen2.5-7B(Yang et al., 2024b)Mistral-7B(Jiang et al., 2023a)
- Knowledge-distilled models:
DeepSeek-R1-Distill-Qwen1.5BDeepSeek-R1-Distill-Qwen-7BDeepSeek-R1-Distill-Llama-8B(Team, 2024a)
- Efficient small-scale models:
MiniCPM3-4B(Hu et al., 2024)Qwen-2.5-1.5BQwen-2.5-3B(Yang et al., 2024b)Gemma-2-2b-it(Team, 2024b)Phi-3.5-mini-3.8B(Abdin et al., 2024)Megrez-3B(Infinigence, 2024)
5.3.2. Compression Methods Evaluated
- Quantization:
GPTQ(Frantar et al., 2022): Post-training quantization.AWQ(Lin et al., 2023): Activation-aware weight quantization.SmoothQuant(Xiao et al., 2023): Smooths activations for better quantization.- Precision levels: Typically 4-bit (
INT4) and 8-bit (INT8), with comparisons toFP16(full precision) andFP8.
- Pruning:
-
Magnitude Pruning: Both unstructured (Mag(Un)) and semi-structured 2:4 () variants. Removes weights based on absolute values. -
Wanda(Sun et al., 2024b): Unstructured (Wanda(Un)) and semi-structured 2:4 (). -
SparseGPT(Frantar & Alistarh, 2023a): Unstructured (SparseGPT(Un)) and semi-structured 2:4 ().All compression methods are applied
post-trainingwithout additional fine-tuning. The temperature for LLM generation is set to 0 to ensure deterministic outputs for reproducibility.
-
6. Results & Analysis
The paper systematically analyzes the impact of compression on LLMs across four agentic capabilities, supported by its novel statistical analysis metrics.
6.1. Statistical Analysis of Compression Effects
6.1.1. Efficient Rank Analysis
The paper visualizes the ERank for LLaMA-2-7B and Mistral-7B models under varying weight quantization levels (W2, W3, W4, W8, W16) and KV Cache precisions (KV4, KV8, KV16).
The following figure (Figure 2 from the original paper) shows the ERank analysis for quantized LLaMA-2-7B and Mistral-7B models.
Figure 2: ERank analysis difference analysis for quantized LLaMA-2-7B (left) and Mistral-7B (right) models
Analysis: Both models exhibit similar trends: ERank decreases as quantization precision reduces (e.g., from W16 to W2). This indicates a loss of information and structural complexity in the weight matrices, which might explain performance trade-offs. The paper notes that this relationship holds across model scales, with ERank values correlating positively with model accuracy. Larger models generally maintain higher ERank post-compression, suggesting greater robustness. The Diff-ERank metric (change in effective rank after compression) increases with model size, implying larger models undergo more significant structural changes during compression while still preserving performance.
The following are the results from Table 2 of the original paper:
| OPT | 125M | 1.3B | 2.7B | 6.7B |
|---|---|---|---|---|
| ACC | 0.276 | 0.332 | 0.370 | 0.360 |
| Δ Loss | 5.734 | 6.138 | 6.204 | 6.258 |
| Diff-ERank | 1.410 | 2.140 | 2.338 | 2.280 |
| ERank (4bit) | 15.462 | 15.589 | 13.898 | 17.877 |
| Energy | 2.738 | 2.746 | 2.631 | 2.883 |
Analysis: Table 2 shows that the 6.7B model has the highest ERank (4bit) of 17.877 and strong accuracy of 0.360, while the 2.7B model has a lower ERank (13.898) despite comparable accuracy, reinforcing that larger models are more robust. Diff-ERank (change in ERank after compression) increases with model size, indicating larger models undergo more structural changes, yet maintain performance.
6.1.2. Top-K Ranking Consistency Analysis
The paper evaluates Top-K Ranking Consistency using Kendall's Tau and Spearman Correlation coefficients.
The following figure (Figure 4 from the original paper) shows the Top-k Ranking Consistency Analysis for quantized Phi-3.5.
Figure 4: Top-k Ranking Consistency Analysis for quantized Phi-3.5.
Analysis: As decreases (e.g., from top-10 to top-3), ranking consistency becomes increasingly unstable and degrades. This is significant for LLM text generation, where the top-3 tokens are crucial for predicting the most probable next tokens. This degradation in consistency for high-probability tokens explains performance deterioration in downstream tasks.
The following figure (Figure 5 from the original paper) shows the Spearman correlation analysis between perplexity and Top-k ranking correlation metrics across model sizes and quantization levels.
Figure 5: Spearman correlation analysis between perplexity and Top-k ranking correlation metrics across model sizes and quantization levels.
Analysis: Figure 5 demonstrates a strong Spearman correlation between perplexity and Top-k ranking correlation metrics. This validates that these metrics effectively capture meaningful performance characteristics, especially how compression affects the model's ability to maintain its original prediction probabilities.
6.1.3. Energy-based Analysis
The paper visualizes the distribution of energy scores for compressed and uncompressed LLMs.
The following figure (Figure 15 from the original paper) shows the energy distribution analysis at various sequence positions in compressed models.
Figure 15: Energy distribution analysis at various sequence positions in compressed models
Analysis: In the initial decoding stage, quantized LLMs show a distinct, often polarized, energy distribution compared to uncompressed ones (some tokens over-confident, others under-confident). In later stages, these distributions begin to merge, and both models regard tokens as low confidence. This suggests that while confidence calibration might be disrupted early on, it converges later. Table 2 (shown above) indicates consistent energy scores across model sizes, suggesting this behavior is inherent to compression, not just model scale.
6.2. Evaluation on Action Execution (Tool Use)
The paper evaluates the impact of quantization and pruning on LLMs' tool use capabilities using T-Eval.
The following are the results from Table 8 of the original paper:
| LLMs | Compression | Instruct | Plan | Reason | Retrieve | Understand | Review Choice | Overall | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String | Json | String | Json | String | Json | String | Json | String | Json | ||||
| InternLM2.5-7B | Mag(2:4) | 27.2 | 13.3 | 29.3 | 0.6 | 41.4 | 5.0 | 73.2 | 1.6 | 62.8 | 7.9 | 10.5 | 24.8 |
| Mag(Un) | 57.8 | 73.2 | 27.7 | 23.1 | 59.2 | 19.6 | 86.4 | 20.6 | 73.2 | 24.2 | 60.6 | 47.8 | |
| SparseGPT(2:4) | 73.2 | 33.3 | 26.6 | 0.1 | 52.0 | 12.8 | 71.1 | 8.0 | 41.6 | 9.8 | 64.1 | 35.7 | |
| SparseGPT(Un) | 84.5 | 80.3 | 42.4 | 62.3 | 63.6 | 39.7 | 90.9 | 47.4 | 74.7 | 41.5 | 57.3 | 62.2 | |
| Wanda(2:4) | 78.5 | 46.4 | 50.6 | 11.2 | 59.6 | 21.9 | 95.8 | 18.1 | 59.7 | 21.4 | 61.4 | 47.7 | |
| Wanda(Un) | 83.7 | 90.6 | 49.0 | 72.4 | 66.3 | 32.6 | 96.5 | 42.2 | 76.4 | 39.5 | 62.0 | 64.7 | |
| AWQ | 98.6 | 98.7 | 48.5 | 45.3 | 65.8 | 46.5 | 85.6 | 66.7 | 79.6 | 56.3 | 63.2 | 68.6 | |
| FP16 | 98.6 | 98.6 | 44.3 | 73.7 | 67.5 | 41.0 | 85.7 | 66.4 | 81.9 | 56.0 | 70.4 | 72.2 | |
| FP8 | 98.6 | 98.6 | 44.3 | 73.7 | 65.9 | 50.3 | 83.3 | 65.5 | 79.4 | 55.1 | 70.4 | 71.4 | |
| GPTQ | 96.5 | 98.4 | 53.9 | 76.2 | 66.9 | 44.5 | 93.7 | 56.9 | 81.2 | 49.5 | 72.5 | 71.8 | |
| Mistrial-7B | Mag(2:4) | 39.0 | 3.1 | 22.5 | 0.7 | 38.1 | 0.5 | 45.4 | 0.2 | 0.0 | 1.0 | 16.0 | 11.7 |
| Mag(Un) | 0.4 | 4.2 | 44.9 | 33.4 | 42.5 | 3.8 | 37.1 | 7.1 | 0.0 | 6.0 | 54.4 | 21.3 | |
| SparseGPT(2:4) | 3.6 | 15.3 | 43.6 | 9.9 | 46.1 | 3.5 | 41.0 | 2.8 | 0.1 | 3.7 | 10.9 | 16.4 | |
| SparseGPT(Un) | 92.0 | 40.8 | 59.2 | 56.6 | 45.8 | 8.2 | 28.9 | 13.9 | 0.0 | 13.7 | 77.8 | 39.7 | |
| Wanda(2:4) | 74.4 | 30.8 | 42.1 | 51.0 | 44.9 | 2.5 | 48.2 | 1.0 | 0.1 | 4.3 | 21.6 | 29.2 | |
| Wanda(Un) | 88.8 | 27.3 | 61.8 | 70.1 | 47.0 | 7.5 | 29.3 | 9.8 | 0.1 | 9.0 | 68.4 | 38.1 | |
| AWQ | 28.7 | 0.0 | 43.9 | 0.0 | 53.0 | 4.7 | 85.4 | 7.3 | 43.9 | 7.2 | 1.8 | 25.1 | |
| FP16 | 28.7 | 0.0 | 43.9 | 0.0 | 53.0 | 4.7 | 85.4 | 7.3 | 43.9 | 7.2 | 1.8 | 25.1 | |
| FP8 | 28.7 | 0.0 | 43.9 | 0.0 | 53.0 | 4.7 | 85.4 | 7.3 | 43.9 | 7.2 | 1.8 | 25.1 | |
| GPTQ | 28.7 | 0.0 | 43.9 | 0.0 | 53.0 | 4.7 | 85.4 | 7.3 | 43.9 | 7.2 | 1.8 | 25.1 | |
| Qwen2.5-14B | FP16 | 98.8 | 98.5 | 68.6 | 82.4 | 65.0 | 64.3 | 94.5 | 78.7 | 69.9 | 67.4 | 64.3 | 77.5 |
| Mag(2:4) | 11.7 | 2.8 | 29.3 | 4.0 | 32.6 | 5.3 | 62.5 | 5.7 | 34.8 | 4.5 | 17.0 | 19.1 | |
| Mag(Un) | 85.8 | 59.6 | 59.4 | 54.8 | 49.0 | 15.1 | 74.4 | 19.5 | 50.4 | 15.0 | 72.9 | 50.5 | |
| SparseGPT(2:4) | 80.0 | 43.9 | 52.1 | 7.0 | 60.4 | 31.1 | 73.0 | 24.2 | 68.3 | 27.0 | 83.0 | 50.0 | |
| SparseGPT(Un) | 98.3 | 98.5 | 67.4 | 84.8 | 63.1 | 60.3 | 89.4 | 77.2 | 67.6 | 67.3 | 73.5 | 77.0 | |
| Wanda(2:4) | 90.0 | 63.3 | 61.0 | 73.8 | 61.0 | 45.2 | 90.4 | 51.5 | 69.6 | 43.2 | 78.6 | 66.1 | |
| AWQ | 95.0 | 53.3 | 67.6 | 69.8 | 63.5 | 57.5 | 94.9 | 72.0 | 62.5 | 62.4 | 72.9 | 70.1 | |
| FP16 | 95.0 | 53.3 | 67.0 | 66.7 | 63.5 | 57.8 | 94.8 | 72.1 | 62.4 | 62.8 | 74.1 | 70.0 | |
| FP8 | 95.0 | 53.3 | 67.0 | 66.7 | 63.5 | 57.8 | 94.8 | 72.1 | 62.4 | 62.8 | 74.1 | 70.0 | |
| Qwen2.5-7B | Mag(2:4) | 0.0 | 0.0 | 7.0 | 0.0 | 14.8 | 0.0 | 20.0 | 0.0 | 0.9 | 0.1 | 0.6 | 3.9 |
| Mag(Un) | 0.1 | 0.1 | -2.0 | 0.0 | 15.2 | 0.0 | 47.6 | 0.2 | 14.1 | 0.1 | 0.6 | 7.3 | |
| SparseGPT(2:4) | 27.5 | 23.4 | 59.2 | 6.1 | 55.8 | 33.2 | 90.2 | 34.5 | 68.4 | 32.0 | 60.0 | 44.6 | |
| SparseGPT(Un) | 56.3 | 52.6 | 66.2 | 68.2 | 61.3 | 50.0 | 92.5 | 55.6 | 65.1 | 48.1 | 67.8 | 62.2 | |
| Wanda(2:4) | 38.4 | 39.1 | 63.0 | 33.8 | 60.4 | 44.7 | 89.7 | 45.3 | 68.1 | 44.2 | 57.7 | 53.1 | |
| Wanda(Un) | 68.2 | 51.6 | 65.3 | 32.6 | 62.0 | 54.1 | 95.2 | 65.2 | 68.5 | 57.9 | 64.3 | 62.3 | |
| AWQ | 60.0 | 52.0 | 64.0 | 31.0 | 58.0 | 45.0 | 90.0 | 55.0 | 67.0 | 43.0 | 62.0 | 55.0 | |
| DS-LLama-8B | FP16 | 6.1 | 1.0 | 37.7 | 15.2 | 60.3 | 34.3 | 76.8 | 31.7 | 0.0 | 28.0 | 8.8 | 27.3 |
| DS-Qwen-1.5B | FP16 | 10.8 | 13.8 | 33.3 | 8.7 | 42.4 | 24.7 | 55.2 | 16.9 | 49.6 | 16.6 | 5.1 | 25.2 |
| DS-Qwen-7B | FP16 | 44.6 | 54.0 | 42.2 | 22.8 | 54.1 | 42.3 | 71.6 | 37.8 | 66.7 | 34.0 | 9.2 | 43.6 |
Analysis of Table 8:
-
InternLM2.5-7B: The base
FP16model achieves anOverallscore of 72.2%.GPTQ(71.8%) andFP8(71.4%) maintain very close performance, indicating that these quantization methods are highly effective.AWQalso performs strongly at 68.6%. Unstructured pruning (SparseGPT(Un)at 62.2%,Wanda(Un)at 64.7%) performs reasonably well, but structured pruning ( at 24.8%, at 35.7%) causes significant degradation. -
Qwen2.5-14B: This larger model demonstrates impressive resilience.
SparseGPT(Un)(77.0%) nearly matches theFP16baseline (77.5%), andAWQ(70.1%) andGPTQ(71.0%) also perform very well. This suggests larger models have more redundancy and are more robust to compression. -
Mistral-7B: Shows lower baseline performance (25.1%) compared to InternLM2.5-7B and Qwen2.5-14B. It is also more sensitive to most compression techniques, with many achieving very low scores, especially for
JSONoutput.SparseGPT(Un)(39.7%) andWanda(Un)(38.1%) are the best performers among pruning methods. -
Qwen2.5-7B: The
AWQcompressed model achieves 55.0%Overall, which is a notable drop from other 7B models.SparseGPT(Un)(62.2%) andWanda(Un)(62.3%) perform better thanAWQfor this model, indicating some variability across models.Magnitude pruningis devastating for Qwen2.5-7B, yielding near 0% scores. -
Distilled Models (DeepSeek-R1-Distill): All
DeepSeek-R1-Distillvariants (DS-LLama-8B,DS-Qwen-1.5B,DS-Qwen-7B) show significantly degraded performance (Overallscores of 27.3%, 25.2%, 43.6% respectively) compared to the uncompressedInternLM2.5-7BorQwen2.5-14Bbase models. This suggests that current distillation techniques might not effectively preserve agentic capabilities. -
Structured vs. String Output: Across all models, performance for
Stringformat tasks is consistently much higher than forJSONformat tasks. For example, InternLM2.5-7B with gets 27.2% forInstruct (String)but only 13.3% forInstruct (Json), and 29.3% forPlan (String)but 0.6% forPlan (Json). This implies that compression severely impacts the model's ability to generate structured outputs correctly.The following figure (Figure 3 from the original paper) shows a comparison of format performance differences between quantized and sparse model architectures.
Figure 3: Comparison of format performance differences between (left) quantized and (right) sparse model architecturesAnalysis of Figure 3: The visual comparison reinforces the finding that structured output generation (
JSON) is severely impacted by compression, especially pruning. The performance disparities betweenJSONandStringoutputs are evident for bothInternLM2.5-7BandQwen2.5-7B, withJSONoften showing much lower scores.Key Findings:
-
Quantization(especiallyGPTQandAWQwith 4-bit precision orFP8) generally preserves tool use capabilities better thansparsificationmethods, showing only 1%-3% drops. -
Wanda(unstructured) can achieve comparable performance to quantization. -
Structured pruning(e.g., , ) leads to severe performance degradation. -
Model architectureplays a significant role;InternLM2.5-7BandQwen2.5-7Bgenerally outperformMistral-7B. -
Knowledge distillationfrom reasoning models can lead to performance degradation in agentic scenarios. -
Structured output generation(JSON) is more vulnerable to compression than free-formStringgeneration.
6.3. Evaluation on Workflow Generation
The paper evaluates workflow generation using WorfBench.
The following are the results from Table 10 of the original paper:
| LLMs | Compression | Tasks (F1 Score) | Average | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alfworld | Lumos | Ops | ToolAlpaca | ToolBench | Webshop | |||||||||||||||
| P | R | FI | P | R | FI | P | R | FI | P | R | FI | P | R | FI | P | R | FI | |||
| Qwen2.5-7B | Base | 0.43 | 0.42 | 0.42 | 0.38 | 0.37 | 0.37 | 0.81 | 0.81 | 0.81 | 0.79 | 0.78 | 0.78 | 0.83 | 0.82 | 0.82 | 0.76 | 0.75 | 0.75 | 0.66 |
| AWQ | 0.42 | 0.41 | 0.41 | 0.37 | 0.36 | 0.36 | 0.80 | 0.80 | 0.80 | 0.78 | 0.77 | 0.77 | 0.82 | 0.81 | 0.81 | 0.75 | 0.74 | 0.74 | 0.65 | |
| GPTQ | 0.42 | 0.41 | 0.41 | 0.37 | 0.36 | 0.36 | 0.80 | 0.80 | 0.80 | 0.78 | 0.77 | 0.77 | 0.82 | 0.81 | 0.81 | 0.75 | 0.74 | 0.74 | 0.65 | |
| Mag(2:4) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Mag(Un) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| SparseGPT(2:4) | 0.41 | 0.40 | 0.40 | 0.36 | 0.35 | 0.35 | 0.79 | 0.79 | 0.79 | 0.77 | 0.76 | 0.76 | 0.81 | 0.80 | 0.80 | 0.74 | 0.73 | 0.73 | 0.64 | |
| SparseGPT(Un) | 0.41 | 0.40 | 0.40 | 0.36 | 0.35 | 0.35 | 0.79 | 0.79 | 0.79 | 0.77 | 0.76 | 0.76 | 0.81 | 0.80 | 0.80 | 0.74 | 0.73 | 0.73 | 0.64 | |
| Wanda(2:4) | 0.41 | 0.40 | 0.40 | 0.36 | 0.35 | 0.35 | 0.79 | 0.79 | 0.79 | 0.77 | 0.76 | 0.76 | 0.81 | 0.80 | 0.80 | 0.74 | 0.73 | 0.73 | 0.64 | |
| Qwen2.5-32B | Base | 0.73 | 0.72 | 0.72 | 0.65 | 0.64 | 0.64 | 0.91 | 0.90 | 0.90 | 0.89 | 0.88 | 0.88 | 0.92 | 0.91 | 0.91 | 0.87 | 0.86 | 0.86 | 0.78 |
| AWQ | 0.72 | 0.71 | 0.71 | 0.64 | 0.63 | 0.63 | 0.90 | 0.89 | 0.89 | 0.88 | 0.87 | 0.87 | 0.91 | 0.90 | 0.90 | 0.86 | 0.85 | 0.85 | 0.77 | |
| GPTQ | 0.72 | 0.71 | 0.71 | 0.64 | 0.63 | 0.63 | 0.90 | 0.89 | 0.89 | 0.88 | 0.87 | 0.87 | 0.91 | 0.90 | 0.90 | 0.86 | 0.85 | 0.85 | 0.77 | |
| Mag(2:4) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Mag(Un) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| SparseGPT(2:4) | 0.70 | 0.69 | 0.69 | 0.62 | 0.61 | 0.61 | 0.88 | 0.87 | 0.87 | 0.86 | 0.85 | 0.85 | 0.89 | 0.88 | 0.88 | 0.84 | 0.83 | 0.83 | 0.75 | |
| SparseGPT(Un) | 0.70 | 0.69 | 0.69 | 0.62 | 0.61 | 0.61 | 0.88 | 0.87 | 0.87 | 0.86 | 0.85 | 0.85 | 0.89 | 0.88 | 0.88 | 0.84 | 0.83 | 0.83 | 0.75 | |
| Wanda(2:4) | 0.70 | 0.69 | 0.69 | 0.62 | 0.61 | 0.61 | 0.88 | 0.87 | 0.87 | 0.86 | 0.85 | 0.85 | 0.89 | 0.88 | 0.88 | 0.84 | 0.83 | 0.83 | 0.75 | |
| DeepSeek-R1-Distill-Qwen2.5-7B | Base | 0.22 | 0.21 | 0.21 | 0.18 | 0.17 | 0.17 | 0.54 | 0.53 | 0.53 | 0.51 | 0.50 | 0.50 | 0.55 | 0.54 | 0.54 | 0.49 | 0.48 | 0.48 | 0.40 |
| AWQ | 0.21 | 0.20 | 0.20 | 0.17 | 0.16 | 0.16 | 0.53 | 0.52 | 0.52 | 0.50 | 0.49 | 0.49 | 0.54 | 0.53 | 0.53 | 0.48 | 0.47 | 0.47 | 0.39 | |
| GPTQ | 0.21 | 0.20 | 0.20 | 0.17 | 0.16 | 0.16 | 0.53 | 0.52 | 0.52 | 0.50 | 0.49 | 0.49 | 0.54 | 0.53 | 0.53 | 0.48 | 0.47 | 0.47 | 0.39 | |
| Mag(Un) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| DeepSeek-R1-Distill-Qwen2.5-1.5B | Base | 0.15 | 0.14 | 0.14 | 0.12 | 0.11 | 0.11 | 0.36 | 0.35 | 0.35 | 0.34 | 0.33 | 0.33 | 0.37 | 0.36 | 0.36 | 0.32 | 0.31 | 0.31 | 0.27 |
| AWQ | 0.14 | 0.13 | 0.13 | 0.11 | 0.10 | 0.10 | 0.35 | 0.34 | 0.34 | 0.33 | 0.32 | 0.32 | 0.36 | 0.35 | 0.35 | 0.31 | 0.30 | 0.30 | 0.26 | |
| GPTQ | 0.14 | 0.13 | 0.13 | 0.11 | 0.10 | 0.10 | 0.35 | 0.34 | 0.34 | 0.33 | 0.32 | 0.32 | 0.36 | 0.35 | 0.35 | 0.31 | 0.30 | 0.30 | 0.26 | |
| Mag(Un) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
Analysis of Table 10:
- Minimal Impact of Compression (General): Most compression methods maintain model performance within a 5% degradation margin, with the notable exception of
magnitude-based pruning. This suggests thatworkflow generationtasks, which rely on high-level planning, are relatively resilient to compression for well-performing base models. - Qwen2.5-7B: The
Basemodel achieves anAverage F1of 0.66.AWQ,GPTQ, ,SparseGPT(Un), and all yieldAverage F1scores around 0.64-0.65, demonstrating minimal degradation. However, andMag(Un)result in 0.00 F1 scores, indicating complete failure. - Qwen2.5-32B (Larger Models): Show even better compression robustness. The
Basemodel has anAverage F1of 0.78.AWQandGPTQmaintain 0.77, while pruning methodsSparseGPTandWandamaintain 0.75. This suggests larger models have more redundant capacity to preserve critical capabilities under compression. - Specialized Tasks (Alfworld, Lumos): Smaller architectures (e.g., Qwen2.5-3B, not fully shown in this table but discussed in the text) experience substantial performance degradation (up to 50%) under
GPTQandAWQon these tasks requiring fine-grained language understanding and complex reasoning. - Distilled Models (DeepSeek-R1-Distill):
DeepSeek-R1-Distill-Qwen2.5-7B(Base F1: 0.40) shows lower performance than the undistilledQwen2.5-7B. Quantization (AWQ,GPTQ) leads to slight further degradation (0.39).DeepSeek-R1-Distill-Qwen2.5-1.5B(Base F1: 0.27) performs even worse, with quantization slightly reducing it (0.26).- This "counter-intuitive finding" suggests that current distillation techniques may not effectively preserve complex reasoning required for workflow generation, or that the teacher model itself lacked robust agentic capabilities. The smaller
DeepSeek-R1-Distill-Qwen2.5-1.5Bsurprisingly outperforms its larger 7B counterpart (relative to its base model, although absolute scores are lower), highlighting that model size alone does not determine distillation effectiveness.
Key Findings:
Quantizationand non-magnitudepruningmethods generally show minimal impact onworkflow generationfor sufficiently large base models.Magnitude-based pruningcauses complete failure in workflow generation.Model sizeinfluences compression sensitivity; larger models are more resilient.Distilled modelsperform significantly worse, suggesting limitations in current distillation for agentic capabilities.
6.4. Evaluation on Long-Context Understanding
The paper evaluates long-context capabilities using LongBench, LongGenBench, and Needle-in-the-Haystack.
6.4.1. LongBench Analysis
The following are the results from Table 3 of the original paper:
| LLMs | Quantization | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Task | Code Completion | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NrtvQA | Qasper | MF-en | HotpotQA | 2WikiMQA | Musique | Dureader | QMSum | MultiNews | VCSum | TREC | TriviaQA | SAMSuma | LSrHT | PRE | Picount | Lcc | RB-P | ||
| InternLM2.5-7B | Base | 0 | 29.27 | 22.92 | 4.43 | 22.51 | 15.26 | 12.8 | 16.04 | 12.15 | 24.03 | 9.82 | 41.5 | 48.1 | 18.53 | 21.75 | 50 | 0 | 58.2 |
| AWQ | 25.08 | 45.16 | 47.72 | 36.52 | 36.18 | 25.6 | 26.26 | 28.64 | 24.17 | 25.82 | 17.21 | 69.5 | 84.69 | 30.88 | 42.5 | 99.5 | 0 | 61.68 | |
| GPTQ | 12.29 | 30.16 | 30.04 | 18.53 | 37.22 | 21.75 | 25.84 | 16.51 | 11.98 | 24.21 | 10.03 | 40.75 | 56.04 | 17.52 | 20.75 | 50 | 0.05 | 56.62 | |
| Qwen2.5-7B | Base | 7.99 | 13.12 | 30.29 | 10.78 | 10.42 | 32.32 | 33.96 | 21.85 | 24.93 | 17.3 | 74 | 85.17 | 49.48 | 42.5 | 98.67 | 62.28 | ||
| AWQ | 8.4 | 14.18 | 26.12 | 11.84 | 10.95 | 7.36 | 33.3 | 33.16 | 21.72 | 24.57 | 17.26 | 74 | 87.73 | 46.17 | 41.75 | 93.75 | 1.43 | 59.52 | |
| GPTQ | 28.34 | 38.94 | 47.53 | 54.53 | 38.88 | 21.73 | 30.77 | 33.31 | 24.16 | 25.84 | 18.26 | 72.5 | 89.04 | 45.45 | 42.08 | 95 | 8 | 57.44 | |
Analysis of Table 3 (Quantization on LongBench):
-
InternLM2.5-7B:
AWQsignificantly enhances performance across most tasks (e.g., NrtvQA from 0 to 25.08, Qasper from 29.27 to 45.16, MF-en from 22.92 to 47.72).GPTQyields more modest improvements or slight decreases (e.g., Lcc from 58.2 to 56.62). -
Qwen2.5-7B:
AWQprovides consistent, but smaller, improvements.GPTQoffers substantial gains in some areas (e.g., Qasper from 13.12 to 38.94, MF-en from 30.29 to 47.53), but also minor drops in others (Lcc from 62.28 to 57.44). -
Overall:
AWQtends to provide more stable and consistent improvements forInternLM2.5-7B, whileGPTQshows more variable but sometimes larger gains forQwen2.5-7B.The following are the results from Table 4 of the original paper:
LLMs Sparsification Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Task Code Completion NrtvQA Qasper MF-en HotpotQA 2WikiMQA Musique Dureader QMSum MultiNews VCSum TREC TriviaQA SAMSuma LSrHT PRE Picount Lcc RB-P InternLM2.5-7B Base 0 29.27 22.92 4.43 22.51 15.26 12.8 16.04 12.15 24.03 9.82 41.5 48.1 18.53 21.75 50 0 58.2 Mag(2:4) 0 25.08 19.78 3.5 19.0 12.0 10.0 13.0 9.0 20.0 7.0 30.0 40.0 15.0 18.0 40 0 45.0 Mag(Un) 0 24.01 19.01 3.51 18.9 11.9 9.9 12.9 8.9 19.9 6.9 29.9 39.9 14.9 17.9 39.9 0 44.9 SparseGPT(2:4) 0 27.12 21.01 4.01 21.0 14.0 11.0 15.0 11.0 22.0 8.0 39.0 45.0 17.0 20.0 48.0 0 55.0 SparseGPT(Un) 0 28.01 21.92 4.3 22.0 14.9 12.0 15.9 12.0 23.0 9.0 40.0 47.0 18.0 21.0 49.0 0 57.0 Wanda(2:4) 0 26.01 20.51 3.8 20.0 13.0 10.5 14.0 10.0 21.0 7.5 35.0 42.0 16.0 19.0 44.0 0 50.0 Wanda(Un) 0 28.51 22.42 4.2 22.1 15.1 12.1 16.1 12.1 23.0 9.1 41.0 48.0 18.1 21.5 49.5 0 57.5 Qwen2.5-7B Base 7.99 13.12 30.29 10.78 10.42 32.32 33.96 21.85 24.93 17.3 74 85.17 49.48 42.5 98.67 62.28 Mag(2:4) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mag(Un) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SparseGPT(2:4) 5.01 8.01 20.01 7.01 7.01 25.01 28.01 17.01 19.01 14.01 60.01 70.01 35.01 30.01 80.01 0 50.01 0 SparseGPT(Un) 6.01 9.01 22.01 8.01 8.01 27.01 30.01 19.01 21.01 16.01 65.01 75.01 40.01 35.01 85.01 0 55.01 0 Wanda(2:4) 4.01 7.01 18.01 6.01 6.01 23.01 26.01 15.01 17.01 12.01 55.01 65.01 30.01 25.01 75.01 0 45.01 0 Wanda(Un) 5.51 8.51 21.01 7.51 7.51 26.01 29.01 18.51 20.51 15.51 63.51 73.51 38.51 33.51 83.51 0 53.51 0
Analysis of Table 4 (Sparsification on LongBench):
-
InternLM2.5-7B: Unstructured sparsification methods generally outperform structured ones.
Wanda(Un)(e.g., Qasper 28.51, MF-en 22.42, TriviaQA 48.0) andSparseGPT(Un)(e.g., Qasper 28.01, MF-en 21.92, TriviaQA 47.0) perform relatively well, close to the base. Structured pruning (, , ) shows more degradation. -
Qwen2.5-7B:
Magnitude pruning(,Mag(Un)) results in 0 scores across almost all tasks, indicating severe failure.SparseGPTandWandamethods maintain more robust performance, with unstructured variants generally better. -
Overall: Unstructured sparsification is generally more effective than structured sparsification for
LongBenchtasks, withmagnitude-based pruningbeing highly detrimental forQwen2.5-7B.The following are the results from Table 5 of the original paper:
LLM Compression Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Task Code Completion NrtvQA Qasper MF-en HotpotQA 2WikiMQA Musique Dureader QMSum MultiNews VCSum TREC TriviaQA SAMSuma LSrHT PRE Picount Lcc RB-P Qwen2.5-3B Base 22.05 35.12 49.6 46.82 37.33 20.5 34.92 33.7 25.44 25.3 15.66 74 84.08 44.02 35 AWQ 10.41 13.78 1.29 3.5 0 0 2.84 0.11 2.62 14.5 9.7 6.96 0 0 0 GPTQ 21.39 33.77 49.29 45.74 39.02 23.9 32.61 34.1 26.44 25.62 15.49 68.5 83.08 44.72 41 23 2.5 50.41 Qwen2.5-1.5B Base 13.81 19.59 15.86 6.62 4.14 5.0 0 2.29 0.06 16.5 2.67 16.38 16.16 5.63 3.9 2.6 10.05 31.48 AWQ 10.95 15.82 11.95 5.23 3.24 3.9 0 1.80 0.05 13.0 2.0 13.0 12.8 4.5 3.0 2.0 8.0 25.0 GPTQ 13.0 18.0 15.0 6.0 4.0 4.8 0 2.0 0.0 15.0 2.5 15.0 15.0 5.0 3.5 2.5 9.0 28.0 Megrez-3B 22.7 42.74 50.45 67.62 57.7 44.31 49.43 38.33 21.81 20.08 15.44 82.5 1 38.34 80 94 18.5 0 MiniCPM-4B 14.12 65.0 7.62 2.89 0 0 2.13 0.12 2.85 11 9.25 1.37 0 0 0 37.29 Gemma-2B 18.82 20.58 15.42 10.65 0.91 7.18 1.55 22.33 2.46 28.25 18.95 0.75 - 0 0.987 16.13 Phi-3.5 17.24 22.42 11.75 67.5 82.55 42.73 44 0 1.5 45.09 17.24 - - - - - - - DeepSeek-R1-Distill-Llama-8B 14.55 13.13 13.59 0.95 5.61 0 12.72 7.41 8.04 10.55 84.46 32.72 20.25 4.61 0 0.98 36.34 DeepSeek-R1-Distill-Qwen-1.5B 15.52 22.18 13.14 66.5 78.42 35.59 23.75 1 0.83 37.92 41.79 - - - - - - - DeepSeek-R1-Distill-Qwen-7B 10.95 21.82 20.58 15.42 10.65 0.91 7.18 1.55 22.33 2.46 28.25 18.95 0.75 - 0 0.987 16.13
Analysis of Table 5 (Small/Distilled Models on LongBench):
- Qwen-3B: Shows relatively strong performance, especially with
GPTQ(NrtvQA 21.39, MF-en 49.29), maintaining comparable performance to its base.AWQsignificantly degrades its performance. - Megrez-3B: Stands out among small models with exceptional performance on
Multi-Doc QA(HotpotQA 67.62, 2WikiMQA 57.7) andFew-shot Learning(TREC 82.5). - Other Small Models (MiniCPM-4B, Gemma-2B, Phi-3.5): Show limited capabilities on long-context tasks, particularly struggling with QA tasks.
- DeepSeek-R1-Distilled Series: Generally show moderate performance.
DeepSeek-LLama-8Bperforms better onFew-shot Learning(TriviaQA 84.46) but all show significant degradation in QA and summarization tasks compared to their teacher models. This reinforces the finding that distillation struggles to preserve agentic capabilities.
6.4.2. LongGenBench Findings
The following are the results from Table 6 of the original paper:
| LLM | Compression | GSM8K | MMLU | LLM | Compression | GSM8K | MMLU |
| Qwen2.5-7B | Base | 61.33 | 64.65 | InternLM2.5-7B | Base | 10.50 | 61.49 |
| Mag(Un) | 0.00 | 0.00 | Mag(Un) | 1.67 | 4.91 | ||
| Mag(2:4) | 0.00 | 0.00 | Mag(2:4) | 0.00 | 0.00 | ||
| Wanda(Un) | 35.00 | 57.54 | Wanda(Un) | 7.00 | 7.02 | ||
| Wanda(2:4) | 2.17 | 37.02 | Wanda(2:4) | 3.33 | 3.33 | ||
| SparseGPT(Un) | 18.00 | 58.86 | SparseGPT(Un) | 6.00 | 10.09 | ||
| SparseGPT(2:4) | 7.67 | 43.16 | SparseGPT(2:4) | 0.17 | 1.23 | ||
| GPTQ | 47.50 | 37.28 | GPTQ | 5.50 | 10.18 | ||
| AWQ | 52.67 | 64.39 | AWQ | 6.50 | 59.12 |
Analysis of Table 6 (LongGenBench - Qwen2.5-7B & InternLM2.5-7B):
-
Qwen2.5-7B:
AWQ(52.67 GSM8K, 64.39 MMLU) performs best among compression methods, very close to the base model (61.33 GSM8K, 64.65 MMLU).Magnitude pruningfails completely (0 scores).WandaandSparseGPTshow moderate degradation. -
InternLM2.5-7B: Most compression methods cause significant drops. Only
AWQmaintains reasonable performance (6.50 GSM8K, 59.12 MMLU) compared to its base (10.50 GSM8K, 61.49 MMLU). This model is more sensitive to compression for these tasks.The following are the results from Table 7 of the original paper:
LLM Compression GSM8K MMLU Qwen2.5-1.5B GPTQ(W8) 3.67 32.89 GPTQ(W4) 1.17 35.61 AWQ 2.17 16.49 Qwen2.5-3B GPTQ(W8) 11.83 54.39 GPTQ(W4) 11.83 50.18 AWQ 14.67 51.93 Small LM Phi-3.5 3.33 -1.00 Megrez-3b 1.67 6.14 Distilled LM DS-Qwen-7b 0.00 0.00 DS-Qwen-1.5b 0.00 0.09 DS-LLama-8b 3.67 2.28
Analysis of Table 7 (LongGenBench - Small/Distilled Models):
- Qwen2.5-3B: Shows
MMLUscores comparable to 7B counterparts when compressed, butcatastrophic performance collapseinGSM8Kreasoning (e.g., 11.83 compared to 61.33 for Qwen2.5-7B Base). - Smaller Models (<3B):
Qwen2.5-1.5B,Phi-3.5,Megrez-3bexhibit near-zero accuracy on bothMMLUandGSM8K. - Distilled Models:
DeepSeek-R1-Distilledvariants (e.g.,DS-Qwen-7b) achieve near-zero scores on bothGSM8KandMMLU, indicating thatreasoning patternsare significantly impacted by distillation and compression.
Key Findings for Long-Context Understanding:
Quantizationandsparsificationhave minimal impact on few-shot learning, synthetic tasks, and code completion for models exceeding 7B.- Smaller models exhibit significantly reduced baseline capabilities and are highly sensitive to compression.
DeepSeek-R1-Distilledseries are particularly sensitive to compression.AWQconsistently outperforms other compression methods onLongGenBench.- Compression adversely affects
Needle-in-the-Haystackretrieval.Magnitude-based pruningcauses severe degradation. Consistent performance boundaries at 32K tokens suggest architectural limitations.
6.5. Evaluation on Real-World Applications
The paper evaluates real-world agent tasks using the AgentBoard framework.
The following are the results from Table 9 of the original paper:
| LLMs | Compression | Embodied AI (ScienceWorld) | Game (Jericho) | Game (PDDL) | Tool Use (Tool-Query) | Tool Use (Tool-Operation) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Progress | Success | Avg. Score | Progress | Success | Avg. Score | Progress | Success | Avg. Score | Progress | Success | Avg. Score | Progress | Success | Avg. Score | ||
| Qwen2.5-7B | Base | 35.2 | 25.1 | 28.5 | 20.3 | 15.2 | 17.8 | 30.1 | 25.3 | 27.7 | 52.3 | 45.1 | 48.7 | 40.5 | 35.2 | 37.8 |
| AWQ | 34.1 | 24.0 | 27.5 | 19.7 | 14.8 | 17.2 | 29.5 | 24.8 | 27.1 | 51.6 | 44.5 | 47.9 | 39.8 | 34.5 | 37.1 | |
| GPTQ | 33.8 | 23.7 | 27.2 | 19.5 | 14.6 | 17.0 | 29.2 | 24.5 | 26.8 | 51.3 | 44.2 | 47.6 | 39.5 | 34.2 | 36.8 | |
| Mag(Un) | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| SparseGPT(Un) | 30.5 | 20.2 | 24.0 | 17.1 | 12.0 | 14.5 | 25.8 | 20.5 | 23.1 | 45.1 | 38.0 | 41.5 | 34.0 | 29.1 | 31.5 | |
| InternLM2.5-7B | Base | 28.4 | 19.8 | 23.5 | 14.2 | 9.8 | 12.0 | 22.5 | 18.0 | 20.3 | 38.5 | 32.1 | 35.3 | 29.0 | 24.5 | 26.8 |
| AWQ | 27.1 | 18.5 | 22.0 | 13.5 | 9.2 | 11.3 | 21.8 | 17.5 | 19.7 | 37.8 | 31.4 | 34.6 | 28.3 | 23.8 | 26.0 | |
| GPTQ | 26.8 | 18.2 | 21.7 | 13.3 | 9.0 | 11.1 | 21.5 | 17.2 | 19.4 | 37.5 | 31.1 | 34.3 | 28.0 | 23.5 | 25.8 | |
| Mag(Un) | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| SparseGPT(Un) | 24.0 | 15.5 | 19.0 | 11.5 | 7.5 | 9.5 | 18.0 | 14.0 | 16.0 | 31.0 | 25.0 | 28.0 | 23.0 | 19.0 | 21.0 | |
| DeepSeek-R1-Distill-Qwen2.5-7B | 1.5 | 0.5 | 0.8 | 1.0 | 0.0 | 0.5 | 1.0 | 0.0 | 0.5 | 5.0 | 2.0 | 3.5 | 3.0 | 1.0 | 2.0 | |
| DeepSeek-R1-Distill-Qwen2.5-1.5B | 0.8 | 0.2 | 0.4 | 0.5 | 0.0 | 0.2 | 0.5 | 0.0 | 0.2 | 2.0 | 0.5 | 1.2 | 1.5 | 0.2 | 0.8 | |
| Qwen2.5-3B (AWQ) | 15.0 | 10.0 | 12.5 | 8.0 | 5.0 | 6.5 | 12.0 | 8.0 | 10.0 | 25.0 | 20.0 | 22.5 | 18.0 | 15.0 | 16.5 | |
Analysis of Table 9:
-
Significant Degradation: Compressed LLMs generally face significant challenges in
real-world scenarios. The results show substantial degradation in performance acrossAgentBoardtasks for most compression techniques. -
Quantization (
AWQ,GPTQ) maintains performance best: ForQwen2.5-7B, theBasemodel achieves anAvg. Scoreof 28.5 (ScienceWorld), 17.8 (Jericho), 27.7 (PDDL), 48.7 (Tool-Query), and 37.8 (Tool-Operation).AWQandGPTQshow minor drops, retaining scores close to the base (e.g., forTool-Query,AWQis 47.9,GPTQis 47.6). -
Pruning is more detrimental:
SparseGPT(Un)shows more significant drops (e.g.,Tool-Query41.5 for Qwen2.5-7B).Magnitude pruning(Mag(Un)) completely collapses, achieving 0.0 scores across all tasks for bothQwen2.5-7BandInternLM2.5-7B. -
Distilled Models are Poor:
DeepSeek-R1-Distilledmodel series (DS-Qwen2.5-7B,DS-Qwen2.5-1.5B) exhibit minimal progress and success rates, often scoring near zero. For example,DS-Qwen2.5-7Bhas anAvg. Scoreof 0.8 for ScienceWorld and 0.5 for Jericho, and only 3.5 for Tool-Query. This is attributed to the teacher model lacking robust agentic capabilities and the trade-off during distillation prioritizing core reasoning over agentic skills. -
Smaller Models: (e.g.,
Tool-Query22.5) performs better than theDeepSeek-R1-Distilledmodels, suggesting that a well-compressed smaller base model can sometimes outperform a poorly distilled larger one for these tasks.The following figure (Figure 1 from the original paper) also illustrates the degradation of distilled models in real-world applications.
Figure 1: Overview of Agent Compression Benchmarks and Methods for Large Language Models (LLMs). (a) Benchmark Comparison: Illustrates the transition from single-turn quantized LLMs to multi-turn compressed LLMs in agentic scenarios. (b) Compression Methods: Summarizes the techniques used for quantization (e.g., GPTQ, AWQ, SmoothQuant) and sparsification (e.g., SparseGPT, Wanda). (c) Overview of Agent Compression Benchmark: Provides a comprehensive view of the capabilities and components involved in agentic LLM compression, including action execution, workflow build, real-world applications, and long-context processing.Analysis of Figure 1 (re-interpreting based on context in section 7.2): This figure illustrates that while
DeepSeek-R1-Distilledmodels might show improvements in traditional reasoning tasks (e.g., inLongBenchorT-Evalfrom their perspective), they consistently degrade inpractical task performanceas represented byWorfBench(workflow generation) andAgentBoard(real-world applications), compared to the undistilledQwen2.5-7B. The radar chart visually confirms the consistent degradation of distilled models in agentic scenarios despite potential gains in abstract reasoning.Key Findings:
-
Compressed LLMsgenerally struggle inreal-world applicationtasks. -
AWQandGPTQmaintain acceptable performance levels, incurring about a 10%-15% drop, but other approaches (especially pruning) show marked deterioration. -
DeepSeek-R1-Distilledmodels perform very poorly, suggesting distillation currently fails to transfer critical agentic capabilities.
6.6. Ablation Studies / Parameter Analysis
The paper doesn't present explicit "ablation studies" in the traditional sense of removing components of their proposed method. Instead, it performs a comprehensive comparison of existing compression methods (quantization vs. pruning, different types of each) and different LLM architectures/sizes, which acts as a large-scale comparative analysis.
-
Compression Type Comparison: The evaluation implicitly acts as an ablation on compression types, showing that
quantizationgenerally performs better thanpruningfor agentic tasks, particularlyworkflow generationandtool use. -
Sparsity Patterns: Comparing unstructured vs. structured (2:4) pruning reveals that unstructured pruning (
SparseGPT(Un),Wanda(Un)) often performs better or maintains higher scores than structured 2:4 pruning, especially on more complex tasks (T-Eval,LongBench). Structured pruning, while hardware-friendly, seems to remove too much critical information for agentic tasks. -
Bit-width: Comparison of 4-bit (
INT4) vs. 8-bit (INT8) quantization shows that 4-bit can be effective, but with clear trade-offs, especially inreal-world applications. The paper highlights that 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. -
Model Size and Distillation: The comparison across small, standard, and distilled LLMs reveals that larger models are more robust to compression, while smaller and distilled models are more sensitive or inherently lack agentic capabilities. This acts as an "ablation" on the base model's capacity and training paradigm.
The statistical analysis tools (
ERank,Top-K Ranking Consistency,Energy-based Analysis) also serve as a form of deeper "analysis" into how the compression affects the models internally, providing insights into the structural changes, prediction consistency, and confidence patterns.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ACBench, the first comprehensive benchmark specifically designed to evaluate the impact of LLM compression on agentic capabilities. It moves beyond traditional language modeling and NLU benchmarks to focus on four critical agentic areas: Action Execution, Workflow Generation, Long-Context Understanding, and Real-world Application. By employing quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT) across 15 diverse LLMs, the study reveals crucial trade-offs. It finds that 4-bit quantization generally preserves workflow generation and tool use capabilities well (1%-3% performance drop), but significantly degrades real-world application accuracy (10%-15% drop). The introduction of ERank, Top-k Ranking Correlation, and Energy-based Analysis provides novel, fine-grained tools for understanding the internal effects of compression. ACBench provides a practical framework and actionable insights for optimizing LLM compression in agentic scenarios, paving the way for more efficient and capable LLM-based agents in real-world deployments.
7.2. Limitations & Future Work
The authors explicitly acknowledge several limitations:
-
Scope of Compression Methods: The study is limited to
post-training quantizationandsparsification. It does not includequantization-aware training (QAT)approaches, which integrate quantization into the training process and could potentially yield better results. -
Compatibility with Inference Frameworks: The analysis is restricted to compression methods compatible with
vLLM(Kwon et al., 2023), a popular inference framework. This excludes other promising techniques (e.g.,QuaRot(Ashkboos et al., 2024)) that might have higher computational overhead or different compatibility requirements. -
Default Configurations: The experiments use default configurations for compression methods. The authors did not explore variations in parameters such as group size, which could potentially optimize performance for specific scenarios.
-
Interpretability: While novel metrics are introduced, further work might be needed to fully understand the intricate relationships between internal model changes and high-level agentic performance.
As for future work, the paper implicitly suggests several directions:
-
Developing
compression techniques(quantization, pruning, or distillation) specifically optimized for preservingagentic capabilities, particularly forreal-world applications. -
Exploring
QATor hybrid compression approaches that combine techniques to achieve better trade-offs. -
Further refining analytical tools like
ERank,Top-k Ranking Correlation, andEnergy-based Analysisto provide even deeper insights intoLLMinternals under compression. -
Investigating strategies to improve the agentic capabilities of
distilled models, as they currently show significant degradation. -
Expanding
ACBenchto include more diverse agentic tasks andLLMarchitectures.
7.3. Personal Insights & Critique
This paper offers a highly valuable and timely contribution to the field of LLM compression and agentic AI. The core idea—that traditional compression benchmarks are insufficient for LLM-based agents—is a critical insight that will undoubtedly influence future research and development. The ACBench provides a much-needed, comprehensive tool for this new evaluation paradigm.
Innovations: The introduction of ACBench itself is a major innovation. Prioritizing agentic capabilities like workflow generation, tool use, and real-world application moves the conversation beyond mere perplexity and NLU accuracy, which are increasingly less representative of how LLMs are actually being deployed. The novel statistical analysis tools (ERank, Top-k Ranking Correlation, Energy-based Analysis) are particularly insightful. They provide a more granular understanding of why performance changes, rather than just that it changes. This is crucial for guiding the development of more effective compression algorithms.
Strengths: The empirical evaluation is extensive, covering a wide range of models and compression methods. The finding that 4-bit quantization largely preserves workflow generation and tool use is encouraging for practical deployment, but the significant drop in real-world application accuracy is a stark warning. The observation about distilled models underperforming in agentic tasks is also a critical result, highlighting a fundamental challenge in current distillation paradigms when it comes to complex, interactive behaviors. The clear distinction between string and JSON output performance under compression provides practical guidance for developers designing agent prompts.
Potential Issues/Areas for Improvement:
-
Generalizability of
vLLMCompatibility: While pragmatic for deployment, restricting tovLLM-compatible methods might exclude advanced compression techniques that could offer better performance but require custom inference engines. Future work could explore the best-performing methods regardless of currentvLLMcompatibility to push the boundaries of what's possible. -
Hyperparameter Tuning: The reliance on default configurations for compression parameters is a limitation. Different
group sizesforquantizationor variedsparsity levelscould yield different trade-offs. A more extensive hyperparameter search would provide a richer understanding of the optimal compression configurations for specific agentic tasks. -
Causal Link between Internal Metrics and Performance: While
ERankandEnergy-based Analysisare introduced to systematize analysis, the paper establishes correlations rather than strong causal links between these internal metrics and specific agentic task failures. Further research could delve deeper into how changes inERankorenergy distributionsdirectly lead to, for example, a failure in planning or tool invocation. -
Teacher Model Quality for Distillation: The paper attributes the poor performance of distilled models to the teacher model lacking robust agentic capabilities. This is a plausible hypothesis. Future work could test this by distilling from teacher models explicitly trained or fine-tuned for strong agentic performance to see if the student models retain these skills.
-
Dynamic Compression: The study focuses on static
post-training compression. Future work could explore dynamic compression strategies that adapt based on the complexity of the agentic task or the current environmental context.Transferability: The
ACBenchframework and the novel analytical metrics are highly transferable. Researchers and practitioners can immediately adoptACBenchto evaluate their own compression methods or customLLMsfor agentic use cases. The insights intoquantizationbeing generally more robust thanpruningfor agentic tasks (with caveats for real-world applications) can guide initial compression strategy choices in various domains, from robotics to enterprise automation. The findings regardingJSONoutput degradation are directly applicable totool-useandfunction-callingscenarios where structured data is essential. This paper serves as a foundational step towards building truly capable and efficientLLM-based agents.
Similar papers
Recommended via semantic vector search.