Paper status: completed

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Published:05/26/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
20 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces the Agent Compression Benchmark (ACBench) to evaluate the impact of compression on LLMs' agentic capabilities across 12 tasks and 4 abilities. Results show 4-bit quantization minimally affects workflow and tool use, but degrades real-world accuracy by about 1

Abstract

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in https://github.com/pprp/ACBench.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

1.2. Authors

Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the arxiv.org link. As of the publication date, it is likely undergoing peer review for a conference or journal. arXiv is a well-respected open-access repository for preprints of scientific papers, particularly in mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

1.4. Publication Year

2025

1.5. Abstract

This paper addresses a gap in the evaluation of compressed Large Language Models (LLMs), which traditionally focuses on language modeling (e.g., perplexity) and natural language understanding (NLU) tasks (e.g., GLUE accuracy). The authors argue that these benchmarks overlook "agentic capabilities" crucial for real-world applications, such as workflow generation, tool use, long-context understanding, and practical application. To fill this gap, they introduce the Agent Compression Benchmark (ACBench), a comprehensive suite designed to evaluate how compression affects these agentic abilities.

ACBench incorporates 12 tasks across 4 key agentic capabilities, evaluates two major compression techniques—quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT)—and tests 15 diverse LLMs, ranging from small to standard and distilled reasoning models. The empirical evaluation reveals significant trade-offs: 4-bit quantization effectively preserves workflow generation and tool use (with only a 1%-3% performance drop) but substantially degrades real-world application accuracy (by 10%-15%). To systematize the analysis, the paper introduces three novel metrics: ERank (Efficient Rank), Top-k Ranking Correlation, and Energy-based Analysis. ACBench aims to provide actionable insights for optimizing LLM compression strategies specifically for agentic scenarios.

https://arxiv.org/abs/2505.19433v2 (Preprint) PDF Link: https://arxiv.org/pdf/2505.19433v2.pdf

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) (models with a vast number of parameters, typically billions, trained on massive text datasets) has revolutionized many domains, from code generation to scientific research and multi-agent collaboration. However, their practical deployment is severely hampered by prohibitive computational and memory costs. This issue necessitates post-training compression techniques (methods applied after a model has been fully trained) like quantization (reducing the precision of numerical representations) and pruning (removing redundant parameters).

Current LLM compression benchmarks primarily focus on single-turn language modeling (e.g., measuring perplexity, how well a model predicts the next word) and Natural Language Understanding (NLU) tasks (e.g., GLUE accuracy, a common benchmark for various NLU tasks). The core problem the paper identifies is that these benchmarks are insufficient. They do not adequately assess the agentic capabilities of LLMs, which are multi-step, interactive, and often real-world oriented.

Why is this problem important? Real-world applications demand capabilities beyond static benchmarks. For instance, robotic control or financial analytics require LLMs to perform multi-step planning, maintain long-context coherence (understanding extended conversations or documents), adapt reasoning across conversational turns, and seamlessly integrate with external tools. The current compression evaluations overlook how these critical agentic capabilities are affected, leading to a critical oversight in guiding deployment strategies for interactive AI agents.

The paper's entry point or innovative idea is to introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark specifically designed to evaluate the impact of compression on LLMs' agentic abilities. This directly addresses the existing gap by moving beyond traditional NLU and language modeling metrics to focus on real-world interactive scenarios.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  1. Introduction of ACBench: They propose ACBench, the first comprehensive benchmark specifically tailored to evaluate the impact of LLM compression on agentic capabilities. This benchmark covers:
    • Four core agentic capabilities: Action Execution (tool use/function call), Workflow Generation, Long-Context Understanding, and Real-world Application.
    • 12 diverse tasks across these capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval).
    • Two mainstream compression methods: Quantization (GPTQ, AWQ) and Pruning (Wanda, SparseGPT).
    • Evaluation across 15 diverse LLMs: Including small models (Gemma-2B), standard models (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill).
  2. Novel Analytical Tools: To systematize the analysis of compression effects, they introduce three new statistical metrics:
    • ERank (Efficient Rank): Measures the effective dimensionality of weight matrices, indicating structural changes due to compression.
    • Top-k Ranking Correlation: Quantifies the consistency of top-k token predictions between compressed and uncompressed models.
    • Energy-based Analysis: Evaluates shifts in logit energy distributions, reflecting changes in model confidence.
  3. Empirical Findings on Compression Trade-offs: Their experiments reveal crucial insights into how compression impacts agentic capabilities:
    • Quantization vs. Pruning: Quantization methods (GPTQ, AWQ) generally preserve workflow generation and tool use capabilities more effectively (with only a 1%-3% drop in performance) compared to pruning methods.

    • Real-world Application Degradation: Despite preserving some capabilities, 4-bit quantization leads to a significant performance degradation (10%-15% accuracy drop) in real-world application tasks.

    • Distillation Limitations: Distilled reasoning models (e.g., DeepSeek-R1-Distill) often show performance degradation in agentic scenarios, suggesting current distillation techniques may not effectively transfer complex agentic skills.

    • Model Size Sensitivity: Smaller models are more sensitive to compression, particularly in complex reasoning tasks (e.g., Qwen2.5-3B's catastrophic performance collapse in GSM8K reasoning post-compression), while larger models exhibit more resilience.

      The key conclusions reached are that while compression is vital for deployment, its impact on agentic capabilities is nuanced and task-dependent. Quantization appears more favorable for certain agentic tasks than pruning, but real-world interactive scenarios remain challenging. The findings highlight the need for tailored compression strategies and improved distillation techniques that prioritize agentic behaviors. ACBench provides a practical tool and framework for guiding future research in this area.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a beginner should grasp the following foundational concepts:

  • Large Language Models (LLMs): These are advanced Artificial Intelligence models designed to understand, generate, and process human language. They are typically based on the transformer architecture and trained on massive amounts of text data, allowing them to learn complex patterns and relationships within language. Their "largeness" refers to the huge number of parameters (weights and biases) they contain, often billions, which enables their sophisticated abilities but also leads to high computational and memory costs.
  • LLM Compression: The process of reducing the size and computational requirements of an LLM while minimizing performance degradation. This is crucial for deploying LLMs on resource-constrained devices (e.g., mobile phones, edge devices) or for reducing inference costs in data centers. The paper focuses on two main types:
    • Quantization: This technique reduces the precision of the numerical representations (e.g., weights, activations) in the neural network. Instead of using high-precision floating-point numbers (e.g., 32-bit or 16-bit floats), quantization maps these values to lower-bit integer representations (e.g., 8-bit, 4-bit, or even 2-bit integers). This significantly shrinks the model size and speeds up computations because integer operations are faster and consume less memory.
    • Pruning: This involves removing redundant or less important parameters (weights or connections) from the neural network. The idea is that not all connections or weights contribute equally to the model's performance; some can be removed without significant impact. Pruning can lead to sparser models, meaning many weights are zero, which reduces memory footprint and can accelerate inference if specialized hardware or software supports sparse computations.
  • Agentic Capabilities / LLM-based Agents: An LLM-based agent is an AI system that uses an LLM as its core reasoning engine to perform complex, autonomous tasks by interacting with an environment. Agentic capabilities refer to the specific skills that enable such an agent to function effectively. These go beyond simple text generation or understanding and include:
    • Workflow Generation: The ability to break down a complex goal into a sequence of actionable steps or a structured plan. This often involves understanding dependencies between sub-tasks.
    • Tool Use/Function Call: The ability to identify when external tools (e.g., calculators, web search APIs, code interpreters) are needed and to correctly invoke them (function call) with appropriate inputs, and then interpret their outputs.
    • Long-Context Understanding: The ability to process, reason over, and maintain coherence across very long input sequences (e.g., tens of thousands of tokens) and retain relevant information over extended interactions or documents.
    • Real-world Application: The ability to operate successfully in practical, often interactive, deployment scenarios (e.g., embodied AI, gaming, e-commerce) that combine multiple agentic skills.
  • Perplexity (PPL): A common metric in language modeling that measures how well a probability model predicts a sample. Lower perplexity generally indicates a better model, as it means the model is more confident and accurate in predicting the next token in a sequence.
  • GLUE Accuracy: General Language Understanding Evaluation (GLUE) is a benchmark for evaluating the performance of NLU models across a diverse set of nine tasks (e.g., sentiment analysis, textual entailment, question answering). GLUE accuracy refers to the aggregate performance metric on these tasks.
  • Logits: In a neural network, logits are the raw, unnormalized outputs from the final layer before the softmax activation function is applied. For classification tasks, these logits represent the model's confidence scores for each possible class. Higher logits for a particular class mean the model is more confident about that class.
  • Softmax Function: A function that takes a vector of arbitrary real-valued logits and normalizes them into a probability distribution, where each value is between 0 and 1, and all values sum to 1. $ p_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} $ where ziz_i is the logit for class ii, and KK is the total number of classes.
  • Jaccard Similarity: A statistic used for gauging the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets. For two sets AA and BB, the Jaccard similarity is J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}.

3.2. Previous Works

The paper contextualizes its work by citing existing research in LLM compression, LLM-based agents, and specific agentic capabilities.

3.2.1. LLM Compression

  • Quantization: The paper references GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2023) as mainstream quantization methods. GPTQ is known for its accurate post-training quantization by optimizing parameters layer by layer, while AWQ focuses on outlier-aware quantization. SmoothQuant (Xiao et al., 2023) is also mentioned, which aims to smooth activations for better quantization. Other general works on quantization (Park et al., 2024; Frantar & Alistarh, 2022; Lee et al., 2024; Du et al., 2024a; Kim et al., 2023; Dong et al., 2024c; Gu et al., 2025; Du et al., 2024b; Li et al., 2024f) are also cited, underscoring the active research in this area.
  • Pruning: SparseGPT (Frantar & Alistarh, 2023a) and Wanda (Sun et al., 2024b) are highlighted as key pruning techniques. SparseGPT is noted for its ability to accurately prune massive language models in a single shot. Wanda is a simple and effective pruning approach. The paper also mentions general pruning research (Frantar & Alistarh, 2023b; Sun et al., 2024c; Shao et al., 2024; Zhang et al., 2024d; Dong et al., 2024b; Tang et al., 2020; Dong et al., 2024a; Lai et al., 2025), including unstructured, structured, and dynamic pruning methods.
  • Knowledge Distillation (KD): The concept of training a smaller student model to mimic a larger teacher model (Hinton et al., 2015b) is crucial for creating efficient, smaller models. KD-zero (Li et al., 2023b) is an example of such a technique.
  • Low-Rank Decomposition: Techniques like SVD-LLM (Wang et al., 2024e; 2025b) and ASVD (Yuan et al., 2023) are used to approximate weight matrices with lower-rank versions, reducing parameters and computation.

3.2.2. LLM-based Agents

The paper references several frameworks and studies on LLM-based agents, indicating their growing importance: MetaGPT (Hong et al., 2023), AutoGen (Wu et al., 2023), AgentVerse (Chen et al., 2023b), and MegaAgent (Wang et al., 2024b). These works demonstrate that LLMs can act as reasoning engines for autonomous systems, combining LLMs with planning, tool use, and memory to achieve goals.

3.2.3. Agentic Capabilities Benchmarks

The paper points out that existing compression evaluations often neglect agentic capabilities, citing works (Li et al., 2024e; Yang et al., 2024a; Gong et al., 2024; Wang et al., 2024a;b) that focus on single-turn NLU or perplexity. This highlights the gap ACBench aims to fill. For agentic tasks, T-Eval (Chen et al., 2023c) is mentioned for tool use, WorfBench (Qiao et al., 2024) for workflow generation, and AgentBoard (Ma et al., 2024) for real-world applications. For long-context understanding, LongBench (Bai et al., 2024), LongGenBench (Liu et al., 2024a), and Needle-in-the-Haystack (Kamradt, 2023) are relevant prior benchmarks.

3.3. Technological Evolution

The evolution of LLM technology has progressed from raw language modeling to sophisticated reasoning and agentic behaviors. Initially, the focus was on building larger models that could achieve better perplexity and NLU performance. This led to breakthroughs like the transformer architecture (Vaswani et al., 2017) and models like GPT-3 (Brown et al., 2020).

As models grew in size and capability, the computational and memory demands became a bottleneck. This spurred intense research into compression techniques (quantization, pruning, distillation, low-rank decomposition) to make these powerful models deployable. Early compression efforts aimed to preserve perplexity and NLU accuracy, as these were the primary metrics.

More recently, the concept of LLM-based agents emerged, where LLMs are not just static text processors but active decision-makers capable of planning, tool use, and interacting with environments. This shift highlighted that traditional compression benchmarks were insufficient; a model might perform well on NLU after compression but fail catastrophically when asked to use a tool or follow a complex workflow.

This paper's work fits squarely within this latest stage of evolution, recognizing that compression must now be evaluated not just on intrinsic language properties but on extrinsic, interactive agentic capabilities. It's a critical step towards enabling the ubiquitous deployment of intelligent agents.

3.4. Differentiation Analysis

Compared to prior work, the core differences and innovations of this paper's approach are:

  1. Agentic Focus in Compression Benchmarking: Previous compression benchmarks (e.g., those evaluated in Gong et al., 2024; Yang et al., 2024a) primarily assessed perplexity or NLU accuracy. This paper is the first to introduce a comprehensive benchmark, ACBench, specifically designed to evaluate how compression impacts agentic capabilities like workflow generation, tool use, long-context understanding, and real-world application. This shifts the evaluation paradigm to better reflect real-world deployment needs for intelligent agents.

  2. Comprehensive Coverage: ACBench integrates a broader and more diverse set of agentic tasks (12 tasks across 4 capabilities) than typically found in compression studies. It also systematically evaluates both major compression techniques (quantization and pruning) across a wide range of model sizes (15 models), including specialized distilled reasoning models.

  3. Novel Analytical Tools: The introduction of ERank, Top-k Ranking Correlation, and Energy-based Analysis provides novel, fine-grained tools for understanding how compression affects the internal workings and decision-making processes of LLMs, beyond just black-box performance metrics. These metrics offer insights into the structural changes, prediction consistency, and confidence patterns induced by compression.

  4. Practical Insights and Trade-offs: The paper provides actionable insights into the trade-offs of compression for agentic scenarios. For example, showing that 4-bit quantization preserves workflow generation and tool use reasonably well, but significantly degrades real-world application accuracy. This level of granular insight into agentic performance under compression is novel and crucial for practitioners.

    In essence, while others have compressed LLMs and evaluated them on standard tasks, and others have evaluated agentic capabilities, this paper uniquely combines both by asking: "Can compressed LLMs truly act?" and provides the first dedicated benchmark and analytical framework to answer this question.

4. Methodology

The paper's methodology centers on evaluating the impact of post-training compression on LLM agentic capabilities using a newly proposed benchmark, ACBench, and novel statistical analysis tools.

4.1. Principles

The core idea is to systematically assess whether compressed LLMs retain the complex, multi-step, and interactive abilities required for agentic applications, which go beyond traditional language modeling and NLU tasks. The theoretical basis is that compression, while reducing model size, might disproportionately affect the intricate reasoning and planning mechanisms crucial for agentic behaviors, which might not be captured by simpler metrics like perplexity. The intuition is that agentic tasks demand robust internal representations and decision-making processes that could be more fragile to information loss from compression.

The methodology involves:

  1. Applying diverse compression techniques: Quantization and pruning are applied to a range of LLMs.
  2. Evaluating compressed models on a custom benchmark (ACBench): This benchmark specifically targets four categories of agentic capabilities.
  3. Analyzing the impact using novel statistical metrics: These metrics (ERank, Top-K Ranking Consistency, Energy-based Analysis) provide deeper insights into how compression alters the model's internal representations and output confidence.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Compression Methods

The paper focuses on two primary post-training compression methods: Quantization and Weight Pruning.

4.2.1.1. Quantization

Quantization reduces the memory and computational demands of neural networks by mapping high-precision values (e.g., 16-bit floating-point numbers) to lower-bit integer representations. The process typically involves an affine transformation to scale and shift the floating-point values into an integer range.

The affine transformation is given by: $ \mathbf{X}{\mathrm{INT}} = \mathrm{round}\left(\frac{\mathbf{X}{\mathrm{FP16}} - Z}{S}\right) \quad (1) $ And the scaling factor SS is calculated as: $ S = \frac{\max(\mathbf{X}{\mathrm{FP}16}) - \min(\mathbf{X}{\mathrm{FP}16})}{2^N - 1} \quad (2) $ Where:

  • XINT\mathbf{X}_{\mathrm{INT}}: The resulting lower-bit integer representation of the tensor.

  • XFP16\mathbf{X}_{\mathrm{FP16}}: The original full-precision (e.g., 16-bit floating-point) tensor.

  • round()\mathrm{round}(\cdot): A rounding function that converts the scaled floating-point value to the nearest integer.

  • ZZ: The zero-point, an integer value that aligns the integer zero with the floating-point zero point. It helps in representing asymmetric ranges.

  • SS: The scaling factor, which determines the range mapping between floating-point and integer values.

  • max(XFP16)\max(\mathbf{X}_{\mathrm{FP16}}): The maximum value in the original floating-point tensor.

  • min(XFP16)\min(\mathbf{X}_{\mathrm{FP16}}): The minimum value in the original floating-point tensor.

  • NN: The target integer bit-width (e.g., 4 bits, 8 bits).

    The paper primarily focuses on GPTQ (Frantar et al., 2022) and AWQ (Lin et al., 2023) for quantization.

  • GPTQ (Generative Pretrained Transformer Quantization): A highly accurate post-training quantization method that quantizes LLMs to 4-bit precision without any retraining. It works by quantizing weights block-wise while minimizing the squared error introduced by quantization.

  • AWQ (Activation-aware Weight Quantization): A quantization method that focuses on protecting salient weights (outliers) from quantization error, as these often have a disproportionately large impact on model performance. It scales weights based on activation statistics.

  • SmoothQuant: (Mentioned in the Appendix as also used, though the abstract only lists GPTQ and AWQ). This method smooths out the activation outliers in LLMs to make them easier to quantize accurately.

4.2.1.2. Weight Pruning

Weight Pruning removes redundant parameters to create sparse weight matrices, which reduces model size and can improve inference latency.

The process of unstructured sparsity is achieved via element-wise masking: $ \tilde{\mathbf{W}} = \mathbf{W}\odot \mathbf{M},\quad \mathbf{M}{ij} = \left{ \begin{array}{ll}1 & \mathrm{if}\left|\mathbf{W}{ij}\right| > \tau \ 0 & \mathrm{otherwise} \end{array} \right. \quad (3) $ Where:

  • W~\tilde{\mathbf{W}}: The pruned (sparse) weight matrix.

  • W\mathbf{W}: The original full weight matrix.

  • M\mathbf{M}: A binary mask matrix, where Mij\mathbf{M}_{ij} is either 1 (keep the weight) or 0 (prune the weight).

  • \odot: The element-wise (Hadamard) product.

  • Wij|\mathbf{W}_{ij}|: The absolute value of an individual weight element.

  • τ\tau: A predefined threshold. Weights with absolute values below this threshold are pruned.

    The paper mentions two specific pruning methods:

  • SparseGPT: (Frantar & Alistarh, 2023a) A one-shot pruning method that can achieve high sparsity (e.g., 50-60%) in LLMs post-training without fine-tuning, while maintaining accuracy. It uses a greedy approach to find a subset of weights to prune.

  • Wanda (Pruning without Activation Outliers): (Sun et al., 2024b) A simple and effective pruning method for LLMs that prunes weights based on their magnitude scaled by the corresponding activation. It can be applied in unstructured settings (individual weights) or structured (e.g., 2:4 semi-structured settings, meaning for every 4 weights, 2 are kept).

4.2.2. Agentic Taxonomy & ACBench Structure

The paper classifies agentic capabilities into four main categories, which form the basis of ACBench:

  1. Action Execution:

    • Focus: Function call and tool use. Function call means the agent uses predefined internal functions, while tool use means it leverages external tools/APIs.
    • Sub-capabilities: Plan, Reason, Retrieve, Understand, Instruct, and Review.
    • Benchmark: T-Eval (Chen et al., 2023c).
  2. Workflow Generation:

    • Focus: Breaking down complex tasks into executable sequences of steps. This includes single-task workflows (direct solution path) and multi-step workflows (intermediate planning, coordination).
    • Task Types: Function Call, Embodied, Problem-Solving, and Open-Grounded tasks.
    • Benchmark: WorfBench (Qiao et al., 2024).
  3. Long-Context Understanding:

    • Focus: Processing and understanding information over very long input sequences, crucial for multi-step workflows.
    • Task Types: General long-context tasks (single/multi-doc QA, summarization, code) and more challenging ones like LongGenBench (Liu et al., 2024a) and Needle-in-the-Haystack (Kamradt, 2023).
    • Benchmarks: LongBench (Bai et al., 2024), LongGenBench, Needle-in-the-Haystack.
  4. Real-world Application:

    • Focus: Operating in practical deployment scenarios (e.g., e-commerce, robotics, scientific experimentation). Requires coordinating multiple capabilities (tool use, planning, environmental interaction).

    • Task Types: Embodied AI, Game, Tool Use, Tool-Query, Tool-Operation.

    • Benchmark: AgentBoard (Ma et al., 2024).

      The paper's overall ACBench framework is visually summarized in Figure 1 (from the paper), which illustrates the different capabilities and components.

      fig 1 Figure 1: Overview of Agent Compression Benchmarks and Methods for Large Language Models (LLMs). (a) Benchmark Comparison: Illustrates the transition from single-turn quantized LLMs to multi-turn compressed LLMs in agentic scenarios. (b) Compression Methods: Summarizes the techniques used for quantization (e.g., GPTQ, AWQ, SmoothQuant) and sparsification (e.g., SparseGPT, Wanda). (c) Overview of Agent Compression Benchmark: Provides a comprehensive view of the capabilities and components involved in agentic LLM compression, including action execution, workflow build, real-world applications, and long-context processing.

4.2.3. Statistical Analysis Metrics

To investigate the influences of compression on LLMs and how it affects them, the paper employs three novel statistical analysis metrics:

4.2.3.1. Efficient Rank (ERank)

Efficient Rank quantifies the effective dimensionality of a matrix, providing a measure of its structural complexity and the distribution of its singular values. A decrease in ERank suggests a loss of information and structural complexity in the weight matrices due to compression.

The Efficient Rank of a non-zero matrix ARd×N\mathbf{A} \in \mathbb{R}^{d \times N} is defined as: $ \mathtt{eRank(A)} = \exp \left(-\sum_{i = 1}^{Q}\frac{\sigma_{i}}{\sum_{j = 1}^{Q}\sigma_{j}}\log \left(\frac{\sigma_{i}}{\sum_{j = 1}^{Q}\sigma_{j}}\right)\right) \quad (4) $ Where:

  • eRank(A)\mathtt{eRank(A)}: The effective rank of matrix A\mathbf{A}.

  • Q=min{N,d}Q = \min \{N,d\}: The maximum possible rank of the matrix A\mathbf{A}, which is the smaller of its dimensions.

  • σ1,σ2,,σQ\sigma_{1},\sigma_{2},\ldots ,\sigma_{Q}: The singular values of the matrix A\mathbf{A}, typically obtained from Singular Value Decomposition (SVD), sorted in descending order.

  • σij=1Qσj\frac{\sigma_{i}}{\sum_{j = 1}^{Q}\sigma_{j}}: Represents the normalized contribution of each singular value to the total "energy" or variance of the matrix. This term is similar to a probability distribution.

  • log()\log (\cdot): The natural logarithm.

  • exp()\exp (\cdot): The exponential function, used to convert the result back from a logarithmic scale.

    This formula essentially calculates the exponential of the negative Shannon entropy of the normalized singular values. A higher ERank indicates a more uniform distribution of singular values and thus higher effective dimensionality, implying richer information content.

4.2.3.2. Top-K Ranking Consistency

Top-K Ranking Consistency measures how consistently the top-k predicted tokens (most probable next tokens) of a compressed LLM align with those of the original (uncompressed) LLM. This is crucial because the top few tokens often dictate the model's confidence and the direction of text generation.

For a given input, let Tk(o)\mathcal{T}_k^{(o)} be the set of top-k tokens predicted by the original model, and Tk(c)\mathcal{T}_k^{(c)} be the set of top-k tokens predicted by the compressed model. The ranking consistency is measured using the Jaccard similarity: $ J_{k} = \frac{|\mathcal{T}{k}^{(o)}}\cap\mathcal{T}{k}^{(c)}|}{|\mathcal{T}{k}^{(o)}\cup\mathcal{T}{k}^{(c)}|} \quad (5) $ Where:

  • JkJ_k: The Jaccard similarity for the top-k tokens.

  • Tk(o)\mathcal{T}_k^{(o)}: The set of top-k tokens from the original model's logits (after softmax).

  • Tk(c)\mathcal{T}_k^{(c)}: The set of top-k tokens from the compressed model's logits.

  • |\cdot|: Denotes the cardinality (number of elements) of a set.

  • \cap: Set intersection.

  • \cup: Set union.

    A higher JkJ_k value (closer to 1) indicates greater consistency in the top-k predictions between the compressed and original models, implying better preservation of the original model's prediction confidence.

4.2.3.3. Energy-based Analysis

Inspired by Out-of-Distribution (OOD) detection techniques, Energy-based Analysis evaluates distributional shifts in logit energies between compressed and uncompressed models. It provides insights into how compression affects the model's overall confidence and calibratedness.

Let f(x):RDRKf(\mathbf{x}): \mathbb{R}^{D} \rightarrow \mathbb{R}^{K} be a discriminative neural classifier that maps an input xRD\mathbf{x} \in \mathbb{R}^{D} to KK logits. For both the original model f(o)f^{(o)} and compressed model f(c)f^{(c)}, the categorical distribution (probabilities) is derived using the softmax function: $ p(y|\mathbf{x}) = \frac{e^{f_{y}(\mathbf{x}) / T}}{\sum_{i = 1}^{K}e^{f_{i}(\mathbf{x}) / T}} \quad (6) $ Where:

  • p(yx)p(y|\mathbf{x}): The probability of class yy given input x\mathbf{x}.

  • fy(x)f_y(\mathbf{x}): The logit corresponding to class label yy for input x\mathbf{x}.

  • TT: The temperature parameter, a positive scalar that controls the softness of the softmax probabilities. A higher TT makes probabilities softer (more uniform), while a lower TT makes them sharper (more confident in the highest logit).

    The energy function E(x;f)E(\mathbf{x};f) for a given input x\mathbf{x} is defined as: $ E(\mathbf{x};f) = -T\cdot \log \sum_{i = 1}^{K}e^{f_{i}(\mathbf{x}) / T} \quad (7) $ Where:

  • E(x;f)E(\mathbf{x};f): The energy score for input x\mathbf{x} using model ff.

  • Tlogi=1Kefi(x)/T-T\cdot \log \sum_{i = 1}^{K}e^{f_{i}(\mathbf{x}) / T}: This term is proportional to the negative logarithm of the sum of exponentiated logits (scaled by temperature). It is inversely related to the softmax probability of the most likely class, providing a measure of confidence. Lower (more negative) energy generally indicates higher confidence.

    The analysis compares the energy distributions between original and compressed models using ΔE=E(x;f(o))E(x;f(c))\Delta_{E} = |E(\mathbf{x}; f^{(o)}) - E(\mathbf{x}; f^{(c)})|. A lower ΔE\Delta_{E} indicates better preservation of model confidence patterns, suggesting that the compressed model's confidence aligns well with the original.

5. Experimental Setup

5.1. Datasets

The ACBench comprehensively evaluates LLMs across various datasets tailored to assess the four core agentic capabilities.

5.1.1. Action Execution (Tool Use)

  • Benchmark: T-Eval (Chen et al., 2023c)
  • Characteristics: This benchmark assesses six core competencies: planning, reasoning, retrieval, understanding, instruction following, and reviewing. It features 15 carefully selected tools from diverse domains (Research, Travel, Entertainment, Web, Life, Financials) with high availability and complete documentation. The dataset comprises 553 high-quality query-solution annotation pairs, resulting in 23,305 test cases across its sub-capabilities. It emphasizes comprehensive and fine-grained evaluation of tool utilization, with an average of 5.8 tool calling steps per query.

5.1.2. Workflow Generation

  • Benchmark: WorfBench (Qiao et al., 2024)
  • Characteristics: WorfBench is a comprehensive benchmark employing a graph-based workflow representation to assess both workflow generation capabilities and execution performance in multi-turn conversational settings. It includes 18,000 training samples, 2,146 test samples, and 723 held-out tasks. The evaluation spans four distinct categories: function calling, embodied interaction, problem-solving, and open-grounded scenarios. It features multi-faceted scenarios and complex graph-structured workflows, addressing limitations of existing benchmarks that focus only on linear workflows.

5.1.3. Long-Context Understanding

The paper uses three benchmarks for long-context understanding:

  • Benchmark 1: LongBench (Bai et al., 2024)
    • Characteristics: A multi-task benchmark spanning five categories: Single-Doc QA (e.g., NarrativeQA, Qasper), Multi-Doc QA (e.g., HotpotQA, MultiNews), Summarization (e.g., GovReport, QMSum), Few-shot Learning, and Synthetic tasks. Documents range from 2K to 32K tokens in length, covering diverse domains like healthcare, legal, scientific research, and creative writing.
  • Benchmark 2: LongGenBench (Liu et al., 2024a)
    • Characteristics: A synthetic benchmark designed for extreme-length generation with multi-shot in-context examples. It specifically targets a model's ability to produce coherent, contextually accurate, and structurally sound text over extended outputs, rather than just retrieval. It supports customizable context length configurations and evaluates models on generating single, unified long-form responses (e.g., GSM8K and MMLU tasks).
  • Benchmark 3: Needle-in-the-Haystack (Kamradt, 2023)
    • Characteristics: A retrieval-focused task probing fine-grained contextual understanding. It tests models' ability to locate and utilize critical information embedded within lengthy contexts, with documents ranging from 4K to 64K tokens. Key information is deliberately placed at varying positions (beginning, middle, end) and with different levels of surrounding distractor content. Tasks include fact retrieval, reasoning chains, and verification.

5.1.4. Real-World Application

  • Benchmark: AgentBoard (Ma et al., 2024)
  • Characteristics: AgentBoard is designed to comprehensively evaluate LLMs as generalist agents. It comprises a diverse set of 9 unique tasks and 1,013 environments, covering a wide range of scenarios: embodied AI tasks (AlfWorld, ScienceWorld, BabyAI), game environments (Jericho, PDDL), web-based tasks (WebShop, WebArena), and tool-oriented tasks (Tool-Query, Tool-Operation). Each environment is crafted for multi-round interactions and partially observable characteristics, with subgoals defined to track detailed progress. It introduces a fine-grained progress rate metric for nuanced evaluation.

5.2. Evaluation Metrics

The paper employs a combination of standard and novel metrics to assess the performance of compressed LLMs.

5.2.1. Task-Specific Metrics (Standard)

  • Accuracy (ACC):

    • Conceptual Definition: Measures the proportion of correctly predicted instances out of the total number of instances. It is a common metric for classification tasks, indicating the overall correctness of a model's predictions.
    • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions: The count of instances where the model's output matches the ground truth.
      • Total Number of Predictions: The total count of instances evaluated.
  • F1 Score:

    • Conceptual Definition: The harmonic mean of precision and recall. It is particularly useful when evaluating models on datasets with imbalanced class distributions, as it considers both false positives and false negatives. A higher F1 score indicates a more robust model.
    • Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\mathrm{True;Positives}}{\mathrm{True;Positives} + \mathrm{False;Positives}} $ $ \mathrm{Recall} = \frac{\mathrm{True;Positives}}{\mathrm{True;Positives} + \mathrm{False;Negatives}} $
    • Symbol Explanation:
      • True  Positives\mathrm{True\;Positives}: Instances correctly identified as positive.
      • False  Positives\mathrm{False\;Positives}: Instances incorrectly identified as positive (Type I error).
      • True  Negatives\mathrm{True\;Negatives}: Instances correctly identified as negative.
      • False  Negatives\mathrm{False\;Negatives}: Instances incorrectly identified as negative (Type II error).
  • Spearman Correlation Coefficient:

    • Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. It is used in the paper to correlate perplexity with Top-k ranking consistency.
    • Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
    • Symbol Explanation:
      • ρ\rho: The Spearman rank correlation coefficient.
      • did_i: The difference between the ranks of corresponding observations for each variable.
      • nn: The number of observations (data points).
  • Kendall's Tau Coefficient:

    • Conceptual Definition: A non-parametric measure of the ordinal association between two measured quantities. A Tau value of 1 implies perfect agreement, -1 implies perfect disagreement, and 0 implies no relationship. It is used to evaluate Top-k Ranking Consistency.
    • Mathematical Formula: $ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{\frac{1}{2} n (n-1)} $
    • Symbol Explanation:
      • τ\tau: The Kendall's Tau correlation coefficient.
      • number of concordant pairs: Pairs of observations that maintain the same relative order in both rankings.
      • number of discordant pairs: Pairs of observations that have different relative orders in the two rankings.
      • nn: The number of observations.

5.2.2. Novel Statistical Analysis Metrics

These metrics are described in detail in Section 4.2.3 and will not be repeated here, but their definitions and formulas are crucial evaluation tools.

  • Efficient Rank (ERank) (Equation 4)
  • Top-K Ranking Consistency (Jaccard Similarity) (Equation 5)
  • Energy-based Analysis (Energy Function) (Equation 7)

5.3. Baselines

The paper evaluates a diverse set of LLMs and compression methods:

5.3.1. LLMs Evaluated

The models are categorized into three groups:

  1. Medium-scale models (7B parameters):
    • InternLM-2.5-7B (Cai et al., 2024)
    • Qwen2.5-7B (Yang et al., 2024b)
    • Mistral-7B (Jiang et al., 2023a)
  2. Knowledge-distilled models:
    • DeepSeek-R1-Distill-Qwen1.5B
    • DeepSeek-R1-Distill-Qwen-7B
    • DeepSeek-R1-Distill-Llama-8B (Team, 2024a)
  3. Efficient small-scale models:
    • MiniCPM3-4B (Hu et al., 2024)
    • Qwen-2.5-1.5B
    • Qwen-2.5-3B (Yang et al., 2024b)
    • Gemma-2-2b-it (Team, 2024b)
    • Phi-3.5-mini-3.8B (Abdin et al., 2024)
    • Megrez-3B (Infinigence, 2024)

5.3.2. Compression Methods Evaluated

  1. Quantization:
    • GPTQ (Frantar et al., 2022): Post-training quantization.
    • AWQ (Lin et al., 2023): Activation-aware weight quantization.
    • SmoothQuant (Xiao et al., 2023): Smooths activations for better quantization.
    • Precision levels: Typically 4-bit (INT4) and 8-bit (INT8), with comparisons to FP16 (full precision) and FP8.
  2. Pruning:
    • Magnitude Pruning: Both unstructured (Mag(Un)) and semi-structured 2:4 (Mag(2:4)Mag(2:4)) variants. Removes weights based on absolute values.

    • Wanda (Sun et al., 2024b): Unstructured (Wanda(Un)) and semi-structured 2:4 (Wanda(2:4)Wanda(2:4)).

    • SparseGPT (Frantar & Alistarh, 2023a): Unstructured (SparseGPT(Un)) and semi-structured 2:4 (SparseGPT(2:4)SparseGPT(2:4)).

      All compression methods are applied post-training without additional fine-tuning. The temperature for LLM generation is set to 0 to ensure deterministic outputs for reproducibility.

6. Results & Analysis

The paper systematically analyzes the impact of compression on LLMs across four agentic capabilities, supported by its novel statistical analysis metrics.

6.1. Statistical Analysis of Compression Effects

6.1.1. Efficient Rank Analysis

The paper visualizes the ERank for LLaMA-2-7B and Mistral-7B models under varying weight quantization levels (W2, W3, W4, W8, W16) and KV Cache precisions (KV4, KV8, KV16).

The following figure (Figure 2 from the original paper) shows the ERank analysis for quantized LLaMA-2-7B and Mistral-7B models.

fig 14 Figure 2: ERank analysis difference analysis for quantized LLaMA-2-7B (left) and Mistral-7B (right) models

Analysis: Both models exhibit similar trends: ERank decreases as quantization precision reduces (e.g., from W16 to W2). This indicates a loss of information and structural complexity in the weight matrices, which might explain performance trade-offs. The paper notes that this relationship holds across model scales, with ERank values correlating positively with model accuracy. Larger models generally maintain higher ERank post-compression, suggesting greater robustness. The Diff-ERank metric (change in effective rank after compression) increases with model size, implying larger models undergo more significant structural changes during compression while still preserving performance.

The following are the results from Table 2 of the original paper:

OPT 125M 1.3B 2.7B 6.7B
ACC 0.276 0.332 0.370 0.360
Δ Loss 5.734 6.138 6.204 6.258
Diff-ERank 1.410 2.140 2.338 2.280
ERank (4bit) 15.462 15.589 13.898 17.877
Energy 2.738 2.746 2.631 2.883

Analysis: Table 2 shows that the 6.7B model has the highest ERank (4bit) of 17.877 and strong accuracy of 0.360, while the 2.7B model has a lower ERank (13.898) despite comparable accuracy, reinforcing that larger models are more robust. Diff-ERank (change in ERank after compression) increases with model size, indicating larger models undergo more structural changes, yet maintain performance.

6.1.2. Top-K Ranking Consistency Analysis

The paper evaluates Top-K Ranking Consistency using Kendall's Tau and Spearman Correlation coefficients.

The following figure (Figure 4 from the original paper) shows the Top-k Ranking Consistency Analysis for quantized Phi-3.5.

fig 4 Figure 4: Top-k Ranking Consistency Analysis for quantized Phi-3.5.

Analysis: As kk decreases (e.g., from top-10 to top-3), ranking consistency becomes increasingly unstable and degrades. This is significant for LLM text generation, where the top-3 tokens are crucial for predicting the most probable next tokens. This degradation in consistency for high-probability tokens explains performance deterioration in downstream tasks.

The following figure (Figure 5 from the original paper) shows the Spearman correlation analysis between perplexity and Top-k ranking correlation metrics across model sizes and quantization levels.

fig 2 Figure 5: Spearman correlation analysis between perplexity and Top-k ranking correlation metrics across model sizes and quantization levels.

Analysis: Figure 5 demonstrates a strong Spearman correlation between perplexity and Top-k ranking correlation metrics. This validates that these metrics effectively capture meaningful performance characteristics, especially how compression affects the model's ability to maintain its original prediction probabilities.

6.1.3. Energy-based Analysis

The paper visualizes the distribution of energy scores for compressed and uncompressed LLMs.

The following figure (Figure 15 from the original paper) shows the energy distribution analysis at various sequence positions in compressed models.

fig 12 Figure 15: Energy distribution analysis at various sequence positions in compressed models

Analysis: In the initial decoding stage, quantized LLMs show a distinct, often polarized, energy distribution compared to uncompressed ones (some tokens over-confident, others under-confident). In later stages, these distributions begin to merge, and both models regard tokens as low confidence. This suggests that while confidence calibration might be disrupted early on, it converges later. Table 2 (shown above) indicates consistent energy scores across model sizes, suggesting this behavior is inherent to compression, not just model scale.

6.2. Evaluation on Action Execution (Tool Use)

The paper evaluates the impact of quantization and pruning on LLMs' tool use capabilities using T-Eval.

The following are the results from Table 8 of the original paper:

LLMs Compression Instruct Plan Reason Retrieve Understand Review Choice Overall
String Json String Json String Json String Json String Json
InternLM2.5-7B Mag(2:4) 27.2 13.3 29.3 0.6 41.4 5.0 73.2 1.6 62.8 7.9 10.5 24.8
Mag(Un) 57.8 73.2 27.7 23.1 59.2 19.6 86.4 20.6 73.2 24.2 60.6 47.8
SparseGPT(2:4) 73.2 33.3 26.6 0.1 52.0 12.8 71.1 8.0 41.6 9.8 64.1 35.7
SparseGPT(Un) 84.5 80.3 42.4 62.3 63.6 39.7 90.9 47.4 74.7 41.5 57.3 62.2
Wanda(2:4) 78.5 46.4 50.6 11.2 59.6 21.9 95.8 18.1 59.7 21.4 61.4 47.7
Wanda(Un) 83.7 90.6 49.0 72.4 66.3 32.6 96.5 42.2 76.4 39.5 62.0 64.7
AWQ 98.6 98.7 48.5 45.3 65.8 46.5 85.6 66.7 79.6 56.3 63.2 68.6
FP16 98.6 98.6 44.3 73.7 67.5 41.0 85.7 66.4 81.9 56.0 70.4 72.2
FP8 98.6 98.6 44.3 73.7 65.9 50.3 83.3 65.5 79.4 55.1 70.4 71.4
GPTQ 96.5 98.4 53.9 76.2 66.9 44.5 93.7 56.9 81.2 49.5 72.5 71.8
Mistrial-7B Mag(2:4) 39.0 3.1 22.5 0.7 38.1 0.5 45.4 0.2 0.0 1.0 16.0 11.7
Mag(Un) 0.4 4.2 44.9 33.4 42.5 3.8 37.1 7.1 0.0 6.0 54.4 21.3
SparseGPT(2:4) 3.6 15.3 43.6 9.9 46.1 3.5 41.0 2.8 0.1 3.7 10.9 16.4
SparseGPT(Un) 92.0 40.8 59.2 56.6 45.8 8.2 28.9 13.9 0.0 13.7 77.8 39.7
Wanda(2:4) 74.4 30.8 42.1 51.0 44.9 2.5 48.2 1.0 0.1 4.3 21.6 29.2
Wanda(Un) 88.8 27.3 61.8 70.1 47.0 7.5 29.3 9.8 0.1 9.0 68.4 38.1
AWQ 28.7 0.0 43.9 0.0 53.0 4.7 85.4 7.3 43.9 7.2 1.8 25.1
FP16 28.7 0.0 43.9 0.0 53.0 4.7 85.4 7.3 43.9 7.2 1.8 25.1
FP8 28.7 0.0 43.9 0.0 53.0 4.7 85.4 7.3 43.9 7.2 1.8 25.1
GPTQ 28.7 0.0 43.9 0.0 53.0 4.7 85.4 7.3 43.9 7.2 1.8 25.1
Qwen2.5-14B FP16 98.8 98.5 68.6 82.4 65.0 64.3 94.5 78.7 69.9 67.4 64.3 77.5
Mag(2:4) 11.7 2.8 29.3 4.0 32.6 5.3 62.5 5.7 34.8 4.5 17.0 19.1
Mag(Un) 85.8 59.6 59.4 54.8 49.0 15.1 74.4 19.5 50.4 15.0 72.9 50.5
SparseGPT(2:4) 80.0 43.9 52.1 7.0 60.4 31.1 73.0 24.2 68.3 27.0 83.0 50.0
SparseGPT(Un) 98.3 98.5 67.4 84.8 63.1 60.3 89.4 77.2 67.6 67.3 73.5 77.0
Wanda(2:4) 90.0 63.3 61.0 73.8 61.0 45.2 90.4 51.5 69.6 43.2 78.6 66.1
AWQ 95.0 53.3 67.6 69.8 63.5 57.5 94.9 72.0 62.5 62.4 72.9 70.1
FP16 95.0 53.3 67.0 66.7 63.5 57.8 94.8 72.1 62.4 62.8 74.1 70.0
FP8 95.0 53.3 67.0 66.7 63.5 57.8 94.8 72.1 62.4 62.8 74.1 70.0
Qwen2.5-7B Mag(2:4) 0.0 0.0 7.0 0.0 14.8 0.0 20.0 0.0 0.9 0.1 0.6 3.9
Mag(Un) 0.1 0.1 -2.0 0.0 15.2 0.0 47.6 0.2 14.1 0.1 0.6 7.3
SparseGPT(2:4) 27.5 23.4 59.2 6.1 55.8 33.2 90.2 34.5 68.4 32.0 60.0 44.6
SparseGPT(Un) 56.3 52.6 66.2 68.2 61.3 50.0 92.5 55.6 65.1 48.1 67.8 62.2
Wanda(2:4) 38.4 39.1 63.0 33.8 60.4 44.7 89.7 45.3 68.1 44.2 57.7 53.1
Wanda(Un) 68.2 51.6 65.3 32.6 62.0 54.1 95.2 65.2 68.5 57.9 64.3 62.3
AWQ 60.0 52.0 64.0 31.0 58.0 45.0 90.0 55.0 67.0 43.0 62.0 55.0
DS-LLama-8B FP16 6.1 1.0 37.7 15.2 60.3 34.3 76.8 31.7 0.0 28.0 8.8 27.3
DS-Qwen-1.5B FP16 10.8 13.8 33.3 8.7 42.4 24.7 55.2 16.9 49.6 16.6 5.1 25.2
DS-Qwen-7B FP16 44.6 54.0 42.2 22.8 54.1 42.3 71.6 37.8 66.7 34.0 9.2 43.6

Analysis of Table 8:

  • InternLM2.5-7B: The base FP16 model achieves an Overall score of 72.2%. GPTQ (71.8%) and FP8 (71.4%) maintain very close performance, indicating that these quantization methods are highly effective. AWQ also performs strongly at 68.6%. Unstructured pruning (SparseGPT(Un) at 62.2%, Wanda(Un) at 64.7%) performs reasonably well, but structured pruning (Mag(2:4)Mag(2:4) at 24.8%, SparseGPT(2:4)SparseGPT(2:4) at 35.7%) causes significant degradation.

  • Qwen2.5-14B: This larger model demonstrates impressive resilience. SparseGPT(Un) (77.0%) nearly matches the FP16 baseline (77.5%), and AWQ (70.1%) and GPTQ (71.0%) also perform very well. This suggests larger models have more redundancy and are more robust to compression.

  • Mistral-7B: Shows lower baseline performance (25.1%) compared to InternLM2.5-7B and Qwen2.5-14B. It is also more sensitive to most compression techniques, with many achieving very low scores, especially for JSON output. SparseGPT(Un) (39.7%) and Wanda(Un) (38.1%) are the best performers among pruning methods.

  • Qwen2.5-7B: The AWQ compressed model achieves 55.0% Overall, which is a notable drop from other 7B models. SparseGPT(Un) (62.2%) and Wanda(Un) (62.3%) perform better than AWQ for this model, indicating some variability across models. Magnitude pruning is devastating for Qwen2.5-7B, yielding near 0% scores.

  • Distilled Models (DeepSeek-R1-Distill): All DeepSeek-R1-Distill variants (DS-LLama-8B, DS-Qwen-1.5B, DS-Qwen-7B) show significantly degraded performance (Overall scores of 27.3%, 25.2%, 43.6% respectively) compared to the uncompressed InternLM2.5-7B or Qwen2.5-14B base models. This suggests that current distillation techniques might not effectively preserve agentic capabilities.

  • Structured vs. String Output: Across all models, performance for String format tasks is consistently much higher than for JSON format tasks. For example, InternLM2.5-7B with Mag(2:4)Mag(2:4) gets 27.2% for Instruct (String) but only 13.3% for Instruct (Json), and 29.3% for Plan (String) but 0.6% for Plan (Json). This implies that compression severely impacts the model's ability to generate structured outputs correctly.

    The following figure (Figure 3 from the original paper) shows a comparison of format performance differences between quantized and sparse model architectures.

    fig 3 Figure 3: Comparison of format performance differences between (left) quantized and (right) sparse model architectures

    Analysis of Figure 3: The visual comparison reinforces the finding that structured output generation (JSON) is severely impacted by compression, especially pruning. The performance disparities between JSON and String outputs are evident for both InternLM2.5-7B and Qwen2.5-7B, with JSON often showing much lower scores.

    Key Findings:

  • Quantization (especially GPTQ and AWQ with 4-bit precision or FP8) generally preserves tool use capabilities better than sparsification methods, showing only 1%-3% drops.

  • Wanda (unstructured) can achieve comparable performance to quantization.

  • Structured pruning (e.g., Mag(2:4)Mag(2:4), SparseGPT(2:4)SparseGPT(2:4)) leads to severe performance degradation.

  • Model architecture plays a significant role; InternLM2.5-7B and Qwen2.5-7B generally outperform Mistral-7B.

  • Knowledge distillation from reasoning models can lead to performance degradation in agentic scenarios.

  • Structured output generation (JSON) is more vulnerable to compression than free-form String generation.

6.3. Evaluation on Workflow Generation

The paper evaluates workflow generation using WorfBench.

The following are the results from Table 10 of the original paper:

LLMs Compression Tasks (F1 Score) Average
Alfworld Lumos Ops ToolAlpaca ToolBench Webshop
P R FI P R FI P R FI P R FI P R FI P R FI
Qwen2.5-7B Base 0.43 0.42 0.42 0.38 0.37 0.37 0.81 0.81 0.81 0.79 0.78 0.78 0.83 0.82 0.82 0.76 0.75 0.75 0.66
AWQ 0.42 0.41 0.41 0.37 0.36 0.36 0.80 0.80 0.80 0.78 0.77 0.77 0.82 0.81 0.81 0.75 0.74 0.74 0.65
GPTQ 0.42 0.41 0.41 0.37 0.36 0.36 0.80 0.80 0.80 0.78 0.77 0.77 0.82 0.81 0.81 0.75 0.74 0.74 0.65
Mag(2:4) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Mag(Un) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SparseGPT(2:4) 0.41 0.40 0.40 0.36 0.35 0.35 0.79 0.79 0.79 0.77 0.76 0.76 0.81 0.80 0.80 0.74 0.73 0.73 0.64
SparseGPT(Un) 0.41 0.40 0.40 0.36 0.35 0.35 0.79 0.79 0.79 0.77 0.76 0.76 0.81 0.80 0.80 0.74 0.73 0.73 0.64
Wanda(2:4) 0.41 0.40 0.40 0.36 0.35 0.35 0.79 0.79 0.79 0.77 0.76 0.76 0.81 0.80 0.80 0.74 0.73 0.73 0.64
Qwen2.5-32B Base 0.73 0.72 0.72 0.65 0.64 0.64 0.91 0.90 0.90 0.89 0.88 0.88 0.92 0.91 0.91 0.87 0.86 0.86 0.78
AWQ 0.72 0.71 0.71 0.64 0.63 0.63 0.90 0.89 0.89 0.88 0.87 0.87 0.91 0.90 0.90 0.86 0.85 0.85 0.77
GPTQ 0.72 0.71 0.71 0.64 0.63 0.63 0.90 0.89 0.89 0.88 0.87 0.87 0.91 0.90 0.90 0.86 0.85 0.85 0.77
Mag(2:4) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Mag(Un) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SparseGPT(2:4) 0.70 0.69 0.69 0.62 0.61 0.61 0.88 0.87 0.87 0.86 0.85 0.85 0.89 0.88 0.88 0.84 0.83 0.83 0.75
SparseGPT(Un) 0.70 0.69 0.69 0.62 0.61 0.61 0.88 0.87 0.87 0.86 0.85 0.85 0.89 0.88 0.88 0.84 0.83 0.83 0.75
Wanda(2:4) 0.70 0.69 0.69 0.62 0.61 0.61 0.88 0.87 0.87 0.86 0.85 0.85 0.89 0.88 0.88 0.84 0.83 0.83 0.75
DeepSeek-R1-Distill-Qwen2.5-7B Base 0.22 0.21 0.21 0.18 0.17 0.17 0.54 0.53 0.53 0.51 0.50 0.50 0.55 0.54 0.54 0.49 0.48 0.48 0.40
AWQ 0.21 0.20 0.20 0.17 0.16 0.16 0.53 0.52 0.52 0.50 0.49 0.49 0.54 0.53 0.53 0.48 0.47 0.47 0.39
GPTQ 0.21 0.20 0.20 0.17 0.16 0.16 0.53 0.52 0.52 0.50 0.49 0.49 0.54 0.53 0.53 0.48 0.47 0.47 0.39
Mag(Un) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
DeepSeek-R1-Distill-Qwen2.5-1.5B Base 0.15 0.14 0.14 0.12 0.11 0.11 0.36 0.35 0.35 0.34 0.33 0.33 0.37 0.36 0.36 0.32 0.31 0.31 0.27
AWQ 0.14 0.13 0.13 0.11 0.10 0.10 0.35 0.34 0.34 0.33 0.32 0.32 0.36 0.35 0.35 0.31 0.30 0.30 0.26
GPTQ 0.14 0.13 0.13 0.11 0.10 0.10 0.35 0.34 0.34 0.33 0.32 0.32 0.36 0.35 0.35 0.31 0.30 0.30 0.26
Mag(Un) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Analysis of Table 10:

  • Minimal Impact of Compression (General): Most compression methods maintain model performance within a 5% degradation margin, with the notable exception of magnitude-based pruning. This suggests that workflow generation tasks, which rely on high-level planning, are relatively resilient to compression for well-performing base models.
  • Qwen2.5-7B: The Base model achieves an Average F1 of 0.66. AWQ, GPTQ, SparseGPT(2:4)SparseGPT(2:4), SparseGPT(Un), and Wanda(2:4)Wanda(2:4) all yield Average F1 scores around 0.64-0.65, demonstrating minimal degradation. However, Mag(2:4)Mag(2:4) and Mag(Un) result in 0.00 F1 scores, indicating complete failure.
  • Qwen2.5-32B (Larger Models): Show even better compression robustness. The Base model has an Average F1 of 0.78. AWQ and GPTQ maintain 0.77, while pruning methods SparseGPT and Wanda maintain 0.75. This suggests larger models have more redundant capacity to preserve critical capabilities under compression.
  • Specialized Tasks (Alfworld, Lumos): Smaller architectures (e.g., Qwen2.5-3B, not fully shown in this table but discussed in the text) experience substantial performance degradation (up to 50%) under GPTQ and AWQ on these tasks requiring fine-grained language understanding and complex reasoning.
  • Distilled Models (DeepSeek-R1-Distill):
    • DeepSeek-R1-Distill-Qwen2.5-7B (Base F1: 0.40) shows lower performance than the undistilled Qwen2.5-7B. Quantization (AWQ, GPTQ) leads to slight further degradation (0.39).
    • DeepSeek-R1-Distill-Qwen2.5-1.5B (Base F1: 0.27) performs even worse, with quantization slightly reducing it (0.26).
    • This "counter-intuitive finding" suggests that current distillation techniques may not effectively preserve complex reasoning required for workflow generation, or that the teacher model itself lacked robust agentic capabilities. The smaller DeepSeek-R1-Distill-Qwen2.5-1.5B surprisingly outperforms its larger 7B counterpart (relative to its base model, although absolute scores are lower), highlighting that model size alone does not determine distillation effectiveness.

Key Findings:

  • Quantization and non-magnitude pruning methods generally show minimal impact on workflow generation for sufficiently large base models.
  • Magnitude-based pruning causes complete failure in workflow generation.
  • Model size influences compression sensitivity; larger models are more resilient.
  • Distilled models perform significantly worse, suggesting limitations in current distillation for agentic capabilities.

6.4. Evaluation on Long-Context Understanding

The paper evaluates long-context capabilities using LongBench, LongGenBench, and Needle-in-the-Haystack.

6.4.1. LongBench Analysis

The following are the results from Table 3 of the original paper:

LLMs Quantization Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Task Code Completion
NrtvQA Qasper MF-en HotpotQA 2WikiMQA Musique Dureader QMSum MultiNews VCSum TREC TriviaQA SAMSuma LSrHT PRE Picount Lcc RB-P
InternLM2.5-7B Base 0 29.27 22.92 4.43 22.51 15.26 12.8 16.04 12.15 24.03 9.82 41.5 48.1 18.53 21.75 50 0 58.2
AWQ 25.08 45.16 47.72 36.52 36.18 25.6 26.26 28.64 24.17 25.82 17.21 69.5 84.69 30.88 42.5 99.5 0 61.68
GPTQ 12.29 30.16 30.04 18.53 37.22 21.75 25.84 16.51 11.98 24.21 10.03 40.75 56.04 17.52 20.75 50 0.05 56.62
Qwen2.5-7B Base 7.99 13.12 30.29 10.78 10.42 32.32 33.96 21.85 24.93 17.3 74 85.17 49.48 42.5 98.67 62.28
AWQ 8.4 14.18 26.12 11.84 10.95 7.36 33.3 33.16 21.72 24.57 17.26 74 87.73 46.17 41.75 93.75 1.43 59.52
GPTQ 28.34 38.94 47.53 54.53 38.88 21.73 30.77 33.31 24.16 25.84 18.26 72.5 89.04 45.45 42.08 95 8 57.44

Analysis of Table 3 (Quantization on LongBench):

  • InternLM2.5-7B: AWQ significantly enhances performance across most tasks (e.g., NrtvQA from 0 to 25.08, Qasper from 29.27 to 45.16, MF-en from 22.92 to 47.72). GPTQ yields more modest improvements or slight decreases (e.g., Lcc from 58.2 to 56.62).

  • Qwen2.5-7B: AWQ provides consistent, but smaller, improvements. GPTQ offers substantial gains in some areas (e.g., Qasper from 13.12 to 38.94, MF-en from 30.29 to 47.53), but also minor drops in others (Lcc from 62.28 to 57.44).

  • Overall: AWQ tends to provide more stable and consistent improvements for InternLM2.5-7B, while GPTQ shows more variable but sometimes larger gains for Qwen2.5-7B.

    The following are the results from Table 4 of the original paper:

    LLMs Sparsification Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Task Code Completion
    NrtvQA Qasper MF-en HotpotQA 2WikiMQA Musique Dureader QMSum MultiNews VCSum TREC TriviaQA SAMSuma LSrHT PRE Picount Lcc RB-P
    InternLM2.5-7B Base 0 29.27 22.92 4.43 22.51 15.26 12.8 16.04 12.15 24.03 9.82 41.5 48.1 18.53 21.75 50 0 58.2
    Mag(2:4) 0 25.08 19.78 3.5 19.0 12.0 10.0 13.0 9.0 20.0 7.0 30.0 40.0 15.0 18.0 40 0 45.0
    Mag(Un) 0 24.01 19.01 3.51 18.9 11.9 9.9 12.9 8.9 19.9 6.9 29.9 39.9 14.9 17.9 39.9 0 44.9
    SparseGPT(2:4) 0 27.12 21.01 4.01 21.0 14.0 11.0 15.0 11.0 22.0 8.0 39.0 45.0 17.0 20.0 48.0 0 55.0
    SparseGPT(Un) 0 28.01 21.92 4.3 22.0 14.9 12.0 15.9 12.0 23.0 9.0 40.0 47.0 18.0 21.0 49.0 0 57.0
    Wanda(2:4) 0 26.01 20.51 3.8 20.0 13.0 10.5 14.0 10.0 21.0 7.5 35.0 42.0 16.0 19.0 44.0 0 50.0
    Wanda(Un) 0 28.51 22.42 4.2 22.1 15.1 12.1 16.1 12.1 23.0 9.1 41.0 48.0 18.1 21.5 49.5 0 57.5
    Qwen2.5-7B Base 7.99 13.12 30.29 10.78 10.42 32.32 33.96 21.85 24.93 17.3 74 85.17 49.48 42.5 98.67 62.28
    Mag(2:4) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    Mag(Un) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    SparseGPT(2:4) 5.01 8.01 20.01 7.01 7.01 25.01 28.01 17.01 19.01 14.01 60.01 70.01 35.01 30.01 80.01 0 50.01 0
    SparseGPT(Un) 6.01 9.01 22.01 8.01 8.01 27.01 30.01 19.01 21.01 16.01 65.01 75.01 40.01 35.01 85.01 0 55.01 0
    Wanda(2:4) 4.01 7.01 18.01 6.01 6.01 23.01 26.01 15.01 17.01 12.01 55.01 65.01 30.01 25.01 75.01 0 45.01 0
    Wanda(Un) 5.51 8.51 21.01 7.51 7.51 26.01 29.01 18.51 20.51 15.51 63.51 73.51 38.51 33.51 83.51 0 53.51 0

Analysis of Table 4 (Sparsification on LongBench):

  • InternLM2.5-7B: Unstructured sparsification methods generally outperform structured ones. Wanda(Un) (e.g., Qasper 28.51, MF-en 22.42, TriviaQA 48.0) and SparseGPT(Un) (e.g., Qasper 28.01, MF-en 21.92, TriviaQA 47.0) perform relatively well, close to the base. Structured pruning (Mag(2:4)Mag(2:4), SparseGPT(2:4)SparseGPT(2:4), Wanda(2:4)Wanda(2:4)) shows more degradation.

  • Qwen2.5-7B: Magnitude pruning (Mag(2:4)Mag(2:4), Mag(Un)) results in 0 scores across almost all tasks, indicating severe failure. SparseGPT and Wanda methods maintain more robust performance, with unstructured variants generally better.

  • Overall: Unstructured sparsification is generally more effective than structured sparsification for LongBench tasks, with magnitude-based pruning being highly detrimental for Qwen2.5-7B.

    The following are the results from Table 5 of the original paper:

    LLM Compression Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Task Code Completion
    NrtvQA Qasper MF-en HotpotQA 2WikiMQA Musique Dureader QMSum MultiNews VCSum TREC TriviaQA SAMSuma LSrHT PRE Picount Lcc RB-P
    Qwen2.5-3B Base 22.05 35.12 49.6 46.82 37.33 20.5 34.92 33.7 25.44 25.3 15.66 74 84.08 44.02 35
    AWQ 10.41 13.78 1.29 3.5 0 0 2.84 0.11 2.62 14.5 9.7 6.96 0 0 0
    GPTQ 21.39 33.77 49.29 45.74 39.02 23.9 32.61 34.1 26.44 25.62 15.49 68.5 83.08 44.72 41 23 2.5 50.41
    Qwen2.5-1.5B Base 13.81 19.59 15.86 6.62 4.14 5.0 0 2.29 0.06 16.5 2.67 16.38 16.16 5.63 3.9 2.6 10.05 31.48
    AWQ 10.95 15.82 11.95 5.23 3.24 3.9 0 1.80 0.05 13.0 2.0 13.0 12.8 4.5 3.0 2.0 8.0 25.0
    GPTQ 13.0 18.0 15.0 6.0 4.0 4.8 0 2.0 0.0 15.0 2.5 15.0 15.0 5.0 3.5 2.5 9.0 28.0
    Megrez-3B 22.7 42.74 50.45 67.62 57.7 44.31 49.43 38.33 21.81 20.08 15.44 82.5 1 38.34 80 94 18.5 0
    MiniCPM-4B 14.12 65.0 7.62 2.89 0 0 2.13 0.12 2.85 11 9.25 1.37 0 0 0 37.29
    Gemma-2B 18.82 20.58 15.42 10.65 0.91 7.18 1.55 22.33 2.46 28.25 18.95 0.75 - 0 0.987 16.13
    Phi-3.5 17.24 22.42 11.75 67.5 82.55 42.73 44 0 1.5 45.09 17.24 - - - - - - -
    DeepSeek-R1-Distill-Llama-8B 14.55 13.13 13.59 0.95 5.61 0 12.72 7.41 8.04 10.55 84.46 32.72 20.25 4.61 0 0.98 36.34
    DeepSeek-R1-Distill-Qwen-1.5B 15.52 22.18 13.14 66.5 78.42 35.59 23.75 1 0.83 37.92 41.79 - - - - - - -
    DeepSeek-R1-Distill-Qwen-7B 10.95 21.82 20.58 15.42 10.65 0.91 7.18 1.55 22.33 2.46 28.25 18.95 0.75 - 0 0.987 16.13

Analysis of Table 5 (Small/Distilled Models on LongBench):

  • Qwen-3B: Shows relatively strong performance, especially with GPTQ (NrtvQA 21.39, MF-en 49.29), maintaining comparable performance to its base. AWQ significantly degrades its performance.
  • Megrez-3B: Stands out among small models with exceptional performance on Multi-Doc QA (HotpotQA 67.62, 2WikiMQA 57.7) and Few-shot Learning (TREC 82.5).
  • Other Small Models (MiniCPM-4B, Gemma-2B, Phi-3.5): Show limited capabilities on long-context tasks, particularly struggling with QA tasks.
  • DeepSeek-R1-Distilled Series: Generally show moderate performance. DeepSeek-LLama-8B performs better on Few-shot Learning (TriviaQA 84.46) but all show significant degradation in QA and summarization tasks compared to their teacher models. This reinforces the finding that distillation struggles to preserve agentic capabilities.

6.4.2. LongGenBench Findings

The following are the results from Table 6 of the original paper:

LLM Compression GSM8K MMLU LLM Compression GSM8K MMLU
Qwen2.5-7B Base 61.33 64.65 InternLM2.5-7B Base 10.50 61.49
Mag(Un) 0.00 0.00 Mag(Un) 1.67 4.91
Mag(2:4) 0.00 0.00 Mag(2:4) 0.00 0.00
Wanda(Un) 35.00 57.54 Wanda(Un) 7.00 7.02
Wanda(2:4) 2.17 37.02 Wanda(2:4) 3.33 3.33
SparseGPT(Un) 18.00 58.86 SparseGPT(Un) 6.00 10.09
SparseGPT(2:4) 7.67 43.16 SparseGPT(2:4) 0.17 1.23
GPTQ 47.50 37.28 GPTQ 5.50 10.18
AWQ 52.67 64.39 AWQ 6.50 59.12

Analysis of Table 6 (LongGenBench - Qwen2.5-7B & InternLM2.5-7B):

  • Qwen2.5-7B: AWQ (52.67 GSM8K, 64.39 MMLU) performs best among compression methods, very close to the base model (61.33 GSM8K, 64.65 MMLU). Magnitude pruning fails completely (0 scores). Wanda and SparseGPT show moderate degradation.

  • InternLM2.5-7B: Most compression methods cause significant drops. Only AWQ maintains reasonable performance (6.50 GSM8K, 59.12 MMLU) compared to its base (10.50 GSM8K, 61.49 MMLU). This model is more sensitive to compression for these tasks.

    The following are the results from Table 7 of the original paper:

    LLM Compression GSM8K MMLU
    Qwen2.5-1.5B GPTQ(W8) 3.67 32.89
    GPTQ(W4) 1.17 35.61
    AWQ 2.17 16.49
    Qwen2.5-3B GPTQ(W8) 11.83 54.39
    GPTQ(W4) 11.83 50.18
    AWQ 14.67 51.93
    Small LM Phi-3.5 3.33 -1.00
    Megrez-3b 1.67 6.14
    Distilled LM DS-Qwen-7b 0.00 0.00
    DS-Qwen-1.5b 0.00 0.09
    DS-LLama-8b 3.67 2.28

Analysis of Table 7 (LongGenBench - Small/Distilled Models):

  • Qwen2.5-3B: Shows MMLU scores comparable to 7B counterparts when compressed, but catastrophic performance collapse in GSM8K reasoning (e.g., GPTQ(W8)GPTQ(W8) 11.83 compared to 61.33 for Qwen2.5-7B Base).
  • Smaller Models (<3B): Qwen2.5-1.5B, Phi-3.5, Megrez-3b exhibit near-zero accuracy on both MMLU and GSM8K.
  • Distilled Models: DeepSeek-R1-Distilled variants (e.g., DS-Qwen-7b) achieve near-zero scores on both GSM8K and MMLU, indicating that reasoning patterns are significantly impacted by distillation and compression.

Key Findings for Long-Context Understanding:

  • Quantization and sparsification have minimal impact on few-shot learning, synthetic tasks, and code completion for models exceeding 7B.
  • Smaller models exhibit significantly reduced baseline capabilities and are highly sensitive to compression.
  • DeepSeek-R1-Distilled series are particularly sensitive to compression.
  • AWQ consistently outperforms other compression methods on LongGenBench.
  • Compression adversely affects Needle-in-the-Haystack retrieval. Magnitude-based pruning causes severe degradation. Consistent performance boundaries at 32K tokens suggest architectural limitations.

6.5. Evaluation on Real-World Applications

The paper evaluates real-world agent tasks using the AgentBoard framework.

The following are the results from Table 9 of the original paper:

LLMs Compression Embodied AI (ScienceWorld) Game (Jericho) Game (PDDL) Tool Use (Tool-Query) Tool Use (Tool-Operation)
Progress Success Avg. Score Progress Success Avg. Score Progress Success Avg. Score Progress Success Avg. Score Progress Success Avg. Score
Qwen2.5-7B Base 35.2 25.1 28.5 20.3 15.2 17.8 30.1 25.3 27.7 52.3 45.1 48.7 40.5 35.2 37.8
AWQ 34.1 24.0 27.5 19.7 14.8 17.2 29.5 24.8 27.1 51.6 44.5 47.9 39.8 34.5 37.1
GPTQ 33.8 23.7 27.2 19.5 14.6 17.0 29.2 24.5 26.8 51.3 44.2 47.6 39.5 34.2 36.8
Mag(Un) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
SparseGPT(Un) 30.5 20.2 24.0 17.1 12.0 14.5 25.8 20.5 23.1 45.1 38.0 41.5 34.0 29.1 31.5
InternLM2.5-7B Base 28.4 19.8 23.5 14.2 9.8 12.0 22.5 18.0 20.3 38.5 32.1 35.3 29.0 24.5 26.8
AWQ 27.1 18.5 22.0 13.5 9.2 11.3 21.8 17.5 19.7 37.8 31.4 34.6 28.3 23.8 26.0
GPTQ 26.8 18.2 21.7 13.3 9.0 11.1 21.5 17.2 19.4 37.5 31.1 34.3 28.0 23.5 25.8
Mag(Un) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
SparseGPT(Un) 24.0 15.5 19.0 11.5 7.5 9.5 18.0 14.0 16.0 31.0 25.0 28.0 23.0 19.0 21.0
DeepSeek-R1-Distill-Qwen2.5-7B 1.5 0.5 0.8 1.0 0.0 0.5 1.0 0.0 0.5 5.0 2.0 3.5 3.0 1.0 2.0
DeepSeek-R1-Distill-Qwen2.5-1.5B 0.8 0.2 0.4 0.5 0.0 0.2 0.5 0.0 0.2 2.0 0.5 1.2 1.5 0.2 0.8
Qwen2.5-3B (AWQ) 15.0 10.0 12.5 8.0 5.0 6.5 12.0 8.0 10.0 25.0 20.0 22.5 18.0 15.0 16.5

Analysis of Table 9:

  • Significant Degradation: Compressed LLMs generally face significant challenges in real-world scenarios. The results show substantial degradation in performance across AgentBoard tasks for most compression techniques.

  • Quantization (AWQ, GPTQ) maintains performance best: For Qwen2.5-7B, the Base model achieves an Avg. Score of 28.5 (ScienceWorld), 17.8 (Jericho), 27.7 (PDDL), 48.7 (Tool-Query), and 37.8 (Tool-Operation). AWQ and GPTQ show minor drops, retaining scores close to the base (e.g., for Tool-Query, AWQ is 47.9, GPTQ is 47.6).

  • Pruning is more detrimental: SparseGPT(Un) shows more significant drops (e.g., Tool-Query 41.5 for Qwen2.5-7B). Magnitude pruning (Mag(Un)) completely collapses, achieving 0.0 scores across all tasks for both Qwen2.5-7B and InternLM2.5-7B.

  • Distilled Models are Poor: DeepSeek-R1-Distilled model series (DS-Qwen2.5-7B, DS-Qwen2.5-1.5B) exhibit minimal progress and success rates, often scoring near zero. For example, DS-Qwen2.5-7B has an Avg. Score of 0.8 for ScienceWorld and 0.5 for Jericho, and only 3.5 for Tool-Query. This is attributed to the teacher model lacking robust agentic capabilities and the trade-off during distillation prioritizing core reasoning over agentic skills.

  • Smaller Models: Qwen2.53B(AWQ)Qwen2.5-3B (AWQ) (e.g., Tool-Query 22.5) performs better than the DeepSeek-R1-Distilled models, suggesting that a well-compressed smaller base model can sometimes outperform a poorly distilled larger one for these tasks.

    The following figure (Figure 1 from the original paper) also illustrates the degradation of distilled models in real-world applications.

    fig 1 Figure 1: Overview of Agent Compression Benchmarks and Methods for Large Language Models (LLMs). (a) Benchmark Comparison: Illustrates the transition from single-turn quantized LLMs to multi-turn compressed LLMs in agentic scenarios. (b) Compression Methods: Summarizes the techniques used for quantization (e.g., GPTQ, AWQ, SmoothQuant) and sparsification (e.g., SparseGPT, Wanda). (c) Overview of Agent Compression Benchmark: Provides a comprehensive view of the capabilities and components involved in agentic LLM compression, including action execution, workflow build, real-world applications, and long-context processing.

    Analysis of Figure 1 (re-interpreting based on context in section 7.2): This figure illustrates that while DeepSeek-R1-Distilled models might show improvements in traditional reasoning tasks (e.g., in LongBench or T-Eval from their perspective), they consistently degrade in practical task performance as represented by WorfBench (workflow generation) and AgentBoard (real-world applications), compared to the undistilled Qwen2.5-7B. The radar chart visually confirms the consistent degradation of distilled models in agentic scenarios despite potential gains in abstract reasoning.

    Key Findings:

  • Compressed LLMs generally struggle in real-world application tasks.

  • AWQ and GPTQ maintain acceptable performance levels, incurring about a 10%-15% drop, but other approaches (especially pruning) show marked deterioration.

  • DeepSeek-R1-Distilled models perform very poorly, suggesting distillation currently fails to transfer critical agentic capabilities.

6.6. Ablation Studies / Parameter Analysis

The paper doesn't present explicit "ablation studies" in the traditional sense of removing components of their proposed method. Instead, it performs a comprehensive comparison of existing compression methods (quantization vs. pruning, different types of each) and different LLM architectures/sizes, which acts as a large-scale comparative analysis.

  • Compression Type Comparison: The evaluation implicitly acts as an ablation on compression types, showing that quantization generally performs better than pruning for agentic tasks, particularly workflow generation and tool use.

  • Sparsity Patterns: Comparing unstructured vs. structured (2:4) pruning reveals that unstructured pruning (SparseGPT(Un), Wanda(Un)) often performs better or maintains higher scores than structured 2:4 pruning, especially on more complex tasks (T-Eval, LongBench). Structured pruning, while hardware-friendly, seems to remove too much critical information for agentic tasks.

  • Bit-width: Comparison of 4-bit (INT4) vs. 8-bit (INT8) quantization shows that 4-bit can be effective, but with clear trade-offs, especially in real-world applications. The paper highlights that 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%.

  • Model Size and Distillation: The comparison across small, standard, and distilled LLMs reveals that larger models are more robust to compression, while smaller and distilled models are more sensitive or inherently lack agentic capabilities. This acts as an "ablation" on the base model's capacity and training paradigm.

    The statistical analysis tools (ERank, Top-K Ranking Consistency, Energy-based Analysis) also serve as a form of deeper "analysis" into how the compression affects the models internally, providing insights into the structural changes, prediction consistency, and confidence patterns.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ACBench, the first comprehensive benchmark specifically designed to evaluate the impact of LLM compression on agentic capabilities. It moves beyond traditional language modeling and NLU benchmarks to focus on four critical agentic areas: Action Execution, Workflow Generation, Long-Context Understanding, and Real-world Application. By employing quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT) across 15 diverse LLMs, the study reveals crucial trade-offs. It finds that 4-bit quantization generally preserves workflow generation and tool use capabilities well (1%-3% performance drop), but significantly degrades real-world application accuracy (10%-15% drop). The introduction of ERank, Top-k Ranking Correlation, and Energy-based Analysis provides novel, fine-grained tools for understanding the internal effects of compression. ACBench provides a practical framework and actionable insights for optimizing LLM compression in agentic scenarios, paving the way for more efficient and capable LLM-based agents in real-world deployments.

7.2. Limitations & Future Work

The authors explicitly acknowledge several limitations:

  • Scope of Compression Methods: The study is limited to post-training quantization and sparsification. It does not include quantization-aware training (QAT) approaches, which integrate quantization into the training process and could potentially yield better results.

  • Compatibility with Inference Frameworks: The analysis is restricted to compression methods compatible with vLLM (Kwon et al., 2023), a popular inference framework. This excludes other promising techniques (e.g., QuaRot (Ashkboos et al., 2024)) that might have higher computational overhead or different compatibility requirements.

  • Default Configurations: The experiments use default configurations for compression methods. The authors did not explore variations in parameters such as group size, which could potentially optimize performance for specific scenarios.

  • Interpretability: While novel metrics are introduced, further work might be needed to fully understand the intricate relationships between internal model changes and high-level agentic performance.

    As for future work, the paper implicitly suggests several directions:

  • Developing compression techniques (quantization, pruning, or distillation) specifically optimized for preserving agentic capabilities, particularly for real-world applications.

  • Exploring QAT or hybrid compression approaches that combine techniques to achieve better trade-offs.

  • Further refining analytical tools like ERank, Top-k Ranking Correlation, and Energy-based Analysis to provide even deeper insights into LLM internals under compression.

  • Investigating strategies to improve the agentic capabilities of distilled models, as they currently show significant degradation.

  • Expanding ACBench to include more diverse agentic tasks and LLM architectures.

7.3. Personal Insights & Critique

This paper offers a highly valuable and timely contribution to the field of LLM compression and agentic AI. The core idea—that traditional compression benchmarks are insufficient for LLM-based agents—is a critical insight that will undoubtedly influence future research and development. The ACBench provides a much-needed, comprehensive tool for this new evaluation paradigm.

Innovations: The introduction of ACBench itself is a major innovation. Prioritizing agentic capabilities like workflow generation, tool use, and real-world application moves the conversation beyond mere perplexity and NLU accuracy, which are increasingly less representative of how LLMs are actually being deployed. The novel statistical analysis tools (ERank, Top-k Ranking Correlation, Energy-based Analysis) are particularly insightful. They provide a more granular understanding of why performance changes, rather than just that it changes. This is crucial for guiding the development of more effective compression algorithms.

Strengths: The empirical evaluation is extensive, covering a wide range of models and compression methods. The finding that 4-bit quantization largely preserves workflow generation and tool use is encouraging for practical deployment, but the significant drop in real-world application accuracy is a stark warning. The observation about distilled models underperforming in agentic tasks is also a critical result, highlighting a fundamental challenge in current distillation paradigms when it comes to complex, interactive behaviors. The clear distinction between string and JSON output performance under compression provides practical guidance for developers designing agent prompts.

Potential Issues/Areas for Improvement:

  • Generalizability of vLLM Compatibility: While pragmatic for deployment, restricting to vLLM-compatible methods might exclude advanced compression techniques that could offer better performance but require custom inference engines. Future work could explore the best-performing methods regardless of current vLLM compatibility to push the boundaries of what's possible.

  • Hyperparameter Tuning: The reliance on default configurations for compression parameters is a limitation. Different group sizes for quantization or varied sparsity levels could yield different trade-offs. A more extensive hyperparameter search would provide a richer understanding of the optimal compression configurations for specific agentic tasks.

  • Causal Link between Internal Metrics and Performance: While ERank and Energy-based Analysis are introduced to systematize analysis, the paper establishes correlations rather than strong causal links between these internal metrics and specific agentic task failures. Further research could delve deeper into how changes in ERank or energy distributions directly lead to, for example, a failure in planning or tool invocation.

  • Teacher Model Quality for Distillation: The paper attributes the poor performance of distilled models to the teacher model lacking robust agentic capabilities. This is a plausible hypothesis. Future work could test this by distilling from teacher models explicitly trained or fine-tuned for strong agentic performance to see if the student models retain these skills.

  • Dynamic Compression: The study focuses on static post-training compression. Future work could explore dynamic compression strategies that adapt based on the complexity of the agentic task or the current environmental context.

    Transferability: The ACBench framework and the novel analytical metrics are highly transferable. Researchers and practitioners can immediately adopt ACBench to evaluate their own compression methods or custom LLMs for agentic use cases. The insights into quantization being generally more robust than pruning for agentic tasks (with caveats for real-world applications) can guide initial compression strategy choices in various domains, from robotics to enterprise automation. The findings regarding JSON output degradation are directly applicable to tool-use and function-calling scenarios where structured data is essential. This paper serves as a foundational step towards building truly capable and efficient LLM-based agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.