1. Bibliographic Information

1.1. Title

Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity

1.2. Authors

Tuowei Wang (Tsinghua University)
Xingyu Chen (Tsinghua University)
Kun Li (Microsoft Research)
Ting Cao (Microsoft Research)
Ju Ren (Tsinghua University)
Yaoxue Zhang (Tsinghua University)

1.3. Journal/Conference

This paper is included in the Proceedings of the 2025 USENIX Annual Technical Conference (ATC '25). USENIX ATC is a highly reputable conference in the systems and operating systems research community, known for publishing high-quality, impactful work on advanced computing systems. Its inclusion signals a significant contribution to the field.

1.4. Publication Year

2025

1.5. Abstract

The increasing demand for long-context applications necessitates extending the context windows of Large Language Models (LLMs). While recent fine-tuning approaches have successfully expanded context lengths, they face critical practical limitations due to high memory footprints, particularly for activations. Current parameter-efficient fine-tuning (PEFT) methods primarily reduce parameter update overhead, neglecting activation memory constraints. Similarly, existing sparsity mechanisms improve computational efficiency but overlook activation memory optimization because of a phenomenon termed Shadowy Activation.

This paper introduces Jenga, the first LLM fine-tuning system to explore and exploit a novel token-level sparsity mechanism inherent in long-context scenarios, named Contextual Token Sparsity. Jenga aims to minimize redundant token involvement by evaluating the informativeness of token embeddings while maintaining model accuracy. It proposes three key techniques: (1) Token Elimination, which dynamically identifies and excludes redundant tokens across varying inputs and layers; (2) Pattern Prediction, which uses well-trained predictors to approximate token sparsity patterns with minimal overhead; and (3) Kernel Optimization, which employs permutation-free and segment-based strategies to boost system performance. Jenga is implemented as an end-to-end fine-tuning system compatible with various LLM architectures and other optimization techniques. Comprehensive evaluations demonstrate that Jenga reduces memory consumption by up to $1.93 \times$ and achieves up to $1.36 \times$ speedups, outperforming state-of-the-art fine-tuning systems.

1.6. Original Source Link

https://www.usenix.org/system/files/atc25-wang-tuowei.pdf (This is the official PDF link, indicating a published paper).

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in the deployment and scaling of Large Language Models (LLMs): their limited context window. A context window refers to the maximum number of tokens (words or sub-word units) an LLM can process at once. As applications demand increasingly comprehensive document analysis, extended multi-turn dialogues, and intricate codebase handling, LLMs need to handle much longer input sequences. However, most LLMs are pre-trained with fixed, relatively small context windows (e.g., 4K tokens for Llama2). When encountering inputs exceeding this limit, their performance degrades significantly.

While recent research has shown that fine-tuning (adapting a pre-trained model to a specific task or dataset) can extend these context windows, this process is extremely resource-intensive. The primary bottleneck is not the model parameters themselves, but the activations (intermediate results and gradients generated during computation), which scale proportionally with the sequence length. For instance, extending Llama models from 2K to 32K context length previously required 128 A100 80GB GPUs, mainly due to memory limitations.

Existing efficient fine-tuning techniques fall short in addressing this activation memory bottleneck:

Parameter-Efficient Fine-Tuning (PEFT) methods (e.g., LoRA) reduce the memory needed for parameter updates by only modifying a small subset of parameters. However, they do not optimize activation memory.
Sparsity mechanisms (e.g., in LongLoRA) aim to improve computational efficiency by approximating dense attention with sparse patterns. Yet, these methods typically focus on sparsity within the hidden dimension of individual tokens, meaning that even if a token is only minimally used, its activations are still retained in memory. The authors term this problem Shadowy Activation. Since the entire token sequence remains involved, these methods fail to provide substantial activation memory reduction.

This critical gap highlights the need for a novel LLM fine-tuning scheme that minimizes token involvement, thereby optimizing both memory and computational efficiency by directly tackling activation memory constraints.

2.2. Main Contributions / Findings

The paper introduces Jenga, a novel system for enhancing long-context fine-tuning of LLMs, built upon a new token-level sparsity mechanism. The main contributions and findings are:

Identification of Contextual Token Sparsity: Jenga identifies and exploits a new sparsity mechanism inherent in long-context fine-tuning. This mechanism recognizes that natural language exhibits significant redundancy, especially in long sequences, allowing for the exclusion of less informative tokens at the token level. This is the first fine-tuning system to leverage such token-level sparsity to optimize both memory and computational efficiency, directly addressing the Shadowy Activation problem.
Development of Three Key Techniques: Jenga proposes a heuristic fine-tuning scheme encompassing both algorithmic and system-level optimizations:
1. Information-driven Token Elimination: A score-based algorithm that dynamically identifies and removes redundant tokens across different inputs and layers, preserving model accuracy. It uses a block-wise approach and layer-specific thresholds.
2. Context-aware Pattern Prediction: A neural-network-based approach to efficiently predict token sparsity patterns at runtime, avoiding the costly computation of full attention scores. It employs elastic size transformation to minimize predictor overhead.
3. High-performance Kernel Optimization: Hardware-efficient techniques, including a permutation-free strategy (fusing token movement operations) and a segment-based peak-cutting method (partitioning loss computation), to mitigate memory access costs and activation memory peaks.
End-to-end System Implementation and Compatibility: Jenga is implemented as an end-to-end fine-tuning system that is compatible with various LLM architectures (e.g., OPT, Llama families) and can seamlessly integrate with other optimization techniques (e.g., 2D-Sparsity, Sparsity-sensitive Offload) for further performance enhancement.
Comprehensive Evaluation Results: Extensive evaluations across diverse LLM families and hardware platforms demonstrate significant performance gains:
- Memory Savings: Jenga achieves an average memory reduction of $38.2\%$ (for 4K sequence length) and $50.5\%$ (for 8K sequence length) compared to LoRA, and up to $1.93 \times$ memory savings overall. It effectively doubles the achievable fine-tuning sequence length under GPU memory constraints.
- Speedups: Jenga achieves comparable computational efficiency to LongLoRA, with an average speedup of $10.8\%$ and $8.6\%$ over LoRA on different platforms, reaching up to $1.36 \times$ speedups for longer sequences.
- Accuracy Preservation: Jenga maintains model accuracy, incurring only a minimal increase in perplexity scores compared to the original LoRA across various sequence lengths and achieving comparable accuracy on the LongBench benchmark.
  
  These findings strongly validate that the inherent redundancy in long-context sequences can be effectively exploited to significantly improve the efficiency of LLM fine-tuning without compromising model accuracy.

3.1. Foundational Concepts

To understand Jenga, a foundational grasp of several concepts related to Large Language Models (LLMs) and their training is essential.

Large Language Models (LLMs): These are advanced artificial intelligence models, often based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like translation, summarization, question-answering, and creative writing. Examples include GPT series, Llama, and OPT.
Fine-tuning: This is a process where a pre-trained LLM, which has learned general language patterns, is further trained on a smaller, specific dataset to adapt its knowledge to a particular downstream task or domain. For example, fine-tuning an LLM on medical texts to improve its performance in medical question-answering.
Context Window (or Context Length): This refers to the maximum number of tokens (sub-word units like "the", "cat", or "##ing") an LLM can consider when processing an input and generating an output. If an input sequence exceeds this length, the model typically truncates it, leading to information loss and degraded performance. Extending this window is crucial for handling long documents or conversations.
Attention Mechanism: A core component of Transformer models. It allows the model to weigh the importance of different parts of the input sequence when processing each token. Instead of processing a sequence word by word, attention looks at all words simultaneously, deciding which ones are most relevant to each other. The most common form is self-attention, which calculates query ( $Q$ $Q$ ), key ( $K$ $K$ ), and value ( $V$ $V$ ) vectors from the input tokens themselves.
- Standard Attention Formula: The scaled dot-product attention, which is fundamental to Transformers, is calculated as: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
  - $Q$ (Query): A matrix where each row corresponds to a query vector for each token in the input sequence.
  - $K$ (Key): A matrix where each row corresponds to a key vector for each token in the input sequence.
  - $V$ (Value): A matrix where each row corresponds to a value vector for each token in the input sequence.
  - $Q K^T$ : The dot product of the query and key matrices, which measures the similarity or "attention score" between each query and all keys.
  - $\sqrt{d_k}$ : The square root of the dimension of the key vectors, used for scaling to prevent very large dot products from pushing the softmax function into regions with tiny gradients.
  - $\mathrm{softmax}$ : A function that normalizes the attention scores into a probability distribution, ensuring they sum to 1.
  - $V$ : The value matrix, which is weighted by the softmax output to produce the final attention output.
- FlashAttention: An optimized attention algorithm designed for modern GPUs that significantly improves the speed and memory efficiency of the attention mechanism by reducing the number of reads/writes to GPU High Bandwidth Memory (HBM). It computes the attention scores and applies softmax in a single, memory-efficient kernel, avoiding the explicit materialization of the large attention matrix. Jenga integrates its custom training kernel into FlashAttention.
Activations: During the forward pass of a neural network, activations are the outputs of each layer's neurons after applying an activation function (like ReLU or SiLU). During the backward pass (gradient computation), these intermediate activations (and their gradients) need to be stored in memory. For deep networks and long sequences, these can consume a significant portion of GPU memory, often more than the model parameters themselves.
Sparsity: In the context of neural networks, sparsity refers to the property where many elements (e.g., weights, activations, attention scores) are zero or very close to zero. Exploiting sparsity can lead to reduced computation and memory usage.
- Hidden-dimension sparsity: Refers to sparsity within the feature dimension of token embeddings. For example, some neurons in a layer might have zero activations.
- Token-level sparsity: Refers to sparsity at the level of entire tokens, meaning some tokens in the sequence are deemed unimportant and can be entirely excluded from computation.

3.2. Previous Works

The paper contextualizes Jenga by discussing prior work in efficient LLM fine-tuning, activation memory optimization, and token utilization.

Efficient LLM Fine-tuning:
- Parameter-Efficient Fine-Tuning (PEFT) methods: These methods aim to reduce the memory footprint of fine-tuning by only updating a small subset of the model's parameters.
  - LoRA (Low-Rank Adaptation) [29]: A representative PEFT method. LoRA freezes the pre-trained model weights and injects small, trainable low-rank matrices into each Transformer block. This significantly reduces the number of trainable parameters and thus the memory needed for optimizer states (parameters, momentum, and variance stored in FP32). However, as highlighted in Jenga, LoRA does not reduce activation memory because gradient computation for these injected matrices still requires a traversal almost identical to full fine-tuning.
- Sparsity-based attention mechanisms: These methods leverage the observation that not all token interactions are equally important, approximating standard dense attention with sparse patterns to reduce computation.
  - LongLoRA [12]: Extends LoRA by incorporating a shifted sparse attention mechanism. It partitions input tokens into groups and performs attention within these groups, with a shift mechanism to allow cross-group information exchange. While LongLoRA achieves computational savings, Jenga argues it fails to provide additional memory reduction for activations due to Shadowy Activation. This is because its sparsity operates on hidden dimensions; all tokens are still "involved" in some computation, even if their interactions are limited.
Optimizations for Activation Memory: These methods primarily aim to manage the large memory footprint of activations during training.
- Activation Recomputation [11, 35, 36]: Instead of storing activations from the forward pass, they are recomputed during the backward pass as needed. This trades computational overhead for memory savings.
- Activation Offloading [27, 30, 55, 56]: Activations are asynchronously transferred from the GPU to CPU memory and prefetched back before they are required. This trades communication overhead for GPU memory savings.
- Activation Compression [9, 21, 44]: Reduces activation memory size through quantization (using lower precision) or pruning (removing insignificant activations). Jenga argues that these methods primarily trade memory for computation/communication rather than fundamentally reducing memory demands. Jenga's approach, by reducing token involvement, is complementary and can be combined with these existing optimizations.
Optimizations for Long-context Fine-tuning (Algorithm Design & Position Embeddings): Some works [10, 18, 43, 59, 68, 74, 80, 84] focus on algorithmic designs or modifying position embeddings (which encode token order) to allow LLMs to handle longer contexts. These are complementary to Jenga's efficiency focus.
Optimizations for Token Utilization (Inference): Several studies [7, 22, 23, 25, 32, 33, 34, 38, 40, 49, 58, 61, 63, 82, 75] explore leveraging natural language redundancy for data engineering, prompt compression, and inference optimization, including token elimination during inference to reduce latency.
- Jenga differentiates itself by being the first to extend attention-based token elimination to LLM long-context fine-tuning. Previous token elimination typically focuses on small-model inference where sparsity patterns are less dynamic across layers. Jenga addresses the challenge of Contextual Token Sparsity, where token importance varies dynamically across inputs and layers during fine-tuning.

3.3. Technological Evolution

The evolution of efficient LLM training and fine-tuning has progressed from full model training, which is extremely resource-intensive, to more memory-efficient approaches. Early advancements focused on optimizing general deep learning training (e.g., activation recomputation, mixed precision training). With the advent of massive Transformer models, Parameter-Efficient Fine-Tuning (PEFT) became prominent, significantly reducing the memory for optimizer states by only updating a small fraction of parameters. Simultaneously, sparsity-based methods emerged to reduce computational costs by making attention mechanisms sparse. However, as LLMs grew larger and context windows expanded, the activation memory bottleneck became increasingly dominant, largely unaddressed by PEFT and existing sparsity methods due to the Shadowy Activation phenomenon. Jenga represents the next evolutionary step by introducing token-level sparsity directly into the fine-tuning process, specifically targeting activation memory reduction, and combining algorithmic insights with system-level kernel optimizations to realize practical gains. It moves beyond hidden-dimension sparsity to a more fundamental token-wise exclusion.

3.4. Differentiation Analysis

Jenga's core differentiation lies in its novel approach to tackling the activation memory bottleneck during LLM long-context fine-tuning by introducing and exploiting Contextual Token Sparsity.

Addressing Shadowy Activation: Unlike existing sparsity-based methods (e.g., LongLoRA), which apply sparsity at the hidden-dimension level of individual tokens, Jenga operates at the token level. This means Jenga can entirely eliminate less informative tokens from computation, thus preventing the Shadowy Activation problem where activations of minimally used tokens are still retained. This direct token exclusion leads to substantial savings in activation memory, a benefit largely missed by previous sparsity techniques.
First to Apply Token Elimination to Fine-tuning: While token elimination has been explored for small-model inference, Jenga is the first to extend this concept to LLM long-context fine-tuning. This is a more complex problem because token importance is highly contextual and dynamic (varying across inputs and layers) during the learning process, unlike static inference. Jenga's Contextual Token Sparsity mechanism and its adaptive design (layer-specific thresholds, pattern prediction) are specifically tailored for this dynamic fine-tuning environment.
Holistic Optimization: Jenga combines algorithmic innovations (Information-driven Token Elimination, Context-aware Pattern Prediction) with system-level Kernel Optimization (permutation-free token movement, segment-based peak cutting). This holistic approach ensures that the algorithmic sparsity is efficiently translated into practical memory and speedup gains on hardware, addressing challenges like data movement and memory peaks that prior works might not fully tackle.
Complementary to Existing Methods: Jenga is designed to be compatible with and enhance existing techniques like PEFT (LoRA) and even other sparsity methods (through its 2D-Sparsity extension), offering a synergistic approach rather than a replacement. Its focus on activation memory directly complements PEFT's focus on parameter memory.

4. Methodology

Jenga is an efficient system designed to enhance LLM long-context fine-tuning by systematically exploring and exploiting contextual token sparsity. Its design is rooted in the observation that natural language exhibits significant redundancy, particularly in long sequences, allowing for the approximation of full attention by focusing on a subset of informative tokens.

4.1. Principles

The core principle behind Jenga is Contextual Token Sparsity. This novel sparsity mechanism in LLM long-context fine-tuning is based on the inherent redundancy of natural language. The key insights are:

Token-wise Importance: Not all tokens in a long sequence are equally important. Some tokens carry more informative weight in influencing the model's output than others. This allows for the exclusion of less valuable tokens.
Contextual Variability: The informativeness of tokens is dynamic. The most important tokens vary across different input texts and even across different layers within the same model for a given input. This dynamic nature necessitates an adaptive system capable of identifying and exploiting this sparsity in real-time.

By directly reducing the involvement of redundant tokens, Jenga aims to alleviate the Shadowy Activation constraint, leading to proportional benefits in both activation memory savings and computational efficiency.

4.2. Core Methodology In-depth (Layer by Layer)

Figure 5 provides an overview of the Jenga system, which is built upon three fundamental techniques: Information-driven Token Elimination, Context-aware Pattern Prediction, and High-performance Kernel Optimization.

4.2.1. Information-driven Token Elimination

This technique is responsible for accurately identifying and excluding redundant tokens.

Token Informativeness Definition: Jenga defines the informativeness of a token based on its interactions with other tokens within the embedding space, drawing inspiration from attention scores. The attention score $S_{ij}$ (calculated as $Q_i K_j$ ) quantifies the interaction between token $i$ and token $j$ . The informativeness of a token $T_j$ , denoted $I(T_j)$ , is defined as the sum of its interaction scores with all other tokens in the sequence: $I(T_{j}) = \sum_{i\neq j}S_{ij} = \sum_{i\neq j}Q_{i}K_{j} \quad (1)$
- $T_j$ : Represents the $j$ -th token in the sequence.
- $S_{ij}$ : The attention score representing the interaction between token $i$ and token $j$ .
- $Q_i$ : The query vector for token $i$ .
- $K_j$ : The key vector for token $j$ .
- The sum $\sum_{i\neq j}$ aggregates the attention scores across all tokens $i$ in the sequence, excluding token $j$ itself, to quantify how much token $j$ interacts with the rest of the context.
Block-wise Elimination: To align with hardware characteristics and process long sequences efficiently, Jenga performs token elimination in a block-wise manner. As illustrated in Figure 6, attention scores are partitioned along the token dimension into multiple score blocks ( $H^S$ ). For each block, Jenga aggregates the positive attention scores across all attention heads and then takes the maximum value within that block as its informativeness score $I(\mathcal{B}_{nm}^S)$ . Positive scores are summed because negative values (after softmax) have negligible impact and could offset positive influences. This process is formalized as: $I(\mathscr{B}_{nm}^S) = \max_{S_{ij}\in \mathscr{B}_{nm}^S}S_{ij} = \max_{S_{ij}\in \mathscr{B}_{nm}^S}\frac{\sum_{h,S_{ij}^t > 0}S_{ij}^t}{N_{\mathrm{head}}} \quad (2)$
- $\mathscr{B}_{nm}^S$ : Represents a specific score block, indexed by $n$ (for query blocks) and $m$ (for key blocks).
- $S_{ij}$ : An attention score within the block.
- $S_{ij}^{t}$ : The attention score between token $i$ and token $j$ in attention head $h$ . The superscript $t$ here is likely a typo and should refer to a specific head, or $S_{ij}^h$ . Given $N_{head}$ is a scaling factor, it implies a summation across heads. Assuming $S_{ij}^t$ refers to the score from a specific attention head $h$ .
- $N_{\mathrm{head}}$ : The total number of attention heads.
- The term $\sum_{h,S_{ij}^t > 0}S_{ij}^t$ sums only the positive attention scores across all attention heads for a given token pair.
- $\max_{S_{ij}\in \mathscr{B}_{nm}^S}S_{ij}$ : The maximum attention score within the block $\mathscr{B}_{nm}^S$ . The rationale for block-wise elimination is twofold: first, with relatively small block sizes, most blocks in long-context sequences contain no important tokens and can be safely eliminated; second, the informativeness score $I(\mathcal{B}^S)$ is robust enough to identify and retain blocks containing even a few important tokens, as their scores are significantly higher than unimportant ones, preventing them from being averaged out. This prioritizes accuracy over minimal computational savings from retaining a few unimportant tokens.
The following image (Figure 6 from the original paper) illustrates the token elimination algorithm:

该图像是一个示意图，展示了 Llama2 和 Llama3 在前向和后向过程中的内存峰值。Llama2 的内存峰值约为总量的 18%，而 Llama3 的内存峰值约为 25%。

Figure 6: Token elimination algorithm. Attention scores are first aggregated across different heads and partitioned into multiple score blocks. The maximum value within each score block is defined as its informativeness score, which is then aggregated along the column. These resulting scores are compared against a layer-specific threshold to determine whether the corresponding tokens should be retained.

Layer-specific Threshold: The informativeness scores of token blocks ( $\mathcal{B}^T$ , derived by aggregating corresponding score blocks) are compared against a threshold to decide whether the tokens within that block are retained or eliminated. Jenga refines this by using a layer-specific threshold. This is crucial because different layers within an LLM exhibit varying sparsity patterns and average informativeness scores (as shown in Figure 7), making a universal threshold suboptimal. Algorithm 1 outlines this approach: it initializes a default threshold for all layers based on score profiling and then fine-tunes these values to align with each layer's unique sparsity characteristics using a gradient-based update.

The following image (Figure 7 from the original paper) shows the variation in average informativeness scores across layers:

fig 7 该图像是一个示意图，展示了 2D 稀疏性在长期上下文微调中的应用，包括标记级别和隐藏维度级别的稀疏性。左侧部分展示了 token 级稀疏的输入结构，右侧部分则对比了 Dense、Naïve 和 Jenga 方法在 fetch 和 offload 过程中的操作流程。重要的稀疏性敏感性模块被突出强调。

Figure 7: The average informativeness scores (normalized) of tokens blocks across different layers.

The following is Algorithm 1: Layer-Specific Threshold Optimization from the original paper:

Input: Model layers L = {L1, L2, …, Ln}

Output: Layer thresholds T = {T1, T2, …, Tn}

// Step 1: Threshold Initialization

foreach layer Li ∈ L do
// Average scores across token blocks

Ti ← avg(I(θT)∀B T ∈ Li);

// Step 2: Threshold Fine-Tuning

foreach layer Li ∈ L do

// Compute gradient with finite changes
G<sub>i</sub> ← (acc(Ti + ϵ) - acc(Ti - ϵ)) / 2ϵ;

// Update threshold based on gradient

Ti ← Ti + η · Gi;

return T

$L$ : Set of model layers.
$T$ : Set of layer-specific thresholds.
$L_i$ : An individual layer in the model.
$\text{avg}(I(\theta T) \forall B^T \in L_i)$ : The average informativeness score of all token blocks ( $B^T$ ) within layer $L_i$ . This initializes the threshold.
$G_i$ : Gradient of accuracy with respect to the threshold for layer $L_i$ , computed using finite differences.
$\text{acc}(T_i + \epsilon)$ : Accuracy when the threshold is slightly increased by $\epsilon$ .
$\text{acc}(T_i - \epsilon)$ : Accuracy when the threshold is slightly decreased by $\epsilon$ .
$\eta$ : Learning rate for updating the threshold.
The algorithm first sets an initial threshold based on the average informativeness across token blocks in each layer, then iteratively fine-tunes these thresholds by computing a gradient of model accuracy with respect to small changes in the threshold, updating them to optimize performance.

Extend to MLP Block: Jenga also extends token elimination to the Multi-Layer Perceptron (MLP) block. Similar to attention scores, intermediate activations within the MLP block are used to evaluate token informativeness. For ReLU-based MLP structures, these are the outputs of the ReLU layer. For SiLU-based structures, they correspond to the element-wise multiplication of the gate projection (after SiLU) and the up projection. This ensures consistent sparsity application across different model components.

4.2.2. Context-aware Pattern Prediction

Directly computing full attention scores to derive sparsity patterns is quadratically expensive. To address this and the dynamic nature of contextual token sparsity, Jenga uses lightweight neural networks as predictors to infer sparsity patterns efficiently.

Neural-network-based Predictor: As depicted in Figure 8, each layer is equipped with a pair of predictors: one for queries ( $Q$ ) and one for keys ( $K$ ). Each predictor consists of three trainable low-rank matrices with ReLU activation functions between them. The inputs to these predictors are token embeddings ( $X$ ), organized into blocks. By extracting a representative embedding from each block, the predictors output approximate informativeness scores, $\hat{I}(Q)$ and $\hat{I}(K)$ , for the respective token blocks. These scores are then multiplied to approximate the informativeness of attention scores ( $\hat{I}(S_{\mathrm{attn}})$ ) for block pairs: $\hat{I} (S_{\mathrm{attn}}) = \hat{I} (Q)\hat{I} (K)^T,\hat{I} (B_{mn}^S) = \hat{I} (B_m^Q)\hat{I} (B_n^K)^T \quad (3)$
- $\hat{I}(Q)$ : The predicted informativeness score for query blocks.
- $\hat{I}(K)$ : The predicted informativeness score for key blocks.
- $\hat{I}(S_{\mathrm{attn}})$ : The approximated informativeness score for the full attention matrix.
- $\hat{I}(B_{mn}^S)$ : The approximated informativeness score for a specific attention score block.
- $\hat{I}(B_m^Q)$ : The predicted informativeness score for the $m$ -th query token block.
- $\hat{I}(B_n^K)$ : The predicted informativeness score for the $n$ -th key token block. These predictors are designed to converge quickly with limited training data. Their size is restricted to the dimension of a single token block, not the full sequence, minimizing prediction overhead.
The following image (Figure 8 from the original paper) illustrates the pattern prediction process:

Figure 8: Pattern prediction process. Each layer is equipped with two predictors to approximate $Q$ and $K$ , respectively. Taking token embeddings as input (organized in token blocks), each predictor outputs the informative score, $\hat{I} (B_{m}^{Q})$ and $\hat{I} (B_{n}^{K})$ , for each token blocks. These scores are then multiplied to compute the informative scores, $\hat{I} (B_{m n}^{S})$ , for blocked attention scores. Besides, elastic size transformation is employed to independently minimize the predictor size for each layer.
Elastic Size Transformation: To further reduce predictor size and overhead, Jenga incorporates an elastic size transformation technique. This dynamically prunes neurons in the predictors based on their activation sparsity. Leveraging the properties of the ReLU activation function (which can produce many zero elements), neurons corresponding to consistently zero intermediate activations are considered inactive and safely pruned. Jenga tracks the zero frequency of intermediate activations during training and periodically prunes neurons with the highest zero frequencies. This adaptively determines the optimal size for each predictor, reducing both computational and memory overhead by an average of $50\%$ .
Comprehensive Overhead Analysis:
- Offline Training: The main bottleneck is obtaining accurate attention informativeness scores ( $I(S_{\mathrm{attn}})$ ). Jenga integrates its custom training kernel into FlashAttention, allowing $I(S_{\mathrm{attn}})$ to be derived online without explicit full attention score computation and storage, leading to a memory complexity that grows linearly with sequence length.
- Online Inference: Given sequence length $s$ $s$ , head dimension $h$ $h$ , and block size $b$ $b$ :
  - Computational overhead for predicting $\hat{I}(Q)$ or $\hat{I}(K)$ is $O(s h^2)$ .
  - Computational overhead for predicting $\hat{I}(S_{\mathrm{attn}})$ is $O(s^2 / b^2)$ . The second term dominates in long-context scenarios but can be mitigated by increasing $b$ .
  - Memory overhead for linear weights within predictors is $O(bh^2)$ , which is constant relative to model configurations. Thanks to elastic size transformation, both computational and memory complexities are further reduced by a sparsity factor (average $50\%$ ).

4.2.3. High-performance Kernel Optimization

Jenga's focus on token-level sparsity introduces specific performance challenges related to data movement and memory peaks. Jenga incorporates hardware-efficient kernel-level techniques to mitigate these.

Permutation-free Token Movement: Contextual token sparsity results in varying sparsity patterns across layers, meaning different subsets of tokens are selected at each layer. As illustrated in Figure 9, naive implementations would involve:
1. Eliminating less informative tokens.
2. Re-permuting the remaining tokens.
3. Padding attention outputs with zeros to maintain dimensional consistency.
4. Residually adding these padded outputs to the original inputs. These steps involve extensive global memory movement and allocation, leading to high latency. Jenga develops a permutation-free strategy by fusing these operations with attention computations:
- Selective Load: Instead of materializing and re-permuting retained tokens, Jenga directly loads the selected tokens from their original input positions.
- In-place Addition: It performs an in-place addition of the attention outputs directly to the original inputs, simultaneously completing token padding (by implicitly ignoring eliminated tokens) and residual addition in one step. During backpropagation, the original inputs are recomputed by subtracting the self-attention output from the output embedding matrix, incurring minimal overhead. This streamlines the process, eliminating unnecessary memory allocation and global memory movement, significantly enhancing performance.
The following image (Figure 9 from the original paper) compares naive token movement with JENGA's permutation-free strategy:

Figure 9: Comparison between naive token movement and JENGA. Highlighted by the red lines, the naive kernel incurs substantial global memory movement costs. JENGA develops a permutation-free strategy through kernel fusion.
Segment-based Peak Cutting: LLMs are often autoregressive, predicting the next token. During fine-tuning, for each token in a long input sequence, the model generates a probability distribution over the large vocabulary size and computes loss. This causes a significant, transient surge in activation memory usage (a memory peak), which elevates the overall GPU memory requirements (Figure 10). To address this, Jenga adopts a segment-based peak-cutting strategy. It partitions the token sequence into smaller, manageable segments during the final loss computation. Instead of performing a forward pass over the entire sequence and retaining all intermediate activations, Jenga processes each segment independently. Activations for a segment are discarded immediately after its gradient computation. This reduces the activation memory peak to $1/N$ when the sequence is divided into $N$ segments, significantly alleviating memory pressure on individual GPUs and remaining compatible with multi-GPU setups.

The following image (Figure 10 from the original paper) illustrates the memory peak during loss gradient computation:

Figure 10: Memory peak during loss gradient computation, exacerbated by the large vocabulary size and long context. The $\mathbf{X}$ -axis represents the timeline of a fine-tuning epoch.

4.3. Implementation and Extension

Jenga is implemented with over 3000 lines of Python and C++ code. Due to minimal changes to the core fine-tuning dynamics, it is compatible with various LLM architectures without requiring code modifications. It also supports seamless integration with other optimization techniques:

Extension 1: Two-dimensional Sparsity (2D-Sparsity): After applying Jenga's token-level sparsity, the remaining tokens can further benefit from existing hidden-dimension-level sparsity techniques (e.g., neuron sparsity). This combination allows for more granular control over resource allocation, leading to further reductions in activation memory and computational costs. The following image (Figure 11(a) from the original paper) depicts Two-dimensional Sparsity:

Figure 11: Two available extensions of JENGA: (a) Two-dimensional Sparsity and (b) Sparsity-sensitive Offload.
Extension 2: Sparsity-sensitive Offload: Jenga enhances existing offload-based techniques by incorporating contextual token sparsity into the optimization. A sparsity-sensitive offloading strategy adapts to varying sparsity ratios across layers, enabling more efficient data transfer between CPU and GPU, thereby alleviating GPU memory constraints. The following image (Figure 11(b) from the original paper) illustrates Sparsity-sensitive Offload:

Figure 11: Two available extensions of JENGA: (a) Two-dimensional Sparsity and (b) Sparsity-sensitive Offload.

5. Experimental Setup

5.1. Datasets

The experiments utilize a diverse set of datasets to evaluate Jenga's long-context fine-tuning capabilities and performance across different tasks:

RedPajama [14]: Used for the core long-context fine-tuning process, following the setup from LongLoRA. RedPajama is a large-scale open dataset for training LLMs, known for its diverse and extensive text content, making it suitable for extending context windows.
PG19 [57]: A book corpus dataset used for evaluating the perplexity (PPL) of fine-tuned models. It consists of books, which are inherently long-form content, making it appropriate for assessing a model's long-context modeling performance.
Cleaned Arxiv Math proof-pile dataset [3]: Another dataset used for perplexity (PPL) evaluation. This dataset focuses on mathematical proofs from Arxiv, representing a domain with highly structured and often long, intricate textual content.
LongBench [5] benchmark: Used for evaluating model accuracy. LongBench is a bilingual, multitask benchmark specifically designed for long-context understanding. It covers various critical long-text application areas, providing a comprehensive assessment. The instruction-tuning for this benchmark was done on LongAlign-10k [4], a dataset geared towards aligning LLMs for long contexts.

These datasets were chosen to provide a robust demonstration of Jenga's scalability and versatility across different content types and task complexities, effectively validating its performance in various long-text application scenarios.

5.2. Evaluation Metrics

The effectiveness of Jenga is measured using several key metrics that assess memory efficiency, computational speed, and model accuracy.

Memory Footprint (GB): This metric quantifies the peak GPU memory consumption during the fine-tuning process. Measurements are taken immediately after the forward pass, which typically corresponds to the highest memory usage. Lower values indicate better memory efficiency.
Execution Time (seconds): This metric measures the time taken per fine-tuning step. Speedup is calculated as the ratio of the baseline's execution time to Jenga's execution time ( $Speedup = T_{baseline} / T_{Jenga}$ ). Higher speedup values indicate better computational efficiency.
Perplexity (PPL):
1. Conceptual Definition: Perplexity is a common intrinsic evaluation metric for language models. It measures how well a probability distribution or language model predicts a sample. In simpler terms, it indicates how "surprised" the model is by a given text. A lower perplexity score means the model assigns a higher probability to the text, indicating that it is a better model of the text. For long-context models, it assesses how well the model can maintain coherence and prediction accuracy over extended sequences.
2. Mathematical Formula: For a sequence of words $W = (w_1, w_2, ..., w_N)$ , the perplexity is defined as: $\mathrm{PPL}(W) = P(w_1, w_2, ..., w_N)^{-1/N}$ In practice, using the chain rule of probability, this is often computed as: $\mathrm{PPL}(W) = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i|w_1, ..., w_{i-1})\right)$
3. Symbol Explanation:
  - $W$ : A sequence of $N$ words (or tokens).
  - $N$ : The total number of words/tokens in the sequence.
  - $P(w_1, w_2, ..., w_N)$ : The joint probability of the entire sequence, according to the language model.
  - $P(w_i|w_1, ..., w_{i-1})$ : The probability of the $i$ -th word given all preceding words in the sequence, as predicted by the language model.
  - $\log$ : Natural logarithm.
  - $\exp$ : Exponential function (base $e$ ).
Accuracy: This metric is task-specific for the LongBench benchmark. For tasks like multi-choice questions, it could be the percentage of correctly answered questions. For summarization or generation tasks, it might involve metrics like ROUGE or F1 scores. The paper references the LongBench benchmark, which includes various task types (e.g., question answering, summarization), and for which "higher is better" typically applies to accuracy scores.

5.3. Baselines

Jenga's performance is compared against two state-of-the-art fine-tuning methods, representing the dominant optimization directions:

LoRA (Low-Rank Adaptation) [29]: A representative Parameter-Efficient Fine-Tuning (PEFT) method. LoRA is chosen because it significantly reduces the memory footprint for optimizer states by updating only a small number of injected low-rank matrices, making it a popular and efficient baseline for fine-tuning.
LongLoRA [12]: This method builds upon LoRA by incorporating a shifted sparse attention mechanism. It is chosen as a representative of sparsity-based fine-tuning techniques. It aims to improve computational efficiency by introducing sparsity into the attention mechanism, allowing for efficient scaling of context length.

For speedup analysis, the primary comparison is with LoRA, with LongLoRA serving as a reference. This distinction is made because Jenga and LongLoRA address orthogonal sparsity dimensions (token-level vs. hidden-dimension level), making direct, fair speedup comparisons complex in some scenarios.

5.4. Hardware

Experiments were conducted on three representative hardware platforms to demonstrate Jenga's performance across different GPU capacities and capabilities. The following are the hardware platform configurations from Table 4 of the original paper:

Platform	GPUs	Memory	FP32 TFLOPS	BF16 TFLOPS
Platform A	1 x A800	80GB	19.5	312
Platform B	1 x A40	48GB	37.4	150
Platform C	4 x 4090	24GB	82.6	82.6

Memory measurements were primarily performed on Platform A (A800) due to its large GPU memory capacity, as memory consumption is largely independent of GPU arithmetic performance. Speedup evaluations employed mixed-precision techniques (BF16 and FP32) for efficiency, a common practice.

5.5. Models

The evaluation includes models from two of the most popular LLM families, covering a range of architectures, parameter sizes, and default context window sizes, ensuring a robust demonstration of Jenga's scalability and versatility. The following are the model configurations from Table 5 of the original paper:

Model	# params.	Def Len.	Seq Len.
OPT [81]	350M/1.3B/2.7B/6.7B	2K	2K-64K
Llama2 [67]	7B	4K	4K-64K
Llama3 [19]	8B	8K	4K-64K

# params.: Number of parameters in the model.
Def Len.: Default context window length the model was originally pre-trained with.
Seq Len.: The extended sequence lengths used for fine-tuning experiments with Jenga.

5.6. Implementation Details

Jenga is implemented with over 3000 lines of Python and C++ code. Its design ensures minimal modifications to the original fine-tuning dynamics, making it compatible with a wide range of LLM architectures without requiring code changes. For evaluations, activation recomputation and offloading techniques are generally excluded unless explicitly stated, to highlight Jenga's intrinsic benefits. All reported metrics are averaged over 10 repeated trials to ensure reliability and statistical significance.

6. Results & Analysis

6.1. Core Results Analysis

Memory Footprint

Jenga's primary objective is to reduce activation memory consumption during long-context fine-tuning. The following are the memory footprint (GB) comparison across different fine-tuning methods from Table 1 of the original paper. LoRA and LongLoRA are representative of PEFT and sparsity-based methods, respectively ( $S = 4K$ ):

Model	Llama-2-7B	Llama3-8B	Mistral-7B	OPT-6.7B
Naive	67.9	78.4	73.4	63.8
LoRA	39.2	43.4	39.3	36.1
LongLoRA	41.3	43.9	39.3	38.1
Jenga	31.3	34.5	31.4	30.0

As an initial comparison, for a sequence length of 4K, Jenga significantly reduces memory usage compared to Naive fine-tuning, LoRA, and LongLoRA across various LLMs. For instance, for Llama-2-7B, Jenga uses 31.3GB compared to LoRA's 39.2GB and LongLoRA's 41.3GB. This immediately demonstrates Jenga's effectiveness in tackling the memory bottleneck.

The following image (Figure 12 from the original paper) presents a more comprehensive memory footprint comparison on an A800 GPU across different models and sequence lengths:

fig 1 该图像是一个示意图，展示了“Jenga”方法在长上下文微调中的三个主要步骤。第一部分“Token Elimination”标明如何通过信息得分 $B$ 消除不必要的令牌；第二部分“Pattern Prediction”描述了对块进行分区和模式预测的过程；第三部分“Kernel Design”则阐述了在注意力计算中对令牌进行选择和填充的步骤。整体结构强调了提高上下文稀疏性的策略。

Figure 12: Memory footprints comparison on A800. The results confirm that Jenga consistently achieves substantial memory savings. Specifically, Jenga achieves average memory reductions of $38.2\%$ and $50.5\%$ compared to LoRA across six different models for sequence lengths of 4K and 8K, respectively. Compared to LongLoRA, Jenga shows similar benefits, highlighting that the Shadowy Activation phenomenon prevents LongLoRA's sparsity from yielding memory gains. A crucial observation is that Jenga's memory efficiency improves as the sequence length increases, which aligns with the hypothesis that longer text sequences contain more redundancy that can be exploited. This enhanced efficiency directly extends the maximum fine-tuning sequence length: while LoRA and LongLoRA are limited to 16K (for OPT 1.3B) or 32K (for OPT 350M) without activation recomputation or offloading, Jenga doubles this capacity, supporting sequence lengths up to 32K (OPT 1.3B) and 64K (OPT 350M) on a single A800 GPU. This is a significant practical advantage for long-context applications.

Execution Time

The minimized token involvement in Jenga also translates into computational savings. The following image (Figure 13 from the original paper) presents the end-to-end execution time and corresponding speedups of Jenga during fine-tuning different models at a sequence length of 4K on A800 and A40 GPUs:

fig 2 该图像是一个示意图，展示了注意力块与评分块之间的关系。图中包含了多个层次的 Token 块以及它们的层特定阈值 $θ_l$ 的比较关系。此外，图中显示了评分的计算方式 $S_{ij} = rac{1}{N_{head}} imes ext{Sum}_{h:s_{h,j}>0} s_{h,j}$ 。

Figure 13: End-to-end speedup of JENGa on A800 and A40. Jenga achieves computational efficiency comparable to LongLoRA, demonstrating an average speedup over LoRA of $10.8\%$ on Platform A (A800) and $8.6\%$ on Platform B (A40). The paper notes that LongLoRA can sometimes be slower than LoRA, primarily because fully utilizing hardware for sparse operations can be challenging with insufficient sequence length. In contrast, Jenga's minimal modifications to the computational flow allow it to more effectively convert computational savings into practical speedups. Furthermore, evaluations on longer sequence lengths (with recomputation) reveal even greater performance gains, achieving up to $1.36 \times$ speedups.

Accuracy Evaluation

Maintaining model accuracy is paramount for any optimization technique. The following are the comparative analysis of model accuracy on the LongBench benchmark (higher is better) from Table 6 of the original paper:

Tasks	mfqa_zh	mfqa_en	gov_report	triviaqa	vcsum	qmsum	musique	2wikimqa	reponหนัก
Origin	23.45	23.22	27.44	84.60	13.30	22.64	4.63	9.01	52.00
Ours	23.53	24.74	25.92	82.59	13.02	20.33	5.73	10.14	48.32
Tasks	qasper	hotpotqa	multi_news	pr_zh	pr_en	trec	Isht	dureader	lcc
Origin	15.94	9.40	24.43	10.0	20.0	68.0	21.0	23.69	71.28
Ours	17.68	9.55	22.53	8.00	22.0	68.0	25.0	21.37	70.32

The results on the LongBench benchmark show that Jenga achieves accuracy comparable to the original LoRA baseline across a wide range of long-text application tasks. While there are minor fluctuations (some tasks show slight improvements, others slight drops), the overall accuracy is maintained, confirming that Jenga's token elimination strategy does not significantly compromise model performance.

Additionally, the paper states that Jenga was evaluated for perplexity (PPL) on fine-tuned Llama2 7B models using the PG19 and Proof-Pile datasets across varying sequence lengths. The authors state, "As shown in Table 7, JENGa incurs only a minimal increase in perplexity scores compared to the original LoRA, across varying sequence lengths." While Table 7 itself is not presented in the provided text, this statement indicates that the Information-driven Token Elimination strategy effectively preserves the model's ability to predict text sequences accurately, demonstrating that the exploited redundancy does not carry critical information. These collective evaluations strongly support the claim that Jenga can achieve significant efficiency gains without compromising model accuracy.

6.2. Ablation Studies / Parameter Analysis

Fine-grained Performance Breakdown

To understand the specific contributions of Jenga's components, a detailed performance breakdown was conducted. The following image (Figure 14 from the original paper) shows the performance breakdown for Llama2 fine-tuning, covering both memory footprint and execution time:

fig 3 该图像是一个示意图，展示了注意力机制中的多个重要组成部分，包括注意力头、得分块和标记块。图中显示了如何计算得分矩阵 $S_{ij}$ 和层特定阈值 $\theta_l$ ，并通过比较不同的标记块 ( $B_1, B_2, B_3$ ) 的重要性 I(B) 进行筛选。其中， $I(B_1) > I(B_3)$ 的记录采用了不同的表示方式。整体结构阐述了长上下文微调的过程。

Figure 14: Performance breakdown of Llama2 fine-tuning: (a) Memory footprint and (b) Execution time.

Memory Aspect: Figure 14(a) shows that Jenga effectively reduces activation memory consumption compared to both LoRA and LongLoRA. Although the predictors introduce some additional memory overhead, this overhead is minimal and does not negate the overall memory reduction. This reduction in activation memory scales linearly with sequence length, confirming its efficacy for long-context scenarios.
Computational Aspect: Figure 14(b) demonstrates that Jenga also achieves computational gains over LoRA. The reduced token involvement directly leads to decreased computation during both the forward and backward passes. Similar to memory, the computational overhead introduced by the predictors is negligible in the context of the overall fine-tuning process.

Technique 1: Token Elimination

The effectiveness of the information-driven token elimination algorithm was analyzed at the layer level. The following image (Figure 15 from the original paper) presents the memory footprint and corresponding threshold across layers on Llama2 7B (upper) and OPT 6.7B (lower):

fig 12

Figure 15: Memory footprint and corresponding threshold across layers on Llama2 7B (upper) and OPT 6.7B (lower). The results indicate that the token elimination algorithm achieves significant average memory savings: $38.3\%$ on the Attention block and $51.1\%$ on the MLP block for the Llama2 model, and $38.0\%$ (Attention) and $54.8\%$ (MLP) for the OPT model. The distinct MLP architectures (SiLU for Llama2, ReLU for OPT) are handled effectively. The application of layer-specific thresholds allows for varying degrees of reduction across layers, maximizing the exploitation of token-level sparsity while preserving model accuracy, as visualized by the varying threshold values.

Further investigation into the effectiveness of information-driven token elimination is provided by analyzing the distribution of unimportant blocks. The following image (Figure 16 from the original paper) shows the proportion of blocks without important tokens and the distribution of attention scores within a token block:

fig 13

Figure 16: (a) Proportion of blocks without important tokens and (b) Distribution of attention scores within a token block.

Proportion of Unimportant Blocks (Figure 16(a)): A large proportion of token blocks (especially in deeper layers) do not contain any important tokens (defined as having an informativeness score less than 10% of the maximum). This is attributed to the relatively small block size (e.g., 64) compared to the long-context input sequences, leading to sparse distribution of important tokens as model depth increases. These blocks can be safely eliminated without compromising accuracy.
Attention Score Distribution (Figure 16(b)): Visualization of attention scores within individual blocks shows that blocks are generally classified as important if most of their tokens are informative, or if even a few tokens are highly important. The large difference in informativeness scores between important and unimportant tokens ensures that important tokens are not averaged out, reinforcing the robustness of the block selection strategy.

Technique 2: Pattern Prediction

The Context-aware Pattern Prediction technique's performance and efficiency were evaluated. The following image (Figure 17 from the original paper) shows the training loss curve and prediction visualizations of predictors:

fig 19

Figure 17: (a) Training loss curve on LongAlign (LA)/ Red-Pajama (RP) and (b) Prediction visualizations of predictors.

Training Loss and Accuracy (Figure 17(a)): The training loss curves for predictors across two models (Llama2 and OPT) and two datasets (LongAlign, Red-Pajama) demonstrate quick convergence, typically in fewer than 400 epochs. This overhead is acceptable given the overall expensive LLM fine-tuning process. The predictors achieve an impressive average recall metric of $95.13\%$ , indicating high accuracy in identifying important tokens.
Prediction Visualizations (Figure 17(b)): Visual comparisons show that the predicted attention scores closely approximate the ground truth, effectively identifying redundant tokens with high accuracy. This visual evidence supports the high recall metric and validates the predictors' ability to infer sparsity patterns reliably.

Regarding elastic size transformation, the paper states, "Table 8 details that by exploiting the inherent sparsity within predictors, their parameter sizes can be uniformly reduced across layers, on average $64.6\%$ ." While Table 8 itself is not included in the provided text, this finding indicates that elastic size transformation is highly effective. By adaptively pruning neurons in the predictors, it significantly reduces their parameter sizes (by an average of $64.6\%$ ), which in turn minimizes prediction overhead, aligning with the negligible overhead observed in the performance breakdown.

Technique 3: Kernel Optimization

The performance of Jenga's high-performance kernel optimization techniques was benchmarked. The following image (Figure 18 from the original paper) illustrates the performance of Jenga's permutation-free kernel:

fig 18

Figure 18: Performance of JENGA's permutation-free kernel. The results show that the permutation-free token movement strategy, encompassing both selective load and in-place addition kernel fusion techniques, significantly enhances performance. Speedups range from $10 \times$ to over $50 \times$ , with improvements increasing with sequence length. This substantial speedup is primarily attributed to the reduction in global memory movement and temporary data allocation by fusing operations. This highlights the critical role of kernel-level optimizations in translating algorithmic efficiency into practical system performance.

The following image (Figure 19 from the original paper) illustrates memory usage peak in loss gradient computation:

fig 20

Figure 19: Memory usage peak in loss gradient computation. Thex-axis represents the timeline of a fine-tuning epoch. Figure 19 demonstrates the effectiveness of the segment-based peak-cutting technique. In a naive implementation, gradient computation for the output loss incurs a substantial, inefficient memory peak (around 10GB of temporary activation memory). By partitioning the gradient computation into smaller segments, this sharp peak is transformed into multiple, much smaller memory peaks. This approach achieves an additional $15\%$ memory savings, effectively mitigating the stringent GPU memory demands exacerbated by large vocabulary sizes and long contexts.

6.3. Extension Evaluation

Jenga's compatibility and synergy with other optimization techniques were also evaluated through its proposed extensions. The following image (Figure 20 from the original paper) shows performance improvements from two extensions:

Figure 20: Performance improvements from two extensions.

Extension 1: Two-dimensional Sparsity: Figure 20(a) shows that combining Jenga's token-level sparsity with existing hidden-dimension-level sparsity techniques (forming 2D-Sparsity) further enhances computational efficiency. This synergistic approach achieves up to $2.04 \times$ speedups on Llama2, demonstrating that more granular control over resource allocation through multi-dimensional sparsity can yield significant additional gains.
Extension 2: Sparsity-sensitive Offload: Figure 20(b) compares Jenga's sparsity-sensitive offloading strategy against a naive uniform offloading approach. By adapting to the varying sparsity ratios across layers, Jenga's method enables more efficient data transfer between CPU and GPU or allows offloading larger data volumes. This results in an average speedup of $1.22 \times$ on Llama2, further alleviating GPU memory constraints and improving overall performance.

6.4. Scalability Analysis

The strong scalability of Jenga was analyzed on a multi-GPU setup ( $4 \times 4090$ GPUs). The following image (Figure 21 from the original paper) shows the strong scalability evaluation of Jenga:

fig 8

Figure 21: Strong scalability evaluation of JENGA. Figure 21 demonstrates that Jenga's performance scales proportionally with the number of GPUs across different models and sequence lengths. This strong scalability is achieved because Jenga effectively minimizes token involvement without introducing extra communication overhead between GPUs. These results underscore Jenga's potential for deployment in large-scale distributed systems, making it suitable for industrial-level LLM fine-tuning.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Jenga, an innovative and efficient system designed to optimize long-context fine-tuning for Large Language Models (LLMs). Its core contribution is the identification and exploitation of a novel sparsity mechanism, termed contextual token sparsity, which is inherent in long-context scenarios due to the natural redundancy of language. Jenga systematically leverages this mechanism through three key techniques: Information-driven Token Elimination (to identify and remove redundant tokens), Context-aware Pattern Prediction (to efficiently infer sparsity patterns), and High-performance Kernel Optimization (to translate algorithmic gains into practical system performance).

Comprehensive evaluations demonstrate that Jenga achieves significant improvements over state-of-the-art fine-tuning systems. It reduces memory consumption by up to $1.93 \times$ and achieves up to $1.36 \times$ speedups, while critically preserving model accuracy across various LLM architectures and long-context benchmarks. Furthermore, Jenga is compatible with existing optimization techniques and exhibits strong scalability in multi-GPU environments. By directly addressing the activation memory bottleneck through token-level sparsity, Jenga represents a crucial advancement in making long-context LLM fine-tuning more accessible and efficient.

7.2. Limitations & Future Work

The authors conclude by stating, "Compression embodies intelligence, with sparsity serving as a potent form of compression. We envision JengA inspiring broader exploration of sparsity for advancing LLMs." This statement, while visionary, implicitly suggests future research directions rather than explicitly detailing current limitations.

Based on the paper's content, potential limitations and areas for future work could include:

Predictor Training Overhead: While the predictors converge quickly (within 400 epochs), this still represents an initial training cost that must be incurred before efficient fine-tuning. For highly dynamic or rapidly evolving datasets, re-training these predictors could add overhead.
Robustness to Diverse Contexts/Domains: Although evaluated on diverse datasets, the generalizability of the learned sparsity patterns and layer-specific thresholds to entirely new, unseen domains or highly adversarial inputs remains an area for further investigation. How well does the informativeness definition hold across all possible linguistic structures?
Granularity of Block-wise Elimination: While block-wise elimination is hardware-friendly, it's a coarser granularity than individual token elimination. This might lead to retaining some unimportant tokens if they are grouped into a block with even a single important token, potentially leaving room for further optimization.
Dynamic Threshold Adaptation: The layer-specific threshold optimization involves a gradient-based update. The sensitivity of this process to hyperparameters like learning rate ( $\eta$ ) and the finite difference $\epsilon$ needs careful tuning and could be an area for more adaptive, automated approaches.
Interaction with Other Optimizations: While Jenga is compatible with activation recomputation and offloading, a deeper analysis of the optimal interplay and potential for combined, co-designed optimizations could yield further benefits. For instance, sparsity-sensitive offloading is shown to be effective, indicating fruitful avenues for combining Jenga with memory management strategies.

7.3. Personal Insights & Critique

Jenga offers a compelling and innovative approach to tackle one of the most pressing challenges in LLM development: efficiently scaling context windows during fine-tuning.

Innovation: The core idea of contextual token sparsity is a powerful insight. Moving beyond hidden-dimension sparsity to truly token-level elimination is a fundamental shift that directly addresses the activation memory bottleneck. This is a crucial distinction that many prior sparsity works overlooked due to the Shadowy Activation problem. The dynamic, context-aware nature of Jenga's sparsity (varying across inputs and layers) is also a significant advancement over static sparsity patterns.
Practicality through System Design: The strength of Jenga lies not just in its algorithmic novelty but also in its meticulous system-level optimizations. The permutation-free kernel optimization and segment-based peak cutting are crucial for translating theoretical gains into tangible memory savings and speedups on real hardware. This holistic approach, integrating algorithmic insights with low-level kernel design, is often the key to high-impact research in systems for AI.
Broader Applicability: The principles behind Jenga could extend beyond LLMs. Any deep learning model that processes long sequences and suffers from activation memory bottlenecks (e.g., in genomics, time-series forecasting, or video processing) might benefit from similar token-level sparsity or sequence-segmentation strategies. On the hardware front, the kernel optimizations for efficient data movement and memory peak management are generally applicable to GPU-accelerated workloads.
Potential Improvements/Critiques:
- "Informativeness" Heuristic: The definition of informativeness relies on the maximum positive attention score within a block. While this simple heuristic proves effective, it might not capture all nuances of token importance (e.g., a token could be critical but have a moderate, rather than maximum, attention score with many other tokens). Further research into more sophisticated, perhaps learned, informativeness metrics could be valuable.
- Training Predictors: While efficient, training separate predictors for each layer adds complexity to the fine-tuning pipeline. Exploring methods where a single global predictor or a small set of predictors can generalize across layers, or methods that are entirely prediction-free (e.g., through hardware-level approximations), could further streamline the process.
- Impact of Block Size: The choice of block size ( $b$ ) is a hyperparameter that balances granularity of sparsity with computational overhead of pattern prediction. A detailed sensitivity analysis on how block size impacts both performance and accuracy, perhaps with an adaptive block-sizing strategy, could be beneficial.
- Transparency of Contextual Token Sparsity: While the paper provides visualizations, a deeper qualitative analysis of which tokens are typically eliminated and why (e.g., stop words, repetitive phrases, less semantically dense segments) across different layers and contexts could offer more insights into the model's internal workings and further validate the linguistic redundancy hypothesis.
  
  Overall, Jenga presents a robust and well-engineered solution that makes significant strides in addressing a practical bottleneck for long-context LLMs, paving the way for more efficient and powerful AI applications.