Paper status: completed

SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation

Published:01/01/2025
Original Link
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SPECTRA is a novel framework that accelerates large language model inference through optimized internal and external speculation, requiring no additional training. It achieves up to 4.08x speedup over state-of-the-art methods across various benchmarks, with its implementation pub

Abstract

Inference with modern Large Language Models (LLMs) is both computationally expensive and time-consuming. Speculative decoding has emerged as a promising solution, but existing approaches face key limitations: training-based methods require a draft model that is challenging to obtain and lacks generalizability, while training-free methods offer limited speedup gains. In this work, we present SPECTRA, a novel framework for accelerating LLM inference without the need for additional training or modification to the original LLM. SPECTRA introduces two new techniques for efficiently utilizing internal and external speculation, each outperforming corresponding state-of-the-art (SOTA) methods independently. When combined, these techniques achieve up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation of SPECTRA is publicly available.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation

1.2. Authors

Nguyen-Khang Le, Dinh-Truong Do, Nguyen Le Minh, all affiliated with Japan Advanced Institute of Science and Technology.

1.3. Journal/Conference

The paper is listed with a publication date in 2025, which suggests it is likely accepted for a prominent machine learning or natural language processing conference in 2025, or an arXiv preprint anticipating such. Given the quality of work and the topic's relevance, it targets top-tier venues in AI/ML.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the significant computational expense and time consumption of Large Language Model (LLM) inference. While speculative decoding offers a solution, existing methods have limitations: training-based approaches require difficult-to-obtain and non-generalizable draft models, and training-free methods yield limited speedups. SPECTRA is introduced as a novel framework to accelerate LLM inference without training or modifying the original LLM. It features two new techniques for internal and external speculation, each independently outperforming current state-of-the-art (SOTA) methods. When combined, SPECTRA achieves up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation is publicly available.

/files/papers/695011299c764da3f20e36ef/paper.pdf (This link indicates it's a direct PDF hosted by the system, likely associated with a preprint server or conference proceedings.)

2. Executive Summary

2.1. Background & Motivation

The core problem SPECTRA aims to solve is the inherent inefficiency of Large Language Model (LLM) inference. Modern LLMs generate text autoregressively, meaning one token at a time. This sequential process leads to generation time scaling linearly with the output sequence length and underutilizes the parallel processing capabilities of modern GPUs. This computational burden makes deploying and running LLMs costly and slow, hindering their widespread application, especially for generating long sequences with low latency.

Prior research has explored speculative decoding as a promising solution. This guess-and-verify paradigm attempts to predict multiple tokens in advance and verify them in parallel. However, existing speculative decoding methods fall into two main categories, both with significant drawbacks:

  1. Training-based methods: These approaches often require a smaller, specialized draft model or modifications to the original LLM itself through specialized training. This demands substantial computational resources, may degrade the original model's capabilities, and lacks generalizability across different LLMs or tasks.

  2. Training-free methods: These approaches avoid additional training by leveraging existing LLM predictions or external information. However, they typically offer only limited speedup gains due to the quality of their speculative guesses or inefficiencies in their integration.

    The paper's entry point is to overcome these limitations by developing a novel training-free speculative decoding framework that does not require modification to the original LLM, yet achieves significantly higher speedups than existing training-free methods. The innovative idea is to optimize both internal speculation (leveraging the LLM's own predicted token distribution) and external speculation (using refined external knowledge) and integrate them efficiently.

2.2. Main Contributions / Findings

The primary contributions of the SPECTRA paper are:

  • A Novel, Training-Free Framework: Introduction of SPECTRA, a plug-and-play speculative decoding method that accelerates LLM inference without requiring any additional training or modifications to the original LLM. This addresses the practical challenges of training-based methods.

  • Optimized Internal Speculation (SPECTRA-CORE): Development of SPECTRA-CORE, a module that leverages the LLM's predicted token distribution to generate high-quality guesses. It employs multi-level N-gram dictionaries for bi-directional search to produce dynamic-length guesses, balancing quality and quantity. It also optimizes a candidate pool to continuously update these dictionaries for broad token coverage, with all updates and verification performed efficiently in a single forward pass.

  • Refined External Speculation (SPECTRA-RETRIEVAL): Introduction of SPECTRA-RETRIEVAL, a module that enhances speedup by deriving guesses from a refined external information source. Unlike prior retrieval-based methods, SPECTRA-RETRIEVAL reduces the search space by selecting only high-quality content from a corpus based on perplexity scores computed by the target LLM. This optimization enables seamless and efficient integration with SPECTRA-CORE.

  • Significant Speedup: When combined, SPECTRA's techniques achieve substantial acceleration, demonstrating up to a 4.08x speedup across a variety of benchmarks and LLM architectures (Llama 2, Llama 3, and CodeLlama, ranging from 7B to 70B parameters). This significantly surpasses existing training-free approaches.

  • Robustness and Generalizability: Extensive experiments confirm SPECTRA's efficiency across diverse tasks (multi-turn conversation, code generation, mathematical reasoning), LLM architectures, GPU types, and sampling settings, while preserving the original model's output quality (lossless acceleration).

  • Public Availability: The implementation of SPECTRA is publicly released, facilitating further research and adoption.

    The key findings show that SPECTRA effectively overcomes the limitations of previous speculative decoding methods by offering a practical, high-impact solution for accelerating LLM inference that is both lossless (maintains output quality) and highly efficient.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand SPECTRA, a basic understanding of Large Language Models (LLMs) and their decoding mechanisms, along with concepts in speculative decoding and language modeling, is essential.

  • Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the Transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language. Examples include GPT models, Llama, and CodeLlama. Their power comes from their ability to learn complex patterns and relationships in language.

  • Autoregressive Decoding: This is the standard method by which LLMs generate text. It operates sequentially:

    1. The LLM processes an input (prompt) and predicts the most probable next token.
    2. This predicted token is appended to the input.
    3. The LLM then takes the extended sequence as input to predict the next token. This process repeats one token at a time until a stop condition (e.g., end-of-sequence token, maximum length) is met.
    • Inefficiency: The main drawback is that each token generation requires a full forward pass through the LLM, making the process slow, especially for long sequences, as generation time scales linearly with the output length. It also underutilizes the parallel processing capabilities of modern GPUs.
  • Speculative Decoding (Guess-and-Verify Paradigm): This technique aims to speed up autoregressive decoding by generating multiple tokens in parallel. Instead of generating one token at a time, it "guesses" several future tokens and then "verifies" them simultaneously.

    • Draft Model: Often, a smaller, faster model (the draft model) is used to quickly generate a sequence of speculative tokens (a "draft").
    • Verification: The original, larger LLM then takes this draft along with the preceding context and verifies all the drafted tokens in a single, parallel forward pass.
    • Acceptance: The LLM checks if the drafted tokens are consistent with its own predictions. A contiguous prefix of correctly guessed tokens is accepted. If a token is incorrect, the sequence up to the last correct token is accepted, and the LLM then autoregressively generates the next token from that point.
    • Efficiency: If many drafted tokens are accepted, the method effectively generates multiple tokens in the time it would normally take to generate one, leading to speedups.
  • N-grams: In computational linguistics, an N-gram is a contiguous sequence of NN items (words, tokens, characters) from a given sample of text or speech. For example, in the sentence "The quick brown fox," "the quick" is a 2-gram (or bigram), and "quick brown fox" is a 3-gram (or trigram). N-gram models are simple probabilistic language models that predict the next item in a sequence based on the preceding N-1 items. In SPECTRA, N-grams are used to store and retrieve sequences of tokens for speculation.

  • Perplexity (PPL): A measure of how well a probability distribution or probability model predicts a sample. In language modeling, perplexity quantifies how "surprised" a model is by a sequence of tokens. A lower perplexity score indicates that the model assigns a higher probability to the sequence, meaning the sequence is more "expected" or "natural" according to the model's learned patterns.

    • Formula: Given a text sequence u=(u0,u1,,ut)u = (u_0, u_1, \ldots, u_t), the perplexity is calculated as: $ \mathrm{PPL}(u) = \exp \left{ - \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i}) \right} $ Where:
      • u=(u0,u1,,ut)u = (u_0, u_1, \ldots, u_t) is the sequence of tokens.
      • tt is the length of the sequence (number of tokens after the first).
      • PM(uiu<i)P_M (u_i \mid u_{<i}) is the probability of token uiu_i given the preceding tokens u<iu_{<i} (i.e., u0,,ui1u_0, \ldots, u_{i-1}) as predicted by the model MM.
      • log\log is the natural logarithm. The term - \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i}) represents the average negative log-likelihood, which is equivalent to the cross-entropy of the model on the sequence. A lower PPL value means the model is more confident and accurate in predicting the sequence.
  • Attention Mask: In Transformer models, the attention mechanism allows the model to weigh the importance of different tokens in the input sequence when processing each token. An attention mask is a binary (or sometimes continuous) matrix that controls which tokens can "attend" to which other tokens. For autoregressive decoding, the attention mask is typically causal, meaning a token can only attend to itself and preceding tokens, preventing it from "seeing" future tokens. In speculative decoding, specially designed attention masks can enable parallel processing of speculative tokens while still respecting causal dependencies within the draft.

  • FlashAttention: An optimized implementation of the attention mechanism that significantly reduces memory usage and improves computational speed, especially for long sequences. It achieves this by reordering computations and utilizing GPU memory hierarchies more efficiently, reducing the number of reads/writes to high-bandwidth memory.

3.2. Previous Works

The paper contextualizes SPECTRA by discussing two main categories of speculative decoding and other related LLM acceleration techniques.

  • LLM Inference Acceleration Techniques (General):

    • Quantization: Reducing the precision of model weights (e.g., from FP16 to INT8) to decrease memory footprint and accelerate computation. (Liu et al., 2024b; Lin et al., 2024; Zhao et al., 2024; Park et al., 2024)
    • Pruning: Removing redundant parameters from the model while trying to maintain performance. (Ma et al., 2023; Xia et al., 2023; Sun et al., 2023a; Le et al., 2025)
    • Knowledge Distillation: Training a smaller "student" model to mimic the behavior of a larger "teacher" model. (Gu et al., 2024; Friha et al., 2024; Zhang et al., 2024b)
    • Limitations: While these methods reduce computational load, they often introduce some degradation in model performance, requiring a trade-off between quality and efficiency.
  • Speculative Decoding (Training-Based Methods): These methods involve training or fine-tuning models to generate speculative tokens.

    • Draft Models: Using a smaller LLM (draft model) to predict several future tokens, which are then verified by the main LLM. (Chen et al., 2023; Leviathan et al., 2023; Miao et al., 2024; Sun et al., 2023b; Zhou et al., 2024; Cai et al., 2024)
    • Self-Speculative Decoding: Training the original LLM itself in a specialized manner to generate its own drafts. (Elhoushi et al., 2024; Liu et al., 2024a; Yang et al., 2024; Zhang et al., 2024a; Li et al., 2024b)
    • Limitations: These approaches require significant computational resources for training, may compromise the original model's capabilities, and often lack generalizability to new models or tasks without further training.
  • Speculative Decoding (Training-Free Methods): These methods aim to avoid additional training by leveraging existing information or structural properties.

    • Retrieval-Based Methods: Using an external datastore (e.g., indexed with observed prefixes in a Trie structure) to retrieve guess sequences. REST (He et al., 2024) and Nearest Neighbor Speculative Decoding (Li et al., 2024a) are examples.
      • Limitation: These methods can struggle with efficiency if the search time in the datastore outweighs the speedup gains, especially when trying to integrate with other speculative techniques. REST (He et al., 2024) is a direct baseline for SPECTRA-RETRIEVAL.
    • Internal Speculation Methods: Methods that generate speculative tokens directly from the LLM's predictions or exploit structural properties. Lookahead Decoding (Fu et al., 2024) and Adaptive N-gram Parallel Decoding (ANPD) (Ou et al., 2024) are examples. They mitigate left-to-right dependencies by generating and validating multiple candidate tokens in parallel.
      • Limitation: Often provide limited speedup due to the quality and quantity of the speculative guesses. Lookahead (Fu et al., 2024) and ANPD (Ou et al., 2024) are direct baselines for SPECTRA-CORE.

3.3. Technological Evolution

The evolution of LLM inference acceleration has progressed from basic autoregressive decoding to sophisticated speculative decoding techniques. Initially, efforts focused on reducing the computational cost per token through methods like quantization and pruning, often with some quality trade-offs. The advent of speculative decoding marked a shift towards accelerating the number of tokens generated per computation step, rather than just making each step cheaper.

Early speculative decoding relied on draft models, introducing the overhead of training and maintaining an auxiliary model. The next wave sought to eliminate this training burden, leading to training-free approaches. These first training-free methods either leveraged the LLM's internal predictions (Lookahead, ANPD) or external knowledge (REST). However, these initial training-free methods faced limitations in speedup gains or integration complexity.

SPECTRA fits into this timeline as a significant advancement in training-free speculative decoding. It represents an evolution from prior training-free methods by introducing more sophisticated internal knowledge utilization and a novel, perplexity-based refinement for external knowledge, enabling seamless and highly effective integration of both.

3.4. Differentiation Analysis

Compared to the main methods in related work, SPECTRA offers several core differences and innovations:

  • Vs. Training-Based Speculative Decoding:

    • Core Difference: SPECTRA is entirely training-free and requires no modifications to the original LLM. This contrasts sharply with methods that need a draft model or specialized self-speculative training, which incur substantial computational costs, may degrade original model capabilities, and lack generalizability.
    • Innovation: SPECTRA provides a plug-and-play solution, making it highly practical for deploying off-the-shelf LLMs without additional resource investment.
  • Vs. Training-Free Internal Speculation (e.g., Lookahead, ANPD):

    • Core Difference: SPECTRA-CORE uses multi-level N-gram dictionaries and a bi-directional search strategy for dynamic-length guesses, along with an optimized candidate pool to continuously update dictionaries. Lookahead often enforces uniform guess lengths and may not have as sophisticated a mechanism for generating high-quality guesses or updating its knowledge base.
    • Innovation: SPECTRA-CORE's approach allows for a better balance of guess quality and quantity, leading to higher acceptance rates and thus greater speedups. The randomness-based mechanism for selecting unseen tokens also increases the coverage of its internal knowledge.
  • Vs. Training-Free External Speculation (e.g., REST):

    • Core Difference: SPECTRA-RETRIEVAL introduces a perplexity-based corpus refinement step. REST relies on a datastore without explicit quality filtering. This refinement is crucial for selecting only high-quality, relevant texts from the corpus.
    • Innovation: The perplexity-based filtering significantly improves the quality of external guesses. More importantly, it enables SPECTRA-RETRIEVAL to be seamlessly and efficiently integrated with SPECTRA-CORE, overcoming the issue where search time in external structures can outweigh speedup gains in naive integrations. The ability to integrate effectively is a key differentiator.
  • Combined Approach:

    • Core Difference: SPECTRA is unique in its optimized combination of both internal and external speculation within a single training-free framework. Previous training-free methods generally focused on one or the other, or struggled with efficient integration.
    • Innovation: The synergy between SPECTRA-CORE's high-quality internal guesses and SPECTRA-RETRIEVAL's refined external guesses, managed by a priority system, allows SPECTRA to achieve significantly higher speedups (up to 4.08x) compared to existing training-free approaches that often yield more limited gains.

4. Methodology

SPECTRA is a novel training-free framework for accelerating LLM inference, designed to be plug-and-play without requiring modifications to the original LLM. It comprises two main modules: SPECTRA-CORE for internal speculation and SPECTRA-RETRIEVAL for external speculation. These modules can operate independently or be combined for maximum acceleration.

The overarching idea is to leverage speculative decoding by generating multiple candidate tokens (guesses) and verifying them in parallel. SPECTRA enhances this by optimizing how these guesses are generated, both from the LLM's internal knowledge and from external sources.

4.1. Principles

The core principle behind SPECTRA is to maximize the acceptance rate of speculative tokens while minimizing the overhead of generating these guesses. It achieves this by:

  1. Leveraging LLM's Predicted Distribution: SPECTRA-CORE intelligently uses the LLM's own output probabilities to build N-gram dictionaries that reflect likely token sequences.
  2. Dynamic Guess Generation: Instead of fixed-length guesses, SPECTRA supports variable-length guesses through bi-directional search in its N-gram dictionaries, enhancing flexibility and efficiency.
  3. Continuous Knowledge Augmentation: The candidate pool and N-gram dictionary update mechanisms ensure that SPECTRA-CORE's internal knowledge base is continuously enriched and adapted to the ongoing generation context.
  4. Refined External Knowledge Integration: SPECTRA-RETRIEVAL addresses the limitations of naive external knowledge usage by pre-filtering external corpora based on perplexity, ensuring that only high-quality, model-aligned guesses are considered, which then integrate seamlessly with internal guesses.
  5. Single-Pass Efficiency: All necessary distributions for predicting new candidates and verifying guesses are obtained in a single forward pass to the LLM, utilizing parallel processing and a specially designed attention mask.

4.2. Core Methodology In-depth

4.2.1. SPECTRA-CORE

SPECTRA-CORE is the internal speculation module. It maintains two primary data structures:

  • NN-gram Storage: This consists of two dictionaries:

    • S_fwd (Forward Dictionary): Maps a token to a list of subsequent sequences. Used to prioritize the quantity of guesses.
    • S_bwd (Backward Dictionary): Maps a sequence (prefix) to a single subsequent token. Used to prioritize the quality of guesses.
  • Candidate Pool (CC): A collection of WW sequences, {c(0),c(1),,c(W1)}\{c^{(0)}, c^{(1)}, \ldots, c^{(W-1)}\}, each of length NN. cj(i)c_j^(i) denotes the jj-th token in the ii-th sequence. This pool is used to augment new N-grams into the storage dictionaries.

    The SPECTRA-CORE decoding process, detailed in Algorithm 1 (lines 5-44), involves a continuous loop of obtaining guesses, forwarding through the LLM, verification, predicting new candidates, and updating N-gram dictionaries.

Bi-directional Search for Guesses

At each time step tt, SPECTRA generates a set of GG guess sequences, G={y~(0),y~(1),,y~(G)}\mathcal{G} = \{\tilde{y}^{(0)}, \tilde{y}^{(1)}, \ldots, \tilde{y}^{(G)}\}. This process is dynamic, allowing for variable-length guesses.

  • Forward Direction (Algorithm 1, line 7): The last generated token, xt1x_{t-1}, is used as a key to search in S_fwd to retrieve existing guess sequences. This step primarily focuses on obtaining a quantity of guesses quickly.
  • Backward Direction (Algorithm 1, lines 8-14): A high-quality guess (uu) is constructed iteratively. Starting from the current context, S_bwd is used to predict one token at a time, extending the sequence up to a desired length NN. This focuses on the quality of the guess by building it token-by-token based on observed high-probability sequences. The set of guesses G\mathcal{G} is populated by combining results from both searches.

Predict & Verify in One Forward Pass

A crucial efficiency aspect of SPECTRA is that all distributions required for both predicting candidates for the candidate pool and verifying the guesses are obtained in a single forward pass to the LLM. This leverages the parallel processing capabilities of GPUs.

The following figure (Figure 2 from the original paper) shows how SPECTRA handles internal knowledge, including the single forward pass:

Figure Detail of how SECTRA handles internal knowledge. The dashed arrow indicatesinteractions betwen the tokens, which are realized by the attention mask in the LLM.
Figure 2: Detail of how SPECTRA handles internal knowledge. The dashed arrow indicates interactions between the tokens, which are realized by the attention mask in the LLM.

This is achieved using a specially designed attention mask. For example, in the figure, a token like c2c2 in a candidate sequence or a guess token attends only to its causal predecessors (c1c1, c0c0) and the input xx. This ensures correctness for autoregressive models while allowing parallel computation for different guesses and candidates.

Verification

The verification step ensures that the output distribution remains consistent with the original LLM (lossless acceleration).

  • For greedy sampling (which selects the token with the highest probability), the verification is detailed in Algorithm 3. It involves comparing the LLM's predicted tokens (yy') with the drafted tokens (y_tilde). The longest contiguous prefix of accepted tokens across all guesses is identified and appended to the output.
  • For sampling-based methods (e.g., top-k, top-p sampling), SPECTRA adopts the sampling verification approach from prior work (Miao et al., 2024; Fu et al., 2024), ensuring correctness for stochastic generation. This is detailed in Algorithm 4.

Predict Tokens for Candidate Pool

After verification, new tokens are predicted to update the candidate pool CC. For each of the WW sequences in CC, the next token c(N1)(i)c_(N-1)^(i) is predicted using the LLM's distribution obtained in the single forward pass (Algorithm 1, lines 26-34).

  • Randomness-Based Coverage: To address the issue of N-gram dictionaries potentially lacking coverage for certain tokens, SPECTRA introduces a probabilistic mechanism using a hyperparameter τ[0,1]\tau \in [0, 1].
    • Let rr be a random draw from a Uniform distribution in [0, 1].
    • If r>τr > \tau, SPECTRA attempts to select the token with the highest probability that is not yet present in S_fwd. This encourages exploration and expands dictionary coverage.
    • Otherwise (if rτr \leq \tau), it selects the token with the highest probability regardless of its presence in S_fwd.
  • Candidate Pool Shift: At the end of each time step, all candidate sequences in CC are shifted one token to the left (e.g., cj(i)cj+1(i)c_j^{(i)} \gets c_{j+1}^{(i)} for all j[0,N2]j \in [0, N-2]), leaving the last position c(N1)c_(N-1) empty for the new prediction in the next step (Algorithm 1, line 43). This creates a sliding window effect for the candidate sequences.

Update N-gram Dictionaries

The N-gram dictionaries (S_fwd and S_bwd) are updated using the tokens from the candidate pool CC at the end of each time step (Algorithm 1, lines 35-41).

  • Subsequence Inclusion: Unlike previous methods that might only add full N-grams, SPECTRA observes that subsequences within N-grams are also valuable. It adds these subsequences to both dictionaries.
  • S_fwd Update: Subsequences are added to S_fwd using their first token as the key. For an N-gram (c0,c1,,cN1)(c_0, c_1, \ldots, c_{N-1}), entries like (c0(c1,,cN1))(c_0 \to (c_1, \ldots, c_{N-1})), (c1(c2,,cN1))(c_1 \to (c_2, \ldots, c_{N-1})) etc., are appended.
  • S_bwd Update: For an N-gram (c0,c1,,cN1)(c_0, c_1, \ldots, c_{N-1}), S_bwd maps the preceding part of the sequence to the last token (e.g., (c0,,cj1)cj(c_0, \ldots, c_{j-1}) \to c_j).

4.2.2. SPECTRA-RETRIEVAL

SPECTRA-RETRIEVAL enhances speedup by leveraging external knowledge. It addresses the common pitfall of retrieval-based methods where integrating guesses from unrefined external sources can lead to search time outweighing speedup gains.

Corpus Refinement by Perplexity

The key innovation in SPECTRA-RETRIEVAL is to select high-quality, relevant texts from a large external corpus using perplexity (PPL) scores.

  • Process: Given an external text corpus, each sequence u=(u0,u1,,ut)u = (u_0, u_1, \ldots, u_t) in the corpus is evaluated using the target LLM to compute its perplexity.
  • Formula: $ \mathrm{PPL}(u) = \exp \left{ - \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i}) \right} $ Where:
    • u=(u0,u1,,ut)u = (u_0, u_1, \ldots, u_t) is the sequence of tokens from the external corpus.
    • tt is the length of the sequence.
    • PM(uiu<i)P_M (u_i \mid u_{<i}) is the probability of token uiu_i given the preceding tokens u<iu_{<i} as predicted by the target LLM MM.
  • Selection: Sequences with the lowest perplexity are deemed to be well-aligned with the LLM's predictions and are thus likely to produce high-quality guesses. These selected, high-quality sequences form a smaller, refined subset of the corpus.
  • Indexing: This refined corpus is then indexed into a structure that supports fast prefix search, such as a Trie.

Integration with SPECTRA-CORE

SPECTRA-RETRIEVAL is designed for seamless integration with SPECTRA-CORE.

  • Guess Generation: During the guess generation step of SPECTRA-CORE (Algorithm 1, line 16), guesses obtained from SPECTRA-RETRIEVAL (G_retrieve) are appended to the guesses generated by SPECTRA-CORE.
  • Limited Budget: Recognizing that too many candidate tokens can strain GPU resources, SPECTRA limits the total number of guesses (GG). By refining the corpus with perplexity, SPECTRA-RETRIEVAL ensures that its contribution to the guess budget is of high quality, preventing the search time in the external Trie from negating the speedup.
  • Priority: As empirically observed, internal guesses from SPECTRA-CORE tend to have higher acceptance rates. Therefore, SPECTRA gives priority to these internal guesses, filling any remaining guess budget with external guesses from SPECTRA-RETRIEVAL.

4.2.3. Overall SPECTRA Decoding Algorithm

The entire SPECTRA decoding process is outlined in Algorithm 1.

Require: Sequence x=(x1,x2,,xn)\mathbf { x } = \left( x _ { 1 } , x _ { 2 } , \ldots , x _ { n } \right) , model `P _ { M }` , max   
N-gram size NN , candidate pool size WW , max guesses GG ,   
max number of new tokens mm .Refine threshold τ\tau   
1Initialize N-gram Forward-dictionary SfwdS _ { f w d }  \emptyset   
\$\frac { }\$ SbwdS _ { b w d }  \emptyset   
cj(i)c _ { j } ^ { ( i ) } j[0,N1],i[0,W1]\forall j \in [ 0 , N - 1 ] , \forall i \in [ 0 , W - 1 ]   
tn+it \gets n + \mathrm { i }   
5: while tn+mt \leq n + m do   
6: {Obtain the guesses}   
7: 8: GSfwd[xˉt1]u=\begin{array} { r l } & { \mathcal { G } \gets \mathrel { S } _ { f w d } \left[ \bar { \mathbf { x } } _ { t - 1 } \right] } \\ & { u = \emptyset } \end{array}   
9: for j=0j = 0 to N1N - 1 do   
10: for k=N1k = N - 1 to 1 do   
11: ujSbwd[xt+jk:t1u0:j1]u _ { j } \gets S _ { b w d } [ { \mathbf { x } } _ { t + j - k : t - 1 } \oplus u _ { 0 : j - 1 } ]   
12: break if found value for `u _ { j }`   
13: end for   
14: end for   
15: G\mathcal { G } .append `( u )`   
16: G=GGretrieve\mathcal { G } = \mathcal { G } \oplus \mathcal { G } _ { r e t r i e v e } \vartriangle Retrieval Integration (Optional)   
17: GG0:G1\mathcal { G } \gets \mathcal { G } _ { 0 : G - 1 } Ensure the max guesses is GG   
18: {Foward in LLM}   
19: Obtain necessary distributions of `P _ { M }` in parallel.   
20: {Verification}   
21: {Greedy verify (Alg. 3) or Sampling verify (Alg. 4)}   
22: hits VerificationFunction (x,PM,G)( \mathbf { x } , P _ { M } , \mathcal { G } )   
23: xx\mathbf { x } \gets \mathbf { x } \oplus hits   
24: tt+size(hits)t \gets t + \mathrm { s i z e } ( h i t s )   
25: {Predict Candidates}   
26: for i=0i = 0 to W1W - 1 do   
27: 28: rUniform[0,1]r \sim \mathrm { Uniform } [ 0 , 1 ] Pc(cN1(i))PM(cN1(i)c:N2(i),x)ifr>τ thencN1(i)argmaxPc(cN1(i))\begin{array} { r l } & { } \\ & { P _ { c } ( c _ { N - 1 } ^ { ( i ) } ) \gets P _ { M } ( c _ { N - 1 } ^ { ( i ) } \mid c _ { : N - 2 } ^ { ( i ) } , \mathbf { x } ) } \\ & { \mathbf { i f } r > \tau \ \mathrm { t h e n } } \\ & { \qquad c _ { N - 1 } ^ { ( i ) } \gets \mathrm { a r g m a x } P _ { c } ( c _ { N - 1 } ^ { ( i ) } ) } \end{array}   
29:   
30:   
31: else   
33: cN1(i)argmaxPc(cN1(i))c _ { N - 1 } ^ { ( i ) } \gets \operatorname { a r g m a x } P _ { c } ( \boldsymbol { c } _ { N - 1 } ^ { ( i ) } )   
34: end for   
35: {Update N-gram dictionaries}   
36: for i=0i = 0 to W1W - 1 do   
38: for j=0j = 0 to N2N - 2 do   
39: Sfwd[cj(i)].append(cj+1:(i))S _ { f w d } [ { c _ { j } ^ { ( i ) } } ] . \mathrm { append } ( { c _ { j + 1 : } ^ { ( i ) } } )   
40: Sbwd[c0:j(i)]cj+1(i)S _ { b w d } [ { c _ { 0 : j } ^ { ( i ) } } ] \gets { c _ { j + 1 } ^ { ( i ) } }    
41: end for   
42: end for   
43: cj(i)cj+1(i),j[0,N2],ic _ { j } ^ { ( i ) } \gets c _ { j + 1 } ^ { ( i ) } , \forall j \in [ 0 , N - 2 ] , \forall i   
44: end while   
45: Output: xn+1:n+m=(y1,y2,,ym)\mathbf { x } _ { n + 1 : n + m } = ( y _ { 1 } , y _ { 2 } , \ldots , y _ { m } )

Algorithm 1: Main SPECTRA Decoding Loop

  • Input:
    • x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n): The initial sequence of tokens.
    • PMP_M: The target LLM (probability distribution function).
    • NN: Maximum N-gram size.
    • WW: Candidate pool size.
    • GG: Maximum number of guesses allowed in parallel.
    • mm: Maximum number of new tokens to generate.
    • τ\tau: Refine threshold for candidate pool prediction.
  • Initialization (lines 1-4):
    • S_fwd and S_bwd are initialized as empty N-gram dictionaries.
    • cj(i)c_j^(i) represents the jj-th token of the ii-th sequence in the candidate pool, which is initialized.
    • tt tracks the current token index in the output sequence, starting after the initial prompt.
  • Main Loop (lines 5-44): Continues as long as the desired number of tokens (mm) has not been generated.
    • Obtain the guesses (lines 6-17):
      • Line 7: Initializes the set of guesses GG by searching S_fwd using the last generated token xt1\mathbf{x}_{t-1}.
      • Lines 8-14: Constructs a high-quality guess uu using S_bwd through an iterative backward search up to N-1 tokens. x(t+jk:t1)x_(t+j-k:t-1) represents a preceding context for the backward search. u0:j1u_0:j-1 is the guess constructed so far. If a value for uju_j is found in S_bwd, the loop breaks.
      • Line 15: Appends the high-quality guess uu to GG.
      • Line 16: Retrieval Integration (Optional): Appends guesses from SPECTRA-RETRIEVAL (Gretrieve\mathcal{G}_{retrieve}) to GG. This is where the external speculation seamlessly integrates.
      • Line 17: Truncates GG to ensure it does not exceed the maximum allowed guesses GG.
    • Forward in LLM (lines 18-19):
      • Line 19: Performs a single LLM forward pass to obtain all necessary probability distributions for both verification and candidate prediction in parallel.
    • Verification (lines 20-24):
      • Line 22: Calls a VerificationFunction (either Algorithm 3 for greedy or Algorithm 4 for sampling) with the current sequence xx, LLM PMP_M, and the collected guesses GG. It returns hits, the accepted tokens.
      • Line 23: Appends hits to the main output sequence xx.
      • Line 24: Increments tt by the number of accepted tokens.
    • Predict Candidates (lines 25-34):
      • For each sequence ii in the candidate pool (WW sequences):
        • Line 28: Samples a random value rr.
        • Line 29: Computes the probability distribution Pc(cN1(i))P_c(c_{N-1}^{(i)}) for the next token based on the current candidate sequence prefix c:N2(i)c_{:N-2}^{(i)} and the context x\mathbf{x}.
        • Line 30-31: If r>τr > τ, it attempts to select the argmax token that is not already in S_fwd to promote coverage.
        • Line 33: Otherwise, it simply selects the argmax token.
    • Update N-gram dictionaries (lines 35-42):
      • For each sequence ii in the candidate pool and for each possible N-gram length jj from 0 to N-2:
        • Line 39: Appends subsequences to S_fwd using cj(i)c_j^(i) as the key.
        • Line 40: Updates S_bwd by mapping the prefix c0:j(i)c_0:j^(i) to the next token cj+1(i)c_j+1^(i).
    • Candidate Pool Shift (line 43):
      • All candidate sequences in CC are shifted left by one token, making room for new predictions in the next iteration.
  • Output (line 45): The generated sequence of mm new tokens.

4.2.4. Greedy Verification (Algorithm 3)

Require: sequence X\mathbf { X } , model `P _ { M }` , guesses G={gi}\mathcal { G } = \{ g ^ { i } \} with i[0,G1]i \in [ 0 , G - 1 ]

Ensure: o {accepted tokens of length 1 to NN }   
1:function GREEDYVERIFICATION (x,PM,G)( \mathbf { x } , P _ { M } , \mathcal { G } )   
2: DD  \emptyset \triangleright Store the distributions   
3: `V  G` \triangleright Store the current guesses   
4: for i=0i = 0 to G1G - 1 do   
5: DD .append (PM(g(i),xnextg(i),x))( P _ { M } ( g ^ { \prime ( i ) } , x _ { \mathrm { n e x t } } | g ^ { ( i ) } , \mathbf { x } ) ) \triangleright Last token of X\mathbf { X } and g(i)g ^ { ( i ) } outputs  total NN distributions   
6: end for   
7: for i=1i = 1 to N1N - 1 do   
8: j1is_accept0PD[1]i\begin{array} { r l } & { j  1 } \\ & { \mathrm { is \_accept }  0 } \\ & { \mathcal { P }  D [ 1 ] _ { i } } \end{array}   
9:   
10:   
11: while jsize(V)j \leq \mathrm { s i z e } ( V ) do   
12: `s _ { j }  V [ j ] _ { i }`   
13: if sj=argmaxPs _ { j } = \arg \operatorname* { m a x } \mathcal { P } then \triangleright accepted, update all potential speculations and probabilities   
14: 0.append (sj)( s _ { j } )   
15: is_accept `1`   
16: Vnew,Dnew,V _ { \mathrm { n e w } } , D _ { \mathrm { n e w } }  \emptyset , \emptyset   
17: for k=jk = j to size(V)\mathrm { s i z e } ( V ) do   
18: if `s _ { j } = V [ k ] _ { i }` then   
19: Vnew.append (V[k])\left( V [ k ] \right)   
20: Dnew.append(D[k])   
21: end if   
22: end for   
23: V,DVnew,DnewV , D \gets V _ { \mathrm { n e w } } , D _ { \mathrm { n e w } }   
24: break   
25: else \triangleright rejected, go to next speculation   
26: jj+1j  j + 1   
27: end if   
28: end while   
29: if is_accept then   
30: continue   
31: else guarantee one step movement   
32: 0.append(arg max P)   
33: break   
34: end if   
35: end for   
36: if is_accept then   
37: 0.append(arg max D[1]N )   
38: end if   
39: return o   
40: end function

Algorithm 3: Greedy Verification with SPECTRA DECODING

  • Input:
    • x\mathbf{x}: The current sequence of tokens.
    • PMP_M: The target LLM.
    • G={gi}\mathcal{G} = \{g^i\}: The set of guesses.
  • Output: oo, the list of accepted tokens.
  • Process:
    • Lines 2-6: Initializes DD (list of distributions for each guess from PMP_M) and VV (list of current guesses). For each guess g(i)g^{(i)}, it computes the LLM's predicted distributions for tokens up to length NN, conditioned on the context x\mathbf{x} and the guess prefix.
    • Lines 7-35: Iterates through the potential positions ii within the guesses (from 1 to N-1).
      • Lines 11-28: For each position, it iterates through the available guesses (VV).
        • Line 13: Checks if the token sjs_j from the current guess matches the argmax token predicted by the LLM at that position (P\mathcal{P}).
        • If accepted (match found):
          • Line 14: Appends sjs_j to the output oo.
          • Line 15: Sets is_accept flag.
          • Lines 16-23: Filters VV and DD to keep only the guesses that are consistent with the newly accepted token sjs_j. This effectively prunes inconsistent guesses.
          • Line 24: Breaks from the inner loop (processed the accepted token).
        • If rejected:
          • Line 26: Moves to the next guess (j+1j+1) to check for acceptance.
      • Lines 29-34: If is_accept is true, it continues to the next position. If no guess is accepted at the current position, it guarantees one step movement by autoregressively taking the argmax token and then breaks from the outer loop.
    • Lines 36-38: If tokens were accepted through the entire loop (meaning is_accept is still true), it appends the argmax token at the end of the longest accepted sequence (which is D[1]_N, the last predicted token). This is to ensure a full NN tokens if possible.

4.2.5. Sample Verification (Algorithm 4)

Require: sequence xx , model `P _ { M }` , guesses gig ^ { i } with i[0,G1]i \in [ 0 , G - 1 ]   

Ensure: o {accepted tokens of length 1 to NN }   
1function SAMPLEVERIFICaTION (x,PM,g)( x , P _ { M } , g )   
2: DD  \emptyset \triangleright Store the distributions   
3: `V  G` \triangleright Store the current guesses   
4: for i=0i = 0 to G1G - 1 do   
5: DD .append( PM(g(i),xnextg(i),x))P _ { M } ( g ^ { \prime ( i ) } , x _ { \mathrm { n e x t } } | g ^ { ( i ) } , \mathbf { x } ) \big ) Last token of X\mathbf { X } and g(i)g ^ { ( i ) } outputs - total NN distributions   
6: end for   
7: for i=1i = 1 to N1N - 1 do   
8: `j  1`   
9: is_accept `0`   
10: PjD[j]i\mathcal { P } _ { j }  D [ j ] _ { i }   
11: while jsize(V)j \leq \mathrm { s i z e } ( V ) do   
12: `s _ { j }  V [ j ] _ { i }`   
13: sample rU(0,1)r \sim U ( 0 , 1 )   
14: if rPj(sj)r \leq \mathcal { P } _ { j } ( s _ { j } ) then \triangleright accepted, update all potential speculations and probabilities   
15: 0.append (sj)( s _ { j } )   
16: is_accept `1`   
17: Vnew,Dnew,V _ { \mathrm { n e w } } , D _ { \mathrm { n e w } }  \emptyset , \emptyset   
18: for k=jk = j to size(V)\mathrm { s i z e } ( V ) do   
19: if `s _ { j } = V [ k ] _ { i }` then   
20: Vnew.append (V[k])\left( V [ k ] \right)   
21: Dnew.append(D[k])   
22: end if   
23: end for   
24: V,DVnew,DnewV , D \gets V _ { \mathrm { n e w } } , D _ { \mathrm { n e w } }   
25: break   
26: else rejected, go to next speculation   
27: Pj(sj)0Pj+1=norm(Pj)jj+1\begin{array} { r l } & { \mathcal { P } _ { j } ( s _ { j } )  0 } \\ & { \mathcal { P } _ { j + 1 } = \operatorname { n o r m } ( \mathcal { P } _ { j } ) } \\ & { j  j + 1 } \end{array}   
28:   
29:   
30: end if   
31: end while   
32: if is_accept then   
33: continue   
34: else \triangleright guarantee one step movement   
35: sample xnextPjx _ { \mathrm { n e x t } } \sim \mathcal { P } _ { j }   
36: 0.append(xnext)   
37: break   
38: end if   
39: end for   
40: if is_accept then   
41: 0.append(sample xnextD[1]N)x _ { \mathrm { n e x t } } \sim D [ 1 ] _ { N } )   
42: end if   
43: return o   
44: end function

Algorithm 4: Sample Verification

  • Input: Same as GreedyVerification (Algorithm 3).
  • Output: oo, the list of accepted tokens.
  • Process: This algorithm is similar to GreedyVerification but adapted for stochastic sampling (e.g., with temperature > 0).
    • Lines 2-6: Initializes DD and VV similarly.
    • Lines 7-31: Iterates through potential positions and guesses.
      • Lines 13-14: Instead of argmax, a token sjs_j from a guess is accepted if a randomly sampled value rr from a Uniform distribution in [0, 1] is less than or equal to the LLM's predicted probability for that token sjs_j (Pj(sj)\mathcal{P}_j(s_j)). This introduces stochasticity.
      • If accepted:
        • Lines 15-24: Similar to greedy, sjs_j is appended to oo, is_accept is set, and VV and DD are filtered.
      • If rejected:
        • Lines 27-29: The probability of the rejected token sjs_j is set to 0 (Pj(sj)0\mathcal{P}_j(s_j) \gets 0), and the remaining probabilities are normalized (Pj+1=norm(Pj)\mathcal{P}_{j+1} = \operatorname{norm}(\mathcal{P}_j)). Then it proceeds to the next guess. This is a key difference, ensuring the sampling distribution is correctly updated after a rejection.
    • Lines 32-38: If is_accept is true, it continues. If no guess is accepted at a position, it samples the next token x_next from the modified distribution Pj\mathcal{P}_j and appends it to oo before breaking.
    • Lines 40-42: If tokens were accepted through the entire loop, it samples the last token from the distribution D[1]_N.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on a diverse set of tasks and datasets to evaluate SPECTRA's performance across different domains.

  • MT-Bench (Zheng et al., 2023): A benchmark for multi-turn conversation.
    • Characteristic: Involves complex dialogues and assessing LLM performance in interactive, conversational settings.
    • Data Source for SPECTRA-RETRIEVAL: Approximately 100k examples from the UltraChat dataset (Ding et al., 2023) are used, specifically those with minimal perplexity under the target LLM.
  • GSM8K (Cobbe et al., 2021): A dataset for mathematical reasoning.
    • Characteristic: Requires step-by-step logical thinking and numerical accuracy to solve grade school math word problems.
  • HumanEval (Chen et al., 2021): A benchmark for code generation.
    • Characteristic: Assesses the LLM's ability to generate functional Python code given a natural language prompt.
    • Data Source for SPECTRA-RETRIEVAL: Snippets from TheStack (Kocetkov et al., 2023) are refined to 100k examples with the lowest perplexity.
  • MBPP (Austin et al., 2021): Another benchmark for code generation.
    • Characteristic: Focuses on generating short Python programs from natural language descriptions and unit tests.
  • ClassEval (Du et al., 2023): A benchmark for class-level code generation.
    • Characteristic: Evaluates LLMs on more complex code generation tasks involving class definitions and methods.

      These datasets are chosen because they represent a variety of LLM applications, including general language tasks (conversation), logical reasoning (math), and specialized programming tasks (code generation). This diversity allows for a comprehensive evaluation of SPECTRA's robustness and effectiveness.

5.2. Evaluation Metrics

SPECTRA is a lossless acceleration method, meaning it does not modify the original LLM's output distribution. Therefore, the generation quality remains the same as the original LLM. The evaluation primarily focuses on acceleration performance.

  • Speedup Ratio:

    • Conceptual Definition: This metric quantifies how much faster the proposed method is compared to a baseline method (typically standard autoregressive decoding). A speedup ratio of XX means the method is XX times faster.
    • Mathematical Formula: $ \text{Speedup Ratio} = \frac{\text{Time}{\text{Autoregressive}}}{\text{Time}{\text{Method}}} $ Where:
      • TimeAutoregressive\text{Time}_{\text{Autoregressive}} is the total time taken for autoregressive decoding to generate a sequence.
      • TimeMethod\text{Time}_{\text{Method}} is the total time taken by SPECTRA (or other speculative method) to generate the same sequence.
    • Symbol Explanation: Time is usually measured in seconds or milliseconds.
  • Compression Ratio (τ\tau):

    • Conceptual Definition: This metric measures the efficiency of the speculative decoding process itself, independent of specific hardware. It represents how many tokens are processed per effective LLM forward pass. A higher compression ratio indicates that more tokens are accepted on average in each speculative step.
    • Mathematical Formula: $ \tau = \frac{\text{Total Autoregressive Steps}}{\text{Total Speculative Steps}} $ Where:
      • Total Autoregressive Steps\text{Total Autoregressive Steps} is the number of steps required if generated one token at a time. This is equivalent to the total length of the generated sequence.
      • Total Speculative Steps\text{Total Speculative Steps} is the number of LLM forward passes required by the speculative decoding method to generate the same sequence.
    • Symbol Explanation: Steps refer to LLM forward passes.
  • ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation): (Used for sampling decoding quality evaluation)

    • Conceptual Definition: ROUGE is a set of metrics used for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (human-produced). ROUGE-N specifically measures the overlap of N-grams between the generated text and the reference text. ROUGE-1 measures unigram (single word) overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures the longest common subsequence (LCS) match, which accounts for sequence order.
    • Mathematical Formula (for ROUGE-N Recall, common in summarization): $ \text{ROUGE-N} = \frac{\sum_{\text{sentence} \in \text{Reference Summaries}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}{\text{match}}(n\text{-gram})}{\sum{\text{sentence} \in \text{Reference Summaries}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}(n\text{-gram})} $ Where:
      • n-gramn\text{-gram} refers to a specific N-gram (e.g., a word for ROUGE-1, a pair of words for ROUGE-2).
      • Countmatch(n-gram)\text{Count}_{\text{match}}(n\text{-gram}) is the maximum number of N-grams co-occurring in a system summary and a set of reference summaries.
      • Count(n-gram)\text{Count}(n\text{-gram}) is the number of N-grams in the reference summary.
    • Symbol Explanation: This formula calculates ROUGE-N based on recall. Precision and F-measure versions also exist. ROUGE-L (LCS) has a different underlying formula based on finding the longest common subsequence between texts.
  • Throughput Metrics:

    • Macro Throughput (Mac-TP):
      • Conceptual Definition: The average rate of token processing per generation step. It gives insight into the efficiency of individual decoding iterations.
      • Mathematical Formula: $ \text{Mac-TP} = \frac{1}{S} \sum_{i=1}^{S} \frac{\text{tokens}_i}{\text{time}_i} $ Where:
        • SS is the total number of speculative decoding steps.
        • tokensi\text{tokens}_i is the number of tokens accepted in step ii.
        • timei\text{time}_i is the time taken for step ii.
    • Micro Throughput (Mic-TP):
      • Conceptual Definition: The overall rate of token generation, calculated as the total number of generated tokens divided by the total elapsed time. This is a common practical measure of generation speed.
      • Mathematical Formula: $ \text{Mic-TP} = \frac{\text{Total Generated Tokens}}{\text{Total Elapsed Time}} $
    • Symbol Explanation: Both are typically measured in tokens per second (tokens/s).

5.3. Baselines

SPECTRA is compared against standard autoregressive decoding and several leading training-free speculative decoding approaches:

  • Autoregressive Decoding: The standard, one-token-at-a-time generation method. This serves as the primary baseline, with a speedup ratio of 1.00x.

  • Adaptive N-gram Parallel Decoding (ANPD) (Ou et al., 2024): A training-free method that leverages adaptive N-grams for parallel decoding.

  • REST (Retrieval-Based Speculative Decoding) (He et al., 2024): A training-free method that uses a retrieval mechanism from an external datastore to generate guesses. This is a direct competitor to SPECTRA-RETRIEVAL.

  • Lookahead Decoding (Fu et al., 2024): A training-free method that breaks the sequential dependency of LLM inference by generating and validating multiple candidate tokens in parallel. This is a direct competitor to SPECTRA-CORE.

    The baselines are replicated using their publicly available GitHub code with default settings and hyperparameters from their original papers to ensure fair comparison.

5.4. Models

The evaluation covers a range of LLM architectures and sizes:

  • LLaMA-2-Chat (Touvron et al., 2023): Sizes 7B, 13B, 70B. These are general-purpose chat models.

  • CodeLlama (Rozière et al., 2024): Sizes 7B, 13B. These models are specialized for code generation.

  • LLaMA-3-Instruct (Dubey et al., 2024): Sizes 8B, 70B. Instruction-tuned models designed for following instructions.

    All model checkpoints are sourced from official repositories or Hugging Face.

  • Precision: For 7B and 13B models, 16-bit (FP16) precision with a pre-allocated key-value cache is used. For 70B models, 8-bit quantization is primarily used for the main results in Table 1, with FP16 evaluations in multi-GPU settings described in Appendix E. Numerical consistency between FP32 and FP16 for LLaMA-2-7B is also verified.

5.5. Hardware

  • Primary GPU: Single NVIDIA A100 GPU with 80GB of memory for most experiments.
  • Hardware Scalability Test (Appendix C): Additional NVIDIA GPUs including RTX 3090, RTX 8000, A40, and A6000 are used to test SPECTRA's robustness across different hardware.
  • Multi-GPU Environments (Appendix E): For 70B models exceeding single-GPU memory in FP16, distributed computation across multiple H100 GPUs (2x, 4x, or 8x) is performed using Hugging Face's pipeline parallelism.

5.6. Hyperparameters

  • SPECTRA:
    • N-gram size (N): 5-gram setup for both forward and backward dictionaries by default.
    • Candidate Pool Size (W): 15 sequences maintained per key.
    • Refine Threshold (τ\tau): 0.1 by default, for probabilistically encouraging selection of unseen tokens in S_fwd.
    • Max Guesses (G): Up to 15 guesses allowed per step.
    • Guess Priority: Internal guesses from SPECTRA-CORE receive priority; if the GG limit is not reached, external guesses are added.
    • External Lookup: A Trie structure is used for rapid prefix queries, similar to REST.
    • Corpus for SPECTRA-RETRIEVAL:
      • Conversation tasks (MT-Bench): ~100k low-perplexity examples from UltraChat.
      • Code tasks (HumanEval, MBPP): ~100k low-perplexity snippets from TheStack.
  • Batch Size: All speedup and throughput metrics are computed at a batch size of 1.
  • Generation Length: Max 512 tokens for code generation; max 1024 tokens for conversation/math tasks, or until end-of-sequence token.
  • Random Seeds: All random seeds are set to 0.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results demonstrate that SPECTRA consistently achieves superior acceleration compared to other training-free speculative decoding methods across a wide range of models and tasks.

The following are the results from Table 1 of the original paper:

Model Method Classeval GSM8K Humaneval MBPP MTBench AVG
Speedup τ Speedup τ Speedup τ Speedup τ Speedup τ Speedup τ
Greedy (temperature=0)
CL-13B ANPD 1.94 2.52 2.81 3.72 2.08 2.50 2.71 3.58 2.61 3.41 2.43 3.15
Lookahead 2.25 3.61 2.80 4.24 2.30 3.16 2.91 4.44 2.59 4.04 2.57 3.90
REST 1.28 2.14 0.93 1.54 1.58 2.31 0.85 1.40 0.94 1.53 1.12 1.78
SPECTRA (Ours) 4.06 4.65 3.75 4.63 2.91 3.95 3.29 4.46 2.65 4.40 3.33 4.42
CL-7B ANPD 2.30 3.21 2.68 3.78 2.16 2.90 3.16 3.83 3.35 3.83 2.73 3.51
Lookahead 2.59 3.66 2.99 3.83 2.50 3.05 3.25 4.10 3.70 4.52 3.01 3.83
REST 1.45 2.22 0.91 1.39 1.70 2.34 0.96 1.45 1.02 1.44 1.21 1.77
SPECTRA (Ours) 2.70 4.10 3.33 4.59 2.96 3.90 3.56 4.45 3.70 4.52 3.25 4.31
L2-13B ANPD 1.36 1.72 1.78 2.01 1.47 1.87 1.61 2.32 1.32 1.69 1.51 1.92
Lookahead 1.78 2.76 1.81 2.01 1.46 1.87 1.73 2.32 1.38 1.69 1.63 2.13
REST 1.22 1.90 0.94 1.46 1.25 1.94 0.95 1.44 1.14 1.10 1.10 1.57
SPECTRA (Ours) 2.00 3.24 1.83 2.62 1.96 2.91 1.63 2.24 1.75 2.60 1.83 2.72
L2-70B ANPD 1.34 2.00 1.30 1.95 1.33 1.90 1.25 1.86 1.21 1.87 1.29 1.92
Lookahead 1.82 2.65 1.90 2.87 1.63 2.52 1.61 2.69 1.86 2.57 1.76 2.66
REST 1.17 1.49 1.20 1.54 1.94 1.30 1.56 2.10 1.46 1.45 1.47 1.58
SPECTRA (Ours) 2.43 3.40 1.95 2.62 2.52 3.22 1.52 1.86 1.68 1.93 2.02 2.61
L2-7B ANPD 1.33 1.73 1.30 1.69 1.37 1.82 1.25 1.82 1.01 1.46 1.25 1.70
Lookahead 1.73 2.59 1.77 2.67 2.05 2.77 1.41 2.16 1.46 2.02 1.60 2.44
REST 1.01 1.46 1.14 1.23 1.53 1.85 1.69 2.10 1.14 1.23 1.30 1.57
SPECTRA (Ours) 2.59 3.86 2.83 4.57 3.49 4.65 2.49 3.36 2.49 3.58 2.78 4.00
L3-70B ANPD 1.15 1.31 1.38 1.61 1.46 1.83 1.07 1.30 1.05 1.30 1.22 1.47
Lookahead 1.24 1.42 1.34 1.96 2.31 2.89 2.33 3.48 1.20 1.35 1.68 2.22
REST 1.14 1.35 1.68 1.96 1.29 1.75 1.10 1.32 1.12 1.27 1.27 1.53
SPECTRA (Ours) 1.54 2.03 1.23 1.86 1.81 2.25 1.20 1.73 1.19 1.41 1.39 1.86
L3-8B ANPD 1.19 1.43 0.88 1.33 1.35 1.73 1.43 1.81 1.19 1.43 1.21 1.55
Lookahead 1.52 1.98 1.18 1.72 1.70 2.12 1.33 1.98 1.07 1.40 1.36 1.84
REST 1.18 1.48 1.12 1.44 1.74 2.01 1.12 1.48 1.03 1.22 1.24 1.53
SPECTRA (Ours) 1.70 2.22 1.52 2.22 1.96 2.75 1.18 1.69 1.46 1.88 1.56 2.15
Sampling (temperature=1.0)
CL-13B ANPD 1.15 1.31 1.38 1.61 1.46 1.83 1.07 1.30 1.05 1.30 1.22 1.47
Lookahead 1.24 1.42 1.34 1.96 2.31 2.89 2.33 3.48 1.20 1.35 1.68 2.22
REST 1.14 1.35 1.68 1.96 1.29 1.75 1.10 1.32 1.12 1.27 1.27 1.53
SPECTRA (Ours) 1.54 2.03 1.23 1.86 1.81 2.25 1.20 1.73 1.19 1.41 1.39 1.86
CL-7B ANPD 1.19 1.43 0.88 1.33 1.35 1.73 1.43 1.81 1.19 1.43 1.21 1.55
Lookahead 1.52 1.98 1.18 1.72 1.70 2.12 1.33 1.98 1.07 1.40 1.36 1.84
REST 1.18 1.48 1.12 1.44 1.74 2.01 1.12 1.48 1.03 1.22 1.24 1.53
SPECTRA (Ours) 1.70 2.22 1.52 2.22 1.96 2.75 1.18 1.69 1.46 1.88 1.56 2.15
L2-13B ANPD 1.31 1.51 1.78 2.30 1.26 1.48 1.34 1.48 1.10 1.22 1.36 1.60
Lookahead 1.78 2.30 1.72 2.09 1.51 1.76 1.28 1.46 1.25 1.36 1.51 1.79
REST 1.26 1.46 0.99 1.27 1.46 1.72 1.21 1.36 1.14 1.21 1.21 1.40
SPECTRA (Ours) 1.72 2.35 1.82 2.62 1.76 2.35 1.59 2.02 1.49 1.88 1.68 2.24

Table 1: Overall performance of speculative decoding methods across multiple tasks. "CL-xB" denotes CodeLlama with xBx \mathbf{B} parameters, "L2-xB" denotes LLaMA-2-Chat of size xBx \mathbf{B}, and "L3-xB" denotes LLaMA-3-Instruct of size xBx \mathbf{B}. We report the speedup ratio (vs. autoregressive) and the compression ratio τ\tau.

  • Overall Performance (Greedy Decoding):

    • SPECTRA consistently achieves the highest speedup ratios across all models and datasets under greedy decoding (temperature=0). For instance, it reaches an impressive 4.08x speedup for Llama3-8B on MBPP.
    • For 7B models (e.g., CL-7B, L2-7B, L3-8B), SPECTRA often exceeds 3x acceleration, demonstrating the effectiveness of its multi-token compression.
    • For larger 13B and 70B models, speedups are generally between 1.6x and 3x, still outperforming all baselines.
    • This robustness across diverse models (Llama 2, Llama 3, CodeLlama) and tasks (conversation, math, code generation) highlights SPECTRA's generalizability. Baselines like REST often show limited or even negative speedups in some cases (e.g., CL-13B on GSM8K, REST has 0.93x speedup), indicating their instability.
  • Compression Ratio (τ\tau):

    • SPECTRA consistently delivers the highest average compression ratio across all datasets and LLMs. This indicates that SPECTRA's draft-and-verify iterations typically accept a greater number of tokens (ranging from 2.1 to 4.8 tokens per step), significantly more than alternative approaches. This nearly doubles the acceptance length achieved by REST, directly translating to higher speedups.
  • Acceleration in Sampling Decoding (temperature=1.0):

    • Under sampling-based decoding, SPECTRA continues to provide acceleration, offering roughly 1.15x – 2.77x speedups over standard autoregressive decoding.

    • These gains are more modest than in greedy decoding, which is an expected outcome for speculative decoding methods under stochastic sampling due to lower acceptance rates. This observation is consistent with findings in prior work (Fu et al., 2024; Leviathan et al., 2023). Nevertheless, SPECTRA still maintains its lead over other methods in this more challenging setting.

      The following figure (Figure 6 from the original paper) plots the cumulative number of accepted tokens versus decoding steps:

      Figure 6: Total number of accepted tokens across all samples at each decoding step. Figure 6: Total number of accepted tokens across all samples at each decoding step.

Figure 6 visualizes the total number of accepted tokens across all samples at each decoding step for Llama2-7B-chat with greedy decoding on various datasets. The steeper slope of the SPECTRA curve (red line) compared to other methods (e.g., ANPD, Lookahead, REST, Autoregressive) clearly indicates that SPECTRA requires substantially fewer decoding steps to generate the same number of tokens. This is a direct consequence of its higher token acceptance rate, underscoring its efficiency. For example, on HumanEval, SPECTRA achieves the total tokens with roughly half the steps of ANPD.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 2 of the original paper:

GSM8K MTBench
Method Speedup τ Speedup τ
REST 1.01 1.47 1.25 1.90
Lookahead 1.66 1.93 1.73 2.05
Lookahead + REST 1.08 1.47 1.27 1.90
SPECTRA's ablation
CORE Module 2.04 2.50 1.92 2.35
- w/o Forward Dict 1.50 1.68 1.20 1.37
- w/o Backward Dict 1.94 2.21 1.74 2.12
- w/o Sub-Ngram 1.95 2.34 1.75 2.18
RETRIEVAL Module 1.18 1.31 1.24 1.50
- w/o PPL refine 1.16 1.29 1.20 1.45
SPECTRA (ours) 2.14 2.64 2.02 2.59

Table 2: Ablation study of SPECTRA's components (greedy decoding, LLaMA2-7B-Chat).

The ablation study (Table 2) on LLaMA2-7B-Chat with greedy decoding provides crucial insights into the contribution of each SPECTRA component:

  • Impact of SPECTRA-CORE Components:

    • The full CORE Module alone achieves a 2.04x speedup on GSM8K and 1.92x on MTBench.
    • Removing the Forward Dictionary (- w/o Forward Dict) drastically reduces speedup to 1.50x (GSM8K) and 1.20x (MTBench). This highlights the importance of the Forward Dictionary in providing a quantity of initial guesses.
    • Removing the Backward Dictionary (- w/o Backward Dict) reduces speedup to 1.94x (GSM8K) and 1.74x (MTBench). While less severe than removing the Forward Dictionary, this shows the Backward Dictionary's contribution to the quality of guesses.
    • Omitting Sub-Ngram inclusion (- w/o Sub-Ngram) also causes a slight drop (1.95x on GSM8K), confirming that including subsequences in the N-gram dictionaries improves guess quality and coverage.
  • Impact of SPECTRA-RETRIEVAL Components:

    • The RETRIEVAL Module alone achieves 1.18x speedup on GSM8K and 1.24x on MTBench.
    • Removing perplexity-based filtering (- w/o PPL refine) slightly decreases the speedup to 1.16x (GSM8K) and 1.20x (MTBench). This validates that perplexity refinement does improve the quality of external guesses, though its standalone impact on speedup is more modest than SPECTRA-CORE's main components.
  • Full SPECTRA Framework:

    • The full SPECTRA framework achieves 2.14x speedup on GSM8K and 2.02x on MTBench, surpassing all individual components and combinations. This demonstrates the synergistic effect of integrating all components, maximizing acceptance rates and overall performance.
  • Naive Combination Comparison:

    • A naive combination of Lookahead and REST (Lookahead + REST) performs significantly worse (1.08x on GSM8K, 1.27x on MTBench) than either Lookahead alone (1.66x, 1.73x) or SPECTRA's full framework. This underscores SPECTRA's careful and optimized integration strategy, showing that simply appending guesses from different sources is insufficient.

6.3. Priority for Source of Guesses

The following figure (Figure 3 from the original paper) shows the acceptance rates of Llama2-7B-chat for different guess sources:

Figure 3: Acceptance rates of Llama2-7B-chat for different guess sources (from SPECTRA-CORE forward dictionary, backward dictionary, SPECTRA-RETRIEVAL). The acceptance rate is the fraction of guessed tokens that pass verification.
Figure 3: Acceptance rates of Llama2-7B-chat for different guess sources (from SPECTRA-CORE forward dictionary, backward dictionary, SPECTRA-RETRIEVAL). The acceptance rate is the fraction of guessed tokens that pass verification.

Figure 3 illustrates the acceptance rates for different guess sources when the limit on the number of guess sequences is temporarily removed (to assess individual contributions).

  • Guesses generated by SPECTRA-CORE (from both the forward dictionary and backward dictionary) consistently show a higher acceptance rate compared to those obtained from the external knowledge source via SPECTRA-RETRIEVAL.
  • This empirical finding justifies SPECTRA's design choice (as implemented in Algorithm 1) to give priority to internal guesses from SPECTRA-CORE over external guesses from SPECTRA-RETRIEVAL. This ensures that the limited guess budget and LLM forward pass resources are utilized for the most promising candidates, maximizing overall speedup.

6.4. FlashAttention

The following figure (Figure 4 from the original paper) shows the effect of FlashAttention on speculative decoding speed:

Figure 4: Effect of FlashAttention on speculative decoding speed: Measured speedups on GSM8K and MTBench (LLama2-7B-Chat, greedy decoding). "No Flash" uses standard attention; "With Flash" uses FlashAttention for faster parallel verification.
Figure 4: Effect of FlashAttention on speculative decoding speed: Measured speedups on GSM8K and MTBench (LLama2-7B-Chat, greedy decoding). "No Flash" uses standard attention; "With Flash" uses FlashAttention for faster parallel verification.

Figure 4 demonstrates the impact of FlashAttention on the speedup of speculative decoding methods.

  • Consistent Boost: Enabling FlashAttention consistently boosts the speedup for all methods (ANPD, Lookahead, REST, SPECTRA), though to varying degrees.
  • SPECTRA's Greater Benefit: SPECTRA benefits the most, observing an additional 0.24x speedup gain on both GSM8K and MTBench. This is attributed to FlashAttention's ability to better exploit the parallel structure of speculative decoding by reducing attention overheads.
  • Reasoning: SPECTRA typically presents the longest verification branches (i.e., accepts more tokens per speculative step, as shown by its higher compression ratio). Therefore, it stands to profit significantly from more efficient attention implementations like FlashAttention, which reduces the computational burden of processing these longer sequences in parallel.

6.5. Other Analysis

6.5.1. Details Results with Throughputs

The following are the results from Table 4 and Table 5 of the original paper, providing detailed throughput analyses:

Model Method Classeval GSM8K Humaneval MBPP MTBench
Mac-TP Mic-TP Mac-TP Mic-TP Mac-TP Mic-TP Mac-TP Mic-TP Mac-TP Mic-TP
Greedy (temperature=0)
CL-13B Autoregressive 30.85 30.85 32.03 32.03 32.35 32.35 32.07 32.07 30.69 30.63
ANPD 59.77 58.03 89.99 89.18 67.43 64.65 86.76 86.41 80.10 76.68
Lookahead 69.28 68.62 89.73 89.00 74.33 73.23 93.38 92.80 79.38 78.67
REST 39.53 37.73 29.93 29.47 51.15 47.49 27.41 27.39 28.92 27.18
SPECTRA (Ours) 73.47 72.98 93.36 93.23 84.91 84.41 105.44 105.39 81.32 80.68
CL-7B Autoregressive 41.17 41.17 41.17 41.17 41.41 41.41 41.60 41.60 38.91 38.93
ANPD 94.76 93.02 132.26 131.30 89.26 87.13 131.35 130.99 130.41 126.64
Lookahead 106.51 105.95 123.04 121.90 103.45 103.51 120.75 120.23 125.58 124.77
REST 59.49 56.61 37.61 37.21 70.38 65.22 40.11 40.09 39.64 36.70
SPECTRA (Ours) 111.09 110.68 137.24 136.86 122.54 122.41 148.32 148.07 143.98 144.32
L2-13B Autoregressive 31.85 31.56 32.40 32.43 32.27 32.27 32.19 32.19 31.93 31.78
ANPD 43.30 44.44 47.54 45.22 43.24 42.28 36.20 35.84 37.44 34.84
Lookahead 57.49 58.94 47.44 47.62 55.76 55.58 44.41 44.15 48.11 46.62
REST 38.81 37.74 30.36 30.22 40.47 39.70 30.70 30.67 36.39 37.02
SPECTRA (Ours) 63.64 64.31 59.21 58.63 63.39 63.18 52.43 52.19 56.04 53.75
L2-70B Autoregressive 2.60 2.60 2.61 2.61 2.61 2.61 2.63 2.63 2.60 2.60
ANPD 4.72 4.80 4.25 4.10 4.85 4.76 3.07 3.07 3.47 3.30
Lookahead 6.90 7.16 4.87 5.12 6.71 6.73 3.92 3.93 5.05 5.02
SPECTRA (Ours) 8.07 8.35 6.58 6.75 8.41 8.41 4.88 4.88 6.32 6.22
L2-7B Autoregressive 40.33 40.32 41.01 41.03 41.14 41.13 41.00 41.04 40.48 40.50
ANPD 65.54 68.10 62.40 59.38 63.27 59.98 48.94 47.67 52.47 50.06
Lookahead 88.41 91.05 68.00 68.20 84.69 83.87 59.79 60.76 70.04 69.07
REST 54.74 53.93 41.43 41.38 57.99 56.41 41.28 40.74 50.58 51.79
SPECTRA (Ours) 96.88 98.75 86.51 85.50 98.77 98.38 72.39 73.22 81.93 79.20
L3-70B Autoregressive 2.58 2.57 2.58 2.58 2.59 2.59 2.59 2.59 2.55 2.55
ANPD 3.97 4.19 3.86 3.72 4.72 4.75 3.77 3.59 3.14 3.03
Lookahead 6.17 6.47 3.99 3.96 6.63 6.75 3.70 3.66 4.49 4.53
SPECTRA (Ours) 6.87 7.18 5.43 5.34 7.33 7.50 5.01 4.88 5.25 5.16
L3-8B Autoregressive 36.59 36.58 36.74 36.74 36.20 36.21 35.24 35.20 36.55 36.69
ANPD 77.21 78.76 141.89 141.36 66.31 65.57 118.47 112.95 41.77 40.20
Lookahead 94.92 90.09 136.32 135.92 89.99 90.47 133.67 133.12 56.09 55.49
SPECTRA (Ours) 103.61 105.88 142.89 142.72 92.86 93.16 143.80 142.72 61.69 60.22
Sampling (temperature=1.0)
CL-13B Autoregressive 30.90 30.64 31.38 31.37 31.24 31.39 31.46 31.45 30.71 30.67
ANPD 35.48 34.86 33.54 32.34 32.64 34.36 31.57 30.95 70.92 65.68
Lookahead 42.54 40.74 33.79 32.49 40.25 42.17 32.02 31.19 71.50 68.46
REST 35.15 33.22 25.67 25.24 39.58 38.49 26.43 25.89 28.41 26.69
SPECTRA (Ours) 51.86 50.04 37.57 35.67 51.60 52.64 36.29 35.27 72.90 69.98
CL-7B Autoregressive 39.60 39.58 40.85 40.87 40.05 40.10 40.81 40.81 40.49 40.50
ANPD 50.89 51.76 47.44 46.68 44.14 46.34 45.86 45.81 112.29 103.57
Lookahead 60.87 60.29 48.54 47.64 57.12 61.14 48.64 48.27 110.07 105.00
REST 48.64 46.41 35.98 35.46 53.35 52.26 37.04 36.57 39.36 36.51
SPECTRA (Ours) 71.70 71.78 55.24 52.81 67.27 69.20 54.48 52.91 112.43 108.49
L2-13B Autoregressive 31.23 31.17 31.44 31.47 31.41 31.42 32.02 32.06 31.67 31.59
ANPD 37.53 37.94 39.11 37.99 36.79 36.75 32.97 32.71 36.91 34.34
Lookahead 47.59 47.35 41.60 41.76 46.33 46.51 37.82 37.82 47.35 45.48
REST 36.78 36.17 29.33 29.25 37.46 36.71 29.38 29.28 35.50 36.21
SPECTRA (Ours) 53.13 52.28 48.60 48.11 52.93 53.11 42.95 43.03 54.98 52.42
L2-7B Autoregressive 39.89 39.88 40.58 40.59 40.09 40.10 40.59 40.66 40.65 40.70
ANPD 52.14 52.78 54.23 52.90 51.40 50.97 44.73 43.77 50.92 48.24
Lookahead 70.82 71.17 61.15 61.34 68.78 69.01 50.84 51.83 68.27 66.77
REST 50.35 49.99 40.19 40.09 50.86 50.06 38.94 38.18 49.12 50.54
SPECTRA (Ours) 78.46 78.74 72.13 71.68 81.71 81.76 59.77 60.09 80.21 77.00
L3-8B Autoregressive 35.75 35.76 35.16 35.17 36.01 36.02 36.05 36.07 35.39 35.48
ANPD 44.71 43.72 69.12 66.73 51.48 51.57 68.03 64.54 40.84 39.23
Lookahead 53.05 50.57 72.68 69.11 64.59 63.79 71.88 68.90 55.46 53.74
SPECTRA (Ours) 69.50 68.92 79.88 76.53 69.09 68.62 78.99 76.69 60.33 57.69

Table 4: Micro throughput (Mic-TP) and Macro throughput (Mac-TP) across multiple tasks and models.

The throughput analysis in Table 4 and Table 5 further supports SPECTRA's performance. SPECTRA consistently achieves higher Macro Throughput (Mac-TP) and Micro Throughput (Mic-TP) than both non-speculative baselines and other training-free accelerators. This holds true across various model sizes, datasets, and even older GPU architectures (RTX 3090, RTX 8000), demonstrating SPECTRA's robustness to varying hardware capabilities. Higher Mac-TP indicates more efficient individual speculative steps, while higher Mic-TP confirms overall faster token generation.

6.5.2. Evaluating SPECTRA in Different GPU Types

The following are the results from Table 3 of the original paper:

GPU Method GSM8K MTBench
Speedup τ Speedup τ
A40 Lookahead 1.49 1.93 1.53 2.07
SPECTRA 1.92 2.46 1.84 2.36
A6000 Lookahead 1.48 1.92 1.52 2.06
SPECTRA 1.92 2.46 1.84 2.36
RTX8000 Lookahead 1.33 1.93 1.34 2.08
SPECTRA 1.70 2.46 1.58 2.35
RTX3090 Lookahead 1.32 1.92 1.30 2.06
SPECTRA 1.84 2.46 1.74 2.36

Table 3: Hardware scalability of SPECTRA decoding on GSM8K and MTBench for various GPU architectures.

Table 3 shows SPECTRA's performance across different GPU types (A40, A6000, RTX8000, RTX3090).

  • Hardware Robustness: The results indicate that SPECTRA's relative accelerations remain consistent across GPUs with varying memory throughput and compute capabilities.
  • Consistent Lead: SPECTRA consistently outperforms Lookahead (the next best baseline) in all cases.
  • Narrowing Gap on Older GPUs: On older GPUs (e.g., RTX 3090 or RTX 8000), the speedup gap between Lookahead and SPECTRA narrows slightly due to less efficient parallelism on these architectures, but SPECTRA still maintains its lead. This confirms SPECTRA's practicality across both data-center and consumer-grade GPUs.

6.5.3. Evaluating SPECTRA in Multi-GPU Environments

The following are the results from Table 6 of the original paper:

GPU & Model Setting Method MTBench
Mac-TP Mic-TP Speedup τ
1xH100 - Quantized Int8 Autoregressive 2.60 2.60 1.00 1.00
SPECTRA 6.32 6.22 2.43 2.51
2xH100 - FP16 Autoregressive 14.81 14.70 1.00 1.00
SPECTRA 29.62 28.91 2.00 2.52
4xH100 - FP16 Autoregressive 14.60 14.48 1.00 1.00
SPECTRA 29.67 28.89 2.03 2.52
8xH100 - FP16 Autoregressive 14.39 14.28 1.00 1.00
SPECTRA 29.27 28.55 2.03 2.52

Table 6: Results in multi-GPU Environments on GSM8K and MTBench using Llama-2-chat-70B.

Table 6 details SPECTRA's scalability in multi-GPU environments using LLaMA-2-70B.

  • Robust Scalability: SPECTRA achieves consistent speedups of 2.00x – 2.03x across all multi-GPU configurations (2xH100, 4xH100, 8xH100 with FP16 precision). The compression ratio (τ\tau) remains stable at 2.52. This indicates that partitioning model weights across multiple GPUs introduces minimal overhead, and the speculative verification process remains efficient despite inter-GPU communication.
  • Quantized Performance: Even in the quantized (Int8) single-GPU (1xH100) setting, SPECTRA provides a substantial 2.43x speedup, outperforming standard autoregressive decoding.
  • Practicality for Large-Scale Deployment: These results validate SPECTRA's practicality for large-scale deployments where memory constraints necessitate distributed inference.

6.5.4. Verifying Generation Quality with SPECTRA Decoding

The following are the results from Table 7 of the original paper:

Dataset Method ROUGE-1 ROUGE-2 ROUGE-L Speedup τ
CNN Autoregressive 9.77 0.39 7.20 1.00 1.00
SPECTRA 9.74 0.41 7.18 1.60 2.05
XSUM Autoregressive 18.12 4.36 12.43 1.00 1.00
SPECTRA 18.13 4.40 12.49 1.69 2.08

Table 7: Evaluation of SPECTRA Decoding on CNN/DailyMail and XSum using a temperature of 1.0. ROUGE scores, speedups over autoregressive decoding, and compression ratio (τ\tau) are reported for LLaMA-2-7B-Chat.

  • Greedy Decoding Performance: For greedy decoding, SPECTRA produces identical outputs to standard Hugging Face's greedy search when using FP32 precision. When transitioning to FP16, both Hugging Face's native search and SPECTRA exhibit a similar, small number of discrepancies (25 vs. 26 out of 160 conversational turns). This confirms that SPECTRA maintains the output distribution within typical numerical error margins seen in standard half-precision inference libraries.

  • Sampling Decoding Performance: Table 7 demonstrates that for sampling-based decoding (temperature=1.0) on summarization datasets (CNN/DailyMail and XSum), SPECTRA's ROUGE-1, ROUGE-2, and ROUGE-L scores are nearly identical to those of standard autoregressive sampling. Simultaneously, SPECTRA achieves notable speedups (1.60x on CNN/DailyMail and 1.69x on XSum) with compression ratios of 2.05 and 2.08 respectively.

    These findings strongly reaffirm that SPECTRA is a lossless acceleration method, preserving the original LLM's generation quality across both greedy and sampling-based methods, while providing substantial speed benefits.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors introduced SPECTRA, a novel and highly effective training-free framework designed to accelerate Large Language Model (LLM) inference. SPECTRA's core innovation lies in its dual approach: the SPECTRA-CORE module leverages the LLM's internal knowledge through multi-level N-gram storage and a bi-directional search to generate high-quality, dynamic-length guesses. Complementing this is the SPECTRA-RETRIEVAL module, which refines external knowledge sources by filtering corpora based on perplexity scores, ensuring that only the most relevant and high-quality external cues are used for speculation. This optimized integration of internal and external speculation within a single forward pass paradigm allows SPECTRA to achieve significant speedups of up to 4.08x across diverse tasks and LLM architectures (Llama 2, Llama 3, CodeLlama), from 7B to 70B parameters. Crucially, SPECTRA is a plug-and-play solution that requires no modifications or additional training of the original LLM, making it a lossless and practical method for accelerating LLM inference.

7.2. Limitations & Future Work

The authors acknowledge two primary limitations:

  1. Cost of Building External Datastores: While SPECTRA-CORE is self-contained, SPECTRA-RETRIEVAL relies on constructing and indexing a sizable external datastore. This process can be time-consuming and memory-intensive, especially for frequently updated data or memory-constrained environments. Although this investment yields significant speedups, it might not always be universally feasible or cost-effective.

  2. Limited Evaluation Scope: The experiments were primarily conducted on English-language benchmarks in conversational and coding tasks using LLaMA-based models. The generalizability of SPECTRA to other languages (e.g., low-resource languages), specialized technical documents, or a wider range of model families (beyond LLaMA-based architectures) remains to be fully assessed. These factors could affect the acceptance rate and overall speedup.

    Future work is needed to:

  • Assess SPECTRA's generality across diverse linguistic settings and model families.
  • Potentially optimize the cost and efficiency of building and maintaining external datastores for SPECTRA-RETRIEVAL.

7.3. Personal Insights & Critique

SPECTRA presents a significant step forward in LLM inference optimization, particularly for scenarios where training-based acceleration methods are impractical or undesirable. The training-free, plug-and-play nature is a major practical advantage, making it accessible to a broader range of users and deployment environments.

One key insight is the importance of carefully balancing the quality and quantity of speculative guesses. SPECTRA's bi-directional search and perplexity-based refinement exemplify this, ensuring that the LLM's single, expensive forward pass is maximally utilized for high-probability tokens. The observation that internal guesses from SPECTRA-CORE have higher acceptance rates than external guesses from SPECTRA-RETRIEVAL is insightful, guiding the prioritization strategy. This suggests that while external knowledge can augment, the LLM's inherent statistical patterns are often the most reliable source for immediate next-token predictions.

The ablative studies are particularly strong, clearly demonstrating the contribution of each module and component. The finding that a naive combination of Lookahead and REST performs poorly highlights the engineering effort required for effective integration, which SPECTRA successfully addresses.

Potential issues or areas for improvement/further research:

  • Datastore Maintenance: The cost of building and maintaining external datastores for SPECTRA-RETRIEVAL is a valid limitation. For rapidly evolving domains or highly personalized LLMs, continuously updating this datastore and recomputing perplexity could become a bottleneck. Research into more dynamic, online, or adaptive datastore construction methods could enhance SPECTRA-RETRIEVAL's practicality.

  • N-gram Size and Candidate Pool Sensitivity: While the paper mentions N=5N=5 and W=15W=15 as defaults, a deeper analysis of the sensitivity to these hyperparameters could provide more guidance for different LLMs or tasks. How do these values impact memory footprint versus speedup?

  • Beyond LLaMA-based Architectures: Expanding evaluation to other Transformer variants (e.g., Mistral, Falcon, GPT-series if accessible) could further solidify SPECTRA's claim of generalizability. Different architectures might have varying internal mechanisms or biases that could interact differently with SPECTRA's speculation strategies.

  • Cross-Lingual Performance: The limited evaluation scope regarding languages is a notable point. The effectiveness of N-gram dictionaries and perplexity as a filtering mechanism might vary significantly across languages, especially for morphologically rich languages or those with different script systems.

  • Scalability to Extremely Long Contexts: While FlashAttention helps with attention overhead, the overall efficiency of speculative decoding can still face challenges with extremely long contexts. How does SPECTRA perform when the context window is very large, and N-gram predictions might become less reliable?

    Overall, SPECTRA offers a robust and well-engineered solution that directly tackles a critical bottleneck in LLM deployment, providing a practical lossless speedup without demanding model-specific retraining. Its insights into optimizing both internal and external speculation are valuable contributions to the field of efficient LLM inference.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.