SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation
TL;DR Summary
SPECTRA is a novel framework that accelerates large language model inference through optimized internal and external speculation, requiring no additional training. It achieves up to 4.08x speedup over state-of-the-art methods across various benchmarks, with its implementation pub
Abstract
Inference with modern Large Language Models (LLMs) is both computationally expensive and time-consuming. Speculative decoding has emerged as a promising solution, but existing approaches face key limitations: training-based methods require a draft model that is challenging to obtain and lacks generalizability, while training-free methods offer limited speedup gains. In this work, we present SPECTRA, a novel framework for accelerating LLM inference without the need for additional training or modification to the original LLM. SPECTRA introduces two new techniques for efficiently utilizing internal and external speculation, each outperforming corresponding state-of-the-art (SOTA) methods independently. When combined, these techniques achieve up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation of SPECTRA is publicly available.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation
1.2. Authors
Nguyen-Khang Le, Dinh-Truong Do, Nguyen Le Minh, all affiliated with Japan Advanced Institute of Science and Technology.
1.3. Journal/Conference
The paper is listed with a publication date in 2025, which suggests it is likely accepted for a prominent machine learning or natural language processing conference in 2025, or an arXiv preprint anticipating such. Given the quality of work and the topic's relevance, it targets top-tier venues in AI/ML.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the significant computational expense and time consumption of Large Language Model (LLM) inference. While speculative decoding offers a solution, existing methods have limitations: training-based approaches require difficult-to-obtain and non-generalizable draft models, and training-free methods yield limited speedups. SPECTRA is introduced as a novel framework to accelerate LLM inference without training or modifying the original LLM. It features two new techniques for internal and external speculation, each independently outperforming current state-of-the-art (SOTA) methods. When combined, SPECTRA achieves up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation is publicly available.
1.6. Original Source Link
/files/papers/695011299c764da3f20e36ef/paper.pdf (This link indicates it's a direct PDF hosted by the system, likely associated with a preprint server or conference proceedings.)
2. Executive Summary
2.1. Background & Motivation
The core problem SPECTRA aims to solve is the inherent inefficiency of Large Language Model (LLM) inference. Modern LLMs generate text autoregressively, meaning one token at a time. This sequential process leads to generation time scaling linearly with the output sequence length and underutilizes the parallel processing capabilities of modern GPUs. This computational burden makes deploying and running LLMs costly and slow, hindering their widespread application, especially for generating long sequences with low latency.
Prior research has explored speculative decoding as a promising solution. This guess-and-verify paradigm attempts to predict multiple tokens in advance and verify them in parallel. However, existing speculative decoding methods fall into two main categories, both with significant drawbacks:
-
Training-based methods: These approaches often require a smaller, specialized
draft modelor modifications to the original LLM itself through specialized training. This demands substantial computational resources, may degrade the original model's capabilities, and lacks generalizability across different LLMs or tasks. -
Training-free methods: These approaches avoid additional training by leveraging existing LLM predictions or external information. However, they typically offer only limited speedup gains due to the quality of their speculative guesses or inefficiencies in their integration.
The paper's entry point is to overcome these limitations by developing a novel
training-freespeculative decoding framework that does not require modification to the original LLM, yet achieves significantly higher speedups than existingtraining-freemethods. The innovative idea is to optimize bothinternal speculation(leveraging the LLM's own predicted token distribution) andexternal speculation(using refined external knowledge) and integrate them efficiently.
2.2. Main Contributions / Findings
The primary contributions of the SPECTRA paper are:
-
A Novel, Training-Free Framework: Introduction of
SPECTRA, aplug-and-playspeculative decoding method that accelerates LLM inference without requiring any additional training or modifications to the original LLM. This addresses the practical challenges oftraining-basedmethods. -
Optimized Internal Speculation (SPECTRA-CORE): Development of
SPECTRA-CORE, a module that leverages the LLM's predicted token distribution to generate high-quality guesses. It employsmulti-level N-gram dictionariesforbi-directional searchto producedynamic-length guesses, balancing quality and quantity. It also optimizes acandidate poolto continuously update these dictionaries for broad token coverage, with all updates and verification performed efficiently in asingle forward pass. -
Refined External Speculation (SPECTRA-RETRIEVAL): Introduction of
SPECTRA-RETRIEVAL, a module that enhances speedup by deriving guesses from a refined external information source. Unlike priorretrieval-basedmethods,SPECTRA-RETRIEVALreduces the search space by selecting only high-quality content from a corpus based onperplexity scorescomputed by the target LLM. This optimization enables seamless and efficient integration withSPECTRA-CORE. -
Significant Speedup: When combined,
SPECTRA's techniques achieve substantial acceleration, demonstrating up to a 4.08x speedup across a variety of benchmarks and LLM architectures (Llama 2, Llama 3, and CodeLlama, ranging from 7B to 70B parameters). This significantly surpasses existingtraining-freeapproaches. -
Robustness and Generalizability: Extensive experiments confirm
SPECTRA's efficiency across diverse tasks (multi-turn conversation, code generation, mathematical reasoning), LLM architectures,GPUtypes, andsamplingsettings, while preserving the original model's output quality (lossless acceleration). -
Public Availability: The implementation of
SPECTRAis publicly released, facilitating further research and adoption.The key findings show that
SPECTRAeffectively overcomes the limitations of previousspeculative decodingmethods by offering a practical, high-impact solution for accelerating LLM inference that is both lossless (maintains output quality) and highly efficient.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand SPECTRA, a basic understanding of Large Language Models (LLMs) and their decoding mechanisms, along with concepts in speculative decoding and language modeling, is essential.
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
Transformerarchitecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language. Examples includeGPTmodels,Llama, andCodeLlama. Their power comes from their ability to learn complex patterns and relationships in language. -
Autoregressive Decoding: This is the standard method by which
LLMsgenerate text. It operates sequentially:- The
LLMprocesses an input (prompt) and predicts the most probable next token. - This predicted token is appended to the input.
- The
LLMthen takes the extended sequence as input to predict the next token. This process repeats one token at a time until a stop condition (e.g., end-of-sequence token, maximum length) is met.
- Inefficiency: The main drawback is that each token generation requires a full forward pass through the
LLM, making the process slow, especially for long sequences, as generation time scales linearly with the output length. It also underutilizes the parallel processing capabilities of modernGPUs.
- The
-
Speculative Decoding (Guess-and-Verify Paradigm): This technique aims to speed up
autoregressive decodingby generating multiple tokens in parallel. Instead of generating one token at a time, it "guesses" several future tokens and then "verifies" them simultaneously.- Draft Model: Often, a smaller, faster model (the
draft model) is used to quickly generate a sequence of speculative tokens (a "draft"). - Verification: The original, larger
LLMthen takes this draft along with the preceding context and verifies all the drafted tokens in a single, parallel forward pass. - Acceptance: The
LLMchecks if the drafted tokens are consistent with its own predictions. A contiguous prefix of correctly guessed tokens is accepted. If a token is incorrect, the sequence up to the last correct token is accepted, and theLLMthenautoregressivelygenerates the next token from that point. - Efficiency: If many drafted tokens are accepted, the method effectively generates multiple tokens in the time it would normally take to generate one, leading to speedups.
- Draft Model: Often, a smaller, faster model (the
-
N-grams: In computational linguistics, an
N-gramis a contiguous sequence of items (words, tokens, characters) from a given sample of text or speech. For example, in the sentence "The quick brown fox," "the quick" is a 2-gram (orbigram), and "quick brown fox" is a 3-gram (ortrigram).N-grammodels are simple probabilistic language models that predict the next item in a sequence based on the precedingN-1items. InSPECTRA,N-gramsare used to store and retrieve sequences of tokens for speculation. -
Perplexity (PPL): A measure of how well a probability distribution or probability model predicts a sample. In language modeling,
perplexityquantifies how "surprised" a model is by a sequence of tokens. A lowerperplexityscore indicates that the model assigns a higher probability to the sequence, meaning the sequence is more "expected" or "natural" according to the model's learned patterns.- Formula: Given a text sequence , the
perplexityis calculated as: $ \mathrm{PPL}(u) = \exp \left{ - \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i}) \right} $ Where:- is the sequence of tokens.
- is the length of the sequence (number of tokens after the first).
- is the probability of token given the preceding tokens (i.e., ) as predicted by the model .
- is the natural logarithm.
The term
- \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i})represents the average negative log-likelihood, which is equivalent to thecross-entropyof the model on the sequence. A lowerPPLvalue means the model is more confident and accurate in predicting the sequence.
- Formula: Given a text sequence , the
-
Attention Mask: In
Transformermodels, theattention mechanismallows the model to weigh the importance of different tokens in the input sequence when processing each token. Anattention maskis a binary (or sometimes continuous) matrix that controls which tokens can "attend" to which other tokens. Forautoregressive decoding, theattention maskis typicallycausal, meaning a token can only attend to itself and preceding tokens, preventing it from "seeing" future tokens. Inspeculative decoding, specially designedattention maskscan enable parallel processing of speculative tokens while still respecting causal dependencies within the draft. -
FlashAttention: An optimized implementation of the
attention mechanismthat significantly reduces memory usage and improves computational speed, especially for long sequences. It achieves this by reordering computations and utilizingGPUmemory hierarchies more efficiently, reducing the number of reads/writes to high-bandwidth memory.
3.2. Previous Works
The paper contextualizes SPECTRA by discussing two main categories of speculative decoding and other related LLM acceleration techniques.
-
LLM Inference Acceleration Techniques (General):
- Quantization: Reducing the precision of model weights (e.g., from
FP16toINT8) to decrease memory footprint and accelerate computation. (Liu et al., 2024b; Lin et al., 2024; Zhao et al., 2024; Park et al., 2024) - Pruning: Removing redundant parameters from the model while trying to maintain performance. (Ma et al., 2023; Xia et al., 2023; Sun et al., 2023a; Le et al., 2025)
- Knowledge Distillation: Training a smaller "student" model to mimic the behavior of a larger "teacher" model. (Gu et al., 2024; Friha et al., 2024; Zhang et al., 2024b)
- Limitations: While these methods reduce computational load, they often introduce some degradation in model performance, requiring a trade-off between quality and efficiency.
- Quantization: Reducing the precision of model weights (e.g., from
-
Speculative Decoding (Training-Based Methods): These methods involve training or fine-tuning models to generate speculative tokens.
- Draft Models: Using a smaller
LLM(draft model) to predict several future tokens, which are then verified by the mainLLM. (Chen et al., 2023; Leviathan et al., 2023; Miao et al., 2024; Sun et al., 2023b; Zhou et al., 2024; Cai et al., 2024) - Self-Speculative Decoding: Training the original
LLMitself in a specialized manner to generate its own drafts. (Elhoushi et al., 2024; Liu et al., 2024a; Yang et al., 2024; Zhang et al., 2024a; Li et al., 2024b) - Limitations: These approaches require significant computational resources for training, may compromise the original model's capabilities, and often lack generalizability to new models or tasks without further training.
- Draft Models: Using a smaller
-
Speculative Decoding (Training-Free Methods): These methods aim to avoid additional training by leveraging existing information or structural properties.
- Retrieval-Based Methods: Using an external
datastore(e.g., indexed with observed prefixes in aTriestructure) to retrieve guess sequences.REST(He et al., 2024) andNearest Neighbor Speculative Decoding(Li et al., 2024a) are examples.- Limitation: These methods can struggle with efficiency if the search time in the
datastoreoutweighs the speedup gains, especially when trying to integrate with other speculative techniques.REST(He et al., 2024) is a direct baseline forSPECTRA-RETRIEVAL.
- Limitation: These methods can struggle with efficiency if the search time in the
- Internal Speculation Methods: Methods that generate speculative tokens directly from the
LLM's predictions or exploit structural properties.Lookahead Decoding(Fu et al., 2024) andAdaptive N-gram Parallel Decoding (ANPD)(Ou et al., 2024) are examples. They mitigate left-to-right dependencies by generating and validating multiple candidate tokens in parallel.- Limitation: Often provide limited speedup due to the quality and quantity of the speculative guesses.
Lookahead(Fu et al., 2024) andANPD(Ou et al., 2024) are direct baselines forSPECTRA-CORE.
- Limitation: Often provide limited speedup due to the quality and quantity of the speculative guesses.
- Retrieval-Based Methods: Using an external
3.3. Technological Evolution
The evolution of LLM inference acceleration has progressed from basic autoregressive decoding to sophisticated speculative decoding techniques. Initially, efforts focused on reducing the computational cost per token through methods like quantization and pruning, often with some quality trade-offs. The advent of speculative decoding marked a shift towards accelerating the number of tokens generated per computation step, rather than just making each step cheaper.
Early speculative decoding relied on draft models, introducing the overhead of training and maintaining an auxiliary model. The next wave sought to eliminate this training burden, leading to training-free approaches. These first training-free methods either leveraged the LLM's internal predictions (Lookahead, ANPD) or external knowledge (REST). However, these initial training-free methods faced limitations in speedup gains or integration complexity.
SPECTRA fits into this timeline as a significant advancement in training-free speculative decoding. It represents an evolution from prior training-free methods by introducing more sophisticated internal knowledge utilization and a novel, perplexity-based refinement for external knowledge, enabling seamless and highly effective integration of both.
3.4. Differentiation Analysis
Compared to the main methods in related work, SPECTRA offers several core differences and innovations:
-
Vs. Training-Based Speculative Decoding:
- Core Difference:
SPECTRAis entirelytraining-freeand requires no modifications to the originalLLM. This contrasts sharply with methods that need adraft modelor specializedself-speculativetraining, which incur substantial computational costs, may degrade original model capabilities, and lack generalizability. - Innovation:
SPECTRAprovides aplug-and-playsolution, making it highly practical for deployingoff-the-shelf LLMswithout additional resource investment.
- Core Difference:
-
Vs. Training-Free Internal Speculation (e.g.,
Lookahead,ANPD):- Core Difference:
SPECTRA-COREusesmulti-level N-gram dictionariesand abi-directional searchstrategy fordynamic-length guesses, along with an optimizedcandidate poolto continuously update dictionaries.Lookaheadoften enforces uniform guess lengths and may not have as sophisticated a mechanism for generating high-quality guesses or updating its knowledge base. - Innovation:
SPECTRA-CORE's approach allows for a better balance of guess quality and quantity, leading to higheracceptance ratesand thus greater speedups. Therandomness-based mechanismfor selecting unseen tokens also increases the coverage of its internal knowledge.
- Core Difference:
-
Vs. Training-Free External Speculation (e.g.,
REST):- Core Difference:
SPECTRA-RETRIEVALintroduces aperplexity-based corpus refinementstep.RESTrelies on adatastorewithout explicit quality filtering. This refinement is crucial for selecting only high-quality, relevant texts from the corpus. - Innovation: The
perplexity-based filteringsignificantly improves the quality of external guesses. More importantly, it enablesSPECTRA-RETRIEVALto be seamlessly and efficiently integrated withSPECTRA-CORE, overcoming the issue where search time in external structures can outweigh speedup gains in naive integrations. The ability to integrate effectively is a key differentiator.
- Core Difference:
-
Combined Approach:
- Core Difference:
SPECTRAis unique in its optimized combination of bothinternalandexternal speculationwithin a singletraining-freeframework. Previoustraining-freemethods generally focused on one or the other, or struggled with efficient integration. - Innovation: The synergy between
SPECTRA-CORE's high-quality internal guesses andSPECTRA-RETRIEVAL's refined external guesses, managed by a priority system, allowsSPECTRAto achieve significantly higher speedups (up to 4.08x) compared to existingtraining-freeapproaches that often yield more limited gains.
- Core Difference:
4. Methodology
SPECTRA is a novel training-free framework for accelerating LLM inference, designed to be plug-and-play without requiring modifications to the original LLM. It comprises two main modules: SPECTRA-CORE for internal speculation and SPECTRA-RETRIEVAL for external speculation. These modules can operate independently or be combined for maximum acceleration.
The overarching idea is to leverage speculative decoding by generating multiple candidate tokens (guesses) and verifying them in parallel. SPECTRA enhances this by optimizing how these guesses are generated, both from the LLM's internal knowledge and from external sources.
4.1. Principles
The core principle behind SPECTRA is to maximize the acceptance rate of speculative tokens while minimizing the overhead of generating these guesses. It achieves this by:
- Leveraging LLM's Predicted Distribution:
SPECTRA-COREintelligently uses theLLM's own output probabilities to buildN-gram dictionariesthat reflect likely token sequences. - Dynamic Guess Generation: Instead of fixed-length guesses,
SPECTRAsupportsvariable-length guessesthroughbi-directional searchin itsN-gramdictionaries, enhancing flexibility and efficiency. - Continuous Knowledge Augmentation: The
candidate poolandN-gram dictionaryupdate mechanisms ensure thatSPECTRA-CORE's internal knowledge base is continuously enriched and adapted to the ongoing generation context. - Refined External Knowledge Integration:
SPECTRA-RETRIEVALaddresses the limitations of naive external knowledge usage by pre-filtering external corpora based onperplexity, ensuring that only high-quality, model-aligned guesses are considered, which then integrate seamlessly with internal guesses. - Single-Pass Efficiency: All necessary distributions for predicting new candidates and verifying guesses are obtained in a
single forward passto theLLM, utilizing parallel processing and a specially designedattention mask.
4.2. Core Methodology In-depth
4.2.1. SPECTRA-CORE
SPECTRA-CORE is the internal speculation module. It maintains two primary data structures:
-
-gram Storage: This consists of two dictionaries:
S_fwd(Forward Dictionary): Maps a token to a list of subsequent sequences. Used to prioritize the quantity of guesses.S_bwd(Backward Dictionary): Maps a sequence (prefix) to a single subsequent token. Used to prioritize the quality of guesses.
-
Candidate Pool (): A collection of sequences, , each of length . denotes the -th token in the -th sequence. This pool is used to augment new
N-gramsinto the storage dictionaries.The
SPECTRA-COREdecoding process, detailed inAlgorithm 1(lines 5-44), involves a continuous loop of obtaining guesses, forwarding through theLLM, verification, predicting new candidates, and updatingN-gramdictionaries.
Bi-directional Search for Guesses
At each time step , SPECTRA generates a set of guess sequences, . This process is dynamic, allowing for variable-length guesses.
- Forward Direction (Algorithm 1, line 7): The last generated token, , is used as a key to search in
S_fwdto retrieve existing guess sequences. This step primarily focuses on obtaining a quantity of guesses quickly. - Backward Direction (Algorithm 1, lines 8-14): A high-quality guess () is constructed iteratively. Starting from the current context,
S_bwdis used to predict one token at a time, extending the sequence up to a desired length . This focuses on the quality of the guess by building it token-by-token based on observed high-probability sequences. The set of guesses is populated by combining results from both searches.
Predict & Verify in One Forward Pass
A crucial efficiency aspect of SPECTRA is that all distributions required for both predicting candidates for the candidate pool and verifying the guesses are obtained in a single forward pass to the LLM. This leverages the parallel processing capabilities of GPUs.
The following figure (Figure 2 from the original paper) shows how SPECTRA handles internal knowledge, including the single forward pass:

Figure 2: Detail of how SPECTRA handles internal knowledge. The dashed arrow indicates interactions between the tokens, which are realized by the attention mask in the LLM.
This is achieved using a specially designed attention mask. For example, in the figure, a token like in a candidate sequence or a guess token attends only to its causal predecessors (, ) and the input . This ensures correctness for autoregressive models while allowing parallel computation for different guesses and candidates.
Verification
The verification step ensures that the output distribution remains consistent with the original LLM (lossless acceleration).
- For
greedy sampling(which selects the token with the highest probability), the verification is detailed inAlgorithm 3. It involves comparing theLLM's predicted tokens () with the drafted tokens (y_tilde). The longest contiguous prefix of accepted tokens across all guesses is identified and appended to the output. - For
sampling-based methods(e.g.,top-k,top-psampling),SPECTRAadopts thesampling verificationapproach from prior work (Miao et al., 2024; Fu et al., 2024), ensuring correctness for stochastic generation. This is detailed inAlgorithm 4.
Predict Tokens for Candidate Pool
After verification, new tokens are predicted to update the candidate pool . For each of the sequences in , the next token is predicted using the LLM's distribution obtained in the single forward pass (Algorithm 1, lines 26-34).
- Randomness-Based Coverage: To address the issue of
N-gramdictionaries potentially lacking coverage for certain tokens,SPECTRAintroduces a probabilistic mechanism using a hyperparameter .- Let be a random draw from a
Uniformdistribution in[0, 1]. - If ,
SPECTRAattempts to select the token with the highest probability that is not yet present inS_fwd. This encourages exploration and expands dictionary coverage. - Otherwise (if ), it selects the token with the highest probability regardless of its presence in
S_fwd.
- Let be a random draw from a
- Candidate Pool Shift: At the end of each time step, all candidate sequences in are shifted one token to the left (e.g., for all ), leaving the last position empty for the new prediction in the next step (Algorithm 1, line 43). This creates a sliding window effect for the candidate sequences.
Update N-gram Dictionaries
The N-gram dictionaries (S_fwd and S_bwd) are updated using the tokens from the candidate pool at the end of each time step (Algorithm 1, lines 35-41).
- Subsequence Inclusion: Unlike previous methods that might only add full
N-grams,SPECTRAobserves that subsequences withinN-gramsare also valuable. It adds these subsequences to both dictionaries. S_fwdUpdate: Subsequences are added toS_fwdusing their first token as the key. For anN-gram, entries like , etc., are appended.S_bwdUpdate: For anN-gram,S_bwdmaps the preceding part of the sequence to the last token (e.g., ).
4.2.2. SPECTRA-RETRIEVAL
SPECTRA-RETRIEVAL enhances speedup by leveraging external knowledge. It addresses the common pitfall of retrieval-based methods where integrating guesses from unrefined external sources can lead to search time outweighing speedup gains.
Corpus Refinement by Perplexity
The key innovation in SPECTRA-RETRIEVAL is to select high-quality, relevant texts from a large external corpus using perplexity (PPL) scores.
- Process: Given an external text corpus, each sequence in the corpus is evaluated using the target
LLMto compute itsperplexity. - Formula:
$
\mathrm{PPL}(u) = \exp \left{ - \frac{1}{t} \sum_{i=1}^t \log P_M (u_i \mid u_{<i}) \right}
$
Where:
- is the sequence of tokens from the external corpus.
- is the length of the sequence.
- is the probability of token given the preceding tokens as predicted by the target
LLM.
- Selection: Sequences with the lowest
perplexityare deemed to bewell-alignedwith theLLM's predictions and are thus likely to produce high-quality guesses. These selected, high-quality sequences form a smaller, refined subset of the corpus. - Indexing: This refined corpus is then indexed into a structure that supports fast
prefix search, such as aTrie.
Integration with SPECTRA-CORE
SPECTRA-RETRIEVAL is designed for seamless integration with SPECTRA-CORE.
- Guess Generation: During the guess generation step of
SPECTRA-CORE(Algorithm 1, line 16), guesses obtained fromSPECTRA-RETRIEVAL(G_retrieve) are appended to the guesses generated bySPECTRA-CORE. - Limited Budget: Recognizing that too many candidate tokens can strain
GPUresources,SPECTRAlimits the total number of guesses (). By refining the corpus withperplexity,SPECTRA-RETRIEVALensures that its contribution to the guess budget is of high quality, preventing the search time in the externalTriefrom negating the speedup. - Priority: As empirically observed, internal guesses from
SPECTRA-COREtend to have higheracceptance rates. Therefore,SPECTRAgives priority to these internal guesses, filling any remaining guess budget with external guesses fromSPECTRA-RETRIEVAL.
4.2.3. Overall SPECTRA Decoding Algorithm
The entire SPECTRA decoding process is outlined in Algorithm 1.
Require: Sequence , model `P _ { M }` , max
N-gram size , candidate pool size , max guesses ,
max number of new tokens .Refine threshold
1Initialize N-gram Forward-dictionary
\$\frac { }\$
5: while do
6: {Obtain the guesses}
7: 8:
9: for to do
10: for to 1 do
11:
12: break if found value for `u _ { j }`
13: end for
14: end for
15: .append `( u )`
16: Retrieval Integration (Optional)
17: Ensure the max guesses is
18: {Foward in LLM}
19: Obtain necessary distributions of `P _ { M }` in parallel.
20: {Verification}
21: {Greedy verify (Alg. 3) or Sampling verify (Alg. 4)}
22: hits VerificationFunction
23: hits
24:
25: {Predict Candidates}
26: for to do
27: 28:
29:
30:
31: else
33:
34: end for
35: {Update N-gram dictionaries}
36: for to do
38: for to do
39:
40:
41: end for
42: end for
43:
44: end while
45: Output:
Algorithm 1: Main SPECTRA Decoding Loop
- Input:
- : The initial sequence of tokens.
- : The target
LLM(probability distribution function). - : Maximum
N-gramsize. - :
Candidate poolsize. - : Maximum number of guesses allowed in parallel.
- : Maximum number of new tokens to generate.
- : Refine threshold for
candidate poolprediction.
- Initialization (lines 1-4):
S_fwdandS_bwdare initialized as emptyN-gramdictionaries.- represents the -th token of the -th sequence in the
candidate pool, which is initialized. - tracks the current token index in the output sequence, starting after the initial prompt.
- Main Loop (lines 5-44): Continues as long as the desired number of tokens () has not been generated.
- Obtain the guesses (lines 6-17):
- Line 7: Initializes the set of guesses by searching
S_fwdusing the last generated token . - Lines 8-14: Constructs a high-quality guess using
S_bwdthrough an iterative backward search up toN-1tokens. represents a preceding context for the backward search. is the guess constructed so far. If a value for is found inS_bwd, the loop breaks. - Line 15: Appends the high-quality guess to .
- Line 16: Retrieval Integration (Optional): Appends guesses from
SPECTRA-RETRIEVAL() to . This is where theexternal speculationseamlessly integrates. - Line 17: Truncates to ensure it does not exceed the maximum allowed guesses .
- Line 7: Initializes the set of guesses by searching
- Forward in LLM (lines 18-19):
- Line 19: Performs a single
LLMforward pass to obtain all necessary probability distributions for both verification and candidate prediction in parallel.
- Line 19: Performs a single
- Verification (lines 20-24):
- Line 22: Calls a
VerificationFunction(eitherAlgorithm 3for greedy orAlgorithm 4for sampling) with the current sequence ,LLM, and the collected guesses . It returnshits, the accepted tokens. - Line 23: Appends
hitsto the main output sequence . - Line 24: Increments by the number of accepted tokens.
- Line 22: Calls a
- Predict Candidates (lines 25-34):
- For each sequence in the
candidate pool( sequences):- Line 28: Samples a random value .
- Line 29: Computes the probability distribution for the next token based on the current candidate sequence prefix and the context .
- Line 30-31: If , it attempts to select the
argmaxtoken that is not already inS_fwdto promote coverage. - Line 33: Otherwise, it simply selects the
argmaxtoken.
- For each sequence in the
- Update
N-gramdictionaries (lines 35-42):- For each sequence in the
candidate pooland for each possibleN-gramlength from0toN-2:- Line 39: Appends subsequences to
S_fwdusing as the key. - Line 40: Updates
S_bwdby mapping the prefix to the next token .
- Line 39: Appends subsequences to
- For each sequence in the
- Candidate Pool Shift (line 43):
- All candidate sequences in are shifted left by one token, making room for new predictions in the next iteration.
- Obtain the guesses (lines 6-17):
- Output (line 45): The generated sequence of new tokens.
4.2.4. Greedy Verification (Algorithm 3)
Require: sequence , model `P _ { M }` , guesses with
Ensure: o {accepted tokens of length 1 to }
1:function GREEDYVERIFICATION
2: Store the distributions
3: `V G` Store the current guesses
4: for to do
5: .append Last token of and outputs total distributions
6: end for
7: for to do
8:
9:
10:
11: while do
12: `s _ { j } V [ j ] _ { i }`
13: if then accepted, update all potential speculations and probabilities
14: 0.append
15: is_accept `1`
16:
17: for to do
18: if `s _ { j } = V [ k ] _ { i }` then
19: Vnew.append
20: Dnew.append(D[k])
21: end if
22: end for
23:
24: break
25: else rejected, go to next speculation
26:
27: end if
28: end while
29: if is_accept then
30: continue
31: else guarantee one step movement
32: 0.append(arg max P)
33: break
34: end if
35: end for
36: if is_accept then
37: 0.append(arg max D[1]N )
38: end if
39: return o
40: end function
Algorithm 3: Greedy Verification with SPECTRA DECODING
- Input:
- : The current sequence of tokens.
- : The target
LLM. - : The set of guesses.
- Output: , the list of accepted tokens.
- Process:
- Lines 2-6: Initializes (list of distributions for each guess from ) and (list of current guesses). For each guess , it computes the
LLM's predicted distributions for tokens up to length , conditioned on the context and the guess prefix. - Lines 7-35: Iterates through the potential positions within the guesses (from 1 to
N-1).- Lines 11-28: For each position, it iterates through the available guesses ().
- Line 13: Checks if the token from the current guess matches the
argmaxtoken predicted by theLLMat that position (). - If accepted (match found):
- Line 14: Appends to the output .
- Line 15: Sets
is_acceptflag. - Lines 16-23: Filters and to keep only the guesses that are consistent with the newly accepted token . This effectively prunes inconsistent guesses.
- Line 24: Breaks from the inner loop (processed the accepted token).
- If rejected:
- Line 26: Moves to the next guess () to check for acceptance.
- Line 13: Checks if the token from the current guess matches the
- Lines 29-34: If
is_acceptis true, it continues to the next position. If no guess is accepted at the current position, it guarantees one step movement byautoregressivelytaking theargmaxtoken and then breaks from the outer loop.
- Lines 11-28: For each position, it iterates through the available guesses ().
- Lines 36-38: If tokens were accepted through the entire loop (meaning
is_acceptis still true), it appends theargmaxtoken at the end of the longest accepted sequence (which isD[1]_N, the last predicted token). This is to ensure a full tokens if possible.
- Lines 2-6: Initializes (list of distributions for each guess from ) and (list of current guesses). For each guess , it computes the
4.2.5. Sample Verification (Algorithm 4)
Require: sequence , model `P _ { M }` , guesses with
Ensure: o {accepted tokens of length 1 to }
1function SAMPLEVERIFICaTION
2: Store the distributions
3: `V G` Store the current guesses
4: for to do
5: .append( Last token of and outputs - total distributions
6: end for
7: for to do
8: `j 1`
9: is_accept `0`
10:
11: while do
12: `s _ { j } V [ j ] _ { i }`
13: sample
14: if then accepted, update all potential speculations and probabilities
15: 0.append
16: is_accept `1`
17:
18: for to do
19: if `s _ { j } = V [ k ] _ { i }` then
20: Vnew.append
21: Dnew.append(D[k])
22: end if
23: end for
24:
25: break
26: else rejected, go to next speculation
27:
28:
29:
30: end if
31: end while
32: if is_accept then
33: continue
34: else guarantee one step movement
35: sample
36: 0.append(xnext)
37: break
38: end if
39: end for
40: if is_accept then
41: 0.append(sample
42: end if
43: return o
44: end function
Algorithm 4: Sample Verification
- Input: Same as
GreedyVerification(Algorithm 3). - Output: , the list of accepted tokens.
- Process: This algorithm is similar to
GreedyVerificationbut adapted forstochastic sampling(e.g., with temperature > 0).- Lines 2-6: Initializes and similarly.
- Lines 7-31: Iterates through potential positions and guesses.
- Lines 13-14: Instead of
argmax, a token from a guess is accepted if a randomly sampled value from aUniformdistribution in[0, 1]is less than or equal to theLLM's predicted probability for that token (). This introduces stochasticity. - If accepted:
- Lines 15-24: Similar to greedy, is appended to ,
is_acceptis set, and and are filtered.
- Lines 15-24: Similar to greedy, is appended to ,
- If rejected:
- Lines 27-29: The probability of the rejected token is set to 0 (), and the remaining probabilities are
normalized(). Then it proceeds to the next guess. This is a key difference, ensuring the sampling distribution is correctly updated after a rejection.
- Lines 27-29: The probability of the rejected token is set to 0 (), and the remaining probabilities are
- Lines 13-14: Instead of
- Lines 32-38: If
is_acceptis true, it continues. If no guess is accepted at a position, it samples the next tokenx_nextfrom the modified distribution and appends it to before breaking. - Lines 40-42: If tokens were accepted through the entire loop, it samples the last token from the distribution
D[1]_N.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on a diverse set of tasks and datasets to evaluate SPECTRA's performance across different domains.
- MT-Bench (Zheng et al., 2023): A benchmark for multi-turn conversation.
- Characteristic: Involves complex dialogues and assessing
LLMperformance in interactive, conversational settings. - Data Source for SPECTRA-RETRIEVAL: Approximately 100k examples from the
UltraChatdataset (Ding et al., 2023) are used, specifically those with minimalperplexityunder the targetLLM.
- Characteristic: Involves complex dialogues and assessing
- GSM8K (Cobbe et al., 2021): A dataset for mathematical reasoning.
- Characteristic: Requires step-by-step logical thinking and numerical accuracy to solve grade school math word problems.
- HumanEval (Chen et al., 2021): A benchmark for code generation.
- Characteristic: Assesses the
LLM's ability to generate functional Python code given a natural language prompt. - Data Source for SPECTRA-RETRIEVAL: Snippets from
TheStack(Kocetkov et al., 2023) are refined to 100k examples with the lowestperplexity.
- Characteristic: Assesses the
- MBPP (Austin et al., 2021): Another benchmark for code generation.
- Characteristic: Focuses on generating short Python programs from natural language descriptions and unit tests.
- ClassEval (Du et al., 2023): A benchmark for class-level code generation.
-
Characteristic: Evaluates
LLMson more complex code generation tasks involving class definitions and methods.These datasets are chosen because they represent a variety of
LLMapplications, including general language tasks (conversation), logical reasoning (math), and specialized programming tasks (code generation). This diversity allows for a comprehensive evaluation ofSPECTRA's robustness and effectiveness.
-
5.2. Evaluation Metrics
SPECTRA is a lossless acceleration method, meaning it does not modify the original LLM's output distribution. Therefore, the generation quality remains the same as the original LLM. The evaluation primarily focuses on acceleration performance.
-
Speedup Ratio:
- Conceptual Definition: This metric quantifies how much faster the proposed method is compared to a baseline method (typically standard
autoregressive decoding). A speedup ratio of means the method is times faster. - Mathematical Formula:
$
\text{Speedup Ratio} = \frac{\text{Time}{\text{Autoregressive}}}{\text{Time}{\text{Method}}}
$
Where:
- is the total time taken for
autoregressive decodingto generate a sequence. - is the total time taken by
SPECTRA(or other speculative method) to generate the same sequence.
- is the total time taken for
- Symbol Explanation: Time is usually measured in seconds or milliseconds.
- Conceptual Definition: This metric quantifies how much faster the proposed method is compared to a baseline method (typically standard
-
Compression Ratio ():
- Conceptual Definition: This metric measures the efficiency of the
speculative decodingprocess itself, independent of specific hardware. It represents how many tokens are processed per effectiveLLMforward pass. A higher compression ratio indicates that more tokens are accepted on average in each speculative step. - Mathematical Formula:
$
\tau = \frac{\text{Total Autoregressive Steps}}{\text{Total Speculative Steps}}
$
Where:
- is the number of steps required if generated one token at a time. This is equivalent to the total length of the generated sequence.
- is the number of
LLMforward passes required by thespeculative decodingmethod to generate the same sequence.
- Symbol Explanation: Steps refer to
LLMforward passes.
- Conceptual Definition: This metric measures the efficiency of the
-
ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation): (Used for
sampling decodingquality evaluation)- Conceptual Definition:
ROUGEis a set of metrics used for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (human-produced).ROUGE-Nspecifically measures the overlap ofN-gramsbetween the generated text and the reference text.ROUGE-1measuresunigram(single word) overlap,ROUGE-2measuresbigramoverlap, andROUGE-Lmeasures the longest common subsequence (LCS) match, which accounts for sequence order. - Mathematical Formula (for ROUGE-N Recall, common in summarization):
$
\text{ROUGE-N} = \frac{\sum_{\text{sentence} \in \text{Reference Summaries}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}{\text{match}}(n\text{-gram})}{\sum{\text{sentence} \in \text{Reference Summaries}} \sum_{n\text{-gram} \in \text{sentence}} \text{Count}(n\text{-gram})}
$
Where:
- refers to a specific
N-gram(e.g., a word forROUGE-1, a pair of words forROUGE-2). - is the maximum number of
N-gramsco-occurring in a system summary and a set of reference summaries. - is the number of
N-gramsin the reference summary.
- refers to a specific
- Symbol Explanation: This formula calculates
ROUGE-Nbased on recall. Precision and F-measure versions also exist.ROUGE-L(LCS) has a different underlying formula based on finding the longest common subsequence between texts.
- Conceptual Definition:
-
Throughput Metrics:
- Macro Throughput (Mac-TP):
- Conceptual Definition: The average rate of token processing per generation step. It gives insight into the efficiency of individual decoding iterations.
- Mathematical Formula:
$
\text{Mac-TP} = \frac{1}{S} \sum_{i=1}^{S} \frac{\text{tokens}_i}{\text{time}_i}
$
Where:
- is the total number of speculative decoding steps.
- is the number of tokens accepted in step .
- is the time taken for step .
- Micro Throughput (Mic-TP):
- Conceptual Definition: The overall rate of token generation, calculated as the total number of generated tokens divided by the total elapsed time. This is a common practical measure of generation speed.
- Mathematical Formula: $ \text{Mic-TP} = \frac{\text{Total Generated Tokens}}{\text{Total Elapsed Time}} $
- Symbol Explanation: Both are typically measured in tokens per second (tokens/s).
- Macro Throughput (Mac-TP):
5.3. Baselines
SPECTRA is compared against standard autoregressive decoding and several leading training-free speculative decoding approaches:
-
Autoregressive Decoding: The standard, one-token-at-a-time generation method. This serves as the primary baseline, with a
speedup ratioof1.00x. -
Adaptive N-gram Parallel Decoding (ANPD) (Ou et al., 2024): A
training-freemethod that leverages adaptiveN-gramsfor parallel decoding. -
REST (Retrieval-Based Speculative Decoding) (He et al., 2024): A
training-freemethod that uses a retrieval mechanism from an externaldatastoreto generate guesses. This is a direct competitor toSPECTRA-RETRIEVAL. -
Lookahead Decoding (Fu et al., 2024): A
training-freemethod that breaks the sequential dependency ofLLMinference by generating and validating multiple candidate tokens in parallel. This is a direct competitor toSPECTRA-CORE.The baselines are replicated using their publicly available
GitHubcode with default settings and hyperparameters from their original papers to ensure fair comparison.
5.4. Models
The evaluation covers a range of LLM architectures and sizes:
-
LLaMA-2-Chat (Touvron et al., 2023): Sizes
7B,13B,70B. These are general-purpose chat models. -
CodeLlama (Rozière et al., 2024): Sizes
7B,13B. These models are specialized for code generation. -
LLaMA-3-Instruct (Dubey et al., 2024): Sizes
8B,70B. Instruction-tuned models designed for following instructions.All model
checkpointsare sourced from official repositories or Hugging Face. -
Precision: For
7Band13Bmodels,16-bit (FP16)precision with a pre-allocatedkey-value cacheis used. For70Bmodels,8-bit quantizationis primarily used for the main results in Table 1, withFP16evaluations in multi-GPU settings described in Appendix E. Numerical consistency betweenFP32andFP16for LLaMA-2-7B is also verified.
5.5. Hardware
- Primary GPU: Single
NVIDIA A100 GPUwith80GBof memory for most experiments. - Hardware Scalability Test (Appendix C): Additional
NVIDIA GPUsincludingRTX 3090,RTX 8000,A40, andA6000are used to testSPECTRA's robustness across different hardware. - Multi-GPU Environments (Appendix E): For
70Bmodels exceeding single-GPU memory inFP16, distributed computation across multipleH100 GPUs(2x, 4x, or 8x) is performed using Hugging Face'spipeline parallelism.
5.6. Hyperparameters
- SPECTRA:
- N-gram size (N):
5-gramsetup for both forward and backward dictionaries by default. - Candidate Pool Size (W):
15sequences maintained per key. - Refine Threshold ():
0.1by default, for probabilistically encouraging selection of unseen tokens inS_fwd. - Max Guesses (G): Up to
15guesses allowed per step. - Guess Priority: Internal guesses from
SPECTRA-COREreceive priority; if the limit is not reached, external guesses are added. - External Lookup: A
Triestructure is used for rapid prefix queries, similar toREST. - Corpus for SPECTRA-RETRIEVAL:
- Conversation tasks (
MT-Bench): ~100k low-perplexity examples fromUltraChat. - Code tasks (
HumanEval,MBPP): ~100k low-perplexity snippets fromTheStack.
- Conversation tasks (
- N-gram size (N):
- Batch Size: All speedup and throughput metrics are computed at a
batch size of 1. - Generation Length: Max 512 tokens for code generation; max 1024 tokens for conversation/math tasks, or until
end-of-sequence token. - Random Seeds: All random seeds are set to
0.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results demonstrate that SPECTRA consistently achieves superior acceleration compared to other training-free speculative decoding methods across a wide range of models and tasks.
The following are the results from Table 1 of the original paper:
| Model | Method | Classeval | GSM8K | Humaneval | MBPP | MTBench | AVG | ||||||
| Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Greedy (temperature=0) | |||||||||||||
| CL-13B | ANPD | 1.94 | 2.52 | 2.81 | 3.72 | 2.08 | 2.50 | 2.71 | 3.58 | 2.61 | 3.41 | 2.43 | 3.15 |
| Lookahead | 2.25 | 3.61 | 2.80 | 4.24 | 2.30 | 3.16 | 2.91 | 4.44 | 2.59 | 4.04 | 2.57 | 3.90 | |
| REST | 1.28 | 2.14 | 0.93 | 1.54 | 1.58 | 2.31 | 0.85 | 1.40 | 0.94 | 1.53 | 1.12 | 1.78 | |
| SPECTRA (Ours) | 4.06 | 4.65 | 3.75 | 4.63 | 2.91 | 3.95 | 3.29 | 4.46 | 2.65 | 4.40 | 3.33 | 4.42 | |
| CL-7B | ANPD | 2.30 | 3.21 | 2.68 | 3.78 | 2.16 | 2.90 | 3.16 | 3.83 | 3.35 | 3.83 | 2.73 | 3.51 |
| Lookahead | 2.59 | 3.66 | 2.99 | 3.83 | 2.50 | 3.05 | 3.25 | 4.10 | 3.70 | 4.52 | 3.01 | 3.83 | |
| REST | 1.45 | 2.22 | 0.91 | 1.39 | 1.70 | 2.34 | 0.96 | 1.45 | 1.02 | 1.44 | 1.21 | 1.77 | |
| SPECTRA (Ours) | 2.70 | 4.10 | 3.33 | 4.59 | 2.96 | 3.90 | 3.56 | 4.45 | 3.70 | 4.52 | 3.25 | 4.31 | |
| L2-13B | ANPD | 1.36 | 1.72 | 1.78 | 2.01 | 1.47 | 1.87 | 1.61 | 2.32 | 1.32 | 1.69 | 1.51 | 1.92 |
| Lookahead | 1.78 | 2.76 | 1.81 | 2.01 | 1.46 | 1.87 | 1.73 | 2.32 | 1.38 | 1.69 | 1.63 | 2.13 | |
| REST | 1.22 | 1.90 | 0.94 | 1.46 | 1.25 | 1.94 | 0.95 | 1.44 | 1.14 | 1.10 | 1.10 | 1.57 | |
| SPECTRA (Ours) | 2.00 | 3.24 | 1.83 | 2.62 | 1.96 | 2.91 | 1.63 | 2.24 | 1.75 | 2.60 | 1.83 | 2.72 | |
| L2-70B | ANPD | 1.34 | 2.00 | 1.30 | 1.95 | 1.33 | 1.90 | 1.25 | 1.86 | 1.21 | 1.87 | 1.29 | 1.92 |
| Lookahead | 1.82 | 2.65 | 1.90 | 2.87 | 1.63 | 2.52 | 1.61 | 2.69 | 1.86 | 2.57 | 1.76 | 2.66 | |
| REST | 1.17 | 1.49 | 1.20 | 1.54 | 1.94 | 1.30 | 1.56 | 2.10 | 1.46 | 1.45 | 1.47 | 1.58 | |
| SPECTRA (Ours) | 2.43 | 3.40 | 1.95 | 2.62 | 2.52 | 3.22 | 1.52 | 1.86 | 1.68 | 1.93 | 2.02 | 2.61 | |
| L2-7B | ANPD | 1.33 | 1.73 | 1.30 | 1.69 | 1.37 | 1.82 | 1.25 | 1.82 | 1.01 | 1.46 | 1.25 | 1.70 |
| Lookahead | 1.73 | 2.59 | 1.77 | 2.67 | 2.05 | 2.77 | 1.41 | 2.16 | 1.46 | 2.02 | 1.60 | 2.44 | |
| REST | 1.01 | 1.46 | 1.14 | 1.23 | 1.53 | 1.85 | 1.69 | 2.10 | 1.14 | 1.23 | 1.30 | 1.57 | |
| SPECTRA (Ours) | 2.59 | 3.86 | 2.83 | 4.57 | 3.49 | 4.65 | 2.49 | 3.36 | 2.49 | 3.58 | 2.78 | 4.00 | |
| L3-70B | ANPD | 1.15 | 1.31 | 1.38 | 1.61 | 1.46 | 1.83 | 1.07 | 1.30 | 1.05 | 1.30 | 1.22 | 1.47 |
| Lookahead | 1.24 | 1.42 | 1.34 | 1.96 | 2.31 | 2.89 | 2.33 | 3.48 | 1.20 | 1.35 | 1.68 | 2.22 | |
| REST | 1.14 | 1.35 | 1.68 | 1.96 | 1.29 | 1.75 | 1.10 | 1.32 | 1.12 | 1.27 | 1.27 | 1.53 | |
| SPECTRA (Ours) | 1.54 | 2.03 | 1.23 | 1.86 | 1.81 | 2.25 | 1.20 | 1.73 | 1.19 | 1.41 | 1.39 | 1.86 | |
| L3-8B | ANPD | 1.19 | 1.43 | 0.88 | 1.33 | 1.35 | 1.73 | 1.43 | 1.81 | 1.19 | 1.43 | 1.21 | 1.55 |
| Lookahead | 1.52 | 1.98 | 1.18 | 1.72 | 1.70 | 2.12 | 1.33 | 1.98 | 1.07 | 1.40 | 1.36 | 1.84 | |
| REST | 1.18 | 1.48 | 1.12 | 1.44 | 1.74 | 2.01 | 1.12 | 1.48 | 1.03 | 1.22 | 1.24 | 1.53 | |
| SPECTRA (Ours) | 1.70 | 2.22 | 1.52 | 2.22 | 1.96 | 2.75 | 1.18 | 1.69 | 1.46 | 1.88 | 1.56 | 2.15 | |
| Sampling (temperature=1.0) | |||||||||||||
| CL-13B | ANPD | 1.15 | 1.31 | 1.38 | 1.61 | 1.46 | 1.83 | 1.07 | 1.30 | 1.05 | 1.30 | 1.22 | 1.47 |
| Lookahead | 1.24 | 1.42 | 1.34 | 1.96 | 2.31 | 2.89 | 2.33 | 3.48 | 1.20 | 1.35 | 1.68 | 2.22 | |
| REST | 1.14 | 1.35 | 1.68 | 1.96 | 1.29 | 1.75 | 1.10 | 1.32 | 1.12 | 1.27 | 1.27 | 1.53 | |
| SPECTRA (Ours) | 1.54 | 2.03 | 1.23 | 1.86 | 1.81 | 2.25 | 1.20 | 1.73 | 1.19 | 1.41 | 1.39 | 1.86 | |
| CL-7B | ANPD | 1.19 | 1.43 | 0.88 | 1.33 | 1.35 | 1.73 | 1.43 | 1.81 | 1.19 | 1.43 | 1.21 | 1.55 |
| Lookahead | 1.52 | 1.98 | 1.18 | 1.72 | 1.70 | 2.12 | 1.33 | 1.98 | 1.07 | 1.40 | 1.36 | 1.84 | |
| REST | 1.18 | 1.48 | 1.12 | 1.44 | 1.74 | 2.01 | 1.12 | 1.48 | 1.03 | 1.22 | 1.24 | 1.53 | |
| SPECTRA (Ours) | 1.70 | 2.22 | 1.52 | 2.22 | 1.96 | 2.75 | 1.18 | 1.69 | 1.46 | 1.88 | 1.56 | 2.15 | |
| L2-13B | ANPD | 1.31 | 1.51 | 1.78 | 2.30 | 1.26 | 1.48 | 1.34 | 1.48 | 1.10 | 1.22 | 1.36 | 1.60 |
| Lookahead | 1.78 | 2.30 | 1.72 | 2.09 | 1.51 | 1.76 | 1.28 | 1.46 | 1.25 | 1.36 | 1.51 | 1.79 | |
| REST | 1.26 | 1.46 | 0.99 | 1.27 | 1.46 | 1.72 | 1.21 | 1.36 | 1.14 | 1.21 | 1.21 | 1.40 | |
| SPECTRA (Ours) | 1.72 | 2.35 | 1.82 | 2.62 | 1.76 | 2.35 | 1.59 | 2.02 | 1.49 | 1.88 | 1.68 | 2.24 | |
Table 1: Overall performance of speculative decoding methods across multiple tasks. "CL-xB" denotes CodeLlama with parameters, "L2-xB" denotes LLaMA-2-Chat of size , and "L3-xB" denotes LLaMA-3-Instruct of size . We report the speedup ratio (vs. autoregressive) and the compression ratio .
-
Overall Performance (Greedy Decoding):
SPECTRAconsistently achieves the highestspeedup ratiosacross all models and datasets undergreedy decoding(temperature=0). For instance, it reaches an impressive 4.08x speedup forLlama3-8BonMBPP.- For
7Bmodels (e.g.,CL-7B,L2-7B,L3-8B),SPECTRAoften exceeds3xacceleration, demonstrating the effectiveness of itsmulti-token compression. - For larger
13Band70Bmodels,speedupsare generally between1.6xand3x, still outperforming all baselines. - This robustness across diverse models (Llama 2, Llama 3, CodeLlama) and tasks (conversation, math, code generation) highlights
SPECTRA's generalizability. Baselines likeRESToften show limited or even negativespeedupsin some cases (e.g.,CL-13BonGSM8K,RESThas0.93xspeedup), indicating their instability.
-
Compression Ratio ():
SPECTRAconsistently delivers the highestaverage compression ratioacross all datasets andLLMs. This indicates thatSPECTRA'sdraft-and-verifyiterations typically accept a greater number of tokens (ranging from2.1to4.8tokens per step), significantly more than alternative approaches. This nearly doubles theacceptance lengthachieved byREST, directly translating to higherspeedups.
-
Acceleration in Sampling Decoding (temperature=1.0):
-
Under
sampling-based decoding,SPECTRAcontinues to provide acceleration, offering roughly1.15x – 2.77x speedupsover standardautoregressive decoding. -
These gains are more modest than in
greedy decoding, which is an expected outcome forspeculative decodingmethods under stochastic sampling due to loweracceptance rates. This observation is consistent with findings in prior work (Fu et al., 2024; Leviathan et al., 2023). Nevertheless,SPECTRAstill maintains its lead over other methods in this more challenging setting.The following figure (Figure 6 from the original paper) plots the cumulative number of accepted tokens versus decoding steps:
Figure 6: Total number of accepted tokens across all samples at each decoding step.
-
Figure 6 visualizes the total number of accepted tokens across all samples at each decoding step for Llama2-7B-chat with greedy decoding on various datasets. The steeper slope of the SPECTRA curve (red line) compared to other methods (e.g., ANPD, Lookahead, REST, Autoregressive) clearly indicates that SPECTRA requires substantially fewer decoding steps to generate the same number of tokens. This is a direct consequence of its higher token acceptance rate, underscoring its efficiency. For example, on HumanEval, SPECTRA achieves the total tokens with roughly half the steps of ANPD.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 2 of the original paper:
| GSM8K | MTBench | |||
| Method | Speedup | τ | Speedup | τ |
|---|---|---|---|---|
| REST | 1.01 | 1.47 | 1.25 | 1.90 |
| Lookahead | 1.66 | 1.93 | 1.73 | 2.05 |
| Lookahead + REST | 1.08 | 1.47 | 1.27 | 1.90 |
| SPECTRA's ablation | ||||
| CORE Module | 2.04 | 2.50 | 1.92 | 2.35 |
| - w/o Forward Dict | 1.50 | 1.68 | 1.20 | 1.37 |
| - w/o Backward Dict | 1.94 | 2.21 | 1.74 | 2.12 |
| - w/o Sub-Ngram | 1.95 | 2.34 | 1.75 | 2.18 |
| RETRIEVAL Module | 1.18 | 1.31 | 1.24 | 1.50 |
| - w/o PPL refine | 1.16 | 1.29 | 1.20 | 1.45 |
| SPECTRA (ours) | 2.14 | 2.64 | 2.02 | 2.59 |
Table 2: Ablation study of SPECTRA's components (greedy decoding, LLaMA2-7B-Chat).
The ablation study (Table 2) on LLaMA2-7B-Chat with greedy decoding provides crucial insights into the contribution of each SPECTRA component:
-
Impact of SPECTRA-CORE Components:
- The full
CORE Modulealone achieves a2.04x speeduponGSM8Kand1.92xonMTBench. - Removing the
Forward Dictionary(- w/o Forward Dict) drastically reduces speedup to1.50x(GSM8K) and1.20x(MTBench). This highlights the importance of theForward Dictionaryin providing a quantity of initial guesses. - Removing the
Backward Dictionary(- w/o Backward Dict) reduces speedup to1.94x(GSM8K) and1.74x(MTBench). While less severe than removing theForward Dictionary, this shows theBackward Dictionary's contribution to the quality of guesses. - Omitting
Sub-Ngraminclusion (- w/o Sub-Ngram) also causes a slight drop (1.95xonGSM8K), confirming that including subsequences in theN-gramdictionaries improvesguess qualityandcoverage.
- The full
-
Impact of SPECTRA-RETRIEVAL Components:
- The
RETRIEVAL Modulealone achieves1.18x speeduponGSM8Kand1.24xonMTBench. - Removing
perplexity-based filtering(- w/o PPL refine) slightly decreases the speedup to1.16x(GSM8K) and1.20x(MTBench). This validates thatperplexity refinementdoes improve the quality of external guesses, though its standalone impact on speedup is more modest thanSPECTRA-CORE's main components.
- The
-
Full SPECTRA Framework:
- The full
SPECTRAframework achieves2.14x speeduponGSM8Kand2.02xonMTBench, surpassing all individual components and combinations. This demonstrates the synergistic effect of integrating all components, maximizingacceptance ratesand overall performance.
- The full
-
Naive Combination Comparison:
- A naive combination of
LookaheadandREST(Lookahead + REST) performs significantly worse (1.08xonGSM8K,1.27xonMTBench) than eitherLookaheadalone (1.66x,1.73x) orSPECTRA's full framework. This underscoresSPECTRA's careful and optimized integration strategy, showing that simply appending guesses from different sources is insufficient.
- A naive combination of
6.3. Priority for Source of Guesses
The following figure (Figure 3 from the original paper) shows the acceptance rates of Llama2-7B-chat for different guess sources:

Figure 3: Acceptance rates of Llama2-7B-chat for different guess sources (from SPECTRA-CORE forward dictionary, backward dictionary, SPECTRA-RETRIEVAL). The acceptance rate is the fraction of guessed tokens that pass verification.
Figure 3 illustrates the acceptance rates for different guess sources when the limit on the number of guess sequences is temporarily removed (to assess individual contributions).
- Guesses generated by
SPECTRA-CORE(from both theforward dictionaryandbackward dictionary) consistently show a higheracceptance ratecompared to those obtained from the external knowledge source viaSPECTRA-RETRIEVAL. - This empirical finding justifies
SPECTRA's design choice (as implemented in Algorithm 1) to give priority tointernal guessesfromSPECTRA-COREoverexternal guessesfromSPECTRA-RETRIEVAL. This ensures that the limitedguess budgetandLLM forward passresources are utilized for the most promising candidates, maximizing overallspeedup.
6.4. FlashAttention
The following figure (Figure 4 from the original paper) shows the effect of FlashAttention on speculative decoding speed:

Figure 4: Effect of FlashAttention on speculative decoding speed: Measured speedups on GSM8K and MTBench (LLama2-7B-Chat, greedy decoding). "No Flash" uses standard attention; "With Flash" uses FlashAttention for faster parallel verification.
Figure 4 demonstrates the impact of FlashAttention on the speedup of speculative decoding methods.
- Consistent Boost: Enabling
FlashAttentionconsistently boosts thespeedupfor all methods (ANPD,Lookahead,REST,SPECTRA), though to varying degrees. - SPECTRA's Greater Benefit:
SPECTRAbenefits the most, observing an additional0.24x speedup gainon bothGSM8KandMTBench. This is attributed toFlashAttention's ability to better exploit the parallel structure ofspeculative decodingby reducingattention overheads. - Reasoning:
SPECTRAtypically presents the longestverification branches(i.e., accepts more tokens per speculative step, as shown by its highercompression ratio). Therefore, it stands to profit significantly from more efficientattention implementationslikeFlashAttention, which reduces the computational burden of processing these longer sequences in parallel.
6.5. Other Analysis
6.5.1. Details Results with Throughputs
The following are the results from Table 4 and Table 5 of the original paper, providing detailed throughput analyses:
| Model | Method | Classeval | GSM8K | Humaneval | MBPP | MTBench | |||||
| Mac-TP | Mic-TP | Mac-TP | Mic-TP | Mac-TP | Mic-TP | Mac-TP | Mic-TP | Mac-TP | Mic-TP | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Greedy (temperature=0) | |||||||||||
| CL-13B | Autoregressive | 30.85 | 30.85 | 32.03 | 32.03 | 32.35 | 32.35 | 32.07 | 32.07 | 30.69 | 30.63 |
| ANPD | 59.77 | 58.03 | 89.99 | 89.18 | 67.43 | 64.65 | 86.76 | 86.41 | 80.10 | 76.68 | |
| Lookahead | 69.28 | 68.62 | 89.73 | 89.00 | 74.33 | 73.23 | 93.38 | 92.80 | 79.38 | 78.67 | |
| REST | 39.53 | 37.73 | 29.93 | 29.47 | 51.15 | 47.49 | 27.41 | 27.39 | 28.92 | 27.18 | |
| SPECTRA (Ours) | 73.47 | 72.98 | 93.36 | 93.23 | 84.91 | 84.41 | 105.44 | 105.39 | 81.32 | 80.68 | |
| CL-7B | Autoregressive | 41.17 | 41.17 | 41.17 | 41.17 | 41.41 | 41.41 | 41.60 | 41.60 | 38.91 | 38.93 |
| ANPD | 94.76 | 93.02 | 132.26 | 131.30 | 89.26 | 87.13 | 131.35 | 130.99 | 130.41 | 126.64 | |
| Lookahead | 106.51 | 105.95 | 123.04 | 121.90 | 103.45 | 103.51 | 120.75 | 120.23 | 125.58 | 124.77 | |
| REST | 59.49 | 56.61 | 37.61 | 37.21 | 70.38 | 65.22 | 40.11 | 40.09 | 39.64 | 36.70 | |
| SPECTRA (Ours) | 111.09 | 110.68 | 137.24 | 136.86 | 122.54 | 122.41 | 148.32 | 148.07 | 143.98 | 144.32 | |
| L2-13B | Autoregressive | 31.85 | 31.56 | 32.40 | 32.43 | 32.27 | 32.27 | 32.19 | 32.19 | 31.93 | 31.78 |
| ANPD | 43.30 | 44.44 | 47.54 | 45.22 | 43.24 | 42.28 | 36.20 | 35.84 | 37.44 | 34.84 | |
| Lookahead | 57.49 | 58.94 | 47.44 | 47.62 | 55.76 | 55.58 | 44.41 | 44.15 | 48.11 | 46.62 | |
| REST | 38.81 | 37.74 | 30.36 | 30.22 | 40.47 | 39.70 | 30.70 | 30.67 | 36.39 | 37.02 | |
| SPECTRA (Ours) | 63.64 | 64.31 | 59.21 | 58.63 | 63.39 | 63.18 | 52.43 | 52.19 | 56.04 | 53.75 | |
| L2-70B | Autoregressive | 2.60 | 2.60 | 2.61 | 2.61 | 2.61 | 2.61 | 2.63 | 2.63 | 2.60 | 2.60 |
| ANPD | 4.72 | 4.80 | 4.25 | 4.10 | 4.85 | 4.76 | 3.07 | 3.07 | 3.47 | 3.30 | |
| Lookahead | 6.90 | 7.16 | 4.87 | 5.12 | 6.71 | 6.73 | 3.92 | 3.93 | 5.05 | 5.02 | |
| SPECTRA (Ours) | 8.07 | 8.35 | 6.58 | 6.75 | 8.41 | 8.41 | 4.88 | 4.88 | 6.32 | 6.22 | |
| L2-7B | Autoregressive | 40.33 | 40.32 | 41.01 | 41.03 | 41.14 | 41.13 | 41.00 | 41.04 | 40.48 | 40.50 |
| ANPD | 65.54 | 68.10 | 62.40 | 59.38 | 63.27 | 59.98 | 48.94 | 47.67 | 52.47 | 50.06 | |
| Lookahead | 88.41 | 91.05 | 68.00 | 68.20 | 84.69 | 83.87 | 59.79 | 60.76 | 70.04 | 69.07 | |
| REST | 54.74 | 53.93 | 41.43 | 41.38 | 57.99 | 56.41 | 41.28 | 40.74 | 50.58 | 51.79 | |
| SPECTRA (Ours) | 96.88 | 98.75 | 86.51 | 85.50 | 98.77 | 98.38 | 72.39 | 73.22 | 81.93 | 79.20 | |
| L3-70B | Autoregressive | 2.58 | 2.57 | 2.58 | 2.58 | 2.59 | 2.59 | 2.59 | 2.59 | 2.55 | 2.55 |
| ANPD | 3.97 | 4.19 | 3.86 | 3.72 | 4.72 | 4.75 | 3.77 | 3.59 | 3.14 | 3.03 | |
| Lookahead | 6.17 | 6.47 | 3.99 | 3.96 | 6.63 | 6.75 | 3.70 | 3.66 | 4.49 | 4.53 | |
| SPECTRA (Ours) | 6.87 | 7.18 | 5.43 | 5.34 | 7.33 | 7.50 | 5.01 | 4.88 | 5.25 | 5.16 | |
| L3-8B | Autoregressive | 36.59 | 36.58 | 36.74 | 36.74 | 36.20 | 36.21 | 35.24 | 35.20 | 36.55 | 36.69 |
| ANPD | 77.21 | 78.76 | 141.89 | 141.36 | 66.31 | 65.57 | 118.47 | 112.95 | 41.77 | 40.20 | |
| Lookahead | 94.92 | 90.09 | 136.32 | 135.92 | 89.99 | 90.47 | 133.67 | 133.12 | 56.09 | 55.49 | |
| SPECTRA (Ours) | 103.61 | 105.88 | 142.89 | 142.72 | 92.86 | 93.16 | 143.80 | 142.72 | 61.69 | 60.22 | |
| Sampling (temperature=1.0) | |||||||||||
| CL-13B | Autoregressive | 30.90 | 30.64 | 31.38 | 31.37 | 31.24 | 31.39 | 31.46 | 31.45 | 30.71 | 30.67 |
| ANPD | 35.48 | 34.86 | 33.54 | 32.34 | 32.64 | 34.36 | 31.57 | 30.95 | 70.92 | 65.68 | |
| Lookahead | 42.54 | 40.74 | 33.79 | 32.49 | 40.25 | 42.17 | 32.02 | 31.19 | 71.50 | 68.46 | |
| REST | 35.15 | 33.22 | 25.67 | 25.24 | 39.58 | 38.49 | 26.43 | 25.89 | 28.41 | 26.69 | |
| SPECTRA (Ours) | 51.86 | 50.04 | 37.57 | 35.67 | 51.60 | 52.64 | 36.29 | 35.27 | 72.90 | 69.98 | |
| CL-7B | Autoregressive | 39.60 | 39.58 | 40.85 | 40.87 | 40.05 | 40.10 | 40.81 | 40.81 | 40.49 | 40.50 |
| ANPD | 50.89 | 51.76 | 47.44 | 46.68 | 44.14 | 46.34 | 45.86 | 45.81 | 112.29 | 103.57 | |
| Lookahead | 60.87 | 60.29 | 48.54 | 47.64 | 57.12 | 61.14 | 48.64 | 48.27 | 110.07 | 105.00 | |
| REST | 48.64 | 46.41 | 35.98 | 35.46 | 53.35 | 52.26 | 37.04 | 36.57 | 39.36 | 36.51 | |
| SPECTRA (Ours) | 71.70 | 71.78 | 55.24 | 52.81 | 67.27 | 69.20 | 54.48 | 52.91 | 112.43 | 108.49 | |
| L2-13B | Autoregressive | 31.23 | 31.17 | 31.44 | 31.47 | 31.41 | 31.42 | 32.02 | 32.06 | 31.67 | 31.59 |
| ANPD | 37.53 | 37.94 | 39.11 | 37.99 | 36.79 | 36.75 | 32.97 | 32.71 | 36.91 | 34.34 | |
| Lookahead | 47.59 | 47.35 | 41.60 | 41.76 | 46.33 | 46.51 | 37.82 | 37.82 | 47.35 | 45.48 | |
| REST | 36.78 | 36.17 | 29.33 | 29.25 | 37.46 | 36.71 | 29.38 | 29.28 | 35.50 | 36.21 | |
| SPECTRA (Ours) | 53.13 | 52.28 | 48.60 | 48.11 | 52.93 | 53.11 | 42.95 | 43.03 | 54.98 | 52.42 | |
| L2-7B | Autoregressive | 39.89 | 39.88 | 40.58 | 40.59 | 40.09 | 40.10 | 40.59 | 40.66 | 40.65 | 40.70 |
| ANPD | 52.14 | 52.78 | 54.23 | 52.90 | 51.40 | 50.97 | 44.73 | 43.77 | 50.92 | 48.24 | |
| Lookahead | 70.82 | 71.17 | 61.15 | 61.34 | 68.78 | 69.01 | 50.84 | 51.83 | 68.27 | 66.77 | |
| REST | 50.35 | 49.99 | 40.19 | 40.09 | 50.86 | 50.06 | 38.94 | 38.18 | 49.12 | 50.54 | |
| SPECTRA (Ours) | 78.46 | 78.74 | 72.13 | 71.68 | 81.71 | 81.76 | 59.77 | 60.09 | 80.21 | 77.00 | |
| L3-8B | Autoregressive | 35.75 | 35.76 | 35.16 | 35.17 | 36.01 | 36.02 | 36.05 | 36.07 | 35.39 | 35.48 |
| ANPD | 44.71 | 43.72 | 69.12 | 66.73 | 51.48 | 51.57 | 68.03 | 64.54 | 40.84 | 39.23 | |
| Lookahead | 53.05 | 50.57 | 72.68 | 69.11 | 64.59 | 63.79 | 71.88 | 68.90 | 55.46 | 53.74 | |
| SPECTRA (Ours) | 69.50 | 68.92 | 79.88 | 76.53 | 69.09 | 68.62 | 78.99 | 76.69 | 60.33 | 57.69 | |
Table 4: Micro throughput (Mic-TP) and Macro throughput (Mac-TP) across multiple tasks and models.
The throughput analysis in Table 4 and Table 5 further supports SPECTRA's performance. SPECTRA consistently achieves higher Macro Throughput (Mac-TP) and Micro Throughput (Mic-TP) than both non-speculative baselines and other training-free accelerators. This holds true across various model sizes, datasets, and even older GPU architectures (RTX 3090, RTX 8000), demonstrating SPECTRA's robustness to varying hardware capabilities. Higher Mac-TP indicates more efficient individual speculative steps, while higher Mic-TP confirms overall faster token generation.
6.5.2. Evaluating SPECTRA in Different GPU Types
The following are the results from Table 3 of the original paper:
| GPU | Method | GSM8K | MTBench | ||
| Speedup | τ | Speedup | τ | ||
|---|---|---|---|---|---|
| A40 | Lookahead | 1.49 | 1.93 | 1.53 | 2.07 |
| SPECTRA | 1.92 | 2.46 | 1.84 | 2.36 | |
| A6000 | Lookahead | 1.48 | 1.92 | 1.52 | 2.06 |
| SPECTRA | 1.92 | 2.46 | 1.84 | 2.36 | |
| RTX8000 | Lookahead | 1.33 | 1.93 | 1.34 | 2.08 |
| SPECTRA | 1.70 | 2.46 | 1.58 | 2.35 | |
| RTX3090 | Lookahead | 1.32 | 1.92 | 1.30 | 2.06 |
| SPECTRA | 1.84 | 2.46 | 1.74 | 2.36 | |
Table 3: Hardware scalability of SPECTRA decoding on GSM8K and MTBench for various GPU architectures.
Table 3 shows SPECTRA's performance across different GPU types (A40, A6000, RTX8000, RTX3090).
- Hardware Robustness: The results indicate that
SPECTRA's relative accelerations remain consistent acrossGPUswith varying memory throughput and compute capabilities. - Consistent Lead:
SPECTRAconsistently outperformsLookahead(the next best baseline) in all cases. - Narrowing Gap on Older GPUs: On older
GPUs(e.g.,RTX 3090orRTX 8000), thespeedup gapbetweenLookaheadandSPECTRAnarrows slightly due to less efficient parallelism on these architectures, butSPECTRAstill maintains its lead. This confirmsSPECTRA's practicality across bothdata-centerandconsumer-grade GPUs.
6.5.3. Evaluating SPECTRA in Multi-GPU Environments
The following are the results from Table 6 of the original paper:
| GPU & Model Setting | Method | MTBench | |||
| Mac-TP | Mic-TP | Speedup | τ | ||
|---|---|---|---|---|---|
| 1xH100 - Quantized Int8 | Autoregressive | 2.60 | 2.60 | 1.00 | 1.00 |
| SPECTRA | 6.32 | 6.22 | 2.43 | 2.51 | |
| 2xH100 - FP16 | Autoregressive | 14.81 | 14.70 | 1.00 | 1.00 |
| SPECTRA | 29.62 | 28.91 | 2.00 | 2.52 | |
| 4xH100 - FP16 | Autoregressive | 14.60 | 14.48 | 1.00 | 1.00 |
| SPECTRA | 29.67 | 28.89 | 2.03 | 2.52 | |
| 8xH100 - FP16 | Autoregressive | 14.39 | 14.28 | 1.00 | 1.00 |
| SPECTRA | 29.27 | 28.55 | 2.03 | 2.52 | |
Table 6: Results in multi-GPU Environments on GSM8K and MTBench using Llama-2-chat-70B.
Table 6 details SPECTRA's scalability in multi-GPU environments using LLaMA-2-70B.
- Robust Scalability:
SPECTRAachieves consistentspeedupsof2.00x – 2.03xacross allmulti-GPUconfigurations (2xH100,4xH100,8xH100withFP16precision). Thecompression ratio() remains stable at2.52. This indicates that partitioning model weights across multipleGPUsintroduces minimal overhead, and thespeculative verificationprocess remains efficient despiteinter-GPUcommunication. - Quantized Performance: Even in the
quantized (Int8)single-GPU (1xH100)setting,SPECTRAprovides a substantial2.43x speedup, outperforming standardautoregressive decoding. - Practicality for Large-Scale Deployment: These results validate
SPECTRA's practicality for large-scale deployments wherememory constraintsnecessitatedistributed inference.
6.5.4. Verifying Generation Quality with SPECTRA Decoding
The following are the results from Table 7 of the original paper:
| Dataset | Method | ROUGE-1 | ROUGE-2 | ROUGE-L | Speedup | τ |
|---|---|---|---|---|---|---|
| CNN | Autoregressive | 9.77 | 0.39 | 7.20 | 1.00 | 1.00 |
| SPECTRA | 9.74 | 0.41 | 7.18 | 1.60 | 2.05 | |
| XSUM | Autoregressive | 18.12 | 4.36 | 12.43 | 1.00 | 1.00 |
| SPECTRA | 18.13 | 4.40 | 12.49 | 1.69 | 2.08 |
Table 7: Evaluation of SPECTRA Decoding on CNN/DailyMail and XSum using a temperature of 1.0. ROUGE scores, speedups over autoregressive decoding, and compression ratio () are reported for LLaMA-2-7B-Chat.
-
Greedy Decoding Performance: For
greedy decoding,SPECTRAproduces identical outputs to standardHugging Face's greedy search when usingFP32precision. When transitioning toFP16, bothHugging Face's native search andSPECTRAexhibit a similar, small number of discrepancies (25 vs. 26 out of 160 conversational turns). This confirms thatSPECTRAmaintains the output distribution within typical numerical error margins seen in standardhalf-precisioninference libraries. -
Sampling Decoding Performance: Table 7 demonstrates that for
sampling-based decoding(temperature=1.0) on summarization datasets (CNN/DailyMailandXSum),SPECTRA'sROUGE-1,ROUGE-2, andROUGE-Lscores are nearly identical to those of standardautoregressive sampling. Simultaneously,SPECTRAachieves notablespeedups(1.60xonCNN/DailyMailand1.69xonXSum) withcompression ratiosof2.05and2.08respectively.These findings strongly reaffirm that
SPECTRAis a lossless acceleration method, preserving the originalLLM's generation quality across bothgreedyandsampling-basedmethods, while providing substantial speed benefits.
7. Conclusion & Reflections
7.1. Conclusion Summary
In this work, the authors introduced SPECTRA, a novel and highly effective training-free framework designed to accelerate Large Language Model (LLM) inference. SPECTRA's core innovation lies in its dual approach: the SPECTRA-CORE module leverages the LLM's internal knowledge through multi-level N-gram storage and a bi-directional search to generate high-quality, dynamic-length guesses. Complementing this is the SPECTRA-RETRIEVAL module, which refines external knowledge sources by filtering corpora based on perplexity scores, ensuring that only the most relevant and high-quality external cues are used for speculation. This optimized integration of internal and external speculation within a single forward pass paradigm allows SPECTRA to achieve significant speedups of up to 4.08x across diverse tasks and LLM architectures (Llama 2, Llama 3, CodeLlama), from 7B to 70B parameters. Crucially, SPECTRA is a plug-and-play solution that requires no modifications or additional training of the original LLM, making it a lossless and practical method for accelerating LLM inference.
7.2. Limitations & Future Work
The authors acknowledge two primary limitations:
-
Cost of Building External Datastores: While
SPECTRA-COREis self-contained,SPECTRA-RETRIEVALrelies on constructing and indexing a sizable externaldatastore. This process can be time-consuming and memory-intensive, especially for frequently updated data ormemory-constrainedenvironments. Although this investment yields significantspeedups, it might not always be universally feasible or cost-effective. -
Limited Evaluation Scope: The experiments were primarily conducted on
English-language benchmarksinconversationalandcoding tasksusingLLaMA-basedmodels. The generalizability ofSPECTRAto other languages (e.g.,low-resource languages), specialized technical documents, or a wider range ofmodel families(beyondLLaMA-based architectures) remains to be fully assessed. These factors could affect theacceptance rateand overallspeedup.Future work is needed to:
- Assess
SPECTRA's generality across diverse linguistic settings and model families. - Potentially optimize the cost and efficiency of building and maintaining external
datastoresforSPECTRA-RETRIEVAL.
7.3. Personal Insights & Critique
SPECTRA presents a significant step forward in LLM inference optimization, particularly for scenarios where training-based acceleration methods are impractical or undesirable. The training-free, plug-and-play nature is a major practical advantage, making it accessible to a broader range of users and deployment environments.
One key insight is the importance of carefully balancing the quality and quantity of speculative guesses. SPECTRA's bi-directional search and perplexity-based refinement exemplify this, ensuring that the LLM's single, expensive forward pass is maximally utilized for high-probability tokens. The observation that internal guesses from SPECTRA-CORE have higher acceptance rates than external guesses from SPECTRA-RETRIEVAL is insightful, guiding the prioritization strategy. This suggests that while external knowledge can augment, the LLM's inherent statistical patterns are often the most reliable source for immediate next-token predictions.
The ablative studies are particularly strong, clearly demonstrating the contribution of each module and component. The finding that a naive combination of Lookahead and REST performs poorly highlights the engineering effort required for effective integration, which SPECTRA successfully addresses.
Potential issues or areas for improvement/further research:
-
Datastore Maintenance: The cost of building and maintaining external
datastoresforSPECTRA-RETRIEVALis a valid limitation. For rapidly evolving domains or highly personalizedLLMs, continuously updating thisdatastoreand recomputingperplexitycould become a bottleneck. Research into more dynamic, online, or adaptivedatastoreconstruction methods could enhanceSPECTRA-RETRIEVAL's practicality. -
N-gram Size and Candidate Pool Sensitivity: While the paper mentions and as defaults, a deeper analysis of the sensitivity to these hyperparameters could provide more guidance for different
LLMsor tasks. How do these values impact memory footprint versus speedup? -
Beyond LLaMA-based Architectures: Expanding evaluation to other
Transformervariants (e.g.,Mistral,Falcon,GPT-series if accessible) could further solidifySPECTRA's claim of generalizability. Different architectures might have varying internal mechanisms or biases that could interact differently withSPECTRA's speculation strategies. -
Cross-Lingual Performance: The limited evaluation scope regarding languages is a notable point. The effectiveness of
N-gramdictionaries andperplexityas a filtering mechanism might vary significantly across languages, especially for morphologically rich languages or those with different script systems. -
Scalability to Extremely Long Contexts: While
FlashAttentionhelps withattention overhead, the overall efficiency ofspeculative decodingcan still face challenges with extremely long contexts. How doesSPECTRAperform when the context window is very large, andN-grampredictions might become less reliable?Overall,
SPECTRAoffers a robust and well-engineered solution that directly tackles a critical bottleneck inLLMdeployment, providing a practicallossless speedupwithout demanding model-specific retraining. Its insights into optimizing bothinternalandexternal speculationare valuable contributions to the field of efficientLLMinference.
Similar papers
Recommended via semantic vector search.