Paper status: completed

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Published:01/25/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The RaLMSpec framework accelerates retrieval-augmented language model serving through speculative retrieval and batched verification, maintaining consistent outputs. Combining prefetching and asynchronous verification, it significantly enhances iterative RaLM efficiency, achievin

Abstract

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is the acceleration of serving for retrieval-augmented language models (RaLM) using a speculation-inspired framework.

1.2. Authors

The authors are:

  • Zhihao Zhang (Carnegie Mellon University, School of Computer Science)

  • Alan Zhu (Carnegie Mellon University, School of Computer Science)

  • Lijie Yang (Carnegie Mellon University, School of Computer Science)

  • Yihua Xu (University of California, Berkeley)

  • Lanting Li (Carnegie Mellon University, School of Computer Science)

  • Phitchaya Mangpo Phothilimthana (Google DeepMind)

  • Zhihao Jia (Carnegie Mellon University, School of Computer Science)

    Their affiliations indicate a strong presence in computer science research, particularly from Carnegie Mellon University, with contributions from Google DeepMind and UC Berkeley. This suggests expertise in areas such as natural language processing, machine learning systems, and potentially computer architecture, given the paper's focus on system-level optimizations like speculation.

1.3. Journal/Conference

The paper is a preprint available on arXiv, published on January 25, 2024. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a highly reputable platform for disseminating research in fields like AI, machine learning, and computer science, allowing for rapid sharing of new findings and often serving as a precursor to publications in top-tier conferences (e.g., NeurIPS, ICML, ACL) or journals.

1.4. Publication Year

2024

1.5. Abstract

Retrieval-augmented language models (RaLM) are effective for knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. They offer advantages like low-cost adaptation to new data and improved source attribution. Iterative RaLM approaches, which involve frequent interaction between the retriever and the language model, achieve higher generation quality but incur significant overhead due to these repeated retrieval steps. To address this, the paper proposes RaLMSpec, a speculation-inspired framework designed to generically speed up iterative RaLM serving while guaranteeing identical model outputs. RaLMSpec achieves this through speculative retrieval and batched verification. It further incorporates prefetching, an optimal speculation stride scheduler, and asynchronous verification to maximize acceleration. Extensive evaluations on three language models across four QA datasets show that RaLMSpec can achieve speed-ups of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x for naive iterative RaLM serving when using exact dense, approximate dense, and sparse retrievers, respectively, compared to a baseline. For KNN-LM serving, speed-ups of up to 7.59x and 2.45x are observed with exact dense and approximate dense retrievers.

The official source link for this paper is https://arxiv.org/abs/2401.14021. The PDF link is https://arxiv.org/pdf/2401.14021v1.pdf. It is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the high inference overhead of iterative Retrieval-Augmented Language Models (RaLMs).

Retrieval-augmented language models represent a significant advancement in solving knowledge-intensive natural language processing (NLP) tasks. Instead of trying to encode all knowledge into a massive, fully parametric model, RaLMs combine a smaller parametric language model with an external, non-parametric knowledge base (like Wikipedia). This approach offers several benefits:

  • Low-cost adaptation: RaLMs can easily adapt to new or continuously updated data by simply updating the external knowledge base, rather than requiring expensive fine-tuning of the entire parametric model.

  • Better source attribution: By explicitly retrieving documents, RaLMs can provide references for their generated output, enhancing transparency and trustworthiness.

    Within RaLM approaches, iterative RaLM stands out for its superior generation quality. Unlike one-shot RaLM which retrieves documents only once at the beginning, iterative RaLM frequently interacts with the knowledge base throughout the generation process, allowing the language model to retrieve more relevant information as the context evolves. This frequent interaction, however, introduces a critical challenge: a high overhead due to the repeated and sequential retrieval steps. This significantly impacts the serving latency, making iterative RaLMs prohibitively slow for practical deployment. The paper highlights this as an inherent inefficiency, especially because each retrieval step is usually performed with a single query derived from the current context.

The paper's entry point and innovative idea is to adapt the concept of speculation, traditionally used in computer architecture and recently in large language model (LLM) decoding, to the retrieval component of iterative RaLMs. By speculatively retrieving documents and then verifying them in batches, the goal is to reduce the number of expensive knowledge base calls without compromising the final output quality.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the challenge of iterative RaLM serving overhead:

  1. RaLMSpec Framework for Generic Iterative RaLM Acceleration: The authors propose RaLMSpec, a novel framework that provides a generic speed-up for iterative RaLM approaches. A key guarantee of RaLMSpec is that it provably preserves the original model outputs, meaning the accelerated system produces exactly the same text as the unoptimized iterative RaLM. This is crucial for maintaining model quality and reliability.
  2. Caching-based Speculative Retrieval with Batched Verification: Leveraging the observed temporal and spatial locality of retrieved documents (i.e., the same or nearby documents are often retrieved repeatedly), RaLMSpec employs a local cache for speculative retrieval. Instead of querying the expensive external knowledge base for every step, it first speculatively retrieves from this fast local cache. To ensure correctness, these speculative retrievals are followed by a batched verification step where the corresponding queries are sent to the external knowledge base in a single, more efficient batch. If a mismatch is detected, the system rolls back and corrects the generation.
  3. Three Additional Latency Reduction Techniques: RaLMSpec integrates three further techniques to maximize performance:
    • Cache Prefetching: Enhances the local cache by updating it with multiple (e.g., top-k) retrieved documents during verification steps, increasing the likelihood of successful speculation.
    • Optimal Speculation Stride Scheduler (OS3OS^3): Dynamically adjusts the number of speculative steps (speculation stride) between verification steps. This scheduler adaptively optimizes the trade-off between the overhead of potential mis-speculation and the latency saved by batched retrievals, which is crucial as optimal strides vary across different models and retrievers.
    • Asynchronous Verification: Exploits concurrency by allowing additional speculative steps to occur concurrently with a verification step, effectively hiding verification latency when speculation is successful.
  4. Extensive Empirical Validation: The paper empirically validates RaLMSpec across a wide range of scenarios:
    • Tasks: Naive iterative RaLM serving and KNN-LM serving.

    • Language Models: GPT2-medium, OPT-1.3B, LLaMA-2-7B, and LLaMA-2-13B.

    • Retrievers: Exact Dense Retriever (EDR), Approximate Dense Retriever (ADR), and Sparse Retriever (SR).

    • Datasets: Wiki-QA, Web Questions, Natural Questions, and Trivia QA for naive RaLM; WikiText-103 for KNN-LM.

      Key Findings:

  • For naive iterative RaLM serving, RaLMSpec+PSARaLMSpec+PSA (with Prefetching, OS3OS^3, and Asynchronous verification) achieves significant speed-ups:
    • 1.75-2.39x with an exact dense retriever.
    • 1.04-1.39x with an approximate dense retriever.
    • 1.31-1.77x with a sparse retriever.
  • For KNN-LM serving, where retrieval is performed for every token, the speed-ups are even more substantial:
    • Up to 7.59x with an exact dense retriever.
    • Up to 2.45x with an approximate dense retriever. These results demonstrate that RaLMSpec is a generic and effective framework for accelerating iterative RaLM serving across diverse configurations.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand RaLMSpec, a reader should be familiar with the following core concepts:

  • Language Models (LMs) & Large Language Models (LLMs):

    • Concept: LMs are statistical models that learn to predict the next word or token in a sequence based on the preceding context. LLMs are LMs with a vast number of parameters (billions or trillions) and are trained on massive text corpora, exhibiting emergent capabilities like few-shot learning and complex reasoning. Examples include GPT-3, LLaMA-2, and PaLM.
    • Autoregressive Nature: Many generative LMs, especially those used for text generation, are autoregressive. This means they generate text token-by-token, where each new token is predicted based on the previously generated tokens and the initial prompt. This sequential generation is a key bottleneck as it limits parallelism.
  • Retrieval-Augmented Language Models (RaLM):

    • Concept: RaLMs combine the strengths of parametric language models with non-parametric knowledge bases. Instead of relying solely on the knowledge encoded during training, RaLMs can retrieve relevant documents or information from an external database (e.g., Wikipedia) at inference time to augment their generation process. This helps them access up-to-date information, reduce factual errors, and provide citations.
    • Non-parametric Knowledge Base: This refers to an external, explicit database (e.g., a collection of documents, facts, or embeddings) that is separate from the language model's learned parameters. It can be easily updated or modified without retraining the LM.
    • Parametric Language Model: This is the traditional LM whose knowledge is implicitly stored in its weights (parameters) learned during training.
  • Types of RaLM Interaction:

    • One-shot RaLM: In this approach, retrieval is performed only once at the beginning of the generation process. The retrieved documents are then concatenated with the original query or prompt and fed into the language model to assist in generating the entire response. This is simpler but limited if information needs evolve during a long generation.
    • Iterative RaLM: This is the focus of the paper. Here, the retrieval step is performed multiple times throughout the generation process. As the language model generates new tokens, the context (original query + generated tokens) is used to form a new query to the knowledge base, retrieving more context-relevant documents. This allows for dynamic adaptation to information needs but incurs much higher retrieval overhead due to frequent calls to the knowledge base.
  • Retrievers: Mechanisms used to search and retrieve relevant documents from a knowledge base given a query.

    • Sparse Retrievers (e.g., BM25, TF-IDF): These methods rely on lexical matching of keywords. They represent documents and queries as bag-of-words vectors and score relevance based on term frequency and inverse document frequency.
      • BM25 (Best Match 25): An advanced term weighting scheme often used in information retrieval that ranks documents based on the appearance of query terms in each document, taking into account term frequency, inverse document frequency, document length, and query term saturation.
    • Dense Retrievers (e.g., DPR): These methods embed both queries and documents into a shared continuous vector space using neural networks (e.g., BERT-like encoders). Retrieval then becomes a nearest-neighbor search in this embedding space, where documents closest to the query embedding are considered most relevant.
      • Exact Dense Retriever (EDR): Performs an exhaustive or highly accurate nearest-neighbor search in the dense vector space. While precise, this can be computationally expensive, especially for very large knowledge bases.
      • Approximate Dense Retriever (ADR): Uses approximate nearest-neighbor (ANN) search algorithms (e.g., HNSW - Hierarchical Navigable Small World graphs, or FAISS) to find near neighbors more quickly, trading off some accuracy for significant speed.
    • FAISS (Facebook AI Similarity Search): A library for efficient similarity search and clustering of dense vectors. It provides algorithms that search in sets of vectors of any size, up to billions.
    • HNSW (Hierarchical Navigable Small World): A graph-based ANN algorithm that builds a multi-layer graph structure, allowing for fast searches by traversing through layers of increasing density. It's known for its good balance between search speed and accuracy.
  • K-Nearest Neighbor Language Models (KNN-LM):

    • Concept: A specific type of iterative RaLM that augments a standard language model by retrieving k-nearest neighbors (documents or context examples) from a datastore for every token prediction. The final next-token distribution is an interpolation between the base LM's prediction and a distribution derived from the retrieved neighbors' target tokens. This is highly retrieval-intensive.
    • Interpolation: Combining two probability distributions, typically by a weighted sum. For KNN-LM, it means blending the base LM's next-token probabilities with probabilities derived from the retrieved neighbors.
  • Speculative Execution/Decoding:

    • Concept (General): Originating from computer architecture, speculative execution involves performing operations before they are confirmed to be necessary. If the speculation is correct, performance is boosted; if incorrect, the work is discarded, and the correct path is taken (rollback).
    • Speculative Decoding (for LLMs): A recent technique to speed up autoregressive LLM inference. A smaller, faster "draft" model speculatively generates a few tokens. These tokens are then verified in parallel by the larger, slower "main" model. If verified, the tokens are accepted; if not, the main model corrects them and regenerates. This aims to reduce the sequential bottleneck of autoregressive generation. RaLMSpec applies this concept not to token generation, but to the retrieval step.
  • Temporal and Spatial Locality (in systems):

    • Temporal Locality: The principle that if a particular piece of data is accessed, it is likely to be accessed again in the near future.
    • Spatial Locality: The principle that if a particular piece of data is accessed, data located nearby in memory (or in a database) is likely to be accessed soon.
    • RaLMSpec leverages these properties by caching recently retrieved documents locally, anticipating that they or nearby documents will be needed again.

3.2. Previous Works

The paper positions RaLMSpec within the context of advancements in RaLM and efficient serving.

  • Early RaLM Pioneers:

    • Guu et al. (2020) proposed the foundational idea of Retrieval Augmented Language Model pre-training (REALM), which inspired many subsequent works. This showed the potential of augmenting LMs with external knowledge.
    • Lewis et al. (2020) introduced Retrieval-Augmented Generation (RAG), which demonstrated how pre-trained seq2seq models could be augmented with a DPR retriever to achieve strong performance on knowledge-intensive NLP tasks. Many existing RaLM approaches build on RAG's principles.
  • One-shot RaLM Approaches: These methods perform retrieval once before generation. Examples include works by Shi et al. (2023), Park et al. (2023), Wang et al. (2023a), Zhu et al. (2023), Rubin & Berant (2023), Wang et al. (2023b), Zhou et al. (2023). While effective, they are limited when the required information changes during long generation processes.

  • Naive Iterative RaLM Approaches: These approaches retrieve regularly during generation.

    • Ram et al. (2023), Lewis et al. (2020), Jiang et al. (2023), Borgeaud et al. (2022), Khattab et al. (2022) are examples where the LM constantly queries an external database with the latest context. These methods use the retrieved information either by directly concatenating it to the prompt or via intermediate layer cross-attention.
    • Khandelwal et al. (2019) introduced K-Nearest Neighbour Language Models (KNN-LM), which performs retrieval for every single token generated. This extreme iteration achieves high quality but makes inference prohibitively expensive, retrieving up to 1024 documents per token. Drozdov et al. (2022) further explore KNN-LM.
  • Efficient Iterative RaLM Serving (Related to RaLMSpec):

    • Alon et al. (2022) proposed a method for KNN-LM serving that reduces calls to the external knowledge base by using a pre-computed automaton state when full retrieval is unnecessary.
      • Differentiation: A crucial distinction is that Alon et al. (2022) is not guaranteed to preserve the same model output and thus might compromise generation quality. RaLMSpec explicitly guarantees output preservation, which is a key advantage.
  • Speculative Inference for LLMs:

    • Recent works have adapted the concept of speculative decoding to accelerate LLM serving (Leviathan et al., 2022; Stern et al., 2018; Chen et al., 2023; Miao et al., 2023; Xia et al.; Joao Gante, 2023; Yang et al., 2023). These methods use a faster draft model to propose tokens, which are then verified by the main model in parallel.
    • Differentiation: The authors emphasize that RaLMSpec is the first work to incorporate speculative retrieval in RaLM serving, and it is orthogonal to speculative inference techniques for LLMs. This means RaLMSpec can potentially be combined with LLM speculative decoding for even greater speed-ups.

3.3. Technological Evolution

The field has evolved from:

  1. Purely Parametric LMs: Relying solely on internal knowledge, often requiring expensive retraining for new information.
  2. One-shot RaLM: A first step towards external knowledge integration, but limited in dynamic contexts.
  3. Iterative RaLM: Enhanced quality through dynamic interaction, but at a high computational cost due to frequent, sequential retrieval.
  4. Optimized Iterative RaLM Serving (RaLMSpec): This paper represents a step towards making high-quality iterative RaLMs practical by directly addressing their primary bottleneck—retrieval overhead—through system-level optimizations that guarantee output correctness.

3.4. Differentiation Analysis

Compared to prior work, RaLMSpec offers several core innovations and differentiators:

  • Speculation Applied to Retrieval: While speculative decoding is known for LLMs, RaLMSpec uniquely applies the speculation paradigm to the retrieval component of RaLMs. This is a novel angle for optimization.
  • Guaranteed Output Preservation: Unlike some prior acceleration methods for RaLMs (e.g., Alon et al., 2022), RaLMSpec is designed to be lossless, meaning it produces identical outputs to the unoptimized iterative RaLM. This is a critical feature for applications where correctness and fidelity to the original model's behavior are paramount.
  • Generic Applicability: RaLMSpec is designed as a generic framework applicable to various iterative RaLM approaches, different language models, and diverse retriever types (sparse, exact dense, approximate dense).
  • Comprehensive Optimization Suite: Beyond the core speculative retrieval and batched verification, the integration of cache prefetching, optimal speculation stride scheduler, and asynchronous verification provides a holistic approach to maximize performance gains.
  • Leveraging Locality: The explicit exploitation of temporal and spatial locality of retrieved documents via a local cache is a key design choice that makes the speculative retrieval highly effective.

4. Methodology

4.1. Principles

The core idea behind RaLMSpec is to address the inefficiency of frequent, sequential retrieval steps in iterative RaLMs by leveraging speculation and batched processing. The primary intuition is rooted in the observation that, during the generation process of iterative RaLMs, the same or closely related documents from the knowledge base are often retrieved repeatedly. This phenomenon is analogous to temporal and spatial locality found in computer systems, where recently accessed data or data near previously accessed data is likely to be accessed again.

Building on this, RaLMSpec employs a caching-based mechanism for speculative retrieval. Instead of immediately querying the expensive external knowledge base for every retrieval request, it first attempts to retrieve documents from a fast, local cache. These "speculated" documents are then used by the language model to generate tokens. To ensure the correctness of the final output, after a certain number of speculative steps, a batched verification step is performed. In this step, the original queries corresponding to the speculative retrievals are sent to the external knowledge base in a single batch. This batched retrieval is significantly more efficient than issuing individual queries sequentially, exploiting parallelism inherent in modern retrieval systems. If any speculated document does not match the ground truth retrieved during verification, the system "rolls back" to the point of mismatch and regenerates the output using the correct documents, thereby preserving the original model's quality.

4.2. Core Methodology In-depth (Layer by Layer)

The RaLMSpec pipeline, as described in Algorithm 1, meticulously orchestrates speculative retrieval, language model generation, and batched verification to achieve speed-up while ensuring correctness.

Algorithm 1 RaLMSpec Pipeline. Input:InputtokensX=x0,x1,...,xt1,externalcorpusC,languagemodelf(.)2:Output:RaLMgeneratedoutputs3:InitializelocalcacheQ=,speculationstrides,modelgenerationstridek4:q=encode(X).Q.insert(C.retrieve(q))cacheprefetching5:whileEOSnotinXdo6:fori=1tosdo7:qi=encode(X),dhati=Q.retrieve(qi)speculativeretrieval8:Xhati=f(X,dhati,k)modelgenerationstepthatgeneratesknewtokens9:X=[X,Xhati]10:endfor11:d1,...,ds=C.retrieve(q1,...,qs)batchedverification12:m=argminidhati!=di13:ifm<=sthendocorrectionifneeded14:RollXbacktothemthspeculationstep15:Xhat=f(X,di,k)16:X=[X,Xhat]17:endif18:endwhile Input: Input tokens X = { x_0, x_1, ..., x_{t-1} }, external corpus C, language model f(.) 2: Output: RaLM generated outputs 3: Initialize local cache Q = { }, speculation stride s, model generation stride k 4: q = encode(X). Q.insert(C.retrieve(q)) ▹ cache prefetching 5: while EOS not in X do 6: for i = 1 to s do 7: q_i = encode(X), d_hat_i = Q.retrieve(q_i) ▹ speculative retrieval 8: X_hat_i = f(X, d_hat_i, k) ▹ model generation step that generates k new tokens 9: X = [X, X_hat_i] 10: end for 11: d_1, ..., d_s = C.retrieve(q_1, ..., q_s) ▹ batched verification 12: m = argmin_i d_hat_i != d_i 13: if m <= s then do correction if needed 14: Roll X back to the m-th speculation step 15: X_hat = f(X, d_i, k) 16: X = [X, X_hat] 17: end if 18: end while

Let's break down each step:

  1. Initialization (Line 3):

    • InputtokensX=x0,x1,...,xt1Input tokens X = { x_0, x_1, ..., x_{t-1} }: The initial prompt or query provided to the RaLM.
    • external corpus C: The large, non-parametric knowledge base (e.g., Wikipedia) from which documents are retrieved.
    • language model f(.): The parametric language model (e.g., GPT2, OPT, LLaMA-2) responsible for generating text.
    • localcacheQ=local cache Q = { }: An empty, fast, request-specific cache that will store recently retrieved documents for speculative access.
    • speculation stride s: A hyperparameter determining the number of consecutive speculative retrieval and generation steps before a verification step is triggered.
    • model generation stride k: The number of new tokens generated by the language model in a single generation step.
  2. Initial Cache Prefetching (Line 4):

    • q=encode(X)q = encode(X): The initial input tokens XX are encoded into a query embedding qq. This encode function typically transforms the textual input into a vector representation suitable for the retriever.
    • C.retrieve(q): The first actual retrieval from the external knowledge base CC is performed using the initial query qq. This ensures the local cache is not empty at the very beginning and also serves as an initial prefetching step.
    • Q.insert(C.retrieve(q)): The retrieved documents are inserted into the local cache Q. This populates the cache with relevant documents immediately.
  3. Main Generation Loop (Lines 5-18):

    • while EOS not in X do: The generation process continues iteratively until an End Of Sequence (EOS) token is generated, indicating the completion of the response.
  4. Speculative Retrieval and Generation Loop (Lines 6-10):

    • fori=1tosdofor i = 1 to s do: This loop executes ss (the speculation stride) consecutive speculative steps.
    • qi=encode(X)q_i = encode(X): In each step ii, the current context XX (including previously generated tokens) is encoded into a new query qiq_i. This qiq_i represents the context-dependent query.
    • dhati=Q.retrieve(qi)d_hat_i = Q.retrieve(q_i): This is the speculative retrieval step. Instead of querying the expensive external corpus CC, RaLMSpec attempts to retrieve documents dhatid_hat_i from the fast local cache Q using the query qiq_i. The local cache acts like a mini-retriever, using the same scoring metric as the original retriever but on a much smaller set of documents.
    • Xhati=f(X,dhati,k)X_hat_i = f(X, d_hat_i, k): The language model f(.) then generates kk new tokens (XhatiX_hat_i) using the current context XX and the speculatively retrieved documents dhatid_hat_i. This is the model generation step.
    • X=[X,Xhati]X = [X, X_hat_i]: The newly generated tokens XhatiX_hat_i are appended to the overall generated sequence XX.
  5. Batched Verification (Line 11):

    • d1,...,ds=C.retrieve(q1,...,qs)d_1, ..., d_s = C.retrieve(q_1, ..., q_s): After ss speculative steps, a batched verification is performed. All ss context-dependent queries (q1q_1 through qsq_s) that were used during the speculative phase are now sent simultaneously (as a batch) to the external corpus CC. This exploits the parallelism capabilities of the retrieval system, making it much faster than ss sequential retrievals. The ground truth documents d1,...,dsd_1, ..., d_s are retrieved.
  6. Mismatch Detection (Line 12):

    • m=argminidhati!=dim = argmin_i d_hat_i != d_i: This step identifies the first position mm (from 1 to ss) where a speculateddocumentdhatispeculated document d_hat_i mismatches the groundtruthdocumentdiground truth document d_i retrieved from the knowledge base during verification. If all documents match, mm would be greater than ss.
  7. Correction if Needed (Lines 13-17):

    • ifm<=sthenif m <= s then: If a mismatch is detected (i.e., mm is within the stride ss), a correction mechanism is triggered to ensure output correctness.
    • Roll X back to the m-th speculation step: The generated tokens XX are rolled back to the state just before the mm-th speculative step where the first mismatch occurred. All tokens generated based on incorrect speculation from this point onwards are discarded.
    • Xhat=f(X,di,k)X_hat = f(X, d_i, k): The language model f(.) then regenerates kk tokens (X_hat) using the correct ground truth document did_i (from the batched verification) and the rolled-back context XX.
    • X=[X,Xhat]X = [X, X_hat]: The correctly generated tokens are appended. The process continues from this corrected state.
    • Local Cache Update: While not explicitly shown in Algorithm 1, the text describes that the local cache QQ is updated with the documents d1,...,dsd_1, ..., d_s retrieved during the verification step. This can be either top-1 (only the most relevant document) or top-k (multiple top documents), where top-k update is referred to as prefetching (Figure 2). This ensures the local cache stays up-to-date and improves future speculation success rates.

Figure 1 illustrates this workflow:

  • Figure 1(a) (Existing Iterative RaLM): Shows sequential LM Generation and Retrieval steps. Queries q0,q1,q2q_0, q_1, q_2 are issued sequentially, leading to high overhead.

  • Figure 1(b) (RaLMSpec Overview): Demonstrates speculative retrieval steps (1,3,5\textcircled{1}, \textcircled{3}, \textcircled{5}) from the local cache, followed by a batched verification step (6\textcircled{6}) using the external knowledge base.

  • Figure 1(c) (Timeline Comparison): Highlights how RaLMSpec significantly reduces latency by replacing multiple sequential retrievals with faster speculative retrievals and one efficient batched verification.

    Figure 1: \(\\{ q _ { 0 } , q _ { 1 } , q _ { 2 } \\}\) denotes context-dependent query embeddings and A, B, C are document entries. Figure 1(a) shows the workflow of existing iterative RaLM, which suffers from high retrieval overhead. Figure 1(b) shows an overview of RaLMSpec, which enables faster speculative retrieval steps \(( \\textcircled{1} , \\textcircled{3 } , \\textcircled{5 } )\) followed by a batched verification step \(\\textcircled{6}\) to guarantee correctness. Consequently, RaLMSpec achieves a lower latency while preserving model quality as shown in Figure 1(c). 该图像是一个示意图,展示了迭代 RaLM 和 RaLMSpec 框架的工作流程对比。图中可见,RaLMSpec 通过推测性检索和批量验证相结合,显著减少了知识检索的开销,提升了响应速度。

Original caption: Figure 1: q0,q1,q2\\{ q _ { 0 } , q _ { 1 } , q _ { 2 } \\} denotes context-dependent query embeddings and A, B, C are document entries. Figure 1(a) shows the workflow of existing iterative RaLM, which suffers from high retrieval overhead. Figure 1(b) shows an overview of RaLMSpec, which enables faster speculative retrieval steps (textcircled1,textcircled3,textcircled5)( \\textcircled{1} , \\textcircled{3 } , \\textcircled{5 } ) followed by a batched verification step textcircled6\\textcircled{6} to guarantee correctness. Consequently, RaLMSpec achieves a lower latency while preserving model quality as shown in Figure 1(c).

Speculative Retrieval Details (Figure 2): A key aspect for the effectiveness of speculative retrieval is that for most dense and sparse retrievers, the relative ranking of documents is preserved. This means if the top-ranked document in the large external knowledge base is present in the local cache for a given query, it will also be ranked at the top when retrieving from the local cache using the same metric. This property, combined with temporal and spatial locality (i.e., relevant documents often reappear or are near previously retrieved ones), significantly boosts the speculation success rate.

Figure 2: For speculative retrieval, we maintain a local cache for each request and use the same scoring metric as the original retriever to rank the entries within the local cache for a given query. In the verification step, we populate the local cache with either the top-1 or top- \(\\mathbf { \\nabla } \\cdot \\mathbf { k }\) retrieved documents from the knowledge base, where the latter one is referred to as prefetching. 该图像是示意图,展示了RaLMSpec框架中的推测检索和缓存更新机制。上半部分展示了在推测检索过程中,语言模型(LM)如何使用局部缓存进行候选项检索,针对查询 q0q_0q1q_1 的过程。下半部分说明了缓存更新策略,包括 Top-1 和 Top-k 更新方式,分别对应于模型验证和预取机制的实施。

Original caption: Figure 2: For speculative retrieval, we maintain a local cache for each request and use the same scoring metric as the original retriever to rank the entries within the local cache for a given query. In the verification step, we populate the local cache with either the top-1 or top- mathbfnablacdotmathbfk\\mathbf { \\nabla } \\cdot \\mathbf { k } retrieved documents from the knowledge base, where the latter one is referred to as prefetching.

Batched Verification and Prefetching: The efficiency gain from batched retrieval is significant because retrieving nn queries in parallel is generally much faster than nn sequential retrievals, as empirically shown in Appendix A.1 (Figure 6). During verification, the local cache is populated not just with the top-1 ground truth document, but potentially with top-k documents. This top-k cache update is referred to as prefetching, aiming to proactively fetch more relevant entries into the local cache to further increase the speculation success rate in subsequent steps.

Asynchronous Verification (Figure 3): To further reduce latency, RaLMSpec can employ asynchronous verification. Instead of the system stalling during the verification step, an additional speculation step can be launched concurrently (asynchronously) with the verification of the previous steps.

  • If the verification succeeds, the model can continue generating based on the asynchronously speculated tokens, effectively hiding the verification latency.

  • If the verification fails, the model will roll back to the mismatch point, discard the asynchronously generated tokens, and regenerate using the correct information. This technique is particularly beneficial when the verification latency is shorter than the language model's decoding latency.

    Figure 3: Asynchronous verification obtains latency saving by hiding the verification latency behind a valid speculation step. In case a mismatch is detected between the speculated document and ground truth document, the language model will regenerate outputs using the ground truth document. 该图像是示意图,展示了RaLMSpec框架中验证线程与推测线程的工作流程。推测线程通过生成和验证文档来降低延迟,其中若生成的文档与预期不符,则需重新生成。该机制与单线程处理相比实现了更高的效率。

Original caption: Figure 3: Asynchronous verification obtains latency saving by hiding the verification latency behind a valid speculation step. In case a mismatch is detected between the speculated document and ground truth document, the language model will regenerate outputs using the ground truth document.

4.3. Optimal Speculation Stride Scheduler (OS3OS^3)

The speculation stride s is a crucial hyperparameter that directly impacts the trade-off between speculation overhead (cost of incorrect speculation and rollback) and retrieval saving (benefits of batched retrieval). A large ss might lead to high overhead if speculation fails early, while a small ss might not fully exploit the benefits of speculation. The optimal ss varies based on the language model, retriever type, and speculation accuracy.

Instead of manual tuning, RaLMSpec introduces the OptimalSpeculationStrideScheduler(OS3)Optimal Speculation Stride Scheduler (OS^3) to adaptively determine the best ss. The goal is to maximize the expected number of documents verified successfully per unit time.

Let:

  • aa: Latency of a single speculation step (speculative retrieval + language model decoding).

  • bb: Latency of a single verification step.

  • did_i: Ground truth document retrieved from the corpus at step ii.

  • d^i\hat{d}_i: Speculated document retrieved from the local cache at step ii.

  • γ(X)\gamma(X): Speculation accuracy, defined as the probability that di=d^id_i = \hat{d}_i given the current context XX, i.e., P(di=d^iX)P(d_i = \hat{d}_i \mid X). This is assumed to be constant for all i[s]i \in [s].

    Expected Number of Matched Documents: The expected number of successfully verified documents within a stride ss is derived from the probability of successful speculation for ii steps. If the speculation succeeds for i-1 steps and fails at step ii, then ii documents are considered "matched". If all ss speculations succeed, then ss documents are matched. The expected number of matched documents, denoted as \mathbb{E}[\text{# of verified documents} \mid X, s], is given by: $ \mathbb{E}[\text{# of verified documents} \mid X, s] = \sum_{i=0}^{s-1} \gamma(X)^i = \frac{1 - \gamma(X)^s}{1 - \gamma(X)} $ This formula sums the probabilities of successfully matching 0, 1, ..., up to s-1 documents (where the (s-1)-th term represents all ss documents matching if γ(X)s\gamma(X)^s is the success probability for all ss steps).

Objective Function for Synchronous Verification: For synchronous verification, the total latency for ss speculation steps is sa+bsa + b. The objective is to maximize the expected number of verified documents per unit time: $ \text{Maximize } \frac{\mathbb{E}[\text{# of verified documents} \mid X, s]}{\text{Latency}} = \frac{\frac{1 - \gamma(X)^s}{1 - \gamma(X)}}{sa + b} = \frac{1 - \gamma(X)^s}{(1 - \gamma(X))(sa + b)} $

Objective Function for Asynchronous Verification: For asynchronous verification, the expected latency calculation is more complex:

  • With probability γ(X)s\gamma(X)^s (all ss speculations succeed), the latency is (s1)a+max(a,b)(s-1)a + \max(a, b). This is because the last speculation step's LM decoding can overlap with the verification, and we take the maximum of their latencies.
  • With probability 1γ(X)s1 - \gamma(X)^s (at least one mismatch occurs), there is no gain from asynchronous verification, and the latency reverts to synchronous: sa+bsa + b. Therefore, the expected latency for asynchronous verification is: $ \text{Expected Latency} = \gamma(X)^s ((s-1)a + \max(a, b)) + (1 - \gamma(X)^s) (sa + b) $ The objective function to maximize becomes: $ \text{Maximize } \frac{1 - \gamma(X)^s}{(1 - \gamma(X))[\gamma(X)^s ((s-1)a + \max(a, b)) + (1 - \gamma(X)^s) (sa + b)]} $ The OS3OS^3 continuously solves for the optimal ss by estimating a,b,γ(X)a, b, \gamma(X).

Parameter Estimation for OS3OS^3:

  • aa and bb (Latencies): These are estimated by profiling the actual running times of the most recent speculation steps (aa) and verification steps (bb). The paper notes that for EDR and SR, batched retrieval latency is nearly constant for small batch sizes, while for ADR, it scales linearly but with a significant intercept, still making batched retrieval efficient. The effect of batch size on latency per query is detailed in Appendix A.1 (Figure 6).
  • γ(X)\gamma(X) (Speculation Accuracy): This is estimated using maximum log-likelihood estimation (MLE) over a specific window size w of recent verification steps. This ensures the estimation reflects recent performance while being stable. Let s(t) be the speculation stride (also the batch size) in the tt-th most recent verification step, and M(s(t), X) be the corresponding number of matched documents in that step. The estimated γ^(X)\hat{\gamma}(X) is: $ \hat{\gamma}(X) = \frac{\sum_t M(s(t), X)}{\sum_t M(s(t), X) + \sum_t \mathbb{1}(M(s(t), X) < s(t))} $ where 1()\mathbb{1}(\cdot) is the indicator function, which is 1 if the condition is true (i.e., a mismatch occurred) and 0 otherwise. The numerator sums the total matched documents across the window, and the denominator sums matched documents plus the count of mismatches. To prevent overly optimistic estimations or division-by-zero errors when γ^\hat{\gamma} approaches 1, an upper bound γmax\gamma_{max} is set and γ^\hat{\gamma} is truncated accordingly.

5. Experimental Setup

5.1. Datasets

For evaluating RaLMSpec on knowledge-intensive open-domain question-answering (QA) tasks, the following datasets were used:

  • Wiki-QA (Yang et al., 2015): A QA dataset where questions are typically related to factual information found on Wikipedia. The answers are short phrases or sentences extracted from Wikipedia articles.

  • Web Questions (Berant et al., 2013): A QA dataset consisting of questions posed by real users to a Google search engine. Answers are usually short entities or facts from Freebase.

  • Natural Questions (Kwiatkowski et al., 2019): A large-scale QA dataset composed of real Google search queries and answers derived from Wikipedia. It includes both short and long answers.

  • Trivia QA (Joshi et al., 2017): A challenging QA dataset that contains questions from trivia websites, with supporting evidence from Wikipedia and other web sources. It tests deep understanding and retrieval capabilities.

    For all these QA tasks, the Wikipedia corpus (Chen et al., 2017) was used as the external knowledge base. This corpus is widely used in open-domain QA and is effective for validating methods that rely on external knowledge retrieval.

For KNN-LM evaluation, the WikiText-103 dataset (Merity et al., 2016) was used. This is a large corpus of text extracted from the set of "Good" and "Featured" articles on Wikipedia. It is specifically designed for language modeling tasks and was the same dataset used in the original KNN-LM work (Khandelwal et al., 2019), making it suitable for direct comparison.

The choice of these datasets is appropriate because they are standard benchmarks for knowledge-intensive NLP tasks, specifically open-domain QA and language modeling. They represent diverse question types and data characteristics, allowing for comprehensive validation of RaLMSpec's performance across different scenarios.

5.2. Evaluation Metrics

The paper primarily focuses on speed-up ratio as its evaluation metric, which quantifies the reduction in latency (or increase in throughput) compared to a baseline. While the paper doesn't explicitly provide a formula for speed-up ratio, it is standardly calculated as:

  1. Speed-up Ratio

    • Conceptual Definition: The speed-up ratio measures how much faster a task can be completed using an optimized method compared to a baseline method. It is a direct indicator of performance improvement in terms of execution time.
    • Mathematical Formula: $ \text{Speed-up Ratio} = \frac{\text{Latency}{\text{Baseline}}}{\text{Latency}{\text{Optimized}}} $ or equivalently, if measuring throughput: $ \text{Speed-up Ratio} = \frac{\text{Throughput}{\text{Optimized}}}{\text{Throughput}{\text{Baseline}}} $
    • Symbol Explanation:
      • LatencyBaseline\text{Latency}_{\text{Baseline}}: The time taken to complete a task using the original, unoptimized baseline method.

      • LatencyOptimized\text{Latency}_{\text{Optimized}}: The time taken to complete the same task using the proposed, optimized method.

      • ThroughputOptimized\text{Throughput}_{\text{Optimized}}: The amount of work done per unit time by the optimized method.

      • ThroughputBaseline\text{Throughput}_{\text{Baseline}}: The amount of work done per unit time by the baseline method.

        Additionally, the paper mentions that KNN-LM aims to improve the perplexity of the base language model. While RaLMSpec guarantees to preserve output quality, perplexity is a common metric for language models that the original KNN-LM work optimizes.

  2. Perplexity (PPL)

    • Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In natural language processing, it's often used to evaluate language models. A lower perplexity indicates a better model, as it means the model is less "perplexed" (more confident and accurate) when predicting the next word in a sequence. It is the exponentiated average negative log-likelihood of a sequence.
    • Mathematical Formula: For a test set W=(w1,w2,,wN)W = (w_1, w_2, \ldots, w_N), the perplexity (PPL) is defined as: $ \text{PPL}(W) = P(w_1, w_2, \ldots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \ldots, w_N)}} $ This can be rewritten using the product rule of probability and logarithms: $ \text{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \ldots, w_{i-1})\right) $
    • Symbol Explanation:
      • WW: A sequence of NN words (tokens) in the test set.

      • wiw_i: The ii-th word in the sequence.

      • NN: The total number of words in the sequence.

      • P(w1,w2,,wN)P(w_1, w_2, \ldots, w_N): The joint probability of the entire sequence, according to the language model.

      • P(wiw1,,wi1)P(w_i | w_1, \ldots, w_{i-1}): The probability of the ii-th word, conditioned on all preceding words, as predicted by the language model.

      • log\log: The natural logarithm.

      • exp()\exp(\cdot): The exponential function (e to the power of \cdot).

        The main body of the RaLMSpec paper focuses solely on latency reduction and speed-up, explicitly stating that it "preserves the same model outputs." This implies that metrics like perplexity or factual accuracy would be identical to the baseline, hence the focus on efficiency metrics.

5.3. Baselines

The paper compares RaLMSpec against two primary baselines, tailored to the specific serving tasks:

  1. Naive Iterative RaLM Serving (RaLMSeq):

    • Description: This baseline directly implements the standard iterative RaLM approach, following the design described by Ram et al. (2023). In this setup, a retrieval operation is triggered every four tokens generated by the language model. The most recently retrieved document chunk is then prepended to the prompt, replacing any previously retrieved documents, to inform the next generation step.
    • Representativeness: This is a representative baseline for typical iterative RaLM deployments, demonstrating the direct overhead incurred by frequent, sequential retrieval steps without any optimization.
  2. KNN-LM Serving Baseline:

    • Description: For KNN-LM serving, the baseline is the direct implementation from the original KNN-LM work by Khandelwal et al. (2019). In this highly retrieval-intensive scenario, a retrieval operation is performed for every single token generated by the language model.

    • Representativeness: This baseline represents the unoptimized, highest-overhead version of KNN-LM, which is known for its strong performance but prohibitive inference costs due to its granular retrieval pattern.

      Retrievers Used for Baselines and RaLMSpec: To demonstrate generality and ensure fair comparison, both RaLMSpec and the baselines were tested with various retriever types:

  • Exact Dense Retriever (EDR): Dense Passage Retriever (DPR) (Karpukhin et al., 2020), which is highly accurate but computationally expensive.

  • Approximate Dense Retriever (ADR): DPR-HNSW (Malkov & Yashunin, 2018), a faster but less accurate approximate version of DPR using Hierarchical Navigable Small World graphs.

  • Sparse Retriever (SR): BM25 (Robertson et al., 2009), a classic lexical matching retriever.

    The implementations for all retrievers were based on Pyserini (Lin et al., 2021), with dense retrievers leveraging the FAISS library (Johnson et al., 2019) for efficient similarity search.

Implementation Details for Experiments:

  • Maximum Lengths: Maximum input prompt length was set to 512 tokens, and maximum generation length to 128 tokens. For naive iterative RaLM, the retrieved document chunk length was 256.

  • Speculation Stride (ss): When OS3OS^3 (Optimal Speculation Stride Scheduler) was disabled, RaLMSpec used a constant stride of s=3s=3. When OS3OS^3 was enabled, it initialized s=1s=1 and adapted dynamically.

  • OS3OS^3 Parameters: Window size w=5w=5, and \gamma_{max}$ = 0.6$ for estimating speculation accuracy. * **Prefetching**: Default prefetch size of 20. Ablation studies also used 256. * **Asynchronous Verification**: Due to Python's Global Interpreter Lock (GIL), asynchronous verification's latency was *simulated* to reflect ideal running time without additional overhead. The authors clarify it's not a primary speed-up factor. * **Hardware**: All experiments (except asynchronous verification simulation) were run on a `VM.GPU.A10` instance on Oracle cloud, equipped with one A10 GPU and 15 CPUs. * **Evaluation Protocol**: For QA datasets, 100 randomly selected questions were used, and results were averaged over five independent runs to provide mean and standard deviation. # 6. Results & Analysis ## 6.1. Core Results Analysis The experiments extensively evaluate `RaLMSpec` across various language models, retrievers, and datasets, demonstrating its effectiveness in reducing serving latency for iterative RaLMs. **Naive Iterative RaLM Serving (Figure 4)**: The main results for naive iterative RaLM serving are presented in Figure 4, comparing `RaLMSeq` (baseline), `RaLMSpec` (base version), and $RaLMSpec+PSA$ (RaLMSpec with Prefetching, Optimal Speculation Stride Scheduler, and Asynchronous verification). The latency is decomposed into language model generation latency (G) and retrieval latency (R). ![Figure 4: Latency comparison between RaLMSeq, RaLMSpec, and RaLMSpec+PSA on GPT2- medium, OPT-1.3B, and LLaMA-2-7B over four QA datasets with three different types of retrievers, where EDR, ADR, SR stand for exact dense retriever, approximate dense retriever, and sparse retriever respectively. We decompose the overall latency into the language model generation latency (G) and retrieval latency (R) to demonstrate the trade-off.](/files/papers/6929a07df1921d7e196830fe/images/4.jpg) *该图像是一个示意图,展示了不同方法(RaLMSpec、RaLMSeq等)在多个数据集(GPT2、OPT、LLAMA-2等)上响应延迟(Latency)的比较。图中数据以条形图形式呈现,提供了各模型的延迟数据,以及不同检索模式下的性能表现。* Original caption: Figure 4: Latency comparison between RaLMSeq, RaLMSpec, and RaLMSpec+PSA on GPT2- medium, OPT-1.3B, and LLaMA-2-7B over four QA datasets with three different types of retrievers, where EDR, ADR, SR stand for exact dense retriever, approximate dense retriever, and sparse retriever respectively. We decompose the overall latency into the language model generation latency (G) and retrieval latency (R) to demonstrate the trade-off. * **Exact Dense Retriever (EDR)**: * $RaLMSpec+PSA$ achieves the most significant speed-ups: **2.39x for GPT2, 2.33x for OPT, and 1.75x for LLaMA-2** compared to `RaLMSeq`. * **Analysis**: EDR typically has high retrieval latency (R). Since `RaLMSpec` primarily optimizes the retrieval component, it can significantly reduce this bottleneck. The portion of the bar representing retrieval latency (R) is substantially reduced in $RaLMSpec+PSA$ compared to `RaLMSeq`, while generation latency (G) remains largely similar. This confirms that `RaLMSpec` is most impactful when retrieval is the dominant bottleneck. * **Approximate Dense Retriever (ADR)**: * $RaLMSpec+PSA$ achieves speed-ups of **1.05x for GPT2, 1.39x for OPT, and 1.04x for LLaMA-2**. * **Analysis**: The speed-ups are more modest compared to EDR. This is because ADR is inherently faster than EDR, meaning the retrieval latency (R) is a smaller fraction of the overall end-to-end latency. Thus, even if `RaLMSpec` reduces R significantly, the overall speed-up is limited by the relatively larger proportion of LM generation latency (G). The naive `RaLMSpec` (without $OS^3$ or prefetching) sometimes performs worse than the baseline, indicating that a constant, non-optimal `speculation stride` can lead to `speculation overhead` that outweighs the gains when retrieval is not the primary bottleneck. This underscores the importance of $OS^3$. * **Sparse Retriever (SR)**: * $RaLMSpec+PSA$ achieves speed-ups of **1.53x for GPT2, 1.77x for OPT, and 1.31x for LLaMA-2**. * **Analysis**: Similar to ADR, SRs are often faster than EDRs. The speed-ups are between EDR and ADR, indicating good performance but again, the overall gain is influenced by the proportion of retrieval latency in the total latency. The observation about naive `RaLMSpec`'s performance (sometimes worse than baseline) also applies here, reinforcing the need for adaptive stride scheduling. **Overall Analysis**: $RaLMSpec+PSA$ (with all components enabled) consistently achieves the best performance across all scenarios. The results highlight that `RaLMSpec`'s speed-up is intrinsically tied to the ratio of retrieval latency to total latency. The more dominant the retrieval latency, the greater the potential for `RaLMSpec` to accelerate the process. The adaptive $Optimal Speculation Stride Scheduler (OS^3)$ is crucial, especially for faster retrievers (ADR, SR), to prevent excessive `speculation overhead` from early mismatches. ## 6.2. Data Presentation (Tables) ### 6.2.1. Ablation Study for Naive Iterative RaLM Serving (Table 1) The following are the results from Table 1 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Retriever</td> <td>Method</td> <td>GPT2</td> <td>OPT</td> <td>LLaMA-2</td> </tr> </thead> <tbody> <tr> <td rowspan="5">EDR</td> <td>RaLMSpec</td> <td>2.04×</td> <td>1.76×</td> <td>1.70×</td> </tr> <tr> <td>RaLMSpec+P</td> <td>2.10×</td> <td>2.16×(**)</td> <td>1.75×(**)</td> </tr> <tr> <td>RaLMSpec+S</td> <td>2.26×(**)</td> <td>2.15×</td> <td>1.69×</td> </tr> <tr> <td>RaLMSpec+A</td> <td>2.03×</td> <td>1.74×</td> <td>1.74×</td> </tr> <tr> <td>RaLMSpec+PSA</td> <td>2.39×(*)</td> <td>2.32×(*)</td> <td>1.75×(*)</td> </tr> <tr> <td rowspan="5">ADR</td> <td>RaLMSpec</td> <td>0.62×</td> <td>0.61×</td> <td>0.58×</td> </tr> <tr> <td>RaLMSpec+P</td> <td>0.59×</td> <td>0.76×</td> <td>0.58×</td> </tr> <tr> <td>RaLMSpec+S</td> <td>0.92×(**)</td> <td>1.17×(**)</td> <td>1.01×(**)</td> </tr> <tr> <td>RaLMSpec+A</td> <td>0.66×</td> <td>0.46×</td> <td>0.55×</td> </tr> <tr> <td>RaLMSpec+PSA</td> <td>1.05×(*)</td> <td>1.39×(*)</td> <td>1.04×(*)</td> </tr> <tr> <td rowspan="5">SR</td> <td>RaLMSpec</td> <td>1.34×</td> <td>1.18×</td> <td>0.97×</td> </tr> <tr> <td>RaLMSpec+P</td> <td>1.39×</td> <td>1.42×</td> <td>0.98×</td> </tr> <tr> <td>RaLMSpec+S</td> <td>1.32×</td> <td>1.52×(**)</td> <td>1.05×(**)</td> </tr> <tr> <td>RaLMSpec+A</td> <td>1.41×(**)</td> <td>1.27×</td> <td></td> </tr> <tr> <td>RaLMSpec+PSA</td> <td>1.53×(*)</td> <td>1.77×(*)</td> <td>1.01× 1.31×(*)</td> </tr> </tbody> </table></div> **Analysis of Table 1**: This table shows the average speed-up ratio of different combinations of `RaLMSpec` components compared to the `RaLMSeq` baseline, averaged over the four QA datasets. * `RaLMSpec` (base version, with fixed $s=3$): Performs well with EDR (2.04x to 1.70x), but poorly with ADR (0.58x to 0.62x, meaning a slowdown) and marginally with SR (0.97x to 1.34x). This highlights the issue of a constant `speculation stride` being suboptimal for different retriever types. * **$OS^3$ ($RaLMSpec+S$)**: Enabling the `Optimal Speculation Stride Scheduler` ($S$) brings the most significant individual gain across ADR and SR, and also performs very well for EDR. For ADR, it significantly boosts performance from slowdowns (e.g., 0.62x for GPT2) to modest speed-ups (0.92x for GPT2, 1.17x for OPT, 1.01x for LLaMA-2). This validates the scheduler's importance in adaptively finding optimal strides, especially when retrieval latency is not overwhelmingly large. * **Prefetching ($RaLMSpec+P$)**: `Prefetching` ($P$) generally improves performance, especially for EDR (e.g., 2.10x for GPT2, 2.16x for OPT), by caching more relevant entries and increasing speculation accuracy. However, for ADR and SR, its impact can be mixed or even negative (e.g., 0.59x for GPT2 with ADR). * **Asynchronous Verification ($RaLMSpec+A$)**: `Asynchronous verification` ($A$) provides marginal improvements in most cases. For EDR, its contribution is minor. For SR, it shows a second-best individual speed-up for GPT2 (1.41x). The authors note its potential is not fully realized due to Python's GIL. * **$RaLMSpec+PSA$ (All components)**: Combining all three components ($P$, $S$, $A$) consistently yields the **best performance (`*`)** across all models and retriever types, demonstrating their synergistic effect. ### 6.2.2. Ablation Study on Prefetching Size (Table 2) The following are the results from Table 2 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Retriever</td> <td>Method</td> <td>GPT2</td> <td>OPT</td> <td>LLaMA-2</td> </tr> </thead> <tbody> <tr> <td rowspan="2">EDR</td> <td>RaLMSpec+P(20)</td> <td>2.10×</td> <td>2.16×</td> <td>1.75×</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>2.15×</td> <td>1.72×</td> <td>1.63×</td> </tr> <tr> <td rowspan="2">ADR</td> <td>RaLMSpec+P(20)</td> <td>0.59×</td> <td>0.76×</td> <td>0.58×</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>0.67×</td> <td>0.25×</td> <td>0.34×</td> </tr> <tr> <td rowspan="2">SR</td> <td>RaLMSpec+P(20)</td> <td>1.39×</td> <td>1.42×</td> <td>0.98×</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>1.02×</td> <td>0.93×</td> <td>0.84×</td> </tr> </tbody> </table></div> **Analysis of Table 2**: This table compares the speed-up ratio when using different `prefetching sizes` (20 vs. 256) with $RaLMSpec+P$. * For **EDR**, increasing prefetching size from 20 to 256 yields a slight improvement for GPT2 (2.10x to 2.15x) but causes a decrease for OPT (2.16x to 1.72x) and LLaMA-2 (1.75x to 1.63x). * For **ADR** and **SR**, increasing prefetching size to 256 consistently leads to a significant *decrease* in performance (e.g., for OPT with ADR, it drops from 0.76x to 0.25x). * **Analysis**: This suggests that while prefetching can increase speculation accuracy, a larger prefetching size also introduces `higher retrieval overhead` during the verification step. For faster retrievers (ADR, SR) or in scenarios where the prefetching gain is marginal, this additional overhead can outweigh the benefits, leading to reduced performance or even slowdowns. An optimal prefetching size is therefore context-dependent. ### 6.2.3. KNN-LM Serving Results (Figures 5 and 6) The paper also evaluates `RaLMSpec` on the more retrieval-intensive `KNN-LM` serving task, where retrieval occurs for every token generation. The cache update rule and verification protocol are adapted: instead of matching document *sets*, success is defined by matching the `ground truth next token`, and the cache is populated with $n=10$ entries directly after the currently retrieved item (exploiting `spatial locality`). The following is the speedup chart for `KNN-LM` with an Exact Dense Retriever (Figure 5a from the original paper): ![该图像是一个柱状图,展示了不同最近邻数目(k)对应的加速比(Speed Up)。随着最近邻数目的增加,算法在不同配置下的性能表现也随之变化,最大加速比在 $k=1$ 时达 7.59,且不同参数设置下的结果明显不同。](/files/papers/6929a07df1921d7e196830fe/images/5.jpg) *该图像是一个柱状图,展示了不同最近邻数目(k)对应的加速比(Speed Up)。随着最近邻数目的增加,算法在不同配置下的性能表现也随之变化,最大加速比在 $k=1$ 时达 7.59,且不同参数设置下的结果明显不同。* Speedup for RaLMSpec in kNN-LM with Exact Retriever on WikiText-103 The following is the speedup chart for `KNN-LM` with an Approximate Dense Retriever (Figure 5b from the original paper): ![该图像是一个条形图,展示了在不同邻居数量(k)下,使用近似密集检索器时的加速比(Speed Up)。条形图中红色、蓝色和灰色分别表示不同的设置(s=3、s=5和OS^3),对应的加速比数值如2.39、2.45等展示了不同配置下的性能。在邻居数量k=1时,加速比最高,接近2.45。](/files/papers/6929a07df1921d7e196830fe/images/6.jpg) *该图像是一个条形图,展示了在不同邻居数量(k)下,使用近似密集检索器时的加速比(Speed Up)。条形图中红色、蓝色和灰色分别表示不同的设置(s=3、s=5和OS^3),对应的加速比数值如2.39、2.45等展示了不同配置下的性能。在邻居数量k=1时,加速比最高,接近2.45。* Speedup for RaLMSpec in kNN-LM with Approximate Retriever on WikiText103 **Analysis of Figures 5a and 5b**: * **Exact Dense Retriever (EDR)**: `RaLMSpec` achieves remarkable speed-ups up to **7.59x** with $OS^3$ (at $k=1$). Even for a large $k=1024$ (meaning 1024 nearest neighbors are retrieved per token), it achieves **3.88x** acceleration. A larger fixed stride (e.g., $s=5$) generally performs better than a smaller one ($s=3$) for EDR, but $OS^3$ consistently finds the best performance. This shows `RaLMSpec` is extremely effective when retrieval is a severe bottleneck, as in `KNN-LM` with EDR. * **Approximate Dense Retriever (ADR)**: `RaLMSpec` achieves speed-ups up to **2.45x** with $OS^3$ (at $k=1$). For $k=1024$, it still reaches **2.37x** acceleration. Here, the advantage of a larger fixed stride is not as consistent; $OS^3$ again proves its value by adapting to the optimal stride for various $k$ values. While the absolute speed-up is lower than EDR, it's still significant given ADR's already faster retrieval. * **Analysis**: The substantial speed-ups for `KNN-LM` (especially with EDR) highlight `RaLMSpec`'s ability to tackle highly retrieval-intensive workloads. The adapted verification protocol (matching tokens) and cache update rule (exploiting spatial locality with $n=10$ next entries) are crucial for its success in this specific task. $OS^3$ remains vital for navigating the optimal trade-offs. ### 6.2.4. LLaMA-2-13B Serving Evaluation (Table 3) The following are the results from Table 3 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Retriever</td> <td>Wiki QA</td> <td>Web Questions</td> <td>Natural Questions</td> <td>Trivia-QA</td> </tr> </thead> <tbody> <tr> <td>EDR</td> <td>1.70×</td> <td>1.85×</td> <td>1.73×</td> <td>1.78×</td> </tr> <tr> <td>ADR</td> <td>1.03×</td> <td>1.04×</td> <td>1.02×</td> <td>1.03×</td> </tr> <tr> <td>SR</td> <td>1.18×</td> <td>1.21×</td> <td>1.22×</td> <td>1.26×</td> </tr> </tbody> </table></div> **Analysis of Table 3**: This table presents the $RaLMSpec+PSA$ speed-up ratios for a larger language model, `LLaMA-2-13B`, across the four QA datasets. * **EDR**: $RaLMSpec+PSA$ still achieves significant speed-ups, up to **1.85x** (Web Questions). This confirms its effectiveness even with larger language models where retrieval can still be a bottleneck. * **ADR & SR**: The improvements are marginal (e.g., 1.02x to 1.04x for ADR, 1.18x to 1.26x for SR). * **Analysis**: For `LLaMA-2-13B`, the `language model generation latency` is considerably higher than for smaller models like GPT2 or OPT. When the LM generation becomes the dominant factor, even substantial reductions in retrieval latency result in only marginal improvements in the overall `end-to-end latency`. This highlights that `RaLMSpec` is most effective when the retrieval step is a noticeable bottleneck, but its impact naturally diminishes as the LM computation itself becomes the overwhelming cost. ## 6.3. Ablation Studies / Parameter Analysis (Appendix) ### 6.3.1. Batched Retrieval Efficiency (Appendix A.1, Figure 6) The following figure (Figure 6 from the original paper) demonstrates the efficiency of batched retrievals, which is a fundamental premise for `RaLMSpec`'s latency savings. ![Figure 6: Effect of batch size on latency per query for three different types of retrievers. $9 5 \\%$ confidence bands for the true mean latency are included.](/files/papers/6929a07df1921d7e196830fe/images/7.jpg) *该图像是一个示意图,展示了不同检索器的查询延迟与批量大小的关系。图中分别展示了精确密集检索器(a)、近似密集检索器(b)和稀疏检索器(c)的性能。随着批量大小的增加,延迟显著降低,特别是在精确密集检索器中,查询延迟从约4秒下降至1秒以下。* Original caption: Figure 6: Effect of batch size on latency per query for three different types of retrievers. $9 5 \\%$ confidence bands for the true mean latency are included. **Analysis of Figure 6**: This figure plots the latency per query as the `batch size` increases for EDR, ADR, and SR. * For **EDR** and **SR**, the `latency per query` decreases significantly as batch size increases, and the `total retrieval time` remains almost constant across all batch sizes. This means that processing multiple queries in a batch takes roughly the same time as processing a single query, making batched retrieval highly efficient. * For **ADR**, the `latency per query` also decreases, but the total latency scales more linearly with batch size, albeit with a significant intercept term. This implies that while ADR is faster, there's still a fixed overhead for initiating any retrieval, which can be amortized over multiple queries in a batch. * **Conclusion**: Batched retrieval is demonstrably more efficient than sequential individual retrievals across all retriever types, validating a core mechanism of `RaLMSpec`. ### 6.3.2. Detailed Ablation Study on Different System Components (Appendix A.3, Table 4 & Figure 7) The following figure (Figure 7 from the original paper) shows the ablation study on the contribution of the prefetching (P), optimal speculation stride scheduler (S), and asynchronous verification (A) components. ![该图像是条形图,展示了 RaLMSpec 及其变体在不同检索器下的性能指标,包括(a)精确密集检索器、(b)近似密集检索器和(c)稀疏检索器。蓝色表示 RaLMSpec,灰色表示 RaLMSpec+PS,橙色表示 RaLMSpec+P,黄色表示 RaLMSpec+PSA。](/files/papers/6929a07df1921d7e196830fe/images/8.jpg) *该图像是条形图,展示了 RaLMSpec 及其变体在不同检索器下的性能指标,包括(a)精确密集检索器、(b)近似密集检索器和(c)稀疏检索器。蓝色表示 RaLMSpec,灰色表示 RaLMSpec+PS,橙色表示 RaLMSpec+P,黄色表示 RaLMSpec+PSA。* Original caption: A.3 Ablation Study on Different System Components Figure 7: Ablation study on the contribution of the prefetching (P), optimal speculation stride scheduler (S), and asynchronous verification (A) components. The following are the results from Table 4 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Retriever</td> <td>B</td> <td>P</td> <td>S</td> <td>A</td> <td>PS</td> <td>SA</td> <td>PA</td> <td>PSA</td> </tr> </thead> <tbody> <tr> <td>EDR</td> <td>144.39s</td> <td>82.23s</td> <td>85.19s</td> <td>90.49s</td> <td>81.64s</td> <td>85.13s</td> <td>81.60s</td> <td>79.06s</td> </tr> <tr> <td>ADR</td> <td>8.06s</td> <td>14.25s</td> <td>8.14s</td> <td>13.90s</td> <td>8.17s</td> <td>7.83s</td> <td>12.84s</td> <td>7.89s</td> </tr> <tr> <td>SR</td> <td>10.75s</td> <td>11.27s</td> <td>10.38s</td> <td>10.88s</td> <td>10.21s</td> <td>8.26s</td> <td>10.61s</td> <td>8.28s</td> </tr> </tbody> </table></div> **Analysis of Table 4 and Figure 7**: This detailed ablation study, conducted on `LLaMA-2-7B` and the `Wiki QA` dataset, evaluates all combinations of Prefetching (P), $OS^3$ (S), and Asynchronous verification (A). $B$ denotes the baseline `RaLMSeq` latency. * **EDR**: * For EDR, $P$ (prefetching) provides the most significant individual speed-up (reducing latency from 144.39s to 82.23s). $S$ (OS^3) also helps but slightly less than $P$. * Interestingly, $P+A$ (81.60s) even outperforms $P+S$ (81.64s) and $S+A$ (85.13s), and is very close to $P+S+A$ (79.06s). * **Reasoning**: For EDR, where retrieval latency is very high, a larger fixed `speculation stride` (like the default $s=3$ when $OS^3$ is off) might already be near-optimal, and $OS^3$'s initial warm-up phase (starting at $s=1$) can introduce some overhead. Prefetching, by improving `speculation accuracy`, provides direct benefits. * **ADR**: * For ADR, $P$ and $A$ individually lead to significant *slowdowns* (latency increases from 8.06s to 14.25s and 13.90s respectively), highlighting their potential overhead if not chosen carefully. * $S$ (OS^3) alone slightly increases latency (8.14s) but is crucial for $P+S+A$. * $S+A$ provides the best performance (7.83s) even outperforming $P+S+A$ (7.89s). * **Reasoning**: For ADR, where retrieval is faster, the overhead of prefetching (fetching more documents) or `asynchronous verification` (if it frequently fails early) can easily outweigh the gains. $OS^3$ is critical for adaptively finding a suitable stride to balance these trade-offs. * **SR**: * Similar to ADR, $P$ individually slightly increases latency (11.27s vs 10.75s). * $S$ provides the best individual gain (10.38s). * $S+A$ achieves the best overall performance (8.26s), again outperforming $P+S+A$ (8.28s). * **Reasoning**: Similar to ADR, fast retrievers benefit most from $OS^3$'s adaptive stride scheduling. Prefetching's overhead can be counterproductive if not justified by very high `speculation accuracy` improvements. **Conclusion from Detailed Ablation**: * $OS^3$ is generally the most crucial component for adapting to diverse workloads and achieving near-optimal performance, especially for faster retrievers where constant strides can be detrimental. * Prefetching (P) is highly beneficial when retrieval dominates latency (e.g., EDR) but can hurt performance for faster retrievers if its overhead outweighs its gain. * Asynchronous verification (A) provides marginal benefits and is less impactful, partly due to implementation limitations (GIL) and partly because its gains are conditional on successful speculation. * Combining all components ($RaLMSpec+PSA$) provides the most consistent overall best performance, as they compensate for each other's weaknesses and collectively maximize the acceleration potential. ### 6.3.3. Ablation Study on Speculation Stride (Appendix A.4, Table 5) The following are the results from Table 5 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Retriever</td> <td>S=2</td> <td>S=4</td> <td>S=8</td> <td>OS3</td> </tr> </thead> <tbody> <tr> <td>EDR</td> <td>92.17s</td> <td>81.06s</td> <td>81.90s</td> <td>85.19s</td> </tr> <tr> <td>ADR</td> <td>9.86s</td> <td>14.93s</td> <td>25.88s</td> <td>8.14s</td> </tr> <tr> <td>SR</td> <td>10.65s</td> <td>12.48s</td> <td>16.66s</td> <td>10.38s</td> </tr> </tbody> </table></div> **Analysis of Table 5**: This table compares the performance (average serving latency) of `LLaMA-2-7B` on `Wiki QA` with different fixed `speculation strides` ($s=2, 4, 8$) against the `Optimal Speculation Stride Scheduler` ($OS^3$). * **EDR**: A larger `speculation stride` (e.g., $s=4$ with 81.06s, or $s=8$ with 81.90s) is generally better. This is because EDR's retrieval latency is very high, so taking more speculative steps before an expensive verification reduces the number of costly external calls. The overhead of potential mis-speculation is smaller relative to the large savings from batched EDR calls. $OS^3$ (85.19s) performs slightly worse than the best fixed strides (e.g., $s=4$) due to its `warm-up phase` (starting at $s=1$) before it can adapt to the optimal larger stride. * **ADR**: For ADR, a smaller `speculation stride` is preferred. As $s$ increases from 2 (9.86s) to 4 (14.93s) and 8 (25.88s), the latency significantly *increases*. This indicates that for faster retrievers, the cost of `speculation overhead` (tokens generated based on incorrect speculation that need to be rolled back) quickly outweighs the marginal savings from batched retrieval. $OS^3$ (8.14s) clearly outperforms all fixed strides, demonstrating its ability to find the optimal smaller stride. * **SR**: Similar to ADR, SR prefers a smaller `speculation stride`. Latency increases with $s$ (10.65s for $s=2$, 12.48s for $s=4$, 16.66s for $s=8$). $OS^3$ (10.38s) again performs best, adapting to the optimal stride for SR. * **Conclusion**: The optimal `speculation stride` is highly dynamic and depends on the retriever's characteristics and the LM's generation latency. $OS^3$ is essential for dynamically tuning this parameter to achieve near-optimal performance across diverse scenarios, effectively tackling the `dynamism` between different retrieval and generation costs. ### 6.3.4. Additional Experimental Results (Appendix A.5) The following are the full evaluation results for average latency (in seconds) across different models, retrievers, and datasets. The following are the results from Table 6 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Model</td> <td>Method</td> <td>Wiki QA</td> <td>WQ</td> <td>NQ</td> <td>Trivia QA</td> </tr> </thead> <tbody> <tr> <td rowspan="6">GPT2</td> <td>Baseline RaLMSpec</td> <td>142.14 ± 0.96</td> <td>141.38 ± 1.50</td> <td>144.22 ± 1.17</td> <td>144.82 ± 0.96</td> </tr> <tr> <td></td> <td>69.82 ± 0.22</td> <td>69.88 ± 0.52</td> <td>70.79 ± 1.27</td> <td>69.83 ± 0.28</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>68.22 ± 0.18</td> <td>68.09 ± 0.18</td> <td>68.06 ± 0.37</td> <td>68.14 ± 0.06</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>66.14 ± 0.39</td> <td>65.22 ± 0.79</td> <td>67.10 ± 0.59</td> <td>66.64 ± 0.23</td> </tr> <tr> <td>RaLMSpec+S</td> <td>62.72 ± 0.48</td> <td>62.43 ± 0.19</td> <td>63.48 ± 0.69</td> <td>64.63 ± 0.44</td> </tr> <tr> <td>RaLMSpec+A</td> <td>69.92 ± 1.06</td> <td>69.36 ± 0.60</td> <td>71.00 ± 0.74</td> <td>70.40 ± 0.78</td> </tr> <tr> <td rowspan="6">OPT</td> <td>RaLMSpec+P(20)SA RaLMSpec+P(256)SA</td> <td>58.35 ± 0.31 53.95 ± 0.72</td> <td>59.24 ± 0.46 53.36 ± 1.08</td> <td>60.21 ± 0.78 56.26 ± 0.95</td> <td>61.39 ± 0.99 56.95 ± 1.34</td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Baseline</td> <td>126.86 ± 1.39</td> <td>55.60 ± 0.08 29.99 ± 0.54</td> <td>62.02 ± 0.11</td> <td>91.50 ± 0.01</td> </tr> <tr> <td>RaLMSpec RaLMSpec+P(20)</td> <td>77.81 ± 0.84 40.37 ± 0.07</td> <td>29.09 ± 0.07</td> <td>34.75 ± 0.19</td> <td>48.20 ± 0.09</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>72.68 ± 0.75</td> <td>31.94 ± 1.16</td> <td>33.79 ± 0.60</td> <td>52.09 ± 0.11</td> </tr> <tr> <td>RaLMSpec+S</td> <td>40.77 ± 0.52</td> <td>29.49 ± 0.53</td> <td>39.90 ± 3.18</td> <td>50.24 ± 0.02</td> </tr> <tr> <td rowspan="8">LLaMA</td> <td>RaLMSpec+A</td> <td></td> <td></td> <td>35.13 ± 0.10</td> <td>50.82 ± 0.49</td> </tr> <tr> <td></td> <td>77.76 ± 4.99</td> <td>30.28 ± 0.59</td> <td>36.16 ± 1.18</td> <td>47.83 ± 0.03</td> </tr> <tr> <td>RaLMSpec+P(20)SA</td> <td>39.00 ± 0.58</td> <td>28.31 ± 0.62</td> <td>31.88 ± 1.67</td> <td>45.51 ± 2.93</td> </tr> <tr> <td>RaLMSpec+P(256)SA</td> <td>59.21 ± 0.04</td> <td>27.79 ± 0.11</td> <td>30.02 ± 0.01</td> <td>45.13 ± 0.04</td> </tr> <tr> <td>Baseline</td> <td>144.39 ± 1.71</td> <td>146.52 ± 1.92</td> <td>149.76 ± 0.95</td> <td>147.76 ± 2.80</td> </tr> <tr> <td>RaLMSpec</td> <td>81.05 ± 0.78</td> <td>87.20 ± 1.83</td> <td>86.92 ± 1.30</td> <td>90.44 ± 2.01</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>83.94 ± 1.11</td> <td>84.23 ± 0.37</td> <td>84.74 ± 0.47</td> <td></td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>82.23 ± 1.95</td> <td>94.15 ± 1.60</td> <td></td> <td>81.86 ± 0.76</td> </tr> </tbody> </table></div> The following are the results from Table 7 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Model</td> <td>Method</td> <td>Wiki QA</td> <td>WQ</td> <td>NQ</td> <td>Trivia QA</td> </tr> </thead> <tbody> <tr> <td rowspan="7">GPT2</td> <td>Baseline</td> <td>4.48 ± 0.11</td> <td>4.50 ± 0.41</td> <td>4.38 ± 0.60</td> <td>3.78 ± 0.10</td> </tr> <tr> <td>RaLMSpec</td> <td>7.26 ± 0.07</td> <td>6.44 ± 0.44</td> <td>6.41 ± 0.71</td> <td>7.47 ± 0.16</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>6.92 ± 0.10</td> <td>7.38 ± 0.56</td> <td>7.37 ± 0.58</td> <td>7.28 ± 0.28</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>6.65 ± 0.07</td> <td>5.97 ± 0.46</td> <td>5.64 ± 0.74</td> <td>6.96 ± 0.63</td> </tr> <tr> <td>RaLMSpec+S</td> <td>4.59 ± 0.28</td> <td>4.77 ± 0.32</td> <td>4.65 ± 0.61</td> <td>4.51 ± 0.45</td> </tr> <tr> <td>RaLMSpec+A</td> <td>6.50 ± 0.54</td> <td>6.49 ± 0.38</td> <td>5.70 ± 0.70</td> <td>6.94 ± 0.84</td> </tr> <tr> <td>RaLMSpec+P(20)SA</td> <td>4.24 ± 0.14</td> <td>4.34 ± 0.35</td> <td>4.03 ± 0.68</td> <td>3.64 ± 0.54</td> </tr> <tr> <td rowspan="7">OPT</td> <td>RaLMSpec+P(256)SA</td> <td>4.01 ± 0.21</td> <td>3.81 ± 0.02</td> <td>3.40 ± 0.01</td> <td>3.86 ± 0.31</td> </tr> <tr> <td>Baseline</td> <td>4.43 ± 0.05</td> <td>1.31 ± 0.01</td> <td>1.83 ± 0.01</td> <td>2.42 ± 0.03</td> </tr> <tr> <td>RaLMSpec</td> <td>7.15 ± 0.06</td> <td>2.34 ± 0.01</td> <td>3.04 ± 0.03</td> <td>3.79 ± 0.04</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>3.44 ± 0.02</td> <td>2.34 ± 0.01</td> <td>2.70 ± 0.06</td> <td>4.66 ± 0.03</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>16.03 ± 0.03</td> <td>6.06 ± 0.01</td> <td>7.06 ± 0.04</td> <td>10.17 ± 0.01</td> </tr> <tr> <td>RaLMSpec+S</td> <td>2.21 ± 0.01</td> <td>1.47 ± 0.01</td> <td>1.88 ± 0.05</td> <td>2.97 ± 0.05</td> </tr> <tr> <td>RaLMSpec+A RaLMSpec+P(20)SA</td> <td>7.55 ± 0.05</td> <td>2.25 ± 0.01</td> <td>5.41 ± 1.32</td> <td>6.20 ± 0.02</td> </tr> <tr> <td rowspan="7">LLaMA</td> <td>RaLMSpec+P(256)SA</td> <td>1.98 ± 0.03 9.41 ± 0.66</td> <td>1.30 ± 0.01 4.14 ± 0.02</td> <td>1.50 ± 0.01</td> <td>2.37 ± 0.01</td> </tr> <tr> <td>Baseline</td> <td></td> <td></td> <td>4.31 ± 0.02</td> <td>6.19 ± 0.02</td> </tr> <tr> <td>RaLMSpec</td> <td>8.06 ± 0.07</td> <td>7.97 ± 0.06</td> <td>8.11 ± 0.11</td> <td>8.68 ± 0.10</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>14.10 ± 0.31 14.25 ± 0.39</td> <td>13.44 ± 0.37 13.45 ± 0.28</td> <td>14.35 ± 0.15</td> <td>14.23 ± 0.35</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td></td> <td></td> <td>14.08 ± 0.32</td> <td>14.21 ± 0.30</td> </tr> <tr> <td>RaLMSpec+S</td> <td>20.63 ± 0.48 8.14 ± 0.19</td> <td>26.44 ± 3.11 8.08 ± 0.07</td> <td>27.38 ± 3.39</td> <td>21.04 ± 0.43</td> </tr> <tr> <td>RaLMSpec+A</td> <td>13.90 ± 0.36</td> <td>13.28 ± 0.17</td> <td>8.08 ± 0.07 13.72 ± 0.14</td> <td>8.16 ± 0.09</td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td>18.35 ± 1.11</td> </tr> <tr> <td>RaLMSpec+P(20)SA</td> <td>7.89 ± 0.22</td> <td>7.84 ± 0.15</td> <td>7.90 ± 0.12</td> <td></td> <td>7.91 ± 0.03</td> </tr> <tr> <td>RaLMSpec+P(256)SA</td> <td>14.06 ± 0.08</td> <td>14.96 ± 1.34</td> <td>14.59 ± 2.04</td> <td></td> <td>12.94 ± 0.03</td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> </tbody> </table></div> The following are the results from Table 8 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td>Model</td> <td>Method</td> <td>Wiki QA</td> <td>WQ</td> <td>NQ</td> <td>Trivia QA</td> </tr> </thead> <tbody> <tr> <td rowspan="6">GPT2</td> <td>Baseline</td> <td>7.41 ± 1.34</td> <td>7.03 ± 1.15</td> <td>7.23 ± 0.11</td> <td>6.80 ± 0.09</td> </tr> <tr> <td>RaLMSpec</td> <td>5.18 ± 0.13</td> <td>5.30 ± 0.95</td> <td>5.34 ± 0.11</td> <td>5.40 ± 0.03</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>5.23 ± 0.23</td> <td>4.58 ± 0.01</td> <td>5.17 ± 0.05</td> <td>5.50 ± 0.04</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>6.88 ± 0.66</td> <td>7.16 ± 1.34</td> <td>6.76 ± 0.27</td> <td>6.91 ± 0.13</td> </tr> <tr> <td>RaLMSpec+S</td> <td>5.62 ± 0.96</td> <td>5.03 ± 0.68</td> <td>5.24 ± 0.13</td> <td>5.61 ± 0.11</td> </tr> <tr> <td>RaLMSpec+A</td> <td>5.34 ± 0.89</td> <td>4.99 ± 0.86</td> <td>4.76 ± 0.14</td> <td>5.04 ± 0.12</td> </tr> <tr> <td rowspan="6">OPT</td> <td>RaLMSpec+P(20)SA RaLMSpec+P(256)SA</td> <td>4.49 ± 0.09 6.66 ± 1.25</td> <td>4.57 ± 0.81 6.50 ± 1.39</td> <td>4.54 ± 0.02 5.54 ± 0.02</td> <td>4.99 ± 0.01 5.91 ± 0.03</td> </tr> <tr> <td></td> <td></td> <td></td> <td></td> <td></td> </tr> <tr> <td>Baseline</td> <td>7.68 ± 0.01</td> <td>1.83 ± 0.01</td> <td>2.62 ± 0.01</td> <td>4.71 ± 0.02</td> </tr> <tr> <td>RaLMSpec</td> <td>5.63 ± 0.01</td> <td>1.93 ± 0.01</td> <td>2.60 ± 0.01</td> <td>4.07 ± 0.02</td> </tr> <tr> <td>RaLMSpec+P(20) RaLMSpec+P(256)</td> <td>3.00 ± 0.01</td> <td>2.06 ± 0.01</td> <td>2.50 ± 0.02</td> <td>4.27 ± 0.01</td> </tr> <tr> <td>OPT RaLMSpec+S</td> <td>7.13 ± 0.01 2.79 ± 0.01</td> <td>2.45 ± 0.01</td> <td>3.22 ± 0.01</td> <td>5.24 ± 0.02</td> </tr> <tr> <td rowspan="6">LLaMA</td> <td></td> <td></td> <td>1.69 ± 0.01</td> <td>2.32 ± 0.01</td> <td>4.27 ± 0.01</td> </tr> <tr> <td>RaLMSpec+A</td> <td>5.27 ± 0.02</td> <td>1.86 ± 0.01</td> <td>2.28 ± 0.01</td> <td>3.82 ± 0.02</td> </tr> <tr> <td>RaLMSpec+P(20)SA</td> <td>2.46 ± 0.01</td> <td>1.57 ± 0.01</td> <td>1.88 ± 0.01</td> <td>3.59 ± 0.01</td> </tr> <tr> <td>RaLMSpec+P(256)SA</td> <td>6.37 ± 0.01</td> <td>1.92 ± 0.04</td> <td>2.44 ± 0.02</td> <td>4.53 ± 0.01</td> </tr> <tr> <td>Baseline</td> <td>10.75 ± 0.32</td> <td>10.55 ± 0.07</td> <td>10.72 ± 0.10</td> <td>11.06 ± 0.25</td> </tr> <tr> <td>RaLMSpec</td> <td>11.47 ± 0.17</td> <td>11.02 ± 0.31</td> <td>11.10 ± 0.27</td> <td>10.79 ± 0.20</td> </tr> <tr> <td>RaLMSpec+P(20)</td> <td>11.27 ± 0.14</td> <td>11.04 ± 0.22</td> <td>10.69 ± 0.15</td> <td>10.66 ± 0.23</td> </tr> <tr> <td>RaLMSpec+P(256)</td> <td>12.83 ± 0.21</td> <td>13.35 ± 0.81</td> <td>12.48 ± 0.22</td> <td>12.60 ± 0.36</td> </tr> <tr> <td>RaLMSpec+S</td> <td>10.38 ± 0.28</td> <td>10.19 ± 0.19</td> <td>9.95 ± 0.04</td> <td>10.18 ± 0.08</td> </tr> <tr> <td>RaLMSpec+A</td> <td>10.88 ± 0.26</td> <td>10.66 ± 0.12</td> <td>10.56 ± 0.20</td> <td>10.16 ± 0.18</td> </tr> <tr> <td>RaLMSpec+P(20)SA RaLMSpec+P(256)SA</td> <td>8.28 ± 0.18 9.46 ± 0.17</td> <td>8.18 ± 0.10 10.42 ± 0.74</td> <td>8.26 ± 0.14 9.64 ± 0.08</td> <td>8.09 ± 0.18 9.36 ± 0.16</td> </tr> </tbody> </table></div> These tables provide the raw latency numbers (mean ± standard deviation) for all configurations, models, and datasets, mirroring the trends discussed in the main results section (Figure 4) and the ablation studies (Tables 1, 2, 5). They confirm that $RaLMSpec+P(20)SA$ or $RaLMSpec+P(256)SA$ often achieve the lowest latencies, with $OS^3$ (indicated by 'S') being crucial for preventing slowdowns, especially with faster retrievers. The detailed numbers underscore the robustness of `RaLMSpec` across a broad range of experimental setups. # 7. Conclusion & Reflections ## 7.1. Conclusion Summary This work introduces `RaLMSpec`, a novel speculation-inspired framework meticulously designed to accelerate the serving latency of generic iterative Retrieval-Augmented Language Models (RaLMs). The core innovation lies in its ability to significantly reduce the overhead caused by frequent retrieval-generation interactions, a critical bottleneck in iterative RaLM serving, while `provably preserving the exact same model outputs`. `RaLMSpec` leverages the inherent `temporal and spatial locality` of retrieved documents by maintaining a fast, request-level local cache for `speculative retrieval`. This allows the language model to generate tokens using quickly accessible, speculated documents. To ensure correctness, these speculative steps are followed by a `batched verification` process where the true documents are retrieved from the external knowledge base in a highly efficient parallel manner. Any detected mismatches trigger a rollback and correction, guaranteeing fidelity. Beyond this core mechanism, `RaLMSpec` is further enhanced by three key techniques: 1. **Cache Prefetching**: Proactively updates the local cache with multiple top-ranked documents during verification, boosting future speculation success rates. 2. **Optimal Speculation Stride Scheduler ($OS^3$)**: Dynamically adjusts the number of speculative steps between verifications, adaptively balancing `speculation overhead` and `retrieval latency savings` across diverse scenarios. 3. **Asynchronous Verification**: Enables overlapping speculative and verification steps to hide latency and exploit concurrency. Empirical evaluations across a broad spectrum of tasks (naive iterative RaLM and `KNN-LM` serving), language models (GPT2, OPT, LLaMA-2), retrievers (Exact Dense, Approximate Dense, Sparse), and downstream QA datasets demonstrate `RaLMSpec`'s substantial effectiveness. It achieves speed-up ratios of up to 2.39x for naive iterative RaLM and an impressive 7.59x for `KNN-LM` serving with exact dense retrievers, consistently outperforming baselines. These results firmly establish `RaLMSpec` as a generic and powerful acceleration framework for iterative RaLMs. ## 7.2. Limitations & Future Work The authors acknowledge a few limitations and suggest future research directions: * **Asynchronous Verification Implementation**: The full potential of `asynchronous verification` was not realized due to the `Global Interpreter Lock (GIL)` in Python, necessitating simulated latency measurements. Overcoming this would require lower-level system implementation (e.g., in C++ or other languages that allow true parallelism without GIL constraints) or specialized hardware scheduling. * **`KNN-LM` Asynchronous Verification**: `Asynchronous verification` was not enabled for `KNN-LM` serving in the current work and is left as future work. Given the high retrieval frequency of `KNN-LM`, integrating asynchronous verification could yield further significant gains. * **Speed-up Bottleneck**: The paper implicitly highlights that `RaLMSpec`'s speed-up is inherently limited by the proportion of retrieval latency in the total end-to-end latency. When language model generation itself becomes the overwhelming bottleneck (e.g., with very large LMs or very fast retrievers), the gains from optimizing retrieval will naturally diminish. Based on these, potential future work could explore: * **Hardware-aware Asynchronous Verification**: Developing `RaLMSpec` with hardware-level concurrency primitives to fully exploit asynchronous verification's benefits. * **Integration with LLM Speculative Decoding**: Combining `RaLMSpec` with existing `speculative decoding` techniques for LLMs could lead to compounded speed-ups by optimizing both retrieval and generation simultaneously. * **Adaptive Prefetching Strategies**: Further research into more sophisticated, adaptive prefetching mechanisms that dynamically adjust prefetch size based on current speculation accuracy and retriever characteristics, to avoid the overhead observed with fixed large prefetch sizes. * **Cross-Request Speculation**: Exploring if speculation can extend beyond a single request to leverage locality across multiple concurrent inference requests. ## 7.3. Personal Insights & Critique This paper presents a highly relevant and practical solution to a critical problem in the deployment of high-quality RaLMs. Iterative RaLMs, while offering superior accuracy and adaptability, have been held back by their inference costs. `RaLMSpec` directly tackles this bottleneck with an elegant and provably correct approach. **Strengths**: * **Provable Correctness**: The guarantee of `same model outputs` is a significant advantage. Many optimization techniques trade off quality for speed; `RaLMSpec` offers pure speed-up without compromise, which is crucial for reliable AI applications. * **Orthogonal Innovation**: Applying `speculation` to the *retrieval* component is novel and distinct from existing LLM `speculative decoding`. This orthogonality means `RaLMSpec` can potentially complement other speed-up methods for LLMs, paving the way for even faster RaLM systems. * **Generality**: The framework's demonstrated effectiveness across various LMs, retrievers, and datasets suggests it's a broadly applicable solution. * **Holistic Optimization**: The combination of `caching`, `batched verification`, `prefetching`, $OS^3$, and `asynchronous verification` shows a thorough understanding of the problem and a comprehensive approach to maximizing performance. $OS^3$ is particularly impressive for its adaptive nature, addressing the dynamic trade-offs involved. **Potential Issues/Critique**: * **Impact of LM Size**: As the paper itself implicitly acknowledges, the gains from `RaLMSpec` diminish when the LM generation time vastly overshadows retrieval time (e.g., with very large LLMs like `LLaMA-2-13B` and fast retrievers). While still providing some speed-up, the relative impact becomes less pronounced. This suggests that future optimizations for RaLMs might need to increasingly focus on the LM generation part for larger models. * **Complexity of $OS^3$**: While powerful, the `Optimal Speculation Stride Scheduler` relies on estimations of $a$, $b$, and \gamma(X).Theaccuracyandstabilityoftheseestimations,especially. The accuracy and stability of these estimations, especially \gamma(X)$$ with its windowing approach, could be sensitive to sudden shifts in query patterns or knowledge base characteristics. The warm-up phase overhead for EDR is an example of this.

  • KNN-LM Adaptation Details: While the paper states KNN-LM's cache update and verification protocols were modified (matching next token, populating with next n=10n=10 entries), a more detailed exploration of these adaptations and their rationale (e.g., why n=10n=10 was chosen, sensitivity analysis for nn) could provide deeper insights.

    Transferability/Application: The core idea of speculative execution with batched verification is broadly applicable to any system where expensive, sequential operations can be partially predicted and then verified. This paradigm could be transferred to:

  • Other Multi-modal AI systems: Systems involving frequent calls to external processing units (e.g., image generation with external modules, video processing pipelines).

  • Database interactions: Optimizing complex database queries that involve multiple sub-queries.

  • Agentic AI systems: AI agents that frequently interact with external tools or environments, where speculative tool use could be verified in batches.

    Overall, RaLMSpec is a significant contribution towards making powerful iterative RaLMs practical for real-world applications by elegantly solving their most pressing performance bottleneck.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.