Paper status: completed

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Published:01/25/2024

Retrieval-Augmented Language Model Serving (1)RaLMSpec Acceleration Framework (1)Speculative Retrieval Mechanism (1)Batched Verification Strategy (1)Downstream QA Datasets (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The RaLMSpec framework accelerates retrieval-augmented language model serving through speculative retrieval and batched verification, maintaining consistent outputs. Combining prefetching and asynchronous verification, it significantly enhances iterative RaLM efficiency, achievin

Abstract

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the language model. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three language models on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.

Mind Map

In-depth Reading

English Analysis~28 min read · 39,797 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the acceleration of serving for retrieval-augmented language models (RaLM) using a speculation-inspired framework.

1.2. Authors

The authors are:

Zhihao Zhang (Carnegie Mellon University, School of Computer Science)
Alan Zhu (Carnegie Mellon University, School of Computer Science)
Lijie Yang (Carnegie Mellon University, School of Computer Science)
Yihua Xu (University of California, Berkeley)
Lanting Li (Carnegie Mellon University, School of Computer Science)
Phitchaya Mangpo Phothilimthana (Google DeepMind)
Zhihao Jia (Carnegie Mellon University, School of Computer Science)

Their affiliations indicate a strong presence in computer science research, particularly from Carnegie Mellon University, with contributions from Google DeepMind and UC Berkeley. This suggests expertise in areas such as natural language processing, machine learning systems, and potentially computer architecture, given the paper's focus on system-level optimizations like speculation.

1.3. Journal/Conference

The paper is a preprint available on arXiv, published on January 25, 2024. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a highly reputable platform for disseminating research in fields like AI, machine learning, and computer science, allowing for rapid sharing of new findings and often serving as a precursor to publications in top-tier conferences (e.g., NeurIPS, ICML, ACL) or journals.

1.4. Publication Year

2024

1.5. Abstract

Retrieval-augmented language models (RaLM) are effective for knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model. They offer advantages like low-cost adaptation to new data and improved source attribution. Iterative RaLM approaches, which involve frequent interaction between the retriever and the language model, achieve higher generation quality but incur significant overhead due to these repeated retrieval steps. To address this, the paper proposes RaLMSpec, a speculation-inspired framework designed to generically speed up iterative RaLM serving while guaranteeing identical model outputs. RaLMSpec achieves this through speculative retrieval and batched verification. It further incorporates prefetching, an optimal speculation stride scheduler, and asynchronous verification to maximize acceleration. Extensive evaluations on three language models across four QA datasets show that RaLMSpec can achieve speed-ups of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x for naive iterative RaLM serving when using exact dense, approximate dense, and sparse retrievers, respectively, compared to a baseline. For KNN-LM serving, speed-ups of up to 7.59x and 2.45x are observed with exact dense and approximate dense retrievers.

1.6. Original Source Link

The official source link for this paper is https://arxiv.org/abs/2401.14021. The PDF link is https://arxiv.org/pdf/2401.14021v1.pdf. It is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the high inference overhead of iterative Retrieval-Augmented Language Models (RaLMs).

Retrieval-augmented language models represent a significant advancement in solving knowledge-intensive natural language processing (NLP) tasks. Instead of trying to encode all knowledge into a massive, fully parametric model, RaLMs combine a smaller parametric language model with an external, non-parametric knowledge base (like Wikipedia). This approach offers several benefits:

Low-cost adaptation: RaLMs can easily adapt to new or continuously updated data by simply updating the external knowledge base, rather than requiring expensive fine-tuning of the entire parametric model.
Better source attribution: By explicitly retrieving documents, RaLMs can provide references for their generated output, enhancing transparency and trustworthiness.

Within RaLM approaches, iterative RaLM stands out for its superior generation quality. Unlike one-shot RaLM which retrieves documents only once at the beginning, iterative RaLM frequently interacts with the knowledge base throughout the generation process, allowing the language model to retrieve more relevant information as the context evolves. This frequent interaction, however, introduces a critical challenge: a high overhead due to the repeated and sequential retrieval steps. This significantly impacts the serving latency, making iterative RaLMs prohibitively slow for practical deployment. The paper highlights this as an inherent inefficiency, especially because each retrieval step is usually performed with a single query derived from the current context.

The paper's entry point and innovative idea is to adapt the concept of speculation, traditionally used in computer architecture and recently in large language model (LLM) decoding, to the retrieval component of iterative RaLMs. By speculatively retrieving documents and then verifying them in batches, the goal is to reduce the number of expensive knowledge base calls without compromising the final output quality.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the challenge of iterative RaLM serving overhead:

RaLMSpec Framework for Generic Iterative RaLM Acceleration: The authors propose RaLMSpec, a novel framework that provides a generic speed-up for iterative RaLM approaches. A key guarantee of RaLMSpec is that it provably preserves the original model outputs, meaning the accelerated system produces exactly the same text as the unoptimized iterative RaLM. This is crucial for maintaining model quality and reliability.
Caching-based Speculative Retrieval with Batched Verification: Leveraging the observed temporal and spatial locality of retrieved documents (i.e., the same or nearby documents are often retrieved repeatedly), RaLMSpec employs a local cache for speculative retrieval. Instead of querying the expensive external knowledge base for every step, it first speculatively retrieves from this fast local cache. To ensure correctness, these speculative retrievals are followed by a batched verification step where the corresponding queries are sent to the external knowledge base in a single, more efficient batch. If a mismatch is detected, the system rolls back and corrects the generation.
Three Additional Latency Reduction Techniques: RaLMSpec integrates three further techniques to maximize performance:
- Cache Prefetching: Enhances the local cache by updating it with multiple (e.g., top-k) retrieved documents during verification steps, increasing the likelihood of successful speculation.
- Optimal Speculation Stride Scheduler ( $OS^3$ ): Dynamically adjusts the number of speculative steps (speculation stride) between verification steps. This scheduler adaptively optimizes the trade-off between the overhead of potential mis-speculation and the latency saved by batched retrievals, which is crucial as optimal strides vary across different models and retrievers.
- Asynchronous Verification: Exploits concurrency by allowing additional speculative steps to occur concurrently with a verification step, effectively hiding verification latency when speculation is successful.
Extensive Empirical Validation: The paper empirically validates RaLMSpec across a wide range of scenarios:
- Tasks: Naive iterative RaLM serving and KNN-LM serving.
- Language Models: GPT2-medium, OPT-1.3B, LLaMA-2-7B, and LLaMA-2-13B.
- Retrievers: Exact Dense Retriever (EDR), Approximate Dense Retriever (ADR), and Sparse Retriever (SR).
- Datasets: Wiki-QA, Web Questions, Natural Questions, and Trivia QA for naive RaLM; WikiText-103 for KNN-LM.
  
  Key Findings:

For naive iterative RaLM serving, $RaLMSpec+PSA$ $R a L MSp ec + PS A$ (with Prefetching, $OS^3$ $O S^{3}$ , and Asynchronous verification) achieves significant speed-ups:
- 1.75-2.39x with an exact dense retriever.
- 1.04-1.39x with an approximate dense retriever.
- 1.31-1.77x with a sparse retriever.
For KNN-LM serving, where retrieval is performed for every token, the speed-ups are even more substantial:
- Up to 7.59x with an exact dense retriever.
- Up to 2.45x with an approximate dense retriever. These results demonstrate that RaLMSpec is a generic and effective framework for accelerating iterative RaLM serving across diverse configurations.

3.1. Foundational Concepts

To fully understand RaLMSpec, a reader should be familiar with the following core concepts:

Language Models (LMs) & Large Language Models (LLMs):
- Concept: LMs are statistical models that learn to predict the next word or token in a sequence based on the preceding context. LLMs are LMs with a vast number of parameters (billions or trillions) and are trained on massive text corpora, exhibiting emergent capabilities like few-shot learning and complex reasoning. Examples include GPT-3, LLaMA-2, and PaLM.
- Autoregressive Nature: Many generative LMs, especially those used for text generation, are autoregressive. This means they generate text token-by-token, where each new token is predicted based on the previously generated tokens and the initial prompt. This sequential generation is a key bottleneck as it limits parallelism.
Retrieval-Augmented Language Models (RaLM):
- Concept: RaLMs combine the strengths of parametric language models with non-parametric knowledge bases. Instead of relying solely on the knowledge encoded during training, RaLMs can retrieve relevant documents or information from an external database (e.g., Wikipedia) at inference time to augment their generation process. This helps them access up-to-date information, reduce factual errors, and provide citations.
- Non-parametric Knowledge Base: This refers to an external, explicit database (e.g., a collection of documents, facts, or embeddings) that is separate from the language model's learned parameters. It can be easily updated or modified without retraining the LM.
- Parametric Language Model: This is the traditional LM whose knowledge is implicitly stored in its weights (parameters) learned during training.
Types of RaLM Interaction:
- One-shot RaLM: In this approach, retrieval is performed only once at the beginning of the generation process. The retrieved documents are then concatenated with the original query or prompt and fed into the language model to assist in generating the entire response. This is simpler but limited if information needs evolve during a long generation.
- Iterative RaLM: This is the focus of the paper. Here, the retrieval step is performed multiple times throughout the generation process. As the language model generates new tokens, the context (original query + generated tokens) is used to form a new query to the knowledge base, retrieving more context-relevant documents. This allows for dynamic adaptation to information needs but incurs much higher retrieval overhead due to frequent calls to the knowledge base.
Retrievers: Mechanisms used to search and retrieve relevant documents from a knowledge base given a query.
- Sparse Retrievers (e.g., BM25, TF-IDF): These methods rely on lexical matching of keywords. They represent documents and queries as bag-of-words vectors and score relevance based on term frequency and inverse document frequency.
  - BM25 (Best Match 25): An advanced term weighting scheme often used in information retrieval that ranks documents based on the appearance of query terms in each document, taking into account term frequency, inverse document frequency, document length, and query term saturation.
- Dense Retrievers (e.g., DPR): These methods embed both queries and documents into a shared continuous vector space using neural networks (e.g., BERT-like encoders). Retrieval then becomes a nearest-neighbor search in this embedding space, where documents closest to the query embedding are considered most relevant.
  - Exact Dense Retriever (EDR): Performs an exhaustive or highly accurate nearest-neighbor search in the dense vector space. While precise, this can be computationally expensive, especially for very large knowledge bases.
  - Approximate Dense Retriever (ADR): Uses approximate nearest-neighbor (ANN) search algorithms (e.g., HNSW - Hierarchical Navigable Small World graphs, or FAISS) to find near neighbors more quickly, trading off some accuracy for significant speed.
- FAISS (Facebook AI Similarity Search): A library for efficient similarity search and clustering of dense vectors. It provides algorithms that search in sets of vectors of any size, up to billions.
- HNSW (Hierarchical Navigable Small World): A graph-based ANN algorithm that builds a multi-layer graph structure, allowing for fast searches by traversing through layers of increasing density. It's known for its good balance between search speed and accuracy.
K-Nearest Neighbor Language Models (KNN-LM):
- Concept: A specific type of iterative RaLM that augments a standard language model by retrieving k-nearest neighbors (documents or context examples) from a datastore for every token prediction. The final next-token distribution is an interpolation between the base LM's prediction and a distribution derived from the retrieved neighbors' target tokens. This is highly retrieval-intensive.
- Interpolation: Combining two probability distributions, typically by a weighted sum. For KNN-LM, it means blending the base LM's next-token probabilities with probabilities derived from the retrieved neighbors.
Speculative Execution/Decoding:
- Concept (General): Originating from computer architecture, speculative execution involves performing operations before they are confirmed to be necessary. If the speculation is correct, performance is boosted; if incorrect, the work is discarded, and the correct path is taken (rollback).
- Speculative Decoding (for LLMs): A recent technique to speed up autoregressive LLM inference. A smaller, faster "draft" model speculatively generates a few tokens. These tokens are then verified in parallel by the larger, slower "main" model. If verified, the tokens are accepted; if not, the main model corrects them and regenerates. This aims to reduce the sequential bottleneck of autoregressive generation. RaLMSpec applies this concept not to token generation, but to the retrieval step.
Temporal and Spatial Locality (in systems):
- Temporal Locality: The principle that if a particular piece of data is accessed, it is likely to be accessed again in the near future.
- Spatial Locality: The principle that if a particular piece of data is accessed, data located nearby in memory (or in a database) is likely to be accessed soon.
- RaLMSpec leverages these properties by caching recently retrieved documents locally, anticipating that they or nearby documents will be needed again.

3.2. Previous Works

The paper positions RaLMSpec within the context of advancements in RaLM and efficient serving.

Early RaLM Pioneers:
- Guu et al. (2020) proposed the foundational idea of Retrieval Augmented Language Model pre-training (REALM), which inspired many subsequent works. This showed the potential of augmenting LMs with external knowledge.
- Lewis et al. (2020) introduced Retrieval-Augmented Generation (RAG), which demonstrated how pre-trained seq2seq models could be augmented with a DPR retriever to achieve strong performance on knowledge-intensive NLP tasks. Many existing RaLM approaches build on RAG's principles.
One-shot RaLM Approaches: These methods perform retrieval once before generation. Examples include works by Shi et al. (2023), Park et al. (2023), Wang et al. (2023a), Zhu et al. (2023), Rubin & Berant (2023), Wang et al. (2023b), Zhou et al. (2023). While effective, they are limited when the required information changes during long generation processes.
Naive Iterative RaLM Approaches: These approaches retrieve regularly during generation.
- Ram et al. (2023), Lewis et al. (2020), Jiang et al. (2023), Borgeaud et al. (2022), Khattab et al. (2022) are examples where the LM constantly queries an external database with the latest context. These methods use the retrieved information either by directly concatenating it to the prompt or via intermediate layer cross-attention.
- Khandelwal et al. (2019) introduced K-Nearest Neighbour Language Models (KNN-LM), which performs retrieval for every single token generated. This extreme iteration achieves high quality but makes inference prohibitively expensive, retrieving up to 1024 documents per token. Drozdov et al. (2022) further explore KNN-LM.
Efficient Iterative RaLM Serving (Related to RaLMSpec):
- Alon et al. (2022) proposed a method for KNN-LM serving that reduces calls to the external knowledge base by using a pre-computed automaton state when full retrieval is unnecessary.
  - Differentiation: A crucial distinction is that Alon et al. (2022) is not guaranteed to preserve the same model output and thus might compromise generation quality. RaLMSpec explicitly guarantees output preservation, which is a key advantage.
Speculative Inference for LLMs:
- Recent works have adapted the concept of speculative decoding to accelerate LLM serving (Leviathan et al., 2022; Stern et al., 2018; Chen et al., 2023; Miao et al., 2023; Xia et al.; Joao Gante, 2023; Yang et al., 2023). These methods use a faster draft model to propose tokens, which are then verified by the main model in parallel.
- Differentiation: The authors emphasize that RaLMSpec is the first work to incorporate speculative retrieval in RaLM serving, and it is orthogonal to speculative inference techniques for LLMs. This means RaLMSpec can potentially be combined with LLM speculative decoding for even greater speed-ups.

3.3. Technological Evolution

The field has evolved from:

Purely Parametric LMs: Relying solely on internal knowledge, often requiring expensive retraining for new information.
One-shot RaLM: A first step towards external knowledge integration, but limited in dynamic contexts.
Iterative RaLM: Enhanced quality through dynamic interaction, but at a high computational cost due to frequent, sequential retrieval.
Optimized Iterative RaLM Serving (RaLMSpec): This paper represents a step towards making high-quality iterative RaLMs practical by directly addressing their primary bottleneck—retrieval overhead—through system-level optimizations that guarantee output correctness.

3.4. Differentiation Analysis

Compared to prior work, RaLMSpec offers several core innovations and differentiators:

Speculation Applied to Retrieval: While speculative decoding is known for LLMs, RaLMSpec uniquely applies the speculation paradigm to the retrieval component of RaLMs. This is a novel angle for optimization.
Guaranteed Output Preservation: Unlike some prior acceleration methods for RaLMs (e.g., Alon et al., 2022), RaLMSpec is designed to be lossless, meaning it produces identical outputs to the unoptimized iterative RaLM. This is a critical feature for applications where correctness and fidelity to the original model's behavior are paramount.
Generic Applicability: RaLMSpec is designed as a generic framework applicable to various iterative RaLM approaches, different language models, and diverse retriever types (sparse, exact dense, approximate dense).
Comprehensive Optimization Suite: Beyond the core speculative retrieval and batched verification, the integration of cache prefetching, optimal speculation stride scheduler, and asynchronous verification provides a holistic approach to maximize performance gains.
Leveraging Locality: The explicit exploitation of temporal and spatial locality of retrieved documents via a local cache is a key design choice that makes the speculative retrieval highly effective.

4. Methodology

4.1. Principles

The core idea behind RaLMSpec is to address the inefficiency of frequent, sequential retrieval steps in iterative RaLMs by leveraging speculation and batched processing. The primary intuition is rooted in the observation that, during the generation process of iterative RaLMs, the same or closely related documents from the knowledge base are often retrieved repeatedly. This phenomenon is analogous to temporal and spatial locality found in computer systems, where recently accessed data or data near previously accessed data is likely to be accessed again.

Building on this, RaLMSpec employs a caching-based mechanism for speculative retrieval. Instead of immediately querying the expensive external knowledge base for every retrieval request, it first attempts to retrieve documents from a fast, local cache. These "speculated" documents are then used by the language model to generate tokens. To ensure the correctness of the final output, after a certain number of speculative steps, a batched verification step is performed. In this step, the original queries corresponding to the speculative retrievals are sent to the external knowledge base in a single batch. This batched retrieval is significantly more efficient than issuing individual queries sequentially, exploiting parallelism inherent in modern retrieval systems. If any speculated document does not match the ground truth retrieved during verification, the system "rolls back" to the point of mismatch and regenerates the output using the correct documents, thereby preserving the original model's quality.

4.2. Core Methodology In-depth (Layer by Layer)

The RaLMSpec pipeline, as described in Algorithm 1, meticulously orchestrates speculative retrieval, language model generation, and batched verification to achieve speed-up while ensuring correctness.

Algorithm 1 RaLMSpec Pipeline. $Input: Input tokens X = { x_0, x_1, ..., x_{t-1} }, external corpus C, language model f(.) 2: Output: RaLM generated outputs 3: Initialize local cache Q = { }, speculation stride s, model generation stride k 4: q = encode(X). Q.insert(C.retrieve(q)) ▹ cache prefetching 5: while EOS not in X do 6: for i = 1 to s do 7: q_i = encode(X), d_hat_i = Q.retrieve(q_i) ▹ speculative retrieval 8: X_hat_i = f(X, d_hat_i, k) ▹ model generation step that generates k new tokens 9: X = [X, X_hat_i] 10: end for 11: d_1, ..., d_s = C.retrieve(q_1, ..., q_s) ▹ batched verification 12: m = argmin_i d_hat_i != d_i 13: if m <= s then do correction if needed 14: Roll X back to the m-th speculation step 15: X_hat = f(X, d_i, k) 16: X = [X, X_hat] 17: end if 18: end while$

Let's break down each step:

Initialization (Line 3):
- $Input tokens X = { x_0, x_1, ..., x_{t-1} }$ : The initial prompt or query provided to the RaLM.
- external corpus C: The large, non-parametric knowledge base (e.g., Wikipedia) from which documents are retrieved.
- language model f(.): The parametric language model (e.g., GPT2, OPT, LLaMA-2) responsible for generating text.
- $local cache Q = { }$ : An empty, fast, request-specific cache that will store recently retrieved documents for speculative access.
- speculation stride s: A hyperparameter determining the number of consecutive speculative retrieval and generation steps before a verification step is triggered.
- model generation stride k: The number of new tokens generated by the language model in a single generation step.
Initial Cache Prefetching (Line 4):
- $q = encode(X)$ : The initial input tokens $X$ are encoded into a query embedding $q$ . This encode function typically transforms the textual input into a vector representation suitable for the retriever.
- C.retrieve(q): The first actual retrieval from the external knowledge base $C$ is performed using the initial query $q$ . This ensures the local cache is not empty at the very beginning and also serves as an initial prefetching step.
- Q.insert(C.retrieve(q)): The retrieved documents are inserted into the local cache Q. This populates the cache with relevant documents immediately.
Main Generation Loop (Lines 5-18):
- while EOS not in X do: The generation process continues iteratively until an End Of Sequence (EOS) token is generated, indicating the completion of the response.
Speculative Retrieval and Generation Loop (Lines 6-10):
- $for i = 1 to s do$ : This loop executes $s$ (the speculation stride) consecutive speculative steps.
- $q_i = encode(X)$ : In each step $i$ , the current context $X$ (including previously generated tokens) is encoded into a new query $q_i$ . This $q_i$ represents the context-dependent query.
- $d_hat_i = Q.retrieve(q_i)$ : This is the speculative retrieval step. Instead of querying the expensive external corpus $C$ , RaLMSpec attempts to retrieve documents $d_hat_i$ from the fast local cache Q using the query $q_i$ . The local cache acts like a mini-retriever, using the same scoring metric as the original retriever but on a much smaller set of documents.
- $X_hat_i = f(X, d_hat_i, k)$ : The language model f(.) then generates $k$ new tokens ( $X_hat_i$ ) using the current context $X$ and the speculatively retrieved documents $d_hat_i$ . This is the model generation step.
- $X = [X, X_hat_i]$ : The newly generated tokens $X_hat_i$ are appended to the overall generated sequence $X$ .
Batched Verification (Line 11):
- $d_1, ..., d_s = C.retrieve(q_1, ..., q_s)$ : After $s$ speculative steps, a batched verification is performed. All $s$ context-dependent queries ( $q_1$ through $q_s$ ) that were used during the speculative phase are now sent simultaneously (as a batch) to the external corpus $C$ . This exploits the parallelism capabilities of the retrieval system, making it much faster than $s$ sequential retrievals. The ground truth documents $d_1, ..., d_s$ are retrieved.
Mismatch Detection (Line 12):
- $m = argmin_i d_hat_i != d_i$ : This step identifies the first position $m$ (from 1 to $s$ ) where a $speculated document d_hat_i$ mismatches the $ground truth document d_i$ retrieved from the knowledge base during verification. If all documents match, $m$ would be greater than $s$ .
Correction if Needed (Lines 13-17):
- $if m <= s then$ : If a mismatch is detected (i.e., $m$ is within the stride $s$ ), a correction mechanism is triggered to ensure output correctness.
- Roll X back to the m-th speculation step: The generated tokens $X$ are rolled back to the state just before the $m$ -th speculative step where the first mismatch occurred. All tokens generated based on incorrect speculation from this point onwards are discarded.
- $X_hat = f(X, d_i, k)$ : The language model f(.) then regenerates $k$ tokens (X_hat) using the correct ground truth document $d_i$ (from the batched verification) and the rolled-back context $X$ .
- $X = [X, X_hat]$ : The correctly generated tokens are appended. The process continues from this corrected state.
- Local Cache Update: While not explicitly shown in Algorithm 1, the text describes that the local cache $Q$ is updated with the documents $d_1, ..., d_s$ retrieved during the verification step. This can be either top-1 (only the most relevant document) or top-k (multiple top documents), where top-k update is referred to as prefetching (Figure 2). This ensures the local cache stays up-to-date and improves future speculation success rates.

Figure 1 illustrates this workflow:

Figure 1(a) (Existing Iterative RaLM): Shows sequential LM Generation and Retrieval steps. Queries $q_0, q_1, q_2$ are issued sequentially, leading to high overhead.
Figure 1(b) (RaLMSpec Overview): Demonstrates speculative retrieval steps ( $\textcircled{1}, \textcircled{3}, \textcircled{5}$ ) from the local cache, followed by a batched verification step ( $\textcircled{6}$ ) using the external knowledge base.
Figure 1(c) (Timeline Comparison): Highlights how RaLMSpec significantly reduces latency by replacing multiple sequential retrievals with faster speculative retrievals and one efficient batched verification.

$Figure 1: $\\{ q _ { 0 } , q _ { 1 } , q _ { 2 } \\}$ denotes context-dependent query embeddings and A, B, C are document entries. Figure 1(a) shows the workflow of existing iterative RaLM, which suffers from high retrieval overhead. Figure 1(b) shows an overview of RaLMSpec, which enables faster speculative retrieval steps $( \\textcircled{1} , \\textcircled{3 } , \\textcircled{5 } )$ followed by a batched verification step $\\textcircled{6}$ to guarantee correctness. Consequently, RaLMSpec achieves a lower latency while preserving model quality as shown in Figure 1(c).$ 该图像是一个示意图，展示了迭代 RaLM 和 RaLMSpec 框架的工作流程对比。图中可见，RaLMSpec 通过推测性检索和批量验证相结合，显著减少了知识检索的开销，提升了响应速度。

Original caption: Figure 1: $\\{ q _ { 0 } , q _ { 1 } , q _ { 2 } \\}$ denotes context-dependent query embeddings and A, B, C are document entries. Figure 1(a) shows the workflow of existing iterative RaLM, which suffers from high retrieval overhead. Figure 1(b) shows an overview of RaLMSpec, which enables faster speculative retrieval steps $( \\textcircled{1} , \\textcircled{3 } , \\textcircled{5 } )$ followed by a batched verification step $\\textcircled{6}$ to guarantee correctness. Consequently, RaLMSpec achieves a lower latency while preserving model quality as shown in Figure 1(c).

Speculative Retrieval Details (Figure 2): A key aspect for the effectiveness of speculative retrieval is that for most dense and sparse retrievers, the relative ranking of documents is preserved. This means if the top-ranked document in the large external knowledge base is present in the local cache for a given query, it will also be ranked at the top when retrieving from the local cache using the same metric. This property, combined with temporal and spatial locality (i.e., relevant documents often reappear or are near previously retrieved ones), significantly boosts the speculation success rate.

$Figure 2: For speculative retrieval, we maintain a local cache for each request and use the same scoring metric as the original retriever to rank the entries within the local cache for a given query. In the verification step, we populate the local cache with either the top-1 or top- $\\mathbf { \\nabla } \\cdot \\mathbf { k }$ retrieved documents from the knowledge base, where the latter one is referred to as prefetching.$ 该图像是示意图，展示了RaLMSpec框架中的推测检索和缓存更新机制。上半部分展示了在推测检索过程中，语言模型（LM）如何使用局部缓存进行候选项检索，针对查询 $q_0$ 和 $q_1$ 的过程。下半部分说明了缓存更新策略，包括 Top-1 和 Top-k 更新方式，分别对应于模型验证和预取机制的实施。

Original caption: Figure 2: For speculative retrieval, we maintain a local cache for each request and use the same scoring metric as the original retriever to rank the entries within the local cache for a given query. In the verification step, we populate the local cache with either the top-1 or top- $\\mathbf { \\nabla } \\cdot \\mathbf { k }$ retrieved documents from the knowledge base, where the latter one is referred to as prefetching.

Batched Verification and Prefetching: The efficiency gain from batched retrieval is significant because retrieving $n$ queries in parallel is generally much faster than $n$ sequential retrievals, as empirically shown in Appendix A.1 (Figure 6). During verification, the local cache is populated not just with the top-1 ground truth document, but potentially with top-k documents. This top-k cache update is referred to as prefetching, aiming to proactively fetch more relevant entries into the local cache to further increase the speculation success rate in subsequent steps.

Asynchronous Verification (Figure 3): To further reduce latency, RaLMSpec can employ asynchronous verification. Instead of the system stalling during the verification step, an additional speculation step can be launched concurrently (asynchronously) with the verification of the previous steps.

If the verification succeeds, the model can continue generating based on the asynchronously speculated tokens, effectively hiding the verification latency.
If the verification fails, the model will roll back to the mismatch point, discard the asynchronously generated tokens, and regenerate using the correct information. This technique is particularly beneficial when the verification latency is shorter than the language model's decoding latency.

该图像是示意图，展示了RaLMSpec框架中验证线程与推测线程的工作流程。推测线程通过生成和验证文档来降低延迟，其中若生成的文档与预期不符，则需重新生成。该机制与单线程处理相比实现了更高的效率。

Original caption: Figure 3: Asynchronous verification obtains latency saving by hiding the verification latency behind a valid speculation step. In case a mismatch is detected between the speculated document and ground truth document, the language model will regenerate outputs using the ground truth document.

4.3. Optimal Speculation Stride Scheduler ( $OS^3$ )

The speculation stride s is a crucial hyperparameter that directly impacts the trade-off between speculation overhead (cost of incorrect speculation and rollback) and retrieval saving (benefits of batched retrieval). A large $s$ might lead to high overhead if speculation fails early, while a small $s$ might not fully exploit the benefits of speculation. The optimal $s$ varies based on the language model, retriever type, and speculation accuracy.

Instead of manual tuning, RaLMSpec introduces the $Optimal Speculation Stride Scheduler (OS^3)$ to adaptively determine the best $s$ . The goal is to maximize the expected number of documents verified successfully per unit time.

Let:

$a$ : Latency of a single speculation step (speculative retrieval + language model decoding).
$b$ : Latency of a single verification step.
$d_i$ : Ground truth document retrieved from the corpus at step $i$ .
$\hat{d}_i$ : Speculated document retrieved from the local cache at step $i$ .
$\gamma(X)$ : Speculation accuracy, defined as the probability that $d_i = \hat{d}_i$ given the current context $X$ , i.e., $P(d_i = \hat{d}_i \mid X)$ . This is assumed to be constant for all $i \in [s]$ .

Expected Number of Matched Documents: The expected number of successfully verified documents within a stride $s$ is derived from the probability of successful speculation for $i$ steps. If the speculation succeeds for i-1 steps and fails at step $i$ , then $i$ documents are considered "matched". If all $s$ speculations succeed, then $s$ documents are matched. The expected number of matched documents, denoted as $\mathbb{E}[\text{# of verified documents} \mid X, s]$ , is given by: $ \mathbb{E}[\text{# of verified documents} \mid X, s] = \sum_{i=0}^{s-1} \gamma(X)^i = \frac{1 - \gamma(X)^s}{1 - \gamma(X)} $ This formula sums the probabilities of successfully matching 0, 1, ..., up to s-1 documents (where the (s-1)-th term represents all $s$ documents matching if $\gamma(X)^s$ is the success probability for all $s$ steps).

Objective Function for Synchronous Verification: For synchronous verification, the total latency for $s$ speculation steps is $sa + b$ . The objective is to maximize the expected number of verified documents per unit time: $ \text{Maximize } \frac{\mathbb{E}[\text{# of verified documents} \mid X, s]}{\text{Latency}} = \frac{\frac{1 - \gamma(X)^s}{1 - \gamma(X)}}{sa + b} = \frac{1 - \gamma(X)^s}{(1 - \gamma(X))(sa + b)} $

Objective Function for Asynchronous Verification: For asynchronous verification, the expected latency calculation is more complex:

With probability $\gamma(X)^s$ (all $s$ speculations succeed), the latency is $(s-1)a + \max(a, b)$ . This is because the last speculation step's LM decoding can overlap with the verification, and we take the maximum of their latencies.
With probability $1 - \gamma(X)^s$ (at least one mismatch occurs), there is no gain from asynchronous verification, and the latency reverts to synchronous: $sa + b$ . Therefore, the expected latency for asynchronous verification is: $ \text{Expected Latency} = \gamma(X)^s ((s-1)a + \max(a, b)) + (1 - \gamma(X)^s) (sa + b) $ The objective function to maximize becomes: $ \text{Maximize } \frac{1 - \gamma(X)^s}{(1 - \gamma(X))[\gamma(X)^s ((s-1)a + \max(a, b)) + (1 - \gamma(X)^s) (sa + b)]} $ The $OS^3$ continuously solves for the optimal $s$ by estimating $a, b, \gamma(X)$ .

Parameter Estimation for $OS^3$ :

$a$ and $b$ (Latencies): These are estimated by profiling the actual running times of the most recent speculation steps ( $a$ ) and verification steps ( $b$ ). The paper notes that for EDR and SR, batched retrieval latency is nearly constant for small batch sizes, while for ADR, it scales linearly but with a significant intercept, still making batched retrieval efficient. The effect of batch size on latency per query is detailed in Appendix A.1 (Figure 6).
$\gamma(X)$ (Speculation Accuracy): This is estimated using maximum log-likelihood estimation (MLE) over a specific window size w of recent verification steps. This ensures the estimation reflects recent performance while being stable. Let s(t) be the speculation stride (also the batch size) in the $t$ -th most recent verification step, and M(s(t), X) be the corresponding number of matched documents in that step. The estimated $\hat{\gamma}(X)$ is: $ \hat{\gamma}(X) = \frac{\sum_t M(s(t), X)}{\sum_t M(s(t), X) + \sum_t \mathbb{1}(M(s(t), X) < s(t))} $ where $\mathbb{1}(\cdot)$ is the indicator function, which is 1 if the condition is true (i.e., a mismatch occurred) and 0 otherwise. The numerator sums the total matched documents across the window, and the denominator sums matched documents plus the count of mismatches. To prevent overly optimistic estimations or division-by-zero errors when $\hat{\gamma}$ approaches 1, an upper bound $\gamma_{max}$ is set and $\hat{\gamma}$ is truncated accordingly.

5. Experimental Setup

5.1. Datasets

For evaluating RaLMSpec on knowledge-intensive open-domain question-answering (QA) tasks, the following datasets were used:

Wiki-QA (Yang et al., 2015): A QA dataset where questions are typically related to factual information found on Wikipedia. The answers are short phrases or sentences extracted from Wikipedia articles.
Web Questions (Berant et al., 2013): A QA dataset consisting of questions posed by real users to a Google search engine. Answers are usually short entities or facts from Freebase.
Natural Questions (Kwiatkowski et al., 2019): A large-scale QA dataset composed of real Google search queries and answers derived from Wikipedia. It includes both short and long answers.
Trivia QA (Joshi et al., 2017): A challenging QA dataset that contains questions from trivia websites, with supporting evidence from Wikipedia and other web sources. It tests deep understanding and retrieval capabilities.

For all these QA tasks, the Wikipedia corpus (Chen et al., 2017) was used as the external knowledge base. This corpus is widely used in open-domain QA and is effective for validating methods that rely on external knowledge retrieval.

For KNN-LM evaluation, the WikiText-103 dataset (Merity et al., 2016) was used. This is a large corpus of text extracted from the set of "Good" and "Featured" articles on Wikipedia. It is specifically designed for language modeling tasks and was the same dataset used in the original KNN-LM work (Khandelwal et al., 2019), making it suitable for direct comparison.

The choice of these datasets is appropriate because they are standard benchmarks for knowledge-intensive NLP tasks, specifically open-domain QA and language modeling. They represent diverse question types and data characteristics, allowing for comprehensive validation of RaLMSpec's performance across different scenarios.

5.2. Evaluation Metrics

The paper primarily focuses on speed-up ratio as its evaluation metric, which quantifies the reduction in latency (or increase in throughput) compared to a baseline. While the paper doesn't explicitly provide a formula for speed-up ratio, it is standardly calculated as:

Speed-up Ratio
- Conceptual Definition: The speed-up ratio measures how much faster a task can be completed using an optimized method compared to a baseline method. It is a direct indicator of performance improvement in terms of execution time.
- Mathematical Formula: $ \text{Speed-up Ratio} = \frac{\text{Latency}{\text{Baseline}}}{\text{Latency}{\text{Optimized}}} $ or equivalently, if measuring throughput: $ \text{Speed-up Ratio} = \frac{\text{Throughput}{\text{Optimized}}}{\text{Throughput}{\text{Baseline}}} $
- Symbol Explanation:
  - $\text{Latency}_{\text{Baseline}}$ : The time taken to complete a task using the original, unoptimized baseline method.
  - $\text{Latency}_{\text{Optimized}}$ : The time taken to complete the same task using the proposed, optimized method.
  - $\text{Throughput}_{\text{Optimized}}$ : The amount of work done per unit time by the optimized method.
  - $\text{Throughput}_{\text{Baseline}}$ : The amount of work done per unit time by the baseline method.
    
    Additionally, the paper mentions that KNN-LM aims to improve the perplexity of the base language model. While RaLMSpec guarantees to preserve output quality, perplexity is a common metric for language models that the original KNN-LM work optimizes.
Perplexity (PPL)
- Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In natural language processing, it's often used to evaluate language models. A lower perplexity indicates a better model, as it means the model is less "perplexed" (more confident and accurate) when predicting the next word in a sequence. It is the exponentiated average negative log-likelihood of a sequence.
- Mathematical Formula: For a test set $W = (w_1, w_2, \ldots, w_N)$ , the perplexity (PPL) is defined as: $ \text{PPL}(W) = P(w_1, w_2, \ldots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \ldots, w_N)}} $ This can be rewritten using the product rule of probability and logarithms: $ \text{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \ldots, w_{i-1})\right) $
- Symbol Explanation:
  - $W$ : A sequence of $N$ words (tokens) in the test set.
  - $w_i$ : The $i$ -th word in the sequence.
  - $N$ : The total number of words in the sequence.
  - $P(w_1, w_2, \ldots, w_N)$ : The joint probability of the entire sequence, according to the language model.
  - $P(w_i | w_1, \ldots, w_{i-1})$ : The probability of the $i$ -th word, conditioned on all preceding words, as predicted by the language model.
  - $\log$ : The natural logarithm.
  - $\exp(\cdot)$ : The exponential function (e to the power of $\cdot$ ).
    
    The main body of the RaLMSpec paper focuses solely on latency reduction and speed-up, explicitly stating that it "preserves the same model outputs." This implies that metrics like perplexity or factual accuracy would be identical to the baseline, hence the focus on efficiency metrics.

5.3. Baselines

The paper compares RaLMSpec against two primary baselines, tailored to the specific serving tasks:

Naive Iterative RaLM Serving (RaLMSeq):
- Description: This baseline directly implements the standard iterative RaLM approach, following the design described by Ram et al. (2023). In this setup, a retrieval operation is triggered every four tokens generated by the language model. The most recently retrieved document chunk is then prepended to the prompt, replacing any previously retrieved documents, to inform the next generation step.
- Representativeness: This is a representative baseline for typical iterative RaLM deployments, demonstrating the direct overhead incurred by frequent, sequential retrieval steps without any optimization.
KNN-LM Serving Baseline:
- Description: For KNN-LM serving, the baseline is the direct implementation from the original KNN-LM work by Khandelwal et al. (2019). In this highly retrieval-intensive scenario, a retrieval operation is performed for every single token generated by the language model.
- Representativeness: This baseline represents the unoptimized, highest-overhead version of KNN-LM, which is known for its strong performance but prohibitive inference costs due to its granular retrieval pattern.
  
  Retrievers Used for Baselines and RaLMSpec: To demonstrate generality and ensure fair comparison, both RaLMSpec and the baselines were tested with various retriever types:

Exact Dense Retriever (EDR): Dense Passage Retriever (DPR) (Karpukhin et al., 2020), which is highly accurate but computationally expensive.
Approximate Dense Retriever (ADR): DPR-HNSW (Malkov & Yashunin, 2018), a faster but less accurate approximate version of DPR using Hierarchical Navigable Small World graphs.
Sparse Retriever (SR): BM25 (Robertson et al., 2009), a classic lexical matching retriever.

The implementations for all retrievers were based on Pyserini (Lin et al., 2021), with dense retrievers leveraging the FAISS library (Johnson et al., 2019) for efficient similarity search.

Implementation Details for Experiments:

Maximum Lengths: Maximum input prompt length was set to 512 tokens, and maximum generation length to 128 tokens. For naive iterative RaLM, the retrieved document chunk length was 256.
Speculation Stride ( $s$ ): When $OS^3$ (Optimal Speculation Stride Scheduler) was disabled, RaLMSpec used a constant stride of $s=3$ . When $OS^3$ was enabled, it initialized $s=1$ and adapted dynamically.
$OS^3$ Parameters: Window size $w=5$ , and \gamma(X) $. The accuracy and stability of these estimations, especially$ \gamma(X)$$ with its windowing approach, could be sensitive to sudden shifts in query patterns or knowledge base characteristics. The warm-up phase overhead for EDR is an example of this.
KNN-LM Adaptation Details: While the paper states KNN-LM's cache update and verification protocols were modified (matching next token, populating with next $n=10$ entries), a more detailed exploration of these adaptations and their rationale (e.g., why $n=10$ was chosen, sensitivity analysis for $n$ ) could provide deeper insights.

Transferability/Application: The core idea of speculative execution with batched verification is broadly applicable to any system where expensive, sequential operations can be partially predicted and then verified. This paradigm could be transferred to:
Other Multi-modal AI systems: Systems involving frequent calls to external processing units (e.g., image generation with external modules, video processing pipelines).
Database interactions: Optimizing complex database queries that involve multiple sub-queries.
Agentic AI systems: AI agents that frequently interact with external tools or environments, where speculative tool use could be verified in batches.

Overall, RaLMSpec is a significant contribution towards making powerful iterative RaLMs practical for real-world applications by elegantly solving their most pressing performance bottleneck.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.