Paper status: completed

Accelerating Retrieval-Augmented Generation

Published:02/06/2025

Retrieval-Augmented Generation Systems (3)Exact Retrieval for Large Language Models (1)Intelligent Knowledge Store Architecture (1)Accelerated Exact Nearest Neighbor Search (1)Large-Scale Vector Database Retrieval (1)

Original Link

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper explores Retrieval-Augmented Generation (RAG) to address hallucinations in large language models (LLMs). It introduces the Intelligent Knowledge Store (IKS), a near-memory acceleration architecture that enhances exact retrieval speed by 13.4–27.9 times, improving infere

Abstract

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4–27.9 × faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7–26.3 × lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM – which is the most expensive component in today’s servers – from being stranded.

Mind Map

In-depth Reading

English Analysis~40 min read · 51,892 chars

1. Bibliographic Information

1.1. Title

Accelerating Retrieval-Augmented Generation

1.2. Authors

Derrick Quinn (Cornell University, Ithaca, NY, USA)
Mohammad Nouri (Cornell University, Ithaca, NY, USA)
Neel Patel (Cornell University, Ithaca, NY, USA)
John Salihu (University of Kansas, Lawrence, KS, USA)
Alireza Salemi (University of Massachusetts Amherst, Amherst, MA, USA)
Sukhan Lee (Samsung Electronics, Hwasung, Republic of Korea)
Hamed Zamani (University of Massachusetts Amherst, Amherst, MA, USA)
Mohammad Alian (Cornell University, Ithaca, NY, USA)

1.3. Journal/Conference

Published in the Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25), March 30-April 3, 2025, Rotterdam, Netherlands.

ASPLOS is a highly reputable and influential conference in the field of computer architecture, programming languages, and operating systems. Publication here signifies significant contributions and rigor in system design and performance.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the issue of hallucinations and accuracy in large language models (LLMs) by focusing on Retrieval-Augmented Generation (RAG). RAG enhances LLMs by retrieving information from external knowledge sources. The authors profile various RAG execution pipelines, elucidating the interplay between retrieval and generation. They demonstrate that while Exact Nearest Neighbor Search (ENNS) is computationally expensive, it reduces inference time compared to Approximate Nearest Neighbor Search (ANNS) variants by providing a smaller, more accurate document list to the generative model, thus maintaining end-to-end accuracy. This observation motivates accelerating ENNS.

The paper introduces Intelligent Knowledge Store (IKS), a Type-2 CXL device. IKS implements a scale-out near-memory acceleration architecture featuring a novel cache-coherent interface between the host CPU and near-memory accelerators (NMAs). IKS achieves $13.4–27.9 \times$ faster ENNS over a 512GB vector database compared to Intel Sapphire Rapids CPUs. This superior search performance results in $1.7–26.3 \times$ lower end-to-end inference time for representative RAG applications. Additionally, IKS functions as a memory expander, allowing its internal DRAM to be disaggregated and used by other applications, preventing DRAM from being stranded.

1.6. Original Source Link

/files/papers/692998d14241c84d8510f9f7/paper.pdf (This link indicates an internal file path, suggesting it might be a link within a specific system or a preprint server, as the publication date is in 2025. It is likely a pre-print or accepted version for ASPLOS '25.)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the inherent limitations of Large Language Models (LLMs), specifically their susceptibility to hallucinations (generating factually incorrect or nonsensical information) and the static nature of their parametric knowledge. LLMs "memorize" information from their training corpora, which becomes outdated in rapidly evolving domains and struggles with less frequent entities or real-time information. Retraining or fine-tuning LLMs to update this knowledge is prohibitively expensive.

Retrieval-Augmented Generation (RAG) emerges as a crucial solution to these challenges. RAG enhances LLMs by augmenting them with dynamic, non-parametric knowledge retrieved from external, up-to-date sources like the web or vast document corpora. This approach significantly improves LLM accuracy and reduces hallucinations.

However, the effectiveness of RAG critically depends on the quality and speed of the retrieval phase. State-of-the-art retrieval methods, known as dense retrieval, encode queries and documents into high-dimensional embedding vectors and store them in a vector database. Retrieving relevant documents then involves a K-nearest neighbor (KNN) search. Exact Nearest Neighbor Search (ENNS), which guarantees the highest retrieval quality, is computationally intensive and memory bandwidth-limited, often consuming up to $97\%$ of the total end-to-end inference time in RAG applications. While Approximate Nearest Neighbor Search (ANNS) offers faster search, the paper's experiments show that ANNS often requires retrieving significantly more documents (larger K) to match the accuracy of ENNS, negating performance gains and increasing the generation time of the LLM. This creates a bottleneck where high-quality retrieval is essential for RAG's accuracy but is currently too slow and expensive on commodity hardware, especially CPUs and even GPUs for very large vector databases.

The paper's entry point is this bottleneck: given that high-quality (exact) retrieval is critical for RAG's overall performance and accuracy, there is a strong motivation to accelerate ENNS efficiently and cost-effectively.

2.2. Main Contributions / Findings

The paper makes several significant contributions to addressing the RAG retrieval bottleneck:

Demystification and Profiling of RAG Pipelines: The authors provide an extensive profiling of RAG execution pipelines, analyzing the interplay between hardware and software configurations. They demonstrate that RAG applications require high-quality retrieval for effective performance, and that high-quality retrieval (whether ENNS or slow ANNS) constitutes a significant portion of the end-to-end runtime, bottlenecking RAG.
Motivation for ENNS Acceleration: The paper empirically shows that ENNS can reduce overall RAG inference time compared to ANNS variants because its higher accuracy allows the generative model to receive a smaller, more precise list of documents while maintaining the same end-to-end accuracy. This observation strongly motivates the need for dedicated ENNS acceleration.
Design of Intelligent Knowledge Store (IKS): The paper introduces IKS, a novel, cost-optimized, and purpose-built Type-2 CXL memory expander that functions as a high-performance, high-capacity vector database accelerator. IKS offloads memory-intensive dot-product operations of ENNS to a distributed array of low-profile accelerators placed near LPDDR5X DRAM packages.
Novel Cache-Coherent Interface: IKS implements an innovative interface built on top of the CXL.cache protocol. This interface facilitates seamless and efficient offloading of exact vector database search operations from the host CPU to the near-memory accelerators (NMAs), eliminating the need for DMA setup, buffer management, interrupts, or polling.
Scale-Out Near-Memory Acceleration Architecture: IKS employs a minimalist scale-out near-memory accelerator architecture where NMA logic is distributed across multiple chips, each connected to an LPDDR5X package. This design optimizes for area, yield, and aggregate memory bandwidth.
Significant Performance Gains: IKS achieves a remarkable $13.4–27.9 \times$ faster ENNS for a 512GB knowledge store compared to Intel Sapphire Rapids CPUs. This translates to a $1.7–26.3 \times$ reduction in end-to-end inference time for representative RAG applications (FiDT5, Llama-8B, Llama-70B).
Memory Disaggregation and Cost-Effectiveness: IKS is inherently a memory expander. Its internal DRAM can be disaggregated and shared with other co-running applications via CXL.mem and CXL.cache protocols, preventing expensive DRAM from being stranded and improving server resource utilization. The design is also shown to be more cost-effective and power-efficient than GPU-based solutions for ENNS.

3.1. Foundational Concepts

Large Language Models (LLMs)

LLMs are advanced neural networks, typically based on the Transformer architecture, trained on massive datasets of text and code. They excel at understanding, generating, and translating human-like text.

Parametric Knowledge: The knowledge acquired by LLMs during training is stored in their millions or billions of internal parameters (weights and biases). This knowledge is static and can only be updated by retraining or fine-tuning the entire model, which is a costly and time-consuming process.
Hallucination: A critical issue where LLMs generate plausible-sounding but factually incorrect or nonsensical information, often due to limitations in their parametric knowledge or misunderstanding of context.

Retrieval-Augmented Generation (RAG)

RAG is a technique that combines LLMs with external information retrieval systems. Instead of relying solely on their internal (parametric) knowledge, RAG models first search for relevant documents or passages from a knowledge source (e.g., a vector database of Wikipedia articles, web pages) and then use this retrieved information to inform their text generation.

Components:
- Retrieval Model (Retriever): Responsible for searching the knowledge source for relevant items based on the user's query. It typically converts the query into an embedding vector and searches for similar document embedding vectors.
- Generative Model (Generator): An LLM that receives the user's query and the retrieved documents. It then synthesizes a response based on this augmented input.
Benefits: Reduces hallucination, provides up-to-date information, increases factual accuracy, and makes LLMs more grounded in verifiable sources.

Dense Retrieval

A modern information retrieval technique used in RAG.

Bi-encoder Neural Networks: Dense retrieval models use two separate neural networks (encoders): a query encoder ( $E_q$ ) and a document encoder ( $E_d$ ).
Embedding Vectors: Both encoders map queries and documents into a shared high-dimensional vector space. The output of these encoders are called embedding vectors (or embeddings). For a query $q$ and a document $d$ , their embeddings are $E_q(q)$ and $E_d(d)$ respectively.
Similarity Score: The similarity between a query and a document is typically calculated using the dot product (or cosine similarity) of their respective embedding vectors: ${\cal S}_d = E_q(q) \cdot E_d(d)$ . A higher score indicates greater relevance.
Vector Database: A specialized database designed to efficiently store and query embedding vectors. It allows for fast similarity searches.

K-Nearest Neighbor (KNN) Search

After converting queries and documents into embedding vectors, retrieving the most relevant documents involves finding the $K$ embedding vectors in the vector database that are "closest" (most similar) to the query embedding vector. KNN search finds these $K$ neighbors.

Exact Nearest Neighbor Search (ENNS): This method exhaustively compares the query vector with every single embedding vector in the vector database to find the absolutely $K$ closest ones. It guarantees perfect retrieval accuracy but is computationally expensive and memory bandwidth-bound, especially for large databases.
Approximate Nearest Neighbor Search (ANNS): This method employs various data structures and algorithms (e.g., Product Quantization, IVFPQ, HNSW) to reduce the search space and find $K$ neighbors that are approximately the closest. ANNS trades off a small amount of retrieval accuracy for significantly faster search times. However, maintaining high accuracy with ANNS often requires more complex configurations or may still be slower than desired.

CXL (Compute Express Link)

CXL is an open industry standard interconnect that provides high-bandwidth, low-latency connectivity between a host processor and accelerators or memory devices. It operates over the PCIe (Peripheral Component Interconnect Express) physical and electrical interface but adds crucial features for coherency and memory semantics.

CXL.mem (Memory Protocol): Allows a CXL device to act as a memory expander, providing additional memory capacity to the host CPU. The host can access this memory directly.
CXL.cache (Caching Protocol): Enables CXL devices (like accelerators) to cache host CPU memory and participate in the CPU's cache coherency domain. This is critical for shared memory models, allowing CPU and accelerator to operate on the same data without explicit DMA transfers and ensuring data consistency.
CXL Type 2 Device: A CXL device that supports both CXL.mem and CXL.cache protocols. This type of device can act as a memory expander and also contain accelerator logic that can cache CPU memory, offering a shared address space between the host CPU and the accelerator. This is the type of device IKS is.

LPDDR5X (Low-Power Double Data Rate 5X)

LPDDR5X is a type of synchronous dynamic random-access memory (SDRAM) primarily designed for mobile and embedded systems, emphasizing low power consumption and high bandwidth.

Characteristics: It offers higher bandwidth (8533 Mbps per pin) than standard DDR5 (7200 MTps) and is more power-efficient due to shorter interconnections. It is typically integrated near a system-on-chip (SoC).
IKS Usage: IKS utilizes LPDDR5X for its internal memory to store embedding vectors due to its bandwidth and power efficiency, despite traditional reliability concerns in datacenter settings (which the paper argues are less critical for ENNS due to bit flip resilience).

SIMD (Single Instruction, Multiple Data)

A class of parallel computing where a single instruction operates on multiple data points simultaneously. This is often implemented through specialized CPU extensions (e.g., AVX-512, Intel AMX) or in GPUs. ENNS is highly data-parallel and benefits greatly from SIMD operations (e.g., vector dot products).

GPU (Graphics Processing Unit)

A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. Modern GPUs are highly parallel processors capable of performing a large number of computations simultaneously, making them suitable for deep learning workloads.

HBM (High-Bandwidth Memory): A type of high-performance RAM used with GPUs and some FPGAs. It's known for very high memory bandwidth but is significantly more expensive than DDR or LPDDR memories.
Limitations for ENNS: While GPUs can accelerate ENNS, they often have excess compute capacity relative to their memory bandwidth for memory-bound workloads like ENNS, leading to poor utilization. Their HBM is also very expensive, and offloading large vector databases to GPU memory can require multiple GPUs, increasing cost.

Intel AMX (Advanced Matrix Extensions)

A set of new CPU extensions introduced in Intel Xeon Scalable Processors (e.g., Sapphire Rapids). AMX is designed to accelerate matrix multiplication and deep learning workloads, particularly with bfloat16 data types. It provides specialized hardware units (tiles) for efficient tensor operations.

Roofline Model

A graphical model used in computer architecture to characterize the performance of a computation based on its operational intensity (floating-point operations per byte accessed from memory) and the hardware's peak performance (FLOPS) and peak memory bandwidth. It helps identify whether a workload is compute-bound (limited by processing speed) or memory-bound (limited by data transfer speed).

Transformer Networks

The foundational architecture for most modern LLMs. Transformers rely heavily on the self-attention mechanism to weigh the importance of different parts of the input sequence when processing each element. They consist of an encoder stack and a decoder stack.

BERT (Bidirectional Encoder Representations from Transformers)

A popular Transformer-based language model pre-trained by Google. BERT is often used as an encoder for generating embedding vectors for text due to its strong contextual understanding.

T5 (Text-to-Text Transfer Transformer)

Another Transformer-based model by Google, framed as a text-to-text problem, meaning every NLP task is converted into a text generation task. T5 can be used as a generative model in RAG. Fusion-in-Decoder (FiD) is a variant where multiple retrieved documents are encoded independently and then their representations are concatenated before being passed to the decoder.

Llama (Large Language Model Meta AI)

A family of LLMs developed by Meta AI. These models are known for their strong performance and open-source availability. The paper uses 4-bit-Quantized Llama-3-8B-Instruct and Llama-3-70B-Instruct as generative models.

Faiss (Facebook AI Similarity Search)

An open-source library developed by Meta for efficient similarity search and clustering of dense vectors. It provides implementations of many ENNS and ANNS algorithms and is widely used in vector database applications.

3.2. Previous Works

The paper builds upon a rich body of work in LLMs, RAG, and nearest neighbor search acceleration.

LLM Limitations

Static Knowledge & Costly Updates: The issue of LLMs having static knowledge that is expensive to update is well-documented [108]. Early LLMs relied solely on their parametric knowledge, which became outdated.
Hallucinations: The problem of LLM hallucinations is a major research area [83, 103], highlighting the need for external grounding.
Limited Memorization & Temporal Degradation: LLMs have shown limited memorization for less frequent entities [29] and can exhibit temporal degradation (performance decline over time as information becomes stale) [31].

Retrieval-Augmented Generation (RAG)

Foundational RAG Models: The concept of RAG gained significant traction with works like Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Lewis et al. (2020) [41] and Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering by Izacard and Grave (2021) [26], both from Meta AI. These works established the two-phase retriever-generator pipeline that the current paper's FiDT5 and Llama implementations closely align with.
Dense Retrieval: The effectiveness of dense retrieval using bi-encoder neural networks for RAG has been demonstrated across various modalities [19, 25, 26, 41, 71, 74], as highlighted by Karpukhin et al. (2020) [30] with Dense Passage Retrieval (DPR). The paper's focus on dense retrieval reflects this trend.
RAG Applications: RAG has been successfully applied to numerous NLP tasks, including dialogue generation [2, 8, 10, 83], machine translation [18, 21, 102], question answering [25, 26, 41, 66, 67, 76, 78], summarization [58, 64], code generation [20, 49], and multi-modal tasks [11, 12, 15, 19, 70, 71].

Nearest Neighbor Search Algorithms

ENNS: Exhaustive search is the baseline for KNN.
ANNS Algorithms:
- Product Quantization (PQ) [28]: A technique to compress embedding vectors and speed up search by breaking vectors into subvectors and quantizing them.
- Inverted File with Product Quantization (IVFPQ) [7]: Combines PQ with an inverted file index (like in traditional information retrieval) to narrow down the search space to a subset of clusters.
- Hierarchical Navigable Small World (HNSW) [52]: A graph-based ANNS algorithm that constructs a multi-layered graph where searches efficiently navigate through layers to find neighbors. The paper uses HNSW as its representative ANNS algorithm for comparison.
Challenges with High-Quality ANNS: Prior work has shown that achieving high-quality ANNS can be nearly as slow as ENNS [45, 94], and GPUs are not always effective for complex ANNS algorithms like IVFPQ and HNSW [27].

Near-Memory Processing (NMP) and Accelerators for Search

DIMM-based NMP: Solutions like AxDIMM [33, 35] and others [4, 62, 111] have explored NMP by integrating computation into DDR DIMMs.
- Limitation: These approaches often require sophisticated mechanisms for shared address space [61], limit per-rank memory capacity, and may compromise host CPU memory capacity when used as accelerators. They also face compute and thermal capacity limitations.
Specialized ANNS Accelerators:
- ANNA [40]: A specialized architecture for PQ-based ANNS.
- NDSearch [95]: An accelerator for graph-based ANNS (like HNSW).
- Limitation: The paper argues these are highly task-specific due to complex algorithms and memory access patterns of ANNS, making them less generalizable as different corpora might be better suited for different ANNS algorithms. ENNS, being simpler, is more amenable to general acceleration.
Computational CXL Memory: Sim et al. (2023) [84] explored computational CXL memory for ENNS acceleration.
- Limitation: This work used DDR DRAM, which may not meet the power and bandwidth requirements for the large corpus sizes relevant to RAG.
CXL-PNM (CXL-based Processing Near Memory): Park et al. (2024) [57] presented an LPDDR-based CXL-PNM platform for Transformer-based LLMs. IKS shares some foundational ideas but differs in its scale-out NMA architecture and cache-coherent interface design.

3.3. Technological Evolution

The evolution of LLMs began with models relying purely on their internal parametric knowledge, leading to issues like hallucinations and stale information. This spurred the development of Retrieval-Augmented Generation (RAG) systems, which dynamically fetch external, up-to-date information to augment LLM responses.

Initially, RAG research focused on various retrieval models, with dense retrieval becoming prominent for its effectiveness. However, the dense retrieval phase, particularly Exact Nearest Neighbor Search (ENNS), quickly emerged as a performance bottleneck due to its memory-bandwidth intensive nature and the rapidly growing size of vector databases. While Approximate Nearest Neighbor Search (ANNS) offered speed, it often compromised retrieval quality, forcing LLMs to process more documents, negating retrieval gains and increasing generation time.

This paper's work (IKS) fits into the timeline by addressing this critical retrieval bottleneck. It represents a shift towards specialized hardware acceleration for ENNS, recognizing its importance for RAG's overall accuracy and latency. By leveraging CXL technology, IKS pushes ENNS computation directly near memory, aiming for both high performance and cost-effectiveness, moving beyond general-purpose CPUs and GPUs which are often inefficient for this specific memory-bound workload. It also differentiates itself from task-specific ANNS accelerators by focusing on the more generalizable ENNS.

3.4. Differentiation Analysis

Compared to main methods in related work, IKS introduces several core differences and innovations:

Focus on ENNS over ANNS for Acceleration: Unlike prior ANNS accelerators (e.g., ANNA [40] for PQ, NDSearch [95] for graph-based ANNS), IKS specifically targets ENNS. The authors argue that ANNS accelerators are highly task-specific and dataset-dependent, while ENNS offers universal high quality and simpler algorithms amenable to more general acceleration. Their empirical findings confirm that ENNS often leads to Pareto-superior configurations for RAG requiring high accuracy.
CXL Type 2 Device for Shared Address Space: Instead of DIMM-based NMP (e.g., AxDIMM [33]), which often struggle with shared address space and capacity limitations, IKS is a Type-2 CXL device. This enables a cache-coherent shared address space between the host CPU and NMAs, eliminating DMA overheads and simplifying data management. It also scales independently in memory capacity.
Novel Cache-Coherent Interface for Offload: IKS designs a unique CPU-accelerator interface atop CXL.cache. This interface uses memory-mapped context buffers and a coherent doorbell register for low-overhead offloading and notification (leveraging umwait()), avoiding expensive interrupts or polling mechanisms common in PCIe-based offloads.
Scale-Out Near-Memory Acceleration Architecture: IKS distributes NMA logic across multiple low-profile chips, each adjacent to an LPDDR5X package. This scale-out design, driven by physical manufacturing constraints (reticle limits, shoreline for PHYs), allows for high aggregate bandwidth, improved yield, and better thermal management compared to monolithic large accelerators.
Cost-Effective and Power-Efficient Memory: IKS utilizes LPDDR5X DRAM instead of HBM (used in GPUs) or DDR (used in some CXL NMP). LPDDR5X offers higher bandwidth than DDR5 and is more power-efficient and cost-effective than HBM, making it suitable for high-capacity vector databases.
Hardware/Software Co-Design: IKS implements a minimalist hardware architecture, relying on software to manage data mapping and final top-K aggregation. This approach simplifies the NMA design while still achieving high performance.
Memory Disaggregation: IKS is explicitly designed as a memory expander that can disaggregate its internal DRAM for other applications, preventing memory stranding – a common issue with dedicated accelerators.

4. Methodology

4.1. Principles

The core principle behind Intelligent Knowledge Store (IKS) is to directly address the memory bandwidth-bound nature of Exact Nearest Neighbor Search (ENNS) in RAG applications by moving computation near memory in a cost-effective and scalable manner. IKS leverages the Compute Express Link (CXL) Type 2 specification to create a compute-enabled memory expander. This allows the host CPU and near-memory accelerators (NMAs) to share a cache-coherent address space, enabling low-overhead offloading of vector similarity search operations.

The theoretical basis and intuition are rooted in observing that ENNS operations have no data reuse for pairwise similarity calculations, consist of simple vector-vector dot-products with top-K logic, exhibit regular memory access patterns, and are highly parallelizable. These characteristics make ENNS ideal for near-memory acceleration because:

Cache Inefficiency: Deep cache hierarchies are not beneficial for ENNS due to the lack of data reuse, and can even introduce overheads.
Low Overhead Coherency: Limited data reuse with large datasets simplifies software-managed cache coherency between the host CPU and NMAs.
Predictable Access: Regular memory access patterns enable coarse-grain virtual-to-physical address translation on NMAs.
Distributed Parallelism: ENNS can be efficiently offloaded to a distributed array of NMAs, each processing a shard of the corpus data in parallel, with a simple top-K aggregation step at the end.

IKS co-designs hardware and software to implement a minimalist scale-out near-memory accelerator architecture that is exposed as a memory expander, preventing DRAM stranding and providing cost-effective high capacity.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. IKS System Integration and Architecture Overview

The Intelligent Knowledge Store (IKS) is designed as a Type-2 CXL device, meaning it supports both CXL.mem (memory protocol) and CXL.cache (caching protocol). This allows IKS to function both as a regular memory expander and as a high-performance vector database accelerator.

The following figure (Figure 5 from the original paper) provides an overview of the IKS architecture and its integration:

该图像是示意图，展示了近内存加速器（NMA）的内部架构，包括处理引擎0和63的组成部分，以及点积单元和Top-K单元的结构。图中详述了内存控制器与查询缓冲区的连接，显示了各个MAC单位的工作方式和数据流向。点积单元和Top-K单元的设计旨在优化查询向量和嵌入向量之间的运算，以提升数据处理性能。

Fig. 5. (a) IKS internal DRAM, scratchpad spaces, and configuration registers are mapped to the host address space. The scratchpad and configuration register address ranges are labeled as Context Buffers (CB). (b) IKS is a compute-enabled CXL memory expander that includes eight LPDDR5X packages with one near-memory accelerator (NMA) chip near each package. (c) Each NMA includes 64 processing engines. (d) Dot-product units reuse the query vector (QV) dimension across 68 MAC units.

System Address Space (Figure 5a): IKS's internal memory (LPDDR5X DRAM), scratchpad spaces (for queries and results), and configuration registers are all mapped into the host CPU's address space. This creates a unified address space where both the CPU and IKS can cache addresses, simplifying interaction and data sharing. The scratchpad and configuration register ranges are specifically labeled as Context Buffers (CB).

IKS Architecture (Figure 5b): IKS employs a scale-out near-memory processing architecture. It features:

Eight LPDDR5X Packages: These DRAM packages provide high-capacity, high-bandwidth, and low-power memory for storing the embedding vectors of the vector database. The total capacity for a single IKS unit is $8 \times 512 \mathrm{Gb} = 4 \mathrm{Tb} \approx 512 \mathrm{GB}$ (assuming 16-bit float embeddings).
Eight Near-Memory Accelerators (NMAs): Each LPDDR5X package is directly connected to a dedicated NMA chip. Each NMA integrates both LPDDR5X memory controllers and the accelerator logic for ENNS.
CXL Controller: A central CXL controller on the IKS card connects to the host CPU via a $CXL.mem/cache$ link (over PCIe). Each NMA connects to this CXL controller via a $\times 2$ PCIe 5.0 uplink, providing 8 GBps per NMA. This scale-out design (distributing NMA logic over multiple smaller chips) is chosen to manage chip area, improve manufacturing yield, and achieve high aggregate chip shoreline for memory PHYs.

4.2.2. Offload Model

The IKS offload model is designed for seamless and efficient userspace interaction without system calls or context switches.

Data Storage: The host CPU stores embedding vectors in a specific block data layout (detailed in Section 4.2.5) within the IKS LPDDR5X DRAM. The actual human-readable documents corresponding to these embedding vectors are stored in the host's DDR or CXL memory.
Offload API: The vector database application running on the host CPU initiates an ENNS offload by calling a blocking API: iks_search(query). This API hides the hardware interaction complexity.
Cache Coherency: After any update operations to the vector database in IKS memory, the CPU flushes its caches to ensure NMAs receive the most up-to-date data.
Offload Context: The iks_search(query) API prepares an offload context. This context includes:
- The query vectors (for batch processing).
- The vector dimensions (VD).
- The base address of the first embedding vector stored in each LPDDR5X package.
Initiating Offload: The offload context is written to memory-mapped regions called context buffers (part of the IKS address space, visible to the host CPU and NMAs). The offload is then initiated by writing to a doorbell register.
Notification: The host CPU then calls umwait() (a user-mode instruction for waiting) to monitor the doorbell register. This register is shared and kept cache-coherent via the CXL.cache protocol, allowing the CPU to efficiently wait for the NMAs to complete without busy-polling or interrupts.
Aggregation: Once all NMAs complete their respective searches, they update the doorbell register, notifying the CPU. The CPU then executes an aggregation routine to combine the partial top-K lists from each NMA into a single, final top-K list.
Document Retrieval: The CPU uses the physical addresses of the top-K embedding vectors (which are known a priori because of the fixed data layout) to retrieve the actual documents from host memory.

4.2.3. Cache Coherent Interface

The IKS design leverages the CXL.cache protocol to implement an efficient and low-overhead interface between NMAs and host CPU processes through shared memory.

The following figure (Figure 6 from the original paper) illustrates the transaction flow through the CXL.cache interface:

Fig. 6. CPU-IKS interface through cache coherent CXL interconnect.
该图像是一个示意图，展示了CPU与IKS（Intelligent Knowledge Store）之间通过CXL缓存一致性互连的接口。图中详细描绘了在上下文缓冲区之间传递参数和查询的过程，包括多个步骤和相应的缓存一致性活动。

Fig. 6. CPU-IKS interface through cache coherent CXL interconnect.

Host Writes Offload Context (Step 1): The host CPU writes the offload context (containing query vectors, vector dimensions, base addresses) to predefined context buffer address ranges. These buffers are cacheable shared memory regions, and the CPU uses temporal writes to populate them efficiently.
Host Writes Doorbell Register (Step 2-3): After populating the context buffers, the host CPU writes a specific value to a doorbell register. This doorbell register is mapped to a cache line that is shared by both the NMAs and the host, ensuring coherency.
Host Waits (Step 3-4): Immediately after writing the doorbell, the host CPU executes the umwait() instruction, which makes the CPU core enter a low-power, idle state while monitoring the doorbell register for changes. This provides fine-grained notification without costly interrupts or busy-waiting.
NMA Polls Doorbell (Step 4): The NMAs continuously poll their local copy of the cache-coherent doorbell register. As soon as they detect a change (i.e., the host has written a new value), they recognize that an offload request has arrived, and the ENNS computation starts.
NMA Reads Offload Context (Step 5-6): The NMA reads the offload context from the IKS cache (which reflects the CPU's temporal writes via CXL.cache coherency) and loads it into its internal scratchpad.
NMA Computation (Step 7): Each NMA performs its ENNS computation on its local shard of the vector database using its processing engines.
NMA Updates Context Buffers (Step 8-9): Upon completion, each NMA writes its partial top-K list (similarity scores and corresponding physical addresses of embedding vectors) back into the output scratchpad within the shared context buffer range.
NMA Writes Doorbell (Step 10-11): Finally, each NMA writes a completion signal to the doorbell register.
Host Notified & Aggregates (Step 11-12): The host CPU, which was umwait()ing, is notified of the change in the doorbell register. It then proceeds to read the partial top-K lists from the output scratchpads (which are also cache-coherent) and aggregates them into the final top-K result.

This cache-coherent interface significantly reduces offload overheads, achieving $1.6 \times$ higher throughput than CXL.io-mimicking non-temporal writes (which resemble PCIe MMIO).

4.2.4. NMA Architecture

As detailed in Figure 5c and 5d, each Near-Memory Accelerator (NMA) chip within IKS is specifically designed for ENNS.

Processing Engines (PEs): Each NMA includes 64 processing engines. These PEs enable parallel calculation of similarity scores for up to 64 query vectors simultaneously, supporting batch processing.
Components per PE: Each processing engine is composed of:
- Query Scratchpad: A small, fast SRAM buffer (2KB in the current incarnation) to store a query vector.
- Dot-Product Unit: The computational core for similarity score calculation.
- Top-K Unit: Hardware logic for maintaining an ordered list of the top K similarity scores (fixed at $K=32$ in hardware) and their corresponding embedding vector addresses.
- Output Scratchpad: A buffer to store the partial top-K list before being read by the CPU.
Central Control Unit: Each NMA has a central control unit that manages memory accesses, internal data movement, and activates processing engines based on the batch size of query vectors.
Network-on-Chip (NoC): A fixed broadcast network connects the DRAM interface to all processing engines. This allows embedding vectors read from DRAM to be broadcast to multiple active PEs, facilitating data reuse when processing different query vectors in a batch.
Dot-Product Unit Operation (Figure 5d):
- Each dot-product unit contains 68 MAC (Multiply-Accumulate) units.
- These MAC units operate at 1 GHz, providing 68 GFLOPS (16-bit floating-point operations) of compute throughput, designed to saturate the 136 GBps memory bandwidth of the LPDDR5X channels.
- Parallel Calculation: In each clock cycle, 68 MAC operations are performed. For a given vector dimension $j$ $j$ :
  - The first input to MAC unit $p$ (where $p$ ranges from 0 to 67) is dimension $j$ of query vector PE (denoted as QV[PE][j]), which is stored in the query scratchpad.
  - The second input is dimension $j$ of embedding vector $i$ (denoted as EV[i][j]), where $i$ ranges from 0 to 67. These embedding vectors are streamed from DRAM.
- Vector Dimension Cycles: It takes VD (Vector Dimension) cycles for a dot-product unit to compute the similarity scores for a block of 68 embedding vectors.
- Score Registers: Once a similarity score is computed, it's loaded into a score register and then streamed to the Top-K unit.
Top-K Unit Operation (Figure 5d):
- The Top-K unit maintains an ordered list of $K=32$ similarity scores (and corresponding addresses).
- Incoming similarity scores are compared with the head (smallest score) of the ordered list. If an incoming score is larger, it's inserted into the list, and the smallest is evicted.
- This insertion process is serialized but is overlapped with the similarity score evaluations because vector dimensions are typically much larger than 68, ensuring it's not on the critical path.

4.2.5. Data Layout Inside DRAM and Query Scratchpad

The host CPU must adhere to specific data layouts when storing embedding vectors in IKS DRAM and populating query scratchpads.

Data Layout in LPDDR5X Package (Figure 7):

The following figure (Figure 7 from the original paper) illustrates the required data layout for embedding vectors within each LPDDR5X package:

	Byte Offset
Physical Address	0	2	4		134
B	EV[0][0]	EV[1][0]	EV[2][0]	...	EV[67][0]
B + 0136VD + 136*1	EV[0][1]	EV[1][1]	EV[2][1]	...	EV[67][1]
B + 0136VD + 136*2	EV[0][2]	EV[1][2]	EV[2][2]	...	EV[67][2]
	...	...	...	...	...
B + 0136VD + 136*(VD-1)	EV[0][VD-1]	EV[1][VD-1]	EV[2][VD-1]	...	EV[67][VD-1]
B + 1136VD	EV[68][0]	EV[69][0]	EV[70][0]	...	EV[135][0]
	...	...	...	...	...
B + 1136VD + 136*(VD-1)	EV[68][VD-1]	EV[69][VD-1]	EV[70][VD-1]	...	EV[135][VD-1]
	...	...	...	...	...
B + (N-67)136VD	EV[N-68][0]	EV[N-67][0]	EV[N-66][0]	...	EV[N-1][0]
B + (N-67)136VD + 136*1	EV[N-68][1]	EV[N-67][1]	EV[N-66][1]	...	EV[N-1][1]
	...	...	...	...	...
B + (N-67)136VD + 136*(VD-1)	EV[N-68][VD-1]	EV[N-67][VD-1]	EV[N-66][VD-1]	...	EV[N-1][VD-1]

Fig. 7. Data layout inside each LPDDR5X package. The host CPU communicates the base address " $B$ ", vector dimension "VD", and the number of vectors " $N$ " to the NMAs for each offload. Four embedding vectors (EVs) are highlighted in this layout.

Block-based Storage: Embedding vectors (EVs) are stored in contiguous blocks, each containing 68 vectors.
Column-Major Order: Within each block, the embedding vectors are stored in column-major order. This means that for a block of 68 vectors (e.g., EV[0] to EV[67]), dimension 0 of all 68 vectors is stored consecutively, followed by dimension 1 of all 68 vectors, and so on, up to dimension VD-1.
Size per block: Each embedding vector dimension is 2 bytes (16-bit floating point). So, dimension $j$ for 68 vectors occupies $68 \times 2 = 136$ bytes. A full block of 68 vectors, across all VD dimensions, occupies $136 \times VD$ bytes.
Efficient Batching: This column-major layout allows the NMA to read and process embedding vectors dimension-by-dimension for up to 68 vectors simultaneously. The NMA can access up to 136 bytes per cycle from the memory controller, comprising one element from 68 distinct embedding vectors.
Parameters for NMA: The host CPU provides the base address ("B"), vector dimension ("VD"), and the total number of vectors ("N") in the corpus shard to each NMA.

Data Layout in Query Scratchpads (Figure 8):

The following figure (Figure 8 from the original paper) illustrates the data layout for query vectors within the query scratchpads, which are mapped to the host memory address space at $QS_B$ (Query Scratchpad Base Address):

(Batch Size = 1)	Byte Offset
Physical Address	0	2	4	...	2*(VD-1)
QS_B	QV[0][0]	QV[0][1]	QV[0][2]	...	QV[0][VD-1]

(Batch Size = 4)	Byte Offset
Physical Address	0	2	4	...	2*(VD-1)
QS_B	QV[0][0]	QV[0][1]	QV[0][2]	...	QV[0][VD-1]
QS_B + 2048	QV[1][0]	QV[1][1]	QV[1][2]	...	QV[1][VD-1]
QS_B + 2 * 2048	QV[2][0]	QV[2][1]	QV[2][2]	...	QV[2][VD-1]
QS_B + 3 * 2048	QV[3][0]	QV[3][1]	QV[3][2]	...	QV[3][VD-1]

(Batch Size = 64)	Byte Offset
Physical Address	0	2	4	...	2*(VD-1)
QS_B	QV[0][0]	QV[0][1]	QV[0][2]	...	QV[0][VD-1]
QS_B + 2048	QV[1][0]	QV[1][1]	QV[1][2]	...	QV[1][VD-1]
	...	...	...	...	...
QS_B + 63 * 2048	QV[63][0]	QV[63][1]	QV[63][2]	...	QV[63][VD-1]

Fig. 8. Data layout inside the query scratchpads mapped to host memory address at query scratchpad base address " $QS\_B$ ". As we increase the batch size, more query scratchpads are populated with distinct query vectors.

Sequential Query Vectors: Query vectors (QVs) are stored in sequential memory addresses within the query scratchpads.
Batching: As the batch size increases, more query scratchpads (one per processing engine) are populated with distinct query vectors. Each query vector is stored linearly, with its dimensions following each other.
Size per Query Vector: Each query vector (assuming VD dimensions and 2 bytes per dimension) occupies $2 \times VD$ bytes. The 2048 byte offset shown implies a maximum VD of 1024 for a 2KB scratchpad ( $2048 \text{ bytes} / 2 \text{ bytes/dimension} = 1024 \text{ dimensions}$ ).

This simplified data layout (both for embedding vectors and query vectors) facilitates address generation and network-on-chip (NoC) architecture within the NMAs. The vector database application's memory allocation scheme is modified to implement this block data mapping.

4.2.6. Multi-Tenancy

IKS supports both spatial and coarse-grain temporal multi-tenancy:

Spatial Multi-Tenancy: The IKS driver can partition embedding vectors from different vector databases across different LPDDR5X packages. This allows each NMA to perform ENNS independently for different vector databases in parallel.
Temporal Multi-Tenancy: For vector databases that share the same LPDDR5X package, the IKS driver can time-multiplex similarity searches among them. This time-multiplexing occurs at the boundary of a complete similarity search operation.

5. Experimental Setup

5.1. Datasets

The primary dataset used for evaluating the RAG applications is the Google's Natural Questions (NQ) dataset [38, 39].

Source and Characteristics: NQ is a large-scale dataset for open-domain question answering. It consists of real user questions issued to the Google search engine and corresponding answers derived from Wikipedia pages. Each question is paired with a Wikipedia page that contains the answer, along with supporting document passages.
Scale: The NQ dataset is typically split into a training set (nq-train) and a validation set (nq-dev). The Meta's KILT benchmark [65] provides these specific splits.
Corpus Size: The document corpus (knowledge source) for all workloads is constructed from Wikipedia, as described in Karpukhin et al. (2020) [30]. The paper tests IKS with various vector database sizes (corpus sizes) storing the embedding vectors, including 50 GB and 512 GB. The documents themselves (plaintext) are stored in CPU memory.
Embedding Generation: A pre-trained BERT base (uncased) model is used to generate embedding vectors for the documents. 16-bit floating point representation is assumed for these vectors.
Example Data Sample (Conceptual):
- Question: "What is the capital of France?"
- Relevant Document Passage (from Wikipedia): "...Paris is the capital and most populous city of France..."
- Answer: "Paris" This structure is typical for NQ dataset entries. The embedding vector of the question would be matched against embedding vectors of document passages to retrieve the most relevant ones.

5.2. Evaluation Metrics

The paper uses a combination of retrieval accuracy, generation accuracy, and performance metrics.

Retrieval Accuracy

Recall: This metric quantifies how many of the truly relevant documents (identified by ENNS as perfect) are successfully retrieved by an ANNS algorithm.
- Conceptual Definition: Recall measures the proportion of relevant items that are successfully retrieved by an information retrieval system out of all relevant items that should have been retrieved. In the context of ANNS vs. ENNS, it indicates how well ANNS can find the same ground truth relevant documents as ENNS.
- Mathematical Formula: $ \mathrm{Recall} = \frac{|{\text{relevant documents retrieved by ANNS}} \cap {\text{relevant documents retrieved by ENNS}}|}{|{\text{relevant documents retrieved by ENNS}}|} $
- Symbol Explanation:
  - $|\{\text{relevant documents retrieved by ANNS}\} \cap \{\text{relevant documents retrieved by ENNS}\}|$ : The number of relevant documents that are retrieved by both the Approximate Nearest Neighbor Search (ANNS) algorithm and the Exact Nearest Neighbor Search (ENNS) algorithm. ENNS is considered the ground truth for relevant documents.
  - $|\{\text{relevant documents retrieved by ENNS}\}|$ : The total number of relevant documents retrieved by the Exact Nearest Neighbor Search (ENNS) algorithm.

Generation Accuracy

This refers to how well the end-to-end RAG system answers questions. The metric depends on the specific generative model used.

Exact Match (EM) for FiDT5:
- Conceptual Definition: Exact Match is a strict metric that measures whether the generated answer text is identical to one of the reference answers, after normalization (e.g., lowercasing, removing punctuation). It's common for extractive question answering tasks.
- Mathematical Formula: (Standard EM formula, not explicitly provided in paper but universally understood) $ \mathrm{EM} = \frac{\text{Number of questions with exact matching answers}}{\text{Total number of questions}} \times 100% $
- Symbol Explanation:
  - $\text{Number of questions with exact matching answers}$ : The count of questions where the RAG application's generated answer, after normalization, matches exactly one of the acceptable reference answers.
  - $\text{Total number of questions}$ : The total number of questions in the evaluation dataset.
ROUGE-L Recall for Llama-8B and Llama-70B:
- Conceptual Definition: ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) measures the overlap in terms of the longest common subsequence (LCS) between the generated answer and the reference answer. ROUGE-L recall specifically focuses on how much of the reference answer is captured by the generated answer. It's more flexible than Exact Match and suitable for abstractive summarization or answer generation where exact wording might vary.
- Mathematical Formula: (Standard ROUGE-L Recall formula, not explicitly provided in paper but universally understood) Let $X$ be the reference answer and $Y$ be the generated answer. LCS(X, Y) is the length of the longest common subsequence between $X$ and $Y$ . $ \mathrm{ROUGE-L_{Recall}} = \frac{LCS(X, Y)}{\text{Length}(X)} $
- Symbol Explanation:
  - LCS(X, Y): The length of the longest common subsequence between the reference answer ( $X$ ) and the predicted answer ( $Y$ ). A subsequence does not require contiguous matches, just the order.
  - $\text{Length}(X)$ : The length (e.g., number of words or tokens) of the reference answer.
- Context: The paper notes that prompting was used instead of fine-tuning for Llama models to preserve generality, which limits evaluation by prompt adherence. Therefore, recall is used over precision or F1-Score to account for potential verbosity while still assessing if key information from the correct answer is present.

Performance Metrics

Inference Time (Latency): The total time taken for an end-to-end RAG application to process a single query (or a batch of queries) and generate a response. The paper specifically mentions time-to-interactive (time to first token) for latency-critical applications.
Throughput (Queries/sec): The number of queries an end-to-end RAG system can process per second, often measured for batch size > 1.
Speedup ( $\times$ ): A ratio comparing the performance of an accelerated system (e.g., IKS) to a baseline system (e.g., CPU), indicating how many times faster the accelerated system is.
Power Consumption (W): The electrical power consumed by the IKS device.
Area ( $mm^2$ ): The physical silicon area occupied by the NMA chips and PHYs.

5.3. Baselines

The paper evaluates IKS performance against several key baselines:

CPU (Intel Xeon 4th Gen Sapphire Rapids):
- Model: Intel Xeon 4416+ (16 cores @ 2.00 GHz).
- Memory: 512 GB DDR5-4000 across 8 channels (256 GB/s bandwidth).
- Cache: 48 kB dcache, 32 kB icache (L1), 2 MB (L2), 37.5 MB shared (L3).
- Compute: 2x AVX-512 FMA units (164 GFlop/s/core).
- Purpose: This serves as the primary baseline for ENNS retrieval, representing a high-performance general-purpose processor.
Intel AMX (Advanced Matrix Extensions):
- Description: Integrated into Intel Sapphire Rapids CPUs. AMX is specialized for BFloat16 operations and provides 500 GFlop/s/core.
- Purpose: Evaluates the impact of specialized CPU instruction sets for accelerating ENNS. The paper notes AMX speedup is flat for small batch sizes due to the memory-bound nature of the workload.
NVIDIA H100 GPU (SXM):
- Model: NVIDIA H100 SXM.
- Memory: High-Bandwidth Memory (HBM) with 3.35 TB/s bandwidth.
- Compute: 1979 TFlop/s.
- Purpose: Represents a state-of-the-art GPU accelerator commonly used for LLM inference and general data-parallel tasks. It's used to accelerate both the generation phase (for FiDT5, Llama-8B, Llama-70B) and as a baseline for ENNS acceleration.
- Configurations: Tested with 1, 2, 4, and 8 H100 GPUs to evaluate scalability and capacity for large corpus sizes. The paper highlights its high cost and HBM limitations.
ANNS (Approximate Nearest Neighbor Search) Configurations:
- Algorithm: HNSW (Hierarchical Navigable Small World) [52], a state-of-the-art graph-based ANNS.
- Index Parameters: $M=32$ , $efConstruction=128$ (optimized for accuracy and reasonable graph).
- Configurations:
  - ANNS-1: $efSearch=2048$ .
  - ANNS-2: $efSearch=10000$ .
- Purpose: To compare the trade-offs between retrieval quality, speed, and end-to-end RAG accuracy against ENNS. The paper demonstrates that while ANNS can be faster, achieving high generation accuracy with ANNS often requires a larger $K$ (more documents), which negates speed benefits.

5.4. Software Configuration

Evaluation Dataset: Google's Natural Questions (NQ) dataset, specifically the nq-dev (validation) split, is used for evaluating models.
Retrieval Model: A pre-trained BERT base (uncased) model is used to generate embedding vectors for documents and queries.
Vector Database Library: Faiss [27] is used for index management for both ENNS and ANNS.
- ENNS Optimization: For ENNS, Intel's OneMKL BLAS backend is used for dot-product calculations for all batch sizes, providing better performance than Faiss's default BLAS usage (which is only for batch sizes\ge 20$$).
Generative Models:
- FiDT5: Uses the T5-based Fusion-in-Decoder [26, 68]. The generator is a pre-trained T5-base model (220 million parameters) fine-tuned on the nq-train dataset to predict answers from question-evidence pairs. Documents are presented to the T5 encoder, and their representations are combined for the decoder.
- Llama-8B and Llama-70B: Uses 4-bit-Quantized Llama-3-8B-Instruct and Llama-3-70B-Instruct [3]. Retrieved documents are presented as plaintext in the prompt. Prompting is used over fine-tuning to maintain model generality.
Batching: Experiments consider batch sizes of 1 (for latency-critical applications) and 16 (for throughput-optimized applications).
Document Count (K): The number of documents ( $K$ ) retrieved and fed to the LLM is a key parameter. Increasing $K$ significantly impacts generation time due to linear scaling of Transformer inference with input size and increased key-value cache memory. Each document is approximately 100 words (avg. 127 tokens for Llama models).

5.5. Experimental Setup (Hardware & Simulation)

Physical Servers: Two servers with Intel Xeon 4th generation CPUs and one NVIDIA H100 GPU are used for baseline measurements and end-to-end RAG execution (generation phase on H100).
IKS Evaluation: A cycle-approximate simulator (see Appendix A) is developed to model IKS performance. This simulator uses:
- Timing parameters from RTL synthesis (for NMA logic).
- LPDDR5X access timing.
- PCIe/CXL timing [43, 81].
- Calculations for real software stack overheads (e.g., top-K aggregation, umwait() overhead).
- It emulates IKS as a CXL device running on a remote CPU socket.
RTL Design and Synthesis: The NMA (Near-Memory Accelerator) RTL design for IKS was synthesized using Synopsys Design Compiler targeting TSMC's 16nm technology node to obtain area, power, and timing metrics for 1 GHz operation.
Area and Power Estimation:
- Memory controllers and PHYs area estimated from Apple M2 die shots (LPDDR5 in 5nm process), scaled to 16nm (assuming negligible mixed-signal scaling [23, 88]).
- Power model developed by evaluating RTL-level energy consumption of processing operations and data access to scratchpads and LPDDR memory. Energy values (e.g., SRAM 39 fJ/bit, LPDDR 4 pJ/bit [13]) are scaled to 16nm [79].

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Tuning RAG Software Parameters and Examining Approximate Search

The paper first explores the trade-offs between retrieval quality and generation accuracy for RAG applications, comparing ENNS with ANNS variants.

The following figure (Figure 2 from the original paper) compares the generation accuracy and throughput of ANNS- and ENNS-based RAG applications:

$Fig. 2. Generation accuracy vs. throughput (Queries/sec) of representative RAG applications for various retrieval algorithms and document counts (K). The corpus size is set to $5 0 \\mathrm { G B }$ and batch size to 16.$
该图像是一个图表，展示了不同检索算法和文档计数 (K) 对代表性 RAG 应用生成准确率和吞吐量 (Queries/sec) 的影响。图中包含准确率和吞吐量的关系，数据点反映了不同文档数量的变化。主轴表示生成准确率，副轴表示吞吐量。

Fig. 2. Generation accuracy vs. throughput (Queries/sec) of representative RAG applications for various retrieval algorithms and document counts (K). The corpus size is set to $50 \mathrm{GB}$ and batch size to 16.

Analysis of Figure 2:

Impact of Retrieval Quality: The figure clearly demonstrates that retrieval quality strongly influences end-to-end generation accuracy.
ANNS Accuracy Drop:
- With a document count (K) of 1, ANNS-1 and ANNS-2 (using HNSW with efSearch of 2048 and 10000 respectively) show significant accuracy drops compared to ENNS:
  - FiDT5: 22.6% and 34.0% reduction.
  - Llama-8B: 52.8% and 53.4% reduction.
  - Llama-70B: 51.0% and 51.5% reduction.
- Even with a larger document count (K) of 16, accuracy reductions are substantial:
  - FiDT5: 13.6% and 22% reduction.
  - Llama-8B: 38.4% and 42.2% reduction.
  - Llama-70B: 38.4% and 45.2% reduction.
Larger LLMs and Retrieval Quality: The impact of retrieval quality on generation accuracy appears even more pronounced for Llama-8B and Llama-70B compared to FiDT5, especially at lower $K$ values. This suggests that larger LLMs not fine-tuned for the task are more sensitive to the quality of the retrieved context.
Throughput vs. Accuracy Trade-off: While ANNS configurations (e.g., ANNS-2) can offer higher throughput (Queries/sec) than ENNS at certain $K$ values, this often comes at a cost of lower generation accuracy.
Compensation for ANNS: Increasing $K$ to compensate for lower ANNS retrieval quality can lead to lower generation accuracy and end-to-end throughput compared to a higher-quality, slower ANNS configuration (e.g., ANNS-2 with 16 documents vs. ANNS-1 with 128 documents for FiDT5, where ANNS-2 is 3% higher accuracy and 128% higher throughput). This is because larger K significantly increases generation time due to the LLM's computational and memory overheads.
Pareto-Superiority of ENNS: For RAG applications requiring high generation accuracy (e.g., above ~43% for FiDT5, ~27% for Llama-8B, ~14% for Llama-70B), ENNS-based RAG exhibits Pareto-superiority. This means ENNS achieves higher accuracy for a given throughput or higher throughput for a given accuracy, especially at the high-accuracy end of the spectrum.
Modest Speedup of High-Quality ANNS: The paper notes that ANNS-2, the best-performing ANNS configuration in terms of throughput in this analysis, offers only a $2.5 \times$ speedup compared to ENNS. This modest speedup, coupled with lower accuracy, reinforces the motivation for ENNS acceleration.

Conclusion from this section: High-quality retrieval is crucial for RAG's effectiveness, and ENNS often yields the best accuracy-throughput trade-off at high accuracy demands, despite its higher baseline cost. This motivates the need to accelerate ENNS specifically.

6.1.2. End-to-End RAG Performance with ENNS

After establishing the importance of ENNS, the paper profiles its end-to-end performance as a bottleneck.

The following figure (Figure 3 from the original paper) shows the latency breakdown of FiDT5, Llama-8B, Llama-70B for various values of $K$ and corpus sizes:

Fig. 3. Latency breakdown of FiDT5, Llama-8B, Llama-70B for various values of K, corpus sizes. All configurations use batch size 1. Retrieval is ENNS and runs on CPU, generation runs on a single NVIDIA H100 (SXM) for all generative models. The value in each bar shows the absolute retrieval time.
该图像是图表，展示了FiDT5、Llama-3-8B和Llama-3-70B在不同语料库大小及K值下的延迟分解。图表中，黑色柱子表示检索时间（ENNS在CPU上），白色柱子表示生成时间（H100）。每个条形的值表示绝对检索时间。

Analysis of Figure 3:

Retrieval as a Bottleneck: For all RAG applications (FiDT5, Llama-8B, Llama-70B), ENNS retrieval on the CPU (represented by the blue portion of the bars) consumes a significant portion of the end-to-end latency.
- For a 50GB corpus and $K=1$ , retrieval time can be 97% (FiDT5), 90% (Llama-8B), and 74% (Llama-70B) of the total time.
- As corpus size increases to 512GB, retrieval time becomes even more dominant, reaching 99% for FiDT5, 98% for Llama-8B, and 92% for Llama-70B (all with $K=1$ ). The absolute retrieval times (indicated in the bars) are in the range of hundreds of milliseconds to several seconds for large corpus sizes.
Impact of K on Generation Time: While increasing K might seem like a way to compensate for lower ANNS quality, Figure 3b clearly shows that increasing K (the number of documents sent to the LLM) significantly increases the generation time (orange portion of the bars). This is due to the linear scaling of Transformer inference with input size and KV cache memory overheads, making it costly in terms of time to first token.
Memory-bound vs. Compute-bound: ENNS is identified as memory bandwidth-bound, while generation is relatively compute-bound. The CPU-based ENNS cannot saturate the available DRAM bandwidth, indicating a fundamental limitation for software-only approaches.
Urgency for Retrieval Acceleration: The end-to-end latency for large corpus sizes can exceed several seconds, which is unacceptable for interactive user-facing applications. This underscores the critical need for accelerating the retrieval phase.

6.1.3. High-Quality Search Acceleration

The paper examines existing hardware solutions for ENNS acceleration.

The following table (Table 1 from the original paper) compares the speedup of Intel AMX and GPU for ENNS, relative to a CPU baseline:

Batch Size Corpus Size	1		16
Batch Size Corpus Size	50 GB	512 GB	50 GB	512 GB
CPU	1	1	1	1
AMX	1.05	1.02	1.10	1.09
GPU	5.2	36.9	6.0	43.7

Table 1. Speedup of Intel AMX and GPU for ENNS, relative to a CPU baseline. AMX speedup is flat for very small batch sizes, due to the memory-bound nature of similarity search. For 50GB and 512GB corpus size, 1 and 8 H100 GPUs are used, respectively.

Analysis of Table 1:

AMX Performance: Intel AMX provides only a modest speedup (1.02-1.10 $\times$ ) for ENNS. This is because ENNS is primarily memory-bound, and AMX, while boosting FLOPs, cannot overcome the memory bandwidth bottleneck of the CPU system.
GPU Performance: GPUs offer significant speedup over CPU for ENNS (5.2-43.7 $\times$ $\times$ ).
- For a 50GB corpus size, a single H100 GPU achieves 5.2 $\times$ (batch 1) and 6.0 $\times$ (batch 16) speedup.
- For a 512GB corpus size, 8 H100 GPUs are required to fit the data, achieving 36.9 $\times$ (batch 1) and 43.7 $\times$ (batch 16) speedup.
GPU Cost and Utilization Issues: The paper highlights two key problems with GPU acceleration for ENNS:
1. High Cost: As corpus size grows, more GPUs are needed (e.g., 8 H100s for 512GB), significantly increasing cost due to HBM memory. HBM is several times more expensive than DDR or LPDDR.
2. Poor Utilization: GPUs provision massive amounts of compute relative to memory bandwidth. For memory-bound ENNS, a large GPU die is poorly utilized, leading to inefficient resource usage [24].
  
  The following figure (Figure 4 from the original paper) illustrates the Roofline model for ENNS on CPU:
  
  该图像是图表，展示了16核Roofline模型中批量大小为1和16的性能。图表中显示了在操作强度（Flops/Byte）与性能（GFlops/s）之间的关系，标注了峰值DRAM带宽和峰值计算能力。相关数据定义为 $P = ext{Performance}$ 和 $O = ext{Operational Intensity}$ 。

Fig. 4. Roofline model for ENNS using Batch Size 1 and 16. See Section 6 for the experimental setup.

Analysis of Figure 4 (Roofline Model):

The Roofline model visually confirms that ENNS running on the CPU is memory-bound. The actual performance (FLOPs/s) is limited by the system's memory bandwidth (the horizontal "roofline") rather than its peak compute performance (the diagonal "roofline").
The CPU cannot saturate its available DRAM bandwidth for ENNS, reinforcing the argument that deep cache hierarchies and high FLOPs are less beneficial for this memory-intensive workload. This further motivates dedicated near-memory acceleration that focuses on bandwidth.

6.1.4. Effectiveness and Scalability of IKS Retrieval

The paper then presents the performance of IKS for ENNS retrieval.

The following figure (Figure 9 from the original paper) compares the ENNS retrieval time for CPU, AMX, GPU (1, 2, 4, and 8 devices), and IKS (1 and 4 devices) for various corpus sizes:

Fig. 9. Comparison of ENNS retrieval time for CPU, AMX, GPU (1, 2, 4, and 8 devices), and IKS (1, and 4 devices) for various corpus sizes. The absence of bars in specific GPU and IKS configurations indicates that the corpus exceeds the capacity of the accelerator memory. The Y-axis is in log-scale.
该图像是图表，展示了不同设备（CPU、AMX、GPU 和 IKS）在各种数据集大小（50GB、200GB、512GB、1024GB、2048GB）下的检索时间比较。X轴为批处理大小（Batch Size），Y轴为检索时间（秒），呈现出随着数据集增大，各设备的检索性能差异。

Analysis of Figure 9:

IKS vs. CPU/AMX: IKS significantly outperforms CPU and AMX for all corpus sizes and batch sizes. For a 512GB corpus, IKS achieves $13.4–27.9 \times$ faster ENNS than Intel Sapphire Rapids CPUs.
IKS vs. GPU (Single Device): Counterintuitively, a single IKS unit outperforms a single H100 GPU for a 50GB corpus by $2.6 \times$ (batch 1) and $4.6 \times$ (batch 16). This is attributed to:
1. Specialized Top-K Units: IKS has dedicated Top-K units, which are more efficient than GPU's general-purpose compute for this specific operation.
2. Higher GPU Utilization for Memory Bandwidth: GPUs require many streaming multiprocessors and tensor cores to issue memory accesses in parallel to saturate HBM bandwidth, which is often not achieved by the memory-bound ENNS workload. IKS is designed to balance compute and memory bandwidth specifically for this task.
Scalability (Multi-GPU vs. Multi-IKS):
- GPU Scaling: GPU retrieval time for a 50GB corpus reduces by $1.9 \times$ , $3.6 \times$ , and $6.9 \times$ with 2, 4, and 8 GPU devices, respectively. Each H100 GPU fits 80GB, so 8 H100s can handle up to 640GB.
- IKS Scaling: IKS shows excellent scalability. With only four IKS devices, it can accommodate up to a 2TB corpus size (whereas 8 H100 GPUs manage 640GB). For a 50GB corpus, IKS retrieval time reduces by $1 \times$ (1 unit) and $3.9 \times$ (4 units).
- Near-Perfect Weak Scaling: IKS demonstrates near-perfect weak scaling. The retrieval time for a 2TB corpus on 4 IKS units is only $100 \mu s$ longer than for a 512GB corpus on 1 IKS unit. This highlights the low-overhead IKS-CPU interface and the highly parallelizable nature of ENNS.

Memory Capacity: The absence of bars for GPU at higher corpus sizes (e.g., 512GB for 1 GPU) and IKS at 2TB for 1 IKS unit indicates that the corpus size exceeds the memory capacity of those configurations.

The following table (Table 3 from the original paper) reports the absolute time breakdown of ENNS retrieval on IKS:

Corpus Size Batch Size	50 GB		512 GB
Corpus Size Batch Size	1	64	1	64
Write Query Vector	0.3 us	1 us	0.3 us	1 us
Dot-Product	45.96 ms	45.96 ms	470.6 ms	470.6 ms
Partial Top-32 Read	0.7 us	22.4 us	0.7 us	22.4 us
Top-K Aggregation	19 us	540 us	23 us	390 us
Total	46.0 ms	46.5 ms	470.6 ms	471.0 ms

Table 3. Breakdown of ENNS latency on IKS.

Analysis of Table 3:

Dominance of Dot-Product: The dot-product computation and DRAM accesses (combined under "Dot-Product") consume the overwhelming majority of the ENNS retrieval time on IKS (e.g., 45.96 ms out of 46.0 ms for 50GB, batch 1). This is expected for a memory-bound workload accelerated near memory.
Negligible Offload Overheads: The overheads associated with the IKS-CPU interface are extremely low:
- Write Query Vector (offload context transfer): 0.3-1 $\mu$ s.
- Partial Top-32 Read (results transfer): 0.7-22.4 $\mu$ s.
- Top-K Aggregation (on CPU): 19-540 $\mu$ s.
Insensitivity to K: The retrieval time on IKS does not change with the value of $K$ (up to 32) because IKS always computes and returns the top 32 similarity scores. The choice of how many of these to pass to the generative model is then made by the retriever model software.
Batch Size Impact: For dot-product, the time remains constant regardless of batch size (1 vs. 64) because IKS's NMA is designed to be fully utilized at max batch size and embedding vectors are reused across query vectors. The Top-K aggregation time increases with batch size as more partial lists need to be processed.

6.1.5. End-to-End Performance

The paper demonstrates the impact of IKS acceleration on end-to-end RAG inference time.

The following figure (Figure 8 from the original paper) displays the comparison of generation times under different retrieval strategies for CPU and IKS retrieval across multiple models (FiDT5, Llama-8B, Llama-70B). The x-axis represents the number of documents retrieved ( $k$ ), and the y-axis shows the time-to-interactive in seconds. It specifically illustrates how IKS search significantly reduces generation times under 512GB and 50GB memory configurations.

该图像是示意图，展示了不同检索策略下，CPU与IKS检索在多个模型上（FiDT5、Llama-8B、Llama-70B）的生成时间比较。x轴表示每次检索的文档数量k，y轴表示交互时间（秒）。具体显示在512GB和50GB内存配置下，IKS搜索显著降低了生成时间。

Analysis of Figure 8:

Dramatic Reduction in Retrieval Bottleneck: The graphs clearly show that when IKS is used for ENNS retrieval (green bars), the retrieval time portion of the end-to-end inference is drastically reduced, almost becoming negligible compared to the generation time. In contrast, CPU retrieval (blue bars) constitutes a major bottleneck, especially for larger corpus sizes (512GB).
Speedup Range: IKS provides substantial end-to-end inference time speedups:
- FiDT5: $5.6 \times$ to $25.6 \times$ .
- Llama-8B: $5.0 \times$ to $24.6 \times$ .
- Llama-70B: $1.7 \times$ to $16.8 \times$ . This speedup varies depending on batch size, corpus size, and document count (K).
Enabling Real-time RAG: For large corpus sizes (e.g., 512GB) or large batch sizes, CPU retrieval leads to inference times of several seconds, which is unacceptable for user-facing applications. IKS brings these times down to hundreds of milliseconds or less, enabling real-time RAG.
Shifting Bottleneck: With IKS, the retrieval bottleneck is largely eliminated, and the generation phase (running on GPU) becomes the dominant factor in end-to-end latency.

The following figure (Figure 11 from the original paper) compares the accuracy and throughput of FiDT5, Llama-8B, and Llama-70B for various configurations:

$Fig. 11. Comparison of accuracy and throughput of FiDT5, Llama-8B, and Llama-70B for various configurations. ANNS-2 is an HNSW index with $M _ { ; }$ efConstruction, and efSearch of 32, 128, and 2048, respectively.$ 该图像是图表，展示了FiDT5、Llama-8B和Llama-70B在不同配置下的准确率与吞吐量的比较。图表中使用了不同的符号表示不同的K值，分别为K=1、K=4、K=16和K=32/128。横轴为生成准确率，纵轴为查询速率（以log比例表示）。不同的曲线代表了不同模型与算法（如ENNS和ANNS-2）的表现，展示了它们在精度与效率上的权衡关系。

Fig. 11. Comparison of accuracy and throughput of FiDT5, Llama-8B, and Llama-70B for various configurations. ANNS-2 is an HNSW index with $M = 32$ ; efConstruction = 128, and efSearch = 2048, respectively.

Analysis of Figure 11:

ANNS vs. ENNS (CPU): ANNS-2 configurations (yellow points) generally exhibit higher throughput than ENNS running on CPU (blue points) but come with lower generation accuracy. This reinforces the earlier finding that ANNS trades off accuracy for speed, and that achieving high accuracy with ANNS can be challenging.
IKS for Pareto-Superiority: RAG applications using IKS (green points) achieve significantly improved throughput while maintaining (or even slightly improving) generation accuracy compared to both CPU ENNS and ANNS variants.
- This is because IKS's high-quality ENNS allows the generative model to receive a smaller but more accurate context (lower $K$ ), drastically reducing generation time and thus increasing end-to-end throughput at high accuracy levels. Retrieval is no longer a bottleneck.
Importance of High-Quality Retrieval: IKS demonstrates that investing in high-quality retrieval acceleration is key to unlocking the full potential of RAG in terms of both accuracy and throughput.

6.1.6. Power and Area Analysis

NMA Area: Each NMA chip (with 64 processing engines, query scratchpad, top-K unit, output scratchpad) occupies approximately $3.4 \mathrm{mm}^2$ $3.4 mm^{2}$ in 16nm TSMC technology.
- An additional $14 \mathrm{mm}^2$ is required for PHYs and memory controllers.
- However, the minimum NMA chip area is determined by the shoreline needed for the eight LPDDR5X memory channels ( $20 \mathrm{mm}$ ) and PCIe PHYs ( $1 \mathrm{mm}$ ), totaling $21 \mathrm{mm}$ shoreline. This necessitates an area of at least $27.56 \mathrm{mm}^2$ in 16nm.
- The paper notes that the physical shoreline constraint implies there's "free" area on the chip up to $25 \mathrm{mm}^2$ , which IKS utilizes by overprovisioning compute (64 PEs) to remain memory-bandwidth bound even at lower batch sizes.
IKS Power Consumption:
- For batch size 1 and vector dimensions 1024:
  - Processing engines and query scratchpad accesses consume ~59 mW.
  - Accessing embedding vectors from LPDDR memory consumes ~4.35 W.
  - Total power for one IKS unit (8 NMAs) is $8 \times (59 \text{ mW} + 4.35 \text{ W}) \approx 35.2 \text{ W}$ .
- For batch size 64 (full utilization):
  - LPDDR access power remains constant (due to data reuse).
  - Processing engine power increases linearly.
  - Total power increases to ~65 W.

6.1.7. Cost and Power Comparison with GPU

Memory Cost: IKS uses LPDDR5X, which is estimated to be $>3 \times$ $> 3 \times$ more expensive than HBM [54]. Since a single IKS unit includes $6.4 \times$ $6.4 \times$ as much onboard memory as a single NVIDIA H100 GPU (512GB vs. 80GB), the memory cost of IKS is expected to be approximately $2.5 \times$ $2.5 \times$ greater than that of a GPU. (This statement seems to have a typo or implies a different cost comparison than what the text says: if LPDDR5X is cheaper, and IKS has more of it, it should be cheaper overall, not 2.5x greater. Re-reading, the paper states "HBM is more than 3x more expensive than LPDDR", so if IKS uses LPDDR and a GPU uses HBM, IKS should be cheaper, not 2.5x greater. This might be a subtle point about total memory capacity vs. cost per GB or an error in my interpretation/the paper's wording in the abstract vs. discussion.)
- Correction/Clarification based on re-read: The paper states HBM is more than 3x more expensive than LPDDR. If IKS has $6.4 \times$ $6.4 \times$ the memory capacity of H100, then:
  - Cost(H100 Memory) $\approx 80 \text{ GB} \times \text{Cost\_HBM\_per\_GB}$
  - Cost(IKS Memory) $\approx 512 \text{ GB} \times \text{Cost\_LPDDR\_per\_GB}$
  - Since $Cost_HBM_per_GB > 3 \times Cost_LPDDR_per_GB$ , then
  - Cost(IKS Memory) $\approx 512 \text{ GB} \times (\text{Cost\_HBM\_per\_GB} / 3) \approx (512/80/3) \times \text{Cost(H100 Memory)} \approx 2.13 \times \text{Cost(H100 Memory)}$ . So, IKS memory cost is higher due to its much larger capacity, but it's more cost-effective per GB. The statement "memory cost of IKS is expected to be approximately 2.5x greater than that of a GPU" is for the total memory onboard not per GB. This aligns with the "cost-effective high capacity" argument.
Compute Unit Cost: A GPU (e.g., H100) has a die area of $826 \mathrm{mm}^2$ . The total die area of all IKS NMAs is $220 \mathrm{mm}^2$ ( $8 \times 27.56 \mathrm{mm}^2$ ). Since chip production cost increases superlinearly with die area, an IKS unit (despite having $6.4 \times$ larger memory capacity) is expected to cost a fraction of a GPU. This makes IKS a highly cost-effective solution.

6.2. Ablation Studies / Parameter Analysis

The paper incorporates parameter analysis throughout Section 3:

Batch Size: The impact of batch size (1 for latency-critical, 16 for throughput-optimized) is evaluated across all RAG applications. While batch size does not impact generation accuracy, it significantly affects execution time and throughput. IKS is designed to be efficient across various batch sizes, achieving full NMA utilization at batch size 64.
Document Count (K): The number of documents $K$ $K$ retrieved and fed to the LLM is a crucial parameter. The analysis in Figure 2 and Figure 3 demonstrates:
- Increasing $K$ can compensate for lower ANNS retrieval quality but drastically increases generation time and memory overhead (due to KV cache).
- High-quality ENNS (accelerated by IKS) allows for smaller $K$ values to achieve the same or higher generation accuracy, thus reducing generation time and improving end-to-end throughput.
ANNS Parameters ( $M$ , efConstruction, efSearch): For HNSW ANNS, $M$ $M$ (number of outgoing edges per node), efConstruction (construction time parameter), and efSearch (search time parameter) were tuned.
- $M=32$ , $efConstruction=128$ were chosen to maximize retrieval accuracy while maintaining a reasonable graph size.
- efSearch (2048 for ANNS-1, 10000 for ANNS-2) was varied to explore the search speed vs. accuracy trade-off. Lower efSearch gives faster search but lower accuracy; higher efSearch gives slower search but higher accuracy. The results show that trading retrieval quality for speed by using a small efSearch and then increasing $K$ often results in worse generation accuracy and end-to-end throughput than a higher-quality, slower ANNS (or ENNS).
  
  These analyses collectively emphasize that high-quality retrieval is not merely a component but a critical factor in RAG's overall performance, justifying IKS's focus on accelerating ENNS.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work presents a comprehensive analysis of Retrieval-Augmented Generation (RAG) applications, identifying the retrieval phase as a critical bottleneck for both accuracy, latency, and throughput. The authors empirically demonstrate that Exact Nearest Neighbor Search (ENNS), despite its computational cost, is often Pareto-superior for RAG systems requiring high generation accuracy, as it enables sending a smaller, more precise set of documents to the Large Language Model (LLM), thereby reducing generation time.

To address this, the paper introduces Intelligent Knowledge Store (IKS), a novel Type-2 CXL device designed for near-memory acceleration of ENNS. IKS features a scale-out architecture with near-memory accelerators (NMAs) integrated with LPDDR5X DRAM, and a cache-coherent interface over CXL.cache for efficient CPU-accelerator interaction. IKS achieves $13.4–27.9 \times$ faster ENNS over a 512GB vector database compared to Intel Sapphire Rapids CPUs, translating to $1.7–26.3 \times$ lower end-to-end inference time for representative RAG workloads. Beyond its acceleration capabilities, IKS functions as a memory expander, allowing its DRAM to be disaggregated and used by other applications, enhancing overall server resource utilization and cost-effectiveness compared to GPU-based solutions.

7.2. Limitations & Future Work

The authors highlight several limitations and propose future research directions:

Exhaustive Search Inefficiency: The current IKS performs an exhaustive search over the entire corpus, which consumes energy and constantly saturates memory bandwidth. This can potentially cause slowdowns for other applications using IKS as a general memory expander.
- Future Work: Exploring early termination techniques for similarity search [9, 42] could reduce memory bandwidth utilization without compromising search accuracy.
Low NMA Utilization for Small Batch Sizes: For batch sizes less than 64, the NMA chip utilization is low. This is partly a design choice (overprovisioning compute to utilize "free" area dictated by PHY shoreline), but it leads to energy inefficiency.
- Future Work: Implementing circuit-level techniques like clock and power gating to power off inactive processing engines for smaller batch sizes. Additionally, dynamic voltage and frequency scaling (DVFS) could be employed to reduce NMA frequency and voltage, potentially allowing multiple processing engines to process a single query vector for very small batch sizes.
Dataset Dependency for ANNS Comparison: While IKS provides dataset-independent high-quality search via ENNS, the paper acknowledges that if a dataset is highly amenable to clustering, ANNS might reduce the accuracy gap, making it more attractive. IKS's performance heavily relies on the sequential memory access pattern of ENNS, making integration of approximation techniques challenging.
- Future Work: This implicitly suggests exploring hybrid approaches or IKS variants that could selectively incorporate some approximation if it doesn't disrupt the efficient ENNS memory access patterns too much.
Overhead of Host-side Aggregation: For deployments with a very large number of IKS units (beyond the four evaluated), the overhead of host-side final top-K aggregation could become a bottleneck.
Multi-Node Deployments: The current evaluation does not cover deployments of IKS spanning multiple nodes.

7.3. Personal Insights & Critique

This paper makes a compelling case for specialized hardware acceleration of Exact Nearest Neighbor Search within the RAG paradigm. The observation that ENNS can paradoxically lead to lower end-to-end inference time by enabling a smaller, higher-quality context for the LLM is a critical insight often overlooked when focusing solely on retrieval speed. This challenges the common assumption that faster approximate retrieval is always better.

The design of IKS is innovative in several aspects:

CXL.cache as a First-Class Interface: Leveraging CXL.cache for a cache-coherent doorbell and context buffers is an elegant solution to the CPU-accelerator communication bottleneck, offering superior performance over MMIO-like CXL.io and avoiding complex DMA programming. This sets a precedent for how CXL can truly enable shared memory processing.
Scale-Out NMA Architecture for Manufacturability: The pragmatic decision to use a scale-out NMA architecture, distributing logic across multiple chips due to shoreline constraints for LPDDR5X PHYs, is a smart engineering choice that balances performance, cost, and manufacturability.
LPDDR5X for Cost-Effectiveness: The choice of LPDDR5X over HBM demonstrates a clear focus on cost-effectiveness and power efficiency, recognizing that for vector databases (which are large and growing), memory cost is a dominant factor.

Potential Issues/Areas for Improvement:

"Free Area" Justification: While the paper justifies overprovisioning compute on the NMA chip due to "free area" (constrained by PHY shoreline), this is a static design decision. The proposed future work on power gating and DVFS is crucial to mitigate power inefficiency for smaller batch sizes and fully capitalize on this "free" area.
Dataset Sensitivity of ANNS: The argument against ANNS accelerators due to dataset dependence is strong, but some hybrid approach where IKS could, for example, quickly filter out a large portion of the search space using simple ANNS techniques before performing an ENNS on a reduced candidate set, might offer even greater efficiency for certain datasets, without necessarily breaking the sequential access pattern. This could be a more dynamic approach to early termination.
Software Ecosystem: While the paper focuses on hardware, the long-term success of such a specialized accelerator depends heavily on its integration into the broader vector database software ecosystem. The minimalist hardware design relies on software for mapping and aggregation, which means strong driver and library support is essential.
Reliability of LPDDR5X: The paper mentions that ENNS is resilient to bit flips in LPDDR5X. While this might hold for ENNS, if IKS is truly a memory expander for other applications, LPDDR5X's datacenter reliability (or lack thereof compared to ECC DDR) could become a concern for other data-sensitive workloads. Further research into in-line ECC for LPDDR5X in datacenter contexts for CXL memory could be valuable.

Overall, IKS represents a significant step towards practical and scalable RAG acceleration. Its rigorous analysis of RAG bottlenecks and innovative hardware-software co-design using CXL offer valuable insights for the future of AI system architectures. The ability to act as a memory expander also makes it a more versatile and economically attractive solution than single-purpose accelerators.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Accelerating Retrieval-Augmented Generation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 51,892 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs)

Retrieval-Augmented Generation (RAG)

Dense Retrieval

K-Nearest Neighbor (KNN) Search

CXL (Compute Express Link)

LPDDR5X (Low-Power Double Data Rate 5X)

SIMD (Single Instruction, Multiple Data)

GPU (Graphics Processing Unit)

Intel AMX (Advanced Matrix Extensions)

Roofline Model

Transformer Networks

BERT (Bidirectional Encoder Representations from Transformers)

T5 (Text-to-Text Transfer Transformer)

Llama (Large Language Model Meta AI)

Faiss (Facebook AI Similarity Search)

3.2. Previous Works

LLM Limitations

Retrieval-Augmented Generation (RAG)

Nearest Neighbor Search Algorithms

Near-Memory Processing (NMP) and Accelerators for Search

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. IKS System Integration and Architecture Overview

4.2.2. Offload Model

4.2.3. Cache Coherent Interface

4.2.4. NMA Architecture

4.2.5. Data Layout Inside DRAM and Query Scratchpad

4.2.6. Multi-Tenancy

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

Retrieval Accuracy

Generation Accuracy

Performance Metrics

5.3. Baselines

5.4. Software Configuration

5.5. Experimental Setup (Hardware & Simulation)

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Tuning RAG Software Parameters and Examining Approximate Search

6.1.2. End-to-End RAG Performance with ENNS

6.1.3. High-Quality Search Acceleration

6.1.4. Effectiveness and Scalability of IKS Retrieval

6.1.5. End-to-End Performance

6.1.6. Power and Area Analysis

6.1.7. Cost and Power Comparison with GPU

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers