Accelerating Retrieval-Augmented Generation
TL;DR Summary
The paper explores Retrieval-Augmented Generation (RAG) to address hallucinations in large language models (LLMs). It introduces the Intelligent Knowledge Store (IKS), a near-memory acceleration architecture that enhances exact retrieval speed by 13.4–27.9 times, improving infere
Abstract
An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4–27.9 × faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7–26.3 × lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM – which is the most expensive component in today’s servers – from being stranded.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Accelerating Retrieval-Augmented Generation
1.2. Authors
- Derrick Quinn (Cornell University, Ithaca, NY, USA)
- Mohammad Nouri (Cornell University, Ithaca, NY, USA)
- Neel Patel (Cornell University, Ithaca, NY, USA)
- John Salihu (University of Kansas, Lawrence, KS, USA)
- Alireza Salemi (University of Massachusetts Amherst, Amherst, MA, USA)
- Sukhan Lee (Samsung Electronics, Hwasung, Republic of Korea)
- Hamed Zamani (University of Massachusetts Amherst, Amherst, MA, USA)
- Mohammad Alian (Cornell University, Ithaca, NY, USA)
1.3. Journal/Conference
Published in the Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25), March 30-April 3, 2025, Rotterdam, Netherlands.
ASPLOS is a highly reputable and influential conference in the field of computer architecture, programming languages, and operating systems. Publication here signifies significant contributions and rigor in system design and performance.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the issue of hallucinations and accuracy in large language models (LLMs) by focusing on Retrieval-Augmented Generation (RAG). RAG enhances LLMs by retrieving information from external knowledge sources. The authors profile various RAG execution pipelines, elucidating the interplay between retrieval and generation. They demonstrate that while Exact Nearest Neighbor Search (ENNS) is computationally expensive, it reduces inference time compared to Approximate Nearest Neighbor Search (ANNS) variants by providing a smaller, more accurate document list to the generative model, thus maintaining end-to-end accuracy. This observation motivates accelerating ENNS.
The paper introduces Intelligent Knowledge Store (IKS), a Type-2 CXL device. IKS implements a scale-out near-memory acceleration architecture featuring a novel cache-coherent interface between the host CPU and near-memory accelerators (NMAs). IKS achieves faster ENNS over a 512GB vector database compared to Intel Sapphire Rapids CPUs. This superior search performance results in lower end-to-end inference time for representative RAG applications. Additionally, IKS functions as a memory expander, allowing its internal DRAM to be disaggregated and used by other applications, preventing DRAM from being stranded.
1.6. Original Source Link
/files/papers/692998d14241c84d8510f9f7/paper.pdf (This link indicates an internal file path, suggesting it might be a link within a specific system or a preprint server, as the publication date is in 2025. It is likely a pre-print or accepted version for ASPLOS '25.)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the inherent limitations of Large Language Models (LLMs), specifically their susceptibility to hallucinations (generating factually incorrect or nonsensical information) and the static nature of their parametric knowledge. LLMs "memorize" information from their training corpora, which becomes outdated in rapidly evolving domains and struggles with less frequent entities or real-time information. Retraining or fine-tuning LLMs to update this knowledge is prohibitively expensive.
Retrieval-Augmented Generation (RAG) emerges as a crucial solution to these challenges. RAG enhances LLMs by augmenting them with dynamic, non-parametric knowledge retrieved from external, up-to-date sources like the web or vast document corpora. This approach significantly improves LLM accuracy and reduces hallucinations.
However, the effectiveness of RAG critically depends on the quality and speed of the retrieval phase. State-of-the-art retrieval methods, known as dense retrieval, encode queries and documents into high-dimensional embedding vectors and store them in a vector database. Retrieving relevant documents then involves a K-nearest neighbor (KNN) search. Exact Nearest Neighbor Search (ENNS), which guarantees the highest retrieval quality, is computationally intensive and memory bandwidth-limited, often consuming up to of the total end-to-end inference time in RAG applications. While Approximate Nearest Neighbor Search (ANNS) offers faster search, the paper's experiments show that ANNS often requires retrieving significantly more documents (larger K) to match the accuracy of ENNS, negating performance gains and increasing the generation time of the LLM. This creates a bottleneck where high-quality retrieval is essential for RAG's accuracy but is currently too slow and expensive on commodity hardware, especially CPUs and even GPUs for very large vector databases.
The paper's entry point is this bottleneck: given that high-quality (exact) retrieval is critical for RAG's overall performance and accuracy, there is a strong motivation to accelerate ENNS efficiently and cost-effectively.
2.2. Main Contributions / Findings
The paper makes several significant contributions to addressing the RAG retrieval bottleneck:
- Demystification and Profiling of RAG Pipelines: The authors provide an extensive profiling of
RAGexecution pipelines, analyzing the interplay between hardware and software configurations. They demonstrate thatRAGapplications require high-quality retrieval for effective performance, and thathigh-quality retrieval(whetherENNSor slowANNS) constitutes asignificant portionof theend-to-end runtime, bottleneckingRAG. - Motivation for
ENNSAcceleration: The paper empirically shows thatENNScan reduce overallRAG inference timecompared toANNSvariants because its higher accuracy allows the generative model to receive a smaller, more precise list of documents while maintaining the sameend-to-end accuracy. This observation strongly motivates the need for dedicatedENNSacceleration. - Design of
Intelligent Knowledge Store (IKS): The paper introducesIKS, a novel, cost-optimized, and purpose-builtType-2 CXLmemory expander that functions as a high-performance, high-capacityvector database accelerator.IKSoffloadsmemory-intensive dot-product operationsofENNSto a distributed array oflow-profile acceleratorsplacednear LPDDR5X DRAM packages. - Novel Cache-Coherent Interface:
IKSimplements an innovative interface built on top of theCXL.cacheprotocol. This interface facilitatesseamlessandefficient offloadingof exactvector database search operationsfrom the hostCPUto thenear-memory accelerators (NMAs), eliminating the need forDMA setup,buffer management,interrupts, orpolling. - Scale-Out Near-Memory Acceleration Architecture:
IKSemploys aminimalist scale-out near-memory accelerator architecturewhereNMAlogic is distributed across multiple chips, each connected to anLPDDR5X package. This design optimizes for area, yield, and aggregate memory bandwidth. - Significant Performance Gains:
IKSachieves a remarkable fasterENNSfor a 512GBknowledge storecompared toIntel Sapphire Rapids CPUs. This translates to a reduction inend-to-end inference timefor representativeRAGapplications (FiDT5, Llama-8B, Llama-70B). - Memory Disaggregation and Cost-Effectiveness:
IKSis inherently amemory expander. Its internalDRAMcan bedisaggregatedand shared with other co-running applications viaCXL.memandCXL.cacheprotocols, preventing expensiveDRAMfrom beingstrandedand improving server resource utilization. The design is also shown to be morecost-effectiveandpower-efficientthanGPU-based solutions forENNS.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Large Language Models (LLMs)
LLMs are advanced neural networks, typically based on the Transformer architecture, trained on massive datasets of text and code. They excel at understanding, generating, and translating human-like text.
- Parametric Knowledge: The knowledge acquired by
LLMsduring training is stored in their millions or billions of internal parameters (weights and biases). This knowledge isstaticand can only be updated byretrainingorfine-tuningthe entire model, which is a costly and time-consuming process. - Hallucination: A critical issue where
LLMsgenerate plausible-sounding but factually incorrect or nonsensical information, often due to limitations in their parametric knowledge or misunderstanding of context.
Retrieval-Augmented Generation (RAG)
RAG is a technique that combines LLMs with external information retrieval systems. Instead of relying solely on their internal (parametric) knowledge, RAG models first search for relevant documents or passages from a knowledge source (e.g., a vector database of Wikipedia articles, web pages) and then use this retrieved information to inform their text generation.
- Components:
- Retrieval Model (Retriever): Responsible for searching the
knowledge sourcefor relevant items based on the user's query. It typically converts the query into anembedding vectorand searches for similar documentembedding vectors. - Generative Model (Generator): An
LLMthat receives the user's query and the retrieved documents. It then synthesizes a response based on this augmented input.
- Retrieval Model (Retriever): Responsible for searching the
- Benefits: Reduces
hallucination, provides up-to-date information, increases factual accuracy, and makesLLMsmoregroundedin verifiable sources.
Dense Retrieval
A modern information retrieval technique used in RAG.
- Bi-encoder Neural Networks:
Dense retrievalmodels use two separate neural networks (encoders): aquery encoder() and adocument encoder(). - Embedding Vectors: Both encoders map queries and documents into a shared
high-dimensional vector space. The output of these encoders are calledembedding vectors(orembeddings). For a query and a document , their embeddings are and respectively. - Similarity Score: The similarity between a query and a document is typically calculated using the
dot product(orcosine similarity) of their respectiveembedding vectors: . A higher score indicates greater relevance. - Vector Database: A specialized database designed to efficiently store and query
embedding vectors. It allows for fast similarity searches.
K-Nearest Neighbor (KNN) Search
After converting queries and documents into embedding vectors, retrieving the most relevant documents involves finding the embedding vectors in the vector database that are "closest" (most similar) to the query embedding vector. KNN search finds these neighbors.
- Exact Nearest Neighbor Search (ENNS): This method exhaustively compares the query vector with every single embedding vector in the
vector databaseto find the absolutely closest ones. It guarantees perfectretrieval accuracybut is computationally expensive andmemory bandwidth-bound, especially for large databases. - Approximate Nearest Neighbor Search (ANNS): This method employs various data structures and algorithms (e.g.,
Product Quantization,IVFPQ,HNSW) to reduce the search space and find neighbors that are approximately the closest.ANNStrades off a small amount ofretrieval accuracyfor significantly faster search times. However, maintaininghigh accuracywithANNSoften requires more complex configurations or may still be slower than desired.
CXL (Compute Express Link)
CXL is an open industry standard interconnect that provides high-bandwidth, low-latency connectivity between a host processor and accelerators or memory devices. It operates over the PCIe (Peripheral Component Interconnect Express) physical and electrical interface but adds crucial features for coherency and memory semantics.
- CXL.mem (Memory Protocol): Allows a
CXLdevice to act as a memory expander, providing additional memory capacity to the hostCPU. The host can access this memory directly. - CXL.cache (Caching Protocol): Enables
CXLdevices (like accelerators) to cache hostCPUmemory and participate in theCPU'scache coherencydomain. This is critical forshared memorymodels, allowingCPUandacceleratorto operate on the same data without explicitDMAtransfers and ensuring data consistency. - CXL Type 2 Device: A
CXLdevice that supports bothCXL.memandCXL.cacheprotocols. This type of device can act as amemory expanderand also containaccelerator logicthat can cacheCPUmemory, offering ashared address spacebetween the hostCPUand theaccelerator. This is the type of deviceIKSis.
LPDDR5X (Low-Power Double Data Rate 5X)
LPDDR5X is a type of synchronous dynamic random-access memory (SDRAM) primarily designed for mobile and embedded systems, emphasizing low power consumption and high bandwidth.
- Characteristics: It offers higher bandwidth (
8533 Mbps per pin) than standardDDR5(7200 MTps) and is morepower-efficientdue to shorter interconnections. It is typically integratednearasystem-on-chip (SoC). - IKS Usage:
IKSutilizesLPDDR5Xfor its internal memory to storeembedding vectorsdue to itsbandwidthandpower efficiency, despite traditional reliability concerns in datacenter settings (which the paper argues are less critical forENNSdue tobit flip resilience).
SIMD (Single Instruction, Multiple Data)
A class of parallel computing where a single instruction operates on multiple data points simultaneously. This is often implemented through specialized CPU extensions (e.g., AVX-512, Intel AMX) or in GPUs. ENNS is highly data-parallel and benefits greatly from SIMD operations (e.g., vector dot products).
GPU (Graphics Processing Unit)
A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer for output to a display device. Modern GPUs are highly parallel processors capable of performing a large number of computations simultaneously, making them suitable for deep learning workloads.
- HBM (High-Bandwidth Memory): A type of high-performance
RAMused withGPUsand someFPGAs. It's known for veryhigh memory bandwidthbut is significantly moreexpensivethanDDRorLPDDRmemories. - Limitations for ENNS: While
GPUscan accelerateENNS, they often haveexcess compute capacityrelative to theirmemory bandwidthformemory-boundworkloads likeENNS, leading topoor utilization. TheirHBMis also veryexpensive, andoffloadinglargevector databasestoGPU memorycan require multipleGPUs, increasing cost.
Intel AMX (Advanced Matrix Extensions)
A set of new CPU extensions introduced in Intel Xeon Scalable Processors (e.g., Sapphire Rapids). AMX is designed to accelerate matrix multiplication and deep learning workloads, particularly with bfloat16 data types. It provides specialized hardware units (tiles) for efficient tensor operations.
Roofline Model
A graphical model used in computer architecture to characterize the performance of a computation based on its operational intensity (floating-point operations per byte accessed from memory) and the hardware's peak performance (FLOPS) and peak memory bandwidth. It helps identify whether a workload is compute-bound (limited by processing speed) or memory-bound (limited by data transfer speed).
Transformer Networks
The foundational architecture for most modern LLMs. Transformers rely heavily on the self-attention mechanism to weigh the importance of different parts of the input sequence when processing each element. They consist of an encoder stack and a decoder stack.
BERT (Bidirectional Encoder Representations from Transformers)
A popular Transformer-based language model pre-trained by Google. BERT is often used as an encoder for generating embedding vectors for text due to its strong contextual understanding.
T5 (Text-to-Text Transfer Transformer)
Another Transformer-based model by Google, framed as a text-to-text problem, meaning every NLP task is converted into a text generation task. T5 can be used as a generative model in RAG. Fusion-in-Decoder (FiD) is a variant where multiple retrieved documents are encoded independently and then their representations are concatenated before being passed to the decoder.
Llama (Large Language Model Meta AI)
A family of LLMs developed by Meta AI. These models are known for their strong performance and open-source availability. The paper uses 4-bit-Quantized Llama-3-8B-Instruct and Llama-3-70B-Instruct as generative models.
Faiss (Facebook AI Similarity Search)
An open-source library developed by Meta for efficient similarity search and clustering of dense vectors. It provides implementations of many ENNS and ANNS algorithms and is widely used in vector database applications.
3.2. Previous Works
The paper builds upon a rich body of work in LLMs, RAG, and nearest neighbor search acceleration.
LLM Limitations
- Static Knowledge & Costly Updates: The issue of
LLMshaving static knowledge that is expensive to update is well-documented [108]. EarlyLLMsrelied solely on their parametric knowledge, which became outdated. - Hallucinations: The problem of
LLM hallucinationsis a major research area [83, 103], highlighting the need for external grounding. - Limited Memorization & Temporal Degradation:
LLMshave shown limited memorization for less frequent entities [29] and can exhibittemporal degradation(performance decline over time as information becomes stale) [31].
Retrieval-Augmented Generation (RAG)
- Foundational RAG Models: The concept of
RAGgained significant traction with works likeRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasksby Lewis et al. (2020) [41] andLeveraging Passage Retrieval with Generative Models for Open Domain Question Answeringby Izacard and Grave (2021) [26], both fromMeta AI. These works established the two-phaseretriever-generatorpipeline that the current paper'sFiDT5andLlamaimplementations closely align with. - Dense Retrieval: The effectiveness of
dense retrievalusingbi-encoder neural networksforRAGhas been demonstrated across various modalities [19, 25, 26, 41, 71, 74], as highlighted by Karpukhin et al. (2020) [30] withDense Passage Retrieval (DPR). The paper's focus ondense retrievalreflects this trend. - RAG Applications:
RAGhas been successfully applied to numerousNLPtasks, including dialogue generation [2, 8, 10, 83], machine translation [18, 21, 102], question answering [25, 26, 41, 66, 67, 76, 78], summarization [58, 64], code generation [20, 49], and multi-modal tasks [11, 12, 15, 19, 70, 71].
Nearest Neighbor Search Algorithms
- ENNS: Exhaustive search is the baseline for
KNN. - ANNS Algorithms:
Product Quantization (PQ)[28]: A technique to compressembedding vectorsand speed up search by breaking vectors into subvectors and quantizing them.Inverted File with Product Quantization (IVFPQ)[7]: CombinesPQwith aninverted file index(like in traditionalinformation retrieval) to narrow down the search space to a subset ofclusters.Hierarchical Navigable Small World (HNSW)[52]: A graph-basedANNSalgorithm that constructs a multi-layered graph where searches efficiently navigate through layers to find neighbors. The paper usesHNSWas its representativeANNSalgorithm for comparison.
- Challenges with High-Quality ANNS: Prior work has shown that achieving
high-quality ANNScan be nearly as slow asENNS[45, 94], andGPUsare not always effective for complexANNSalgorithms likeIVFPQandHNSW[27].
Near-Memory Processing (NMP) and Accelerators for Search
- DIMM-based NMP: Solutions like
AxDIMM[33, 35] and others [4, 62, 111] have exploredNMPby integrating computation intoDDR DIMMs.- Limitation: These approaches often require sophisticated mechanisms for
shared address space[61], limit per-rank memory capacity, and may compromise hostCPUmemory capacity when used as accelerators. They also face compute and thermal capacity limitations.
- Limitation: These approaches often require sophisticated mechanisms for
- Specialized ANNS Accelerators:
ANNA[40]: A specialized architecture forPQ-basedANNS.NDSearch[95]: An accelerator forgraph-based ANNS(likeHNSW).- Limitation: The paper argues these are
highly task-specificdue to complex algorithms and memory access patterns ofANNS, making them less generalizable as different corpora might be better suited for differentANNSalgorithms.ENNS, being simpler, is more amenable to general acceleration.
- Computational CXL Memory: Sim et al. (2023) [84] explored
computational CXL memoryforENNSacceleration.- Limitation: This work used
DDR DRAM, which may not meet thepowerandbandwidthrequirements for thelarge corpus sizesrelevant toRAG.
- Limitation: This work used
- CXL-PNM (CXL-based Processing Near Memory): Park et al. (2024) [57] presented an
LPDDR-basedCXL-PNMplatform forTransformer-basedLLMs.IKSshares some foundational ideas but differs in itsscale-out NMA architectureandcache-coherent interfacedesign.
3.3. Technological Evolution
The evolution of LLMs began with models relying purely on their internal parametric knowledge, leading to issues like hallucinations and stale information. This spurred the development of Retrieval-Augmented Generation (RAG) systems, which dynamically fetch external, up-to-date information to augment LLM responses.
Initially, RAG research focused on various retrieval models, with dense retrieval becoming prominent for its effectiveness. However, the dense retrieval phase, particularly Exact Nearest Neighbor Search (ENNS), quickly emerged as a performance bottleneck due to its memory-bandwidth intensive nature and the rapidly growing size of vector databases. While Approximate Nearest Neighbor Search (ANNS) offered speed, it often compromised retrieval quality, forcing LLMs to process more documents, negating retrieval gains and increasing generation time.
This paper's work (IKS) fits into the timeline by addressing this critical retrieval bottleneck. It represents a shift towards specialized hardware acceleration for ENNS, recognizing its importance for RAG's overall accuracy and latency. By leveraging CXL technology, IKS pushes ENNS computation directly near memory, aiming for both high performance and cost-effectiveness, moving beyond general-purpose CPUs and GPUs which are often inefficient for this specific memory-bound workload. It also differentiates itself from task-specific ANNS accelerators by focusing on the more generalizable ENNS.
3.4. Differentiation Analysis
Compared to main methods in related work, IKS introduces several core differences and innovations:
- Focus on ENNS over ANNS for Acceleration: Unlike prior
ANNS accelerators(e.g.,ANNA[40] forPQ,NDSearch[95] forgraph-based ANNS),IKSspecifically targetsENNS. The authors argue thatANNS acceleratorsarehighly task-specificanddataset-dependent, whileENNSoffers universalhigh qualityandsimpler algorithmsamenable to more general acceleration. Their empirical findings confirm thatENNSoften leads toPareto-superiorconfigurations forRAGrequiringhigh accuracy. - CXL Type 2 Device for Shared Address Space: Instead of
DIMM-based NMP(e.g.,AxDIMM[33]), which often struggle withshared address spaceand capacity limitations,IKSis aType-2 CXLdevice. This enables acache-coherent shared address spacebetween the hostCPUandNMAs, eliminatingDMAoverheads and simplifying data management. It also scales independently in memory capacity. - Novel Cache-Coherent Interface for Offload:
IKSdesigns a uniqueCPU-accelerator interfaceatopCXL.cache. This interface usesmemory-mapped context buffersand acoherent doorbell registerforlow-overhead offloadingandnotification(leveragingumwait()), avoiding expensiveinterruptsorpollingmechanisms common inPCIe-based offloads. - Scale-Out Near-Memory Acceleration Architecture:
IKSdistributesNMAlogic acrossmultiple low-profile chips, each adjacent to anLPDDR5X package. Thisscale-outdesign, driven by physical manufacturing constraints (reticle limits, shoreline forPHYs), allows forhigh aggregate bandwidth, improvedyield, and betterthermal managementcompared to monolithic large accelerators. - Cost-Effective and Power-Efficient Memory:
IKSutilizesLPDDR5X DRAMinstead ofHBM(used inGPUs) orDDR(used in someCXLNMP).LPDDR5Xoffershigher bandwidththanDDR5and ismore power-efficientandcost-effectivethanHBM, making it suitable forhigh-capacity vector databases. - Hardware/Software Co-Design:
IKSimplements aminimalisthardware architecture, relying onsoftwareto manage data mapping andfinal top-K aggregation. This approach simplifies theNMAdesign while still achieving high performance. - Memory Disaggregation:
IKSis explicitly designed as amemory expanderthat candisaggregateits internalDRAMfor other applications, preventingmemory stranding– a common issue with dedicated accelerators.
4. Methodology
4.1. Principles
The core principle behind Intelligent Knowledge Store (IKS) is to directly address the memory bandwidth-bound nature of Exact Nearest Neighbor Search (ENNS) in RAG applications by moving computation near memory in a cost-effective and scalable manner. IKS leverages the Compute Express Link (CXL) Type 2 specification to create a compute-enabled memory expander. This allows the host CPU and near-memory accelerators (NMAs) to share a cache-coherent address space, enabling low-overhead offloading of vector similarity search operations.
The theoretical basis and intuition are rooted in observing that ENNS operations have no data reuse for pairwise similarity calculations, consist of simple vector-vector dot-products with top-K logic, exhibit regular memory access patterns, and are highly parallelizable. These characteristics make ENNS ideal for near-memory acceleration because:
-
Cache Inefficiency: Deep
cache hierarchiesare not beneficial forENNSdue to the lack ofdata reuse, and can even introduce overheads. -
Low Overhead Coherency: Limited
data reusewith large datasets simplifiessoftware-managed cache coherencybetween the hostCPUandNMAs. -
Predictable Access:
Regular memory access patternsenablecoarse-grain virtual-to-physical address translationonNMAs. -
Distributed Parallelism:
ENNScan be efficiently offloaded to adistributed array of NMAs, each processing a shard of the corpus data in parallel, with a simpletop-K aggregationstep at the end.IKSco-designs hardware and software to implement aminimalist scale-out near-memory accelerator architecturethat is exposed as a memory expander, preventingDRAMstranding and providingcost-effective high capacity.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. IKS System Integration and Architecture Overview
The Intelligent Knowledge Store (IKS) is designed as a Type-2 CXL device, meaning it supports both CXL.mem (memory protocol) and CXL.cache (caching protocol). This allows IKS to function both as a regular memory expander and as a high-performance vector database accelerator.
The following figure (Figure 5 from the original paper) provides an overview of the IKS architecture and its integration:

该图像是示意图,展示了近内存加速器(NMA)的内部架构,包括处理引擎0和63的组成部分,以及点积单元和Top-K单元的结构。图中详述了内存控制器与查询缓冲区的连接,显示了各个MAC单位的工作方式和数据流向。点积单元和Top-K单元的设计旨在优化查询向量和嵌入向量之间的运算,以提升数据处理性能。
Fig. 5. (a) IKS internal DRAM, scratchpad spaces, and configuration registers are mapped to the host address space. The scratchpad and configuration register address ranges are labeled as Context Buffers (CB). (b) IKS is a compute-enabled CXL memory expander that includes eight LPDDR5X packages with one near-memory accelerator (NMA) chip near each package. (c) Each NMA includes 64 processing engines. (d) Dot-product units reuse the query vector (QV) dimension across 68 MAC units.
System Address Space (Figure 5a):
IKS's internal memory (LPDDR5X DRAM), scratchpad spaces (for queries and results), and configuration registers are all mapped into the host CPU's address space. This creates a unified address space where both the CPU and IKS can cache addresses, simplifying interaction and data sharing. The scratchpad and configuration register ranges are specifically labeled as Context Buffers (CB).
IKS Architecture (Figure 5b):
IKS employs a scale-out near-memory processing architecture. It features:
- Eight LPDDR5X Packages: These
DRAMpackages provide high-capacity, high-bandwidth, and low-power memory for storing theembedding vectorsof thevector database. The total capacity for a singleIKSunit is (assuming 16-bit float embeddings). - Eight Near-Memory Accelerators (NMAs): Each
LPDDR5X packageis directly connected to a dedicatedNMAchip. EachNMAintegrates bothLPDDR5X memory controllersand theaccelerator logicforENNS. - CXL Controller: A central
CXL controlleron theIKScard connects to the hostCPUvia a link (overPCIe). EachNMAconnects to thisCXL controllervia aPCIe 5.0 uplink, providing 8 GBps perNMA. Thisscale-outdesign (distributingNMAlogic over multiple smaller chips) is chosen to manage chip area, improve manufacturing yield, and achievehigh aggregate chip shorelinefor memoryPHYs.
4.2.2. Offload Model
The IKS offload model is designed for seamless and efficient userspace interaction without system calls or context switches.
- Data Storage: The host
CPUstoresembedding vectorsin a specificblock data layout(detailed in Section 4.2.5) within theIKSLPDDR5X DRAM. The actual human-readable documents corresponding to theseembedding vectorsare stored in the host'sDDRorCXL memory. - Offload API: The
vector databaseapplication running on the hostCPUinitiates anENNSoffload by calling a blockingAPI:iks_search(query). ThisAPIhides the hardware interaction complexity. - Cache Coherency: After any update operations to the
vector databaseinIKSmemory, theCPUflushes its caches to ensureNMAsreceive the most up-to-date data. - Offload Context: The
iks_search(query)APIprepares anoffload context. This context includes:- The
query vectors(for batch processing). - The
vector dimensions (VD). - The
base addressof the firstembedding vectorstored in eachLPDDR5X package.
- The
- Initiating Offload: The
offload contextis written tomemory-mapped regionscalledcontext buffers(part of theIKSaddress space, visible to the hostCPUandNMAs). The offload is then initiated by writing to adoorbell register. - Notification: The host
CPUthen callsumwait()(a user-mode instruction for waiting) to monitor thedoorbell register. This register is shared and keptcache-coherentvia theCXL.cache protocol, allowing theCPUto efficiently wait for theNMAsto complete without busy-polling or interrupts. - Aggregation: Once all
NMAscomplete their respective searches, they update thedoorbell register, notifying theCPU. TheCPUthen executes anaggregation routineto combine thepartial top-K listsfrom eachNMAinto a single, finaltop-K list. - Document Retrieval: The
CPUuses the physical addresses of thetop-K embedding vectors(which are knowna prioribecause of the fixed data layout) to retrieve the actual documents from host memory.
4.2.3. Cache Coherent Interface
The IKS design leverages the CXL.cache protocol to implement an efficient and low-overhead interface between NMAs and host CPU processes through shared memory.
The following figure (Figure 6 from the original paper) illustrates the transaction flow through the CXL.cache interface:

该图像是一个示意图,展示了CPU与IKS(Intelligent Knowledge Store)之间通过CXL缓存一致性互连的接口。图中详细描绘了在上下文缓冲区之间传递参数和查询的过程,包括多个步骤和相应的缓存一致性活动。
Fig. 6. CPU-IKS interface through cache coherent CXL interconnect.
-
Host Writes Offload Context (Step 1): The host
CPUwrites theoffload context(containingquery vectors,vector dimensions,base addresses) to predefinedcontext bufferaddress ranges. These buffers arecacheableshared memoryregions, and theCPUusestemporal writesto populate them efficiently. -
Host Writes Doorbell Register (Step 2-3): After populating the
context buffers, the hostCPUwrites a specific value to adoorbell register. Thisdoorbell registeris mapped to acache linethat is shared by both theNMAsand the host, ensuringcoherency. -
Host Waits (Step 3-4): Immediately after writing the doorbell, the host
CPUexecutes theumwait()instruction, which makes theCPUcore enter a low-power, idle state while monitoring thedoorbell registerfor changes. This providesfine-grained notificationwithout costlyinterruptsorbusy-waiting. -
NMA Polls Doorbell (Step 4): The
NMAscontinuously poll their local copy of thecache-coherent doorbell register. As soon as they detect a change (i.e., the host has written a new value), they recognize that an offload request has arrived, and theENNScomputation starts. -
NMA Reads Offload Context (Step 5-6): The
NMAreads theoffload contextfrom theIKS cache(which reflects theCPU'stemporal writesviaCXL.cache coherency) and loads it into its internalscratchpad. -
NMA Computation (Step 7): Each
NMAperforms itsENNScomputation on its local shard of thevector databaseusing itsprocessing engines. -
NMA Updates Context Buffers (Step 8-9): Upon completion, each
NMAwrites itspartial top-K list(similarity scores and corresponding physical addresses ofembedding vectors) back into theoutput scratchpadwithin the sharedcontext bufferrange. -
NMA Writes Doorbell (Step 10-11): Finally, each
NMAwrites a completion signal to thedoorbell register. -
Host Notified & Aggregates (Step 11-12): The host
CPU, which wasumwait()ing, is notified of the change in thedoorbell register. It then proceeds to read thepartial top-K listsfrom theoutput scratchpads(which are alsocache-coherent) and aggregates them into the finaltop-Kresult.This
cache-coherent interfacesignificantly reducesoffload overheads, achieving higher throughput thanCXL.io-mimicking non-temporal writes (which resemblePCIe MMIO).
4.2.4. NMA Architecture
As detailed in Figure 5c and 5d, each Near-Memory Accelerator (NMA) chip within IKS is specifically designed for ENNS.
- Processing Engines (PEs): Each
NMAincludes 64processing engines. ThesePEsenable parallel calculation ofsimilarity scoresfor up to 64query vectorssimultaneously, supportingbatch processing. - Components per PE: Each
processing engineis composed of:- Query Scratchpad: A small, fast
SRAMbuffer (2KB in the current incarnation) to store aquery vector. - Dot-Product Unit: The computational core for
similarity scorecalculation. - Top-K Unit: Hardware logic for maintaining an ordered list of the
top Ksimilarity scores(fixed at in hardware) and their correspondingembedding vectoraddresses. - Output Scratchpad: A buffer to store the
partial top-K listbefore being read by theCPU.
- Query Scratchpad: A small, fast
- Central Control Unit: Each
NMAhas acentral control unitthat managesmemory accesses, internaldata movement, and activatesprocessing enginesbased on thebatch sizeofquery vectors. - Network-on-Chip (NoC): A fixed
broadcast networkconnects theDRAMinterface to allprocessing engines. This allowsembedding vectorsread fromDRAMto be broadcast to multiple activePEs, facilitatingdata reusewhen processing differentquery vectorsin a batch. - Dot-Product Unit Operation (Figure 5d):
- Each
dot-product unitcontains 68MAC (Multiply-Accumulate)units. - These
MAC unitsoperate at 1 GHz, providing 68GFLOPS(16-bit floating-point operations) of compute throughput, designed to saturate the 136 GBpsmemory bandwidthof theLPDDR5X channels. - Parallel Calculation: In each clock cycle, 68
MAC operationsare performed. For a givenvector dimension:- The first input to
MAC unit(where ranges from 0 to 67) isdimensionofquery vector PE(denoted asQV[PE][j]), which is stored in thequery scratchpad. - The second input is
dimensionofembedding vector(denoted asEV[i][j]), where ranges from 0 to 67. Theseembedding vectorsare streamed fromDRAM.
- The first input to
- Vector Dimension Cycles: It takes
VD(Vector Dimension) cycles for adot-product unitto compute thesimilarity scoresfor a block of 68embedding vectors. - Score Registers: Once a
similarity scoreis computed, it's loaded into ascore registerand then streamed to theTop-K unit.
- Each
- Top-K Unit Operation (Figure 5d):
- The
Top-K unitmaintains anordered listofsimilarity scores(and corresponding addresses). - Incoming
similarity scoresare compared with thehead(smallest score) of the ordered list. If an incoming score is larger, it's inserted into the list, and the smallest is evicted. - This insertion process is
serializedbut isoverlappedwith thesimilarity score evaluationsbecausevector dimensionsare typically much larger than 68, ensuring it's not on thecritical path.
- The
4.2.5. Data Layout Inside DRAM and Query Scratchpad
The host CPU must adhere to specific data layouts when storing embedding vectors in IKS DRAM and populating query scratchpads.
Data Layout in LPDDR5X Package (Figure 7):
The following figure (Figure 7 from the original paper) illustrates the required data layout for embedding vectors within each LPDDR5X package:
| Byte Offset | |||||
| Physical Address | 0 | 2 | 4 | 134 | |
| B | EV[0][0] | EV[1][0] | EV[2][0] | ... | EV[67][0] |
| B + 0*136*VD + 136*1 | EV[0][1] | EV[1][1] | EV[2][1] | ... | EV[67][1] |
| B + 0*136*VD + 136*2 | EV[0][2] | EV[1][2] | EV[2][2] | ... | EV[67][2] |
| ... | ... | ... | ... | ... | |
| B + 0*136*VD + 136*(VD-1) | EV[0][VD-1] | EV[1][VD-1] | EV[2][VD-1] | ... | EV[67][VD-1] |
| B + 1*136*VD | EV[68][0] | EV[69][0] | EV[70][0] | ... | EV[135][0] |
| ... | ... | ... | ... | ... | |
| B + 1*136*VD + 136*(VD-1) | EV[68][VD-1] | EV[69][VD-1] | EV[70][VD-1] | ... | EV[135][VD-1] |
| ... | ... | ... | ... | ... | |
| B + (N-67)*136*VD | EV[N-68][0] | EV[N-67][0] | EV[N-66][0] | ... | EV[N-1][0] |
| B + (N-67)*136*VD + 136*1 | EV[N-68][1] | EV[N-67][1] | EV[N-66][1] | ... | EV[N-1][1] |
| ... | ... | ... | ... | ... | |
| B + (N-67)*136*VD + 136*(VD-1) | EV[N-68][VD-1] | EV[N-67][VD-1] | EV[N-66][VD-1] | ... | EV[N-1][VD-1] |
Fig. 7. Data layout inside each LPDDR5X package. The host CPU communicates the base address "", vector dimension "VD", and the number of vectors "" to the NMAs for each offload. Four embedding vectors (EVs) are highlighted in this layout.
- Block-based Storage:
Embedding vectors(EVs) are stored in contiguous blocks, each containing 68 vectors. - Column-Major Order: Within each block, the
embedding vectorsare stored incolumn-major order. This means that for a block of 68 vectors (e.g.,EV[0]toEV[67]),dimension 0of all 68 vectors is stored consecutively, followed bydimension 1of all 68 vectors, and so on, up todimension VD-1. - Size per block: Each
embedding vector dimensionis 2 bytes (16-bit floating point). So,dimensionfor 68 vectors occupies bytes. A full block of 68 vectors, across allVDdimensions, occupies bytes. - Efficient Batching: This
column-major layoutallows theNMAto read and processembedding vectorsdimension-by-dimensionfor up to 68 vectors simultaneously. TheNMAcan access up to 136 bytes per cycle from the memory controller, comprising one element from 68 distinctembedding vectors. - Parameters for NMA: The host
CPUprovides thebase address ("B"),vector dimension ("VD"), and the totalnumber of vectors ("N")in the corpus shard to eachNMA.
Data Layout in Query Scratchpads (Figure 8):
The following figure (Figure 8 from the original paper) illustrates the data layout for query vectors within the query scratchpads, which are mapped to the host memory address space at (Query Scratchpad Base Address):
| (Batch Size = 1) | Byte Offset | ||||
| Physical Address | 0 | 2 | 4 | ... | 2*(VD-1) |
| QS_B | QV[0][0] | QV[0][1] | QV[0][2] | ... | QV[0][VD-1] |
| (Batch Size = 4) | Byte Offset | ||||
| Physical Address | 0 | 2 | 4 | ... | 2*(VD-1) |
| QS_B | QV[0][0] | QV[0][1] | QV[0][2] | ... | QV[0][VD-1] |
| QS_B + 2048 | QV[1][0] | QV[1][1] | QV[1][2] | ... | QV[1][VD-1] |
| QS_B + 2 * 2048 | QV[2][0] | QV[2][1] | QV[2][2] | ... | QV[2][VD-1] |
| QS_B + 3 * 2048 | QV[3][0] | QV[3][1] | QV[3][2] | ... | QV[3][VD-1] |
| (Batch Size = 64) | Byte Offset | ||||
| Physical Address | 0 | 2 | 4 | ... | 2*(VD-1) |
| QS_B | QV[0][0] | QV[0][1] | QV[0][2] | ... | QV[0][VD-1] |
| QS_B + 2048 | QV[1][0] | QV[1][1] | QV[1][2] | ... | QV[1][VD-1] |
| ... | ... | ... | ... | ... | |
| QS_B + 63 * 2048 | QV[63][0] | QV[63][1] | QV[63][2] | ... | QV[63][VD-1] |
Fig. 8. Data layout inside the query scratchpads mapped to host memory address at query scratchpad base address "". As we increase the batch size, more query scratchpads are populated with distinct query vectors.
-
Sequential Query Vectors:
Query vectors(QVs) are stored in sequential memory addresses within thequery scratchpads. -
Batching: As the
batch sizeincreases, morequery scratchpads(one perprocessing engine) are populated with distinctquery vectors. Eachquery vectoris stored linearly, with itsdimensionsfollowing each other. -
Size per Query Vector: Each
query vector(assumingVDdimensions and 2 bytes per dimension) occupies bytes. The2048byte offset shown implies a maximumVDof 1024 for a 2KB scratchpad ().This simplified
data layout(both forembedding vectorsandquery vectors) facilitatesaddress generationandnetwork-on-chip (NoC)architecture within theNMAs. Thevector databaseapplication's memory allocation scheme is modified to implement this block data mapping.
4.2.6. Multi-Tenancy
IKS supports both spatial and coarse-grain temporal multi-tenancy:
- Spatial Multi-Tenancy: The
IKS drivercan partitionembedding vectorsfrom differentvector databasesacross differentLPDDR5X packages. This allows eachNMAto performENNSindependently for differentvector databasesin parallel. - Temporal Multi-Tenancy: For
vector databasesthat share the sameLPDDR5X package, theIKS drivercantime-multiplex similarity searchesamong them. Thistime-multiplexingoccurs at the boundary of a completesimilarity search operation.
5. Experimental Setup
5.1. Datasets
The primary dataset used for evaluating the RAG applications is the Google's Natural Questions (NQ) dataset [38, 39].
- Source and Characteristics:
NQis a large-scale dataset foropen-domain question answering. It consists of real user questions issued to the Google search engine and corresponding answers derived from Wikipedia pages. Each question is paired with a Wikipedia page that contains the answer, along with supporting document passages. - Scale: The
NQdataset is typically split into a training set (nq-train) and a validation set (nq-dev). TheMeta's KILT benchmark[65] provides these specific splits. - Corpus Size: The
document corpus(knowledge source) for all workloads is constructed from Wikipedia, as described in Karpukhin et al. (2020) [30]. The paper testsIKSwith variousvector database sizes(corpus sizes) storing theembedding vectors, including 50 GB and 512 GB. The documents themselves (plaintext) are stored inCPU memory. - Embedding Generation: A pre-trained
BERT base (uncased) modelis used to generateembedding vectorsfor the documents.16-bit floating point representationis assumed for these vectors. - Example Data Sample (Conceptual):
- Question: "What is the capital of France?"
- Relevant Document Passage (from Wikipedia): "...Paris is the capital and most populous city of France..."
- Answer: "Paris"
This structure is typical for
NQdataset entries. Theembedding vectorof the question would be matched againstembedding vectorsof document passages to retrieve the most relevant ones.
5.2. Evaluation Metrics
The paper uses a combination of retrieval accuracy, generation accuracy, and performance metrics.
Retrieval Accuracy
- Recall: This metric quantifies how many of the truly relevant documents (identified by
ENNSas perfect) are successfully retrieved by anANNSalgorithm.- Conceptual Definition:
Recallmeasures the proportion of relevant items that are successfully retrieved by an information retrieval system out of all relevant items that should have been retrieved. In the context ofANNSvs.ENNS, it indicates how wellANNScan find the sameground truthrelevant documents asENNS. - Mathematical Formula: $ \mathrm{Recall} = \frac{|{\text{relevant documents retrieved by ANNS}} \cap {\text{relevant documents retrieved by ENNS}}|}{|{\text{relevant documents retrieved by ENNS}}|} $
- Symbol Explanation:
- : The number of relevant documents that are retrieved by both the
Approximate Nearest Neighbor Search (ANNS)algorithm and theExact Nearest Neighbor Search (ENNS)algorithm.ENNSis considered theground truthfor relevant documents. - : The total number of relevant documents retrieved by the
Exact Nearest Neighbor Search (ENNS)algorithm.
- : The number of relevant documents that are retrieved by both the
- Conceptual Definition:
Generation Accuracy
This refers to how well the end-to-end RAG system answers questions. The metric depends on the specific generative model used.
-
Exact Match (EM) for FiDT5:
- Conceptual Definition:
Exact Matchis a strict metric that measures whether the generated answer text is identical to one of the reference answers, after normalization (e.g., lowercasing, removing punctuation). It's common forextractive question answeringtasks. - Mathematical Formula: (Standard EM formula, not explicitly provided in paper but universally understood) $ \mathrm{EM} = \frac{\text{Number of questions with exact matching answers}}{\text{Total number of questions}} \times 100% $
- Symbol Explanation:
- : The count of questions where the
RAG application's generated answer, after normalization, matches exactly one of the acceptable reference answers. - : The total number of questions in the evaluation dataset.
- : The count of questions where the
- Conceptual Definition:
-
ROUGE-L Recall for Llama-8B and Llama-70B:
- Conceptual Definition:
ROUGE-L(Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) measures the overlap in terms of thelongest common subsequence (LCS)between the generated answer and the reference answer.ROUGE-L recallspecifically focuses on how much of the reference answer is captured by the generated answer. It's more flexible thanExact Matchand suitable forabstractive summarizationoranswer generationwhere exact wording might vary. - Mathematical Formula: (Standard ROUGE-L Recall formula, not explicitly provided in paper but universally understood)
Let be the reference answer and be the generated answer.
LCS(X, Y)is the length of the longest common subsequence between and . $ \mathrm{ROUGE-L_{Recall}} = \frac{LCS(X, Y)}{\text{Length}(X)} $ - Symbol Explanation:
LCS(X, Y): The length of thelongest common subsequencebetween the reference answer () and the predicted answer (). Asubsequencedoes not require contiguous matches, just the order.- : The length (e.g., number of words or tokens) of the reference answer.
- Context: The paper notes that
promptingwas used instead offine-tuningforLlamamodels to preserve generality, which limits evaluation byprompt adherence. Therefore,recallis used overprecisionorF1-Scoreto account for potential verbosity while still assessing if key information from the correct answer is present.
- Conceptual Definition:
Performance Metrics
- Inference Time (Latency): The total time taken for an
end-to-end RAGapplication to process a single query (or a batch of queries) and generate a response. The paper specifically mentionstime-to-interactive(time to first token) for latency-critical applications. - Throughput (Queries/sec): The number of queries an
end-to-end RAGsystem can process per second, often measured forbatch size> 1. - Speedup (): A ratio comparing the performance of an accelerated system (e.g.,
IKS) to a baseline system (e.g.,CPU), indicating how many times faster the accelerated system is. - Power Consumption (W): The electrical power consumed by the
IKSdevice. - Area (): The physical silicon area occupied by the
NMAchips andPHYs.
5.3. Baselines
The paper evaluates IKS performance against several key baselines:
-
CPU (Intel Xeon 4th Gen Sapphire Rapids):
- Model: Intel Xeon 4416+ (16 cores @ 2.00 GHz).
- Memory: 512 GB
DDR5-4000across 8 channels (256 GB/s bandwidth). - Cache: 48 kB dcache, 32 kB icache (L1), 2 MB (L2), 37.5 MB shared (L3).
- Compute: 2x
AVX-512 FMAunits (164 GFlop/s/core). - Purpose: This serves as the primary baseline for
ENNSretrieval, representing a high-performance general-purpose processor.
-
Intel AMX (Advanced Matrix Extensions):
- Description: Integrated into
Intel Sapphire Rapids CPUs.AMXis specialized forBFloat16operations and provides 500 GFlop/s/core. - Purpose: Evaluates the impact of specialized
CPUinstruction sets for acceleratingENNS. The paper notesAMXspeedup is flat for smallbatch sizesdue to thememory-boundnature of the workload.
- Description: Integrated into
-
NVIDIA H100 GPU (SXM):
- Model: NVIDIA H100 SXM.
- Memory:
High-Bandwidth Memory (HBM)with 3.35 TB/s bandwidth. - Compute: 1979 TFlop/s.
- Purpose: Represents a state-of-the-art
GPUaccelerator commonly used forLLM inferenceand generaldata-paralleltasks. It's used to accelerate both thegeneration phase(forFiDT5,Llama-8B,Llama-70B) and as a baseline forENNSacceleration. - Configurations: Tested with 1, 2, 4, and 8
H100 GPUsto evaluate scalability and capacity for largecorpus sizes. The paper highlights its high cost andHBMlimitations.
-
ANNS (Approximate Nearest Neighbor Search) Configurations:
- Algorithm:
HNSW(Hierarchical Navigable Small World) [52], a state-of-the-artgraph-based ANNS. - Index Parameters: , (optimized for accuracy and reasonable graph).
- Configurations:
- ANNS-1: .
- ANNS-2: .
- Purpose: To compare the trade-offs between
retrieval quality,speed, andend-to-end RAG accuracyagainstENNS. The paper demonstrates that whileANNScan be faster, achievinghigh generation accuracywithANNSoften requires a larger (more documents), which negates speed benefits.
- Algorithm:
5.4. Software Configuration
- Evaluation Dataset: Google's
Natural Questions (NQ)dataset, specifically thenq-dev(validation) split, is used for evaluating models. - Retrieval Model: A pre-trained
BERT base (uncased)model is used to generateembedding vectorsfor documents and queries. - Vector Database Library:
Faiss[27] is used forindex managementfor bothENNSandANNS.- ENNS Optimization: For
ENNS,Intel's OneMKL BLAS backendis used fordot-productcalculations for allbatch sizes, providing better performance thanFaiss's defaultBLASusage (which is only forbatch sizes\ge 20$$).
- ENNS Optimization: For
- Generative Models:
- FiDT5: Uses the
T5-based Fusion-in-Decoder[26, 68]. The generator is a pre-trainedT5-base model(220 million parameters) fine-tuned on thenq-train datasetto predict answers fromquestion-evidence pairs. Documents are presented to theT5encoder, and their representations are combined for the decoder. - Llama-8B and Llama-70B: Uses
4-bit-Quantized Llama-3-8B-InstructandLlama-3-70B-Instruct[3]. Retrieved documents are presented asplaintextin theprompt.Promptingis used overfine-tuningto maintain model generality.
- FiDT5: Uses the
- Batching: Experiments consider
batch sizesof 1 (forlatency-criticalapplications) and 16 (forthroughput-optimizedapplications). - Document Count (K): The number of documents () retrieved and fed to the
LLMis a key parameter. Increasing significantly impactsgeneration timedue to linear scaling ofTransformer inferencewithinput sizeand increasedkey-value cachememory. Each document is approximately 100 words (avg. 127 tokens forLlamamodels).
5.5. Experimental Setup (Hardware & Simulation)
- Physical Servers: Two servers with
Intel Xeon 4th generation CPUsand oneNVIDIA H100 GPUare used for baseline measurements andend-to-end RAGexecution (generation phase onH100). - IKS Evaluation: A
cycle-approximate simulator(see Appendix A) is developed to modelIKSperformance. This simulator uses:- Timing parameters from
RTL synthesis(forNMAlogic). LPDDR5X access timing.PCIe/CXL timing[43, 81].- Calculations for real
software stack overheads(e.g.,top-K aggregation,umwait()overhead). - It emulates
IKSas aCXL devicerunning on aremote CPU socket.
- Timing parameters from
- RTL Design and Synthesis: The
NMA(Near-Memory Accelerator)RTL designforIKSwas synthesized usingSynopsys Design CompilertargetingTSMC's 16nm technology nodeto obtainarea,power, andtiming metricsfor 1 GHz operation. - Area and Power Estimation:
- Memory controllers and
PHYsarea estimated fromApple M2die shots (LPDDR5 in 5nm process), scaled to 16nm (assuming negligible mixed-signal scaling [23, 88]). - Power model developed by evaluating
RTL-level energy consumptionofprocessing operationsanddata accesstoscratchpadsandLPDDR memory. Energy values (e.g.,SRAM39 fJ/bit,LPDDR4 pJ/bit [13]) are scaled to 16nm [79].
- Memory controllers and
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Tuning RAG Software Parameters and Examining Approximate Search
The paper first explores the trade-offs between retrieval quality and generation accuracy for RAG applications, comparing ENNS with ANNS variants.
The following figure (Figure 2 from the original paper) compares the generation accuracy and throughput of ANNS- and ENNS-based RAG applications:

该图像是一个图表,展示了不同检索算法和文档计数 (K) 对代表性 RAG 应用生成准确率和吞吐量 (Queries/sec) 的影响。图中包含准确率和吞吐量的关系,数据点反映了不同文档数量的变化。主轴表示生成准确率,副轴表示吞吐量。
Fig. 2. Generation accuracy vs. throughput (Queries/sec) of representative RAG applications for various retrieval algorithms and document counts (K). The corpus size is set to and batch size to 16.
Analysis of Figure 2:
-
Impact of Retrieval Quality: The figure clearly demonstrates that
retrieval qualitystrongly influencesend-to-end generation accuracy. -
ANNS Accuracy Drop:
- With a
document count (K)of 1,ANNS-1andANNS-2(usingHNSWwithefSearchof 2048 and 10000 respectively) show significantaccuracy dropscompared toENNS:- FiDT5: 22.6% and 34.0% reduction.
- Llama-8B: 52.8% and 53.4% reduction.
- Llama-70B: 51.0% and 51.5% reduction.
- Even with a larger
document count (K)of 16, accuracy reductions are substantial:- FiDT5: 13.6% and 22% reduction.
- Llama-8B: 38.4% and 42.2% reduction.
- Llama-70B: 38.4% and 45.2% reduction.
- With a
-
Larger LLMs and Retrieval Quality: The impact of
retrieval qualityongeneration accuracyappears even more pronounced forLlama-8BandLlama-70Bcompared toFiDT5, especially at lower values. This suggests that largerLLMsnot fine-tuned for the task are more sensitive to the quality of the retrieved context. -
Throughput vs. Accuracy Trade-off: While
ANNSconfigurations (e.g.,ANNS-2) can offer higherthroughput(Queries/sec) thanENNSat certain values, this often comes at a cost oflower generation accuracy. -
Compensation for ANNS: Increasing to compensate for
lower ANNS retrieval qualitycan lead tolower generation accuracyandend-to-end throughputcompared to ahigher-quality, slower ANNSconfiguration (e.g.,ANNS-2with 16 documents vs.ANNS-1with 128 documents for FiDT5, whereANNS-2is 3% higher accuracy and 128% higher throughput). This is becauselarger Ksignificantly increasesgeneration timedue to theLLM's computational and memory overheads. -
Pareto-Superiority of ENNS: For
RAGapplications requiring highgeneration accuracy(e.g., above ~43% for FiDT5, ~27% for Llama-8B, ~14% for Llama-70B),ENNS-based RAGexhibitsPareto-superiority. This meansENNSachieves higher accuracy for a given throughput or higher throughput for a given accuracy, especially at the high-accuracy end of the spectrum. -
Modest Speedup of High-Quality ANNS: The paper notes that
ANNS-2, the best-performingANNSconfiguration in terms ofthroughputin this analysis, offers only a speedup compared toENNS. This modest speedup, coupled with lower accuracy, reinforces the motivation forENNSacceleration.Conclusion from this section: High-quality retrieval is crucial for
RAG's effectiveness, andENNSoften yields the bestaccuracy-throughputtrade-off at high accuracy demands, despite its higher baseline cost. This motivates the need to accelerateENNSspecifically.
6.1.2. End-to-End RAG Performance with ENNS
After establishing the importance of ENNS, the paper profiles its end-to-end performance as a bottleneck.
The following figure (Figure 3 from the original paper) shows the latency breakdown of FiDT5, Llama-8B, Llama-70B for various values of and corpus sizes:

该图像是图表,展示了FiDT5、Llama-3-8B和Llama-3-70B在不同语料库大小及K值下的延迟分解。图表中,黑色柱子表示检索时间(ENNS在CPU上),白色柱子表示生成时间(H100)。每个条形的值表示绝对检索时间。
Fig. 3. Latency breakdown of FiDT5, Llama-8B, Llama-70B for various values of K, corpus sizes. All configurations use batch size 1. Retrieval is ENNS and runs on CPU, generation runs on a single NVIDIA H100 (SXM) for all generative models. The value in each bar shows the absolute retrieval time.
Analysis of Figure 3:
- Retrieval as a Bottleneck: For all
RAGapplications (FiDT5,Llama-8B,Llama-70B),ENNS retrievalon theCPU(represented by the blue portion of the bars) consumes asignificant portionof theend-to-end latency.- For a 50GB corpus and ,
retrieval timecan be 97% (FiDT5), 90% (Llama-8B), and 74% (Llama-70B) of the total time. - As
corpus sizeincreases to 512GB,retrieval timebecomes even more dominant, reaching 99% for FiDT5, 98% for Llama-8B, and 92% for Llama-70B (all with ). The absolute retrieval times (indicated in the bars) are in the range of hundreds of milliseconds to several seconds for large corpus sizes.
- For a 50GB corpus and ,
- Impact of K on Generation Time: While
increasing Kmight seem like a way to compensate for lowerANNSquality, Figure 3b clearly shows thatincreasing K(the number of documents sent to theLLM) significantly increases thegeneration time(orange portion of the bars). This is due to thelinear scalingofTransformer inferencewith input size andKV cachememory overheads, making it costly in terms oftime to first token. - Memory-bound vs. Compute-bound:
ENNSis identified asmemory bandwidth-bound, whilegenerationis relativelycompute-bound. TheCPU-basedENNScannot saturate the availableDRAM bandwidth, indicating a fundamental limitation for software-only approaches. - Urgency for Retrieval Acceleration: The
end-to-end latencyfor largecorpus sizescan exceed several seconds, which is unacceptable for interactiveuser-facing applications. This underscores the critical need for accelerating theretrieval phase.
6.1.3. High-Quality Search Acceleration
The paper examines existing hardware solutions for ENNS acceleration.
The following table (Table 1 from the original paper) compares the speedup of Intel AMX and GPU for ENNS, relative to a CPU baseline:
| Batch Size Corpus Size | 1 | 16 | ||
| 50 GB | 512 GB | 50 GB | 512 GB | |
| CPU | 1 | 1 | 1 | 1 |
| AMX | 1.05 | 1.02 | 1.10 | 1.09 |
| GPU | 5.2 | 36.9 | 6.0 | 43.7 |
Table 1. Speedup of Intel AMX and GPU for ENNS, relative to a CPU baseline. AMX speedup is flat for very small batch sizes, due to the memory-bound nature of similarity search. For 50GB and 512GB corpus size, 1 and 8 H100 GPUs are used, respectively.
Analysis of Table 1:
- AMX Performance:
Intel AMXprovides only amodest speedup(1.02-1.10) forENNS. This is becauseENNSis primarilymemory-bound, andAMX, while boostingFLOPs, cannot overcome thememory bandwidthbottleneck of theCPUsystem. - GPU Performance:
GPUsoffersignificant speedupoverCPUforENNS(5.2-43.7).- For a 50GB
corpus size, a singleH100 GPUachieves 5.2 (batch 1) and 6.0 (batch 16) speedup. - For a 512GB
corpus size, 8H100 GPUsare required to fit the data, achieving 36.9 (batch 1) and 43.7 (batch 16) speedup.
- For a 50GB
- GPU Cost and Utilization Issues: The paper highlights two key problems with
GPUacceleration forENNS:-
High Cost: As
corpus sizegrows, moreGPUsare needed (e.g., 8H100sfor 512GB), significantly increasing cost due toHBMmemory.HBMis several times more expensive thanDDRorLPDDR. -
Poor Utilization:
GPUsprovision massive amounts ofcomputerelative tomemory bandwidth. Formemory-boundENNS, a largeGPU dieis poorly utilized, leading to inefficient resource usage [24].The following figure (Figure 4 from the original paper) illustrates the
Roofline modelforENNSonCPU:
该图像是图表,展示了16核Roofline模型中批量大小为1和16的性能。图表中显示了在操作强度(Flops/Byte)与性能(GFlops/s)之间的关系,标注了峰值DRAM带宽和峰值计算能力。相关数据定义为 和 。
-
Fig. 4. Roofline model for ENNS using Batch Size 1 and 16. See Section 6 for the experimental setup.
Analysis of Figure 4 (Roofline Model):
- The
Roofline modelvisually confirms thatENNSrunning on theCPUismemory-bound. The actual performance (FLOPs/s) is limited by the system'smemory bandwidth(the horizontal "roofline") rather than itspeak compute performance(the diagonal "roofline"). - The
CPUcannot saturate its availableDRAM bandwidthforENNS, reinforcing the argument that deepcache hierarchiesand highFLOPsare less beneficial for thismemory-intensiveworkload. This further motivates dedicatednear-memory accelerationthat focuses onbandwidth.
6.1.4. Effectiveness and Scalability of IKS Retrieval
The paper then presents the performance of IKS for ENNS retrieval.
The following figure (Figure 9 from the original paper) compares the ENNS retrieval time for CPU, AMX, GPU (1, 2, 4, and 8 devices), and IKS (1 and 4 devices) for various corpus sizes:

该图像是图表,展示了不同设备(CPU、AMX、GPU 和 IKS)在各种数据集大小(50GB、200GB、512GB、1024GB、2048GB)下的检索时间比较。X轴为批处理大小(Batch Size),Y轴为检索时间(秒),呈现出随着数据集增大,各设备的检索性能差异。
Fig. 9. Comparison of ENNS retrieval time for CPU, AMX, GPU (1, 2, 4, and 8 devices), and IKS (1, and 4 devices) for various corpus sizes. The absence of bars in specific GPU and IKS configurations indicates that the corpus exceeds the capacity of the accelerator memory. The Y-axis is in log-scale.
Analysis of Figure 9:
-
IKS vs. CPU/AMX:
IKSsignificantly outperformsCPUandAMXfor allcorpus sizesandbatch sizes. For a 512GB corpus,IKSachieves fasterENNSthanIntel Sapphire Rapids CPUs. -
IKS vs. GPU (Single Device): Counterintuitively, a single
IKSunit outperforms a singleH100 GPUfor a 50GB corpus by (batch 1) and (batch 16). This is attributed to:- Specialized Top-K Units:
IKShas dedicatedTop-K units, which are more efficient thanGPU's general-purpose compute for this specific operation. - Higher GPU Utilization for Memory Bandwidth:
GPUsrequire manystreaming multiprocessorsandtensor coresto issue memory accesses in parallel to saturateHBM bandwidth, which is often not achieved by thememory-bound ENNSworkload.IKSis designed to balancecomputeandmemory bandwidthspecifically for this task.
- Specialized Top-K Units:
-
Scalability (Multi-GPU vs. Multi-IKS):
- GPU Scaling:
GPU retrieval timefor a 50GB corpus reduces by , , and with 2, 4, and 8GPU devices, respectively. EachH100 GPUfits 80GB, so 8H100scan handle up to 640GB. - IKS Scaling:
IKSshows excellent scalability. With only fourIKS devices, it can accommodate up to a 2TBcorpus size(whereas 8H100 GPUsmanage 640GB). For a 50GB corpus,IKS retrieval timereduces by (1 unit) and (4 units). - Near-Perfect Weak Scaling:
IKSdemonstrates near-perfectweak scaling. Theretrieval timefor a 2TB corpus on 4IKS unitsis only longer than for a 512GB corpus on 1IKS unit. This highlights the low-overheadIKS-CPU interfaceand the highlyparallelizablenature ofENNS.
- GPU Scaling:
-
Memory Capacity: The absence of bars for
GPUat highercorpus sizes(e.g., 512GB for 1GPU) andIKSat 2TB for 1IKSunit indicates that thecorpus sizeexceeds the memory capacity of those configurations.The following table (Table 3 from the original paper) reports the absolute time breakdown of
ENNS retrievalonIKS:Corpus Size Batch Size 50 GB 512 GB 1 64 1 64 Write Query Vector 0.3 us 1 us 0.3 us 1 us Dot-Product 45.96 ms 45.96 ms 470.6 ms 470.6 ms Partial Top-32 Read 0.7 us 22.4 us 0.7 us 22.4 us Top-K Aggregation 19 us 540 us 23 us 390 us Total 46.0 ms 46.5 ms 470.6 ms 471.0 ms
Table 3. Breakdown of ENNS latency on IKS.
Analysis of Table 3:
- Dominance of Dot-Product: The
dot-productcomputation andDRAM accesses(combined under "Dot-Product") consume the overwhelming majority of theENNS retrieval timeonIKS(e.g., 45.96 ms out of 46.0 ms for 50GB, batch 1). This is expected for amemory-boundworkload accelerated near memory. - Negligible Offload Overheads: The overheads associated with the
IKS-CPU interfaceare extremely low:Write Query Vector(offload context transfer): 0.3-1 s.Partial Top-32 Read(results transfer): 0.7-22.4 s.Top-K Aggregation(onCPU): 19-540 s.
- Insensitivity to K: The
retrieval timeonIKSdoes not change with the value of (up to 32) becauseIKSalways computes and returns thetop 32 similarity scores. The choice of how many of these to pass to thegenerative modelis then made by theretriever modelsoftware. - Batch Size Impact: For
dot-product, the time remains constant regardless ofbatch size(1 vs. 64) becauseIKS'sNMAis designed to be fully utilized at maxbatch sizeandembedding vectorsare reused acrossquery vectors. TheTop-K aggregationtime increases withbatch sizeas more partial lists need to be processed.
6.1.5. End-to-End Performance
The paper demonstrates the impact of IKS acceleration on end-to-end RAG inference time.
The following figure (Figure 8 from the original paper) displays the comparison of generation times under different retrieval strategies for CPU and IKS retrieval across multiple models (FiDT5, Llama-8B, Llama-70B). The x-axis represents the number of documents retrieved (), and the y-axis shows the time-to-interactive in seconds. It specifically illustrates how IKS search significantly reduces generation times under 512GB and 50GB memory configurations.

Analysis of Figure 8:
-
Dramatic Reduction in Retrieval Bottleneck: The graphs clearly show that when
IKSis used forENNS retrieval(green bars), theretrieval timeportion of theend-to-end inferenceis drastically reduced, almost becoming negligible compared to thegeneration time. In contrast,CPU retrieval(blue bars) constitutes a major bottleneck, especially for largercorpus sizes(512GB). -
Speedup Range:
IKSprovides substantialend-to-end inference time speedups:- FiDT5: to .
- Llama-8B: to .
- Llama-70B: to .
This speedup varies depending on
batch size,corpus size, anddocument count (K).
-
Enabling Real-time RAG: For
large corpus sizes(e.g., 512GB) orlarge batch sizes,CPU retrievalleads toinference timesof several seconds, which is unacceptable foruser-facing applications.IKSbrings these times down to hundreds of milliseconds or less, enabling real-timeRAG. -
Shifting Bottleneck: With
IKS, theretrieval bottleneckis largely eliminated, and thegeneration phase(running onGPU) becomes the dominant factor inend-to-end latency.The following figure (Figure 11 from the original paper) compares the
accuracyandthroughputofFiDT5,Llama-8B, andLlama-70Bfor various configurations:
该图像是图表,展示了FiDT5、Llama-8B和Llama-70B在不同配置下的准确率与吞吐量的比较。图表中使用了不同的符号表示不同的K值,分别为K=1、K=4、K=16和K=32/128。横轴为生成准确率,纵轴为查询速率(以log比例表示)。不同的曲线代表了不同模型与算法(如ENNS和ANNS-2)的表现,展示了它们在精度与效率上的权衡关系。
Fig. 11. Comparison of accuracy and throughput of FiDT5, Llama-8B, and Llama-70B for various configurations. ANNS-2 is an HNSW index with ; efConstruction = 128, and efSearch = 2048, respectively.
Analysis of Figure 11:
- ANNS vs. ENNS (CPU):
ANNS-2configurations (yellow points) generally exhibithigher throughputthanENNSrunning onCPU(blue points) but come withlower generation accuracy. This reinforces the earlier finding thatANNStrades off accuracy for speed, and that achieving high accuracy withANNScan be challenging. - IKS for Pareto-Superiority:
RAG applicationsusingIKS(green points) achieve significantly improvedthroughputwhile maintaining (or even slightly improving)generation accuracycompared to bothCPU ENNSandANNSvariants.- This is because
IKS'shigh-quality ENNSallows thegenerative modelto receive asmaller but more accurate context(lower ), drastically reducinggeneration timeand thus increasingend-to-end throughputat high accuracy levels.Retrievalis no longer a bottleneck.
- This is because
- Importance of High-Quality Retrieval:
IKSdemonstrates that investing inhigh-quality retrieval accelerationis key to unlocking the full potential ofRAGin terms of bothaccuracyandthroughput.
6.1.6. Power and Area Analysis
- NMA Area: Each
NMAchip (with 64processing engines,query scratchpad,top-K unit,output scratchpad) occupies approximately in16nm TSMC technology.- An additional is required for
PHYsandmemory controllers. - However, the minimum
NMA chip areais determined by theshorelineneeded for the eightLPDDR5X memory channels() andPCIe PHYs(), totaling shoreline. This necessitates an area of at least in16nm. - The paper notes that the physical shoreline constraint implies there's "free" area on the chip up to , which
IKSutilizes byoverprovisioning compute(64PEs) to remainmemory-bandwidth boundeven at lowerbatch sizes.
- An additional is required for
- IKS Power Consumption:
- For
batch size1 andvector dimensions1024:Processing enginesandquery scratchpad accessesconsume ~59 mW.- Accessing
embedding vectorsfromLPDDR memoryconsumes ~4.35 W. - Total power for one
IKS unit(8 NMAs) is .
- For
batch size64 (full utilization):LPDDR access powerremains constant (due to data reuse).Processing enginepower increases linearly.- Total power increases to ~65 W.
- For
6.1.7. Cost and Power Comparison with GPU
- Memory Cost:
IKSusesLPDDR5X, which is estimated to be more expensive thanHBM[54]. Since a singleIKS unitincludes as much onboard memory as a singleNVIDIA H100 GPU(512GB vs. 80GB), the memory cost ofIKSis expected to be approximately greater than that of aGPU. (This statement seems to have a typo or implies a different cost comparison than what the text says: if LPDDR5X is cheaper, and IKS has more of it, it should be cheaper overall, not 2.5x greater. Re-reading, the paper states "HBM is more than 3x more expensive than LPDDR", so if IKS uses LPDDR and a GPU uses HBM, IKS should be cheaper, not 2.5x greater. This might be a subtle point about total memory capacity vs. cost per GB or an error in my interpretation/the paper's wording in the abstract vs. discussion.)- Correction/Clarification based on re-read: The paper states
HBM is more than 3x more expensive than LPDDR. IfIKShas the memory capacity ofH100, then:- Cost(H100 Memory)
- Cost(IKS Memory)
- Since , then
- Cost(IKS Memory) .
So,
IKSmemory cost is higher due to its much larger capacity, but it's more cost-effective per GB. The statement "memory cost of IKS is expected to be approximately 2.5x greater than that of a GPU" is for the total memory onboard not per GB. This aligns with the "cost-effective high capacity" argument.
- Correction/Clarification based on re-read: The paper states
- Compute Unit Cost: A
GPU(e.g.,H100) has a die area of . The total die area of allIKS NMAsis (). Since chip production cost increasessuperlinearlywithdie area, anIKS unit(despite having larger memory capacity) is expected to cost afraction of a GPU. This makesIKSa highlycost-effectivesolution.
6.2. Ablation Studies / Parameter Analysis
The paper incorporates parameter analysis throughout Section 3:
- Batch Size: The impact of
batch size(1 forlatency-critical, 16 forthroughput-optimized) is evaluated across allRAG applications. Whilebatch sizedoes not impactgeneration accuracy, it significantly affectsexecution timeandthroughput.IKSis designed to be efficient across various batch sizes, achieving fullNMAutilization atbatch size64. - Document Count (K): The number of documents retrieved and fed to the
LLMis a crucial parameter. The analysis in Figure 2 and Figure 3 demonstrates:- Increasing can compensate for
lower ANNS retrieval qualitybut drastically increasesgeneration timeandmemory overhead(due toKV cache). High-quality ENNS(accelerated byIKS) allows for smaller values to achieve the same or highergeneration accuracy, thus reducinggeneration timeand improvingend-to-end throughput.
- Increasing can compensate for
- ANNS Parameters (,
efConstruction,efSearch): ForHNSWANNS, (number of outgoing edges per node),efConstruction(construction time parameter), andefSearch(search time parameter) were tuned.-
, were chosen to maximize
retrieval accuracywhile maintaining a reasonable graph size. -
efSearch(2048 forANNS-1, 10000 forANNS-2) was varied to explore thesearch speedvs.accuracytrade-off. LowerefSearchgives faster search but lower accuracy; higherefSearchgives slower search but higher accuracy. The results show thattrading retrieval quality for speedby using a smallefSearchand then increasing often results in worsegeneration accuracyandend-to-end throughputthan ahigher-quality, slower ANNS(orENNS).These analyses collectively emphasize that
high-quality retrievalis not merely a component but a critical factor inRAG's overall performance, justifyingIKS's focus on acceleratingENNS.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This work presents a comprehensive analysis of Retrieval-Augmented Generation (RAG) applications, identifying the retrieval phase as a critical bottleneck for both accuracy, latency, and throughput. The authors empirically demonstrate that Exact Nearest Neighbor Search (ENNS), despite its computational cost, is often Pareto-superior for RAG systems requiring high generation accuracy, as it enables sending a smaller, more precise set of documents to the Large Language Model (LLM), thereby reducing generation time.
To address this, the paper introduces Intelligent Knowledge Store (IKS), a novel Type-2 CXL device designed for near-memory acceleration of ENNS. IKS features a scale-out architecture with near-memory accelerators (NMAs) integrated with LPDDR5X DRAM, and a cache-coherent interface over CXL.cache for efficient CPU-accelerator interaction. IKS achieves faster ENNS over a 512GB vector database compared to Intel Sapphire Rapids CPUs, translating to lower end-to-end inference time for representative RAG workloads. Beyond its acceleration capabilities, IKS functions as a memory expander, allowing its DRAM to be disaggregated and used by other applications, enhancing overall server resource utilization and cost-effectiveness compared to GPU-based solutions.
7.2. Limitations & Future Work
The authors highlight several limitations and propose future research directions:
- Exhaustive Search Inefficiency: The current
IKSperforms anexhaustive searchover the entirecorpus, which consumes energy and constantly saturatesmemory bandwidth. This can potentially causeslowdownsfor other applications usingIKSas a generalmemory expander.- Future Work: Exploring
early terminationtechniques forsimilarity search[9, 42] could reducememory bandwidth utilizationwithout compromisingsearch accuracy.
- Future Work: Exploring
- Low NMA Utilization for Small Batch Sizes: For
batch sizesless than 64, theNMA chip utilizationis low. This is partly a design choice (overprovisioning compute to utilize "free" area dictated byPHY shoreline), but it leads toenergy inefficiency.- Future Work: Implementing
circuit-level techniqueslikeclock and power gatingto power off inactiveprocessing enginesfor smallerbatch sizes. Additionally,dynamic voltage and frequency scaling (DVFS)could be employed to reduceNMAfrequency and voltage, potentially allowing multipleprocessing enginesto process a singlequery vectorfor very small batch sizes.
- Future Work: Implementing
- Dataset Dependency for ANNS Comparison: While
IKSprovides dataset-independenthigh-quality searchviaENNS, the paper acknowledges that if a dataset is highly amenable to clustering,ANNSmight reduce the accuracy gap, making it more attractive.IKS's performance heavily relies on thesequential memory access patternofENNS, making integration ofapproximation techniqueschallenging.- Future Work: This implicitly suggests exploring hybrid approaches or
IKSvariants that could selectively incorporate some approximation if it doesn't disrupt the efficientENNSmemory access patterns too much.
- Future Work: This implicitly suggests exploring hybrid approaches or
- Overhead of Host-side Aggregation: For deployments with a very large number of
IKS units(beyond the four evaluated), theoverhead of host-side final top-K aggregationcould become a bottleneck. - Multi-Node Deployments: The current evaluation does not cover
deployments of IKS spanning multiple nodes.
7.3. Personal Insights & Critique
This paper makes a compelling case for specialized hardware acceleration of Exact Nearest Neighbor Search within the RAG paradigm. The observation that ENNS can paradoxically lead to lower end-to-end inference time by enabling a smaller, higher-quality context for the LLM is a critical insight often overlooked when focusing solely on retrieval speed. This challenges the common assumption that faster approximate retrieval is always better.
The design of IKS is innovative in several aspects:
- CXL.cache as a First-Class Interface: Leveraging
CXL.cachefor acache-coherent doorbellandcontext buffersis an elegant solution to theCPU-accelerator communication bottleneck, offering superior performance overMMIO-likeCXL.ioand avoiding complexDMAprogramming. This sets a precedent for howCXLcan truly enableshared memoryprocessing. - Scale-Out NMA Architecture for Manufacturability: The pragmatic decision to use a
scale-out NMAarchitecture, distributing logic across multiple chips due toshoreline constraintsforLPDDR5X PHYs, is a smart engineering choice that balances performance, cost, and manufacturability. - LPDDR5X for Cost-Effectiveness: The choice of
LPDDR5XoverHBMdemonstrates a clear focus oncost-effectivenessandpower efficiency, recognizing that forvector databases(which are large and growing), memory cost is a dominant factor.
Potential Issues/Areas for Improvement:
-
"Free Area" Justification: While the paper justifies
overprovisioning computeon theNMA chipdue to "free area" (constrained byPHY shoreline), this is a static design decision. The proposed future work onpower gatingandDVFSis crucial to mitigatepower inefficiencyfor smallerbatch sizesand fully capitalize on this "free" area. -
Dataset Sensitivity of ANNS: The argument against
ANNS acceleratorsdue todataset dependenceis strong, but some hybrid approach whereIKScould, for example, quickly filter out a large portion of the search space using simpleANNStechniques before performing anENNSon a reduced candidate set, might offer even greater efficiency for certain datasets, without necessarily breaking thesequential access pattern. This could be a more dynamic approach toearly termination. -
Software Ecosystem: While the paper focuses on hardware, the long-term success of such a specialized accelerator depends heavily on its integration into the broader
vector databasesoftware ecosystem. Theminimalisthardware design relies onsoftwarefor mapping and aggregation, which means strong driver and library support is essential. -
Reliability of LPDDR5X: The paper mentions that
ENNSis resilient tobit flipsinLPDDR5X. While this might hold forENNS, ifIKSis truly amemory expanderfor other applications,LPDDR5X's datacenter reliability (or lack thereof compared toECC DDR) could become a concern for other data-sensitive workloads. Further research intoin-line ECCforLPDDR5Xin datacenter contexts forCXLmemory could be valuable.Overall,
IKSrepresents a significant step towards practical and scalableRAGacceleration. Its rigorous analysis ofRAGbottlenecks and innovative hardware-software co-design usingCXLoffer valuable insights for the future ofAI system architectures. The ability to act as amemory expanderalso makes it a more versatile and economically attractive solution than single-purpose accelerators.
Similar papers
Recommended via semantic vector search.