SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
TL;DR Summary
SPFresh introduces LIRE, a lightweight incremental rebalancing protocol enabling in-place updates for billion-scale vector search, significantly reducing update latency and resource use while improving accuracy compared to costly global rebuild methods.
Abstract
Approximate Nearest Neighbor Search (ANNS) is now widely used in various applications, ranging from information retrieval, question answering, and recommendation, to search for similar high-dimensional vectors. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors. Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds. We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale vector index with 1% of daily vector update rate.
English Analysis
1. Bibliographic Information
- Title: SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
- Authors: Yuming Xu, Hengyu Liang, Jin Li, Shuotao Xu, Qi Chen, Qianxi Zhang, Cheng Li, Ziyue Yang, Fan Yang, Yuqing Yang, Peng Cheng, Mao Yang
- Affiliations: The authors are affiliated with the University of Science and Technology of China, Microsoft Research Asia, and Harvard University. This indicates a collaboration between a leading academic institution and a top-tier industrial research lab, suggesting a blend of rigorous academic research and practical, large-scale systems expertise.
- Journal/Conference: The paper was published in the ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP '23). SOSP is one of the most prestigious and competitive conferences in the field of computer systems, known for publishing foundational work in operating systems, distributed systems, and storage. Publication at SOSP signifies a high level of novelty, technical depth, and impact.
- Publication Year: 2023
- Abstract: The paper addresses the challenge of updating large-scale vector indexes, which are crucial for Approximate Nearest Neighbor Search (ANNS). Existing systems often accumulate updates and perform periodic, resource-intensive global rebuilds of the entire index, leading to fluctuating search performance and high operational costs. The authors introduce
SPFresh
, a system that enables efficient, incremental "in-place" updates. The core ofSPFresh
isLIRE
, a lightweight protocol that locally rebalances vector partitions by splitting them and reassigning only a small number of vectors near the partition boundaries. This approach adapts to data distribution shifts without global rebuilds. The abstract claims that for a billion-scale index with a 1% daily update rate,SPFresh
provides superior and more stable query performance than state-of-the-art solutions, while using only 1% of the peak DRAM and less than 10% of the peak CPU cores. - Original Source Link:
- Official Link: https://doi.org/10.1145/3600006.3613166
- Preprint (arXiv): https://arxiv.org/abs/2410.14452
- PDF Link: https://arxiv.org/pdf/2410.14452v1.pdf
- The paper was formally published at SOSP '23.
2. Executive Summary
-
Background & Motivation (Why): Modern AI applications, from recommendation systems to generative AI, rely on searching for similar items within massive datasets of high-dimensional vectors. This is accomplished using Approximate Nearest Neighbor Search (ANNS), which depends on a specialized data structure called a vector index. As data is continuously generated (e.g., new images, user interactions, documents), these indexes must be updated frequently to provide "fresh" results. The dominant industry practice is to batch updates and periodically rebuild the entire index from scratch. This "global rebuild" approach is extremely problematic: it consumes massive computational resources (CPU and memory), takes days to complete for billion-scale datasets, and causes significant fluctuations in search latency and accuracy, harming the user experience of live services. The core challenge is that updating a high-dimensional index structure is inherently complex and costly, a problem this paper aims to solve.
-
Main Contributions / Findings (What): The paper introduces
SPFresh
, a disk-based ANNS system designed to handle continuous updates efficiently without resorting to global rebuilds. Its primary contributions are:- A novel incremental update protocol,
LIRE
(Lightweight Incremental REbalancing): Instead of global rebuilds,LIRE
performs local, on-the-fly maintenance. When a data partition grows too large from insertions, it is split.LIRE
then intelligently reassigns a minimal set of vectors at the boundaries of the affected and neighboring partitions to maintain the overall quality and balance of the index. - A complete, high-performance system design:
SPFresh
is architected as a two-stage pipeline with a foregroundUpdater
for fast insertions and a backgroundLocal Rebuilder
that executes theLIRE
protocol. This decouples the expensive rebalancing work from the live update path. - A specialized storage engine: The system includes a custom
Block Controller
built onSPDK
that bypasses the operating system's storage stack for direct, low-latency SSD access. It is optimized for the append-only and bulk-read patterns of vector partitions. - Demonstrated Superiority: Experimental results show that
SPFresh
provides stable, low search latency and high accuracy under continuous updates, outperforming state-of-the-art systems that rely on global rebuilds. Crucially, it achieves this with a tiny fraction of the resource cost—requiring 99% less peak memory and over 90% fewer CPU cores than a leading system likeDiskANN
during updates.
- A novel incremental update protocol,
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
-
Vector Embeddings: The process of converting complex data like images, text, or audio into numerical, high-dimensional vectors using deep learning models. The geometric distance between two vectors in this "embedding space" reflects their semantic similarity.
As shown in Figure 1, images from a dataset are converted into vectors. To find images similar to a query image, the query is also converted into a vector, and the system searches for the nearest vectors in the dataset.
-
Approximate Nearest Neighbor Search (ANNS): Finding the exact nearest neighbors for a query vector in a high-dimensional space is computationally infeasible due to the "curse of dimensionality." ANNS algorithms trade perfect accuracy for a massive speedup by finding vectors that are probably among the true nearest neighbors.
-
Vector Index: A data structure that organizes vectors to accelerate ANNS. It creates a "navigation map" of the vector space. The paper categorizes them into two main types:
- Graph-based (Fine-grained): Each vector is a node, and edges connect it to its nearest neighbors. A search traverses this graph. Examples include
HNSW
andDiskANN
. These are difficult to update because adding or deleting a single vector can require changing many connections throughout the graph. - Cluster-based (Coarse-grained): Vectors are grouped into partitions or "clusters." A search first identifies the most promising clusters (by comparing the query to cluster centers, or
centroids
) and then performs a detailed search only on the vectors within those clusters.SPANN
is a key example. These are easier to update locally but risk becoming imbalanced or inaccurate over time.
- Graph-based (Fine-grained): Each vector is a node, and edges connect it to its nearest neighbors. A search traverses this graph. Examples include
-
On-Disk Index: For billion-scale datasets, storing the entire index in memory (DRAM) is prohibitively expensive. On-disk indexes store the bulk of the data on cheaper secondary storage like SSDs, keeping only essential metadata in memory. This makes I/O efficiency a critical performance factor.
-
-
Previous Works:
-
SPANN
: The state-of-the-art disk-based, cluster-based index thatSPFresh
is built upon.SPANN
's key innovation was a clustering algorithm that creates a large number of balanced partitions (calledpostings
), ensuring that search latency is stable and predictable. However,SPANN
was originally designed for static datasets and lacks an efficient update mechanism. -
DiskANN
: A state-of-the-art disk-based, graph-based index. It is known for high search performance but, like other graph-based methods, its updates rely on costly global rebuilds. -
Out-of-Place Updates (LSM-style): Systems like
Milvus
andADBV
use this common strategy. New data is written to a separate, smaller in-memory index. Periodically, this "delta" index is merged with the main on-disk index by rebuilding everything. This causes two major problems: 1) search queries must check both indexes, increasing latency, and 2) the rebuild process is a "stop-the-world" event that consumes enormous resources, as quantified in Table 1.Manual Transcription of Table 1: Global rebuild costs of disk-based ANNS indices for billion-scale datasets.
Memory CPU Time DiskANN 1100 GB 32 cores 2 days 64GB 16 cores 5 days SPANN 260 GB 45 cores 4 days -
Naive In-Place Updates: Systems like
Vearch
(and a modifiedSPANN
shown in the paper's experiments) simply insert new vectors into the closest existing partition. This is fast initially but quickly degrades index quality. Partitions become imbalanced (hurting latency), and the fixedcentroids
no longer accurately represent the data within them (hurting accuracy). This is demonstrated in Figure 2, where naive updates cause recall to drop and tail latency to skyrocket by 4x.该图像是两幅折线图,展示了静态索引与原位更新两种系统设置下的Recall@10与尾部延迟CDF关系。左图显示Recall@10随延迟变化,右图展示延迟的累积分布函数,数据基于SPANN索引和sift数据集。
-
-
Differentiation:
SPFresh
carves a new path. UnlikeDiskANN
andMilvus
, it avoids global rebuilds entirely. UnlikeVearch
or a naiveSPANN
, it introduces an active, lightweight maintenance protocol (LIRE
) that preserves index quality incrementally. It is the first system to offer true, efficient, in-place updates for billion-scale on-disk vector indexes.
4. Methodology (Core Technology & Implementation)
The core of SPFresh
is the LIRE
protocol, which is implemented within a carefully designed system architecture.
4.1 SPANN
: The Foundation
SPFresh
builds on SPANN
, a balanced, cluster-based index. As shown in Figure 3, SPANN
has two main parts:
-
Posting Lists (On Disk): The entire vector dataset is divided into a large number of partitions, called
postings
. Eachposting
is a cluster of vectors that are close to each other. These are stored on an SSD. -
Centroid Index (In Memory): The
centroid
(average vector) of eachposting
is stored in memory. These centroids are organized into a graph-based index (SPTAG
) to allow for quick identification of the most relevantpostings
for a given query.The key property
SPFresh
inherits fromSPANN
is that the index is already in a well-partitioned and balanced state, which makes local, incremental changes feasible.该图像是示意图,展示了SPANN索引的数据结构,包括内存中的Posting Centroids和磁盘上的Posting Lists,表示索引构建和存储的基本关系。
4.2 LIRE
: Lightweight Incremental RE-balancing Protocol
The goal of LIRE
is to maintain the Nearest Partition Assignment (NPA) property: every vector should be stored in the posting
whose centroid
is closest to it. Naive updates violate this.
The Problem: NPA Violation on Split
When a posting
receives too many new vectors, it must be split. LIRE
splits it into two new postings
, A1
and A2
, with new centroids
. As illustrated in Figure 4:
-
A vector (yellow dot) that was in and is now assigned to
A2
might actually be closer to thecentroid
of a neighboringposting
. The NPA property is violated. -
A vector (green dot) in
posting
might now be closer to the newcentroid
ofA2
than its originalcentroid
. The NPA property is violated again.These violations degrade search accuracy because the
centroids
no longer faithfully represent the vectors in theirpostings
.LIRE
Operations
To fix this, LIRE
defines five operations: Insert
, Delete
, Split
, Merge
, and Reassign
. The key innovation lies in Split
and Reassign
.
-
Split
: When aposting
exceeds a size threshold, it's split into two balancedpostings
. This action triggers aReassign
job. -
Merge
: When aposting
becomes too small (e.g., due to deletions), it is merged with its nearest neighbor to prevent having many tiny, inefficientpostings
. This also triggers aReassign
. -
Reassign
: This is the most critical operation. To avoid a costly global check,LIRE
identifies a small set of candidate vectors for reassignment using two necessary conditions. For aposting
with oldcentroid
that splits into newpostings
withcentroids
and :-
A vector from the original split posting is a candidate for reassignment only if its distance to the old
centroid
was smaller than its distance to the newcentroids
.- Explanation: If the new
centroids
and are actually worse (further away) for than the old one , it raises the possibility that a completely different neighboringcentroid
(like in Figure 4) might now be the best fit.
- Explanation: If the new
-
A vector from a neighboring posting (with
centroid
) is a candidate for reassignment only if a newcentroid
is closer to it than the oldcentroid
was.-
Explanation: Before the split, we knew (due to the NPA property). If a new
centroid
is even closer than was (i.e., ), then it's possible that might also be closer than , i.e., . This condition efficiently filters out all neighbors for which a reassignment is impossible.LIRE
only applies these checks to a small number of the nearest neighboringpostings
, making the process lightweight.
-
-
Split-Reassign Convergence
The paper provides a proof that the split-reassign
process, even with cascading splits, will always terminate. The logic is as follows:
- The state of the index is defined by its set of
centroids
. - A
split
operation always increases the number ofcentroids
by one: (one oldcentroid
is removed, two new ones are added). - The total number of
centroids
cannot exceed the total number of vectors in the dataset: . - Since the number of
centroids
increases with eachsplit
and is bounded by a finite number, the sequence ofsplits
must be finite.
4.3 SPFresh
System Design and Implementation
The LIRE
protocol is implemented within the SPFresh
system, which is architected for high performance and concurrency.
该图像是SPFresh系统架构的示意图(图5),展示了内存中SPTAG索引模块与更新器、局部重建器、块控制器及底层存储的交互关系,体现局部重建(Local Rebuild)的工作流程和数据读写路径。
As shown in Figure 5, the architecture has three key new components:
-
In-place Updater: A lightweight module that handles incoming insert and delete requests. It appends new vectors to the appropriate
posting
and uses a version map to manage deletions via tombstones. When aposting
grows too large, it submits asplit
job to theLocal Rebuilder
. This keeps the foreground update path extremely fast. -
Local Rebuilder: A background, multi-threaded component that processes a queue of
split
,merge
, andreassign
jobs. This is where theLIRE
protocol is executed. By running in the background, it does not block incoming searches or updates. Concurrency is managed with fine-grained locks at theposting
level and atomic operations (compare-and-swap
) on the version map. -
Block Controller: A custom user-space storage engine built on
SPDK
(Storage Performance Development Kit). This is a crucial engineering detail for performance.该图像是图6,SPFresh存储结构示意图,展示了内存中块控制器对存储中数据块的映射与更新流程,包括块映射、空闲块池及并发IO请求队列的协调管理过程。
-
Direct SSD Access: As seen in Figure 6, it bypasses the kernel's file system to issue I/O commands directly to the NVMe SSD, minimizing overhead and latency.
-
Data Layout: It maintains an in-memory
Block Mapping
fromposting
ID to a list of SSD block locations. Free blocks are managed in aFree Block Pool
. -
Optimized
APPEND
: To add a vector to aposting
, it only performs a read-modify-write on the last block of thatposting
, rather than rewriting the entire multi-block file. This dramatically reduces write amplification.Crash Recovery:
SPFresh
uses a standard snapshot and Write-Ahead Log (WAL) approach. Snapshots of the in-memory data structures are taken periodically. A copy-on-write mechanism at the block level ensures that old data blocks are not freed until a new snapshot is complete, allowing for consistent recovery.
-
5. Experimental Setup
-
Platform: Experiments were run on a storage-optimized Azure VM with 16 vCPUs and an attached high-performance NVMe SSD.
-
Datasets: The primary dataset was
SIFT1B
, a standard benchmark containing one billion 128-dimensional vectors. Another dataset,SPACEV
, with shifting data distribution was also used (details not in the provided text). -
Evaluation Metrics:
RecallK@K
: Measures search accuracy.- Conceptual Definition: For a K-Nearest Neighbor query, it measures the fraction of the true K nearest neighbors that were successfully found by the approximate algorithm. A higher recall means higher accuracy.
- Mathematical Formula:
- Symbol Explanation:
- : The set of results returned by the ANNS algorithm.
- : The ground-truth set of the true nearest neighbors.
- : The number of elements in a set. Here, .
- Search Latency: The time taken to complete a query, measured in milliseconds (ms). The paper focuses on tail latency (e.g., 99th percentile), which is critical for user-facing online services.
- Throughput: The rate of operations, measured in Queries Per Second (QPS) for search and insertions per second for updates.
- Resource Usage: Peak memory (DRAM) in GB and CPU cores used during the update process.
-
Baselines:
SPFresh
is primarily compared againstDiskANN
and a version ofSPANN
that uses global rebuilds (), representing the state-of-the-art for graph-based and cluster-based indexes, respectively.
6. Results & Analysis
The experiments comprehensively validate SPFresh
's claims of superior performance and efficiency.
-
Overall Performance (Figure 7): This is the key result, showing performance over an extended period of updates.
-
Latency & Accuracy:
SPFresh
maintains consistently low tail latency and high, stable recall. In contrast,DiskANN
and would exhibit periodic spikes in latency and drops in accuracy during their global rebuild phases (the graph forDiskANN
shows its latency and memory steadily increasing as updates accumulate before a rebuild). -
Resource Usage: The most dramatic result is in memory consumption.
SPFresh
operates with a stable, low memory footprint (around 10GB).DiskANN
, on the other hand, requires a massive memory allocation (over 1000GB) for its global rebuild process.SPFresh
achieves a 2.4x lower tail latency with less than 1% of the peak memory and 10% of the CPU cores needed byDiskANN
.该图像是图7,展示了SPFresh在插入吞吐量、准确率、内存使用以及搜索不同分位延迟上的整体性能表现。图中比较了SPFresh、DiskANN和SPANN+三种方法随时间的变化趋势,体现了SPFresh在性能和资源使用上的优势。
-
-
Ablation Study on
LIRE
(Figure 10): This study justifies the design of theLIRE
protocol.-
In-place update only
(likeVearch
) shows a rapid degradation of the accuracy-latency trade-off. -
Adding
split
helps manage partition size but doesn't fix the NPA violations, so the curve is still poor. -
Only with the full
LIRE
protocol (split
+reassign
) does the system maintain a high-quality index, achieving a performance curve close to a statically built index.该图像是图表,展示了不同更新技术在数据分布日益偏斜时搜索准确率(Recall 10@10)与延迟(Latency)的权衡关系。图中对比了静态索引以及三种基于原地更新的方案,分别为仅更新、更新+分裂和更新+分裂+重新分配,反映出不同方法在性能与准确率上的表现差异。
-
-
Parameter Sensitivity of Reassignment (Figure 11): This experiment explores how many neighboring
postings
need to be checked during thereassign
step. The results show that checking just the nearest 64 neighbors (top64
) is sufficient to achieve nearly all the possible recall improvement. Checking more neighbors offers negligible benefits while increasing cost. This confirms that thereassign
process can be kept "lightweight".该图像是图表,展示了图11中不同重新分配范围参数对召回率Recall 10@10和延迟Latency(ms)的影响,结果表明top64(最近的64个向量)重新分配扫描足够。
-
System Scalability:
-
Search Throughput (Figure 8): As the number of search threads increases, the search QPS scales linearly until it hits the maximum IOPS (I/O Operations Per Second) of the SSD. This demonstrates that
SPFresh
is highly efficient and its performance is bound by the underlying hardware, not software bottlenecks.该图像是图表,展示了Azure lsv实例上SPFresh搜索线程数量与搜索吞吐量和磁盘IOPS的关系。横轴为线程数,左纵轴为每秒查询数(QPS),右纵轴为IOPS,图中柱状图代表QPS,折线图代表IOPS。
-
Update Throughput (Figure 12): The system's throughput scales well with both the number of foreground (updater) threads and background (rebuilder) threads, showing that the pipelined architecture is effective at handling high update loads.
该图像是柱状图,展示了前台和后台更新线程数量对吞吐量的影响。左侧显示固定后台线程为1时,不同前台线程数的吞吐量;右侧显示固定前台线程为8时,不同后台线程数的吞吐量。
-
Stress Test (Figure 9): Under a heavy, continuous update workload,
SPFresh
maintains stable and low latency for both search and insert operations. Its performance remains robust even on skewed datasets, proving the effectiveness of the dynamic rebalancing.该图像是图表,展示了SPFresh在均匀数据集(SIFT)和偏态数据集(SPACV)压力测试下的延迟、吞吐量和IOPS性能指标表现,对比了不同批处理(日)条件下的搜索和插入性能。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents
SPFresh
, a novel system that solves a critical and previously open problem: efficient, incremental updates for billion-scale on-disk vector indexes. Its core contribution, theLIRE
protocol, replaces the disruptive and costly "global rebuild" paradigm with a lightweight, local rebalancing mechanism. By doing so,SPFresh
delivers stable, high-quality search performance (low latency, high accuracy) under continuous data ingestion, all while consuming drastically fewer computational resources than existing state-of-the-art systems. -
Limitations & Future Work: (Based on the provided text and general domain knowledge)
- The paper focuses on insert-heavy workloads. The dynamics of
merge
operations triggered by heavy deletions are less explored. A mix of frequent splits and merges could introduce more complex concurrency scenarios. - The
LIRE
protocol's necessary conditions and opportunistic checks are shown to work well empirically, but there might be theoretical edge cases or adversarial data distributions where index quality could slowly degrade over very long periods. - The system is highly optimized for a specific hardware setup (NVMe SSDs via
SPDK
). Its performance characteristics on other storage media (e.g., cloud object storage, HDDs) would be different and might require architectural changes. - The current design is for a single-node system. Extending the
LIRE
protocol to a distributed setting would be a significant and challenging next step.
- The paper focuses on insert-heavy workloads. The dynamics of
-
Personal Insights & Critique:
SPFresh
is a landmark paper in the field of vector search systems. Its central idea—localizing the cost of updates—is powerful and elegant. It draws a compelling parallel to the principles of log-structured merge-trees (LSM-trees), where small, frequent compactions are preferred over rewriting entire data levels.- The work is a masterclass in systems design. It combines a clever algorithm (
LIRE
) with deep, practical systems engineering (theSPDK
-basedBlock Controller
, the pipelined architecture). The custom storage layer is a key enabler; without its I/O efficiency, the frequent local rebuilds might not have been fast enough. - The impact of this work is immediate and practical. For any company running large-scale ANNS in production (e.g., for search, ads, or recommendation),
SPFresh
offers a path to dramatically reduce operational costs, improve service stability, and deliver fresher results to users. It effectively turns vector databases from periodically updated warehouses into real-time, dynamic systems.
Similar papers
Recommended via semantic vector search.
Discussion
Leave a comment
No comments yet. Start the discussion!