Paper status: completed

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search

Published:10/18/2024

Incremental Vector Index Update (1)Billion-Scale Vector Search System (1)Approximate Nearest Neighbor Search Optimization (1)Lightweight Incremental Rebalancing Protocol (1)Memory and Compute Resource Efficiency (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SPFresh introduces LIRE, a lightweight incremental rebalancing protocol enabling in-place updates for billion-scale vector search, significantly reducing update latency and resource use while improving accuracy compared to costly global rebuild methods.

Abstract

Approximate Nearest Neighbor Search (ANNS) is now widely used in various applications, ranging from information retrieval, question answering, and recommendation, to search for similar high-dimensional vectors. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors. Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds. We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale vector index with 1% of daily vector update rate.

Mind Map

In-depth Reading

English Analysis~17 min read · 18,312 chars

1. Bibliographic Information

Title: SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
Authors: Yuming Xu, Hengyu Liang, Jin Li, Shuotao Xu, Qi Chen, Qianxi Zhang, Cheng Li, Ziyue Yang, Fan Yang, Yuqing Yang, Peng Cheng, Mao Yang
Affiliations: The authors are affiliated with the University of Science and Technology of China, Microsoft Research Asia, and Harvard University. This indicates a collaboration between a leading academic institution and a top-tier industrial research lab, suggesting a blend of rigorous academic research and practical, large-scale systems expertise.
Journal/Conference: The paper was published in the ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP '23). SOSP is one of the most prestigious and competitive conferences in the field of computer systems, known for publishing foundational work in operating systems, distributed systems, and storage. Publication at SOSP signifies a high level of novelty, technical depth, and impact.
Publication Year: 2023
Abstract: The paper addresses the challenge of updating large-scale vector indexes, which are crucial for Approximate Nearest Neighbor Search (ANNS). Existing systems often accumulate updates and perform periodic, resource-intensive global rebuilds of the entire index, leading to fluctuating search performance and high operational costs. The authors introduce SPFresh, a system that enables efficient, incremental "in-place" updates. The core of SPFresh is LIRE, a lightweight protocol that locally rebalances vector partitions by splitting them and reassigning only a small number of vectors near the partition boundaries. This approach adapts to data distribution shifts without global rebuilds. The abstract claims that for a billion-scale index with a 1% daily update rate, SPFresh provides superior and more stable query performance than state-of-the-art solutions, while using only 1% of the peak DRAM and less than 10% of the peak CPU cores.
Original Source Link:
- Official Link: https://doi.org/10.1145/3600006.3613166
- Preprint (arXiv): https://arxiv.org/abs/2410.14452
- PDF Link: https://arxiv.org/pdf/2410.14452v1.pdf
- The paper was formally published at SOSP '23.

2. Executive Summary

Background & Motivation (Why): Modern AI applications, from recommendation systems to generative AI, rely on searching for similar items within massive datasets of high-dimensional vectors. This is accomplished using Approximate Nearest Neighbor Search (ANNS), which depends on a specialized data structure called a vector index. As data is continuously generated (e.g., new images, user interactions, documents), these indexes must be updated frequently to provide "fresh" results. The dominant industry practice is to batch updates and periodically rebuild the entire index from scratch. This "global rebuild" approach is extremely problematic: it consumes massive computational resources (CPU and memory), takes days to complete for billion-scale datasets, and causes significant fluctuations in search latency and accuracy, harming the user experience of live services. The core challenge is that updating a high-dimensional index structure is inherently complex and costly, a problem this paper aims to solve.
Main Contributions / Findings (What): The paper introduces SPFresh, a disk-based ANNS system designed to handle continuous updates efficiently without resorting to global rebuilds. Its primary contributions are:
1. A novel incremental update protocol, LIRE (Lightweight Incremental REbalancing): Instead of global rebuilds, LIRE performs local, on-the-fly maintenance. When a data partition grows too large from insertions, it is split. LIRE then intelligently reassigns a minimal set of vectors at the boundaries of the affected and neighboring partitions to maintain the overall quality and balance of the index.
2. A complete, high-performance system design: SPFresh is architected as a two-stage pipeline with a foreground Updater for fast insertions and a background Local Rebuilder that executes the LIRE protocol. This decouples the expensive rebalancing work from the live update path.
3. A specialized storage engine: The system includes a custom Block Controller built on SPDK that bypasses the operating system's storage stack for direct, low-latency SSD access. It is optimized for the append-only and bulk-read patterns of vector partitions.
4. Demonstrated Superiority: Experimental results show that SPFresh provides stable, low search latency and high accuracy under continuous updates, outperforming state-of-the-art systems that rely on global rebuilds. Crucially, it achieves this with a tiny fraction of the resource cost—requiring 99% less peak memory and over 90% fewer CPU cores than a leading system like DiskANN during updates.

Foundational Concepts:
- Vector Embeddings: The process of converting complex data like images, text, or audio into numerical, high-dimensional vectors using deep learning models. The geometric distance between two vectors in this "embedding space" reflects their semantic similarity.
  
  As shown in Figure 1, images from a dataset are converted into vectors. To find images similar to a query image, the query is also converted into a vector, and the system searches for the nearest vectors in the dataset.
- Approximate Nearest Neighbor Search (ANNS): Finding the exact nearest neighbors for a query vector in a high-dimensional space is computationally infeasible due to the "curse of dimensionality." ANNS algorithms trade perfect accuracy for a massive speedup by finding vectors that are probably among the true nearest neighbors.
- Vector Index: A data structure that organizes vectors to accelerate ANNS. It creates a "navigation map" of the vector space. The paper categorizes them into two main types:
  - Graph-based (Fine-grained): Each vector is a node, and edges connect it to its nearest neighbors. A search traverses this graph. Examples include HNSW and DiskANN. These are difficult to update because adding or deleting a single vector can require changing many connections throughout the graph.
  - Cluster-based (Coarse-grained): Vectors are grouped into partitions or "clusters." A search first identifies the most promising clusters (by comparing the query to cluster centers, or centroids) and then performs a detailed search only on the vectors within those clusters. SPANN is a key example. These are easier to update locally but risk becoming imbalanced or inaccurate over time.
- On-Disk Index: For billion-scale datasets, storing the entire index in memory (DRAM) is prohibitively expensive. On-disk indexes store the bulk of the data on cheaper secondary storage like SSDs, keeping only essential metadata in memory. This makes I/O efficiency a critical performance factor.
Previous Works:
- SPANN: The state-of-the-art disk-based, cluster-based index that SPFresh is built upon. SPANN's key innovation was a clustering algorithm that creates a large number of balanced partitions (called postings), ensuring that search latency is stable and predictable. However, SPANN was originally designed for static datasets and lacks an efficient update mechanism.
- DiskANN: A state-of-the-art disk-based, graph-based index. It is known for high search performance but, like other graph-based methods, its updates rely on costly global rebuilds.
- Out-of-Place Updates (LSM-style): Systems like Milvus and ADBV use this common strategy. New data is written to a separate, smaller in-memory index. Periodically, this "delta" index is merged with the main on-disk index by rebuilding everything. This causes two major problems: 1) search queries must check both indexes, increasing latency, and 2) the rebuild process is a "stop-the-world" event that consumes enormous resources, as quantified in Table 1.
  
  Manual Transcription of Table 1: Global rebuild costs of disk-based ANNS indices for billion-scale datasets.
  
  Memory CPU Time
  
  DiskANN 1100 GB 32 cores 2 days
  
  64GB 16 cores 5 days
  
  SPANN 260 GB 45 cores 4 days
- Naive In-Place Updates: Systems like Vearch (and a modified SPANN shown in the paper's experiments) simply insert new vectors into the closest existing partition. This is fast initially but quickly degrades index quality. Partitions become imbalanced (hurting latency), and the fixed centroids no longer accurately represent the data within them (hurting accuracy). This is demonstrated in Figure 2, where naive updates cause recall to drop and tail latency to skyrocket by 4x.
  
  该图像是两幅折线图，展示了静态索引与原位更新两种系统设置下的Recall@10与尾部延迟CDF关系。左图显示Recall@10随延迟变化，右图展示延迟的累积分布函数，数据基于SPANN索引和sift数据集。
Differentiation: SPFresh carves a new path. Unlike DiskANN and Milvus, it avoids global rebuilds entirely. Unlike Vearch or a naive SPANN, it introduces an active, lightweight maintenance protocol (LIRE) that preserves index quality incrementally. It is the first system to offer true, efficient, in-place updates for billion-scale on-disk vector indexes.

	Memory	CPU	Time
DiskANN	1100 GB	32 cores	2 days
	64GB	16 cores	5 days
SPANN	260 GB	45 cores	4 days

4. Methodology (Core Technology & Implementation)

The core of SPFresh is the LIRE protocol, which is implemented within a carefully designed system architecture.

4.1 `SPANN`: The Foundation

SPFresh builds on SPANN, a balanced, cluster-based index. As shown in Figure 3, SPANN has two main parts:

Posting Lists (On Disk): The entire vector dataset is divided into a large number of partitions, called postings. Each posting is a cluster of vectors that are close to each other. These are stored on an SSD.
Centroid Index (In Memory): The centroid (average vector) of each posting is stored in memory. These centroids are organized into a graph-based index (SPTAG) to allow for quick identification of the most relevant postings for a given query.

The key property SPFresh inherits from SPANN is that the index is already in a well-partitioned and balanced state, which makes local, incremental changes feasible.

该图像是示意图，展示了SPANN索引的数据结构，包括内存中的Posting Centroids和磁盘上的Posting Lists，表示索引构建和存储的基本关系。

4.2 `LIRE`: Lightweight Incremental RE-balancing Protocol

The goal of LIRE is to maintain the Nearest Partition Assignment (NPA) property: every vector should be stored in the posting whose centroid is closest to it. Naive updates violate this.

The Problem: NPA Violation on Split

When a posting $A$ receives too many new vectors, it must be split. LIRE splits it into two new postings, A1 and A2, with new centroids. As illustrated in Figure 4:

A vector (yellow dot) that was in $A$ and is now assigned to A2 might actually be closer to the centroid of a neighboring posting $B$ . The NPA property is violated.
A vector (green dot) in posting $B$ might now be closer to the new centroid of A2 than its original centroid $B$ . The NPA property is violated again.

These violations degrade search accuracy because the centroids no longer faithfully represent the vectors in their postings.

LIRE Operations

To fix this, LIRE defines five operations: Insert, Delete, Split, Merge, and Reassign. The key innovation lies in Split and Reassign.

Split: When a posting exceeds a size threshold, it's split into two balanced postings. This action triggers a Reassign job.
Merge: When a posting becomes too small (e.g., due to deletions), it is merged with its nearest neighbor to prevent having many tiny, inefficient postings. This also triggers a Reassign.
Reassign: This is the most critical operation. To avoid a costly global check, LIRE identifies a small set of candidate vectors for reassignment using two necessary conditions. For a posting with old centroid $A_o$ that splits into new postings with centroids $A_1$ and $A_2$ :
1. A vector $v$ from the original split posting is a candidate for reassignment only if its distance to the old centroid was smaller than its distance to the new centroids. $D ( v , A _ { o } ) \leq D ( v , A _ { i } ) , \forall i \in 1 , 2$
  - Explanation: If the new centroids $A_1$ and $A_2$ are actually worse (further away) for $v$ than the old one $A_o$ , it raises the possibility that a completely different neighboring centroid (like $B$ in Figure 4) might now be the best fit.
2. A vector $v$ from a neighboring posting (with centroid $B$ ) is a candidate for reassignment only if a new centroid $A_i$ is closer to it than the old centroid $A_o$ was. $D ( v , A _ { i } ) \leq D ( v , A _ { o } ) , \exists i \in 1 , 2$
  - Explanation: Before the split, we knew $D(v, B) \le D(v, A_o)$ (due to the NPA property). If a new centroid $A_i$ is even closer than $A_o$ was (i.e., $D(v, A_i) < D(v, A_o)$ ), then it's possible that $A_i$ might also be closer than $B$ , i.e., $D(v, A_i) < D(v, B)$ . This condition efficiently filters out all neighbors for which a reassignment is impossible.
    
    LIRE only applies these checks to a small number of the nearest neighboring postings, making the process lightweight.

Split-Reassign Convergence

The paper provides a proof that the split-reassign process, even with cascading splits, will always terminate. The logic is as follows:

The state of the index is defined by its set of centroids $C$ .
A split operation always increases the number of centroids by one: $|C_{i+1}| = |C_i| + 1$ (one old centroid is removed, two new ones are added).
The total number of centroids cannot exceed the total number of vectors in the dataset: $|C| \le |V|$ .
Since the number of centroids increases with each split and is bounded by a finite number, the sequence of splits must be finite.

4.3 `SPFresh` System Design and Implementation

The LIRE protocol is implemented within the SPFresh system, which is architected for high performance and concurrency.

Figure 5. SPFREsH architecture (LR means Local Rebuild). 该图像是SPFresh系统架构的示意图（图5），展示了内存中SPTAG索引模块与更新器、局部重建器、块控制器及底层存储的交互关系，体现局部重建（Local Rebuild）的工作流程和数据读写路径。

As shown in Figure 5, the architecture has three key new components:

In-place Updater: A lightweight module that handles incoming insert and delete requests. It appends new vectors to the appropriate posting and uses a version map to manage deletions via tombstones. When a posting grows too large, it submits a split job to the Local Rebuilder. This keeps the foreground update path extremely fast.
Local Rebuilder: A background, multi-threaded component that processes a queue of split, merge, and reassign jobs. This is where the LIRE protocol is executed. By running in the background, it does not block incoming searches or updates. Concurrency is managed with fine-grained locks at the posting level and atomic operations (compare-and-swap) on the version map.
Block Controller: A custom user-space storage engine built on SPDK (Storage Performance Development Kit). This is a crucial engineering detail for performance.

该图像是图6，SPFresh存储结构示意图，展示了内存中块控制器对存储中数据块的映射与更新流程，包括块映射、空闲块池及并发IO请求队列的协调管理过程。
- Direct SSD Access: As seen in Figure 6, it bypasses the kernel's file system to issue I/O commands directly to the NVMe SSD, minimizing overhead and latency.
- Data Layout: It maintains an in-memory Block Mapping from posting ID to a list of SSD block locations. Free blocks are managed in a Free Block Pool.
- Optimized APPEND: To add a vector to a posting, it only performs a read-modify-write on the last block of that posting, rather than rewriting the entire multi-block file. This dramatically reduces write amplification.
  
  Crash Recovery: SPFresh uses a standard snapshot and Write-Ahead Log (WAL) approach. Snapshots of the in-memory data structures are taken periodically. A copy-on-write mechanism at the block level ensures that old data blocks are not freed until a new snapshot is complete, allowing for consistent recovery.

5. Experimental Setup

Platform: Experiments were run on a storage-optimized Azure VM with 16 vCPUs and an attached high-performance NVMe SSD.
Datasets: The primary dataset was SIFT1B, a standard benchmark containing one billion 128-dimensional vectors. Another dataset, SPACEV, with shifting data distribution was also used (details not in the provided text).
Evaluation Metrics:
- RecallK@K: Measures search accuracy.
  1. Conceptual Definition: For a K-Nearest Neighbor query, it measures the fraction of the true K nearest neighbors that were successfully found by the approximate algorithm. A higher recall means higher accuracy.
  2. Mathematical Formula: $\mathrm{Recall}K@K = \frac{|Y \cap G|}{|G|}$
  3. Symbol Explanation:
    - $Y$ : The set of $K$ results returned by the ANNS algorithm.
    - $G$ : The ground-truth set of the true $K$ nearest neighbors.
    - $| \cdot |$ : The number of elements in a set. Here, $|Y| = |G| = K$ .
- Search Latency: The time taken to complete a query, measured in milliseconds (ms). The paper focuses on tail latency (e.g., 99th percentile), which is critical for user-facing online services.
- Throughput: The rate of operations, measured in Queries Per Second (QPS) for search and insertions per second for updates.
- Resource Usage: Peak memory (DRAM) in GB and CPU cores used during the update process.
Baselines: SPFresh is primarily compared against DiskANN and a version of SPANN that uses global rebuilds ( $SPANN+$ ), representing the state-of-the-art for graph-based and cluster-based indexes, respectively.

6. Results & Analysis

The experiments comprehensively validate SPFresh's claims of superior performance and efficiency.

Overall Performance (Figure 7): This is the key result, showing performance over an extended period of updates.
- Latency & Accuracy: SPFresh maintains consistently low tail latency and high, stable recall. In contrast, DiskANN and $SPANN+$ would exhibit periodic spikes in latency and drops in accuracy during their global rebuild phases (the graph for DiskANN shows its latency and memory steadily increasing as updates accumulate before a rebuild).
- Resource Usage: The most dramatic result is in memory consumption. SPFresh operates with a stable, low memory footprint (around 10GB). DiskANN, on the other hand, requires a massive memory allocation (over 1000GB) for its global rebuild process. SPFresh achieves a 2.4x lower tail latency with less than 1% of the peak memory and 10% of the CPU cores needed by DiskANN.
  
  该图像是图7，展示了SPFresh在插入吞吐量、准确率、内存使用以及搜索不同分位延迟上的整体性能表现。图中比较了SPFresh、DiskANN和SPANN+三种方法随时间的变化趋势，体现了SPFresh在性能和资源使用上的优势。
Ablation Study on LIRE (Figure 10): This study justifies the design of the LIRE protocol.
- In-place update only (like Vearch) shows a rapid degradation of the accuracy-latency trade-off.
- Adding split helps manage partition size but doesn't fix the NPA violations, so the curve is still poor.
- Only with the full LIRE protocol (split + reassign) does the system maintain a high-quality index, achieving a performance curve close to a statically built index.
  
  该图像是图表，展示了不同更新技术在数据分布日益偏斜时搜索准确率（Recall 10@10）与延迟（Latency）的权衡关系。图中对比了静态索引以及三种基于原地更新的方案，分别为仅更新、更新+分裂和更新+分裂+重新分配，反映出不同方法在性能与准确率上的表现差异。
Parameter Sensitivity of Reassignment (Figure 11): This experiment explores how many neighboring postings need to be checked during the reassign step. The results show that checking just the nearest 64 neighbors (top64) is sufficient to achieve nearly all the possible recall improvement. Checking more neighbors offers negligible benefits while increasing cost. This confirms that the reassign process can be kept "lightweight".

该图像是图表，展示了图11中不同重新分配范围参数对召回率Recall 10@10和延迟Latency(ms)的影响，结果表明top64（最近的64个向量）重新分配扫描足够。
System Scalability:
- Search Throughput (Figure 8): As the number of search threads increases, the search QPS scales linearly until it hits the maximum IOPS (I/O Operations Per Second) of the SSD. This demonstrates that SPFresh is highly efficient and its performance is bound by the underlying hardware, not software bottlenecks.
  
  $Figure 8. Search throughput/disk IOPS vs $\\#$ of SPFRESH search threads on Azure lsv instance \[41\].$ 该图像是图表，展示了Azure lsv实例上SPFresh搜索线程数量与搜索吞吐量和磁盘IOPS的关系。横轴为线程数，左纵轴为每秒查询数（QPS），右纵轴为IOPS，图中柱状图代表QPS，折线图代表IOPS。
- Update Throughput (Figure 12): The system's throughput scales well with both the number of foreground (updater) threads and background (rebuilder) threads, showing that the pipelined architecture is effective at handling high update loads.
  
  $Figure 12. Foreground Scalability (Background Thread Number $= 1$ ) and Background Scalability (Foreground Thread Number $= 8$$ 该图像是柱状图，展示了前台和后台更新线程数量对吞吐量的影响。左侧显示固定后台线程为1时，不同前台线程数的吞吐量；右侧显示固定前台线程为8时，不同后台线程数的吞吐量。
- Stress Test (Figure 9): Under a heavy, continuous update workload, SPFresh maintains stable and low latency for both search and insert operations. Its performance remains robust even on skewed datasets, proving the effectiveness of the dynamic rebalancing.
  
  该图像是图表，展示了SPFresh在均匀数据集（SIFT）和偏态数据集（SPACV）压力测试下的延迟、吞吐量和IOPS性能指标表现，对比了不同批处理（日）条件下的搜索和插入性能。

7. Conclusion & Reflections

Conclusion Summary: The paper successfully presents SPFresh, a novel system that solves a critical and previously open problem: efficient, incremental updates for billion-scale on-disk vector indexes. Its core contribution, the LIRE protocol, replaces the disruptive and costly "global rebuild" paradigm with a lightweight, local rebalancing mechanism. By doing so, SPFresh delivers stable, high-quality search performance (low latency, high accuracy) under continuous data ingestion, all while consuming drastically fewer computational resources than existing state-of-the-art systems.
Limitations & Future Work: (Based on the provided text and general domain knowledge)
- The paper focuses on insert-heavy workloads. The dynamics of merge operations triggered by heavy deletions are less explored. A mix of frequent splits and merges could introduce more complex concurrency scenarios.
- The LIRE protocol's necessary conditions and opportunistic checks are shown to work well empirically, but there might be theoretical edge cases or adversarial data distributions where index quality could slowly degrade over very long periods.
- The system is highly optimized for a specific hardware setup (NVMe SSDs via SPDK). Its performance characteristics on other storage media (e.g., cloud object storage, HDDs) would be different and might require architectural changes.
- The current design is for a single-node system. Extending the LIRE protocol to a distributed setting would be a significant and challenging next step.
Personal Insights & Critique:
- SPFresh is a landmark paper in the field of vector search systems. Its central idea—localizing the cost of updates—is powerful and elegant. It draws a compelling parallel to the principles of log-structured merge-trees (LSM-trees), where small, frequent compactions are preferred over rewriting entire data levels.
- The work is a masterclass in systems design. It combines a clever algorithm (LIRE) with deep, practical systems engineering (the SPDK-based Block Controller, the pipelined architecture). The custom storage layer is a key enabler; without its I/O efficiency, the frequent local rebuilds might not have been fast enough.
- The impact of this work is immediate and practical. For any company running large-scale ANNS in production (e.g., for search, ads, or recommendation), SPFresh offers a path to dramatically reduce operational costs, improve service stability, and deliver fresher results to users. It effectively turns vector databases from periodically updated warehouses into real-time, dynamic systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.