- Title: FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
- Authors:
- Aditi Singh (Microsoft Research India)
- Suhas Jayaram Subramanya (Carnegie Mellon University, work done at Microsoft Research)
- Ravishankar Krishnaswamy (Microsoft Research India)
- Harsha Vardhan Simhadri (Microsoft Research India)
- Journal/Conference: The paper is available on arXiv, a preprint server. It was likely prepared for a top-tier conference in data management (like VLDB, SIGMOD) or information retrieval (like SIGIR).
- Publication Year: 2021
- Abstract: The paper addresses a major limitation in state-of-the-art graph-based Approximate Nearest Neighbor Search (ANNS) indices: their static nature. While these indices excel at searching billion-point datasets on a single machine, they cannot handle real-time data updates (inserts and deletes), which are common in real-world applications like news or email search. The current industry solution is to periodically rebuild the entire index, which is extremely expensive. This paper introduces
FreshDiskANN
, the first graph-based ANNS system that supports real-time updates without sacrificing search performance. FreshDiskANN
can index over a billion points on a commodity machine with an SSD, handle thousands of concurrent updates and searches per second, and maintain high accuracy (>95% recall). The authors claim this reduces the cost of maintaining index freshness by 5-10x compared to periodic rebuilding.
- Original Source Link:
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The core of the paper is a set of algorithms that allow a graph-based index to be updated without degrading its quality, and a system architecture to apply these updates efficiently to a large, disk-based index.
4.1 The Problem: Why Simple Updates Fail
The authors first demonstrate that naive approaches to updating graph indices are flawed. They experiment with two deletion policies on popular indices (HNSW
, NSG
, Vamana
):
-
Delete Policy A: Simply remove the deleted node and all its incident edges.
-
Delete Policy B: When a node p is deleted, try to "heal" the graph by connecting its in-neighbors to its out-neighbors.
As shown in Figure 1, both policies lead to a steady decline in recall over repeated cycles of deletions and insertions. This happens because these simple modifications break the carefully constructed "navigability" of the graph, making it harder for the greedy search to find the true nearest neighbors. The graph becomes too sparse or disconnected.
该图像是图表,展示了在SIFT1M数据集上删除并重新插入5%数据批次20轮后,静态构建的HNSW、Vamana和NSG索引在两种删除策略(Delete Policy A与B)下的5-recall@5搜索召回率变化。
4.2 FreshVamana
: An Updatable Graph Index
The key insight is to use a more robust graph structure that can tolerate updates. The authors build on Vamana
, the graph algorithm from DiskANN
, which uses a relaxed pruning rule.
-
α-RNG
Property: The core idea is the Relative Neighborhood Graph (RNG) property, relaxed by a parameter α>1. During graph construction, an edge from node p to p'' is pruned (removed) only if there's another neighbor p' of p that is significantly closer to p'' than p is. Formally, the edge (p,p′′) is removed if there exists a neighbor p′ of p such that α⋅d(p′,p′′)≤d(p,p′′). Using α>1 (e.g., 1.2) creates denser graphs with more redundant paths, which makes them more robust to changes and ensures search paths converge quickly.
-
RobustPrune
(Algorithm 3): This is the central pruning procedure used in all updates. Given a node p and a set of candidate neighbors, it iteratively selects the closest candidate, adds it to p's neighborhood, and then prunes other candidates that are now "reachable" through the newly added neighbor according to the α-RNG
rule. This ensures the degree of each node stays bounded by R while maintaining the graph's navigability.
-
Insertion (Algorithm 2):
- To insert a new point p, first treat p as a query and search the existing graph to find a set of candidate neighbors V.
- Use
RobustPrune
to select the best R out-neighbors for p from V.
- For each new neighbor j of p, add a reciprocal (in-going) edge from j to p. If this causes j's out-degree to exceed R, run
RobustPrune
on j's neighborhood as well.
-
Deletion (Algorithm 4):
- Lazy Deletion: Deletions are not processed immediately. The ID of the deleted point is added to a
DeleteList
. During search, points in this list are used for navigation but are filtered out from the final results.
- Batch Consolidation: Periodically, the deletions are consolidated in a batch. For every node p that had an edge to a deleted node v, its neighborhood is "repaired." The algorithm considers the remaining neighbors of p plus the neighbors of the deleted node v as a new candidate set. It then runs
RobustPrune
on this set to form a new, robust neighborhood for p.
4.3 FreshDiskANN
: The Billion-Scale System
To handle datasets larger than RAM, FreshDiskANN
uses a hybrid architecture.
5. Experimental Setup
6. Results & Analysis
The paper presents a comprehensive set of experiments to validate its claims.
-
FreshVamana
Recall Stability (Figure 2 & 3):
- Figure 2 shows that over 50 cycles of deleting and re-inserting 5%, 10%, or even 50% of the data, the recall of
FreshVamana
remains stable around 95%. This directly contrasts with the degradation seen in Figure 1 for other methods.
- Figure 3 demonstrates the importance of the α parameter. When α=1 (the strict RNG rule), recall degrades over time. For α>1, recall is stable, confirming that the denser graph structure is key to robustness.
-
StreamingMerge
Stability (Figure 4):
- This experiment shows the recall of the SSD-based index over 40 cycles of
StreamingMerge
. There is an initial drop in recall because the merge process uses approximate distances from compressed vectors. However, after this initial adjustment period (~20 cycles), the recall stabilizes. This proves that the system can maintain a consistent quality of service in a long-running, steady-state scenario.
-
Billion-Scale FreshDiskANN
Evaluation (Figures 5-8):
This week-long experiment on an 800-million-point index is the main result.
-
System Performance: The system sustained a throughput of 1800 inserts/sec and 1800 deletes/sec concurrently with 1000 searches/sec.
-
Search Latency (Figures 5 & 6): User-perceived mean search latency remained low, mostly under 15-20ms, even when a StreamingMerge
was active in the background. Latency is lowest when no merge is running (~5ms) and highest during the Insert
phase of the merge due to random read contention on the SSD.
-
Update Latency (Figure 6, right): The user-perceived latency for inserts is under 1ms, and for deletes is even lower (microseconds), because they are handled by fast in-memory operations. The system's overall update throughput is limited by the speed of the background StreamingMerge
.
-
Cost-Effectiveness (Table 2): The paper shows that merging a 7.5% change (30M inserts + 30M deletes) into an 800M index takes ~16,000 seconds. Rebuilding the entire index from scratch would take ~190,000 seconds. This demonstrates a >10x improvement in efficiency, directly supporting the paper's central claim of a 5-10x cost reduction. The StreamingMerge
time also scales near-linearly with the number of threads used (Figure 7).
Below are manual transcriptions of the tables from the paper.
Table 1: Index build times for Vamana and FreshVamana on mem with R=64,Lc=75,α=1.2
Dataset |
Vamana |
FreshVamana |
Speedup |
SIFT1M |
32.3s |
21.8 s |
1.48x |
DEEP1M |
26.9s |
17.7 s |
1.52x |
GIST1M |
417.2s |
228.1 s |
1.83x |
SIFT100M |
7187.1s |
4672.1s |
1.54x |
This table shows that building an index from scratch by streaming inserts with FreshVamana
is ~1.5x faster than the batch-build process of the original Vamana
.
Table 2: Full build time with DiskANN (96 threads) versus FreshDiskANN (40 threads) to update a 800M index with 30M inserts and deletes
Dataset |
DiskANN(sec) |
StreamingMerge (sec) |
SIFT800M |
83140 s* |
15832 s |
*Note: The text states a rebuild takes ~190,000 seconds. The 83,140s in the table might refer to a specific configuration or phase. The key takeaway is the order-of-magnitude difference.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully tackles a critical, practical problem in large-scale ANNS: maintaining index freshness. It introduces FreshVamana
, the first graph-based algorithm with robust update rules that preserve search quality. It then engineers FreshDiskANN
, a complete system that scales this algorithm to billion-point datasets on a single machine using an efficient StreamingMerge
procedure. The system achieves impressive performance, supporting thousands of concurrent updates and queries per second with low latency and high accuracy, all while reducing the cost of maintaining freshness by an order of magnitude compared to the industry-standard practice of periodic rebuilding.
-
Limitations & Future Work:
- Parameter Tuning: The system has several important parameters (α, R, L, merge thresholds) that likely require careful tuning for different datasets and workloads to achieve optimal performance.
- Consistency Model: The paper assumes
quiescent consistency
, where a search at time t reflects all updates completed before t. This is a reasonable model but might not be sufficient for applications requiring stricter transactional guarantees.
- Worst-Case Scenarios: The performance analysis focuses on random updates. A highly skewed or adversarial update pattern (e.g., deleting all the central "hub" nodes in the graph) could potentially degrade performance more significantly.
-
Personal Insights & Critique:
- Impact: This paper is a significant contribution to the field. It makes the best-performing class of ANNS algorithms (graph-based) practical for a much wider range of real-world, dynamic applications. The massive cost reduction it enables is a compelling reason for industry adoption.
- Engineering Excellence: The design of the
StreamingMerge
procedure is a standout piece of systems engineering. It elegantly solves the challenge of applying many small, random updates to a massive, disk-resident data structure by converting them into I/O-efficient sequential passes.
- Future Directions: The principles behind
FreshVamana
and StreamingMerge
could likely be adapted to other graph-based indices beyond Vamana
. Furthermore, this work opens the door to building fully-fledged distributed databases for vector search that can handle both high-volume queries and real-time data ingestion efficiently.