Paper status: completed

Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters

Published:04/26/2023

Graph-Structured Index Construction (1)Approximate Nearest Neighbor Search Algorithms (1)Filtered Nearest Neighbor Retrieval (1)Label-Aware Vector Retrieval (1)Efficient Large-Scale Retrieval Systems (1)

Original Link

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Filtered-DiskANN introduces two graph-based algorithms for efficient filtered ANNS. They build indexes considering both vector geometry and label metadata, significantly outperforming prior methods in speed and accuracy for filtered queries. This supports high-recall, large-scale

Abstract

Filtered − DiskANN : Graph Algorithms for Approximate Nearest Neighbor Search with Filters Siddharth Gollapudi Microsoft Research India sgollapu@berkeley.edu Neel Karia ∗ , 1 Columbia University USA neel2karia@gmail.com Varun Sivashankar ∗ Microsoft Research India varunsiva@ucla.edu Ravishankar Krishnaswamy Microsoft Research India rakri@microsoft.com Nikit Begwani Microsoft India nikit.begwani@microsoft.com Swapnil Raz Microsoft India swraz@microsoft.com Yiyong Lin Microsoft USA yiyolin@microsoft.com Yin Zhang Microsoft USA yinzhang@microsoft.com Neelam Mahapatro Microsoft India nmahapatro@microsoft.com Premukumar Srinivasan Microsoft USA prsriniv@microsoft.com Amit Singh Microsoft India siamit@microsoft.com Harsha Vardhan Simhadri Microsoft Research USA harshasi@microsoft.com ABSTRACT As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval becomes ubiquitous for search and recommendation sce- narios, efficiently answering filtered ANNS queries has become a critical requirement. Filtered ANNS queries ask for the nearest neighbors of a query’s embedding from the points in the index that match the query’s labels such as date, price r

Mind Map

In-depth Reading

English Analysis~14 min read · 15,606 chars

1. Bibliographic Information

Title: Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters
Authors: Siddharth Gollapudi, Neel Karia, Varun Sivashankar, Ravishankar Krishnaswamy, Nikit Begwani, Swapnil Raz, Yiyong Lin, Yin Zhang, Neelam Mahapatro, Premukumar Srinivasan, Amit Singh, and Harsha Vardhan Simhadri. The authors are primarily affiliated with Microsoft Research and Microsoft, with one author from Columbia University. This indicates a strong industry-research collaboration, often focused on solving large-scale, practical problems.
Journal/Conference: Proceedings of the ACM Web Conference 2023 (WWW '23). The Web Conference (formerly WWW) is a top-tier, highly competitive venue for research on web technologies, data science, and internet applications. Publication here signifies high quality and impact.
Publication Year: 2023
Abstract: The paper addresses the growing need for efficient Approximate Nearest Neighbor Search (ANNS) that can handle queries with filters (e.g., search for similar items within a specific price range or language). Existing methods often suffer from high latency or low accuracy (recall), which is unacceptable for interactive applications. The authors introduce two novel graph-based algorithms, FilteredVamana (streaming-friendly) and StitchedVamana (batch-built), that natively incorporate filter information into the index structure itself. By considering both the vector geometry and the associated labels during graph construction, these algorithms significantly outperform state-of-the-art methods on real-world datasets, often by an order of magnitude. The resulting indices are also efficient enough to run on SSDs, supporting thousands of queries per second with high recall.
Original Source Link: https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf (Formally published paper)

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Modern search and recommendation systems heavily rely on finding the "nearest" items (e.g., products, documents, images) to a query in a high-dimensional vector space. This is the Approximate Nearest Neighbor Search (ANNS) problem. A critical real-world requirement is to perform this search on a subset of the data that matches certain metadata criteria, or "filters." This is known as Filtered ANNS. For example, finding visually similar dresses that are also under `50 and available in size M.
- Importance & Gaps: As ANNS becomes standard, the lack of efficient filtering is a major bottleneck. Prior methods are generally inefficient. Post-processing (search first, then filter) fails when the filter is very specific, as it may return no valid results. Pre-processing (building a separate index for every possible filter) is not scalable. Inline-processing (filtering during the search traversal) improves on this but is often applied to less efficient index structures (like IVF) and not the state-of-the-art graph-based indices.
- Innovation: This paper's central innovation is to integrate filter awareness directly into the graph construction process. Instead of treating filters as an afterthought at query time, the proposed algorithms build a graph where the connections (edges) are chosen based on both vector similarity and shared labels. This ensures that the graph is navigable not just in the geometric space but also within the context of any given filter.
Main Contributions / Findings (What):
1. Two Novel Algorithms: The paper introduces FilteredVamana and StitchedVamana, two graph-based algorithms for filtered ANNS.
  - FilteredVamana is an incremental algorithm, suitable for streaming data and dynamic updates.
  - StitchedVamana is a batch-construction algorithm that builds separate graphs per filter and then intelligently "stitches" them together.
2. State-of-the-Art Performance: On real-world datasets, both algorithms are shown to be an order of magnitude or more efficient (in terms of query speed for a target accuracy) than existing baselines like HNSW with post-processing, IVF with inline-processing, and other specialized methods like NHQ and Milvus. They achieve high recall (>90\%) even for highly specific filters (e.g., matching only 0.0001% of the data).
3. Real-World Validation: The effectiveness of FilteredVamana was confirmed in a live A/B test on a major sponsored search engine. It led to significant increases in user clicks (30-70%) and revenue (30-80%), especially for queries targeting smaller, less-represented groups.
4. Scalability to SSDs: The proposed methods can be integrated into the DiskANN framework, allowing massive datasets (tens of millions of points) to be indexed and queried from cost-effective SSDs at interactive speeds.

Foundational Concepts:
- Approximate Nearest Neighbor Search (ANNS): Given a large dataset of high-dimensional vectors (e.g., embeddings of images or text), the goal is to find the k vectors from the dataset that are closest to a given query vector. Doing this exactly requires a linear scan, which is too slow for large datasets due to the "curse of dimensionality." ANNS algorithms trade perfect accuracy for a massive speedup by finding vectors that are very likely to be the true nearest neighbors.
- Graph-based ANNS: A leading class of ANNS algorithms (e.g., HNSW, Vamana). They build a graph where each data point is a node. Edges connect nodes that are close to each other in the vector space. To find neighbors for a query, the algorithm performs a greedy search on this graph, starting from an entry point and iteratively moving to the neighbor closest to the query. A well-constructed graph guides the search quickly to the true nearest neighbors.
- Filtered ANNS: The problem of performing ANNS on a subset of the data defined by a filter. A query consists of a vectorx_qand a filter labelf. The goal is to find the nearest neighbors ofx_qonly among the data points that have the label`f $\text{Recall@k} = \frac{|\text{Returned Neighbors} \cap \text{True Top-k Neighbors}|}{k}$ \alpha \cdot d(p^, p') \leq d(p, p') $, where$ \alpha \ge 1 $is a diversity parameter. This is the standard Vamana pruning rule. 2. Filter-Aware Condition (The Novelty): The intermediate point p* must "cover" the filter-based path. This means that p* must have all the labels that p and p' have in common. Formally,$ F_{p'} \cap F_p \subseteq F_{p^} $. This is the key insight: it ensures that if a search for a filter f could have traversed the edge (p, p'), it can now traverse the path (p, p*, ..., p') because the intermediate node p* also has the filter f. * FilteredVamana (Algorithm 4): This algorithm builds the filtered graph incrementally. * Steps: 1. Start Point Selection: For each filter label f, select a representative starting node st(f) using a load-balancing algorithm (`FindMedoid`, Algorithm 2) to avoid overloading any single node. 2. Incremental Insertion: Process data points one by one in a random order. For each point p: a. Run `FilteredGreedySearch` for p (using p's own vector and labels as the query) to find a set of candidate neighbors. The search starts from the designated start points for p's labels. b. Use `FilteredRobustPrune` to select the best R neighbors for p from this candidate set. c. Add directed edges from p to its new neighbors. d. Add reciprocal (backward) edges from the neighbors back to p. If adding a backward edge causes a neighbor's out-degree to exceed R, run `FilteredRobustPrune` on that neighbor. * StitchedVamana (Algorithm 5): This algorithm builds the index in a batch process. * Steps: 1. Build Per-Filter Graphs: For each filter label f in the universe of labels$ \mathcal{F}\mathrm{Recall}@k = \frac{|A \cap G|}{k} $* Symbol Explanation: *$ A $: The set of *k* neighbors returned by the ANNS algorithm. *$ G $: The ground truth set of the actual *k* nearest neighbors. *$ k $: The number of neighbors to retrieve. 2. Queries Per Second (QPS): * Conceptual Definition: Measures the speed or throughput of the search index. It is the total number of queries processed divided by the total time taken. Higher QPS is better. The paper reports QPS for runs using 48 threads. * Baselines: * `IVF Inline-Processing`: A FAISS IVF index where filtering is done "on the fly" as clusters are explored. * `IVF post-processing`: A standard FAISS IVF index where a large number of results are retrieved and then filtered. * `HNSW post-processing`: A standard FAISS HNSW index with post-processing. * `NHQ`: The competing graph-based filtered ANNS method. * `Milvus`: A popular open-source vector database that supports filtered search. Several of its internal algorithms were tested. # 6. Results & Analysis * Core Results: The main results are presented as QPS vs. Recall@10 curves, where the ideal algorithm is in the top-right (high recall and high QPS). ![Figure 1: Turing dataset: QPS$ \mathbf { \dot { x } } $axis) vs recall `@` 1 u er 007,50,n1 p ey.](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/1.jpg) *该图像是包含五个子图的图表，展示了不同特异性（1%、25%、50%、75%、100%）条件下，多种方法的 Recall@10 随查询数变化的趋势比较。* Figure 1 (Turing Dataset): This figure dramatically illustrates the failure of baseline methods on filters with very low specificity. For the 1pc specificity filter, `FilteredVamana` and `StitchedVamana` maintain high recall at high QPS, whereas all other methods (`IVF`, `HNSW`) have near-zero recall. Their QPS is also nearly 1000x lower. This shows that when the target subset is small, only a filter-aware index can perform effectively. ![Figure 2: Prep dataset: QPS$ \mathbf { \dot { x } } $i) vs recall@10 for various algorithms with filter of 100, 7, 50, 25 and 1 percentil specificity](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/2.jpg) *该图像是图表，展示了不同算法在不同特异性滤波条件（100%、75%、50%、25%、1%）下，Recall@10与QPS的关系。图中比较了FilteredVamana、StitchedVamana等多种算法的性能差异。* Figure 2 (Prep Dataset): Across all specificities, `FilteredVamana` and `StitchedVamana` form a clear Pareto frontier, outperforming all baselines. At 90% recall, `StitchedVamana` is ~6x faster than the next best baseline (`IVF Inline Processing`), and `FilteredVamana` is ~2.5x faster. Post-processing methods perform very poorly. ![Figure 3: DANN dataset: QPS$ \mathbf { \dot { x } } $axis) vs recall$ \bf { \Pi } ( \ @ { 1 0 } $for various algorithms with flters of 100, 75, 50, 25 and 1 percentile specificity](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/3.jpg) *该图像是图表，展现了图3中DANN数据集中不同过滤特异性（100%、75%、50%、25%、1%）下，五种算法的查询每秒数（QPS）与召回率（Recall）@10的对比情况。* Figure 3 (DANN Dataset): The results are consistent with the Prep dataset. `StitchedVamana` and `FilteredVamana` are significantly better. At 90% recall, `StitchedVamana` is ~7.5x faster than IVF inline processing, and `FilteredVamana` is ~3x faster. * Comparing FilteredVamana and StitchedVamana: * Performance vs. Build Time: * Query Performance: `StitchedVamana` consistently offers better QPS for a given recall level (by a factor of ~2x). This is likely because it builds optimal sub-graphs for each filter before pruning, preserving strong local structures. * Build Time: `FilteredVamana` is significantly faster to build. This is because it builds the graph in a single pass, whereas `StitchedVamana` must build many separate graphs and then prune a very large combined graph. *This is a transcribed version of Table 2 from the paper.* | Alg./Data | Dann | Prep | Turing | Audio | SIFT | :--- | :--- | :--- | :--- | :--- | :--- | FilteredVamana | 159.8 | 66.6 | 103.4 | 1.3 | 44.1 | StitchedVamana | 469.9 | 222.6 | 295.9 | 1.6 | 24.4 | NHQ | NA | NA | NA | 1.1 | 24.4 | Milvus HNSW | 153.6 | 49.3 | NA | 5.5 | 72.0 | Faiss HNSW | 158.6 | 44.5 | 188.0 | 1.1 | 71.1 *(Build times in seconds)* * Robustness to Label Correlation: ![Figure 4: Shuffled Experiment for Dann and Prep: QPS vs recall$ \bf { \Pi } ( \ @ { ' } \mathbf { \Pi } ^ { \omega 1 0 } $for FilteredVamana and StitchedVamana with filters of 100 and 1 percentile sp…](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/4.jpg) *该图像是图表，展示了在prep和dann数据集上，针对100%和1%特异性，FilteredVamana与StitchedVamana方法标签随机打乱前后的QPS与Recall@10关系，用以评估标签打乱对性能的影响。* Figure 4 (Shuffled Labels): This experiment tests what happens when the correlation between vector positions and labels is destroyed by shuffling the labels. `FilteredVamana` appears more robust, with its performance curve barely changing. `StitchedVamana`'s performance degrades slightly, suggesting it may implicitly leverage natural data clustering that aligns with labels. * Unfiltered Query Performance: ![Figure 5: QPS vs recall@10 for Unfiltered Search on FilteredVamana and StitchedVamana built on original labels.](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/5.jpg) *该图像是两个折线图，展示了FilteredVamana、StitchedVamana和Unfiltered Vamana三种方法在dann和Prep数据集上的Recall随QPS变化的关系。图中横轴为QPS，纵轴为Recall，显示在不同QPS下各算法的性能表现。* Figure 5 (Unfiltered Search): Even though they are designed for filtered search, both algorithms perform very well on standard unfiltered queries. Their performance is only slightly below that of a specialized, unfiltered `Vamana` index (achieving 95% recall at 80-90% of the QPS). This means a single `FilteredVamana` or `StitchedVamana` index can effectively serve both filtered and unfiltered traffic. * Streaming Capability: The paper argues that `FilteredVamana` is naturally suited for dynamic indices where points are added over time, as its incremental construction process can be directly applied to new points. In contrast, updating a `StitchedVamana` index would be more complex, as adding a single point could require modifying multiple sub-graphs. * SSD-Based Indices (`Filtered-DiskANN`): ![Figure 6: Performance of Filtered-DiskANN on a larger 28 million point DANN dataset on filters of various specificity.](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/6.jpg) *该图像是图表，展示了Filtered-DiskANN在2800万点DANN数据集上不同特异性过滤条件下的性能表现。左图显示Recall@10随每秒查询数（OPS）的变化，右图展示Recall与每次查询SSD IO次数的关系。* Figure 6 (SSD Performance): This demonstrates the scalability of the approach. On a large 28 million point dataset stored on an SSD, the system still achieves thousands of queries per second with high recall ($ >90% $for high specificity,$ >70% $for low specificity), while keeping the number of expensive SSD I/O operations low. * Online A/B Test in Production: This is a powerful validation of the real-world impact. *This is a transcribed version of Table 3 and Table 4 from the paper.* Table 3: Overall Performance Improvement | Number of Different Regions | Pct. incr. in clicks | Pct. incr. in revenue | :--- | :--- | :--- | 47 | 34.61% (0.03) | 48.95% (0.009) Table 4: Subgroup Performance Improvement | Region's share in Index | Pct. incr. in clicks | Pct. incr. in revenue | :--- | :--- | :--- | 3-9% (10 regions) | 25.54% | 28.61% | 1-2% (10 regions) | 54.07% | 46.67% | <1% (27 regions) | 70.67% | 79.77% The results show massive, statistically significant gains in clicks and revenue over the baseline post-processing system. Crucially, Table 4 confirms the hypothesis: the biggest improvements are for the smallest regions (lowest specificity filters), which were poorly served by the old system. `FilteredVamana` corrects this bias, leading to fairer and more relevant ad retrieval. * Appendix Results: The appendix provides further comparisons against `NHQ KGraph` (Figures 7 and 8) and various `Milvus` algorithms (Figures 9, 10, and 11), consistently showing that `FilteredVamana` and `StitchedVamana` offer superior QPS/recall trade-offs by a large margin. ![Figure 7: KGraph on NHQ datasets: QPS$ \mathbf { \dot { x } } - \mathbf { \dot { x } } $axis) vs recall$ \bf { \Pi } ( \ @ { 1 0 } $for NHQ KGraph, FilteredVamana and StitchedVamana.](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/7.jpg) *该图像是多子图的图表，展示了在Audio、SIFT1M、Paper、Msong和GIST数据集上，FilteredVamana、StitchedVamana和NHQ KGraph三种算法的Recall@10与查询每秒（QPS）间的关系，横轴采用对数刻度。* ![Figure 8: KGraph and Filtered Vamana: QPS$ \mathbf { \dot { x } } $axis) vs recall `@` 10 on NHQ datasets (Build Normalized).](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/8.jpg) *该图像是包含五个子图的图表，展示了Filtered Vamana和NHQ KGraph在Audio、SIFT1M、Paper、Msong和GIST数据集上的Recall@10随查询每秒数（QPS）变化的表现。横轴为对数刻度的QPS，纵轴为Recall@10，显示两者性能对比。* ![该图像是多个折线图组成的图表，展示了不同特异性条件下四种算法（HNSW, IVF FLAT, IVF SQ8, IVF PQ）的Recall@10性能对比。横坐标为查询时间，纵坐标为Recall@10，反映了过滤近似最近邻搜索中各种索引的效率和准确率表现。](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/10.jpg) *该图像是多个折线图组成的图表，展示了不同特异性条件下四种算法（HNSW, IVF FLAT, IVF SQ8, IVF PQ）的Recall@10性能对比。横坐标为查询时间，纵坐标为Recall@10，反映了过滤近似最近邻搜索中各种索引的效率和准确率表现。* ![Figure 11: QPS$ \mathbf { x } $axis) vs recall$ \bf { \Pi } ( \ @ { ' } \mathbf { \Pi } ^ { \omega 1 0 } $for Milvus algorithms with 4 NHQ datasets.](/files/papers/68edfb8a58c9cb7bcb2c7e6d/images/11.jpg) *该图像是包含四个子图的图表，展示了四种算法（HNSW、IVF FLAT、IVF SQ8、IVF PQ）在四个数据集（Audio、SIFT1M、Paper、Msong）上的Recall@10与相关参数的关系，体现了不同算法在召回率和速度上的性能表现。* # 7. Conclusion & Reflections * Conclusion Summary: The paper successfully demonstrates that integrating filter awareness into the construction of graph-based ANNS indices is a highly effective strategy for filtered ANNS queries. The proposed algorithms, `FilteredVamana` and `StitchedVamana`, establish a new state-of-the-art, offering an order-of-magnitude performance improvement over existing methods. The gains are particularly pronounced for challenging low-specificity filters, and the real-world impact is validated through significant revenue and engagement improvements in a large-scale production system. * Limitations & Future Work: The authors identify several areas for future work: * More Complex Filters: The current work focuses on single-label matching (equivalent to an OR of labels). Supporting complex SQL-like queries with conjunctions (AND), negations, and range checks remains an open challenge. * Larger Filter Sets: Scaling to thousands of filters is demonstrated, but performance with even larger label universes could be explored. * Dynamic Deletions: While `FilteredVamana` is suited for insertions, a full evaluation of the dynamic setting including efficient point deletions is left for future work. * Personal Insights & Critique: * Key Insight: The paper's core strength lies in its simple but powerful idea: graph navigability must be preserved not just geometrically but also per-filter. The `FilteredRobustPrune` condition ($ F_{p'} \cap F_p \subseteq F_{p^*}$) is an elegant and effective way to encode this principle. It fundamentally shifts the problem from a search-time modification to an index-time optimization, which proves to be the correct approach.
- Practicality: The work is heavily grounded in practical needs. The choice of building upon Vamana/DiskANN, the consideration of SSDs, the analysis of build time vs. query time trade-offs, and especially the A/B test results, make this paper exceptionally relevant for practitioners in the fields of search, e-commerce, and recommendation systems.
- Open Questions: The trade-off between FilteredVamana and StitchedVamana is interesting. FilteredVamana seems more general-purpose due to its streaming-friendliness and robustness, while StitchedVamana is a powerful batch solution. It would be interesting to see hybrid approaches that might combine the strengths of both. For instance, one could use StitchedVamana for a large static base and FilteredVamana for an incremental layer on top.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.