Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters
TL;DR Summary
Filtered-DiskANN introduces two graph-based algorithms for efficient filtered ANNS. They build indexes considering both vector geometry and label metadata, significantly outperforming prior methods in speed and accuracy for filtered queries. This supports high-recall, large-scale
Abstract
As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval becomes ubiquitous for search and recommendation scenarios, efficiently answering filtered ANNS queries has become a critical requirement. Filtered ANNS queries ask for the nearest neighbors of a query’s embedding from the points in the index that match the query’s labels such as date, price range, language. There has been little prior work on algorithms that use label metadata associated with vector data to build efficient indices for filtered ANNS queries. Consequently, current indices have high search latency or low recall which is not practical in interactive web-scenarios. We present two algorithms with native support for faster and more accurate filtered ANNS queries: one with streaming support, and another based on batch construction. Central to our algorithms is the construction of a graph-structured index which forms connections not only based on the geometry of the vector data, but also the associated label set. On real-world data with natural labels, both algorithms are an order of magnitude or more efficient for filtered queries than the current state of the art algorithms. The generated indices also be queried from an SSD and support thousands of queries per second at over 90% recall@10.
English Analysis
1. Bibliographic Information
- Title: Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters
- Authors: Siddharth Gollapudi, Neel Karia, Varun Sivashankar, Ravishankar Krishnaswamy, Nikit Begwani, Swapnil Raz, Yiyong Lin, Yin Zhang, Neelam Mahapatro, Premukumar Srinivasan, Amit Singh, and Harsha Vardhan Simhadri. The authors are primarily affiliated with Microsoft Research and Microsoft, with one author from Columbia University. This indicates a strong industry-research collaboration, often focused on solving large-scale, practical problems.
- Journal/Conference: Proceedings of the ACM Web Conference 2023 (WWW '23). The Web Conference (formerly WWW) is a top-tier, highly competitive venue for research on web technologies, data science, and internet applications. Publication here signifies high quality and impact.
- Publication Year: 2023
- Abstract: The paper addresses the growing need for efficient Approximate Nearest Neighbor Search (ANNS) that can handle queries with filters (e.g., search for similar items within a specific price range or language). Existing methods often suffer from high latency or low accuracy (recall), which is unacceptable for interactive applications. The authors introduce two novel graph-based algorithms,
FilteredVamana
(streaming-friendly) andStitchedVamana
(batch-built), that natively incorporate filter information into the index structure itself. By considering both the vector geometry and the associated labels during graph construction, these algorithms significantly outperform state-of-the-art methods on real-world datasets, often by an order of magnitude. The resulting indices are also efficient enough to run on SSDs, supporting thousands of queries per second with high recall. - Original Source Link: https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf (Formally published paper)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern search and recommendation systems heavily rely on finding the "nearest" items (e.g., products, documents, images) to a query in a high-dimensional vector space. This is the Approximate Nearest Neighbor Search (ANNS) problem. A critical real-world requirement is to perform this search on a subset of the data that matches certain metadata criteria, or "filters." This is known as Filtered ANNS. For example, finding visually similar dresses that are also under `50 and available in size M.
- Importance & Gaps: As ANNS becomes standard, the lack of efficient filtering is a major bottleneck. Prior methods are generally inefficient. Post-processing (search first, then filter) fails when the filter is very specific, as it may return no valid results. Pre-processing (building a separate index for every possible filter) is not scalable. Inline-processing (filtering during the search traversal) improves on this but is often applied to less efficient index structures (like IVF) and not the state-of-the-art graph-based indices.
- Innovation: This paper's central innovation is to integrate filter awareness directly into the graph construction process. Instead of treating filters as an afterthought at query time, the proposed algorithms build a graph where the connections (edges) are chosen based on both vector similarity and shared labels. This ensures that the graph is navigable not just in the geometric space but also within the context of any given filter.
-
Main Contributions / Findings (What):
- Two Novel Algorithms: The paper introduces
FilteredVamana
andStitchedVamana
, two graph-based algorithms for filtered ANNS.FilteredVamana
is an incremental algorithm, suitable for streaming data and dynamic updates.StitchedVamana
is a batch-construction algorithm that builds separate graphs per filter and then intelligently "stitches" them together.
- State-of-the-Art Performance: On real-world datasets, both algorithms are shown to be an order of magnitude or more efficient (in terms of query speed for a target accuracy) than existing baselines like HNSW with post-processing, IVF with inline-processing, and other specialized methods like NHQ and Milvus. They achieve high recall (
>90\%
) even for highly specific filters (e.g., matching only 0.0001% of the data). - Real-World Validation: The effectiveness of
FilteredVamana
was confirmed in a live A/B test on a major sponsored search engine. It led to significant increases in user clicks (30-70%) and revenue (30-80%), especially for queries targeting smaller, less-represented groups. - Scalability to SSDs: The proposed methods can be integrated into the
DiskANN
framework, allowing massive datasets (tens of millions of points) to be indexed and queried from cost-effective SSDs at interactive speeds.
- Two Novel Algorithms: The paper introduces
3. Prerequisite Knowledge & Related Work
- Foundational Concepts:
- Approximate Nearest Neighbor Search (ANNS): Given a large dataset of high-dimensional vectors (e.g., embeddings of images or text), the goal is to find the k vectors from the dataset that are closest to a given query vector. Doing this exactly requires a linear scan, which is too slow for large datasets due to the "curse of dimensionality." ANNS algorithms trade perfect accuracy for a massive speedup by finding vectors that are very likely to be the true nearest neighbors.
- Graph-based ANNS: A leading class of ANNS algorithms (e.g., HNSW, Vamana). They build a graph where each data point is a node. Edges connect nodes that are close to each other in the vector space. To find neighbors for a query, the algorithm performs a greedy search on this graph, starting from an entry point and iteratively moving to the neighbor closest to the query. A well-constructed graph guides the search quickly to the true nearest neighbors.
- Filtered ANNS: The problem of performing ANNS on a subset of the data defined by a filter. A query consists of a vector
x_q
and a filter labelf
. The goal is to find the nearest neighbors ofx_q
only among the data points that have the label`f. * Specificity: A term defined in the paper as the fraction of the total dataset that matches a given filter. A filter with low specificity (e.g., 1%) matches a small subset of the data and is typically harder for existing methods to handle. * Recall@k: A primary metric for ANNS quality. It measures the fraction of the true top-*k* nearest neighbors that the algorithm successfully found. * Previous Works: * Post-processing: The simplest approach. You run a standard ANNS query to retrieve a large number of candidates (e.g., 1000) and then filter out those that don't match the query's label. Limitation: This fails dramatically for low-specificity filters, as you might need to retrieve millions of candidates to find a few that match. * Pre-processing (Separate Indices): Build a separate ANNS index for each possible filter label. Limitation: This is infeasible if there are many labels or if data points can have multiple labels, as it leads to an explosion in memory and maintenance costs. * Inline-processing: This approach integrates filtering into the search process. For example, in an inverted file index (`IVF`), the data is partitioned into clusters. During a query, only points in the most promising clusters are checked. Inline-processing would skip points within these clusters that don't match the query filter. Limitation: While better than post-processing, this technique has been primarily applied to less performant index structures like `IVF` or `LSH`, not the highly efficient graph-based methods. * NHQ: A recent graph-based method that encodes filter labels as vectors and appends them to the data vectors. Limitation: The paper notes this is less effective when points can have multiple labels and is outperformed by their proposed methods. * Differentiation: The key differentiator of `Filtered-DiskANN` is its filter-aware index construction. Unlike previous methods that only modify the *search* step, these algorithms build a graph whose structure is fundamentally designed to facilitate filtered searches from the ground up. This leads to vastly superior performance because the graph provides direct, efficient paths between points that share the same labels. # 4. Methodology (Core Technology & Implementation) The paper's core technology revolves around modifying the Vamana graph construction algorithm to be filter-aware. * FilteredGreedySearch (Algorithm 1): This is the search algorithm used for both querying and index construction. It's a modification of the standard greedy search on a graph. * Inputs: A graph G, a set of start nodes S, a query vector `xq`, number of neighbors k, search list size L, and a query filter `Fq`. * Procedure: 1. Initialize a priority queue L (the candidate list) with start nodes that match the query filter `Fq`. 2. Initialize an empty set V of visited nodes. 3. While L contains unvisited nodes: a. Select the unvisited node p* in L that is closest to the query `xq`. b. Add p* to the visited set V. c. For each neighbor of p*, if the neighbor matches the query filter `Fq` and has not been visited, add it to L. d. If L exceeds its size limit, truncate it to keep only the L closest nodes to `xq`. 4. The search terminates when all nodes in L have been visited. * Output: The k nearest neighbors from the final candidate list L. * FilteredRobustPrune (Algorithm 3): This is the most critical component. It's the procedure for pruning edges from a node's neighborhood to maintain a fixed maximum degree R while preserving graph navigability for filtered queries. * Principle: An edge from a point p to a candidate neighbor p' can be pruned if there is another neighbor p* of p that provides a "better" path to p'. The conditions for this are: 1. Geometric Condition: p* is significantly closer to p' than p is. Formally, \alpha \cdot d(p^, p') \leq d(p, p')\alpha \ge 1F_{p'} \cap F_p \subseteq F_{p^}\mathcal{F}, build a standard `Vamana` graph G_f using only the subset of points P_f that have that label. These sub-graphs are built with smaller degree (`R_small`) and search list (`L_small`) parameters for speed. 2. <strong>Stitch Graphs:</strong> Create a final graph G by taking the union of all nodes and all edges from all the per-filter graphs (G_f). A node in G will have outgoing edges to all its neighbors from every sub-graph it belongs to. This can result in nodes with very high degrees. 3. <strong>Prune Stitched Graph:</strong> For every node v in the combined graph G, run `FilteredRobustPrune` on its neighborhood to reduce its out-degree to a final, manageable bound `R_stitched`. # 5. Experimental Setup * <strong>Datasets:</strong> The evaluation used a mix of real-world and semi-synthetic datasets to test performance under various conditions. *This is a transcribed version of Table 1 from the paper.* | Dataset | Dim | # Pts. | # Queries | Source Data | Filters | Filters per Pt. | Unique Filters | 100pc. | 75pc. | 50pc. | 25pc. | 1pc. | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | Turing | 100 | 2,599,968 | 996 | Text | Natural | 1.09 | 3070 | 0.127 | 1.56x10⁻⁴ | 4.15x10⁻⁵ | 1.54x10⁻⁵ | 7.7x10⁻⁶ | Prep | 64 | 1,000,000 | 10000 | Text | Natural | 8.84 | 47 | 0.425 | 0.136 | 0.130 | 0.127 | 0.09 | DANN | 64 | 3,305,317 | 32926 | Text | Natural | 3.91 | 47 | 0.735 | 0.361 | 0.183 | 0.167 | 0.150 | SIFT | 128 | 1,000,000 | 10000 | Image | Random | 1 | 12 | 0.083 | 0.083 | 0.083 | 0.083 | 0.082 | GIST | 960 | 1,000,000 | 1000 | Image | Random | 1 | 12 | 0.083 | 0.083 | 0.083 | 0.083 | 0.082 | msong | 420 | 992,272 | 200 | Audio | Random | 1 | 12 | 0.083 | 0.082 | 0.082 | 0.082 | 0.082 | audio | 192 | 53,387 | 200 | Audio | Random | 1 | 12 | 0.085 | 0.084 | 0.083 | 0.082 | 0.081 | paper | 200 | 2,029,997 | 10000 | Text | Random | 1 | 12 | 0.083 | 0.083 | 0.083 | 0.083 | 0.082 The `Turing`, `Prep`, and `DANN` datasets are real-world industrial datasets with naturally occurring labels, making them strong indicators of practical performance. The other datasets are standard benchmarks where labels were assigned randomly. * <strong>Evaluation Metrics:</strong> 1. <strong>Recall@k:</strong> * <strong>Conceptual Definition:</strong> Measures the accuracy of the search. It is the proportion of the true *k* nearest neighbors (found by a brute-force search) that are present in the top-*k* results returned by the approximate algorithm. A recall of 1.0 (or 100%) means perfect accuracy. For this paper, it is `Recall@10`. * <strong>Mathematical Formula:</strong> \mathrm{Recall}@k = \frac{|A \cap G|}{k}AGk: The number of neighbors to retrieve. 2. <strong>Queries Per Second (QPS):</strong> * <strong>Conceptual Definition:</strong> Measures the speed or throughput of the search index. It is the total number of queries processed divided by the total time taken. Higher QPS is better. The paper reports QPS for runs using 48 threads. * <strong>Baselines:</strong> * `IVF Inline-Processing`: A FAISS IVF index where filtering is done "on the fly" as clusters are explored. * `IVF post-processing`: A standard FAISS IVF index where a large number of results are retrieved and then filtered. * `HNSW post-processing`: A standard FAISS HNSW index with post-processing. * `NHQ`: The competing graph-based filtered ANNS method. * `Milvus`: A popular open-source vector database that supports filtered search. Several of its internal algorithms were tested. # 6. Results & Analysis * <strong>Core Results:</strong> The main results are presented as QPS vs. Recall@10 curves, where the ideal algorithm is in the top-right (high recall and high QPS).  *该图像是包含四个子图的图表,展示了四种算法(HNSW、IVF FLAT、IVF SQ8、IVF PQ)在四个数据集(Audio、SIFT1M、Paper、Msong)上的Recall@10与相关参数的关系,体现了不同算法在召回率和速度上的性能表现。* # 7. Conclusion & Reflections * <strong>Conclusion Summary:</strong> The paper successfully demonstrates that integrating filter awareness into the construction of graph-based ANNS indices is a highly effective strategy for filtered ANNS queries. The proposed algorithms, `FilteredVamana` and `StitchedVamana`, establish a new state-of-the-art, offering an order-of-magnitude performance improvement over existing methods. The gains are particularly pronounced for challenging low-specificity filters, and the real-world impact is validated through significant revenue and engagement improvements in a large-scale production system. * <strong>Limitations & Future Work:</strong> The authors identify several areas for future work: * <strong>More Complex Filters:</strong> The current work focuses on single-label matching (equivalent to an OR of labels). Supporting complex SQL-like queries with conjunctions (AND), negations, and range checks remains an open challenge. * <strong>Larger Filter Sets:</strong> Scaling to thousands of filters is demonstrated, but performance with even larger label universes could be explored. * <strong>Dynamic Deletions:</strong> While `FilteredVamana` is suited for insertions, a full evaluation of the dynamic setting including efficient point deletions is left for future work. * <strong>Personal Insights & Critique:</strong> * <strong>Key Insight:</strong> The paper's core strength lies in its simple but powerful idea: graph navigability must be preserved not just geometrically but also per-filter. The `FilteredRobustPrune` condition (F_{p'} \cap F_p \subseteq F_{p^*}$) is an elegant and effective way to encode this principle. It fundamentally shifts the problem from a search-time modification to an index-time optimization, which proves to be the correct approach. - Practicality: The work is heavily grounded in practical needs. The choice of building upon
Vamana
/DiskANN
, the consideration of SSDs, the analysis of build time vs. query time trade-offs, and especially the A/B test results, make this paper exceptionally relevant for practitioners in the fields of search, e-commerce, and recommendation systems. - Open Questions: The trade-off between
FilteredVamana
andStitchedVamana
is interesting.FilteredVamana
seems more general-purpose due to its streaming-friendliness and robustness, whileStitchedVamana
is a powerful batch solution. It would be interesting to see hybrid approaches that might combine the strengths of both. For instance, one could useStitchedVamana
for a large static base andFilteredVamana
for an incremental layer on top.
Similar papers
Recommended via semantic vector search.
FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
Existing graph-based ANNS indices struggle with real-time updates for billion-scale data. FreshDiskANN introduces the first dynamic graph-based ANNS index with update rules, enabling concurrent real-time inserts, deletes, and searches on SSD-backed billion-point datasets without
Product quantization for nearest neighbor search
This paper introduces product quantization for efficient approximate nearest neighbor search, using subspace quantization and asymmetric distance computation, achieving superior accuracy and scalability on billion-scale image descriptor datasets.
SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
SPFresh introduces LIRE, a lightweight incremental rebalancing protocol enabling in-place updates for billion-scale vector search, significantly reducing update latency and resource use while improving accuracy compared to costly global rebuild methods.
Discussion
Leave a comment
No comments yet. Start the discussion!