HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor Search
TL;DR Summary
Harmony is a scalable distributed vector database designed for high-throughput Approximate Nearest Neighbor Search, addressing load imbalance and communication overhead with a novel multi-granularity partition strategy, achieving significant performance improvements in extensive
Abstract
Approximate Nearest Neighbor Search (ANNS) is essential for various data-intensive applications, including recommendation systems, image retrieval, and machine learning. Scaling ANNS to handle billions of high-dimensional vectors on a single machine presents significant challenges in memory capacity and processing efficiency. To address these challenges, distributed vector databases leverage multiple nodes for the parallel storage and processing of vectors. However, existing solutions often suffer from load imbalance and high communication overhead, primarily due to traditional partition strategies that fail to effectively distribute the workload. In this paper, we introduce Harmony, a distributed ANNS system that employs a novel multi-granularity partition strategy, combining dimension-based and vector-based partition. This strategy ensures a balanced distribution of computational load across all nodes while effectively minimizing communication costs. Furthermore, Harmony incorporates an early-stop pruning mechanism that leverages the monotonicity of distance computations in dimension-based partition, resulting in significant reductions in both computational and communication overhead. We conducted extensive experiments on diverse real-world datasets, demonstrating that Harmony outperforms leading distributed vector databases, achieving 4.63 times throughput on average in four nodes and 58% performance improvement over traditional distribution for skewed workloads.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a scalable distributed vector database system named Harmony, designed for high-throughput Approximate Nearest Neighbor Search (ANNS).
1.2. Authors
The authors and their affiliations are:
- Qian Xu, Feng Zhang, Chengxi Li, Zheng Chen, Xiaoyong Du - Renmin University of China
- Lei Cao - Massachusetts Institute of Technology
- Jidong Zhai - Tsinghua University
1.3. Journal/Conference
The paper is published on arXiv, which is a preprint server for electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, electrical engineering and systems science, and economics. While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research findings within the scientific community, often preceding formal publication.
1.4. Publication Year
2025
1.5. Abstract
Approximate Nearest Neighbor Search (ANNS) is crucial for various data-intensive applications. However, scaling ANNS to handle billions of high-dimensional vectors on a single machine is limited by memory and processing efficiency. Distributed vector databases attempt to solve this by using multiple nodes for parallel processing, but existing solutions often suffer from load imbalance and high communication overhead due to traditional partitioning strategies. This paper introduces Harmony, a distributed ANNS system that proposes a novel multi-granularity partition strategy, combining dimension-based and vector-based partition. This strategy aims to balance computational load and minimize communication costs. Furthermore, Harmony incorporates an early-stop pruning mechanism that leverages the monotonicity of distance computations in dimension-based partition to reduce both computational and communication overhead. Extensive experiments on real-world datasets show that Harmony outperforms leading distributed vector databases, achieving 4.63 times throughput on average in a four-node setup and a 58% performance improvement over traditional distribution for skewed workloads.
1.6. Original Source Link
https://arxiv.org/abs/2506.14707 (Preprint)
https://arxiv.org/pdf/2506.14707v1.pdf (PDF Link)
The paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the significant challenge of scaling Approximate Nearest Neighbor Search (ANNS) to handle billions of high-dimensional vectors efficiently. In modern data-intensive applications, such as recommendation systems, image retrieval, and machine learning, ANNS is a fundamental operation. The sheer volume of unstructured data being generated daily (e.g., 500PB for Alibaba, 500 hours of content on YouTube every minute) necessitates solutions that can scale beyond single-machine capabilities.
Existing distributed vector databases attempt to address this by leveraging multiple nodes for parallel storage and processing of vectors. However, they face two critical limitations:
-
Load Imbalance: Traditional partitioning methods (like
cluster-basedorhash-based) often result in some machines becomingoverloadedwith computations while others are underutilized, especially underskewed query workloads. This degrades overall system performance. -
High Communication Overhead: Distributing queries across machines typically involves transmitting intermediate results. Although these might be smaller than original vectors, the disparity between
bandwidth(e.g., 100 Gb/s) andcomputation speed(hundreds of GB/s) means communication can still introduce significant delays, acting as a bottleneck.The paper identifies that over 90% of ANNS time in
cluster-based indicesis spent ondistance computations. This highlights the need for optimizing these computations in a distributed setting. Traditionalgraph-based segmentationfor standalone machines also struggles with distributed features due tocross-machine edgesandhigh latency.
The paper's entry point or innovative idea is to develop a distributed-friendly vector distribution method and optimize distributed vector distance computations based on cluster-based vectors by introducing a novel multi-granularity partition strategy combined with an early-stop pruning mechanism.
2.2. Main Contributions / Findings
The paper introduces Harmony, a distributed ANNS system with three primary innovations and corresponding key findings:
-
Multi-Granularity Partition Strategy:
Harmonyintegratesvector-basedanddimension-based partitioninto a hybrid index construction framework, guided by acost model.- Contribution: This approach dynamically adapts to query characteristics and cluster state, ensuring efficient query processing and stable performance across various loads. It addresses load imbalance by combining strengths of both partitioning methods.
- Finding: This adaptive strategy ensures robust performance even under
skewed workloads, outperforming traditional distribution methods.
-
Early-Stop Pruning Mechanism:
Harmonyincorporates anearly-stop pruning mechanismthat leverages themonotonicity of distance computationsindimension-based partition.- Contribution: When partial vector dimensions no longer contribute to the final result, redundant calculations are skipped. This
early pruning, coupled with apipelined execution model, minimizes unnecessary computations and communications, especially forhigh-dimensional queries. - Finding: Experiments show significant
pruning ratios, especially in later stages of dimension processing (e.g., over 80% for the third slice, over 90% for the fourth slice), dramatically reducing computational overhead and network transfers.
- Contribution: When partial vector dimensions no longer contribute to the final result, redundant calculations are skipped. This
-
Adaptive Index Construction & Load-Aware Partitioning:
- Contribution: The system adaptively adjusts its index structure based on workload characteristics and uses a
load-aware routing mechanismto minimize communication overhead and balance query processing. - Finding: This results in
Harmonyachieving substantial performance gains over state-of-the-art vector databases.
- Contribution: The system adaptively adjusts its index structure based on workload characteristics and uses a
Key Conclusions / Findings:
Harmonyachieves a4.63 times throughput improvementon average compared toFaissin a four-node environment.- It demonstrates a
58% performance improvementover traditional distribution methods when dealing withskewed workloads. - The
dimension-based partitioninherently allows forearly stoppingduring distance computations, with partial distances quickly exceeding the current best candidate threshold, thus enabling significant pruning. - The
multi-granularity approach(combining dimension and vector partitioning) effectively balances computational load and minimizes communication, addressing the core challenges of distributed ANNS. - Individual optimization techniques contribute significantly:
balanced loadyields1.75xspeedup,pipeline and asynchronous execution1.25x, andpruning1.51x.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Harmony, a reader needs to grasp several core concepts in ANNS and distributed systems:
-
Approximate Nearest Neighbor Search (ANNS):
- Concept: ANNS is a technique used to find data points in a high-dimensional space that are "approximately" closest to a given
query vector. Unlike exact Nearest Neighbor Search (NNS), which guarantees finding the absolute closest points, ANNS sacrifices a small amount of accuracy for significantly faster search times, which is crucial for large datasets. - How it works: ANNS algorithms typically build an
index structure(e.g., a graph, tree, or cluster-based structure) that allows for rapid traversal of the search space, avoiding a full scan of all data points. - Why it's needed: In applications like recommendation systems or image retrieval, data is often represented as
high-dimensional vectors(e.g., embeddings from deep learning models). Exact search on billions of such vectors is computationally prohibitive.
- Concept: ANNS is a technique used to find data points in a high-dimensional space that are "approximately" closest to a given
-
High-Dimensional Vectors:
- Concept: These are numerical representations of data points where each point has many attributes or features (dimensions). For example, a word embedding might have 300 dimensions, and an image feature vector could have 1024 dimensions.
- Challenges: As dimensionality increases, traditional distance metrics become less meaningful (the "curse of dimensionality"), and the volume of space grows exponentially, making exhaustive search impractical.
-
Distributed Vector Databases:
- Concept: Systems designed to store and search vast collections of
high-dimensional vectorsacross multiple interconnected machines (nodes). They leverageparallel processingto handle workloads that would overwhelm a single machine. - Benefits: Increased
storage capacity,reduced query latency(by processing parts of a query concurrently), and improvedfault tolerance.
- Concept: Systems designed to store and search vast collections of
-
Partitioning Strategies (for Distributed Systems):
- Vector-based Partitioning:
- Concept: The entire
vector datasetis divided intosubsets (shards), and each subset is assigned to a different node. Each node stores complete vectors. - Pros: Simplifies distance computation locally on each node; communication typically only happens at the beginning (sending query) and end (aggregating results).
- Cons: Can lead to
load imbalanceif certain subsets of vectors (hot partitions) are queried more frequently or are computationally heavier; doesn't help with memory if individual vectors are too large for a single node.
- Concept: The entire
- Dimension-based Partitioning:
- Concept: The
dimensional spaceof the vectors is split into disjoint subsets, and each subset of dimensions is assigned to a different node. Each node stores only asliceof each vector. - Pros: Ensures a
balanced loadacross all nodes, as each node contributes equally to thedistance computationfor any given query; reduces memory pressure per node for very high-dimensional vectors. - Cons: Requires more communication during query processing, as partial distance results from multiple nodes need to be combined to form a full distance, potentially increasing
communication overhead.
- Concept: The
- Cluster-based Partitioning:
- Concept: Data points are grouped into
clustersbased on similarity (e.g., using k-means), and then these clusters (or the data points within them) are distributed across nodes. - Relevance to ANNS: Many ANNS indexing techniques are
cluster-based, where the first step is to find the closest clusters to the query, then search only within those clusters.
- Concept: Data points are grouped into
- Hash-based Partitioning:
- Concept: Vectors are assigned to partitions based on a hash function applied to some property of the vector or its ID.
- Pros: Simple, usually ensures uniform distribution.
- Cons: Doesn't consider data locality, can lead to
cross-partition data accessfor related vectors.
- Vector-based Partitioning:
-
Euclidean Distance (Squared Euclidean Distance):
- Concept: A common metric for measuring the straight-line distance between two points in Euclidean space. The squared version is often used to avoid the square root operation, which doesn't change the relative ordering of distances.
- Formula: For two -dimensional vectors and , the squared Euclidean distance is: $ D^2_{\mathrm{Eucl}}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{d} (p_i - q_i)^2 $
- Significance for dimension-based partitioning: The total squared Euclidean distance is the sum of squared partial distances over disjoint sets of dimensions. This
monotonicproperty is crucial forearly stopping. If the sum of partial distances already exceeds a threshold, adding more positive values will only increase it further, meaning the vector can be pruned.
-
Cosine Similarity:
- Concept: Measures the cosine of the angle between two non-zero vectors. It indicates how similar the direction of two vectors is, regardless of their magnitude. A value of 1 means identical direction, 0 means orthogonal, and -1 means exactly opposite.
- Formula: For two vectors and : $ \text{Cosine Similarity}(\mathbf{p}, \mathbf{q}) = \frac{\mathbf{p} \cdot \mathbf{q}}{|\mathbf{p}| |\mathbf{q}|} = \frac{\sum_{i=1}^{d} p_i q_i}{\sqrt{\sum_{i=1}^{d} p_i^2} \sqrt{\sum_{i=1}^{d} q_i^2}} $
- Significance for dimension-based partitioning: If vectors are
pre-normalized(i.e., ), then cosine similarity is directly proportional to thedot product. The dot product also exhibits amonotonicproperty, allowing for early stopping if we are maximizing similarity (e.g., if a partial sum of dot products is too low, it won't be a good match).
-
Early Stopping / Pruning:
- Concept: A technique to terminate computations prematurely if it's determined that the current path or candidate cannot lead to a better result than what's already found.
- Relevance: In ANNS, if the partial distance calculated for a candidate vector already exceeds the distance of the -th best neighbor found so far, then that candidate cannot be among the top- neighbors, and its remaining distance computations can be skipped.
-
Throughput (QPS - Queries Per Second):
- Concept: A measure of the number of queries a system can process successfully in a given unit of time. Higher QPS indicates better performance and efficiency.
-
Recall:
- Concept: In ANNS, recall measures the proportion of true nearest neighbors that are successfully retrieved by the approximate search algorithm. A recall of 1.0 means all true nearest neighbors were found.
-
Load Imbalance:
- Concept: A situation in a distributed system where the workload is unevenly distributed among the processing nodes. Some nodes are heavily utilized (
hot), while others are underutilized (cold), leading to bottlenecks and reduced overall system performance.
- Concept: A situation in a distributed system where the workload is unevenly distributed among the processing nodes. Some nodes are heavily utilized (
-
Communication Overhead:
- Concept: The time or resources consumed by the communication between different components of a distributed system (e.g., sending data, synchronizing processes). Excessive communication overhead can negate the benefits of parallel processing.
3.2. Previous Works
The paper discusses previous research in several areas:
-
Vector Database Systems:
- General advancements:
ANNShas seen significant progress inindexing structuresandquery algorithms(e.g., [28, 47]), but distance computations remain the bottleneck. - Indexing techniques:
Graph-based indexes: Examples like [23, 24, 26, 27, 32, 47, 55, 57, 66] build a graph based on vector distances and usebreadth-first search. The paper notes these are not well-compatible with distributed features due tocross-machine edges.Partition-based indexes: Includecluster-based[14, 15, 17, 25, 36-38, 63, 67],hash-based[20, 33, 53, 59, 60], andhigh-dimensional tree-based[16, 45, 49, 58] methods. These divide vectors into partitions for faster identification.
- Challenges: High storage overhead for full-dimensionality, despite indexing improvements. Lossy compression (e.g.,
quantization[25]) can reduce storage but sacrifices accuracy. This drives the need for distributedANNS.
- General advancements:
-
Distributed Solutions in Data Management:
- Graph databases: Systems like
Neo4j[4],TigerGraph[8], andJanusGraph[11] employpartitioning (sharding)to split graphs across machines, aiming to minimizecross-partition edgesanddata transfers. - Distributed relational databases:
Google Spanner[19],CockroachDB[2], andF1[54] achievehorizontal scalabilitybysharding tables, providingtransactional consistencyandparallel query execution. - Distributed query engines:
Presto[6],Spark SQL[3], andApache Drill[1] usemassive parallel processing (MPP)andin-memory computationto reduce query latencies. - Lessons learned: These systems highlight the importance of
robust data partitioning(to minimize inter-node communication and balance workloads) andparallel processing frameworks. They also expose common challenges likeload imbalanceandexcessive data transfers.
- Graph databases: Systems like
-
Distributed Vector Databases Specifics:
- Previous research [5, 7, 9, 10, 46, 68] has focused on improving
query performanceand ensuringaccuracyin multi-machine settings. However, the paper points out that these studies often do not addressperformance guarantees under load imbalancesor explorepruning strategiesthat reduceinter-machine workload.
- Previous research [5, 7, 9, 10, 46, 68] has focused on improving
-
Faiss [25]: Mentioned as a
state-of-the-art stand-alone ANNS query engineused as a baseline for comparison.Faissis developed by Facebook AI Research and provides efficient implementations of ANNS algorithms, often leveraging quantization and clustering techniques. -
Auncel [68]: Another
distributed ANNSsystem, mentioned for comparison.Auncelfocuses onerror-bound guaranteesforlow search loadscenarios and uses afixed partitioning strategysimilar to purevector-based partitioning. The paper notesAuncelsharesHarmony-vector's limitation inload balancingunderskewed workloads.
3.3. Technological Evolution
The evolution of ANNS can be broadly described as follows:
-
Early ANNS (Single Machine): Initial efforts focused on optimizing search within a single machine using techniques like
k-d trees,LSH (Locality Sensitive Hashing), and later, more sophisticatedgraph-based(e.g.,HNSW) andcluster-based(e.g.,IVFADCvariants inFaiss) indexes. These methods significantly improved query speed compared to brute-force search but were limited by the memory and computational capacity of a single machine. -
Emergence of Distributed Needs: With the explosion of
high-dimensional data(e.g., embeddings from deep learning) reaching billions of vectors, single-machine solutions became inadequate. This necessitated the development ofdistributed ANNSsystems. -
Initial Distributed Approaches: Early distributed systems often adapted existing partitioning strategies from other domains (e.g.,
vector-based sharding) or basiccluster-based distributions. While these offered scalability in terms of data volume, they frequently encountered issues likeload imbalance(some nodes becoming bottlenecks) andhigh communication overhead(data transfer between nodes becoming costly). -
Specialized Distributed Vector Databases: The recognition of unique challenges posed by
high-dimensional vector dataled to the development of specialized distributed vector databases. These aimed to optimize forvector similarity searchspecifically, rather than generic data management. However, many still struggled with effectively balancing workloads and minimizing communication in diverse query scenarios.This paper's work,
Harmony, fits into this timeline as an advanced specialized distributed vector database. It addresses the limitations of earlier distributed solutions by proposing a more sophisticatedmulti-granularity partitioningand anearly-stop pruning mechanismspecifically tailored to the characteristics of vector distance computations. It represents a step towards more robust and efficient large-scaleANNSinmulti-node environments.
3.4. Differentiation Analysis
Compared to the main methods in related work, Harmony's core differences and innovations lie in its integrated approach to partitioning and query processing:
-
Traditional Vector-based Partitioning (e.g.,
Harmony-vector, orAuncel's fixed strategy):- Difference: These methods assign entire vectors to nodes. While they have lower
communication overheadper query (one-time data transfer), they are highly susceptible toload imbalanceunderskewed workloads(as shown in Figure 1a), where "hot" partitions can overload specific nodes. Harmony's Innovation:Harmonymitigates this by dynamically choosing betweenvector-basedanddimension-based partitioningusing acost model. It leveragesvector-based partitioningfor itslow communication costwhen workloads are balanced, but shifts towardsdimension-based partitioningor a hybrid when imbalance is detected.
- Difference: These methods assign entire vectors to nodes. While they have lower
-
Traditional Dimension-based Partitioning (e.g.,
Harmony-dimension):- Difference: These methods split vectors by their dimensions across nodes. They inherently provide
load balancefor distance computations (as each node contributes to every query's distance calculation, Figure 1b). However, they typically incurhigh communication overheadbecause partial results from all dimension-holding nodes must be collected and combined for each query. Harmony's Innovation:Harmonyadoptsdimension-based partitioningprinciples but significantly reduces itscommunication overheadthrough a novelearly-stop pruning mechanismandpipelined execution. This pruning, enabled by themonotonicity of distance computations, allowsHarmonyto skip processing on later dimension slices if a candidate is already deemed unpromising. This transforms a potential weakness (high communication) into a strength (efficient pruning).
- Difference: These methods split vectors by their dimensions across nodes. They inherently provide
-
Existing Distributed Vector Databases (general):
-
Difference: Many existing systems focus on
query performanceandaccuracybut don't explicitly addressload imbalance under varying workloadsor comprehensiveinter-machine pruning strategies. -
Harmony's Innovation:Harmonyexplicitly tacklesload imbalancethrough its adaptive,multi-granularity partitioningguided by acost model. It also introduces a sophisticatedmulti-machine pruning techniquewithin apipelined execution engine, making pruning a first-class citizen in the distributed query process. This distinguishes it by focusing onrobustnessandefficiencyacross a wider range of operational conditions.In essence,
Harmony's innovation lies in its hybrid and adaptive nature, intelligently combining the strengths of different partitioning methods and introducing a powerfuldimension-level pruningmechanism that is deeply integrated with apipelined distributed execution model.
-
4. Methodology
4.1. Principles
The core idea behind Harmony is to achieve high-throughput and robust Approximate Nearest Neighbor Search (ANNS) in a distributed environment by adaptively combining different data partitioning strategies and leveraging early pruning. The theoretical basis and intuition are rooted in two key observations:
-
Varied Granularities of Partitioning: Vectors can be partitioned in different ways—either by assigning entire vectors to nodes (
vector-based partition) or by splitting dimensions across nodes (dimension-based partition). Each has distinct advantages and disadvantages regardingcomputational load distributionandcommunication overhead. The principle is that a flexible system should dynamically choose or combine these granularities based on the current workload. -
Monotonicity of Distance Computations: For common distance metrics like
squared Euclidean distanceorcosine similarity(when normalized), the total distance or similarity can be cumulatively summed from partial contributions across disjoint dimensions. Thismonotonicproperty means that if a partial sum already indicates a poor candidate (e.g., partial distance exceeds the best-found distance), then subsequent computations on remaining dimensions for that candidate are redundant. This intuition forms the basis forearly-stop pruning.Harmony's design integrates these principles: it uses acost modelto adaptively decide the optimal partitioning strategy (or a hybrid) to balance load and minimize communication, and then employs apipelined execution enginewithdimension-level pruningto exploit themonotonicityof distance calculations, further reducing computational and communication overhead.
4.2. Core Methodology In-depth (Layer by Layer)
Harmony is designed with two main modules: a fine-grained query planner and a flexible pipelined execution engine.
4.2.1. Fine-grained Query Planner
The fine-grained query planner is responsible for determining the most efficient partitioning and query strategy. It uses a cost model to estimate computational and communication overhead.
Cost Model
The cost model helps Harmony make informed decisions about partitioning and query execution. It considers partition definitions, query processing costs, and imbalance factors.
The following are the results from Table 1 of the original paper:
| Symbol | Meaning |
|---|---|
| Partition plan | |
| , | Dimension- and vector-based blocks in |
| , | Single block , |
| , | Computation/communication costs for query when processed in or |
| Total load on node under plan | |
| Imbalance factor | |
| Weight for imbalance in overall cost |
-
Query Cost (): For each query in the set of queries , the cost model sums the computation and communication costs across two types of partitioned data:
dimension-based partitionsandvector-based partitions. $ C_q(\pi) = \sum_{b \in \mathcal{B}{\mathrm{dim}}(\pi)} \left[ c{\mathrm{comp}}^{\mathrm{dim}}(b, q) + c_{\mathrm{comm}}^{\mathrm{dim}}(b, q) \right] + \sum_{s \in \mathcal{B}{\mathrm{vec}}(\pi)} \Big[ c{\mathrm{comp}}^{\mathrm{vec}}(s, q) + c_{\mathrm{comm}}^{\mathrm{vec}}(s, q) \Big] $-
: The total cost for processing a single query under a given partition plan .
-
: The set of dimension-based blocks defined by partition plan .
-
: The set of vector-based blocks (shards) defined by partition plan .
-
: A single dimension-based block.
-
: A single vector-based block.
-
: The computation cost for query on a dimension-based block .
-
: The communication cost for query related to a dimension-based block .
-
: The computation cost for query on a vector-based block .
-
: The communication cost for query related to a vector-based block .
This formula calculates the total cost for a query by summing up the costs incurred from all involved dimension-based and vector-based blocks.
-
-
Imbalance Factor (): This factor measures how unevenly the workload is distributed across the nodes.
- Node Load (): The total work a node performs under a given partition plan is calculated by summing the computation costs of all dimension-based and vector-based blocks assigned to that node for all queries in .
$
\mathrm{Load}(n, \pi) = \sum_{q \in Q} \big[ \sum_{b \in \mathcal{B}{\mathrm{dim}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{dim}}(b, q) + \sum_{s \in \mathcal{B}{\mathrm{vec}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{vec}}(s, q) \big]
$
- : The total computational load on node for all queries under partition plan .
- : Dimension-based blocks stored on node .
- : Vector-based shards on node .
- Imbalance Factor Formula: The imbalance factor itself is defined using the standard deviation of the node loads:
$
\bar{Z}(\pi) = \sqrt{\frac{1}{N} \sum_{n=1}^{N} \bigl[ \mathrm{Load}(n, \pi) - \overline{L} \bigr]^2}
$
-
: The imbalance factor for partition plan .
-
: The total number of nodes in the system.
-
: The load on node .
-
: The average load across all nodes, calculated as .
A higher value of indicates a more significant load imbalance, meaning some nodes are disproportionately busy compared to others.
-
- Node Load (): The total work a node performs under a given partition plan is calculated by summing the computation costs of all dimension-based and vector-based blocks assigned to that node for all queries in .
$
\mathrm{Load}(n, \pi) = \sum_{q \in Q} \big[ \sum_{b \in \mathcal{B}{\mathrm{dim}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{dim}}(b, q) + \sum_{s \in \mathcal{B}{\mathrm{vec}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{vec}}(s, q) \big]
$
-
Overall Cost Function (): The planner combines the per-query costs and the imbalance factor into a single objective function to evaluate a partition plan: $ C(\pi, Q) = \sum_{q \in Q} C_q(\pi) + \alpha \cdot \bar{Z}(\pi) $
-
: The total cost of a partition plan for processing the entire set of queries .
-
: The cost for a single query (as defined above).
-
: A user-defined weight that determines how much the system
penalizes skew(imbalance) relative tolocal computation/communication efficiency. A higher means the system prioritizes load balance more.The goal of the
fine-grained query planneris to find a partition plan that minimizes this overall cost function.
-
Query Load Distribution
After the cost model analyzes the workload and determines an optimal partition plan , Harmony distributes queries across the cluster. This process involves several steps, as illustrated in Figure 4:
The following figure (Figure 4 from the original paper) illustrates Harmony's query distribution. denotes the id of the cluster, denotes the query block i, denotes the ith block according to vector-based partition, denotes the ith block according to dimension-based partition, denotes the ith machine.
该图像是示意图,展示了Harmony系统中的查询分布。左侧部分(a)展示了基础向量分布,包含了不同的簇 和向量 。右侧部分(b)展示了查询负载分布,标明了查询块 如何分配到不同的机器 和向量 。该图形直观地反映了多粒度分区策略的应用,以实现负载均衡和减少通信开销。
-
Grid Formation: The
partition plandefines a grid ofdimension splitsandvector splits. For example, as shown in Figure 4 (a), if there are three dimension partitions () and two vector partitions (), this creates blocks like . Each machine () in the cluster is assigned one of these blocks, establishing the base vector distribution. -
Identify Cluster Centroids (Client-side): As shown in Figure 4 (b) (purple table), similar to other
cluster-based ANNS engines,Harmonyfirst identifies relevantcluster centroidson the client side. This step maps each incomingquery vectorto the appropriate portion of the base vector collection it needs to interact with. This is a crucial initial filtering step. -
Map Queries to Vector-based Blocks (Client-side): After centroid identification,
Harmonydetermines whichvector partitionseach query should visit. For example, in Figure 4 (b) (blue table),Q1is mapped toV1. This meansQ1will be processed against the vectors residing invector partition V1. -
Combine with Dimension Splits and Map to Machines (Client-side and Routers): Once the
vector partitionsare determined,Harmonyfurther splits the queries based ondimension partitions. ForQ1mapped toV1, it will be split intoQ1D1,Q1D2, andQ1D3, corresponding to the dimension partitions . Thesequery chunksare then routed to the corresponding machines ( respectively forV1D1, V1D2, V1D3) forpartial distance computations. This ensures thatQ1D1, Q1D2, Q1D3are processed in parallel on separate machines.Advantages: This multi-granularity distribution ensures that:
- The blocks () are evenly distributed among machines.
- All available nodes are fully utilized.
- No single machine is overloaded, leading to
balanced loadandhigher throughput.
Time Complexity:
- Centroid Assignment (Computation overhead): For query vectors, dimensions, and
NCbase centroids, finding the closest centroids typically costs . This is a standard operation in ANNS and not specificHarmony's overhead. - Query Distribution (Communication overhead): A query vector interacts with machines (for vector-based partitioning). With
Harmony'sdimension-based partitioning, this increases to , where is the number ofdimension-based splits. However, the total data transferred remains the same because the size of each split is reduced to . Therefore, while the number of communication messages might increase, the total amount of data transmitted does not, preventing additionalcommunication overhead.
Space Complexity:
Harmony's space requirements are minimal because each base vector is stored on only one machine (no redundancy). The query set is typically small, so storing and routing query splits incurs negligible space costs. The overall space complexity remains , where NB is the number of base vectors and is their dimensionality, matching standard distributed systems without extra replication.
4.2.2. Flexible Pipelined Execution Engine
The flexible pipelined execution engine is designed to exploit the opportunity for distributed pruning, where partial distance computations can eliminate unpromising candidates early. The challenge is to maintain pruning ability while distributing computations across machines.
The following figure (Figure 5 from the original paper) illustrates Harmony's pipeline pruning and querying.
该图像是示意图,展示了Harmony系统中的管道修剪和查询过程。左侧部分(a)说明了向量级修剪,展示了在阶段A和阶段B中,如何更新修剪条件并处理查询队列。右侧部分(b)则描述了维度级修剪的三个阶段(X、Y、Z),指明计算与发送的流程。整体图示突出了Harmony系统在高效查询中的设计策略。
Algorithm 1 (not explicitly numbered in the paper, but presented as pseudocode) describes the orchestration:
1 Function PrewarmHeap (QuerySet Q, int K): // Compute partial distances for a subset of queries to build an initial max-heap.
2 foreach q in Q_subset do
3 dist = ComputeDistance (q, randomVectors)
4 heap_insert(q, dist, K)
5 return heap
6 Function DimensionPipeline (q, DSet): // Process each dimension block in DSet sequentially.
7 foreach d in DSet do
8 partialDist ← ComputeDistance (q, d)
9 if partialDist > q.currentThreshold then
10 prune(q)
11 return
12 UpdatePruning (q, partialDist)
13 Function VectorPipeline (QueryBatch Qv, DSet): // Process a batch of queries for a specific vector partition.
14 foreach q in Qv do
15 DimensionPipeline (q, DSet)
16 if q.pruned == true then
17 continue
18 UpdatePruning (q, finalDist)
19 Function QUERYPIPELINE(QuerySet Q, VSetList VList, DSetList DList, int K): // Stage 0: Prewarming
20 heap = PrewarmHeap (Q, K) // Stage I: Vector-level pipeline
21 foreach vPart in VList do
22 Qv = filterQueries(Q, vPart)
23 VectorPipeline (Qv, DList)
-
Prewarm Stage (Lines 1-5,
PrewarmHeapfunction):- Purpose: To establish an initial
pruning threshold. - Mechanism: A subset of queries (
Q_subset) is processed on the client node.ComputeDistancecalculates initial distances for these queries against somerandomVectors(likely centroids or a small sample). These distances are then inserted into amax-heapof size (for top- search). Themax-heapstores the smallest distances found so far, with the largest of these (the -th best distance) serving as the initialpruning threshold(q.currentThreshold).
- Purpose: To establish an initial
-
Vector Pipeline (Lines 13-18,
VectorPipelinefunction):- Purpose: To process queries in batches corresponding to
vector partitionsand perform coarse-grained pruning. - Mechanism: It iterates through each
vector partition(vPart) inVList. For eachvPart, it filters theQuerySetto get aQueryBatchQvrelevant to that partition. Then, for each query inQv, it callsDimensionPipeline(Line 15) to handlepartial distance computations. After theDimensionPipelinecompletes, if the query has beenpruned(Line 16), itcontinuesto the next query. Otherwise, it updates the globalpruning thresholdswith thefinalDist(Line 18). This ensures that each machine specializes in searching a particular vector partition. Themax-heapis refined iteratively, so subsequentquery batchesbenefit from tighterpruning thresholds.
- Purpose: To process queries in batches corresponding to
-
Dimension Pipeline (Lines 6-12,
DimensionPipelinefunction):- Purpose: To perform fine-grained,
early-stop pruningby processingdimension blockssequentially for a single query. - Mechanism: For a given query , it iterates through each
dimension blockinDSet(which represents the dimension partitions for that query).- It computes a
partial distance(partialDist) for the currentdimension block(Line 8). - It then compares this
partialDistto the query'scurrentThreshold(Line 9). - If
partialDist>q.currentThreshold, the candidate vector can beprunedimmediately (Line 10), and the functionreturns(Line 11), skipping any further computation on subsequentdimension blocks. - If not pruned,
UpdatePruning(Line 12) is called, which updatesq.currentThresholdto reflect the accumulated partial distance, potentially making the threshold tighter for subsequent blocks.
- It computes a
- Purpose: To perform fine-grained,
Overall Workflow (QUERYPIPELINE function, Lines 19-23):
The QUERYPIPELINE orchestrates the entire process. First, it PrewarmHeap (Line 20) to get an initial threshold. Then, it enters a loop over vector partitions (vPart in VList). For each vPart, it filters queries (filterQueries) and then calls VectorPipeline (Line 23) to process that batch of queries.
Example (Figure 5):
- Vector-level pipeline (Figure 5a): Queries
Q1-Q3are processed forV1in Stage A, andQ4-Q6forV2. After Stage A, the global pruning threshold is updated. Stage B then processesQ4-Q6forV1andQ1-Q3forV2. The pruning threshold established after Stage A benefits Stage B. - Dimension-level pipeline (Figure 5b): This illustrates how dimensions of different queries are processed across machines .
- Stage X:
Q1'sD1is on ,Q2'sD2on ,Q3'sD3on . Each machine computes a partial distance. - Stage Y:
Q1'sD2moves to ,Q2'sD3to ,Q3'sD1to . Partial results from Stage X are accumulated. Pruning checks occur. - Stage Z: The remaining dimension blocks are processed.
Harmonyensures that different dimensions of the same query vector are processed in non-consecutive stages and on different machines. This minimizes synchronization overhead and allows partial result transmission to overlap with computation.
- Stage X:
Complexity Analysis:
Let be the number of query vectors, NB the number of base vectors, the dimensionality, the total number of clusters, and the number of probed clusters per query.
-
Time Complexity: In a centralized system, the theoretical total operations would be . With
Harmony, work is distributed across machines. Under abalanced load, each machine handles: $ \frac{Q \times ( n_{\mathrm{probe}} / n_{\mathrm{list}} ) \times NB \times D^2}{\mathcal{B}{\mathrm{vec}}(\pi) \times \mathcal{B}{\mathrm{dim}}(\pi)} $ Thus, the final per-machine cost scales as: $ O\left( \frac{Q \cdot n_{\mathrm{list}} \cdot NB \cdot D^2}{n_{\mathrm{probe}} \mathcal{B}{\mathrm{vec}}(\pi) \mathcal{B}{\mathrm{dim}}(\pi)} \right) $- Explanation: The total work is divided by the product of the number of vector blocks and dimension blocks, effectively distributing the computational burden. The
pruningmechanism further reduces the effective term as not all dimensions are processed for all candidates.
- Explanation: The total work is divided by the product of the number of vector blocks and dimension blocks, effectively distributing the computational burden. The
-
Communication Overhead: On average, each query interacts with vector blocks or dimension blocks. While the number of communication events increases by times (due to multiple dimension slices), the amount of data transferred per event is reduced by . The communication required for query distribution is proportional to: $ \left( \frac{NB}{\mathcal{B}{\mathrm{vec}}(\pi)} \right) \times \left( n{\mathrm{probe}} \times \frac{\mathcal{B}{\mathrm{vec}}(\pi)}{n{\mathrm{list}}} \right) \times \frac{D}{\mathcal{B}{\mathrm{dim}}(\pi)} \times \mathcal{B}{\mathrm{dim}}(\pi) $
- Explanation: This formula represents:
- : The average number of base vectors handled per vector block.
- : The number of vector blocks probed per query.
- : The size of each dimension slice.
- : The number of dimension slices.
The last two terms () effectively cancel out to . This means the total communication overhead in
Harmonyremains identical to a purelyvector-based scheme, as the finer partitioning of dimensions does not increase the total amount of transmitted data, only reorganizes it into smaller,parallelizable chunks.
- Explanation: This formula represents:
-
Space Complexity: Each of the
NBbase vectors is split among vector blocks and dimension blocks. Each block stores approximately: $ \frac{NB \times D}{\mathcal{B}{\mathrm{vec}}(\pi) \times \mathcal{B}{\mathrm{dim}}(\pi)} $ elements.- Explanation: Since there is no data duplication, the total storage remains . Intermediate query results are small and negligible. Thus, the space cost is , matching standard distributed systems.
Load Balancing Strategies:
While initial distribution aims for balance, pruning can introduce imbalance (later dimension slices are more likely to be pruned, reducing work for those machines). To counter this, Harmony dynamically adjusts the execution order of dimensions. If a machine (e.g., for ) becomes overloaded, subsequent queries (Q2) are reconfigured to process last. This leverages the higher pruning effectiveness in later stages to alleviate the load on overloaded machines, preventing bottlenecks.
4.3. Implementation
Harmony is implemented in (approximately 6000 lines of code), leveraging modern C++ features for efficiency.
- Communication:
OpenMPIis used forinter-node communication. This allows workers to issuenon-blocking operations(MPI_Isend / MPI_Irecv) for data and partial results, enablingcommunicationandlocal computationto overlap. - Computation:
Intel MKL(Math Kernel Library) accelerates fundamental operations likeL2(squared Euclidean distance) orinner-productcomputations.OpenMPis used to parallelizedistance calculationswithin a worker node.
- Control Flow: The master node sends updated
top-k heaps(aspruning thresholds) to worker nodes during query execution. - Parameters:
Harmonyoffers configurable parameters:- : Specifies the number of nodes in the distributed index.
-Pruning_Configuration [nargs=0..1]: Enables or disablesvectoranddimension-level pruning.-Indexing_Parameters [e.g., nlist, nprobe, dim]: Controlsclustering granularityandsearch scopeto balancerecall,latency, andmemory usage.nlist: Number of inverted lists (clusters).nprobe: Number of inverted lists to probe during search.dim: Dimensionality of vectors.
-\alpha$$: A user-defined weight in thecost modelto control the preference betweenload balancingandthroughput.-Mode [Harmony, Harmony-vector, Harmony-dimension]: Specifies the operational mode (e.g., fullHarmonywith adaptive strategy, or baselinepure vector-basedorpure dimension-based).
5. Experimental Setup
5.1. Datasets
The experiments utilize ten open-source datasets (Table 2), chosen for their diverse sizes, dimensions, and data types, which are representative of real-world ANNS scenarios. These datasets are commonly used in ANN system evaluations. The Star Light Curves and Hand Outlines datasets were expanded to ensure sufficient data for distributed processing.
The following are the results from Table 2 of the original paper:
| Dataset | Size | Dim | Query Size | Data Type |
|---|---|---|---|---|
| Star Light Curves | 823,600 | 1024 | 1,000 | Time Series |
| Msong | 992,272 | 420 | 1,000 | Audio |
| Sift1M | 1,000,000 | 128 | 10,000 | Image |
| Deep1M | 1,000,000 | 256 | 1,000 | Image |
| Word2vec | 1,000,000 | 300 | 1,000 | Word Vectors |
| Hand Outlines | 1,000,000 | 2709 | 370 | Time Series |
| Glove1.2m | 1,193,514 | 200 | 1,000 | Text |
| Glove2.2m | 2,196,017 | 300 | 1,000 | Text |
| SpaceV1B | 1,000,000,000 | 100 | 10,000 | Text |
| Sift1B | 1,000,000,000 | 128 | 10,000 | Image |
-
Star Light Curves (Time Series): A dataset representing light intensity measurements over time, often used in astronomy. The high dimension (1024) indicates a rich temporal feature set.
-
Msong (Audio): Audio features, typically extracted from music or speech. Its dimension (420) is moderately high.
-
Sift1M / Sift1B (Image):
Scale-Invariant Feature Transform (SIFT)descriptors are classic image features.Sift1M(1 million vectors) is a common benchmark, andSift1B(1 billion vectors) represents a very large-scale challenge. Dimensions are 128. -
Deep1M (Image): Features extracted from deep neural networks for image data, with a dimension of 256.
-
Word2vec (Word Vectors): Embeddings that represent words in a continuous vector space, capturing semantic relationships. Dimension is 300.
-
Hand Outlines (Time Series): Features describing the outlines of hands, often used in gesture recognition or biometrics. Its very high dimension (2709) poses significant challenges.
-
Glove1.2m / Glove2.2m (Text):
Global Vectors for Word Representation (GloVe)are pre-trained word embeddings. These datasets represent text data with dimensions 200 and 300, respectively, and are of varying sizes. -
SpaceV1B (Text): A very large-scale text embedding dataset with 1 billion vectors and 100 dimensions.
These datasets were chosen because they:
-
Cover a wide range of
data types(image, audio, text, time series). -
Vary significantly in
size(from ~800K to 1 billion vectors) anddimensionality(from 100 to 2709), allowing for robust evaluation under different scales and complexities. -
Are widely recognized
benchmarksin theANNScommunity, enabling comparison with other research. -
Their characteristics (especially high dimensionality) make them suitable for validating
Harmony's strengths in handling memory and computational challenges in distributed settings.
5.2. Evaluation Metrics
The paper primarily uses Queries Per Second (QPS), Recall, and Speedup to evaluate performance.
-
Queries Per Second (QPS):
- Conceptual Definition:
QPSmeasures the processing capacity of anANNSsystem, indicating how many search queries it can successfully execute per second. A higherQPSvalue signifies greater efficiency and throughput, which is critical for real-time applications. - Mathematical Formula: While not explicitly provided in the paper,
QPSis generally calculated as: $ \text{QPS} = \frac{\text{Total number of queries processed}}{\text{Total time taken to process all queries (in seconds)}} $ - Symbol Explanation:
Total number of queries processed: The count of individual search requests submitted to and completed by the system.Total time taken to process all queries: The cumulative duration, in seconds, from the submission of the first query to the completion of the last query, including any system overhead.
- Conceptual Definition:
-
Recall:
- Conceptual Definition: In the context of
ANNS,recallquantifies the effectiveness of the approximate search algorithm. It measures the proportion of the true nearest neighbors (those found by an exact search) that are successfully retrieved by the approximate search within its top-K results. A higherrecallvalue indicates a more accurate approximate search. - Mathematical Formula: $ \text{Recall} = \frac{\text{Number of true nearest neighbors found by ANNS}}{\text{Total number of true nearest neighbors (from exact search)}} $
- Symbol Explanation:
Number of true nearest neighbors found by ANNS: The count of items that are both in the ground-truth set of nearest neighbors and were returned by the approximate search query.Total number of true nearest neighbors: The total count of actual nearest neighbors for a given query, typically obtained by performing a brute-force (exact) search.
- Conceptual Definition: In the context of
-
Speedup:
- Conceptual Definition:
Speedupmeasures the performance improvement gained by using a parallel or optimized system compared to a baseline serial or less optimized system. It shows how many times faster the new system is. - Mathematical Formula: $ \text{Speedup} = \frac{\text{Time taken by baseline system}}{\text{Time taken by proposed system}} = \frac{\text{Throughput of proposed system}}{\text{Throughput of baseline system}} $
- Symbol Explanation:
Time taken by baseline system: The time required for the baseline (e.g., single-nodeFaiss) to complete a task.Time taken by proposed system: The time required for theHarmonysystem to complete the same task.Throughput of proposed system: TheQPSachieved by theHarmonysystem.Throughput of baseline system: TheQPSachieved by the baseline system.
- Conceptual Definition:
5.3. Baselines
Harmony is compared against the following baselines:
-
Faiss [25]: This is the
state-of-the-art stand-alone ANNS query engine. It serves as the primary single-node baseline. For fairness, all methods (includingHarmonyand its variants) adopt the same clustering algorithm and number of clusters asFaissfor index construction. -
Harmony-vector (pure vector-based partitioning): This is a baseline configuration within the
Harmonyframework that uses onlyvector-based partitioning. It distributes entire vectors to different nodes. This setup is used to assess the performance of traditional partitioning without the benefits of dimension splitting or adaptive strategies. -
Harmony-dimension (pure dimension-based partitioning): This is another baseline configuration within the
Harmonyframework that uses onlydimension-based partitioning. It distributes vector dimensions across nodes. This setup helps evaluate the inherent load balancing and communication characteristics of dimension-based approaches beforeHarmony's optimizations. -
Auncel [68]: This is a
distributed ANNSsystem that provideserror-bound guaranteesforlow search load.Aunceluses afixed partitioning strategysimilar toHarmony-vector. It is referenced for comparison, particularly regarding its limitations underskewed workloads.These baselines are chosen to:
-
Demonstrate
Harmony's performance against a highly optimized single-node solution (Faiss). -
Isolate and quantify the benefits of
Harmony's hybridmulti-granularity partitioningandpruningby comparing it against purevector-basedanddimension-baseddistributed setups. -
Contextualize
Harmony's advancements within the landscape of existingdistributed ANNSsystems (Auncel).
5.4. Platform
The experiments were conducted on a robust distributed system with the following specifications:
-
Nodes: 20 nodes in total.
-
CPU: Intel(R) Xeon(R) Gold 6258R CPU, featuring 28 cores and 56 threads per node.
-
Memory: 256GB of global memory per node, configured in a
NUMA (Non-Uniform Memory Access)architecture. -
Network: High-speed 100 Gb/s interconnect between nodes. This fast network minimizes potential communication bottlenecks.
-
Operating System: CentOS Stream 8.
-
Instruction Set: Each node supports
AVX-512, an advanced instruction set forsingle instruction, multiple data (SIMD)operations, which is beneficial for accelerating vector computations.This platform provides a high-performance environment suitable for demanding distributed workloads, allowing for accurate evaluation of
Harmony's scalability and efficiency.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. QPS-recall trade-off under uniform workloads
The following figure (Figure 6 from the original paper) shows the query per second (QPS) performance of Harmony and other algorithms across different datasets such as StarLightCurves, Msong, and Sift1M at varying recall rates.
该图像是一个图表,展示了不同数据集(如StarLightCurves、Msong、Sift1M等)上Harmony与其他算法在不同召回率下的查询每秒(QPS)表现。图中可以看到,Harmony在各种召回率上均表现出更高的查询性能。
The QPS-recall trade-off analysis (Figure 6) compares Harmony (running on four worker nodes) against Faiss (on a single node) under uniform workloads. For very large datasets like SpaceV1B and Sift1B, Harmony used 16 nodes, as Faiss and a 4-node setup could not handle their scale.
Key Findings:
- Scalability and Average Speedup: All
Harmonyvariants (Harmony,Harmony-vector,Harmony-dimension) demonstrate strong scalability. On average, theyoutperform Faissby3.75xin terms of speedup. This indicates that distributing the workload across multiple nodes, even with basic partitioning, significantly boosts performance compared to a single-node solution. - Exceeding Theoretical Speedup with Pruning: Under
high recall precision conditions(e.g., closer to 1.0 recall),Harmonyfrequently exceeds the theoretical maximum speedup of 4x (for a 4-node system), achieving an impressive4.63xspeedup. Thissuperlinear speedupis attributed toHarmony'spruning mechanism, which effectively reduces the overall computational overhead, making the distributed system more efficient than just linearly scaling the single-node performance. - Optimal Performance for
Harmony-Vectorat Lower Precision: When therecallislower than 99%, theHarmony-vectorpartitioning strategy (pure vector-based) shows theoptimal performanceamong theHarmonyvariants. This suggests that the overhead introduced bydimension-based partitioningand its associated communication forHarmonyandHarmony-dimensionmight outweigh the benefits of pruning or load balancing at lower precision targets, where less stringent search criteria mean that simpler, communication-light strategies perform better.
6.1.2. Search performance under skewed workloads
For skewed workload experiments, the analysis focuses on 8 relatively smaller datasets, as SpaceV1B and Sift1B are too large for a 4-node setup. The experiments use four distinct nodes and measure QPS for three Harmony strategies under varying levels of load imbalance, quantified by variance (as defined in Section 4.2.1).
The following figure (Figure 7 from the original paper) shows the impact of load distribution on query performance.

Key Findings:
- Traditional Vector Partition's Vulnerability to Imbalance: The
Harmony-vector(traditional vector-based partition) method performs poorly underload imbalance scenarios. As theload varianceincreases (indicating more severe skew), itsaverage QPS decreases by 56%on average. This validates the paper's initial motivation that traditional partitioning struggles with uneven workloads. - Robustness of
HarmonyandHarmony-Dimension: BothHarmonyandHarmony-dimensionpartitioning methods handleload imbalancemuch more favorably. They showno clear performance degradationeven asload skewchanges significantly. This highlights the inherentload balancingcapability of dimension-based approaches. Harmony's Adaptive Advantage for Extreme Skew: Even thoughHarmony-dimensionis stable,Harmony(the hybrid adaptive approach) maintains an advantage. When the load isextremely imbalanced,Harmonyoutperforms Harmony-dimension by 91%. This demonstrates the effectiveness ofHarmony'scost modelin dynamically adapting the partitioning strategy to maintain high performance under severeskewed workloads, leveraging the synergy betweendimension-splitandvector-splitstrategies.- Consistency for Larger/Higher-Dimensional Data:
Harmonyconsistently retains its advantage for datasets withlarger vector sizesorhigher feature dimensions. This indicates that its adaptive strategy is particularly beneficial as data complexity and distribution skew worsen, providing robust performance.
6.2. Performance Ablation Study
6.2.1. Time breakdown
This section analyzes how time is spent across data communication, computation, and other overhead for Harmony, Harmony-vector, and Harmony-dimension.
The following figure (Figure 8 from the original paper) shows the contribution of the three optimization techniques to Harmony's query throughput.

Key Findings:
- Communication Overhead:
Harmony-vectorconsistently incurslower communication overhead. This is expected as it sends entire query vectors to specific nodes and requires less inter-node communication for partial results.- Both
HarmonyandHarmony-dimensionincurcommunication overhead, withHarmony-dimensiongenerally havinghigher communication overheaddue to moredimension slicingand frequent exchange of partial results.Harmonyhas less dimension partitioning compared toHarmony-dimension, thus its communication time is lower.
- Computational Overhead:
- Despite similar
computation overheadbetweenHarmony-dimensionandHarmony-vectorin some cases (likely due to the total sum of partial computations being similar before pruning),Harmonygenerally exhibitslower computational overheadthan the other two. This reduction is primarily attributed toHarmony'spruning module, which eliminates unnecessary distance calculations.
- Despite similar
- Communication vs. Computation:
- For all three methods, the main overheads are concentrated in
communicationandcomputation. - Crucially,
communication overheadisindependent of dimensions. This means that for datasets withlarger dimensions(e.g.,Msongwith 420 dimensions), thecommunication overheadbecomessignificantly lowerrelative to thecomputation overhead. Conversely, for lower-dimensional datasets (e.g.,Sift1Mwith 128 dimensions),communication overheadis proportionally higher. This observation highlights that as dimensions grow, computation becomes the dominant factor, making optimizations likepruningeven more impactful.
- For all three methods, the main overheads are concentrated in
6.2.2. Contribution of optimization techniques
This section dissects the speedup provided by Harmony's key features: balanced load, pipeline and asynchronous execution, and pruning, measured on a 4-node setup.
The following figure (Figure 9 from the original paper) shows the contribution of the three optimization techniques to Harmony's query throughput.

Key Findings:
- Individual Performance Improvements: Each optimization technique contributes positively to performance.
Balanced load: Provides an average1.75xincrease in throughput. This confirms the importance of distributing work evenly.Pipeline and asynchronous execution: Achieves a1.25ximprovement. This highlights the benefits of overlapping communication and computation, and efficient sequencing of tasks.Pruning: Offers the most significant enhancement with a1.51ximprovement in throughput.
- Pruning as the Most Impactful Optimization: The results confirm that
pruningis the most impactful optimization technique for improvingHarmony's query throughput. This aligns with the motivation in Section 3.1, which highlighted the potential of early stopping. - Dataset-Specific Impact: For the
Sift1Mdataset, the performance improvements frombalanced loadandpipeline executionare less pronounced. This is attributed to therelatively uniform distribution of the loadinSift1M, where the benefits of balancing and asynchronous execution are naturally diminished. However,pruningstill leads to a clear throughput improvement forSift1M, indicating its robustness regardless of the initial data distribution characteristics.
6.2.3. Pruning ratio breakdown
This experiment evaluates the effectiveness of Harmony's pruning strategy (discussed in Section 4.3) in reducing computational overhead. It uses a dimensional split of size 4 (meaning vectors are split into four dimension blocks) and measures the pruning ratio for each slice.
The following are the results from Table 3 of the original paper:
| Dataset | First Slice (%) | Second Slice (%) | Third Slice (%) | Fourth Slice (%) | Average Pruning Ratio (%) |
|---|---|---|---|---|---|
| Msong | 0.00 | 43.14 | 76.06 | 95.29 | 53.87 |
| Glove1.2m | 0.00 | 1.54 | 30.71 | 86.66 | 29.73 |
| Word2vec | 0.00 | 24.85 | 53.77 | 83.66 | 40.32 |
| Deep1M | 0.00 | 7.67 | 66.09 | 97.36 | 42.03 |
| Sift1m | 0.00 | 41.76 | 85.04 | 98.40 | 57.05 |
| Star | 0.00 | 81.24 | 95.23 | 99.05 | 69.14 |
| Glove2.2m | 0.00 | 5.14 | 30.70 | 81.18 | 29.76 |
| Hand | 0.00 | 63.54 | 91.62 | 98.10 | 63.83 |
Key Observations:
- Increasing Pruning Rates in Later Slices: The
pruning ratesare significantly higher in later slices. The first slice consistently shows0.00% pruning(as it's the initial computation before a meaningful threshold is established). The averagepruning ratiofor the second slice is33.61%, for the third slice66.15%, and for the fourth slice92.33%. This demonstrates thatsubsequent query slicesbenefit from theaccumulated resultsand a tighterpruning thresholdfrom earlier slices, as discussed in Section 3.1. - Dataset-Dependent Pruning Effectiveness:
Pruning ratesvary considerably across different datasets. For example,Glove1.2M's third slice only achieves30.71% pruning, whereasStar's third slice reaches95.23%. This variance is attributed to differences indataset distributions; some datasets (e.g.,Star) lend themselves more readily to early pruning than others (e.g.,Glovedatasets), likely due to their inherent clustering or sparsity. - High Pruning in the Final Slice: The
pruning ratein thefinal sliceconsistentlyexceeds 80%for all datasets, often reachingover 95%(Msong,Deep1M,Sift1m,Star,Hand). This strongly indicates that for the vast majority of candidate vectors,distance computationsdo not require processing the last dimensions. Most vectors are effectively pruned by the time the final dimension slice is reached, significantly reducing computational load.
6.3. Index Build Experiments
6.3.1. Index build time
This experiment measures the time overhead for index construction across Harmony-vector, Harmony-dimension, Harmony (all in a 4-node environment), and Faiss (single-machine). The index construction process is broken down into three stages: training the clustering centers (Train), distributing base vectors to clustering centers (Add), and assigning parts of the index to different machines (Pre-assign).
The following figure (Figure 10 from the original paper) shows Harmony's index build time breakdown.

Key Observations:
- Similar Train and Add Times: The
trainandaddtimes for all methods (includingFaissand theHarmonyvariants) aresimilar. This suggests thatHarmonydoes not introduce significant overhead in the core clustering and initial vector distribution steps and demonstrates good scalability in these phases without requiring major modifications to the underlyingindex structure. - Longer Pre-assign Time for
Harmony-dimensionandHarmony: Thepre-assigntime islongerforHarmony-dimensionandHarmonycompared toHarmony-vector. This is because these two methods involvedimension-based partitioning, which requires additional steps likeallocating spaceandinitializing intermediate resultsrelated todistance computation. This overhead is dependent on thedata size(e.g., approximately twice as long forGlove2.2mas forGlove1.2m). - Proportionality to Dimensionality: For datasets of the same size, the
trainandaddtimes are generallyproportional to the dimensionalityof the dataset. This is logical because bothtraining(e.g., clustering) anddistribution(adding vectors to lists) operations scale linearly with the number of dimensions, as they involve processing each dimension of the vectors.
6.3.2. Index space overhead
This experiment compares the memory footprint of Harmony's index with Faiss and the baseline Harmony configurations (4 nodes).
The following are the results from Table 4 of the original paper:
| Dataset | Faiss | Harmony vector | Harmony dimension | Harmony |
|---|---|---|---|---|
| StarLightCurves | 3.2GB | 788MB | 815MB | 798MB |
| Msong | 1.6GB | 411MB | 418MB | 413MB |
| Sift1M | 497MB | 126MB | 131MB | 128MB |
| Deep1M | 986MB | 245MB | 253MB | 250MB |
| Word2Vec | 1.2GB | 258MB | 295MB | 279MB |
| HandOutlines | 6.1GB | 1.50GB | 1.54GB | 1.51GB |
| Glove1.2M | 921MB | 227MB | 238MB | 233MB |
| Glove2.2M | 2.5GB | 660MB | 697MB | 686MB |
Key Points:
- Proportionality to Dataset Size and Dimensions: The
index overhead(memory usage) is directlyproportional to both the dataset size and its dimensions. For instance,Msong's index space is 3.2 times larger thanSift1M's, which correlates with theirdimension × dataset sizeproduct (e.g.,Msonghas 992,272 vectors * 420 dim ≈ 416M,Sift1Mhas 1,000,000 vectors * 128 dim ≈ 128M; ratio is roughly 3.25). This confirms expected memory scaling with data characteristics. - Efficiency of Distributed Indexing: All three
Harmonypartitioning schemes (Harmony-vector,Harmony-dimension,Harmony) occupy approximately1/4 of the space of Faiss(the single-machine index). This is because the data is distributed across four nodes. This finding indicates thatdistributed indexingdoes not introduce significant additional space overhead due to replication or complex distributed structures, and it effectivelyalleviates memory pressureon individual nodes. - Minimal Additional Overhead for
HarmonyandHarmony-dimension: BothHarmonyandHarmony-dimensionincur someadditional space overheadcompared toHarmony-vector. However, this overhead isminimal, representingonly about 2%of the original space size. This implies that the extra data structures or intermediate results needed fordimension-based partitioningandadaptive strategiesare very lightweight.
6.4. Additional Discussion
6.4.1. Impact of dimensions and dataset size on harmony performance
The paper constructs Gaussian-distributed datasets with dimensions ranging from 64 to 512 and sizes from 250K to 1M to study the impact on Harmony's speedup using four nodes.
The following figure (Figure 11 from the original paper) shows the impact of pruning effectiveness and indexing parameters on search performance.

Key Findings (Figure 11a):
- Performance Improvement with Increased Size and Dimension: As
both the dataset size and dimension increase,Harmony's performance (speedup) improves.- For each doubling of the
dimension, the speedup increases by26.8%. - For each doubling of the
dataset size, the speedup increases by25.9%. This indicates thatHarmonyis particularly effective for larger and higher-dimensional datasets, which are typical challenges in modernANNS.
- For each doubling of the
- Superlinear Speedup in Large Datasets: In
large datasets(1M vectors, 512 dimensions),Harmonyexhibits aspeedup greater than the number of machines(i.e., greater than 4x). Thissuperlinear speedupis attributed to thepruning techniques, which reduce the overallcomputational overheadmore than linearly, allowing the distributed system to perform disproportionately better. - Suboptimal Performance for Smaller Datasets:
Harmony's performance issuboptimal for smaller datasets. This is explained by theincreased communication overheaddominating the overall performance in such cases. For small datasets, the benefits of distributing and pruning might not outweigh the costs of inter-node communication and coordination, making the relative overhead more significant.
6.4.2. Scalability
To validate Harmony's scalability, experiments were performed across different machine configurations: 4, 8, and 16 nodes.
Key Findings (Figure 11b):
- Group-based Partitioning (Harmony) Shows Superlinear Speedup: The
group-based partitioning method(referring toHarmony's adaptive strategy) consistently demonstrates aspeedup greater than the number of machines. Thissuperlinear speedupis a direct result of thepruning design, which effectively reduces thecomputational overheadper node, leading to disproportionately higher performance gains as more nodes are added. - Vector-based Partitioning Scales Linearly: The
vector-based partitioning(Harmony-vector) achieves performanceclose to the number of workers. This indicates that it scales almostlinearlywith the number of nodes, as it does not introduce additional overhead beyond simple distribution. Each added worker roughly contributes its share of processing capacity. - Dimension-based Partitioning's Communication Bottleneck: The
dimension-based partitioning method(Harmony-dimension) shows an initial increase in performance as the number of workers grows, but then it experiences adeclineafter a certain point. This occurs because as the number of dimensions increases or as more workers are added (leading to finer dimension splits), thecommunication overhead(for exchanging partial results) grows significantly. Eventually, this communication overhead becomes dominant, diminishing the overall performance gain and leading to a performance degradation or bottleneck.
6.4.3. Peak memory usage
This experiment measures the peak memory consumption of Harmony and compares it to baseline partition methods in a 4-node setup.
The following are the results from Table 5 of the original paper:
| Dataset | Harmony-vector | Harmony | Harmony-dimension |
|---|---|---|---|
| StarLightCurves | 3.94GB | 4.01GB | 4.07GB |
| Msong | 1.15GB | 1.32GB | 1.46GB |
| Sift1M | 1.37GB | 1.72GB | 1.96GB |
| Deep1M | 1.23GB | 1.61GB | 1.88GB |
| Word2Vec | 658MB | 723MB | 812MB |
| HandOutlines | 11.06GB | 11.19GB | 11.33GB |
| Glove1.2M | 814MB | 932MB | 1.06GB |
| Glove2.2M | 1.64GB | 1.98GB | 2.23GB |
Key Findings:
- Proportionality to Data Dimensions and Size: The
peak memory usagefor all three partition methods is generallyproportional to both the data dimensions and dataset size. This is a natural consequence of storing and processing vector data, where more data points or higher dimensionality requires more memory. - Memory Overhead Diminishes with Increasing Dimensions: The
peak memory consumptionforHarmonyandHarmony-dimensionisslightly higherthanHarmony-vector. However, this increasediminishes as the data dimension grows. For example, withDeep1M(256 dimensions),HarmonyexceedsHarmony-vectorby30.9%. But withHandOutlines(2709 dimensions), the increase drops to1.17%. This behavior suggests that the additionalintermediate results(unrelated to data dimensions) required bydimension-based partitioningbecomedilutedin significance as the actual vector data (which scales with dimensions) grows very large. This demonstratesHarmony's potential for efficiently handlinghigh-dimensional data. - Memory Overhead Related to Dimension Slices: The increase in
memory usagedue tointermediate resultsis related to thenumber of dimension slices.Harmonyshows asmaller increase in memory consumptioncompared toHarmony-dimension. This is becauseHarmony's adaptive strategy does not partition entirely based on dimensions, allowing it to manage the intermediate result overhead more effectively by sometimes favoringvector-based partitioningor a less granular dimension split.
6.4.4. Comparison with Auncel and Harmony
- Auncel [68]: This
distributed vector databasefocuses onlow search loadscenarios withspecific precision requirements. It employs afixed partitioning strategysimilar toHarmony-vector, distributing the query load. - Auncel's Limitation:
Auncelshares a similar drawback withHarmony-vectorregardingload balancing: its performance isnot optimal under skewed workloads. This is because its fixed,vector-based partitioningcannot adapt to uneven query distributions, leading tobottleneckson nodes holding "hot" data. - Harmony's Advantage: In contrast,
Harmonyis specifically designed to addressskewed workloads. It leverages itsadaptive multi-granularity partitioning(guided by thecost model) andpruning techniquesto achievefine-grained load balancing. This allowsHarmonyto handleuneven query distributionsmuch more effectively, maintaining higherthroughputandstabilityin scenarios whereAuncel(andHarmony-vector) would falter due toload imbalance.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Harmony, a novel distributed vector database system designed to address the challenges of load imbalance and high communication overhead in large-scale Approximate Nearest Neighbor Search (ANNS). Harmony's core innovation lies in its multi-granularity partition strategy, which intelligently combines dimension-based and vector-based partitioning dynamically guided by a cost model. This adaptive approach ensures balanced computational load across nodes while minimizing communication costs. Furthermore, Harmony integrates an early-stop pruning mechanism within a flexible pipelined execution engine, exploiting the monotonicity of distance computations in dimension-based partitions to significantly reduce redundant computations and network transfers.
Experimental evaluations across diverse real-world datasets demonstrate Harmony's superior performance. It achieves an average 4.63 times throughput improvement over Faiss (a state-of-the-art single-node system) in a four-node setup, exhibiting superlinear speedup due to its effective pruning. Crucially, Harmony shows a 58% performance improvement over traditional distributed methods when faced with skewed workloads, proving its robustness. Its optimization techniques (balanced load, pipelined execution, and pruning) each contribute substantial performance gains, with pruning being the most impactful. Harmony also scales efficiently with increasing dataset size and dimensionality, while maintaining minimal index space and manageable peak memory usage.
7.2. Limitations & Future Work
The paper does not explicitly dedicate a section to Limitations & Future Work. However, based on the experimental results and discussions, several implicit limitations and potential future research directions can be inferred:
Implicit Limitations:
- Overhead for Smaller Datasets: The experiments show that
Harmony's performance issuboptimal for smaller datasetsdue to theincreased communication overheaddominating the overall performance. This suggests that the complexity introduced by the adaptive partitioning and pipelined pruning might not be justified for datasets that are small enough to be handled efficiently by single-node or simpler distributed solutions. - Communication Overhead in
Harmony-dimension: WhileHarmonymanages communication better, theHarmony-dimensionbaseline clearly suffers fromincreased communication overheadas the number of nodes or dimension slices grows. This highlights a fundamental challenge of puredimension-based partitioningthatHarmonyaims to mitigate but might not entirely eliminate, especially in extreme configurations. - Cost Model Complexity: While a
cost modelis used, itscomputational overheadfor dynamic adaptation isn't deeply analyzed. For rapidly changing workloads or very frequent adjustments, the cost of running thecost modelitself could become a factor. - Generalizability of Pruning Effectiveness: The paper notes that
pruning rates vary significantly across different datasetsdue to their unique distributions. WhileHarmonybenefits from pruning, its efficacy is dataset-dependent, implying that performance gains from pruning might not be uniform across all possible real-world data distributions. - Dependency on Metrics with Monotonicity: The
early-stop pruningrelies on themonotonicityof distance computations (e.g., Euclidean distance, cosine similarity with normalized vectors). For other, non-monotonic similarity metrics, this specific pruning technique might not be directly applicable, limitingHarmony's generalizability across all possibleANNSuse cases.
Potential Future Work (Inferred):
- Adaptive Cost Model Refinement: Further research could focus on making the
cost modeleven more sophisticated, potentially incorporating machine learning to predict workload patterns and dynamically adjust (the imbalance weight) or even the structure of thepartition planmore granularly. - Optimizing Communication for Small Datasets: Investigating techniques to reduce communication overhead specifically for smaller datasets or when the number of nodes is very high, to make
Harmonyuniversally efficient. - Dynamic Dimension Ordering: The paper mentions dynamically adjusting the execution order of dimensions to mitigate
load imbalancecaused by pruning. This could be explored in more depth, potentially with predictive models for optimal dimension ordering. - Support for Diverse Similarity Metrics: Expanding
Harmony's pruning capabilities to non-monotonic distance or similarity metrics, perhaps through novel bounding techniques or approximate monotonicity approaches. - Fault Tolerance and Elasticity: While a distributed system, the paper doesn't detail
Harmony's fault tolerance mechanisms (e.g., handling node failures, dynamic scaling of nodes). This would be crucial for production-grade systems. - Integration with Advanced Indexing: Exploring how
Harmony's partitioning and pruning strategies could be integrated with more advancedgraph-basedorquantization-basedANNS indexing techniques, which traditionally struggle with distributed environments.
7.3. Personal Insights & Critique
Harmony presents a compelling solution to critical challenges in distributed ANNS. The hybrid multi-granularity partitioning is a highly intuitive and practical approach, effectively bridging the trade-off between load balancing and communication overhead. Rather than rigidly adhering to one partitioning paradigm, the cost model-driven adaptation is a significant strength. This flexibility allows Harmony to perform robustly across diverse and skewed workloads, which is a common pain point in real-world distributed systems.
The early-stop pruning mechanism is particularly elegant. By leveraging the inherent monotonicity of standard distance metrics, Harmony avoids a tremendous amount of redundant computation. The empirical results, showing superlinear speedup and pruning ratios exceeding 90% in later stages, strongly validate this design choice. This dimension-level pruning is a powerful example of how a deep understanding of the underlying mathematical properties of the problem can lead to substantial system-level optimizations. The pipelined execution further enhances this by overlapping computation and communication, which is crucial for high-throughput systems.
Critique:
- Black Box Nature of the Cost Model: While the
cost modelis central toHarmony's adaptive strategy, the paper provides limited detail on its practical implementation and how its parameters (like ) are tuned or learned. A more in-depth discussion on theestimation methodologyforcomputationandcommunication costs() would have been beneficial. Are these empirically derived, or are they analytical models? How accurate are they in dynamic environments? - Absence of Explicit Future Work: The lack of a dedicated section for
Limitations & Future Workis a minor omission. While some challenges can be inferred, explicitly stating them would guide future research and provide a more complete picture of the system's current boundaries. - Comparison to Graph-based Distributed ANNS: The paper primarily compares against
Faiss(cluster-based) and its ownvector-basedanddimension-basedvariants. WhileFaissis state-of-the-art, a more comprehensive comparison against recent distributedgraph-based ANNSsystems (even if they have different trade-offs) could further contextualizeHarmony's performance, especially given the paper's initial mention of graph-based methods being "not well compatible with distributed features." - Scalability at Very High Node Counts: While scalability up to 16 nodes is shown, the observation that
Harmony-dimension's performance declines at higher worker counts suggests that communication becomes a dominant factor even for dimension-based approaches. Further analysis or mitigation strategies for extreme scale (e.g., hundreds or thousands of nodes) could be insightful.
Transference/Application:
The principles employed in Harmony – adaptive multi-granularity partitioning, cost model-driven optimization, and monotonicity-based early pruning in a pipelined distributed execution – are highly transferable. They could be applied to other high-dimensional data processing tasks beyond ANNS, such as:
-
Distributed Machine Learning Inference: Where models operate on high-dimensional feature vectors and computation can be distributed.
-
Large-scale Data Filtering/Querying: For non-ANN-specific queries on high-dimensional data where partial computations could lead to early rejection of irrelevant data.
-
Other Distance/Similarity Computations: The
monotonicityprinciple applies to various metrics, suggesting broader applicability forpruningin distributed environments.Overall,
Harmonymakes a significant contribution by presenting a well-engineered and rigorously evaluated solution that pushes the boundaries of efficient and robust distributedANNS, particularly forskewed workloadsandhigh-dimensional data.
Similar papers
Recommended via semantic vector search.