Paper status: completed

HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor Search

Published:06/18/2025

Approximate Nearest Neighbor Search (5)Distributed Vector Database (1)Load Balancing and Communication Optimization (1)Multi-Granularity Partition Strategy (1)High-Throughput Data Processing (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Harmony is a scalable distributed vector database designed for high-throughput Approximate Nearest Neighbor Search, addressing load imbalance and communication overhead with a novel multi-granularity partition strategy, achieving significant performance improvements in extensive

Abstract

Approximate Nearest Neighbor Search (ANNS) is essential for various data-intensive applications, including recommendation systems, image retrieval, and machine learning. Scaling ANNS to handle billions of high-dimensional vectors on a single machine presents significant challenges in memory capacity and processing efficiency. To address these challenges, distributed vector databases leverage multiple nodes for the parallel storage and processing of vectors. However, existing solutions often suffer from load imbalance and high communication overhead, primarily due to traditional partition strategies that fail to effectively distribute the workload. In this paper, we introduce Harmony, a distributed ANNS system that employs a novel multi-granularity partition strategy, combining dimension-based and vector-based partition. This strategy ensures a balanced distribution of computational load across all nodes while effectively minimizing communication costs. Furthermore, Harmony incorporates an early-stop pruning mechanism that leverages the monotonicity of distance computations in dimension-based partition, resulting in significant reductions in both computational and communication overhead. We conducted extensive experiments on diverse real-world datasets, demonstrating that Harmony outperforms leading distributed vector databases, achieving 4.63 times throughput on average in four nodes and 58% performance improvement over traditional distribution for skewed workloads.

Mind Map

In-depth Reading

English Analysis~41 min read · 55,789 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is a scalable distributed vector database system named Harmony, designed for high-throughput Approximate Nearest Neighbor Search (ANNS).

1.2. Authors

The authors and their affiliations are:

Qian Xu, Feng Zhang, Chengxi Li, Zheng Chen, Xiaoyong Du - Renmin University of China
Lei Cao - Massachusetts Institute of Technology
Jidong Zhai - Tsinghua University

1.3. Journal/Conference

The paper is published on arXiv, which is a preprint server for electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, electrical engineering and systems science, and economics. While arXiv is not a peer-reviewed journal or conference, it is a highly influential platform for rapid dissemination of research findings within the scientific community, often preceding formal publication.

1.4. Publication Year

2025

1.5. Abstract

Approximate Nearest Neighbor Search (ANNS) is crucial for various data-intensive applications. However, scaling ANNS to handle billions of high-dimensional vectors on a single machine is limited by memory and processing efficiency. Distributed vector databases attempt to solve this by using multiple nodes for parallel processing, but existing solutions often suffer from load imbalance and high communication overhead due to traditional partitioning strategies. This paper introduces Harmony, a distributed ANNS system that proposes a novel multi-granularity partition strategy, combining dimension-based and vector-based partition. This strategy aims to balance computational load and minimize communication costs. Furthermore, Harmony incorporates an early-stop pruning mechanism that leverages the monotonicity of distance computations in dimension-based partition to reduce both computational and communication overhead. Extensive experiments on real-world datasets show that Harmony outperforms leading distributed vector databases, achieving 4.63 times throughput on average in a four-node setup and a 58% performance improvement over traditional distribution for skewed workloads.

1.6. Original Source Link

https://arxiv.org/abs/2506.14707 (Preprint) https://arxiv.org/pdf/2506.14707v1.pdf (PDF Link) The paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant challenge of scaling Approximate Nearest Neighbor Search (ANNS) to handle billions of high-dimensional vectors efficiently. In modern data-intensive applications, such as recommendation systems, image retrieval, and machine learning, ANNS is a fundamental operation. The sheer volume of unstructured data being generated daily (e.g., 500PB for Alibaba, 500 hours of content on YouTube every minute) necessitates solutions that can scale beyond single-machine capabilities.

Existing distributed vector databases attempt to address this by leveraging multiple nodes for parallel storage and processing of vectors. However, they face two critical limitations:

Load Imbalance: Traditional partitioning methods (like cluster-based or hash-based) often result in some machines becoming overloaded with computations while others are underutilized, especially under skewed query workloads. This degrades overall system performance.
High Communication Overhead: Distributing queries across machines typically involves transmitting intermediate results. Although these might be smaller than original vectors, the disparity between bandwidth (e.g., 100 Gb/s) and computation speed (hundreds of GB/s) means communication can still introduce significant delays, acting as a bottleneck.

The paper identifies that over 90% of ANNS time in cluster-based indices is spent on distance computations. This highlights the need for optimizing these computations in a distributed setting. Traditional graph-based segmentation for standalone machines also struggles with distributed features due to cross-machine edges and high latency.

The paper's entry point or innovative idea is to develop a distributed-friendly vector distribution method and optimize distributed vector distance computations based on cluster-based vectors by introducing a novel multi-granularity partition strategy combined with an early-stop pruning mechanism.

2.2. Main Contributions / Findings

The paper introduces Harmony, a distributed ANNS system with three primary innovations and corresponding key findings:

Multi-Granularity Partition Strategy: Harmony integrates vector-based and dimension-based partition into a hybrid index construction framework, guided by a cost model.
- Contribution: This approach dynamically adapts to query characteristics and cluster state, ensuring efficient query processing and stable performance across various loads. It addresses load imbalance by combining strengths of both partitioning methods.
- Finding: This adaptive strategy ensures robust performance even under skewed workloads, outperforming traditional distribution methods.
Early-Stop Pruning Mechanism: Harmony incorporates an early-stop pruning mechanism that leverages the monotonicity of distance computations in dimension-based partition.
- Contribution: When partial vector dimensions no longer contribute to the final result, redundant calculations are skipped. This early pruning, coupled with a pipelined execution model, minimizes unnecessary computations and communications, especially for high-dimensional queries.
- Finding: Experiments show significant pruning ratios, especially in later stages of dimension processing (e.g., over 80% for the third slice, over 90% for the fourth slice), dramatically reducing computational overhead and network transfers.
Adaptive Index Construction & Load-Aware Partitioning:
- Contribution: The system adaptively adjusts its index structure based on workload characteristics and uses a load-aware routing mechanism to minimize communication overhead and balance query processing.
- Finding: This results in Harmony achieving substantial performance gains over state-of-the-art vector databases.

Key Conclusions / Findings:

Harmony achieves a 4.63 times throughput improvement on average compared to Faiss in a four-node environment.
It demonstrates a 58% performance improvement over traditional distribution methods when dealing with skewed workloads.
The dimension-based partition inherently allows for early stopping during distance computations, with partial distances quickly exceeding the current best candidate threshold, thus enabling significant pruning.
The multi-granularity approach (combining dimension and vector partitioning) effectively balances computational load and minimizes communication, addressing the core challenges of distributed ANNS.
Individual optimization techniques contribute significantly: balanced load yields 1.75x speedup, pipeline and asynchronous execution 1.25x, and pruning 1.51x.

3.1. Foundational Concepts

To understand Harmony, a reader needs to grasp several core concepts in ANNS and distributed systems:

Approximate Nearest Neighbor Search (ANNS):
- Concept: ANNS is a technique used to find data points in a high-dimensional space that are "approximately" closest to a given query vector. Unlike exact Nearest Neighbor Search (NNS), which guarantees finding the absolute closest points, ANNS sacrifices a small amount of accuracy for significantly faster search times, which is crucial for large datasets.
- How it works: ANNS algorithms typically build an index structure (e.g., a graph, tree, or cluster-based structure) that allows for rapid traversal of the search space, avoiding a full scan of all data points.
- Why it's needed: In applications like recommendation systems or image retrieval, data is often represented as high-dimensional vectors (e.g., embeddings from deep learning models). Exact search on billions of such vectors is computationally prohibitive.
High-Dimensional Vectors:
- Concept: These are numerical representations of data points where each point has many attributes or features (dimensions). For example, a word embedding might have 300 dimensions, and an image feature vector could have 1024 dimensions.
- Challenges: As dimensionality increases, traditional distance metrics become less meaningful (the "curse of dimensionality"), and the volume of space grows exponentially, making exhaustive search impractical.
Distributed Vector Databases:
- Concept: Systems designed to store and search vast collections of high-dimensional vectors across multiple interconnected machines (nodes). They leverage parallel processing to handle workloads that would overwhelm a single machine.
- Benefits: Increased storage capacity, reduced query latency (by processing parts of a query concurrently), and improved fault tolerance.
Partitioning Strategies (for Distributed Systems):
- Vector-based Partitioning:
  - Concept: The entire vector dataset is divided into subsets (shards), and each subset is assigned to a different node. Each node stores complete vectors.
  - Pros: Simplifies distance computation locally on each node; communication typically only happens at the beginning (sending query) and end (aggregating results).
  - Cons: Can lead to load imbalance if certain subsets of vectors (hot partitions) are queried more frequently or are computationally heavier; doesn't help with memory if individual vectors are too large for a single node.
- Dimension-based Partitioning:
  - Concept: The dimensional space of the vectors is split into disjoint subsets, and each subset of dimensions is assigned to a different node. Each node stores only a slice of each vector.
  - Pros: Ensures a balanced load across all nodes, as each node contributes equally to the distance computation for any given query; reduces memory pressure per node for very high-dimensional vectors.
  - Cons: Requires more communication during query processing, as partial distance results from multiple nodes need to be combined to form a full distance, potentially increasing communication overhead.
- Cluster-based Partitioning:
  - Concept: Data points are grouped into clusters based on similarity (e.g., using k-means), and then these clusters (or the data points within them) are distributed across nodes.
  - Relevance to ANNS: Many ANNS indexing techniques are cluster-based, where the first step is to find the closest clusters to the query, then search only within those clusters.
- Hash-based Partitioning:
  - Concept: Vectors are assigned to partitions based on a hash function applied to some property of the vector or its ID.
  - Pros: Simple, usually ensures uniform distribution.
  - Cons: Doesn't consider data locality, can lead to cross-partition data access for related vectors.
Euclidean Distance (Squared Euclidean Distance):
- Concept: A common metric for measuring the straight-line distance between two points in Euclidean space. The squared version is often used to avoid the square root operation, which doesn't change the relative ordering of distances.
- Formula: For two $d$ -dimensional vectors $\mathbf{p} = (p_1, \ldots, p_d)$ and $\mathbf{q} = (q_1, \ldots, q_d)$ , the squared Euclidean distance is: $ D^2_{\mathrm{Eucl}}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{d} (p_i - q_i)^2 $
- Significance for dimension-based partitioning: The total squared Euclidean distance is the sum of squared partial distances over disjoint sets of dimensions. This monotonic property is crucial for early stopping. If the sum of partial distances already exceeds a threshold, adding more positive values will only increase it further, meaning the vector can be pruned.
Cosine Similarity:
- Concept: Measures the cosine of the angle between two non-zero vectors. It indicates how similar the direction of two vectors is, regardless of their magnitude. A value of 1 means identical direction, 0 means orthogonal, and -1 means exactly opposite.
- Formula: For two vectors $\mathbf{p}$ and $\mathbf{q}$ : $ \text{Cosine Similarity}(\mathbf{p}, \mathbf{q}) = \frac{\mathbf{p} \cdot \mathbf{q}}{|\mathbf{p}| |\mathbf{q}|} = \frac{\sum_{i=1}^{d} p_i q_i}{\sqrt{\sum_{i=1}^{d} p_i^2} \sqrt{\sum_{i=1}^{d} q_i^2}} $
- Significance for dimension-based partitioning: If vectors are pre-normalized (i.e., $\|\mathbf{p}\| = \|\mathbf{q}\| = 1$ ), then cosine similarity is directly proportional to the dot product $\mathbf{p} \cdot \mathbf{q} = \sum_{i=1}^{d} p_i q_i$ . The dot product also exhibits a monotonic property, allowing for early stopping if we are maximizing similarity (e.g., if a partial sum of dot products is too low, it won't be a good match).
Early Stopping / Pruning:
- Concept: A technique to terminate computations prematurely if it's determined that the current path or candidate cannot lead to a better result than what's already found.
- Relevance: In ANNS, if the partial distance calculated for a candidate vector already exceeds the distance of the $K$ -th best neighbor found so far, then that candidate cannot be among the top- $K$ neighbors, and its remaining distance computations can be skipped.
Throughput (QPS - Queries Per Second):
- Concept: A measure of the number of queries a system can process successfully in a given unit of time. Higher QPS indicates better performance and efficiency.
Recall:
- Concept: In ANNS, recall measures the proportion of true nearest neighbors that are successfully retrieved by the approximate search algorithm. A recall of 1.0 means all true nearest neighbors were found.
Load Imbalance:
- Concept: A situation in a distributed system where the workload is unevenly distributed among the processing nodes. Some nodes are heavily utilized (hot), while others are underutilized (cold), leading to bottlenecks and reduced overall system performance.
Communication Overhead:
- Concept: The time or resources consumed by the communication between different components of a distributed system (e.g., sending data, synchronizing processes). Excessive communication overhead can negate the benefits of parallel processing.

3.2. Previous Works

The paper discusses previous research in several areas:

Vector Database Systems:
- General advancements: ANNS has seen significant progress in indexing structures and query algorithms (e.g., [28, 47]), but distance computations remain the bottleneck.
- Indexing techniques:
  - Graph-based indexes: Examples like [23, 24, 26, 27, 32, 47, 55, 57, 66] build a graph based on vector distances and use breadth-first search. The paper notes these are not well-compatible with distributed features due to cross-machine edges.
  - Partition-based indexes: Include cluster-based [14, 15, 17, 25, 36-38, 63, 67], hash-based [20, 33, 53, 59, 60], and high-dimensional tree-based [16, 45, 49, 58] methods. These divide vectors into partitions for faster identification.
- Challenges: High storage overhead for full-dimensionality, despite indexing improvements. Lossy compression (e.g., quantization [25]) can reduce storage but sacrifices accuracy. This drives the need for distributed ANNS.
Distributed Solutions in Data Management:
- Graph databases: Systems like Neo4j [4], TigerGraph [8], and JanusGraph [11] employ partitioning (sharding) to split graphs across machines, aiming to minimize cross-partition edges and data transfers.
- Distributed relational databases: Google Spanner [19], CockroachDB [2], and F1 [54] achieve horizontal scalability by sharding tables, providing transactional consistency and parallel query execution.
- Distributed query engines: Presto [6], Spark SQL [3], and Apache Drill [1] use massive parallel processing (MPP) and in-memory computation to reduce query latencies.
- Lessons learned: These systems highlight the importance of robust data partitioning (to minimize inter-node communication and balance workloads) and parallel processing frameworks. They also expose common challenges like load imbalance and excessive data transfers.
Distributed Vector Databases Specifics:
- Previous research [5, 7, 9, 10, 46, 68] has focused on improving query performance and ensuring accuracy in multi-machine settings. However, the paper points out that these studies often do not address performance guarantees under load imbalances or explore pruning strategies that reduce inter-machine workload.
Faiss [25]: Mentioned as a state-of-the-art stand-alone ANNS query engine used as a baseline for comparison. Faiss is developed by Facebook AI Research and provides efficient implementations of ANNS algorithms, often leveraging quantization and clustering techniques.
Auncel [68]: Another distributed ANNS system, mentioned for comparison. Auncel focuses on error-bound guarantees for low search load scenarios and uses a fixed partitioning strategy similar to pure vector-based partitioning. The paper notes Auncel shares Harmony-vector's limitation in load balancing under skewed workloads.

3.3. Technological Evolution

The evolution of ANNS can be broadly described as follows:

Early ANNS (Single Machine): Initial efforts focused on optimizing search within a single machine using techniques like k-d trees, LSH (Locality Sensitive Hashing), and later, more sophisticated graph-based (e.g., HNSW) and cluster-based (e.g., IVFADC variants in Faiss) indexes. These methods significantly improved query speed compared to brute-force search but were limited by the memory and computational capacity of a single machine.
Emergence of Distributed Needs: With the explosion of high-dimensional data (e.g., embeddings from deep learning) reaching billions of vectors, single-machine solutions became inadequate. This necessitated the development of distributed ANNS systems.
Initial Distributed Approaches: Early distributed systems often adapted existing partitioning strategies from other domains (e.g., vector-based sharding) or basic cluster-based distributions. While these offered scalability in terms of data volume, they frequently encountered issues like load imbalance (some nodes becoming bottlenecks) and high communication overhead (data transfer between nodes becoming costly).
Specialized Distributed Vector Databases: The recognition of unique challenges posed by high-dimensional vector data led to the development of specialized distributed vector databases. These aimed to optimize for vector similarity search specifically, rather than generic data management. However, many still struggled with effectively balancing workloads and minimizing communication in diverse query scenarios.

This paper's work, Harmony, fits into this timeline as an advanced specialized distributed vector database. It addresses the limitations of earlier distributed solutions by proposing a more sophisticated multi-granularity partitioning and an early-stop pruning mechanism specifically tailored to the characteristics of vector distance computations. It represents a step towards more robust and efficient large-scale ANNS in multi-node environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, Harmony's core differences and innovations lie in its integrated approach to partitioning and query processing:

Traditional Vector-based Partitioning (e.g., Harmony-vector, or Auncel's fixed strategy):
- Difference: These methods assign entire vectors to nodes. While they have lower communication overhead per query (one-time data transfer), they are highly susceptible to load imbalance under skewed workloads (as shown in Figure 1a), where "hot" partitions can overload specific nodes.
- Harmony's Innovation: Harmony mitigates this by dynamically choosing between vector-based and dimension-based partitioning using a cost model. It leverages vector-based partitioning for its low communication cost when workloads are balanced, but shifts towards dimension-based partitioning or a hybrid when imbalance is detected.
Traditional Dimension-based Partitioning (e.g., Harmony-dimension):
- Difference: These methods split vectors by their dimensions across nodes. They inherently provide load balance for distance computations (as each node contributes to every query's distance calculation, Figure 1b). However, they typically incur high communication overhead because partial results from all dimension-holding nodes must be collected and combined for each query.
- Harmony's Innovation: Harmony adopts dimension-based partitioning principles but significantly reduces its communication overhead through a novel early-stop pruning mechanism and pipelined execution. This pruning, enabled by the monotonicity of distance computations, allows Harmony to skip processing on later dimension slices if a candidate is already deemed unpromising. This transforms a potential weakness (high communication) into a strength (efficient pruning).
Existing Distributed Vector Databases (general):
- Difference: Many existing systems focus on query performance and accuracy but don't explicitly address load imbalance under varying workloads or comprehensive inter-machine pruning strategies.
- Harmony's Innovation: Harmony explicitly tackles load imbalance through its adaptive, multi-granularity partitioning guided by a cost model. It also introduces a sophisticated multi-machine pruning technique within a pipelined execution engine, making pruning a first-class citizen in the distributed query process. This distinguishes it by focusing on robustness and efficiency across a wider range of operational conditions.
  
  In essence, Harmony's innovation lies in its hybrid and adaptive nature, intelligently combining the strengths of different partitioning methods and introducing a powerful dimension-level pruning mechanism that is deeply integrated with a pipelined distributed execution model.

4. Methodology

4.1. Principles

The core idea behind Harmony is to achieve high-throughput and robust Approximate Nearest Neighbor Search (ANNS) in a distributed environment by adaptively combining different data partitioning strategies and leveraging early pruning. The theoretical basis and intuition are rooted in two key observations:

Varied Granularities of Partitioning: Vectors can be partitioned in different ways—either by assigning entire vectors to nodes (vector-based partition) or by splitting dimensions across nodes (dimension-based partition). Each has distinct advantages and disadvantages regarding computational load distribution and communication overhead. The principle is that a flexible system should dynamically choose or combine these granularities based on the current workload.
Monotonicity of Distance Computations: For common distance metrics like squared Euclidean distance or cosine similarity (when normalized), the total distance or similarity can be cumulatively summed from partial contributions across disjoint dimensions. This monotonic property means that if a partial sum already indicates a poor candidate (e.g., partial distance exceeds the best-found distance), then subsequent computations on remaining dimensions for that candidate are redundant. This intuition forms the basis for early-stop pruning.

Harmony's design integrates these principles: it uses a cost model to adaptively decide the optimal partitioning strategy (or a hybrid) to balance load and minimize communication, and then employs a pipelined execution engine with dimension-level pruning to exploit the monotonicity of distance calculations, further reducing computational and communication overhead.

4.2. Core Methodology In-depth (Layer by Layer)

Harmony is designed with two main modules: a fine-grained query planner and a flexible pipelined execution engine.

4.2.1. Fine-grained Query Planner

The fine-grained query planner is responsible for determining the most efficient partitioning and query strategy. It uses a cost model to estimate computational and communication overhead.

Cost Model

The cost model helps Harmony make informed decisions about partitioning and query execution. It considers partition definitions, query processing costs, and imbalance factors.

The following are the results from Table 1 of the original paper:

Symbol	Meaning
$\pi$	Partition plan
$B_{\mathrm{dim}}(\pi)$ , $B_{\mathrm{vec}}(\pi)$	Dimension- and vector-based blocks in $\pi$
$b$ , $s$	Single block $b \in B_{\mathrm{dim}}(\pi)$ , $s \in B_{\mathrm{vec}}(\pi)$
$c_{\mathrm{comp}}(b, q)$ , $c_{\mathrm{comm}}(s, q)$	Computation/communication costs for query $q$ when processed in $b$ or $s$
$\mathrm{Load}(n, \pi)$	Total load on node $n$ under plan $\pi$
$I(\pi)$	Imbalance factor
$\alpha$	Weight for imbalance in overall cost

Query Cost ( $C_q(\pi)$ ): For each query $q$ in the set of queries $Q = \{q_1, q_2, \ldots, q_{|Q|}\}$ , the cost model sums the computation and communication costs across two types of partitioned data: dimension-based partitions and vector-based partitions. $ C_q(\pi) = \sum_{b \in \mathcal{B}{\mathrm{dim}}(\pi)} \left[ c{\mathrm{comp}}^{\mathrm{dim}}(b, q) + c_{\mathrm{comm}}^{\mathrm{dim}}(b, q) \right] + \sum_{s \in \mathcal{B}{\mathrm{vec}}(\pi)} \Big[ c{\mathrm{comp}}^{\mathrm{vec}}(s, q) + c_{\mathrm{comm}}^{\mathrm{vec}}(s, q) \Big] $
- $C_q(\pi)$ : The total cost for processing a single query $q$ under a given partition plan $\pi$ .
- $\mathcal{B}_{\mathrm{dim}}(\pi)$ : The set of dimension-based blocks defined by partition plan $\pi$ .
- $\mathcal{B}_{\mathrm{vec}}(\pi)$ : The set of vector-based blocks (shards) defined by partition plan $\pi$ .
- $b$ : A single dimension-based block.
- $s$ : A single vector-based block.
- $c_{\mathrm{comp}}^{\mathrm{dim}}(b, q)$ : The computation cost for query $q$ on a dimension-based block $b$ .
- $c_{\mathrm{comm}}^{\mathrm{dim}}(b, q)$ : The communication cost for query $q$ related to a dimension-based block $b$ .
- $c_{\mathrm{comp}}^{\mathrm{vec}}(s, q)$ : The computation cost for query $q$ on a vector-based block $s$ .
- $c_{\mathrm{comm}}^{\mathrm{vec}}(s, q)$ : The communication cost for query $q$ related to a vector-based block $s$ .
  
  This formula calculates the total cost for a query by summing up the costs incurred from all involved dimension-based and vector-based blocks.
Imbalance Factor ( $\bar{Z}(\pi)$ ): This factor measures how unevenly the workload is distributed across the nodes.
- Node Load ( $\mathrm{Load}(n, \pi)$ ): The total work a node $n$ $n$ performs under a given partition plan $\pi$ $π$ is calculated by summing the computation costs of all dimension-based and vector-based blocks assigned to that node for all queries in $Q$ $Q$ . $ \mathrm{Load}(n, \pi) = \sum_{q \in Q} \big[ \sum_{b \in \mathcal{B}{\mathrm{dim}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{dim}}(b, q) + \sum_{s \in \mathcal{B}{\mathrm{vec}}(\pi, n)} c{\mathrm{comp}}^{\mathrm{vec}}(s, q) \big] $
  - $\mathrm{Load}(n, \pi)$ : The total computational load on node $n$ for all queries $q \in Q$ under partition plan $\pi$ .
  - $\mathcal{B}_{\mathrm{dim}}(\pi, n)$ : Dimension-based blocks stored on node $n$ .
  - $\mathcal{B}_{\mathrm{vec}}(\pi, n)$ : Vector-based shards on node $n$ .
- Imbalance Factor Formula: The imbalance factor itself is defined using the standard deviation of the node loads: $ \bar{Z}(\pi) = \sqrt{\frac{1}{N} \sum_{n=1}^{N} \bigl[ \mathrm{Load}(n, \pi) - \overline{L} \bigr]^2} $
  - $\bar{Z}(\pi)$ : The imbalance factor for partition plan $\pi$ .
  - $N$ : The total number of nodes in the system.
  - $\mathrm{Load}(n, \pi)$ : The load on node $n$ .
  - $\overline{L}$ : The average load across all nodes, calculated as $\overline{L} = \frac{1}{N} \sum_{n=1}^{N} \mathrm{Load}(n, \pi)$ .
    
    A higher value of $\bar{Z}(\pi)$ indicates a more significant load imbalance, meaning some nodes are disproportionately busy compared to others.
Overall Cost Function ( $C(\pi, Q)$ ): The planner combines the per-query costs and the imbalance factor into a single objective function to evaluate a partition plan: $ C(\pi, Q) = \sum_{q \in Q} C_q(\pi) + \alpha \cdot \bar{Z}(\pi) $
- $C(\pi, Q)$ : The total cost of a partition plan $\pi$ for processing the entire set of queries $Q$ .
- $C_q(\pi)$ : The cost for a single query $q$ (as defined above).
- $\alpha$ : A user-defined weight that determines how much the system penalizes skew (imbalance) relative to local computation/communication efficiency. A higher $\alpha$ means the system prioritizes load balance more.
  
  The goal of the fine-grained query planner is to find a partition plan $\pi$ that minimizes this overall cost function.

Query Load Distribution

After the cost model analyzes the workload and determines an optimal partition plan $\pi$ , Harmony distributes queries across the cluster. This process involves several steps, as illustrated in Figure 4:

The following figure (Figure 4 from the original paper) illustrates Harmony's query distribution. $\mathbf { C } _ { i }$ denotes the id of the cluster, $\mathbf { Q } _ { i }$ denotes the query block i, $\mathbf { V } _ { i }$ denotes the ith block according to vector-based partition, $\mathbf { D } _ { i }$ denotes the ith block according to dimension-based partition, $\mathbf { M } _ { i }$ denotes the ith machine.

$Figure 4: Harmony's query distribution. $\\mathbf { C } _ { i }$ denotes the id of the cluster, $\\mathbf { Q } _ { i }$ denotes the query block i, $\\mathbf { V } _ { i }$ denotes the ith block according to vector-based partition, $\\mathbf { D } _ { i }$ denotes the ith block according to dimension-based partition, $\\mathbf { M } _ { i }$ denotes the ith machine.$ 该图像是示意图，展示了Harmony系统中的查询分布。左侧部分（a）展示了基础向量分布，包含了不同的簇 $C_i$ 和向量 $V_i$ 。右侧部分（b）展示了查询负载分布，标明了查询块 $Q_i$ 如何分配到不同的机器 $M_i$ 和向量 $V_i$ 。该图形直观地反映了多粒度分区策略的应用，以实现负载均衡和减少通信开销。

Grid Formation: The partition plan $\pi$ defines a grid of dimension splits $\mathcal{B}_{\mathrm{dim}}(\pi)$ and vector splits $\mathcal{B}_{\mathrm{vec}}(\pi)$ . For example, as shown in Figure 4 (a), if there are three dimension partitions ( $\mathrm{D}_1, \mathrm{D}_2, \mathrm{D}_3$ ) and two vector partitions ( $\mathrm{V}_1, \mathrm{V}_2$ ), this creates blocks like $\mathrm{V}_1\mathrm{D}_1, \mathrm{V}_1\mathrm{D}_2, \ldots$ . Each machine ( $M_1 - M_6$ ) in the cluster is assigned one of these blocks, establishing the base vector distribution.
Identify Cluster Centroids (Client-side): As shown in Figure 4 (b) (purple table), similar to other cluster-based ANNS engines, Harmony first identifies relevant cluster centroids on the client side. This step maps each incoming query vector $q_i$ to the appropriate portion of the base vector collection it needs to interact with. This is a crucial initial filtering step.
Map Queries to Vector-based Blocks (Client-side): After centroid identification, Harmony determines which vector partitions $\mathcal{B}_{\mathrm{vec}}(\pi)$ each query should visit. For example, in Figure 4 (b) (blue table), Q1 is mapped to V1. This means Q1 will be processed against the vectors residing in vector partition V1.
Combine with Dimension Splits and Map to Machines (Client-side and Routers): Once the vector partitions are determined, Harmony further splits the queries based on dimension partitions $\mathcal{B}_{\mathrm{dim}}(\pi)$ . For Q1 mapped to V1, it will be split into Q1D1, Q1D2, and Q1D3, corresponding to the dimension partitions $\{\mathrm{D}_1, \mathrm{D}_2, \mathrm{D}_3\}$ . These query chunks are then routed to the corresponding machines ( $M_1, M_2, M_3$ respectively for V1D1, V1D2, V1D3) for partial distance computations. This ensures that Q1D1, Q1D2, Q1D3 are processed in parallel on separate machines.

Advantages: This multi-granularity distribution ensures that:

The blocks ( $\mathcal{B}_{\mathrm{dim}}(\pi) \times \mathcal{B}_{\mathrm{vec}}(\pi)$ ) are evenly distributed among machines.
All available nodes are fully utilized.
No single machine is overloaded, leading to balanced load and higher throughput.

Time Complexity:

Centroid Assignment (Computation overhead): For $Q$ query vectors, $D$ dimensions, and NC base centroids, finding the closest centroids typically costs $O(Q \cdot NC \cdot D^2)$ . This is a standard operation in ANNS and not specific Harmony's overhead.
Query Distribution (Communication overhead): A query vector interacts with $V$ machines (for vector-based partitioning). With Harmony's dimension-based partitioning, this increases to $V \cdot B$ , where $B$ is the number of dimension-based splits $\mathcal{B}_{\mathrm{dim}}(\pi)$ . However, the total data transferred remains the same because the size of each split is reduced to $Q \cdot D / B$ . Therefore, while the number of communication messages might increase, the total amount of data transmitted does not, preventing additional communication overhead.

Space Complexity: Harmony's space requirements are minimal because each base vector is stored on only one machine (no redundancy). The query set is typically small, so storing and routing query splits incurs negligible space costs. The overall space complexity remains $O(NB \cdot D)$ , where NB is the number of base vectors and $D$ is their dimensionality, matching standard distributed systems without extra replication.

4.2.2. Flexible Pipelined Execution Engine

The flexible pipelined execution engine is designed to exploit the opportunity for distributed pruning, where partial distance computations can eliminate unpromising candidates early. The challenge is to maintain pruning ability while distributing computations across machines.

The following figure (Figure 5 from the original paper) illustrates Harmony's pipeline pruning and querying.

Figure 5: Harmony's pipeline pruning and querying. 该图像是示意图，展示了Harmony系统中的管道修剪和查询过程。左侧部分（a）说明了向量级修剪，展示了在阶段A和阶段B中，如何更新修剪条件并处理查询队列。右侧部分（b）则描述了维度级修剪的三个阶段（X、Y、Z），指明计算与发送的流程。整体图示突出了Harmony系统在高效查询中的设计策略。

Algorithm 1 (not explicitly numbered in the paper, but presented as pseudocode) describes the orchestration:

1 Function PrewarmHeap (QuerySet Q, int K): // Compute partial distances for a subset of queries to build an initial max-heap.
2   foreach q in Q_subset do
3     dist = ComputeDistance (q, randomVectors)
4     heap_insert(q, dist, K)
5   return heap
6 Function DimensionPipeline (q, DSet): // Process each dimension block in DSet sequentially.
7   foreach d in DSet do
8     partialDist ← ComputeDistance (q, d)
9     if partialDist > q.currentThreshold then
10      prune(q)
11      return
12    UpdatePruning (q, partialDist)
13 Function VectorPipeline (QueryBatch Qv, DSet): // Process a batch of queries for a specific vector partition.
14  foreach q in Qv do
15    DimensionPipeline (q, DSet)
16    if q.pruned == true then
17      continue
18    UpdatePruning (q, finalDist)
19 Function QUERYPIPELINE(QuerySet Q, VSetList VList, DSetList DList, int K): // Stage 0: Prewarming
20  heap = PrewarmHeap (Q, K) // Stage I: Vector-level pipeline
21  foreach vPart in VList do
22    Qv = filterQueries(Q, vPart)
23    VectorPipeline (Qv, DList)

Prewarm Stage (Lines 1-5, PrewarmHeap function):
- Purpose: To establish an initial pruning threshold.
- Mechanism: A subset of queries (Q_subset) is processed on the client node. ComputeDistance calculates initial distances for these queries against some randomVectors (likely centroids or a small sample). These distances are then inserted into a max-heap of size $K$ (for top- $K$ search). The max-heap stores the $K$ smallest distances found so far, with the largest of these (the $K$ -th best distance) serving as the initial pruning threshold (q.currentThreshold).
Vector Pipeline (Lines 13-18, VectorPipeline function):
- Purpose: To process queries in batches corresponding to vector partitions and perform coarse-grained pruning.
- Mechanism: It iterates through each vector partition (vPart) in VList. For each vPart, it filters the QuerySet $Q$ to get a QueryBatch Qv relevant to that partition. Then, for each query $q$ in Qv, it calls DimensionPipeline (Line 15) to handle partial distance computations. After the DimensionPipeline completes, if the query $q$ has been pruned (Line 16), it continues to the next query. Otherwise, it updates the global pruning thresholds with the finalDist (Line 18). This ensures that each machine specializes in searching a particular vector partition. The max-heap is refined iteratively, so subsequent query batches benefit from tighter pruning thresholds.
Dimension Pipeline (Lines 6-12, DimensionPipeline function):
- Purpose: To perform fine-grained, early-stop pruning by processing dimension blocks sequentially for a single query.
- Mechanism: For a given query $q$ $q$ , it iterates through each dimension block $d$ $d$ in DSet (which represents the dimension partitions for that query).
  - It computes a partial distance (partialDist) for the current dimension block $d$ (Line 8).
  - It then compares this partialDist to the query's currentThreshold (Line 9).
  - If partialDist > q.currentThreshold, the candidate vector $q$ can be pruned immediately (Line 10), and the function returns (Line 11), skipping any further computation on subsequent dimension blocks.
  - If not pruned, UpdatePruning (Line 12) is called, which updates q.currentThreshold to reflect the accumulated partial distance, potentially making the threshold tighter for subsequent blocks.

Overall Workflow (QUERYPIPELINE function, Lines 19-23): The QUERYPIPELINE orchestrates the entire process. First, it PrewarmHeap (Line 20) to get an initial threshold. Then, it enters a loop over vector partitions (vPart in VList). For each vPart, it filters queries (filterQueries) and then calls VectorPipeline (Line 23) to process that batch of queries.

Example (Figure 5):

Vector-level pipeline (Figure 5a): Queries Q1-Q3 are processed for V1 in Stage A, and Q4-Q6 for V2. After Stage A, the global pruning threshold is updated. Stage B then processes Q4-Q6 for V1 and Q1-Q3 for V2. The pruning threshold established after Stage A benefits Stage B.
Dimension-level pipeline (Figure 5b): This illustrates how dimensions of different queries are processed across machines $M_1, M_2, M_3$ $M_{1}, M_{2}, M_{3}$ .
- Stage X: Q1's D1 is on $M_1$ , Q2's D2 on $M_2$ , Q3's D3 on $M_3$ . Each machine computes a partial distance.
- Stage Y: Q1's D2 moves to $M_2$ , Q2's D3 to $M_3$ , Q3's D1 to $M_1$ . Partial results from Stage X are accumulated. Pruning checks occur.
- Stage Z: The remaining dimension blocks are processed. Harmony ensures that different dimensions of the same query vector are processed in non-consecutive stages and on different machines. This minimizes synchronization overhead and allows partial result transmission to overlap with computation.

Complexity Analysis:

Let $Q$ be the number of query vectors, NB the number of base vectors, $D$ the dimensionality, $n_{\mathrm{list}}$ the total number of clusters, and $n_{\mathrm{probe}}$ the number of probed clusters per query.

Time Complexity: In a centralized system, the theoretical total operations would be $O(Q \cdot NB \cdot n_{\mathrm{probe}} / n_{\mathrm{list}} \cdot D^2)$ . With Harmony, work is distributed across $\mathcal{B}_{\mathrm{vec}}(\pi) \times \mathcal{B}_{\mathrm{dim}}(\pi)$ machines. Under a balanced load, each machine handles: $ \frac{Q \times ( n_{\mathrm{probe}} / n_{\mathrm{list}} ) \times NB \times D^2}{\mathcal{B}{\mathrm{vec}}(\pi) \times \mathcal{B}{\mathrm{dim}}(\pi)} $ Thus, the final per-machine cost scales as: $ O\left( \frac{Q \cdot n_{\mathrm{list}} \cdot NB \cdot D^2}{n_{\mathrm{probe}} \mathcal{B}{\mathrm{vec}}(\pi) \mathcal{B}{\mathrm{dim}}(\pi)} \right) $
- Explanation: The total work is divided by the product of the number of vector blocks and dimension blocks, effectively distributing the computational burden. The pruning mechanism further reduces the effective $D^2$ term as not all dimensions are processed for all candidates.
Communication Overhead: On average, each query interacts with $n_{\mathrm{probe}} \times \frac{\mathcal{B}_{\mathrm{vec}}(\pi)}{n_{\mathrm{list}}}$ vector blocks or $\mathcal{B}_{\mathrm{dim}}(\pi)$ dimension blocks. While the number of communication events increases by $\mathcal{B}_{\mathrm{dim}}(\pi)$ times (due to multiple dimension slices), the amount of data transferred per event is reduced by $1/\mathcal{B}_{\mathrm{dim}}(\pi)$ . The communication required for query distribution is proportional to: $ \left( \frac{NB}{\mathcal{B}{\mathrm{vec}}(\pi)} \right) \times \left( n{\mathrm{probe}} \times \frac{\mathcal{B}{\mathrm{vec}}(\pi)}{n{\mathrm{list}}} \right) \times \frac{D}{\mathcal{B}{\mathrm{dim}}(\pi)} \times \mathcal{B}{\mathrm{dim}}(\pi) $
- Explanation: This formula represents:
  - $\frac{NB}{\mathcal{B}_{\mathrm{vec}}(\pi)}$ : The average number of base vectors handled per vector block.
  - $n_{\mathrm{probe}} \times \frac{\mathcal{B}_{\mathrm{vec}}(\pi)}{n_{\mathrm{list}}}$ : The number of vector blocks probed per query.
  - $\frac{D}{\mathcal{B}_{\mathrm{dim}}(\pi)}$ : The size of each dimension slice.
  - $\mathcal{B}_{\mathrm{dim}}(\pi)$ : The number of dimension slices. The last two terms ( $\frac{D}{\mathcal{B}_{\mathrm{dim}}(\pi)} \times \mathcal{B}_{\mathrm{dim}}(\pi)$ ) effectively cancel out to $D$ . This means the total communication overhead in Harmony remains identical to a purely vector-based scheme, as the finer partitioning of dimensions does not increase the total amount of transmitted data, only reorganizes it into smaller, parallelizable chunks.
Space Complexity: Each of the NB base vectors is split among $\mathcal{B}_{\mathrm{vec}}(\pi)$ vector blocks and $\mathcal{B}_{\mathrm{dim}}(\pi)$ dimension blocks. Each block stores approximately: $ \frac{NB \times D}{\mathcal{B}{\mathrm{vec}}(\pi) \times \mathcal{B}{\mathrm{dim}}(\pi)} $ elements.
- Explanation: Since there is no data duplication, the total storage remains $NB \cdot D$ . Intermediate query results are small and negligible. Thus, the space cost is $O(NB \cdot D)$ , matching standard distributed systems.

Load Balancing Strategies: While initial distribution aims for balance, pruning can introduce imbalance (later dimension slices are more likely to be pruned, reducing work for those machines). To counter this, Harmony dynamically adjusts the execution order of dimensions. If a machine (e.g., $M_1$ for $D_1$ ) becomes overloaded, subsequent queries (Q2) are reconfigured to process $D_1$ last. This leverages the higher pruning effectiveness in later stages to alleviate the load on overloaded machines, preventing bottlenecks.

4.3. Implementation

Harmony is implemented in $C++20$ (approximately 6000 lines of code), leveraging modern C++ features for efficiency.

Communication: OpenMPI is used for inter-node communication. This allows workers to issue non-blocking operations (MPI_Isend / MPI_Irecv) for data and partial results, enabling communication and local computation to overlap.
Computation:
- Intel MKL (Math Kernel Library) accelerates fundamental operations like L2 (squared Euclidean distance) or inner-product computations.
- OpenMP is used to parallelize distance calculations within a worker node.
Control Flow: The master node sends updated top-k heaps (as pruning thresholds) to worker nodes during query execution.
Parameters: Harmony offers configurable parameters:
- $-NMachine [nargs=1]$ : Specifies the number of nodes in the distributed index.
- -Pruning_Configuration [nargs=0..1]: Enables or disables vector and dimension-level pruning.
- -Indexing_Parameters [e.g., nlist, nprobe, dim]: Controls clustering granularity and search scope to balance recall, latency, and memory usage.
  - nlist: Number of inverted lists (clusters).
  - nprobe: Number of inverted lists to probe during search.
  - dim: Dimensionality of vectors.
- -\alpha$$: A user-defined weight in the cost model to control the preference between load balancing and throughput.
- -Mode [Harmony, Harmony-vector, Harmony-dimension]: Specifies the operational mode (e.g., full Harmony with adaptive strategy, or baseline pure vector-based or pure dimension-based).

5. Experimental Setup

5.1. Datasets

The experiments utilize ten open-source datasets (Table 2), chosen for their diverse sizes, dimensions, and data types, which are representative of real-world ANNS scenarios. These datasets are commonly used in ANN system evaluations. The Star Light Curves and Hand Outlines datasets were expanded to ensure sufficient data for distributed processing.

The following are the results from Table 2 of the original paper:

Dataset	Size	Dim	Query Size	Data Type
Star Light Curves	823,600	1024	1,000	Time Series
Msong	992,272	420	1,000	Audio
Sift1M	1,000,000	128	10,000	Image
Deep1M	1,000,000	256	1,000	Image
Word2vec	1,000,000	300	1,000	Word Vectors
Hand Outlines	1,000,000	2709	370	Time Series
Glove1.2m	1,193,514	200	1,000	Text
Glove2.2m	2,196,017	300	1,000	Text
SpaceV1B	1,000,000,000	100	10,000	Text
Sift1B	1,000,000,000	128	10,000	Image

Star Light Curves (Time Series): A dataset representing light intensity measurements over time, often used in astronomy. The high dimension (1024) indicates a rich temporal feature set.
Msong (Audio): Audio features, typically extracted from music or speech. Its dimension (420) is moderately high.
Sift1M / Sift1B (Image): Scale-Invariant Feature Transform (SIFT) descriptors are classic image features. Sift1M (1 million vectors) is a common benchmark, and Sift1B (1 billion vectors) represents a very large-scale challenge. Dimensions are 128.
Deep1M (Image): Features extracted from deep neural networks for image data, with a dimension of 256.
Word2vec (Word Vectors): Embeddings that represent words in a continuous vector space, capturing semantic relationships. Dimension is 300.
Hand Outlines (Time Series): Features describing the outlines of hands, often used in gesture recognition or biometrics. Its very high dimension (2709) poses significant challenges.
Glove1.2m / Glove2.2m (Text): Global Vectors for Word Representation (GloVe) are pre-trained word embeddings. These datasets represent text data with dimensions 200 and 300, respectively, and are of varying sizes.
SpaceV1B (Text): A very large-scale text embedding dataset with 1 billion vectors and 100 dimensions.

These datasets were chosen because they:
Cover a wide range of data types (image, audio, text, time series).
Vary significantly in size (from ~800K to 1 billion vectors) and dimensionality (from 100 to 2709), allowing for robust evaluation under different scales and complexities.
Are widely recognized benchmarks in the ANNS community, enabling comparison with other research.
Their characteristics (especially high dimensionality) make them suitable for validating Harmony's strengths in handling memory and computational challenges in distributed settings.

5.2. Evaluation Metrics

The paper primarily uses Queries Per Second (QPS), Recall, and Speedup to evaluate performance.

Queries Per Second (QPS):
- Conceptual Definition: QPS measures the processing capacity of an ANNS system, indicating how many search queries it can successfully execute per second. A higher QPS value signifies greater efficiency and throughput, which is critical for real-time applications.
- Mathematical Formula: While not explicitly provided in the paper, QPS is generally calculated as: $ \text{QPS} = \frac{\text{Total number of queries processed}}{\text{Total time taken to process all queries (in seconds)}} $
- Symbol Explanation:
  - Total number of queries processed: The count of individual search requests submitted to and completed by the system.
  - Total time taken to process all queries: The cumulative duration, in seconds, from the submission of the first query to the completion of the last query, including any system overhead.
Recall:
- Conceptual Definition: In the context of ANNS, recall quantifies the effectiveness of the approximate search algorithm. It measures the proportion of the true nearest neighbors (those found by an exact search) that are successfully retrieved by the approximate search within its top-K results. A higher recall value indicates a more accurate approximate search.
- Mathematical Formula: $ \text{Recall} = \frac{\text{Number of true nearest neighbors found by ANNS}}{\text{Total number of true nearest neighbors (from exact search)}} $
- Symbol Explanation:
  - Number of true nearest neighbors found by ANNS: The count of items that are both in the ground-truth set of nearest neighbors and were returned by the approximate search query.
  - Total number of true nearest neighbors: The total count of actual nearest neighbors for a given query, typically obtained by performing a brute-force (exact) search.
Speedup:
- Conceptual Definition: Speedup measures the performance improvement gained by using a parallel or optimized system compared to a baseline serial or less optimized system. It shows how many times faster the new system is.
- Mathematical Formula: $ \text{Speedup} = \frac{\text{Time taken by baseline system}}{\text{Time taken by proposed system}} = \frac{\text{Throughput of proposed system}}{\text{Throughput of baseline system}} $
- Symbol Explanation:
  - Time taken by baseline system: The time required for the baseline (e.g., single-node Faiss) to complete a task.
  - Time taken by proposed system: The time required for the Harmony system to complete the same task.
  - Throughput of proposed system: The QPS achieved by the Harmony system.
  - Throughput of baseline system: The QPS achieved by the baseline system.

5.3. Baselines

Harmony is compared against the following baselines:

Faiss [25]: This is the state-of-the-art stand-alone ANNS query engine. It serves as the primary single-node baseline. For fairness, all methods (including Harmony and its variants) adopt the same clustering algorithm and number of clusters as Faiss for index construction.
Harmony-vector (pure vector-based partitioning): This is a baseline configuration within the Harmony framework that uses only vector-based partitioning. It distributes entire vectors to different nodes. This setup is used to assess the performance of traditional partitioning without the benefits of dimension splitting or adaptive strategies.
Harmony-dimension (pure dimension-based partitioning): This is another baseline configuration within the Harmony framework that uses only dimension-based partitioning. It distributes vector dimensions across nodes. This setup helps evaluate the inherent load balancing and communication characteristics of dimension-based approaches before Harmony's optimizations.
Auncel [68]: This is a distributed ANNS system that provides error-bound guarantees for low search load. Auncel uses a fixed partitioning strategy similar to Harmony-vector. It is referenced for comparison, particularly regarding its limitations under skewed workloads.

These baselines are chosen to:
Demonstrate Harmony's performance against a highly optimized single-node solution (Faiss).
Isolate and quantify the benefits of Harmony's hybrid multi-granularity partitioning and pruning by comparing it against pure vector-based and dimension-based distributed setups.
Contextualize Harmony's advancements within the landscape of existing distributed ANNS systems (Auncel).

5.4. Platform

The experiments were conducted on a robust distributed system with the following specifications:

Nodes: 20 nodes in total.
CPU: Intel(R) Xeon(R) Gold 6258R CPU, featuring 28 cores and 56 threads per node.
Memory: 256GB of global memory per node, configured in a NUMA (Non-Uniform Memory Access) architecture.
Network: High-speed 100 Gb/s interconnect between nodes. This fast network minimizes potential communication bottlenecks.
Operating System: CentOS Stream 8.
Instruction Set: Each node supports AVX-512, an advanced instruction set for single instruction, multiple data (SIMD) operations, which is beneficial for accelerating vector computations.

This platform provides a high-performance environment suitable for demanding distributed workloads, allowing for accurate evaluation of Harmony's scalability and efficiency.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. QPS-recall trade-off under uniform workloads

The following figure (Figure 6 from the original paper) shows the query per second (QPS) performance of Harmony and other algorithms across different datasets such as StarLightCurves, Msong, and Sift1M at varying recall rates.

Figure 6: Time-accuracy trade-off. 该图像是一个图表，展示了不同数据集（如StarLightCurves、Msong、Sift1M等）上Harmony与其他算法在不同召回率下的查询每秒（QPS）表现。图中可以看到，Harmony在各种召回率上均表现出更高的查询性能。

The QPS-recall trade-off analysis (Figure 6) compares Harmony (running on four worker nodes) against Faiss (on a single node) under uniform workloads. For very large datasets like SpaceV1B and Sift1B, Harmony used 16 nodes, as Faiss and a 4-node setup could not handle their scale.

Key Findings:

Scalability and Average Speedup: All Harmony variants (Harmony, Harmony-vector, Harmony-dimension) demonstrate strong scalability. On average, they outperform Faiss by 3.75x in terms of speedup. This indicates that distributing the workload across multiple nodes, even with basic partitioning, significantly boosts performance compared to a single-node solution.
Exceeding Theoretical Speedup with Pruning: Under high recall precision conditions (e.g., closer to 1.0 recall), Harmony frequently exceeds the theoretical maximum speedup of 4x (for a 4-node system), achieving an impressive 4.63x speedup. This superlinear speedup is attributed to Harmony's pruning mechanism, which effectively reduces the overall computational overhead, making the distributed system more efficient than just linearly scaling the single-node performance.
Optimal Performance for Harmony-Vector at Lower Precision: When the recall is lower than 99%, the Harmony-vector partitioning strategy (pure vector-based) shows the optimal performance among the Harmony variants. This suggests that the overhead introduced by dimension-based partitioning and its associated communication for Harmony and Harmony-dimension might outweigh the benefits of pruning or load balancing at lower precision targets, where less stringent search criteria mean that simpler, communication-light strategies perform better.

6.1.2. Search performance under skewed workloads

For skewed workload experiments, the analysis focuses on 8 relatively smaller datasets, as SpaceV1B and Sift1B are too large for a 4-node setup. The experiments use four distinct nodes and measure QPS for three Harmony strategies under varying levels of load imbalance, quantified by variance (as defined in Section 4.2.1).

The following figure (Figure 7 from the original paper) shows the impact of load distribution on query performance.

Figure 7: Impact of load distribution on query performance.

Key Findings:

Traditional Vector Partition's Vulnerability to Imbalance: The Harmony-vector (traditional vector-based partition) method performs poorly under load imbalance scenarios. As the load variance increases (indicating more severe skew), its average QPS decreases by 56% on average. This validates the paper's initial motivation that traditional partitioning struggles with uneven workloads.
Robustness of Harmony and Harmony-Dimension: Both Harmony and Harmony-dimension partitioning methods handle load imbalance much more favorably. They show no clear performance degradation even as load skew changes significantly. This highlights the inherent load balancing capability of dimension-based approaches.
Harmony's Adaptive Advantage for Extreme Skew: Even though Harmony-dimension is stable, Harmony (the hybrid adaptive approach) maintains an advantage. When the load is extremely imbalanced, Harmony outperforms Harmony-dimension by 91%. This demonstrates the effectiveness of Harmony's cost model in dynamically adapting the partitioning strategy to maintain high performance under severe skewed workloads, leveraging the synergy between dimension-split and vector-split strategies.
Consistency for Larger/Higher-Dimensional Data: Harmony consistently retains its advantage for datasets with larger vector sizes or higher feature dimensions. This indicates that its adaptive strategy is particularly beneficial as data complexity and distribution skew worsen, providing robust performance.

6.2. Performance Ablation Study

6.2.1. Time breakdown

This section analyzes how time is spent across data communication, computation, and other overhead for Harmony, Harmony-vector, and Harmony-dimension.

The following figure (Figure 8 from the original paper) shows the contribution of the three optimization techniques to Harmony's query throughput.

Figure 8: The contribution of the three optimization techniques to Harmony's query throughput.

Key Findings:

Communication Overhead:
- Harmony-vector consistently incurs lower communication overhead. This is expected as it sends entire query vectors to specific nodes and requires less inter-node communication for partial results.
- Both Harmony and Harmony-dimension incur communication overhead, with Harmony-dimension generally having higher communication overhead due to more dimension slicing and frequent exchange of partial results. Harmony has less dimension partitioning compared to Harmony-dimension, thus its communication time is lower.
Computational Overhead:
- Despite similar computation overhead between Harmony-dimension and Harmony-vector in some cases (likely due to the total sum of partial computations being similar before pruning), Harmony generally exhibits lower computational overhead than the other two. This reduction is primarily attributed to Harmony's pruning module, which eliminates unnecessary distance calculations.
Communication vs. Computation:
- For all three methods, the main overheads are concentrated in communication and computation.
- Crucially, communication overhead is independent of dimensions. This means that for datasets with larger dimensions (e.g., Msong with 420 dimensions), the communication overhead becomes significantly lower relative to the computation overhead. Conversely, for lower-dimensional datasets (e.g., Sift1M with 128 dimensions), communication overhead is proportionally higher. This observation highlights that as dimensions grow, computation becomes the dominant factor, making optimizations like pruning even more impactful.

6.2.2. Contribution of optimization techniques

This section dissects the speedup provided by Harmony's key features: balanced load, pipeline and asynchronous execution, and pruning, measured on a 4-node setup.

The following figure (Figure 9 from the original paper) shows the contribution of the three optimization techniques to Harmony's query throughput.

Figure 9: The contribution of the three optimization techniques to Harmony's query throughput.

Key Findings:

Individual Performance Improvements: Each optimization technique contributes positively to performance.
- Balanced load: Provides an average 1.75x increase in throughput. This confirms the importance of distributing work evenly.
- Pipeline and asynchronous execution: Achieves a 1.25x improvement. This highlights the benefits of overlapping communication and computation, and efficient sequencing of tasks.
- Pruning: Offers the most significant enhancement with a 1.51x improvement in throughput.
Pruning as the Most Impactful Optimization: The results confirm that pruning is the most impactful optimization technique for improving Harmony's query throughput. This aligns with the motivation in Section 3.1, which highlighted the potential of early stopping.
Dataset-Specific Impact: For the Sift1M dataset, the performance improvements from balanced load and pipeline execution are less pronounced. This is attributed to the relatively uniform distribution of the load in Sift1M, where the benefits of balancing and asynchronous execution are naturally diminished. However, pruning still leads to a clear throughput improvement for Sift1M, indicating its robustness regardless of the initial data distribution characteristics.

6.2.3. Pruning ratio breakdown

This experiment evaluates the effectiveness of Harmony's pruning strategy (discussed in Section 4.3) in reducing computational overhead. It uses a dimensional split of size 4 (meaning vectors are split into four dimension blocks) and measures the pruning ratio for each slice.

The following are the results from Table 3 of the original paper:

Dataset	Second Slice (%)	Third Slice (%)	Fourth Slice (%)	Average Pruning Ratio (%)
Msong	43.14	76.06	95.29	53.87
Glove1.2m	1.54	30.71	86.66	29.73
Word2vec	24.85	53.77	83.66	40.32
Deep1M	7.67	66.09	97.36	42.03
Sift1m	41.76	85.04	98.40	57.05
Star	81.24	95.23	99.05	69.14
Glove2.2m	5.14	30.70	81.18	29.76
Hand	63.54	91.62	98.10	63.83

Key Observations:

Increasing Pruning Rates in Later Slices: The pruning rates are significantly higher in later slices. The first slice consistently shows 0.00% pruning (as it's the initial computation before a meaningful threshold is established). The average pruning ratio for the second slice is 33.61%, for the third slice 66.15%, and for the fourth slice 92.33%. This demonstrates that subsequent query slices benefit from the accumulated results and a tighter pruning threshold from earlier slices, as discussed in Section 3.1.
Dataset-Dependent Pruning Effectiveness: Pruning rates vary considerably across different datasets. For example, Glove1.2M's third slice only achieves 30.71% pruning, whereas Star's third slice reaches 95.23%. This variance is attributed to differences in dataset distributions; some datasets (e.g., Star) lend themselves more readily to early pruning than others (e.g., Glove datasets), likely due to their inherent clustering or sparsity.
High Pruning in the Final Slice: The pruning rate in the final slice consistently exceeds 80% for all datasets, often reaching over 95% (Msong, Deep1M, Sift1m, Star, Hand). This strongly indicates that for the vast majority of candidate vectors, distance computations do not require processing the last dimensions. Most vectors are effectively pruned by the time the final dimension slice is reached, significantly reducing computational load.

6.3. Index Build Experiments

6.3.1. Index build time

This experiment measures the time overhead for index construction across Harmony-vector, Harmony-dimension, Harmony (all in a 4-node environment), and Faiss (single-machine). The index construction process is broken down into three stages: training the clustering centers (Train), distributing base vectors to clustering centers (Add), and assigning parts of the index to different machines (Pre-assign).

The following figure (Figure 10 from the original paper) shows Harmony's index build time breakdown.

Figure 10: Harmony's index build time breakdown.

Key Observations:

Similar Train and Add Times: The train and add times for all methods (including Faiss and the Harmony variants) are similar. This suggests that Harmony does not introduce significant overhead in the core clustering and initial vector distribution steps and demonstrates good scalability in these phases without requiring major modifications to the underlying index structure.
Longer Pre-assign Time for Harmony-dimension and Harmony: The pre-assign time is longer for Harmony-dimension and Harmony compared to Harmony-vector. This is because these two methods involve dimension-based partitioning, which requires additional steps like allocating space and initializing intermediate results related to distance computation. This overhead is dependent on the data size (e.g., approximately twice as long for Glove2.2m as for Glove1.2m).
Proportionality to Dimensionality: For datasets of the same size, the train and add times are generally proportional to the dimensionality of the dataset. This is logical because both training (e.g., clustering) and distribution (adding vectors to lists) operations scale linearly with the number of dimensions, as they involve processing each dimension of the vectors.

6.3.2. Index space overhead

This experiment compares the memory footprint of Harmony's index with Faiss and the baseline Harmony configurations (4 nodes).

The following are the results from Table 4 of the original paper:

Dataset	Faiss	Harmony vector	Harmony dimension	Harmony
StarLightCurves	3.2GB	788MB	815MB	798MB
Msong	1.6GB	411MB	418MB	413MB
Sift1M	497MB	126MB	131MB	128MB
Deep1M	986MB	245MB	253MB	250MB
Word2Vec	1.2GB	258MB	295MB	279MB
HandOutlines	6.1GB	1.50GB	1.54GB	1.51GB
Glove1.2M	921MB	227MB	238MB	233MB
Glove2.2M	2.5GB	660MB	697MB	686MB

Key Points:

Proportionality to Dataset Size and Dimensions: The index overhead (memory usage) is directly proportional to both the dataset size and its dimensions. For instance, Msong's index space is 3.2 times larger than Sift1M's, which correlates with their dimension × dataset size product (e.g., Msong has 992,272 vectors * 420 dim ≈ 416M, Sift1M has 1,000,000 vectors * 128 dim ≈ 128M; ratio is roughly 3.25). This confirms expected memory scaling with data characteristics.
Efficiency of Distributed Indexing: All three Harmony partitioning schemes (Harmony-vector, Harmony-dimension, Harmony) occupy approximately 1/4 of the space of Faiss (the single-machine index). This is because the data is distributed across four nodes. This finding indicates that distributed indexing does not introduce significant additional space overhead due to replication or complex distributed structures, and it effectively alleviates memory pressure on individual nodes.
Minimal Additional Overhead for Harmony and Harmony-dimension: Both Harmony and Harmony-dimension incur some additional space overhead compared to Harmony-vector. However, this overhead is minimal, representing only about 2% of the original space size. This implies that the extra data structures or intermediate results needed for dimension-based partitioning and adaptive strategies are very lightweight.

6.4. Additional Discussion

6.4.1. Impact of dimensions and dataset size on harmony performance

The paper constructs Gaussian-distributed datasets with dimensions ranging from 64 to 512 and sizes from 250K to 1M to study the impact on Harmony's speedup using four nodes.

The following figure (Figure 11 from the original paper) shows the impact of pruning effectiveness and indexing parameters on search performance.

Figure 11: Impact of pruning effectiveness and indexing parameters on search performance

Key Findings (Figure 11a):

Performance Improvement with Increased Size and Dimension: As both the dataset size and dimension increase, Harmony's performance (speedup) improves.
- For each doubling of the dimension, the speedup increases by 26.8%.
- For each doubling of the dataset size, the speedup increases by 25.9%. This indicates that Harmony is particularly effective for larger and higher-dimensional datasets, which are typical challenges in modern ANNS.
Superlinear Speedup in Large Datasets: In large datasets (1M vectors, 512 dimensions), Harmony exhibits a speedup greater than the number of machines (i.e., greater than 4x). This superlinear speedup is attributed to the pruning techniques, which reduce the overall computational overhead more than linearly, allowing the distributed system to perform disproportionately better.
Suboptimal Performance for Smaller Datasets: Harmony's performance is suboptimal for smaller datasets. This is explained by the increased communication overhead dominating the overall performance in such cases. For small datasets, the benefits of distributing and pruning might not outweigh the costs of inter-node communication and coordination, making the relative overhead more significant.

6.4.2. Scalability

To validate Harmony's scalability, experiments were performed across different machine configurations: 4, 8, and 16 nodes.

Key Findings (Figure 11b):

Group-based Partitioning (Harmony) Shows Superlinear Speedup: The group-based partitioning method (referring to Harmony's adaptive strategy) consistently demonstrates a speedup greater than the number of machines. This superlinear speedup is a direct result of the pruning design, which effectively reduces the computational overhead per node, leading to disproportionately higher performance gains as more nodes are added.
Vector-based Partitioning Scales Linearly: The vector-based partitioning (Harmony-vector) achieves performance close to the number of workers. This indicates that it scales almost linearly with the number of nodes, as it does not introduce additional overhead beyond simple distribution. Each added worker roughly contributes its share of processing capacity.
Dimension-based Partitioning's Communication Bottleneck: The dimension-based partitioning method (Harmony-dimension) shows an initial increase in performance as the number of workers grows, but then it experiences a decline after a certain point. This occurs because as the number of dimensions increases or as more workers are added (leading to finer dimension splits), the communication overhead (for exchanging partial results) grows significantly. Eventually, this communication overhead becomes dominant, diminishing the overall performance gain and leading to a performance degradation or bottleneck.

6.4.3. Peak memory usage

This experiment measures the peak memory consumption of Harmony and compares it to baseline partition methods in a 4-node setup.

The following are the results from Table 5 of the original paper:

Dataset	Harmony-vector	Harmony	Harmony-dimension
StarLightCurves	3.94GB	4.01GB	4.07GB
Msong	1.15GB	1.32GB	1.46GB
Sift1M	1.37GB	1.72GB	1.96GB
Deep1M	1.23GB	1.61GB	1.88GB
Word2Vec	658MB	723MB	812MB
HandOutlines	11.06GB	11.19GB	11.33GB
Glove1.2M	814MB	932MB	1.06GB
Glove2.2M	1.64GB	1.98GB	2.23GB

Key Findings:

Proportionality to Data Dimensions and Size: The peak memory usage for all three partition methods is generally proportional to both the data dimensions and dataset size. This is a natural consequence of storing and processing vector data, where more data points or higher dimensionality requires more memory.
Memory Overhead Diminishes with Increasing Dimensions: The peak memory consumption for Harmony and Harmony-dimension is slightly higher than Harmony-vector. However, this increase diminishes as the data dimension grows. For example, with Deep1M (256 dimensions), Harmony exceeds Harmony-vector by 30.9%. But with HandOutlines (2709 dimensions), the increase drops to 1.17%. This behavior suggests that the additional intermediate results (unrelated to data dimensions) required by dimension-based partitioning become diluted in significance as the actual vector data (which scales with dimensions) grows very large. This demonstrates Harmony's potential for efficiently handling high-dimensional data.
Memory Overhead Related to Dimension Slices: The increase in memory usage due to intermediate results is related to the number of dimension slices. Harmony shows a smaller increase in memory consumption compared to Harmony-dimension. This is because Harmony's adaptive strategy does not partition entirely based on dimensions, allowing it to manage the intermediate result overhead more effectively by sometimes favoring vector-based partitioning or a less granular dimension split.

6.4.4. Comparison with Auncel and Harmony

Auncel [68]: This distributed vector database focuses on low search load scenarios with specific precision requirements. It employs a fixed partitioning strategy similar to Harmony-vector, distributing the query load.
Auncel's Limitation: Auncel shares a similar drawback with Harmony-vector regarding load balancing: its performance is not optimal under skewed workloads. This is because its fixed, vector-based partitioning cannot adapt to uneven query distributions, leading to bottlenecks on nodes holding "hot" data.
Harmony's Advantage: In contrast, Harmony is specifically designed to address skewed workloads. It leverages its adaptive multi-granularity partitioning (guided by the cost model) and pruning techniques to achieve fine-grained load balancing. This allows Harmony to handle uneven query distributions much more effectively, maintaining higher throughput and stability in scenarios where Auncel (and Harmony-vector) would falter due to load imbalance.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Harmony, a novel distributed vector database system designed to address the challenges of load imbalance and high communication overhead in large-scale Approximate Nearest Neighbor Search (ANNS). Harmony's core innovation lies in its multi-granularity partition strategy, which intelligently combines dimension-based and vector-based partitioning dynamically guided by a cost model. This adaptive approach ensures balanced computational load across nodes while minimizing communication costs. Furthermore, Harmony integrates an early-stop pruning mechanism within a flexible pipelined execution engine, exploiting the monotonicity of distance computations in dimension-based partitions to significantly reduce redundant computations and network transfers.

Experimental evaluations across diverse real-world datasets demonstrate Harmony's superior performance. It achieves an average 4.63 times throughput improvement over Faiss (a state-of-the-art single-node system) in a four-node setup, exhibiting superlinear speedup due to its effective pruning. Crucially, Harmony shows a 58% performance improvement over traditional distributed methods when faced with skewed workloads, proving its robustness. Its optimization techniques (balanced load, pipelined execution, and pruning) each contribute substantial performance gains, with pruning being the most impactful. Harmony also scales efficiently with increasing dataset size and dimensionality, while maintaining minimal index space and manageable peak memory usage.

7.2. Limitations & Future Work

The paper does not explicitly dedicate a section to Limitations & Future Work. However, based on the experimental results and discussions, several implicit limitations and potential future research directions can be inferred:

Implicit Limitations:

Overhead for Smaller Datasets: The experiments show that Harmony's performance is suboptimal for smaller datasets due to the increased communication overhead dominating the overall performance. This suggests that the complexity introduced by the adaptive partitioning and pipelined pruning might not be justified for datasets that are small enough to be handled efficiently by single-node or simpler distributed solutions.
Communication Overhead in Harmony-dimension: While Harmony manages communication better, the Harmony-dimension baseline clearly suffers from increased communication overhead as the number of nodes or dimension slices grows. This highlights a fundamental challenge of pure dimension-based partitioning that Harmony aims to mitigate but might not entirely eliminate, especially in extreme configurations.
Cost Model Complexity: While a cost model is used, its computational overhead for dynamic adaptation isn't deeply analyzed. For rapidly changing workloads or very frequent adjustments, the cost of running the cost model itself could become a factor.
Generalizability of Pruning Effectiveness: The paper notes that pruning rates vary significantly across different datasets due to their unique distributions. While Harmony benefits from pruning, its efficacy is dataset-dependent, implying that performance gains from pruning might not be uniform across all possible real-world data distributions.
Dependency on Metrics with Monotonicity: The early-stop pruning relies on the monotonicity of distance computations (e.g., Euclidean distance, cosine similarity with normalized vectors). For other, non-monotonic similarity metrics, this specific pruning technique might not be directly applicable, limiting Harmony's generalizability across all possible ANNS use cases.

Potential Future Work (Inferred):

Adaptive Cost Model Refinement: Further research could focus on making the cost model even more sophisticated, potentially incorporating machine learning to predict workload patterns and dynamically adjust $α$ (the imbalance weight) or even the structure of the partition plan more granularly.
Optimizing Communication for Small Datasets: Investigating techniques to reduce communication overhead specifically for smaller datasets or when the number of nodes is very high, to make Harmony universally efficient.
Dynamic Dimension Ordering: The paper mentions dynamically adjusting the execution order of dimensions to mitigate load imbalance caused by pruning. This could be explored in more depth, potentially with predictive models for optimal dimension ordering.
Support for Diverse Similarity Metrics: Expanding Harmony's pruning capabilities to non-monotonic distance or similarity metrics, perhaps through novel bounding techniques or approximate monotonicity approaches.
Fault Tolerance and Elasticity: While a distributed system, the paper doesn't detail Harmony's fault tolerance mechanisms (e.g., handling node failures, dynamic scaling of nodes). This would be crucial for production-grade systems.
Integration with Advanced Indexing: Exploring how Harmony's partitioning and pruning strategies could be integrated with more advanced graph-based or quantization-based ANNS indexing techniques, which traditionally struggle with distributed environments.

7.3. Personal Insights & Critique

Harmony presents a compelling solution to critical challenges in distributed ANNS. The hybrid multi-granularity partitioning is a highly intuitive and practical approach, effectively bridging the trade-off between load balancing and communication overhead. Rather than rigidly adhering to one partitioning paradigm, the cost model-driven adaptation is a significant strength. This flexibility allows Harmony to perform robustly across diverse and skewed workloads, which is a common pain point in real-world distributed systems.

The early-stop pruning mechanism is particularly elegant. By leveraging the inherent monotonicity of standard distance metrics, Harmony avoids a tremendous amount of redundant computation. The empirical results, showing superlinear speedup and pruning ratios exceeding 90% in later stages, strongly validate this design choice. This dimension-level pruning is a powerful example of how a deep understanding of the underlying mathematical properties of the problem can lead to substantial system-level optimizations. The pipelined execution further enhances this by overlapping computation and communication, which is crucial for high-throughput systems.

Critique:

Black Box Nature of the Cost Model: While the cost model is central to Harmony's adaptive strategy, the paper provides limited detail on its practical implementation and how its parameters (like $α$ ) are tuned or learned. A more in-depth discussion on the estimation methodology for computation and communication costs ( $c_{\mathrm{comp}}, c_{\mathrm{comm}}$ ) would have been beneficial. Are these empirically derived, or are they analytical models? How accurate are they in dynamic environments?
Absence of Explicit Future Work: The lack of a dedicated section for Limitations & Future Work is a minor omission. While some challenges can be inferred, explicitly stating them would guide future research and provide a more complete picture of the system's current boundaries.
Comparison to Graph-based Distributed ANNS: The paper primarily compares against Faiss (cluster-based) and its own vector-based and dimension-based variants. While Faiss is state-of-the-art, a more comprehensive comparison against recent distributed graph-based ANNS systems (even if they have different trade-offs) could further contextualize Harmony's performance, especially given the paper's initial mention of graph-based methods being "not well compatible with distributed features."
Scalability at Very High Node Counts: While scalability up to 16 nodes is shown, the observation that Harmony-dimension's performance declines at higher worker counts suggests that communication becomes a dominant factor even for dimension-based approaches. Further analysis or mitigation strategies for extreme scale (e.g., hundreds or thousands of nodes) could be insightful.

Transference/Application: The principles employed in Harmony – adaptive multi-granularity partitioning, cost model-driven optimization, and monotonicity-based early pruning in a pipelined distributed execution – are highly transferable. They could be applied to other high-dimensional data processing tasks beyond ANNS, such as:

Distributed Machine Learning Inference: Where models operate on high-dimensional feature vectors and computation can be distributed.
Large-scale Data Filtering/Querying: For non-ANN-specific queries on high-dimensional data where partial computations could lead to early rejection of irrelevant data.
Other Distance/Similarity Computations: The monotonicity principle applies to various metrics, suggesting broader applicability for pruning in distributed environments.

Overall, Harmony makes a significant contribution by presenting a well-engineered and rigorously evaluated solution that pushes the boundaries of efficient and robust distributed ANNS, particularly for skewed workloads and high-dimensional data.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

HARMONY: A Scalable Distributed Vector Database for High-Throughput Approximate Nearest Neighbor Search

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 55,789 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Fine-grained Query Planner

Cost Model

Query Load Distribution

4.2.2. Flexible Pipelined Execution Engine

4.3. Implementation

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Platform

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. QPS-recall trade-off under uniform workloads

6.1.2. Search performance under skewed workloads

6.2. Performance Ablation Study

6.2.1. Time breakdown

6.2.2. Contribution of optimization techniques

6.2.3. Pruning ratio breakdown

6.3. Index Build Experiments

6.3.1. Index build time

6.3.2. Index space overhead

6.4. Additional Discussion

6.4.1. Impact of dimensions and dataset size on harmony performance

6.4.2. Scalability

6.4.3. Peak memory usage

6.4.4. Comparison with Auncel and Harmony

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers