AiPaper
Paper status: completed

AnyKey: A Key-Value SSD for All Workload Types

Published:02/06/2025
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AnyKey, a novel KV-SSD design, addresses performance issues in existing systems under workloads with larger keys by optimizing metadata size. It outperforms state-of-the-art KV-SSD designs across various workload types.

Abstract

Key-value solid-state drives (KV-SSDs) are considered as a potential storage solution for large-scale key-value (KV) store applications. Unfortunately, the existing KV-SSD designs are tuned for a specific type of workload, namely, those in which the size of the values are much larger than the size of the keys. Interestingly, there also exists another type of workload, in practice, in which the sizes of keys are relatively large. We re-evaluate the current KV-SSD designs using such unexplored workloads and document their significantly-degraded performance. Observing that the performance problem stems from the increased size of the metadata, we subsequently propose a novel KV-SSD design, called AnyKey, which prevents the size of the metadata from increasing under varying sizes of keys. Our detailed evaluation using a wide range of real-life workloads indicates that AnyKey outperforms the state-of-the-art KV-SSD design under different types of workloads with varying sizes of keys and values.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

AnyKey: A Key-Value SSD for All Workload Types

1.2. Authors

  • Chanyoung Park (Hanyang University, Ansan, Republic of Korea)
  • Jungho Lee (Hanyang University, Ansan, Republic of Korea)
  • Chun-Yi Liu (Micron Technology Inc., Folsom, CA, USA)
  • Kyungtae Kang (Hanyang University, Ansan, Republic of Korea)
  • Mahmut Taylan Kandemir (Pennsylvania State University, University Park, PA, USA)
  • Wonil Choi (Hanyang University, Ansan, Republic of Korea) - Corresponding author

1.3. Journal/Conference

The paper does not explicitly state the specific journal or conference in the provided text. However, the presence of CCS Concepts, Keywords, and ACM Reference Format suggests publication in a computing or systems conference/journal, likely associated with ACM. Given the technical depth and focus on storage systems, venues like USENIX FAST, USENIX ATC, ACM ASPLOS, or ACM SIGMOD are common for such research. These venues are highly reputable and influential in the fields of operating systems, computer architecture, and storage.

1.4. Publication Year

2025 (Published at: 2025-02-06T00:00:00.000Z)

1.5. Abstract

Key-value solid-state drives (KV-SSDs) are a promising storage solution for large-scale key-value (KV) store applications. However, existing KV-SSD designs are primarily optimized for workloads where values are much larger than keys (referred to as high-v/k workloads). The authors discover a significant performance degradation in current KV-SSD designs when subjected to low-v/k workloads, where key sizes are relatively large. This performance issue is attributed to the increased size of metadata, which cannot be fully accommodated in the device-internal DRAM.

To address this, the paper proposes AnyKey, a novel KV-SSD design that prevents metadata size from increasing with varying key sizes. AnyKey achieves this by grouping KV pairs and storing only one key per group in the metadata, while also introducing hash lists and a value log for efficient management and reduced flash accesses. Through extensive evaluation using diverse real-life KV workloads, AnyKey demonstrates superior performance over state-of-the-art KV-SSD designs, particularly in low-v/k scenarios, and its enhanced version (AnyKey+AnyKey+) further improves performance across all workload types.

/files/papers/6916de05110b75dcc59adfbc/paper.pdf This is an internal file path, suggesting it's likely a PDF hosted on a server. Its publication status appears to be "published" given the explicit publication date.

2. Executive Summary

2.1. Background & Motivation

The explosive growth of key-value (KV) store-based applications and services has driven the development of Key-Value Solid-State Drives (KV-SSDs). KV-SSDs aim to integrate the KV store directly into the SSD device, offering several advantages over traditional host-based KV stores using block SSDs:

  1. Low-latency access: Streamlines the software stack for direct KV requests.

  2. Improved scalability: Simply adding more KV-SSDs enhances system scalability by reducing CPU and memory requirements on the host.

  3. Enhanced bandwidth and device lifetime: SSD-aware KV store management (e.g., reducing write amplification).

    Despite these potential benefits, the adoption of KV-SSDs has been slow, partly because their performance often lags behind host-based KV stores. The key challenge lies in effectively utilizing the SSD's limited internal resources to meet performance demands.

The core problem identified by the paper is that existing LSM-tree-based KV-SSD designs are predominantly tuned for specific workload types, namely high-v/k workloads where the size of values is significantly larger than keys. The authors observed that in practice, there are also low-v/k workloads with relatively large key sizes. When existing KV-SSDs (like the state-of-the-art PinK) are re-evaluated under these unexplored low-v/k workloads, their performance degrades significantly. This degradation is attributed to an explosion in metadata size, which can no longer fit into the SSD's internal DRAM. Consequently, requests incur additional flash accesses to retrieve metadata, drastically increasing tail latencies and decreasing IOPS.

The paper's innovative idea, or entry point, is to address this metadata scalability issue for low-v/k workloads. By redesigning how KV pairs are indexed and managed within the KV-SSD, AnyKey aims to minimize metadata size and ensure it always resides in DRAM, thereby making KV-SSDs truly versatile for all workload types.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

  1. Unveiling the low-v/k Workload Problem: The authors identify and characterize a range of real-world KV workloads with low value-to-key ratios (low-v/k) from various domains. They demonstrate that existing KV-SSD designs exhibit significantly degraded performance under these previously underexplored workloads. This finding highlights a crucial limitation in current KV-SSD research and design.

  2. Proposing AnyKey for low-v/k Workloads: They propose and evaluate a novel KV-SSD design, AnyKey, specifically targeting low-v/k workloads. The core innovation of AnyKey is to streamline the metadata required for locating KV pairs, making it compact enough to fit entirely within the SSD's internal DRAM. This design significantly improves tail latencies, IOPS, and device lifetime compared to state-of-the-art designs.

  3. Presenting AnyKey+AnyKey+ for All Workloads: The paper also introduces an enhanced version, AnyKey+AnyKey+, which further improves IOPS for high-v/k workloads by mitigating high overheads associated with LSM-tree management (specifically, "compaction chains"). This enhancement ensures that AnyKey+AnyKey+ outperforms the state-of-the-art across all types of KV workloads, making it a truly universal KV-SSD solution.

    The key conclusion is that AnyKey (and its enhanced version AnyKey+AnyKey+) successfully resolves the performance bottleneck of KV-SSDs under low-v/k workloads, which primarily stems from unmanageably large metadata. By intelligently grouping KV pairs and separating values, AnyKey ensures metadata remains DRAM-resident, leading to substantial performance gains and increased storage utilization and device lifetime.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand the AnyKey paper, a foundational understanding of key-value stores, SSDs, and Log-Structured Merge (LSM) trees is essential.

3.1.1. Key-Value (KV) Store

A key-value (KV) store is a simple database model designed to store, retrieve, and manage associative arrays (also known as dictionaries or hash maps). In this model, data is stored as a collection of key-value pairs, where each key is a unique identifier, and the value is the data associated with that key. The key is used to efficiently retrieve its corresponding value.

  • Example: Imagine a contacts list. The key could be a person's name (e.g., "Alice Smith"), and the value could be their phone number, email, and address (e.g., "{'phone': '555-1234', 'email': 'alice@example.com'}").
  • Operations: Basic operations include PUT (insert/update a key-value pair), GET (retrieve a value given a key), and DELETE (remove a key-value pair).
  • Applications: Widely used in web services, mobile apps, databases, and caches due to their simplicity, scalability, and high performance for specific access patterns.

3.1.2. Solid-State Drive (SSD)

An SSD is a type of non-volatile storage device that stores persistent data on flash memory cells. Unlike traditional hard disk drives (HDDs) that use spinning platters and read/write heads, SSDs have no moving parts, leading to significantly faster data access, lower latency, and higher durability.

  • Flash Memory: The core component of an SSD. Data is stored in NAND flash memory cells.
  • Pages: The smallest unit of data that can be written to or read from flash memory. Typically 4KB, 8KB, or 16KB in size.
  • Blocks: A collection of multiple pages (e.g., 64, 128, 256 pages). This is the smallest unit of data that can be erased.
  • Erase-before-write: A fundamental characteristic of flash memory. Before new data can be written to a page, the entire block containing that page must first be erased. This makes in-place updates (modifying data directly in its location) inefficient and costly for SSDs.
  • Flash Translation Layer (FTL): A software layer within the SSD controller that manages the mapping between logical block addresses (LBAs) from the host and physical page addresses (PPAs) on the flash memory. It handles wear leveling, garbage collection, and bad block management to extend the SSD's lifespan and present a block device interface to the host.

3.1.3. Key-Value SSD (KV-SSD)

A KV-SSD is an SSD that exposes a key-value interface directly to the host system, rather than the traditional block-level interface. Instead of the host managing key-value pairs and translating them into block I/Os, the KV-SSD itself understands and manages key-value requests.

  • Advantages:
    • Reduced Host Overhead: Offloads KV store management tasks (like LSM-tree compaction, garbage collection) from the host CPU and DRAM to the SSD controller.
    • Optimized Storage: The SSD controller has intimate knowledge of the flash memory characteristics. By directly managing KV pairs, it can perform SSD-aware optimizations, such as reducing write amplification (WA) and improving data locality.
    • Streamlined I/O Path: Eliminates layers of software stack between the application and the physical storage, leading to lower latency.
  • Internal Architecture: KV-SSDs still contain flash memory and a controller. The difference is in the firmware on the controller, which now includes logic to manage KV pairs directly using data structures like hash tables or LSM-trees and their associated metadata.

3.1.4. Log-Structured Merge (LSM) Tree

An LSM-tree is a data structure optimized for write-heavy workloads, commonly used in key-value stores like RocksDB and LevelDB. It is particularly flash-friendly due to its out-of-place update mechanism.

  • Structure: An LSM-tree organizes data into multiple sorted components (levels).
    • Memtable (C0): An in-memory component (usually DRAM) where all new writes are buffered. It's typically a sorted data structure (e.g., skip list, B-tree).
    • SSTables (L1, L2, ...): When the Memtable becomes full, its contents are flushed to disk (or flash) as an immutable Sorted String Table (SSTable). This becomes the first level (L1). Subsequent SSTables are merged from higher levels to lower levels (L1 merges into L2, L2 into L3, etc.).
  • Out-of-Place Updates: Instead of modifying data directly, new versions of data are always written to the Memtable. This means old versions of data on disk are never overwritten; they are eventually cleaned up during compaction. This is highly beneficial for SSDs because it avoids expensive in-place updates that would trigger block erases.
  • Compaction: The process of merging SSTables from a higher level with SSTables from the next lower level. During compaction, KV pairs are read from both levels, sorted, and then written out as new, larger SSTables for the lower level. Invalid (overwritten or deleted) KV pairs are discarded during this process, reclaiming space. This is a critical background operation that consumes I/O bandwidth and impacts performance.
  • Tail Latencies: An LSM-tree can guarantee bounded tail latencies because its multi-level indexing structure allows for predictable access paths, although compaction operations can introduce latency spikes if not managed carefully.

3.1.5. Metadata

In the context of storage, metadata is "data about data." For a KV-SSD, metadata includes information necessary to locate KV pairs stored in the flash memory. This typically involves:

  • Keys: Or at least prefixes/hashes of keys for indexing.
  • Pointers/Addresses: Physical addresses (PPAs) in flash memory where the actual KV values or KV entities are stored.
  • Index Structures: Data structures (like level lists or meta segments in PinK, or level lists and hash lists in AnyKey) that organize these keys and pointers for efficient searching.
  • Location: Ideally, metadata should reside in the SSD's fast internal DRAM for quick access. If DRAM is insufficient, metadata must be stored in flash, incurring additional, slower flash accesses.

3.1.6. Tail Latency

Latency is the time delay between a request and its response. Tail latency refers to the latency experienced by a small percentage of requests, typically the slowest ones (e.g., the 95th95^{th}, 99th99^{th}, or 99.9th99.9^{th} percentile).

  • Importance: In large-scale online services, even a small fraction of slow requests can significantly impact user experience or overall system performance. High tail latencies can be caused by various factors, including garbage collection, compaction, or contention for resources.
  • Example: A 95th95^{th} percentile read latency of 10ms means that 95% of read requests complete within 10ms, but 5% take longer.

3.1.7. IOPS (Input/Output Operations Per Second)

IOPS is a common performance metric that measures the number of distinct input and output operations a storage device or system can handle in one second. Higher IOPS generally indicates better performance for transactional workloads.

3.1.8. Write Amplification (WA)

Write amplification is an undesirable phenomenon in flash memory where the actual amount of data written to the flash media is a multiple of the amount of data originally written by the host.

  • Causes: Out-of-place updates, garbage collection (which involves reading valid data from old blocks and rewriting them to new blocks), and wear leveling.
  • Impact: High WA reduces the SSD's lifespan (due to a finite number of erase/program cycles per flash cell) and consumes I/O bandwidth, leading to lower performance.

3.1.9. Garbage Collection (GC)

Garbage collection is a background process in SSDs essential for managing flash memory. Due to the erase-before-write constraint, SSDs cannot simply overwrite old data.

  • Process: When flash blocks contain a mix of valid (live) and invalid (stale/deleted) pages, GC identifies victim blocks. It then reads all the valid pages from these victim blocks, consolidates them, and rewrites them to new, empty pages in other blocks. Finally, the victim blocks are erased, making them available for new writes.
  • Overhead: GC operations consume significant I/O bandwidth (reading and writing valid data) and CPU resources, potentially increasing latency and reducing IOPS for foreground I/Os.

3.2. Previous Works

The paper categorizes existing KV store execution models into three types and discusses the evolution of KV-SSD designs, particularly focusing on LSM-tree-based approaches.

3.2.1. Execution Models of KV Store

  1. Host + Block Device (Figure 3a):
    • Description: The traditional approach where a KV store (e.g., RocksDB, LevelDB) runs entirely on the host system, using conventional block-based SSDs.
    • Performance Issue: This model suffers from performance bottlenecks due to the full software stack required to generate block I/Os and access the underlying storage.
    • Prior Efforts: Many research efforts [7, 11, 18-20, 44, 46-49, 61] have focused on optimizing host-based KV stores to better leverage block devices. More recent work [12, 16, 23, 35, 37, 54, 56, 60] introduced byte-addressable persistent memory (PM) in the host to reduce block I/O traffic.
  2. Host + SSD Support (Figure 3b):
    • Description: This model attempts to improve performance by enabling the host-based KV store to leverage specific features or internal structures of the SSD. The SSD is no longer treated as a "black box."
    • Examples: Defining new SSD interfaces that expose device-internal parallelism or transactional persistence capabilities [50], or modifying kernel drivers for Open-Channel SSDs to improve caching and scheduling [63].
  3. KV-SSD (Figure 3c):
    • Description: The most recent approach, where the KV store logic is moved directly into the SSD controller. The host issues KV requests, and the SSD manages the KV pairs internally.

    • Advantages: Higher performance, better scalability, and more efficient device management due to SSD-aware optimizations [4, 36, 38].

    • Focus of Paper: This paper focuses on improving KV-SSD designs.

      The following figure (Figure 3 from the original paper) illustrates the three execution models and the internal architecture of a KV-SSD.

      Figure 13. The total number of page writes experienced. 该图像是一个柱状图,展示了不同工作负载下的页面写入总数。各个工作负载的页面写入数量(以百万计)通过三种不同的设计(Pink、AnyKey 和 AnyKey+)进行比较,显示出在转换条件下 AnyKey 的优势。

Figure 3. (a-c) The comparison of three execution models for a KV store, and (d) the internal architecture of a KV-SSD.

3.2.2. Existing KV-SSD Designs

KV-SSDs share the same hardware architecture as conventional block SSDs (flash chips, blocks, pages). The key difference lies in how they store and locate KV pairs and their metadata (usually in DRAM). The choice of metadata data structure is critical.

  1. Hash-based KV-SSD:

    • Description: Early KV-SSD designs often used hash tables [8, 24, 32, 36, 64] for metadata due to their simplicity.
    • Issues:
      • In-place Updates: Hash tables typically require in-place updates for KV pairs. On SSDs, this leads to partial page updates and frequent, costly garbage collection (GC) operations, wasting bandwidth and reducing device lifetime.
      • Hash Collisions: To avoid frequent GC, hash tables might be minimized to fit in DRAM. However, a small hash table increases the likelihood of hash collisions, leading to additional flash accesses to resolve collisions and consequently increasing tail latencies.
    • Alternative Sorted Indexes: Other sorted indexing structures like B-trees [15] and learned indexes [18, 40] also suffer from in-place update issues, making them less suitable for write-heavy workloads on SSDs.
  2. LSM-tree-based KV-SSD:

    • Description: LSM-trees [13, 30, 31, 43, 51] have gained popularity as an alternative due to their flash-friendly out-of-place update mechanism. They are now the mainstream for KV-SSD designs.

    • PinK as State-of-the-Art: The paper identifies PinK [30, 31] as a state-of-the-art LSM-tree-based KV-SSD design and uses it as the primary baseline for comparison.

    • PinK's Structure (Figure 4):

      • Data Segments: KV pairs are stored directly in flash pages as data segments.

      • Meta Segments: Keys and their Physical Flash Addresses (PPAs) are sorted within meta segments. Upper-level meta segments are kept in DRAM, while lower-level ones are stored in flash.

      • Level Lists: For each meta segment, a level list entry exists, containing the meta segment's first key and its location (DRAM or flash address). Level lists are grouped into multiple levels (L1 to Ln) and sorted.

        The following figure (Figure 4 from the original paper) illustrates the internal structure of PinK.

        Figure 14. The storage utilizations of the three systems. 该图像是图表,展示了三种系统在不同工作负载下的存储利用率。纵轴表示存储利用率,横轴则列出了各个系统,包括 PinK、AnyKey 和 AnyKey+。图表显示,AnyKey 和 AnyKey+ 在大部分工作负载中展现出优于 PinK 的性能。

    • Locating a KV pair (Read Operation in PinK, Figure 4):

      1. Given a key (e.g., "PP"), PinK searches level lists from L1 downwards.
      2. If an entry in a level list (e.g., L1) is found where "PP" falls into its key range, its corresponding meta segment is accessed (e.g., from DRAM address 0x10).
      3. The key "PP" is then searched within this meta segment. If not found (due to overlapping key ranges across levels), the search continues to the next level list.
      4. If "PP" is found in a meta segment corresponding to an entry in a lower level list (e.g., L2), the PPA of the data segment containing the target KV pair is obtained.
      5. Finally, the data segment at that PPA is read, and the value is extracted.
    • Write Operation in PinK: New write requests are initially buffered in DRAM. When the buffer fills, KV pairs are merged into L1 (L0-L1 compaction), generating new level list entries and meta segments in DRAM, while values are written to data segments in flash.

3.3. Technological Evolution

The evolution of storage systems for key-value stores has progressed from entirely host-managed, block-device-agnostic approaches to increasingly SSD-aware and eventually in-storage computation models.

  1. Traditional Host-Managed (early 2000s onwards): Initially, KV stores like Berkeley DB, and later NoSQL databases like Cassandra and MongoDB, ran on conventional HDDs or block-based SSDs. The KV store software (e.g., LSM-trees) handled all indexing, compaction, and garbage collection logic. The SSD was a passive block device, abstracted by the FTL. This led to high CPU and DRAM overheads on the host, and I/O patterns (especially random writes) that were often flash-unfriendly.

  2. SSD-Aware Host-Side Optimizations (mid-2010s): Researchers and developers began optimizing host-based KV stores to be more SSD-aware. This involved modifying LSM-tree compaction strategies (e.g., WiscKey [49] separating keys from values), improving caching, and exploiting SSD parallelism, sometimes through Open-Channel SSDs or custom SSD interfaces. The goal was to reduce write amplification and tail latencies by coordinating host-side KV store logic with SSD internals.

  3. KV-SSDs (late 2010s onwards): The logical next step was to move the KV store intelligence directly into the SSD controller. Early KV-SSDs primarily used hash tables for indexing, leveraging the SSD's internal DRAM for metadata. However, hash-based designs faced challenges with in-place updates (leading to GC overheads) and hash collisions. The industry quickly shifted towards LSM-tree-based KV-SSDs (e.g., PinK), which are inherently flash-friendly due to out-of-place updates and tiered storage management. These designs promised bounded tail latencies and reduced write amplification.

    This paper's work, AnyKey, fits into the latest stage of this evolution. It builds upon the LSM-tree-based KV-SSD paradigm but addresses a critical oversight: the assumption that KV-SSDs would primarily handle high-v/k workloads. AnyKey pioneers optimizations for low-v/k workloads, ensuring that LSM-tree-based KV-SSDs can perform efficiently across a broader spectrum of real-world KV store applications.

3.4. Differentiation Analysis

Compared to PinK and other contemporary LSM-tree-based KV-SSDs, AnyKey introduces fundamental differences in metadata management and LSM-tree compaction strategies, driven by its focus on low-v/k workloads.

  1. Metadata Representation and Size:

    • PinK: Maintains individual meta segments for each key (or small groups of keys) which store keys and PPAs in sorted order. For high-v/k workloads, where keys are small, these meta segments are compact and can mostly reside in DRAM. However, for low-v/k workloads (large keys), the collective size of meta segments explodes, forcing many to be stored in flash. This incurs multiple flash accesses for metadata lookups, causing severe performance degradation.
    • AnyKey: Drastically reduces metadata size by grouping multiple KV pairs into data segment groups. Instead of storing metadata for each KV pair, AnyKey stores metadata only for each data segment group. This means level list entries contain only the smallest key of the group, the PPA of the first page, and a list of hash values for the first keys in each page within the group. This design ensures the metadata (level lists and hash lists) remains small enough to fit entirely within DRAM, regardless of the individual key sizes.
  2. Handling Metadata in DRAM vs. Flash:

    • PinK: Dynamically stores meta segments in DRAM or flash based on available DRAM capacity, leading to unpredictable flash accesses for metadata under low-v/k scenarios.
    • AnyKey: Explicitly designs metadata (level lists and hash lists) to be DRAM-resident. By minimizing their size, AnyKey guarantees that KV pair location can almost always be determined without additional flash accesses for metadata.
  3. Mechanism for Locating KV pairs within a Group:

    • PinK: Directly stores keys in meta segments and uses a binary search (or similar) to find the key within the meta segment.
    • AnyKey: Introduces hash values (e.g., 32-bit xxHash) for keys. KV pairs within a data segment group are sorted by these hash values. To quickly find a KV pair within a group, AnyKey uses a list of hash values for the first keys on each page. This allows for efficient page-level targeting using a smaller metadata footprint than storing actual keys for each page. It also includes hash collision bits to manage rare cases of hash collisions across page boundaries.
  4. LSM-tree Compaction Efficiency:

    • PinK: During compaction, KV pairs (both keys and values) are read from source levels and rewritten to the destination level. When meta segments are in flash, this involves substantial reads and writes of key data, further exacerbating compaction overhead.
    • AnyKey: Introduces a value log in flash, separating values from keys in data segment groups. During compaction, only keys and pointers (if values are in the value log) are moved, significantly reducing read/write amplification for large values. AnyKey+AnyKey+ further refines this by modifying log-triggered compaction to prevent "compaction chains," where one compaction triggers another, which is a common issue for LSM-trees under specific update patterns and value log interactions.
  5. Storage Utilization:

    • PinK: Under low-v/k workloads, if meta segments are too large for DRAM, PinK stores them in flash. This effectively means two copies of key information reside in flash (one in meta segments, one in data segments with values), leading to lower storage utilization.

    • AnyKey: By eliminating meta segments in flash and always keeping metadata in DRAM, AnyKey only stores one copy of each unique key's information in flash, thus achieving higher storage utilization.

      In essence, AnyKey represents a fundamental architectural shift for KV-SSDs to handle low-v/k workloads efficiently, while AnyKey+AnyKey+ extends this efficiency to cover all workload types by optimizing LSM-tree dynamics.

4. Methodology

4.1. Principles

The core principle behind AnyKey is to fundamentally change how metadata is managed within a KV-SSD to ensure it remains compact enough to always reside in the device-internal DRAM, regardless of the key size. This directly addresses the performance degradation observed in existing KV-SSD designs when faced with low value-to-key ratio (low-v/k) workloads.

The intuition is that existing LSM-tree-based KV-SSDs (like PinK) store metadata that explicitly references every single KV pair in flash. When keys are small, this metadata is manageable. However, as key sizes increase (as in low-v/k workloads), the sheer volume of metadata (specifically, the meta segments in PinK) can exceed DRAM capacity, forcing it into slower flash memory. This results in extra flash accesses for every KV lookup, causing tail latencies to skyrocket and IOPS to plummet.

AnyKey's solution is built on two main ideas:

  1. Grouping KV pairs: Instead of individual KV pairs, AnyKey manages sets of sorted KV pairs in logical units called data segment groups. Metadata is then maintained only for these groups, rather than for each KV pair. This dramatically reduces the metadata footprint.

  2. Separating values from keys (for efficiency): While grouping helps metadata size, compaction operations in LSM-trees still involve moving KV pairs. If values are large and physically co-located with keys within data segment groups, compaction would incur significant read/write amplification. To mitigate this, AnyKey introduces a value log to store values separately, allowing compaction to primarily handle smaller key/pointer pairs, thus improving LSM-tree management efficiency and device lifetime.

    By implementing these principles, AnyKey aims to ensure that metadata lookups are always DRAM-speed and LSM-tree background operations (like compaction) are optimized, leading to high IOPS and low tail latencies across a wide range of KV workloads.

4.2. Core Methodology In-depth (Layer by Layer)

AnyKey introduces novel data structures and modifies LSM-tree operations to achieve its goals. The internal structure shown in Figure 6 describes the new components.

The following figure (Figure 6 from the original paper) describes the internal structure of AnyKey.

Figure 16. The read latencies under varying flash page sizes.
该图像是图表,显示了在不同工作负载下,Crypto1、ETC 和 W-Pink 三种情形下的读取延迟的累积分布函数(CDF)。图中包含了4KB、8KB和16KB三种页面大小的延迟比较,横坐标为延迟(毫秒),纵坐标为CDF值。不同颜色和线型表示不同的页面大小。

4.2.1. Minimizing the Size of the Metadata

The primary innovation is to manage KV pairs in sorted groups to drastically reduce the metadata size.

4.2.1.1. Data Segment Group (in the flash)

  • Concept: AnyKey combines multiple flash pages to form a data segment group. Within each data segment group, KV pairs are stored in sorted order.
  • Implementation: In the current implementation, AnyKey integrates neighboring physical pages within a flash block to form a data segment group. This simplifies access, as the PPAs of pages within a block are sequential. A group can contain 32 8KB flash pages, accommodating 100-5,000 KV pairs under tested workloads.
  • KV Entity: Each KV pair within a data segment group is stored as a KV entity, which comprises:
    1. (i) The key: The actual key string.
    2. (ii) The hash value of the key: A small-sized (e.g., 32-bit) hash value of the key. This is crucial for sorting and locating KV pairs efficiently within a group, especially when keys are large, without needing to store the full key in internal lookup structures.
    3. (iii) The value: The actual value, or a PPA pointer to where the value is stored in a separate value log (explained later).
  • Hash-based Sorting: KV pairs within a data segment group are sorted based on the 32-bit hash values of their keys. This allows efficient binary search to locate a target KV pair once its data segment group is identified.
  • Hash Collision Handling: A potential issue is hash collisions, where different keys have the same hash value. Since KV entities with identical hash values are stored consecutively, collisions might span across page boundaries.
    • To handle this, AnyKey reserves 2 bits per page, called hash collision bits.

    • If a page contains KV entities with a hash value that continues into the next page, the bits are set to 01.

    • If a hash value from the previous page continues into the current page, the bits are set to 10.

    • This mechanism helps determine if an adjacent page needs to be read to find a KV entity with a colliding hash. The paper notes that such collisions are rare (0.075% of groups), so their performance impact is negligible.

      The following figure (Figure 7 from the original paper) illustrates how hash collision bits are used when two different keys have the same hash value.

      Figure 8. The two types of compaction operations. 该图像是示意图,展示了两种类型的压缩操作:树触发压缩和日志触发压缩。在树触发压缩中,当目标源达到其阈值时,数据段组和价值日志发生合并;而在日志触发压缩中,若日志满则进行合并和回收操作。

Figure 7. When two different keys have the same hash value.

4.2.1.2. Level Lists (in the DRAM)

  • Concept: Similar to existing LSM-tree designs, AnyKey uses level lists to index the data segment groups. However, each level list entry is redesigned to hold group metadata.
  • Structure of a Level List Entry: For each data segment group, a level list entry (residing in DRAM) contains:
    1. (i) The smallest key: The smallest key of the entire data segment group.
    2. (ii) The PPA of the first page: The Physical Page Address of the very first page in the data segment group. Since pages within a group are contiguous (from neighboring pages in a block), this single PPA allows access to all pages in the group.
    3. (iii) A list of the hash values of the first keys in the pages: A list of truncated hash values (e.g., first 16 bits Hash[0:15]) corresponding to the first key on each page within the data segment group.
  • Lookup Process:
    • First, the level lists (sorted by smallest key) are searched to identify the data segment group that might contain the target key.
    • Once a data segment group is identified, the PPA of its first page is known.
    • To find the exact flash page within that group, the hash value of the target key is compared against the list of hash values for the first keys in each page (item (iii) above). This quickly narrows down the search to a specific page.
  • Metadata Size Reduction: By having metadata at the group level rather than per-KV pair and using truncated hash values for intra-group indexing, the collective size of level lists is significantly reduced, ensuring they fit in DRAM.

4.2.2. Further Reducing Flash Accesses

Even with level lists identifying data segment groups, there's still a possibility that a key falls into a key range indicated by a level list entry, but the key does not actually exist in the corresponding data segment group (due to overlapping key ranges across LSM-tree levels). This would lead to unnecessary flash accesses. To prevent this:

4.2.2.1. Hash Lists (in the DRAM)

  • Concept: AnyKey introduces a new metadata type called hash list for each level list entry.
  • Structure: Each hash list stores the hash values of all the keys that actually exist in its corresponding data segment group.
  • Purpose: Before accessing the flash to read a data segment group, AnyKey first checks if the hash value of the target key is present in the hash list for the candidate level list entry. If the hash is not found, it definitively indicates that the KV pair is not in that group, and the search can immediately proceed to the next LSM-tree level without a wasteful flash read.
  • Placement: Hash lists are stored in the remaining DRAM space. If DRAM is full, hash lists are not maintained for lower LSM-tree levels, in which case unnecessary flash accesses might still occur for those levels (similar to other LSM-tree-based designs).
  • Implementation: Currently, an integer array with binary search is used for hash lists. While Bloom filters or Cuckoo filters could save DRAM space, PinK also noted their high computational burden on SSD resources, a reason AnyKey avoids them for now.

4.2.3. Improving the Efficiency of the LSM-tree Management

While the above designs improve read latency, LSM-tree compaction can become an overhead if values are large and part of the data segment groups. If values are always co-located with keys and frequently moved during compaction, it would waste device bandwidth and lifetime.

4.2.3.1. Value Log (in the flash)

  • Concept: Inspired by designs like WiscKey [49], AnyKey detaches values from keys for LSM-tree management. A separate flash area is designated as a value log.
  • Storage: The flash space in AnyKey is divided into two areas: one for data segment groups (storing keys and pointers/small values) and one for the value log (storing large values).
  • KV Entity Update: If a value is stored in the value log, the KV entity in the data segment group contains a pointer (PPA) to its location in the value log, rather than the value itself.
  • Write Flow: Initially, all values from new writes are written into the value log.
  • Compaction Benefit: This separation means that during compaction, only the smaller key/pointer pairs in data segment groups are read and rewritten. The large values in the value log are not moved unless a log-triggered compaction explicitly decides to merge them back into data segment groups (explained below). This significantly reduces write amplification.
  • Capacity Management: AnyKey reserves half of the remaining SSD capacity for the value log. If SSD capacity is full, over-provisioned (OP) capacity can be used for the value log.

4.2.4. Detailed Operations

4.2.4.1. Read

Given a key request:

  1. Level List Search: AnyKey searches the level lists from L1L_1 to LnL_n sequentially to find an entry whose key range contains the target key.
  2. Hash List Check: If a candidate level list entry is found, AnyKey checks if the hash value of the target key exists in the hash list corresponding to that entry (if a hash list is maintained for that level).
    • If the hash value is not found in the hash list, the KV pair does not exist in that data segment group, and the search continues to the next LSM-tree level.
    • If the hash value is found, the target data segment group has been identified.
  3. Page Identification: Using the PPA of the first page of the data segment group and the list of hash values for the first keys of each page (from the level list entry), AnyKey determines the specific flash page where the target KV entity is likely stored.
  4. KV Entity Read: This specific flash page is read to retrieve the KV entity.
  5. Value Retrieval:
    • If the KV entity itself contains the value, the read operation is complete.

    • If the KV entity contains a pointer (PPA) to the value log, AnyKey performs a second flash read to retrieve the actual value from the value log.

      Figure 6 (repeated below for convenience) illustrates this read process: A request for key "PP" (hash "4791") is shown. If L1 doesn't have it, it moves to the next level. If a candidate entry in a lower level is found and its hash is in the hash list, the data segment group is identified. The page at PPA 300 is read. If the KV entity holds a pointer, the value at PPA 901 in the value log is read.

      Figure 16. The read latencies under varying flash page sizes. 该图像是图表,显示了在不同工作负载下,Crypto1、ETC 和 W-Pink 三种情形下的读取延迟的累积分布函数(CDF)。图中包含了4KB、8KB和16KB三种页面大小的延迟比较,横坐标为延迟(毫秒),纵坐标为CDF值。不同颜色和线型表示不同的页面大小。

4.2.4.2. Write

  1. Buffer in DRAM: A write request (either an update to an existing key or a new key addition) is initially buffered in the device-internal DRAM.
  2. L0-L1 Compaction: When this DRAM buffer (effectively L0) becomes full, its KV pairs are merged into L1L_1. During this L0-L1 compaction:
    • New data segment groups and level list entries are constructed for L1L_1.
    • All values associated with these KV pairs are written into the value log. (Only pointers or small values are written into the KV entities in data segment groups).
    • This ensures that new values entering the LSM-tree are placed into the value log first.

4.2.4.3. Compaction

AnyKey employs two types of compaction operations, both of which involve reading KV pairs from source levels, sorting them, and constructing new data segment groups and level list entries for the destination level. A key distinction is that KV entities are sorted based on hash values for insertion into data segment groups.

The following figure (Figure 8 from the original paper) illustrates the two types of compaction operations.

Figure 17. The lat. of ETC under varying key distributions.
该图像是图表,展示了在不同 heta 值下,AnyKey、AnyKey+ 和 PinK 的延迟分布曲线(CDF)。从图中可以看出,随着 heta 值的变化,AnyKey 的延迟表现优于其他方案。

Figure 8. The two types of compaction operations.

  1. Tree-triggered Compaction:

    • Trigger: Invoked when the size of LSM-tree level Ln1L_{n-1} (i.e., the collective size of its data segment groups and any values referenced in the value log) reaches a predefined threshold.
    • Purpose: This is the conventional compaction operation required by LSM-trees to maintain their structure and reclaim space from invalidated KV pairs due to out-of-place updates.
    • Process: Reads KV pairs from Ln1L_{n-1} and LnL_n, merges them, discards invalid KV pairs, and writes new data segment groups for LnL_n. Values are read from wherever they are currently stored (either data segment group or value log) and written to the data segment groups of the new LnL_n (potentially moving from value log to data segment group at this stage if desired, or remaining in value log).
  2. Log-triggered Compaction:

    • Trigger: Invoked when the value log becomes full.
    • Purpose: To secure available space in the value log for ongoing writes.
    • Process: AnyKey selects a target source level Ln1L_{n-1} (initially, the level with the largest volume of its values in the value log). A standard Ln1L_{n-1}-to-LnL_n compaction is performed. Crucially, during this compaction, all values of KV pairs from both Ln1L_{n-1} and LnL_n that are currently in the value log are relocated and merged directly into the data segment groups of the newly created LnL_n. This frees up space in the value log.
  • Hash List Update: Whenever compaction operations modify level lists, AnyKey updates the hash lists for the top levels in DRAM accordingly. This overhead is minimal given the I/O-intensive nature of compaction.

4.2.4.4. GC (Garbage Collection)

  • Conventional GC: Typically relocates all valid data (page by page) from a victim block before erasing it.
  • AnyKey's GC:
    1. Relocates data segment groups (unit of multiple pages) from a victim block.
    2. Updates the new PPA in the corresponding level list entry.
    • Efficiency: The paper observes that most victim blocks in AnyKey do not contain valid data and can be erased immediately. This is because data segment groups belonging to a specific LSM-tree level tend to be stored in the same flash block and are relocated together during compaction. This significantly reduces GC overhead.
    • Value Log GC: GC is not triggered in the value log itself, as space reclamation there is handled by log-triggered compaction.

4.2.4.5. Range Query

  • Challenge: Since KV pairs within a data segment group are sorted by their hash values, a range query (retrieving keys between key_min and key_max) would require re-sorting by actual key to deliver results in lexicographical order.
  • Solution: AnyKey maintains extra information for each KV pair in a data segment group: {target page, page offset}. This information is sorted by actual key and stored in the first page of the data segment group.
  • Benefit: For a range query, this pre-sorted location information allows KV pairs to be located quickly without any on-the-fly sorting. For longer range queries, AnyKey performs better than PinK because consecutive KV pairs are often co-located in flash pages within a data segment group, reducing the number of flash pages that need to be read. For small range queries, AnyKey's performance is comparable to PinK.

4.2.5. Analysis on the Size of Metadata

The paper conducts a quantitative analysis comparing the metadata sizes of PinK and AnyKey under varying low value-to-key ratios. This analysis assumes a 64GB SSD with 64MB DRAM and is full of KV pairs.

The following are the results from Table 1 of the original paper:

Workload PinK AnyKey
Level lists Meta segments Sum Level lists Hash lists Sum
4.0 (160B/40B) 56MB 316MB 372MB 29MB 35MB 64MB
2.0 (120B/60B) 117MB 414MB 531MB 34MB 30MB 64MB
1.0 (80B/80B) 200MB 503MB 703MB 38MB 26MB 64MB

Table 1. The analysis on the metadata size in PinK and AnyKey under varying (low) value-to-key ratios, assuming that our 64GB SSD is full of KV pairs (Section 5.1).

Observations from Table 1:

  • PinK's Metadata Growth: As the value-to-key ratio decreases (meaning key sizes are relatively larger, e.g., from 4.0 to 1.0), the sum of level lists and meta segments in PinK grows dramatically from 372MB to 703MB. This clearly exceeds the assumed 64MB DRAM capacity, forcing a large portion of meta segments into flash.
  • AnyKey's Stable Metadata: In contrast, AnyKey's total metadata size (sum of level lists and hash lists) consistently remains at 64MB. This is precisely the DRAM capacity, demonstrating that AnyKey effectively limits its metadata to fit within DRAM across varying low-v/k scenarios.
  • Mechanism: This stability in AnyKey is due to:
    1. The significant reduction in level list size by managing metadata per data segment group rather than per-KV pair.
    2. Hash lists being dynamically generated and stored using only the remaining DRAM space, ensuring the total DRAM footprint for metadata does not exceed the available DRAM capacity.

4.2.6. Design Overhead

4.2.6.1. Storage Capacity Overhead

AnyKey introduces minor storage overheads:

  • Hash Values: Each KV entity in a data segment group includes a 32-bit hash value of its key. For 10 million KV pairs in a 64GB SSD, this amounts to 38MB, which is only 0.05% of the total SSD capacity.
  • Range Query Information: For efficient range query processing, each data segment group stores {target page, page offset} for all KV pairs within it, sorted by key. Assuming 1,000 KV pairs in a 32-page (8KB/page) group, this information is about 2KB, less than 1% of the group size.
  • Overall: These collective overheads are extremely small, especially compared to PinK which might store two copies of each key in flash (one in meta segments and one in data segments) if metadata overflows DRAM. The paper asserts that AnyKey achieves higher storage utilization than PinK (Section 5.4).

4.2.6.2. Computation Overhead

AnyKey introduces some computational tasks on the SSD controller:

  • Hash Value Generation: AnyKey needs to generate a 32-bit hash value for a key when a KV pair is first written and when a read request is issued. Using xxHash [14] on a 1.2 GHz ARM Cortex-A53 core (a common SSD controller CPU), generating a 32-bit hash for a 40-byte key takes 79ns. This is considered a marginal overhead compared to typical flash read/write latencies.
  • Sorting KV Entities: During compaction, when a data segment group is constructed, AnyKey needs to sort a set of KV entities based on their hash values. Since the KV entities from the merging groups are already sorted, this is a fast merge-sort process. Merging two groups, each with 8,192 KV entities, takes approximately 118µs on the ARM Cortex-A53 core, which is negligible relative to the overall time-consuming compaction operation.
  • Inclusion in Evaluation: The paper explicitly states that all experimental data presented in Section 5 already includes these computation overheads.

4.2.7. An Enhancement to Compaction Process (AnyKey+AnyKey+)

While base AnyKey excels in low-v/k workloads, an observed issue was decreased IOPS for high-v/k workloads and some low-v/k workloads with relatively large values.

4.2.7.1. Motivational Observations

  • Compaction Chains: The problem stems from log-triggered compaction. When the value log fills up, log-triggered compaction is invoked to move values from the value log into data segment groups in an LSM-tree level LnL_n. If a significant volume of values are merged into LnL_n, this can cause LnL_n itself to exceed its size threshold, immediately triggering a tree-triggered compaction to merge LnL_n into Ln+1L_{n+1}. This chain of consecutive compaction invocations (compaction chain) adds substantial overhead.

    The following figure (Figure 9a from the original paper) illustrates a problematic situation where a compaction chain occurs.

    Figure 18. The latencies of UDB under varying scan lengths. 该图像是图表,展示了不同扫描长度下UDB的延迟CDF曲线,包括长度为100、150和200的三个子图。可以看出,AnyKey设计在多种工作负载下的延迟表现优于其他设计。

Figure 9. An illustration of the problematic situations during the compaction process, and an enhancement to avoid them.

  • Figure 9a describes this: When value log fills, log-triggered Ln1LnL_{n-1} \rightarrow L_n compaction (1) merges many values into LnL_n. If LnL_n then exceeds its threshold, it triggers a tree-triggered LnLn+1L_n \rightarrow L_{n+1} compaction (2).

4.2.7.2. Modified Log-Triggered Compaction (AnyKey+AnyKey+)

To mitigate compaction chains, AnyKey+AnyKey+ implements a modified log-triggered compaction:

The following figure (Figure 9b from the original paper) illustrates the modified log-triggered compaction scheme.

Figure 18. The latencies of UDB under varying scan lengths.
该图像是图表,展示了不同扫描长度下UDB的延迟CDF曲线,包括长度为100、150和200的三个子图。可以看出,AnyKey设计在多种工作负载下的延迟表现优于其他设计。

  1. Threshold Monitoring: During a log-triggered Ln1L_{n-1}-to-LnL_n compaction, AnyKey+AnyKey+ continuously monitors whether the size of the target destination level LnL_n is approaching its threshold.

  2. Early Termination of Value Merging: If LnL_n reaches a certain point (e.g., threshold×αthreshold × α, where 0α10 \leq α \leq 1), AnyKey+AnyKey+ stops merging values from the value log into the data segment groups of LnL_n. Any remaining values that were supposed to be merged are instead written back into the value log. This prevents LnL_n from exceeding its threshold and avoids triggering a subsequent tree-triggered compaction.

  3. Improved Target Level Selection: To compensate for potentially reduced space reclamation in the value log (due to values being written back), AnyKey+AnyKey+ changes its strategy for choosing the target source level Ln1L_{n-1}. Instead of selecting the level with the largest total size of values in the value log, it selects the level with the most "invalid" values in the value log. This is effective because LSM-trees generally accumulate many invalid (overwritten or deleted) values. By targeting levels with more invalid values, AnyKey+AnyKey+ can still reclaim significant free space in the value log while avoiding compaction chains.

    This enhancement (AnyKey+AnyKey+) ensures that LSM-tree management remains efficient even under workloads that would typically induce compaction chains, thereby providing high IOPS across all workload types.

5. Experimental Setup

5.1. Datasets

The paper used 14 Key-Value (KV) workloads, explicitly categorized into high value-to-key ratio (high-v/k) and low value-to-key ratio (low-v/k) based on their characteristics.

The following are the results from Table 2 of the original paper:

Workload Description Key Value
KVSSD[36] The workload used in Samsung. 16 4,096
YCSB[17] The default key and value sizes of YCSB. 20 1,000
W-PinK[31] The workload used in PinK. 32 1,024
Xbox[19] Xbox LIVE Primetime online game. 94 1,200
ETC[6] General-purpose KV store of Facebook. 41 358
UDB[10] Facebook storage layer for social graph. 27 127
Cache[58] Twitter's cache cluster. 42 188
VAR[6] Server-side browser info. of Facebook. 35 115
Crypto2[55] Trezor's KV store for Bitcoin wallet. 37 110
Dedup[21] DB of Microsoft's storage dedup. engine. 20 44
Cache15[59] 15% of the 153 cache clusters at Twitter. 38 38
ZippyDB[10] Object metadata of Facebook store. 48 43
Crypto1[9] BlockStream's store for Bitcoin explorer. 76 50
RTDATA[26] IBM's real-time data analytics workloads. 24 10

Table 2. Our tested KV workloads (sizes are in bytes).

Dataset Characteristics:

  • Source: The workloads are derived from real-life applications and services, including:
    • Industry benchmarks (e.g., YCSB).
    • Workloads from major tech companies (e.g., Samsung, Facebook, Twitter, Microsoft, IBM).
    • Specific application domains (e.g., Xbox LIVE game, Bitcoin wallets, data analytics).
  • Key and Value Sizes: The table clearly lists the key and value sizes in bytes for each workload. This is the crucial factor determining the value-to-key ratio.
    • High-v/k Workloads (typical, previously studied): KVSSD, YCSB, W-PinK, Xbox. These generally have small keys (16-94 bytes) and large values (1000-4096 bytes), leading to high value-to-key ratios.
    • Low-v/k Workloads (unexplored, focus of paper): ETC, UDB, Cache, VAR, Crypto2, Dedup, Cache15, ZippyDB, Crypto1, RTDATA. These have more balanced key and value sizes, or even keys larger than values (e.g., Crypto1, RTDATA), resulting in low value-to-key ratios.
  • Why these datasets were chosen: These datasets represent a wide spectrum of real-world KV store usage, including both the traditionally studied high-v/k patterns and the newly identified low-v/k patterns that challenge existing KV-SSD designs. Their diversity allows for a comprehensive evaluation of AnyKey's performance across different key/value size distributions and access patterns.

5.2. Evaluation Metrics

The paper uses several standard performance metrics to evaluate the KV-SSD designs.

5.2.1. Tail Latency

  • Conceptual Definition: Tail latency measures the performance of the slowest requests, often expressed as a high percentile (e.g., 95th95^{th}, 99th99^{th}). It quantifies the time taken for a certain percentage of requests to complete. High tail latency indicates that a significant portion of users or applications experience slow responses, which is critical in interactive or large-scale services. The paper specifically focuses on the 95th95^{th} percentile read tail latency. A 95th95^{th} percentile latency of XX ms means that 95% of all read requests completed within XX ms, while 5% took longer.
  • Mathematical Formula: There is no single universal formula for tail latency. It's typically calculated from a distribution of observed latencies. For the PP-th percentile, if LL is a sorted list of latencies from NN requests, the PP-th percentile latency is the value LkL_k where k=P100×Nk = \lceil \frac{P}{100} \times N \rceil.
  • Symbol Explanation:
    • LL: A sorted list of observed latencies.
    • NN: The total number of requests.
    • PP: The desired percentile (e.g., 95 for 95th95^{th} percentile).
    • kk: The index in the sorted list corresponding to the PP-th percentile.
    • \lceil \cdot \rceil: The ceiling function, which rounds a number up to the nearest integer.

5.2.2. IOPS (Input/Output Operations Per Second)

  • Conceptual Definition: IOPS quantifies the number of read and write operations a storage device or system can perform per second. It is a measure of throughput, indicating how many discrete storage transactions can be processed within a given time frame. Higher IOPS generally signifies better performance for transactional workloads common in KV stores.
  • Mathematical Formula: $ \text{IOPS} = \frac{\text{Total I/O Operations}}{\text{Total Time (seconds)}} $
  • Symbol Explanation:
    • Total I/O Operations: The sum of all read and write operations performed during the measurement period.
    • Total Time (seconds): The duration of the measurement period in seconds.

5.2.3. Device Lifetime (implied by number of page writes)

  • Conceptual Definition: Device lifetime refers to the expected operational duration of an SSD before its flash memory cells degrade to an unusable state. Flash memory has a finite number of Program/Erase (P/E) cycles. Each write to a flash page consumes one P/E cycle for the underlying block. Therefore, the total number of page writes (including write amplification from compaction and GC) is a direct indicator of device wear and, consequently, device lifetime. Fewer page writes imply a longer device lifetime. The paper uses "total number of page writes experienced" as the metric.
  • Mathematical Formula: No explicit formula given, but it is implicitly related to Total Page Writes / P/E Cycles per Block.
  • Symbol Explanation:
    • Total Page Writes: The cumulative count of all pages written to the flash memory throughout the SSD's operation, including those due to user writes, compaction, and GC.
    • P/E Cycles per Block: The maximum number of times a flash block can be erased and programmed before it becomes unreliable.

5.2.4. Storage Utilization

  • Conceptual Definition: Storage utilization measures the efficiency with which the available storage capacity is used to store unique KV pairs. It's the ratio of the actual size of unique user data to the total physical storage capacity. Lower storage utilization indicates that a significant portion of the SSD is occupied by redundant data (e.g., old versions of KV pairs not yet cleaned by GC/compaction) or excessive metadata. Higher storage utilization is desirable as it means more user data can be stored for a given SSD size.
  • Mathematical Formula: $ \text{Storage Utilization} = \frac{\text{Total Size of Unique KV Pairs}}{\text{Total SSD Capacity}} $
  • Symbol Explanation:
    • Total Size of Unique KV Pairs: The sum of the sizes of all currently valid and distinct key-value pairs stored in the system.
    • Total SSD Capacity: The total raw storage capacity of the SSD.

5.3. Baselines

The paper primarily compares AnyKey against PinK and also evaluates AnyKey+AnyKey+, an enhanced version of AnyKey.

  • PinK [30, 31]: This is cited as the state-of-the-art LSM-tree-based KV-SSD design. It serves as the main baseline because it represents the most advanced existing approach that AnyKey aims to improve upon.
  • AnyKey (Base Version): The proposed design focusing on minimizing metadata for low-v/k workloads.
  • AnyKey+ (Enhanced Version): An improved version of AnyKey that includes a modified log-triggered compaction mechanism to prevent compaction chains and enhance LSM-tree management efficiency, particularly for high-v/k workloads.

5.4. Testbed & Configuration

The experiments were conducted in a highly controlled and realistic KV-SSD emulation environment.

  • Testbed Components:
    • Flash Emulator (FEMU [45]): A publicly available and widely used emulator that accurately models flash memory behavior, including latency characteristics of flash reads, writes, and erases.
    • Host System Driver (uNVMe [2]): A host-side driver used for generating KV requests in user space and directly issuing them to the emulated SSD without passing through a conventional I/O software stack. This setup allows for direct KV-SSD interaction.
    • Implementation: The authors ported PinK's publicly available source code [31] into this testbed and modified/re-engineered its data structures to implement both AnyKey and AnyKey+AnyKey+. The source code for their implementation is available on GitHub.
  • Device Configuration:
    • SSD Capacity: 64GB (The paper also mentions evaluating a 4TB SSD with 4GB DRAM in Section 5.9 for scalability, but 64GB is the default for most experiments).
    • Architecture: Consists of eight channels, each with eight flash chips (similar to PinK's setup).
    • DRAM Capacity: Default 64MB (approximately 0.1% of total SSD capacity, a common ratio [25, 39]). A sensitivity study varied this from 32MB to 96MB.
    • Page Size: Default 8KB. A sensitivity study varied this from 4KB to 16KB.
    • Flash Type & Latencies: Assumed a modern TLC flash with specific latencies based on [34]:
      • Read Times: (56.5, 77.5, 106) µs (for typical operations).
      • Write Times: (0.8, 2.2, 5.7) ms.
      • Erase Time: 3 ms.
  • Workload Configuration:
    • Write Ratio: All tested workloads had a 20% write ratio (meaning 20% of operations are writes, 80% are reads).
    • Key Distribution: By default, a Zipfian distribution with θ=0.99\theta = 0.99 was used for key access patterns, simulating a skewed access pattern where a few keys are accessed much more frequently than others. A sensitivity study varied θ\theta values.
    • I/O Queue Depth: Set to 64, allowing all 64 underlying flash chips to potentially service I/O requests concurrently, maximizing parallelism.
  • Motivation for Emulation: The paper notes the lack of publicly available KV applications that directly interact with KV-SSDs (they typically generate block I/Os). Therefore, KV-SSD research commonly relies on KV generators (like FIO engine with uNVMe) to control key/value sizes, read/write ratios, and access distributions.

5.5. Execution Procedure

The experiments follow a two-stage process to ensure stable and representative performance data.

  1. Warm-up Stage:
    • During this stage, all KV pairs for the target workload are initially inserted into the emulated SSD.
    • GC and compaction operations are triggered frequently, filling the SSD and bringing the system into a steady, operational state, akin to a long-running production environment.
  2. Execution Stage (Data Collection):
    • After the warm-up stage, performance data (e.g., latencies, IOPS) are collected.
    • This stage continues until the total size of issued KV requests reaches twice the full SSD capacity (e.g., 128GB for a 64GB SSD).
    • The paper notes that this entire process can take 4-13 hours depending on the workload, emphasizing the thoroughness of the evaluation.

6. Results & Analysis

The experimental evaluation compares PinK, AnyKey, and AnyKey+AnyKey+ across various workloads and configurations, highlighting AnyKey's effectiveness, especially for low-v/k workloads, and AnyKey+AnyKey+'s overall superior performance.

6.1. Tail Latencies and IOPS

The following figure (Figure 10 from the original paper) shows the Cumulative Distribution Function (CDF) of 95th95^{th} percentile read latency for representative workloads under the three systems.

Figure 19. The Impact of varying the sizes of the value log.
该图像是图表,展示了不同日志大小对 ZippyDB、UDB 和 ETC 的 IOPS 及页面写入总数的影响。图(a)显示了在不同日志大小下的 IOPS,而图(b)则展示了页面写入的总数。

Figure 10. The CDF of 95th95^{th} percentilerdtenci the representative workloas unde he thre ystems.

  • Analysis of Figure 10 (Tail Latencies):
    • low-v/k Workloads (Figures 10a-e: RTDATA, Crypto1, ZippyDB, Cache15, Cache):
      • PinK exhibits significantly longer tail latencies (e.g., up to 10ms for Crypto1, 100ms for RTDATA). This validates the paper's initial hypothesis that PinK performs poorly under these workloads.
      • AnyKey and AnyKey+AnyKey+ drastically reduce tail latencies for these workloads. For instance, Crypto1's tail latency is reduced from ~10ms in PinK to less than 1ms in AnyKey/AnyKey+AnyKey+. This is because AnyKey's metadata (level lists and hash lists) is designed to fit entirely in DRAM, eliminating extra flash accesses for metadata lookups.
    • high-v/k Workloads (Figures 10f-g: W-PinK, KVSSD):
      • AnyKey and AnyKey+AnyKey+ show tail latencies comparable to PinK. This indicates that AnyKey's design, while optimized for low-v/k, does not negatively impact performance for high-v/k workloads where PinK already performs well. This is important for AnyKey to be a "universal" solution.

        The following figure (Figure 11 from the original paper) provides an analysis of metadata size and its impact on AnyKey and AnyKey+AnyKey+.

        Figure 2. The observed performance of state-of-the-art PinK \[30\] under varying value-to-key ratios from \(\\frac { 2 0 } { 4 0 }\) to \(\\textstyle { \\frac { 1 2 8 0 } { 4 0 } }\) . 该图像是图表(a)和(b),展示了不同值尺寸与键尺寸比率下,95th百分位尾延迟和IOPS的性能特征。其中(a)展示了延迟的CDF曲线,(b)展示了不同比率下的IOPS值。

Figure 2. The observed performance of state-of-the-art PinK [30] under varying value-to-key ratios from 2040\frac { 20 } { 40 } to 128040\textstyle { \frac { 1280 } { 40 } } .

The following figure (Figure 11, from image 10.jpg) provides a deeper analysis of the metadata size and its impact on AnyKey and AnyKey+AnyKey+ for low-v/k workloads.

Figure 2. The observed performance of state-of-the-art PinK \[30\] under varying value-to-key ratios from \(\\frac { 2 0 } { 4 0 }\) to \(\\textstyle { \\frac { 1 2 8 0 } { 4 0 } }\) .
该图像是图表(a)和(b),展示了不同值尺寸与键尺寸比率下,95th百分位尾延迟和IOPS的性能特征。其中(a)展示了延迟的CDF曲线,(b)展示了不同比率下的IOPS值。

Figure 11. An analysis on the metadata size and its impact on AnyKey and AnyKey+.

  • Analysis of Figure 11 (Metadata Size and Flash Accesses):
    • Figure 11a (Metadata Size): For low-v/k workloads, PinK's metadata (especially meta segments) grows significantly, exceeding the 64MB DRAM limit and requiring storage in flash. In contrast, AnyKey's level lists are much smaller, and its hash lists utilize the remaining DRAM space, ensuring the total metadata size consistently fits within DRAM.

    • Figure 11b (Flash Accesses for Reads): PinK often requires 4-7 flash accesses per read for low-v/k workloads (due to metadata in flash). AnyKey and AnyKey+AnyKey+ limit flash accesses to at most 2 in most cases (one for the KV entity in the data segment group and potentially one for the value in the value log). This direct correlation explains the tail latency improvements.

      The following figure (Figure 12 from the original paper) shows the IOPS achieved by the three systems.

      Figure 3. (a-c) The comparison of three execution models for a KV store, and (d) the internal architecture of a KV-SSD. 该图像是图表,展示了KV存储的三种执行模型的比较(a-c),以及KV-SSD的内部架构(d)。其中,图(d)显示了用于定位KV对的元数据结构,包括哈希表和LSM树等。

Figure 12. The IOPS achieved by the three systems.

  • Analysis of Figure 12 (IOPS):
    • low-v/k Workloads: AnyKey achieves a substantial 3.15x improvement in IOPS on average compared to PinK. This significant gain is attributed to:
      1. Reduced flash accesses for read requests (as seen in Figure 11b).
      2. Decreased compaction overhead (discussed in the next section).
    • high-v/k Workloads: AnyKey's IOPS can be mixed, sometimes outperforming PinK, sometimes not. However, AnyKey+AnyKey+ consistently outperforms both PinK and AnyKey for these workloads. This demonstrates the effectiveness of AnyKey+AnyKey+'s modified log-triggered compaction in preventing compaction chains that can degrade performance under workloads with larger values. Specifically, AnyKey+AnyKey+ achieves 15% higher IOPS than PinK for high-v/k workloads.

6.2. Analysis of Compaction and GC

The following are the results from Table 3 of the original paper:

Workload Compared systems The number of
page reads or writes
Compaction GC
Read Write Read Write
Crypto1
(low-v/k)
PinK 85M 88M 340M 0
AnyKey 49M 26M 0 0
AnyKey+ 49M 26M 0 0
Cache
(low-v/k)
PinK 41M 45M 339M 0
AnyKey 64M 20M 0 0
AnyKey+ 48M 15M 0 0
W-PinK
(high-v/k)
PinK 3M 9M 27M 0
AnyKey 25M 22M 0 0
AnyKey+ 20M 6M 26K 26K
KVSSD
(high-v/k)
PinK 0.2M 10M 15M 0
AnyKey 48M 50M 0 0
AnyKey+ 6M 9M 0 0

Table 3. The analysis on compaction and GC operations.

  • Analysis of Table 3 (Compaction and GC Operations):
    • GC Operations (Reads/Writes): AnyKey and AnyKey+AnyKey+ consistently show zero or very few GC reads/writes across all workloads. This is a significant advantage. The reason is that data segment groups belonging to the same LSM-tree level tend to be co-located in flash blocks and are relocated together during compaction. Consequently, victim blocks often become entirely invalid, allowing GC to erase them immediately without the costly process of reading and rewriting valid data. PinK, in contrast, often incurs millions to hundreds of millions of GC reads.
    • Compaction for low-v/k Workloads (e.g., Crypto1, Cache):
      • PinK generates a high number of compaction reads and writes (e.g., 85M reads, 88M writes for Crypto1). This is primarily because its large meta segments (containing keys) are stored in flash and must be read and rewritten during compaction.
      • AnyKey (and AnyKey+AnyKey+) significantly reduces compaction writes (e.g., 26M for Crypto1) by eliminating flash-resident meta segments and separating values into the value log. While AnyKey might sometimes have higher compaction reads than PinK (e.g., for Cache), its total write amplification (especially combined with GC) is much lower.
    • Compaction for high-v/k Workloads (e.g., W-PinK, KVSSD):
      • AnyKey can experience increased compaction overheads (e.g., 25M reads, 22M writes for W-PinK) compared to PinK (3M reads, 9M writes). This is where compaction chains (log-triggered leading to tree-triggered) occur when large values are merged from the value log into data segment groups.
      • AnyKey+AnyKey+ effectively addresses this. For W-PinK, AnyKey+AnyKey+ reduces compaction writes to 6M (from 22M in AnyKey) and reads to 20M (from 25M). For KVSSD, AnyKey+AnyKey+ reduces compaction reads from 48M (AnyKey) to 6M and writes from 50M (AnyKey) to 9M. This demonstrates AnyKey+AnyKey+'s success in preventing compaction chains and optimizing LSM-tree management for high-v/k workloads.

6.3. Device Lifetime and Storage Utilization

The following figure (Figure 13 from the original paper) shows the total number of page writes experienced.

该图像是示意图,展示了 AnyKey KV-SSD 的结构,包括 DRAM、层级列表、元段和数据段。图中标示了不同的键和地址映射,突出显示了键值对在存储中的布局。关键公式如 `PPA` 代表物理页面地址。
该图像是示意图,展示了 AnyKey KV-SSD 的结构,包括 DRAM、层级列表、元段和数据段。图中标示了不同的键和地址映射,突出显示了键值对在存储中的布局。关键公式如 PPA 代表物理页面地址。

Figure 13. The total number of page writes experienced.

  • Analysis of Figure 13 (Total Page Writes / Device Lifetime):
    • This figure directly reflects the cumulative write amplification from both compaction and GC operations.

    • AnyKey+AnyKey+ significantly reduces the total number of page writes by 50% on average compared to PinK. This is due to its efficient GC (no relocation of valid data) and optimized compaction process (avoiding compaction chains and value movement).

    • A lower number of total page writes directly translates to a longer device lifetime for the SSD.

      The following figure (Figure 14 from the original paper) shows the storage utilization of the three systems.

      Figure 5. The contrasting behaviors of PinK under the three high-v/k and eleven low-v/k workloads. 该图像是图表,展示了不同元数据类型的大小及其在DRAM或Flash中的位置。图的左侧部分(a)比较了高v/k和低v/k工作负载下数据结构的大小,标出了64MB DRAM的参考线。右侧部分(b)显示了不同请求下Flash的访问次数比例,揭示了在不同工作负载下的性能差异。这些信息有助于理解KV-SSD设计在不同场景下的表现。

Figure 14. The storage utilizations of the three systems.

  • Analysis of Figure 14 (Storage Utilization):
    • low-v/k Workloads: AnyKey and AnyKey+AnyKey+ achieve significantly higher storage utilization than PinK. This is because PinK, under low-v/k workloads, is forced to store large meta segments (containing keys) in flash due to DRAM overflow. This means PinK effectively stores two copies of key information in flash (one in meta segments, one with values in data segments), leading to lower utilization.
    • AnyKey's Advantage: By eliminating flash-resident meta segments and ensuring metadata is DRAM-resident, AnyKey and AnyKey+AnyKey+ maintain only one copy of each unique key's information in flash, thus maximizing storage utilization.

6.4. Sensitivity Analysis

6.4.1. Varying DRAM Sizes

The following figure (Figure 15 from the original paper) shows the read latencies under varying DRAM sizes.

该图像是示意图,展示了 AnyKey 设计中的关键-值 SSD 结构。图中包含了 DRAM、Flash 存储的层次结构,以及如何在不同层次中搜索给定关键字的过程。图示强调了级别列表和哈希列表的组合,以及数据段组的排序机制。
该图像是示意图,展示了 AnyKey 设计中的关键-值 SSD 结构。图中包含了 DRAM、Flash 存储的层次结构,以及如何在不同层次中搜索给定关键字的过程。图示强调了级别列表和哈希列表的组合,以及数据段组的排序机制。

Figure 15. The read latencies under varying DRAM sizes.

  • Analysis of Figure 15 (Read Latencies under Varying DRAM Sizes):
    • low-v/k Workloads (Crypto1, ETC): When DRAM capacity decreases from 96MB to 32MB, the tail latencies for Crypto1 and ETC (where metadata size is large) increase. This happens because even a reduced amount of metadata can no longer be fully accommodated in the smaller DRAM, potentially forcing some hash lists or level list entries to be accessed from flash.
    • high-v/k Workloads (W-PinK): For W-PinK, where metadata is inherently small, changing DRAM size from 96MB to 32MB has minimal impact on tail latencies. A small (32MB) DRAM is sufficient to hold its metadata.
    • Conclusion: This validates that AnyKey's performance is tied to its ability to keep metadata in DRAM, and if DRAM becomes too constrained for low-v/k workloads, performance can degrade.

6.4.2. Varying Flash Page Sizes

The following figure (Figure 16 from the original paper) shows the read latencies under varying flash page sizes.

Figure 8. The two types of compaction operations.
该图像是示意图,展示了两种类型的压缩操作:树触发压缩和日志触发压缩。在树触发压缩中,当目标源达到其阈值时,数据段组和价值日志发生合并;而在日志触发压缩中,若日志满则进行合并和回收操作。

Figure 16. The read latencies under varying flash page sizes.

  • Analysis of Figure 16 (Read Latencies under Varying Flash Page Sizes):
    • As the flash page size increases (from 4KB to 16KB), tail latencies generally decrease for AnyKey.
    • Reasoning: Assuming a fixed total SSD capacity, a larger page size means fewer total flash pages. This, in turn, implies fewer data segment groups in AnyKey. Fewer data segment groups result in a reduced number of level list entries (i.e., smaller metadata size). A smaller metadata footprint further strengthens AnyKey's ability to keep all metadata in DRAM, reducing flash accesses and improving latency.

6.5. Different Key Distributions

The following figure (Figure 17 from the original paper) illustrates the latency of ETC under varying key distributions.

Figure 9. An illustration of the problematic situations during the compaction process, and an enhancement to avoid them.
该图像是一个示意图,展示了 AnyKey 中的日志触发压实过程中的问题情况(a)和修改后的日志触发压实方案(b)。图中说明了在源目标满时的处理步骤,以及阈值监控机制的变化。

Figure 17. The lat. of ETC under varying key distributions.

  • Analysis of Figure 17 (Latencies under Varying Key Distributions):
    • The Zipfian distribution parameter θ\theta was varied (a lower θ\theta means keys are more evenly distributed, while a higher θ\theta means a few keys are accessed very frequently).
    • PinK's Behavior: As the key distribution becomes more even (lower θ\theta), PinK's tail latencies increase. This is because a more even distribution means more requests are directed to KV pairs in the lower LSM-tree levels, whose metadata (meta segments) is more likely to be stored in flash (as seen in Section 3.2).
    • AnyKey's Advantage: AnyKey and AnyKey+AnyKey+ exhibit more uniform performance across different key distributions. This is because they effectively reduce the size of metadata for all LSM-tree levels, ensuring that metadata for keys in lower levels can also be found in DRAM. This consistently low metadata access latency translates to stable tail latencies regardless of how skewed the key distribution is.

6.6. Range Queries

The following figure (Figure 18 from the original paper) shows the latencies of UDB under varying scan lengths.

Figure 10. The CDF of \(9 5 ^ { t h }\) percentilerdtenci the representative workloas unde he thre ystems.
该图像是图表,展示了不同系统下 95th95^{th} 百分位延迟的累积分布函数(CDF)。包含七个子图,分别为RTDATA、Crypto1、ZippyDB、Cache15、Cache、W-PinK及KVSSD,展示了它们在不同延迟下的表现。

Figure 18. The latencies of UDB under varying scan lengths.

  • Analysis of Figure 18 (Latencies for Range Queries):
    • The experiment used a scan-centric workload based on UDB, varying the scan length (number of consecutive keys retrieved).
    • AnyKey's Advantage: AnyKey and AnyKey+AnyKey+ show increasing benefits as scan length increases. This is because AnyKey stores sets of consecutive keys together within the flash pages of a data segment group. When a range query requests multiple consecutive keys, many sub-requests can be serviced from the same (or very few) flash pages and data segment groups, leading to efficient burst reads.
    • PinK's Disadvantage: In PinK, KV pairs are logically sorted in meta segments in DRAM, but their corresponding values in data segments can be scattered across different flash pages. This means a range query in PinK might require accessing a substantial number of disparate flash pages, increasing latency.
    • Small Range Queries: For small range queries, AnyKey's performance is comparable to PinK because it can directly locate target KV pairs using the location information stored in the first page of the data segment group, avoiding sorting overhead.

6.7. The Size of The Value Log

The following figure (Figure 19 from the original paper) shows the impact of varying the sizes of the value log.

Figure 11. An analysis on the metadata size and its impact on AnyKey and AnyKey+.
该图像是图表,展示了不同数据结构的大小及其在元数据中的位置(DRAM或Flash)。图(a)分析各数据结构的大小,图(b)则展示了读取时所需的Flash访问次数,能够反映出AnyKey和AnyKey+在元数据处理上的表现与影响。

Figure 19. The Impact of varying the sizes of the value log.

  • Analysis of Figure 19 (Impact of Value Log Size):
    • The value log size was varied from 3.2GB to 9.6GB (5% to 15% of the 64GB SSD).
    • Figure 19a (IOPS):
      • Small Values (ZippyDB): For workloads with relatively small values (like ZippyDB, a low-v/k workload), a small value log (e.g., 3.2GB) is sufficient and already achieves high IOPS. Increasing the value log size beyond this point does not significantly improve IOPS. This is because a small value log for such workloads rarely becomes full, thus rarely triggering costly log-triggered compactions.
      • Larger Values (UDB, ETC): As value sizes increase (UDB and ETC are low-v/k but with relatively larger values than ZippyDB), increasing the value log size leads to slightly higher IOPS. A larger value log reduces the frequency of log-triggered compactions, thereby reducing compaction overheads.
    • Figure 19b (Total Page Writes): Similar patterns are observed for the number of page writes. Workloads with small values don't see much change with increasing value log size, while those with larger values show a reduction in page writes with a larger value log, reflecting fewer log-triggered compactions.
    • Comparison with AnyKey- (without value log): The paper also briefly compares AnyKey+AnyKey+ with a hypothetical AnyKey- system (without a value log).
      • When write ratio increases, AnyKey- experiences a significant decrease in IOPS, while AnyKey+AnyKey+ maintains high IOPS. This highlights the crucial role of the value log in managing write-intensive workloads by deferring value movement.
      • AnyKey+AnyKey+ might have slightly increased tail latencies compared to AnyKey- (due to potential second flash access for values in the value log), but it is still comparable to PinK and offers much better IOPS.

6.8. Design Scalability

The paper asserts that AnyKey's design is applicable to any SSD capacity and that larger SSD capacities yield even greater benefits under low-v/k workloads.

  • Example: For a 4TB SSD running Crypto1 (a low-v/k workload), PinK's metadata could swell to 25.2GB, requiring extensive flash accesses. In contrast, AnyKey would limit its metadata to approximately 3.65GB, which would still comfortably fit into a proportionally larger DRAM (e.g., 4GB DRAM for a 4TB SSD). This demonstrates that the fundamental problem of metadata overflow in PinK scales with SSD size for low-v/k workloads, and AnyKey's solution scales effectively to maintain DRAM-resident metadata.

6.9. Multi-Workload Execution Scenarios

To assess applicability in shared environments, the paper considers multi-workload scenarios using partitioning.

  • Setup: A scenario with two co-located workloads (W-PinK and ZippyDB) was evaluated on a 64GB SSD, divided into two equal partitions. Each partition was managed independently by either PinK or AnyKey.
  • Results: Compared to PinK, AnyKey improved the 99.9th99.9^{th} percentile tail latencies of W-PinK by 14% and ZippyDB by 216%.
  • Conclusion: This indicates that AnyKey's benefits can be realized in multi-tenant environments by partitioning the SSD and running AnyKey on each partition, especially benefiting low-v/k workloads like ZippyDB.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully identifies a critical blind spot in existing Key-Value SSD (KV-SSD) designs: their sub-optimal performance under low value-to-key ratio (low-v/k) workloads, where key sizes are relatively large. This performance bottleneck was directly linked to the excessive growth of metadata in state-of-the-art designs like PinK, forcing metadata into slower flash memory and incurring prohibitive tail latencies and reduced IOPS.

To address this, the authors propose AnyKey, a novel KV-SSD architecture. AnyKey's core contribution is its innovative metadata management, which organizes KV pairs into data segment groups and maintains metadata only at the group level, using hash values for efficient intra-group indexing. This design ensures that metadata remains compact and entirely DRAM-resident, eliminating flash accesses for metadata lookups. Furthermore, AnyKey introduces a value log to separate values from keys, significantly improving the efficiency of LSM-tree compaction by reducing write amplification.

The enhanced version, AnyKey+AnyKey+, further refines the LSM-tree management by modifying log-triggered compaction to prevent "compaction chains," which can degrade performance under high-v/k workloads. Extensive evaluation using diverse real-world workloads demonstrates that AnyKey dramatically improves tail latencies and IOPS for low-v/k scenarios, while AnyKey+AnyKey+ extends this superior performance to all workload types, outperforming the state-of-the-art across the board in IOPS, device lifetime (reduced page writes), and storage utilization.

7.2. Limitations & Future Work

The paper doesn't explicitly dedicate a section to "Limitations and Future Work," but some can be inferred from the design choices and discussions:

  • Computational Overhead for Hash Generation and Sorting: While the paper states that the xxHash generation and merge-sort operations for data segment groups are marginal (79ns and 118µs respectively on a Cortex-A53 core), these still add computational load to the SSD controller. For extremely latency-sensitive applications or very small KV pairs where flash access time is minimal, this overhead, however small, might become more noticeable. Future work could explore hardware accelerators for hashing or sorting within the SSD controller to further reduce this.
  • Range Query Efficiency for Small Ranges: While AnyKey improves range query performance for longer scans, its performance for very small range queries is only "comparable" to PinK. The overhead of parsing the location information stored in the first page of a data segment group might still be present. Further optimizations specific to very short range queries could be explored.
  • Fixed Value Log Sizing: The current implementation reserves half of the remaining SSD capacity for the value log. While a sensitivity study on value log size is performed, a dynamic and adaptive sizing mechanism for the value log based on real-time write patterns and value sizes could potentially optimize storage use and compaction frequency further.
  • Hash Collision Handling: Although hash collisions are rare (0.075%) and their impact is deemed negligible, the current mechanism (2 hash collision bits per page and potentially reading an adjacent page) is a simple heuristic. For future very high-density data segment groups or specific hash functions, more sophisticated, yet lightweight, collision resolution strategies might be beneficial.
  • Real Hardware Validation: The evaluation is based on an emulation environment (FEMU). While FEMU is widely-used and accurate, validation on real KV-SSD hardware would provide definitive proof of performance in a production setting.
  • Complexity of LSM-tree Management: While AnyKey+AnyKey+ addresses compaction chains, LSM-tree management remains complex. Further advancements in LSM-tree compaction algorithms, perhaps leveraging AI/ML or more dynamic resource allocation within the SSD controller, could yield additional benefits.

7.3. Personal Insights & Critique

The AnyKey paper makes a highly relevant and practical contribution to the KV-SSD landscape. My personal insights and critique are as follows:

  • Novelty and Importance: The most significant aspect of AnyKey is its identification and systematic address of the low-v/k workload problem. This is a crucial "aha!" moment, as much of KV-SSD research seemed implicitly biased towards high-v/k scenarios. By showing that metadata scalability is the bottleneck, the paper fills a critical gap, making KV-SSDs truly viable for a broader range of real-world applications (e.g., IoT devices, blockchain, specific database indexing where keys are complex identifiers). The paper's re-evaluation of PinK under low-v/k conditions provides compelling evidence for this overlooked problem.

  • Elegance of Solution: AnyKey's solution is elegant in its simplicity and effectiveness. Grouping KV pairs and abstracting metadata to the group level is a straightforward idea that yields massive benefits in metadata footprint. The separation of keys and values into data segment groups and a value log is a well-established technique (WiscKey [49]) but its integration and refinement within the KV-SSD context, especially with the AnyKey+AnyKey+ enhancements for compaction chains, is highly effective.

  • Applicability and Transferability: The principles of metadata reduction and value-key separation are not unique to KV-SSDs. These ideas could be adapted to other in-storage computing paradigms or even host-based KV stores that struggle with large metadata structures on NVMe SSDs. For instance, learned indexes that store model parameters as metadata could benefit from similar group-level indexing if their metadata size becomes an issue.

  • Potential for Further Optimization:

    • Dynamic Page/Group Sizing: The paper mentions the number of pages in a data segment group can be configured. An adaptive mechanism that dynamically adjusts group size based on workload characteristics (e.g., key/value size distributions, read/write patterns) or SSD resources could further fine-tune performance.
    • Tiered Value Log: For very large SSDs, the value log itself could be tiered (e.g., hot values in a faster, smaller region; cold values in a slower, larger region) to optimize access latency for values, similar to how LSM-trees tier SSTables.
    • Heterogeneous flash memory: With the advent of different flash technologies (e.g., QLC, PLC), AnyKey could potentially allocate metadata and hot data segments to faster/more durable flash types, and colder data/values to denser/cheaper flash.
    • DRAM-less/Hybrid KV-SSDs: The design heavily relies on DRAM. For DRAM-less SSDs or those with limited DRAM, a hybrid metadata approach (e.g., critical metadata in on-chip SRAM, less critical in flash with a very smart caching layer) would be an interesting extension.
  • Critique on DRAM usage for hash lists: While hash lists help avoid flash accesses, they consume DRAM. The paper mentions Bloom filters were considered but rejected due to computational overhead. Re-evaluating this choice with newer SSD controller capabilities (e.g., specialized AI/ML accelerators on-chip) might reveal new trade-offs. Bloom filters could potentially offer a probabilistic way to save DRAM while still significantly reducing unnecessary flash accesses.

  • Unverified Assumptions: The claim that hash collision impact is negligible relies on a measured low frequency (0.075%). While plausible, this could be sensitive to the choice of hash function and key distribution. Further analysis on the robustness of this assumption for various extreme cases might be valuable.

  • Impact on Standardization: KV-SSDs are still an evolving standard. AnyKey's insights could influence future KV-SSD API designs, particularly in how metadata is exposed or optimized for different workload types.

    Overall, AnyKey is a robust and well-thought-out contribution that broadens the applicability of KV-SSDs, making them a more compelling solution for diverse modern data storage challenges. Its rigorous evaluation and detailed analysis of performance bottlenecks are commendable.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.