- Title: RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries
- Authors: Chiyu Hao, Jixian Su, Shixuan Sun, Hao Zhang, Sen Gao, Jianwen Zhao, Chenyi Zhang, Jieru Zhao, Chen Chen, Minyi Guo.
- Affiliations: Shanghai Jiao Tong University, China, and Huawei, China.
- Journal/Conference: The paper is currently available as a preprint on arXiv. The link suggests a future publication date, but its current status is a preprint. ArXiv is a widely respected repository for pre-publication academic papers in fields like computer science, allowing for rapid dissemination of research.
- Publication Year: The source link indicates a submission date in July 2025, suggesting this is an early draft for a future conference or journal.
- Abstract: The paper addresses the challenge of designing efficient dynamic graph storage systems that can handle concurrent read and write operations. The authors identify key issues in existing systems: interference between reads and writes, high overhead from per-edge versioning, and poor search performance under concurrent workloads. They propose RapidStore, an in-memory dynamic graph storage system designed for read-intensive workloads. The core idea is a decoupled architecture that separates the management of read and write queries and detaches version data from graph data. This is achieved through a novel subgraph-level concurrency control mechanism and an optimized graph data store. Experiments show that RapidStore improves query scalability and efficiency, effectively balancing performance across different operations.
- Original Source Link:
- arXiv Page:
http://arxiv.org/abs/2507.00839
- PDF Link:
http://arxiv.org/pdf/2507.00839v1
- Publication Status: Preprint.
2. Executive Summary
-
Foundational Concepts:
- Dynamic Graph: A graph where vertices and edges can be added or removed over time. This contrasts with a static graph, which does not change.
- Concurrency Control: In databases and storage systems, this is the mechanism that ensures multiple transactions (queries) can run simultaneously without causing data inconsistencies.
- Multi-Version Concurrency Control (MVCC): A popular concurrency control technique. Instead of locking data, it creates a new version of a data item whenever it's modified. Read queries can then access an older, consistent version of the data without being blocked by write queries.
- Two-Phase Locking (2PL): A locking protocol that ensures serializability. In its "growing phase," a transaction can acquire locks but cannot release any. In its "shrinking phase," it can release locks but cannot acquire new ones.
MV2PL is a variant used with MVCC.
- Adaptive Radix Tree (ART): A memory-efficient tree data structure for storing key-value pairs. It adapts its node sizes based on the number of children, avoiding the memory waste of traditional radix trees. It's very fast for lookups, insertions, and deletions.
- Compressed Sparse Row (CSR): A highly efficient, compact data format for storing static sparse matrices or graphs. It's excellent for
scan operations but very inefficient for adding or removing edges.
-
Previous Works:
Sortledton: A state-of-the-art system that uses per-edge versioning with MVCC and MV2PL. It locks vertices for both read and write operations, leading to the contention issues identified in the paper. It uses skip lists for storing neighbors of high-degree vertices, which are slower for searches than tree-based structures.
Teseo: Another system using per-edge versioning. It stores graph data in a Packed Memory Array (PMA) to improve data locality for scans but still suffers from lock contention and versioning overhead.
LiveGraph: Stores neighbor lists as timestamp-ordered logs. This is fast for insertions but very slow for searches, as it may require scanning the entire log to find an edge.
Aspen: Uses a copy-on-write approach at the global graph level, which is good for reads but limits it to a single writer, severely hampering write concurrency.
-
Technological Evolution & Differentiation:
The field has moved from static graph processing systems (like Ligra) to dynamic ones that can handle updates. Early dynamic systems struggled with concurrency. More recent systems like Sortledton and Teseo adopted database techniques like MVCC but applied them at a fine-grained (per-edge) level. This created new bottlenecks related to version check overhead and lock contention.
RapidStore represents the next step in this evolution. Its key differentiation is moving from fine-grained (per-edge) versioning to coarse-grained (subgraph) versioning. This seemingly simple change has profound architectural implications:
- Decouples Versions from Data: Version information is no longer attached to every edge. Instead, it's a pointer to an entire subgraph snapshot. This eliminates the need for version checks during graph traversal.
- Decouples Reads from Writes: Read queries become completely lock-free. They simply determine which subgraph versions to use at the start and then operate on an immutable snapshot, completely isolated from ongoing write operations. This directly solves the read-write interference problem.
4. Methodology (Core Technology & Implementation)
RapidStore's methodology is built on two pillars: a subgraph-centric concurrency control mechanism and a multi-version graph store optimized for it.
4.1 Overall Architecture and Execution Flow
The system coordinates operations using a global logical clock.
-
Write Query Execution:
- Identify Subgraphs: A write query first determines which subgraphs (S) are affected by its updates.
- Acquire Locks: It acquires exclusive locks on these subgraphs in a sorted order to prevent deadlocks.
- Create New Versions: It uses a copy-on-write strategy to create new, modified versions of the affected subgraphs. The new version number is set to the current clock value + 1.
- Commit: The changes are committed, and the global clock is incremented.
- Garbage Collect: Outdated versions no longer needed by any active readers are removed.
- Release Locks: The locks are released.
-
Read Query Execution:
-
Register: A read query registers itself and gets the current clock value as its start time. This is a lock-free operation.
-
Build Snapshot: It constructs a consistent view (snapshot) of the graph by selecting the appropriate version of each subgraph based on its start time.
-
Execute: It performs its operations (e.g., PageRank) on this immutable snapshot without any locks or interference from writers.
-
Unregister: After completion, it unregisters itself.
This decoupled design is the foundation for RapidStore's high concurrency.
4.2 Subgraph-Centric Concurrency Control
-
Principles: The core idea is to manage versions at a coarser granularity (subgraphs) instead of a fine one (edges). This amortizes the overhead of version management and enables efficient, lock-free reads.
-
Steps & Procedures:
- Graph Partitioning: The graph's vertices are divided into equal-sized partitions. Each partition of vertices and their outgoing edges forms a subgraph. The paper uses a simple partitioning scheme based on contiguous vertex IDs, with a default partition size of 64.
- Version Management: Each subgraph has a version chain (a linked list of its historical versions), stored separately from the graph data itself. When a subgraph is modified, a new version is created and added to the head of its chain.
- Concurrency for Writers: Writers use
MV2PL on subgraphs. By locking an entire subgraph, they ensure exclusive access for modifications. Timestamps (t_w for writes, t_r for reads) coordinate commits to maintain serializability.
- Concurrency for Readers: Reads are lock-free. A
reader tracer (an array) keeps track of active read queries and their start timestamps. When a new read query starts, it finds a free slot in the tracer, records the current global read timestamp t_r, and proceeds. This timestamp guarantees it sees a consistent snapshot. To build the snapshot, it iterates through each subgraph's version chain and picks the latest version with a timestamp less than or equal to its start time.
-
Key Details - Garbage Collection: A writer, after committing, scans the reader tracer to find the oldest timestamp among all active readers. Any subgraph version older than this timestamp (and not the latest version) can be safely reclaimed.
4.3 Multi-Version Graph Store
This is the underlying storage engine that implements the copy-on-write strategy and provides fast graph operations.
-
Principles: The store is designed to support fast snapshot creation via copy-on-write and to provide balanced performance for search, scan, and insert operations. It handles degree skew (some vertices having vastly more neighbors than others) by using different data structures.
-
Compressed Adaptive Radix Tree (C-ART):
-
Purpose: Used to store the neighbor sets of high-degree vertices.
-
Problem with standard ART: Standard ART (Image 1) stores one key per leaf node. For graph scans, traversing from leaf to leaf can be inefficient. Storing neighbors in larger, contiguous blocks improves scan speed, but due to the sparse nature of neighbor sets, these blocks would be mostly empty, wasting memory.
-
C-ART Innovation: C-ART introduces horizontal compression. Instead of one vertex per leaf, a C-ART leaf can store up to B (e.g., 256) vertices. It achieves this by allowing multiple keys in an inner node to point to the same leaf node, effectively merging what would have been sparse leaves in a standard ART. This dramatically increases the "filling ratio" of leaves, improving memory locality and scan performance. Image 6 shows an example of C-ART.
-
Operations:
Search(u, v): Traverses the tree based on the byte sequence of neighbor v's ID, then performs a binary search within the correct leaf.
- Scan(u): Performs a depth-first traversal of the tree, reading vertices sequentially from the compressed leaves.
Insert(u, v): Follows the search path. If the target leaf has space, v is added. If the leaf is full, it is split into two, or a new internal node is created to deepen the tree, as shown in Image 7.
-
Optimizations: C-ART further saves space by storing only the unique suffixes of vertex IDs in leaves (since the common prefix is known). It also uses bitmaps and AVX2 instructions to quickly skip over empty child pointers in internal nodes during traversal.
该图像是一张ART树结构的示意图,展示了多个不同深度和前缀的节点及其键和指针的关系,用于图形存储系统中的动态索引。
-
Clustered Index:
- Purpose: Used to store the neighbor sets of low-degree vertices.
- How it works: It's implemented as a B+ tree. Instead of having a separate data structure for each low-degree vertex, their neighbor lists are stored sequentially together within the clustered index. This co-location improves data locality when accessing multiple low-degree vertices within the same subgraph, which is a common access pattern.
-
Copy-on-Write Implementation: When an edge is added or deleted, only the path from the root of the C-ART (or clustered index) to the affected leaf is duplicated. All other unmodified nodes and leaves are shared with the previous version. This makes snapshot creation very cheap in terms of both time and space.
5. Experimental Setup
-
Datasets: A diverse set of six real-world and synthetic graphs were used, varying in size, density, and degree distribution. This ensures the evaluation is comprehensive.
This is a manual transcription of Table 5 from the paper.
| Dataset | Abbr. | |V| | |E| | Avg. Deg. | Max Deg. | Size(GB)
| :--- | :--- | :--- | :--- | :--- | :--- | :---
| LiveJournal | lj | 4M | 43M | 17.4 | 14815 | 0.67
| Orkut | ot | 3M | 117M | 76.3 | 33313 | 1.7
| LDBC | ldbc | 30M | 176M | 11.8 | 4282595 | 2.84
| Graph500 | g5 | 9M | 260M | 58.7 | 406416 | 4.16
| Twitter | tw | 21M | 265M | 24.8 | 698112 | 4.11
| Friendster | fr | 65M | 2B | 55.1 | 5214 | 30.1
-
Evaluation Metrics:
- Throughput: Measures the rate of operations.
- Conceptual Definition: The number of operations (e.g., edges inserted, edges processed) completed per unit of time. Higher is better. It is measured in TEPS (Thousand Edges Per Second) or MEPS (Million Edges Per Second).
- Mathematical Formula:
Throughput=Total TimeTotal Operations
- Latency: Measures the time to completion.
- Conceptual Definition: The time taken to execute a single query or a workload (e.g., 10 iterations of PageRank). Lower is better. Measured in seconds (s).
- Memory Consumption: Measures space efficiency.
- Conceptual Definition: The amount of main memory (RAM) used by the system after loading the graph. Measured as the Resident Set Size (RSS) in Gigabytes (GB). Lower is better.
-
Baselines:
Sortledton: State-of-the-art transactional graph data structure with per-edge versioning.
Teseo: A dynamic graph library using a Packed Memory Array and per-edge versioning.
Aspen: A dynamic graph system using copy-on-write but limited to a single writer.
LiveGraph: A transactional system using a logging approach for neighbor lists.
CSR: A static graph representation, used as a baseline for optimal read performance (it does not support updates).
6. Results & Analysis
-
Core Results (Read and Write Performance):
-
Read Performance: Table 4 (transcribed below) shows the slowdown of each system relative to the static CSR baseline (lower numbers are better). RapidStore consistently achieves the lowest slowdown, demonstrating performance closest to the static optimal. For example, on the lj dataset for PageRank, RapidStore is only 1.53x slower than CSR, while Sortledton is 6.69x slower. This highlights the effectiveness of eliminating version checks and improving data locality.
This is a manual transcription of the slowdown results from Table 4 in the paper.
|
lj |
ot |
ldbc |
| BFS |
PR |
SSSP |
WCC |
TC |
BFS |
PR |
SSSP |
WCC |
TC |
BFS |
PR |
SSSP |
WCC |
TC |
| Sortledton |
7.23 |
6.69 |
2.45 |
3.11 |
3.18 |
5.97 |
4.77 |
2.46 |
3.07 |
3.07 |
8.35 |
8.49 |
3.04 |
4.20 |
4.94 |
| Teseo |
15.25 |
4.32 |
3.53 |
2.52 |
3.99 |
3.83 |
2.74 |
3.50 |
4.29 |
4.34 |
6.28 |
4.01 |
3.84 |
3.23 |
7.31 |
| Aspen |
7.11 |
7.10 |
4.81 |
3.97 |
6.24 |
3.67 |
3.34 |
2.71 |
2.80 |
10.15 |
8.90 |
9.07 |
5.97 |
4.58 |
9.87 |
| LiveGraph |
21.77 |
17.12 |
4.76 |
7.01 |
- |
15.77 |
10.49 |
7.84 |
6.86 |
- |
11.46 |
13.91 |
4.99 |
6.41 |
- |
| RapidStore |
2.11 |
1.53 |
1.61 |
1.18 |
1.39 |
1.59 |
1.35 |
1.28 |
1.22 |
1.51 |
1.82 |
1.65 |
1.79 |
1.36 |
1.49 |
-
Write Performance: Figure 8 (not shown, described in text) reveals that Sortledton has the highest raw insertion throughput in most cases, as it modifies data in-place. RapidStore is a close second, with its copy-on-write overhead being managed effectively. However, on skewed datasets like ldbc, RapidStore outperforms Sortledton because its coarser locking reduces contention.
-
Concurrent Performance Analysis:
-
Readers with Concurrent Writers: Figure 9 (not shown) demonstrates RapidStore's key advantage. When writers are active, the read performance of Sortledton and Teseo degrades sharply due to lock contention. RapidStore's read latency remains almost flat, showing minimal interference.
-
Writers with Concurrent Readers: As shown in Figure 10, adding readers has a very small impact on RapidStore's write throughput. In contrast, Sortledton and Teseo see their write throughput drop significantly because readers acquire locks that block writers.
-
Hardware Utilization: Figure 11 shows that RapidStore's performance degradation under heavy concurrency is due to hitting the physical memory bandwidth limit of the machine. This is a positive sign: the system is so efficient that its bottleneck is the hardware, not software-induced contention. Other systems are bottlenecked by lock contention long before they can saturate the hardware.
该图像是图表,展示了RapidStore及其他三种系统在不同读者数量下的插入吞吐量表现。图 (a) 和 (b) 分别展示了在 lj 和 g5 数据集上的插入性能,其中固定写者数量为8,随着读者数增加,吞吐量呈下降趋势。
该图像是图表,展示了在并发读写实验中不同动态图存储系统随读者数量变化的平均内存带宽使用率。图中比较了Sortedton、Teseo、LiveGraph和RapidStore四种系统在lj和g5数据集上的表现,RapidStore在带宽使用上表现突出。
-
Ablations / Parameter Sensitivity:
- Partition Size: Figure 12 shows the performance trade-off. Larger partition sizes (|P|) improve read performance (better locality) but decrease write performance (higher chance of lock conflicts). The default of |P|=64 is chosen as a good balance.
- Ablation Study: Table 6 (transcribed below) confirms the contribution of each component.
-
Adding Subgraph-Centric control (ART + SC) significantly improves read performance over the baseline (ART with per-edge versioning).
-
Replacing ART with C-ART (C-ART + SC) further boosts read performance, especially on graphs with high-degree vertices like g5.
-
Using the Clustered Index (C-ART + SC + CI) provides a major leap in both read and write performance over a simpler vector-based approach (VEC), validating its design.
This is a manual transcription of Table 6 from the paper.
| Dataset |
Method |
Insert (TEPS) |
Analytics Query (s) |
| BFS |
PR |
SSSP |
WCC |
| lj |
ART |
2445.59 |
10.92 |
59.84 |
11.01 |
17.04 |
| ART + SC |
2119.22 |
7.31 |
38.97 |
5.42 |
11.55 |
| C-ART + SC |
1991.25 |
7.30 |
37.77 |
5.25 |
10.81 |
| C-ART + SC + VEC |
1710.56 |
2.42 |
11.25 |
2.73 |
2.59 |
| C-ART + SC + CI |
2373.05 |
1.05 |
5.90 |
2.31 |
1.97 |
| g5 |
ART |
1592.24 |
25.22 |
246.28 |
32.21 |
45.98 |
| ART + SC |
1562.43 |
22.85 |
243.93 |
27.46 |
41.86 |
| C-ART + SC |
1797.23 |
9.73 |
98.81 |
13.20 |
18.87 |
| C-ART + SC + VEC |
1711.60 |
4.18 |
47.12 |
8.23 |
9.06 |
| C-ART + SC + CI |
1973.13 |
2.71 |
29.28 |
6.47 |
5.81 |
-
Memory Consumption:
-
Figure 13 clearly shows that RapidStore is the most memory-efficient system. This is because it avoids storing per-edge version metadata and its C-ART structure with ID compression is very compact. On the largest dataset (fr), it uses significantly less memory than Sortledton and Teseo.
该图像是图表,展示了RapidStore在不同分区大小下的写入和读取性能(图12)。图中分别以写入吞吐量和延迟以及读取吞吐量和延迟两个维度,比较了lj和g5数据集上的表现,随着分区大小增加,性能和延迟均有所变化。
该图像是柱状图,展示了不同系统在六个数据集上的内存消耗情况。图中红色柱子表示RapidStore系统的内存使用,整体表现出较低的内存消耗,尤其在较大数据集如fr上的内存需求显著低于部分系统。
7. Conclusion & Reflections
-
Conclusion Summary:
RapidStore presents a compelling solution to the long-standing problem of concurrent query processing in dynamic graph systems. By shifting from per-edge to subgraph-centric versioning, it creates a decoupled architecture that effectively eliminates read-write interference. This, combined with the novel C-ART data store, delivers exceptional read performance, high concurrency, competitive write speeds, and low memory usage. The work successfully demonstrates that a holistic system design, tailored to the access patterns of graph queries, can significantly outperform general-purpose approaches adapted from relational databases.
-
Limitations & Future Work:
- Authors' Acknowledgment: The authors identify that RapidStore is currently an in-memory system. A key area for future work is extending it to support persistence on modern storage devices like NVMe SSDs, making it suitable for graphs that exceed main memory capacity.
- Static Partitioning: The current graph partitioning strategy is simple and static (based on vertex ID ranges). A potential limitation is that this may not be optimal for highly skewed workloads where updates are concentrated in specific parts of the graph. An adaptive partitioning strategy that adjusts to workload patterns could further reduce write contention.
-
Personal Insights & Critique:
- Novelty and Impact: The core idea of subgraph-centric versioning is both elegant and highly effective. It is a significant conceptual shift from the prevailing per-edge versioning paradigm and is likely to influence future designs of dynamic graph systems. The paper's strength lies in its holistic approach—it's not just a new data structure or algorithm, but a complete system design where the concurrency protocol and storage layer are co-designed to work in synergy.
- Trade-offs: The system is explicitly designed for read-intensive workloads. While it shows competitive write performance, the copy-on-write strategy inherently has higher overhead than in-place updates. For extremely write-heavy workloads, a different design might be more appropriate. The paper is transparent about this trade-off.
- Open Questions: The optimal partition size (|P|) likely depends on the graph structure and workload mix. While the paper provides an empirical evaluation, developing a theoretical model or an auto-tuning mechanism for this hyperparameter would be a valuable extension. Furthermore, how the system would perform on graph algorithms that modify the graph structure during execution (as opposed to just reading it) is an interesting area for exploration.