Read this in other languages: English | 简体中文
A high-performance, concurrent, in-memory key-value store optimized for C++ applications, with optional WAL persistence, SIMD-accelerated vector search, and a Graph-on-KV layer supporting GraphRAG workloads.
- Extreme Concurrency — Sharded Locking architecture (32 independent shards by default) dramatically reduces lock contention in multi-threaded environments.
- Read-Heavy Optimization — Combines
std::shared_mutex(Reader-Writer Locks) with a Lazy LRU Promotion strategy, making 99% ofgetoperations virtually lock-free. - Industrial-Grade Reliability — Supports TTL expiration, capacity limits, LRU eviction, and Write-Ahead Logging (WAL) with Group Commit for crash consistency.
- SIMD-Accelerated Vector Search — AVX2-optimized cosine similarity for high-dimensional embedding workloads (4.58x speedup over scalar).
- Graph-on-KV + GraphRAG — Graph database layer built on top of ShardedCache: Node/Edge CRUD, K-hop BFS traversal, and a two-phase GraphRAG query engine (vector search → concurrent BFS expansion).
- Zero-Copy API — Interfaces designed with
std::string_viewand move semantics to eliminate unnecessary allocations.
Environment: Tencent Cloud (Shanghai), 4 Cores / 8 Threads, Ubuntu 22.04, g++ 11
Compilation:-O2 -march=native -mavx2 -mfma
(Note:-O2outperforms-O3on this workload — aggressive inlining inflates hot-path code size and increases L1 icache pressure.)
| Metric | Value |
|---|---|
| Peak Throughput | 5.08M QPS (2 threads, 32 shards, R90W10) |
| P99 Latency | 2.81 μs |
| WAL Overhead (8 threads) | 10.6% (Group Commit, 10ms interval) |
| Sharding Gain | +77% (1 shard → 32 shards) |
| SIMD Speedup | 4.58x (512-dim L2 distance, AVX2) |
Peak QPS hits 5.08M at 2 threads. Performance plateaus beyond 2 threads due to memory bandwidth saturation — the 100K key range exceeds L3 cache capacity, causing contention on the memory bus rather than on locks.
| Threads | QPS | P50 (μs) | P95 (μs) | P99 (μs) | Hit Rate |
|---|---|---|---|---|---|
| 1 | 3.72M | 0.20 | 0.50 | 0.66 | 10% |
| 2 | 5.08M | 0.29 | 0.59 | 2.81 | 10% |
| 4 | 3.93M | 0.52 | 3.26 | 8.85 | 11% |
| 8 | 3.10M | 1.02 | 10.45 | 18.52 | 13% |
| 16 | 2.86M | 1.44 | 25.85 | 51.65 | 16% |
P99 rises sharply beyond 2 threads due to lock contention, peaking at 51.65 μs at 16 threads.
Optimal at 32 shards (+77% over 1 shard). Performance saturates beyond 4 shards; the difference between 16 and 64 shards is under 1%.
| Shards | QPS | Relative |
|---|---|---|
| 1 | 1.71M | 0.57x |
| 4 | 2.99M | 0.99x |
| 8 | 3.02M | 0.99x |
| 32 | 3.04M | 1.00x |
| 64 | 2.99M | 0.99x |
fsync overhead drops from 34.9% at 1 thread to 10.6% at 8 threads thanks to Group Commit batching writes. Group Commit is 199x faster than sync-every-write (~16K QPS).
The 1-thread overhead is elevated because the WAL background flush thread competes with the main thread for the same physical CPU core. The 8-thread figure (10.6%) is more representative of real production multi-core environments.
| Threads | In-Memory QPS | Group Commit QPS | Overhead |
|---|---|---|---|
| 1 | 3.42M | 2.23M | 34.9% |
| 2 | 3.94M | 3.02M | 23.4% |
| 4 | 3.76M | 3.11M | 17.4% |
| 8 | 3.09M | 2.76M | 10.6% |
AVX2 SIMD achieves a 4.58x QPS speedup (18.73M vs 4.09M) and reduces average latency from 0.24 μs to 0.05 μs. This is 57.2% of the theoretical 8x maximum (AVX2 processes 8 floats per instruction), with the gap explained by memory bandwidth limits, horizontal reduction overhead, and cache misses.
| Version | QPS | Avg Latency | Speedup |
|---|---|---|---|
| Scalar (loop) | 4.09M | 0.24 μs | 1.00x |
| AVX2 SIMD | 18.73M | 0.05 μs | 4.58x |
System-level KV impact is ±3% (Amdahl's Law: vector computation is only ~1–2% of total KV operation time; the bottleneck is hash lookup and lock management).
Header-only integration — just include and go:
#include "db/sharded_cache.h"
#include <string>
#include <iostream>
int main() {
// 10,000 capacity per shard × 32 shards = 320,000 total capacity
minkv::db::ShardedCache<std::string, std::string> cache(10000, 32);
// Write with TTL (milliseconds)
cache.put("user:1001", "Robinson", 5000); // expires in 5 seconds
// Read
auto value = cache.get("user:1001");
if (value) {
std::cout << "Found: " << *value << std::endl;
} else {
std::cout << "Not found or expired" << std::endl;
}
// Thread-safe by default — no external locking needed
return 0;
}MinKV includes a graph database layer built entirely on top of ShardedCache<string, string>. All graph data (nodes, edges, adjacency lists, embeddings) lives in the same KV instance — no separate storage engine needed.
n:{node_id} → Node binary serialization
e:{src}:{dst}:{label} → Edge binary serialization
adj:out:{node_id} → Outgoing adjacency list (binary length-prefix array)
adj:in:{node_id} → Incoming adjacency list
vec:{node_id} → Embedding raw bytes (float[])
#include "graph/graph_store.h"
#include "core/sharded_cache.h"
using namespace minkv::graph;
using GraphKVStore = minkv::db::ShardedCache<std::string, std::string>;
auto kv = std::make_shared<GraphKVStore>(65536, 16);
GraphStore gs(kv);
// Build a knowledge graph
gs.AddNode({"Elon_Musk", R"({"role":"CEO"})"});
gs.AddNode({"SpaceX", R"({"type":"company"})"});
gs.AddNode({"Starship", R"({"type":"rocket"})"}); // no embedding needed
gs.SetNodeEmbedding("Elon_Musk", embed("Elon Musk founder entrepreneur"));
gs.SetNodeEmbedding("SpaceX", embed("aerospace rocket launch"));
gs.AddEdge({"Elon_Musk", "SpaceX", "founded"});
gs.AddEdge({"SpaceX", "Starship", "product"});
// GraphRAG query: vector search → concurrent K-hop BFS expansion
auto results = gs.GraphRAGQuery(query_vec, /*top_k=*/3, /*hop_depth=*/2);
// Returns Elon_Musk + SpaceX (via vector search) + Starship (via graph traversal)
// Starship has no embedding — pure vector search would miss it entirely| Method | Recall on no-embedding nodes |
|---|---|
| Pure vector search | 0% |
| GraphRAG (2-hop) | 42.6% |
Nodes like Starship, GPT4, AWS have no embeddings. Vector search cannot find them. GraphRAG traverses the graph and retrieves them via relationship chains.
- Binary adjacency list — Replaced JSON with length-prefix binary encoding. Eliminates JSON escape/parse overhead for hub nodes (high-degree nodes with thousands of edges).
- Thread pool concurrent BFS — Multiple entry nodes' BFS runs in parallel. Fixed-size thread pool eliminates
std::asyncthread creation overhead. P99 latency: 467ms → 10ms. - Crash recovery — Write order guarantees edge data precedes adjacency list updates.
RebuildAdjacencyList()rescans alle:keys to restore consistency after a crash.
cmake -S MinKV -B MinKV/build -DCMAKE_BUILD_TYPE=Release
cmake --build MinKV/build --target demo_graphrag test_graphrag_accuracy -j$(nproc)
# GraphRAG demo (Elon Musk → SpaceX → Starship chain)
./MinKV/build/bin/demo_graphrag 2>/dev/null
# Accuracy test: GraphRAG vs pure vector search
./MinKV/build/bin/test_graphrag_accuracy 2>/dev/nullHeader-only integration — just include and go:
#include "db/sharded_cache.h"
#include <string>
#include <iostream>
int main() {
// 10,000 capacity per shard × 32 shards = 320,000 total capacity
minkv::db::ShardedCache<std::string, std::string> cache(10000, 32);
// Write with TTL (milliseconds)
cache.put("user:1001", "Robinson", 5000); // expires in 5 seconds
// Read
auto value = cache.get("user:1001");
if (value) {
std::cout << "Found: " << *value << std::endl;
} else {
std::cout << "Not found or expired" << std::endl;
}
// Thread-safe by default — no external locking needed
return 0;
}A single global mutex forces all threads to serialize. Under high concurrency, threads spend most of their time waiting — this is Lock Contention, the primary bottleneck in concurrent systems.
1. Sharding
Keys are hashed into 32 independent buckets. Each bucket has its own lock, so threads accessing different keys never block each other. Ablation tests confirm a +77% throughput gain over a single-shard baseline.
2. Reader-Writer Locks
~99% of real-world cache traffic is reads. std::shared_mutex allows unlimited concurrent readers; only writers need exclusive access. This is the foundation that makes high read throughput possible.
3. Lazy LRU Promotion
Traditional LRU updates the linked list on every read — a write operation under the hood. MinKV's Lazy Promotion skips list updates for accesses within a 1-second window, converting 99% of get calls into pure read-lock operations and fully unlocking the potential of the reader-writer lock.
4. Group Commit WAL
A background thread batches writes and calls fsync every 10ms. This amortizes the cost of disk I/O across many operations, keeping persistence overhead at ~10% in multi-threaded workloads while guaranteeing crash consistency.
MinKV/
├── src/
│ ├── core/ # ShardedCache, LRU, expiration
│ ├── persistence/ # WAL, checkpoint, recovery
│ ├── vector/ # SIMD-accelerated vector ops
│ ├── graph/ # Graph-on-KV layer (GraphStore, serializer, types)
│ ├── base/ # Thread pool, logging, utilities
│ └── tests/ # Benchmarks and unit tests
├── tests/
│ └── graph/ # Graph unit tests, PBT, GraphRAG accuracy test
├── mcp_server_go/ # Go MCP server (Cursor/Claude integration)
├── docs/
│ ├── images/ # Benchmark charts
│ └── tests/ # Detailed test reports
└── scripts/ # Chart generation, build helpers
License: MIT | Author: Robinson




