A single-partition key/value storage engine on top of the monterey io_uring runtime. Designed for low-latency point and range access with zero-copy read paths.
Status: experimental — the on-disk format is not stable and the public API will change.
A Partition is a self-contained COW B+tree backed by a block device:
- Copy-on-write B+tree. Every mutation allocates new pages; the old root stays valid until a new root is committed to a meta page. Crash-safe without in-place writes.
- Block cache over FixedBufs. 4K pages are cached in io_uring-registered buffers. Clock (second-chance) eviction, CRC32C verification on read, writev coalescing on flush.
- Append-only WriteBuffer. Puts and deletes hit an arena-backed index in O(1); gets skip it via a key-range filter in O(1) when the key is out of range.
- Write-ahead log. Optional per-op or batched fsync for durability.
- Snapshots. Generation-gated freelist holds freed pages until all referencing snapshots are dropped, so a snapshot can keep reading an old root while the live tree continues to mutate.
- Streaming range scans. Zero-copy
key()/value()slices pointing directly into the page cache — no per-entry allocation. - Overflow pages. Values larger than ~1 KiB spill into contiguous overflow chains, read via a single multi-block I/O that bypasses the cache.
For multi-partition sharding, see the store-sharded crate.
use store::block::file::FileBlockDevice;
use store::wal::WalMode;
use store::{Partition, PartitionConfig};
fn main() {
let config = PartitionConfig::default()
.cache_size(1024) // pages (4 MiB at 4K each)
.wal_mode(WalMode::None);
monterey::Cluster::builder()
.cores(1)
.buffers(config.buffer_config())
.init(move || {
monterey::spawn(async move {
let dev = FileBlockDevice::open("/tmp/mydb", 16 * 1024).await.unwrap();
let mut db = Partition::create_with(dev, config).await.unwrap();
db.put(b"hello", b"world").await.unwrap();
db.flush().await.unwrap();
let v = db.get(b"hello").await.unwrap().unwrap();
assert_eq!(&v, b"world");
// Streaming range scan — zero-copy.
let mut scan = db.scan(b"a", Some(b"z")).await.unwrap();
while scan.advance().await.unwrap() {
println!("{:?} = {:?}", scan.key(), scan.value());
}
monterey::request_shutdown();
});
})
.run()
.unwrap();
}A runnable version is in examples/basic.rs:
cargo run --example basic --release
┌────────────────────────┐
│ Partition API │ put / get / delete / scan / snapshot
└───────────┬────────────┘
│
┌──────────────┼──────────────────────┐
│ │ │
┌──▼──────┐ ┌────▼─────┐ ┌─────▼─────-┐
│WriteBuf │ │ Flusher │ ────────►│ Tree<D> │
│ (arena) │ │ task │ │ COW B+tree│
└────┬────┘ └────┬─────┘ └─────┬─────-┘
│ │ │
│ │ │
│ ┌────▼──────────────────────▼──--┐
│ │ PageCache (Rc) │ interior-mutable
│ │ clock eviction, CRC32C, writev│ shared by writer+flusher
│ └────┬──────────────────────────-┘
│ │
┌────▼────-┐ ┌────▼─────────┐
│ WAL │ │ BlockDevice │ FileBlockDevice / NVMe passthrough
│(optional)│ └──────────────┘
└────────-─┘
The writer task takes put/delete calls and appends them to a WriteBuffer. When the buffer crosses a threshold or flush() is called, it's handed to the flusher task which merges it into the B+tree and rewrites the meta page. Both tasks share a single Rc<PageCache>, which is safe because the cache is interior-mutable and RefCell borrows are never held across await points.
Range iterators also hold a clone of the Rc<PageCache>, so a scan can outlive the tree view that created it and a snapshot scan can survive across flushes.
| File | What it does |
|---|---|
src/block/ |
BlockDevice trait and FileBlockDevice / NVMe passthrough impls. |
src/cache.rs |
PageCache — clock eviction, CRC verification, writev coalescing, direct overflow reads. |
src/page.rs |
Raw 4 KiB page layout: header, leaf/internal cells, CRC32C, overflow pointers. |
src/tree.rs |
COW B+tree: get, put, delete, bulk_load, merge_rebuild, compaction helpers. |
src/freelist.rs |
Free-page tracking with generation-gated holding for snapshot isolation. |
src/range.rs |
RangeIter — zero-copy streaming scan with prefetch directory. |
src/writebuf.rs |
Arena-backed append-only write buffer with key-range filter. |
src/wal.rs |
Write-ahead log (optional durability layer). |
src/partition.rs |
Partition — public API, writer + flusher tasks, snapshots, compaction. |
benches/compare.rs compares store against RocksDB across a handful of scenarios (in-memory fast path, cache-pressured random reads, no-cache reads, durable writes). Results depend heavily on the block device — a local file on an NVMe SSD is a different regime from direct NVMe passthrough.
cargo bench --bench compare
Bench outputs are not checked in.
cargo check --tests --benches --examples
cargo fmt
cargo test
Imports follow rustfmt defaults: std → external crates → crate::, blank line between groups. cargo fmt is the source of truth.
MIT OR Apache-2.0