Skip to content

bryandmc/store

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

store

A single-partition key/value storage engine on top of the monterey io_uring runtime. Designed for low-latency point and range access with zero-copy read paths.

Status: experimental — the on-disk format is not stable and the public API will change.

What it is

A Partition is a self-contained COW B+tree backed by a block device:

  • Copy-on-write B+tree. Every mutation allocates new pages; the old root stays valid until a new root is committed to a meta page. Crash-safe without in-place writes.
  • Block cache over FixedBufs. 4K pages are cached in io_uring-registered buffers. Clock (second-chance) eviction, CRC32C verification on read, writev coalescing on flush.
  • Append-only WriteBuffer. Puts and deletes hit an arena-backed index in O(1); gets skip it via a key-range filter in O(1) when the key is out of range.
  • Write-ahead log. Optional per-op or batched fsync for durability.
  • Snapshots. Generation-gated freelist holds freed pages until all referencing snapshots are dropped, so a snapshot can keep reading an old root while the live tree continues to mutate.
  • Streaming range scans. Zero-copy key()/value() slices pointing directly into the page cache — no per-entry allocation.
  • Overflow pages. Values larger than ~1 KiB spill into contiguous overflow chains, read via a single multi-block I/O that bypasses the cache.

For multi-partition sharding, see the store-sharded crate.

Quick start

use store::block::file::FileBlockDevice;
use store::wal::WalMode;
use store::{Partition, PartitionConfig};

fn main() {
    let config = PartitionConfig::default()
        .cache_size(1024)            // pages (4 MiB at 4K each)
        .wal_mode(WalMode::None);

    monterey::Cluster::builder()
        .cores(1)
        .buffers(config.buffer_config())
        .init(move || {
            monterey::spawn(async move {
                let dev = FileBlockDevice::open("/tmp/mydb", 16 * 1024).await.unwrap();
                let mut db = Partition::create_with(dev, config).await.unwrap();

                db.put(b"hello", b"world").await.unwrap();
                db.flush().await.unwrap();

                let v = db.get(b"hello").await.unwrap().unwrap();
                assert_eq!(&v, b"world");

                // Streaming range scan — zero-copy.
                let mut scan = db.scan(b"a", Some(b"z")).await.unwrap();
                while scan.advance().await.unwrap() {
                    println!("{:?} = {:?}", scan.key(), scan.value());
                }

                monterey::request_shutdown();
            });
        })
        .run()
        .unwrap();
}

A runnable version is in examples/basic.rs:

cargo run --example basic --release

Architecture

      ┌────────────────────────┐
      │     Partition API      │  put / get / delete / scan / snapshot
      └───────────┬────────────┘
                  │
   ┌──────────────┼──────────────────────┐
   │              │                      │
┌──▼──────┐  ┌────▼─────┐          ┌─────▼─────-┐
│WriteBuf │  │  Flusher │ ────────►│  Tree<D>   │
│ (arena) │  │   task   │          │  COW B+tree│
└────┬────┘  └────┬─────┘          └─────┬─────-┘
     │            │                      │
     │            │                      │
     │       ┌────▼──────────────────────▼──--┐
     │       │       PageCache (Rc)           │  interior-mutable
     │       │  clock eviction, CRC32C, writev│  shared by writer+flusher
     │       └────┬──────────────────────────-┘
     │            │
┌────▼────-┐  ┌────▼─────────┐
│   WAL    │  │ BlockDevice  │  FileBlockDevice / NVMe passthrough
│(optional)│  └──────────────┘
└────────-─┘

The writer task takes put/delete calls and appends them to a WriteBuffer. When the buffer crosses a threshold or flush() is called, it's handed to the flusher task which merges it into the B+tree and rewrites the meta page. Both tasks share a single Rc<PageCache>, which is safe because the cache is interior-mutable and RefCell borrows are never held across await points.

Range iterators also hold a clone of the Rc<PageCache>, so a scan can outlive the tree view that created it and a snapshot scan can survive across flushes.

Module layout

File What it does
src/block/ BlockDevice trait and FileBlockDevice / NVMe passthrough impls.
src/cache.rs PageCache — clock eviction, CRC verification, writev coalescing, direct overflow reads.
src/page.rs Raw 4 KiB page layout: header, leaf/internal cells, CRC32C, overflow pointers.
src/tree.rs COW B+tree: get, put, delete, bulk_load, merge_rebuild, compaction helpers.
src/freelist.rs Free-page tracking with generation-gated holding for snapshot isolation.
src/range.rs RangeIter — zero-copy streaming scan with prefetch directory.
src/writebuf.rs Arena-backed append-only write buffer with key-range filter.
src/wal.rs Write-ahead log (optional durability layer).
src/partition.rs Partition — public API, writer + flusher tasks, snapshots, compaction.

Benchmarks

benches/compare.rs compares store against RocksDB across a handful of scenarios (in-memory fast path, cache-pressured random reads, no-cache reads, durable writes). Results depend heavily on the block device — a local file on an NVMe SSD is a different regime from direct NVMe passthrough.

cargo bench --bench compare

Bench outputs are not checked in.

Development

cargo check --tests --benches --examples
cargo fmt
cargo test

Imports follow rustfmt defaults: std → external crates → crate::, blank line between groups. cargo fmt is the source of truth.

License

MIT OR Apache-2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages