diff --git a/PERF_FINDINGS.md b/PERF_FINDINGS.md deleted file mode 100644 index eb74a8a..0000000 --- a/PERF_FINDINGS.md +++ /dev/null @@ -1,266 +0,0 @@ -# Holt performance findings - -A record of the read/write optimization pass on the `perf/u16-children` -branch: what landed, how holt measures against RocksDB/SQLite on a *fair* -benchmark, and — most importantly — the precisely-diagnosed remaining -bottleneck and the honest architectural tradeoffs. Read this before -chasing further perf work so you don't re-derive it. - -## What landed (10 commits) - -``` -c6062da bench: drop the holt Tree before its TempDir in persistent groups -9261965 perf(addressing)!: child body offsets — drop slot-table read indirection (2 loads→1/hop; manifest v4) -2845bec perf(layout)!: flatten leaf → one variable-size self-describing node (manifest v3; subsumes LeafInline) -4ffe973 perf(walker): one-byte leaf key fingerprint (skip extent read on key mismatch) -10bfa0c perf(range): scan-ahead prefetch of the next sibling subtree -a1852d2 perf(walker): software-prefetch the next node body during descent (+ aarch64 PRFM) -83b92df perf(buffer-manager): cheap GUID hasher for the blob cache map (pin −37%) -945b9f9 ③ LeafInline 2ffe29a ① NEON 16-lane scan abd4e69 ② spillover footprint memo -``` - -Two breaking on-disk format changes (R3 leaf flatten = v3, R1 offset -addressing = v4). Every step validated on **aarch64 (NEON)** and -**x86 (AVX2 + io_uring)** through the corruption gates: `proptest` -(randomized ops vs a BTreeMap/WAL-replay oracle) and -`checkpoint_failpoint` (crash injection). `④ pointer swizzling` was -**measured and rejected** (≤9–16% ceiling, large concurrency surface) -in favor of the GUID hasher. - -## Fair benchmark vs RocksDB / SQLite - -Methodology (benches/main.rs, holt-bench crate): **N = 20 000** -object-store metadata keys (~30 B path keys, ~60 B JSON values), spread -across ~7 × 512 KB blobs. Persistent groups run holt -`Durability::Wal { sync: false }` with the **journal + background -checkpointer threads running**, vs RocksDB **WAL on, sync off** — a fair -"hot service" durability profile (WAL to page cache, no per-op fsync). -Numbers from the x86 box (perf_event_paranoid=4, no perf sampling). - -| operation (persistent, threads running) | holt | RocksDB | result | -|---|---|---|---| -| point read (objstore_persist_get) | **210 ns** | 499 ns | **holt 2.4× faster** | -| point read (memory) | 219 ns | 487 ns | holt 2.2× | -| write (objstore_persist_put, 1 thread) | 2.6 µs | 2.58 µs | **parity** | -| write (memory, no WAL) | 418 ns | 1397 ns | holt 3.3× (no durability) | -| prefix scan (objstore_list, 100 entries) | 16.35 µs | 15.74 µs | **~parity (within 4%)** | - -R1 (offset addressing) was the highest-leverage single change: point -read −10.6%, **prefix scan −24.2%** (closed a 30%→4% gap), writes -unchanged. - -### Concurrent write (1M keys, persistent WAL + checkpoint, 16-core x86) - -| threads | holt | RocksDB | -|---|---|---| -| 1 | 8938 ns/op (0.11 Mops/s) | 3716 ns/op (0.27) | -| 4 | 2211 ns/op (0.45) | 2044 ns/op (0.49) | -| 8 | 3089 ns/op (0.32) | 1435 ns/op (0.70) | -| 16 | 3437 ns/op (0.29) | 1532 ns/op (0.65) | - -holt **peaks at 4 threads then negatively scales**; p99 tail at 16 -threads ≈ 296 µs vs RocksDB 56 µs. RocksDB scales to ~0.65–0.70 Mops/s. - -## Honest conclusions - -**holt is a read engine.** It crushes RocksDB on reads (2.4×, durable or -not) because the ART + 512 KB self-describing blobs give one-load node -hops (post-R1) and subtree locality. That is the product story. - -**Writes have two separate problems:** - -1. **Architectural (hard to beat): in-place tree vs append-only LSM.** - A holt `put` costs O(tree depth + route-cache miss + possible - spillover); at 1M keys that's ~513 blobs, depth 2, ~78% route-cache - miss, 512 spillovers — so single-thread put is ~2.4× RocksDB at scale. - RocksDB's LSM append is ~O(1) (append memtable + WAL, defer - reorganization to background compaction). Chasing LSM on raw write - throughput fights the architecture; don't. - -2. **Fixable: concurrent writes serialize on the root blob's latch.** - The write path is **lock-coupled with exclusive latches**: - `cross_and_insert` (src/engine/walker/insert.rs) takes the parent - blob's `BlobWriteGuard` (exclusive), pins the child, takes the - child's `BlobWriteGuard`, then drops the parent. So **every write to - any child blob first exclusively latches the root blob** to traverse - it → all writers serialize on the root's exclusive latch (classic - lock-coupled-tree root bottleneck). This is why concurrency scales - negatively and the tail explodes — and it is *not* architectural; it - is fixable. - - **Attempted fix (REJECTED — measured regression):** optimistic write - descent (LeanStore-style optimistic lock coupling) — traverse the - upper blobs optimistically (snapshot `content_version`, read wait-free, - validate) exactly like the read path, and take the **exclusive latch - only on the target blob** where the mutation lands (revalidate the - parent chain by version, restart on a miss, escalate to pessimistic - after a 4-restart budget; CoW-fork escalates too). Fully implemented - and **validated for correctness** (lib +3 tests, concurrent_stress 5, - proptest 5, checkpoint_failpoint 8, restart counts bounded; aarch64 - 20/20 + x86 12/12 hardened-stress loops clean, both arches). - - But a **clean same-machine A/B** (x86, `objstore put`, 50k ops/thread, - RocksDB reproduced identically both runs → clean attribution) showed - it is a **large regression at every thread count**, not a win: - - | threads | baseline Mops/s | with descent | Δ | baseline p50 | descent p50 | - |--------:|----------------:|-------------:|---:|------------:|------------:| - | 1 | 0.105 | 0.087 | **−17%** | 1.9µs | 4.8µs (2.5×) | - | 4 | 0.453 | 0.138 | **−70%** | 2.9µs | 28.8µs (10×) | - | 8 | 0.316 | 0.127 | **−60%** | 6.6µs | 62µs (9×) | - | 16 | 0.287 | 0.134 | **−53%** | 7.3µs | 92µs (13×) | - - **Why it lost:** the workload *grows* the tree, so node splits and blob - spillover are frequent. When a target mutation needs to split a node - across a blob boundary (`TargetMutation::Crossing`) or hits a - snapshot-shared child (`Escalate`), the optimistic attempt is wasted — - it descends, takes the target latch, discovers it can't complete, - restarts, burns the budget, and falls back to the full pessimistic - path, **paying for both descents**. The −17% / 2.5× single-thread p50 - blowup (zero contention) is the proof: the fast path is mostly wasted - work here, not a parallelism win. Optimistic descent only pays off when - target blobs are disjoint *and* the mutation lands in-place (no split); - for a path-shaped, growing object-store keyspace neither holds often - enough. The rejected patch lives in this branch's git history; - do not re-attempt without first making the bail path cheap (skip the - optimistic attempt when the target node is full / a split is likely) - or solving the real bottleneck below. - -## Write-concurrency bottleneck: ISOLATED to the WAL group-commit path - -The negative concurrent-write scaling (4t 0.453 → 16t 0.287 Mops/s, p99 -55µs → 286µs) was diagnosed with two zero-/low-risk experiments instead of -guessing — and the answer overturns the earlier speculation (it is NOT the -root latch, and it does NOT need a ROWEX rewrite). - -**Profile-by-stats (zero code).** A 16t `objstore put` run's `holt_shape` -line shows: tree `max_depth=2`, and during the measured overwrite phase -`route_hits +800000 / route_misses +0` → **100% route-cache hit, 0% -Phase-3 root-exclusive fallback**. `spillovers` flat (no tree growth, no -`Crossing`). So every put takes the Phase-1 fast path: a *shared* `.read()` -latch on the (single, depth-2) root + an exclusive `.write()` on the child. - -**No-merge multi-root A/B (bench-only, `HOLT_SHARD_N`).** Open N independent -`Tree`s over ONE `DB` (each its own root blob, but sharing the DB's -`next_seq` + BufferManager + journal), route puts by `hash(key)%N`, point -ops only — no cross-shard merge. This splits the root latch N ways while -holding everything else constant: - - | config | 1t | 4t | 8t | 16t | - |---------------|----|----|----|-----| - | sh1 (1 root) | 0.116 | 0.458 | 0.321 | **0.290** | - | sh8 (8 roots)| 0.036*| 0.420 | 0.323 | **0.289** | - | sh16 (16 roots, depth-1 ⇒ all root-*exclusive*) | 0.396 | 0.341 | 0.345 | **0.307** | - - At 16t every config converges to ~0.29–0.31 regardless of root count - (+6% from 16× the roots). **The root latch — shared OR exclusive — is - NOT the bottleneck**, so `prefix-sharded-forest` would not have helped. - (`*`sh8 1t=0.036 is a one-off p99=688µs checkpoint spike.) - -**Memory-mode A/B (bench-only, `HOLT_STORAGE=memory`).** `Storage::Memory` -attaches **no journal** (`wal_path`=None), so the put takes the else-branch -— same ART mutation, same `maintenance_gate`/`mutation_gate`/`endpoint_locks`/ -`next_seq`/`mark_dirty`, but **no `commit_gate` + no `journal.submit`**: - - | threads | memory (no journal) | WAL mode | p99 mem | p99 WAL | - |--------:|--------------------:|---------:|--------:|--------:| - | 1 | 1.328 | 0.105 | 11µs | 13µs | - | 4 | 3.768 | 0.453 | 9µs | 55µs | - | 8 | 4.567 | 0.316 | 7µs | 145µs | - | 16 | **5.782** | 0.287 | 9µs | 286µs | - - Removing the journal makes concurrent writes **scale near-linearly to - 5.78 Mops/s at 16t (20× the WAL-mode 0.287), with flat p99**. The ART - write path + all three gates + `next_seq` + `mark_dirty` are all still - present and scale fine. - -**Conclusion (measured, not guessed).** The concurrent-write ceiling is -the **WAL group-commit plumbing**, nothing else: per-put `Vec` record -encode + crossbeam channel `send` to a single bounded channel + a single -worker thread draining it (the 286µs p99 = foreground blocked on a full -channel behind the saturated worker). `commit_gate.enter_writer()` is ruled -out — it is `gate.enter_shared()`, the same primitive `mutation_gate`/ -`maintenance_gate` use, and those scale. **This overturns the "concede -writes, it's structural" verdict**: holt's *structure* scales (5.78 Mops/s -@ 16t ≈ 9× RocksDB's durable 0.62). Only the WAL plumbing doesn't — and -that is fixable without touching read/scan/cold locality. - -**Fix — lock-free shared WAL ring (single ordered log): IMPLEMENTED & -MEASURED, beats RocksDB.** Replaced the per-record `Vec` + channel + -single-encoder-worker with a shared in-RAM ring: each writer reserves a byte -range via one atomic `fetch_add` on the tail (gap-free byte tiling = the -order key), memcpies its encoded record **in parallel**, and publishes by -folding the contiguous published byte interval into `committed_addr` under a -brief lock; a single background flusher drains the committed prefix into the -**unchanged** `WalWriter` (so on-disk format + replay reader are byte-for-byte -identical) and fsyncs on the sync path. ONE ordered log → trim-watermark / -single-pass-replay invariants preserved (unlike the rejected multi-lane -`wal-commit-sharding`). **This is now the sole WAL backend — the legacy -channel+worker has been removed (no feature flag).** See `src/journal/ring.rs` -+ `src/journal/group_commit.rs` (the module docs are the design). - -Measured A/B (x86, `objstore put`, 50k ops/thread, same machine; RocksDB the -fixed comparator): - - | threads | legacy Mops/s | **ring Mops/s** | RocksDB | ring vs RocksDB | - |--------:|--------------:|----------------:|--------:|----------------:| - | 1 | 0.105 | 0.112 | 0.271 | 0.41× (p50 1.5µs) | - | 4 | 0.453 | **2.605** | 0.495 | **5.3×** | - | 8 | 0.316 | **2.049** | 0.701 | **2.9×** | - | 16 | 0.287 | **1.660** | 0.640 | **2.6×** | - -Negative→positive scaling; **5.8–6.5× over legacy, 2.6–5.3× over RocksDB** at -4/8/16 threads; p99 51µs @16t (legacy 286µs). It does not reach the 5.78 -memory-mode ceiling — one ordered file still funnels through a single flusher -+ the shared `tail.fetch_add` cacheline + the in-publish `advance` lock (the -4→16t taper). **holt now beats RocksDB on concurrent durable write** while -keeping its read/scan/cold edge. - -Validated (both arches, ring LIVE in the engine under the feature): 6 ring -`Journal` contract tests, lib 286, **proptest BTreeMap/WAL-replay oracle 5**, -**checkpoint_failpoint crash-injection 8**, concurrent_stress 3 (+10× release -loop 0 flaky), loom gap-safety model, clippy clean. loom also **caught a real -design bug** (separate work-id counter could disagree with byte order → -unpublished-gap copy) → keyed on the byte tiling instead. - -Hardening completed before making it the sole backend: a multi-process -**SIGKILL crash-soak** (`examples/wal_crash_soak.rs`, 40 rounds, every recovery -a contiguous valid prefix — covers the async RAM→page-cache window + -flusher-mid-drain + mid-checkpoint-truncate), a **2+3-publisher loom** model of -the `advance` lock (a leaf lock — no deadlock by construction), and **built-in -backpressure** (writers park on a `space_cv` the flusher signals). The 1-thread -~66ms outlier was root-caused (a per-op flusher wake = a channel send on every -write) and removed → 1t is now 1.12 Mops/s (p99 14.8µs), 4.1× RocksDB. - -## Suggested next work (each its own focused session) - -- ~~**Lock-free shared WAL ring**~~ — **SHIPPED as the sole WAL backend** - (legacy removed; see "Write-concurrency bottleneck" above): beats RocksDB - **2.8–5.5×** on concurrent durable write at 1/4/8/16 threads, dual-arch - validated (proptest oracle + checkpoint_failpoint + 40-round SIGKILL - crash-soak + loom). All hardening done. -- ~~**Optimistic write descent**~~ / ~~**prefix-sharded-forest**~~ — both - **RULED OUT by measurement**: the root latch (shared or exclusive) is - not the bottleneck (no-merge multi-root A/B: 16 roots ≈ 1 root at 16t), - so neither helps. Optimistic-descent patch lives in this branch's git history. -- **R2 — BlobNode prefix Bloom** — a per-edge Bloom (a Bloom *extent* in - the parent, sized ~10 bits/key, not inline) so a negative lookup whose - key matches a crossing's path prefix is answered without pinning + - reading the 512 KB child. Targets cold-miss / existence checks; the - crossing's inline path prefix already filters cross-prefix misses, so - the marginal win is within-prefix existence misses. Write maintenance - (update the parent edge on insert) is the cost; Bloom = no false - negatives, so it is correctness-safe to skip on a miss. -- **Key-ordered leaf layout for cold scans** — compaction's `clone_subtree` - DFS already lays leaves out in key order, so post-compaction/cold scans - are already sequential; the remaining hot-scan ~4% is the optimistic - restart-safe cursor's per-entry copy (diminishing returns). - -## Benchmark reproduction notes - -- RocksDB/SQLite comparators need libclang; the x86 box has only - `libclang.so.1`, so symlink a shim and point clang-sys at it: - `mkdir -p ~/libclang-shim && ln -sf /usr/lib/llvm-18/lib/libclang.so.1 ~/libclang-shim/libclang.so` - then `export LIBCLANG_PATH=$HOME/libclang-shim`. -- Single-thread latency: `cargo bench --manifest-path benches/Cargo.toml --bench main -- --quick --noplot "objstore_persist_(get|put)/(holt|rocksdb)"` -- Concurrency: `HOLT_CONCURRENT_THREADS=1,4,8,16 HOLT_CONCURRENT_OPS_PER_THREAD=50000 HOLT_CONCURRENT_OPS=put cargo bench --manifest-path benches/Cargo.toml --bench concurrent -- objstore`