Skip to content

perf: diagnose order-book storage benchmark behavior#122

Merged
div0rce merged 12 commits into
mainfrom
perf/m47-storage-benchmark-diagnosis
Jun 15, 2026
Merged

perf: diagnose order-book storage benchmark behavior#122
div0rce merged 12 commits into
mainfrom
perf/m47-storage-benchmark-diagnosis

Conversation

@div0rce

@div0rce div0rce commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Milestone

M47 follow-up — Storage benchmark diagnosis (on top of the merged M47 study, PR #119).

Summary

This follow-up diagnoses why the M47 storage modes ranked as they did by adding deterministic
workload variants and non-timed shape characterization. In review, a Codex comment flagged that the
benchmark timed per-run setup as command cost — and chasing it down overturned the headline
conclusion
.

run_once constructed a fresh MatchingEngine and applied the RegisterSymbol prefix inside the
timed interval. Book construction is eager, so for the pooled modes that prefix runs
OrderPool/RawPool free-list initialization over 65536 slots per book. With a fresh engine
per replay and only ~5000 commands, that one-time setup was charged to per-command time, amortized
over far too few commands, and scaled with symbol count — so it inflated the intrusive mode
most. (Note: the cost is incurred at register_symbol/book construction, not at the
MatchingEngine constructor as the review supposed — the constructor only stores the storage enum.)

The fix runs engine construction, the registration prefix, and the end-of-run snapshot outside
the timed interval and normalizes over timed commands. With setup excluded, the earlier "intrusive
is ~4–5× slower" result disappears
— it was a benchmark-setup artifact, not a per-command property.

⚠️ This overturns the intrusive-slowness reading carried into the merged M47 docs. The merged
PR #119 description is immutable, but the live docs/pool_backed_storage.md interpretation is
corrected here so the repo is not self-contradicting.

Corrected results (engine-level synthetic; aarch64 / Linux / g++ 13.3 / Release; results/pool_backed_storage.txt)

Median ns per timed command (per-run setup excluded), lower is better. Single-machine,
hardware/compiler/build-dependent — not a production-throughput claim.

Workload baseline pooled pmr intrusive contiguous was (intrusive, contaminated)
general (4 sym) 111.0 121.4 95.4 93.2 486.5
dense (2 sym) 96.4 88.3 66.0 70.7 196.5
sparse (4 sym) 81.0 72.1 48.2 60.9 426.8
cancel/modify (3 sym) 59.7 59.8 44.3 42.8 283.1
match/traversal (1 sym) 109.3 117.9 87.2 69.9 126.5

(bold = fastest mode in that row; Source digest: sha256:b606452b…, Dirty inputs: no.)

Honest reading: with per-run setup excluded the four modes cluster into a tight ~40–120 ns/cmd
band (vs the old 40–486 spread). Intrusive and contiguous are the two fastest and trade the lead by
workload shape; baseline/PMR sit behind. This does not mean "intrusive won" — it still pays a
large fixed init cost (pre-allocating 65536 slots/book) that this per-command metric deliberately
excludes and that only amortizes over a long engine lifetime. The contiguous fixed-band caveat
([1,1024]) still holds.

Verification of the fix

A controlled macOS before/after on the same host isolates the effect: intrusive dropped 45–92
ns/cmd while baseline/PMR/contiguous stayed within noise, and the drop scales with symbol count
(4-symbol flows ≈80–92, 2-symbol ≈45) — exactly the "N books × 2 × 65536 free-list init removed"
signature.

Definition of Done

  • Artifact generated by committed scripts with full metadata; regenerated in Docker Linux.
  • Provenance intact: digest sha256:b606452b…, Dirty inputs: no, informational commit cf0396f.
  • Second review finding (same root cause) closed: the non-timed characterize pass now observes
    the same post-registration trading range the timed rows measure, sharing one registration-prefix
    boundary and the same should_probe predicate — so the shape line's commands/top_probe_calls
    match the per-run cmds/probes/run instead of counting the registration prefix.
  • Docs state the methodology change and the corrected interpretation; the stale "intrusive slow"
    ranking is removed and explained as an artifact.
  • Negative/neutral framing preserved; no speedup overclaim (intrusive's excluded fixed cost is stated).
  • All-mode equivalence regression green; order_book.cpp change behavior-preserving.
  • make check / make asan pass; PROGRESS.md updated.

Tests

make check   # native macOS: 232/232 passed (Docker Linux: 240/240; delta is Linux-only epoll/socket tests)
make asan    # native macOS: 232/232 passed, 0 failed
make bench-storage  # regenerated in Docker Ubuntu 24.04 (g++ 13.3); Dirty inputs: no
CodeScene analyze_change_set (base main): quality_gates = passed, no degraded files

Notes / limitations

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR diagnoses a benchmarking artifact in the M47 storage study by excluding engine construction and symbol-registration from the timed interval. It replaces the single engine_flow benchmark with a five-workload framework (WorkloadBuilder, characterization, time_storage), refactors IntrusiveStore cancel/modify/rest paths, adds a cross-storage-mode equivalence regression test, and regenerates results and documentation.

Changes

M47 Storage Benchmark Diagnosis

Layer / File(s) Summary
IntrusiveStore resting-order refactor
src/engine/order_book.cpp
Introduces RestingNode and Index alias; adds append_resting (with index rollback) and erase_indexed_order helpers; rewires rest, cancel, and modify to use them; adds early-capacity check to can_store_limit.
Storage-mode equivalence regression test
tests/unit/test_matching_engine.cpp
Adds benchmark_mix_flow, BenchmarkMixCoverage, and run_flow_with_storage; inserts a Catch2 test asserting all four OrderBook::Storage modes emit identical events and snapshots for the benchmark command mix.
Benchmark data structures, WorkloadBuilder, and characterization
benchmarks/bench_storage.cpp
Defines LimitSpec, Workload, WorkloadShape, RunSummary, and Timing types; implements WorkloadBuilder for deterministic replay-command construction with active-order tracking; adds characterize pipeline computing command/event tallies, price-domain metrics, and top-of-book probe cadence.
Benchmark execution pipeline and workload generators
benchmarks/bench_storage.cpp
Implements probe_top_of_book, apply_trading (timed path), finalize_run (untimed), run_once, and time_storage (warmup + ns/cmd sampling returning median/min/max); adds benchmark_workload coordinator; adds five named workload generators and rewrites run_storage_benchmarks.
Script metadata, regenerated results, and documentation
scripts/run_storage_benchmarks.sh, results/pool_backed_storage.txt, docs/pool_backed_storage.md, docs/benchmarking.md
Updates shell-script header strings for timing scope and caveat; regenerates results with per-workload shape sections and per-mode median/min/max; expands docs with Command Hot Path comparison table, five-workload methodology, corrected-artifact section, and updated limitations.
Milestone and progress tracking
MILESTONES.md, PROGRESS.md
Marks M47 as merged; updates current-state block, milestone table row, 2026-06-15 decision-log entry, and next-action guidance to reflect the active perf/m47-storage-benchmark-diagnosis branch and PR #122.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • div0rce/quant-systems-lab#112: Directly overlaps with this PR's IntrusiveStore resting-node/index helper work and can_store_limit capacity-guarding logic in the same file and behaviors.

Suggested reviewers

  • codescene-delta-analysis

Poem

🐇 Hoppity-hop through the order book tree,
Five workloads now timed with precision and glee!
The setup excluded, the nanoseconds fair,
append_resting rolls back with delicate care.
Median, min, max — the benchmarks align,
This bunny says: corrected results are divine! 🎉

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf: diagnose order-book storage benchmark behavior' accurately describes the main change—a diagnostic follow-up to M47 that investigates and corrects storage benchmark methodology and results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively covers all required template sections: Milestone, Summary, Definition of Done (with completed checklist items), Tests, and Notes/limitations, providing thorough context about the M47 follow-up diagnostic work.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/m47-storage-benchmark-diagnosis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

codescene-delta-analysis[bot]

This comment was marked as outdated.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a42e7caf93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread benchmarks/bench_storage.cpp Outdated
div0rce added 2 commits June 15, 2026 14:29
The storage benchmark timed run_once in full, including MatchingEngine
construction and the RegisterSymbol prefix. Book construction is eager,
so for the pooled modes that prefix runs OrderPool/RawPool free-list
initialization over 65536 slots per book -- a fixed per-run setup cost
charged to per-command time and amortized over only ~5k commands. With
2-4 books that setup dominated the intrusive-pool numbers.

Apply registration (and the end-of-run snapshot readout) outside the
timed interval and normalize over the timed command count, so each row
reflects per-command work rather than per-run pool initialization.

Addresses the PR #122 review finding that per-run setup leaked into the
timed storage rows.
Install the Docker-Linux artifact regenerated with per-run setup excluded
from timing (digest 81ff7430..., Dirty inputs: no, informational commit
476ba71) and rewrite the interpretation. The earlier "intrusive is the
slow outlier / PMR fastest" ranking was dominated by per-run pool
free-list initialization timed as command cost; with setup excluded the
four modes cluster into a tight ~40-120 ns/cmd band and intrusive and
contiguous are the two fastest, trading the lead by workload shape.
codescene-delta-analysis[bot]

This comment was marked as outdated.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a396af2995

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread benchmarks/bench_storage.cpp Outdated
div0rce added 2 commits June 15, 2026 15:05
The shape pass (characterize) walked the full command stream and derived
top_probe_calls from a formula on the total command count, while the timed
path measures only the post-registration trading commands. The shape line
therefore reported top_probe_calls=20016 against probes/run=20008 (the 2
registration commands x 2 symbols x bid/ask).

Share one registration-prefix boundary (registration_prefix_len) between
characterize and the timed path, and count probes with the same
should_probe predicate over the same range. The shape line now describes
exactly the timed trading workload -- commands and top_probe_calls match
the per-run rows -- closing the registration-prefix accounting class at the
source rather than patching one formula.

Addresses the second PR #122 review finding (same root cause as the first).
Regenerate the Docker-Linux artifact after the characterize fix (digest
e12d1416..., Dirty inputs: no, informational commit d3ed253) so the shape
line's commands/top_probe_calls match the per-run cmds/probes/run, and
rebuild the doc table from that single artifact. The timed path was
unchanged, so medians moved only within noise and the ranking is identical:
the four modes cluster into a tight ~40-120 ns/cmd band with intrusive and
contiguous fastest, trading the lead by workload shape.
codescene-delta-analysis[bot]

This comment was marked as outdated.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/bench_storage.cpp`:
- Around line 369-393: The time_storage function can fail when reps is zero
(leaving samples empty) or when timed_commands is zero (causing division by
zero). Add explicit guards in the function: first check if reps is zero and
handle appropriately before entering the loop, then within the loop check if
timed_commands is zero before performing the division operation (ns /
timed_commands), and finally before accessing samples vector elements
(samples[samples.size() / 2], samples.front(), samples.back()), verify that
samples is not empty. These guards will prevent undefined behavior and crashes
on edge case inputs.

In `@docs/benchmarking.md`:
- Around line 113-117: In the paragraph describing the storage experiment that
includes variants like PMR-backed container-node allocation, intrusive
OrderPool-backed resting-order nodes, and M47 fixed-band contiguous
direct-price-indexed storage mode, the phrase on line 117 about timing being
over "full workload replays" is imprecise. Reword this portion to explicitly
clarify that the timing measurements are specifically for the post-registration
command path executed per replay, not the entire workload replay, to align with
the actual harness behavior and maintain precision in performance documentation
as required by the coding guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6b270b7-07a7-4895-a9d7-905cfe43c960

📥 Commits

Reviewing files that changed from the base of the PR and between 93d5062 and a42dcc4.

📒 Files selected for processing (9)
  • MILESTONES.md
  • PROGRESS.md
  • benchmarks/bench_storage.cpp
  • docs/benchmarking.md
  • docs/pool_backed_storage.md
  • results/pool_backed_storage.txt
  • scripts/run_storage_benchmarks.sh
  • src/engine/order_book.cpp
  • tests/unit/test_matching_engine.cpp

Comment thread benchmarks/bench_storage.cpp
Comment thread docs/benchmarking.md Outdated
div0rce added 2 commits June 15, 2026 15:40
time_storage now resolves the timed-command count once (it is identical
across reps) and returns early when reps == 0 or a workload has no
post-registration commands, avoiding empty-vector indexing and a
divide-by-zero in the sampling math. docs/benchmarking.md no longer says
timing covers "full workload replays" -- it times only the post-registration
command path, matching the storage-doc and the harness.

Addresses the CodeRabbit review comments on PR #122.
Regenerate the Docker-Linux artifact (digest b606452b..., Dirty inputs: no,
informational commit cf0396f) after the time_storage guard/hoist and rebuild
the doc table from that single artifact. The timed-path logic is unchanged,
so medians moved only within noise and the four-mode ranking is identical.

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Health Improved (1 files improve in Code Health)

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

View Improvements
File Code Health Impact Categories Improved
bench_storage.cpp 9.69 → 10.00 Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

@div0rce div0rce merged commit 548cb68 into main Jun 15, 2026
8 checks passed
@div0rce div0rce deleted the perf/m47-storage-benchmark-diagnosis branch June 16, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant