perf: diagnose order-book storage benchmark behavior by div0rce · Pull Request #122 · div0rce/quant-systems-lab

div0rce · 2026-06-15T18:01:51Z

Milestone

M47 follow-up — Storage benchmark diagnosis (on top of the merged M47 study, PR #119).

Summary

This follow-up diagnoses why the M47 storage modes ranked as they did by adding deterministic
workload variants and non-timed shape characterization. In review, a Codex comment flagged that the
benchmark timed per-run setup as command cost — and chasing it down overturned the headline
conclusion.

run_once constructed a fresh MatchingEngine and applied the RegisterSymbol prefix inside the
timed interval. Book construction is eager, so for the pooled modes that prefix runs
OrderPool/RawPool free-list initialization over 65536 slots per book. With a fresh engine
per replay and only ~5000 commands, that one-time setup was charged to per-command time, amortized
over far too few commands, and scaled with symbol count — so it inflated the intrusive mode
most. (Note: the cost is incurred at register_symbol/book construction, not at the
MatchingEngine constructor as the review supposed — the constructor only stores the storage enum.)

The fix runs engine construction, the registration prefix, and the end-of-run snapshot outside
the timed interval and normalizes over timed commands. With setup excluded, the earlier "intrusive
is ~4–5× slower" result disappears — it was a benchmark-setup artifact, not a per-command property.

⚠️ This overturns the intrusive-slowness reading carried into the merged M47 docs. The merged
PR #119 description is immutable, but the live docs/pool_backed_storage.md interpretation is
corrected here so the repo is not self-contradicting.

Corrected results (engine-level synthetic; aarch64 / Linux / g++ 13.3 / Release; `results/pool_backed_storage.txt`)

Median ns per timed command (per-run setup excluded), lower is better. Single-machine,
hardware/compiler/build-dependent — not a production-throughput claim.

Workload	baseline	pooled pmr	intrusive	contiguous	was (intrusive, contaminated)
general (4 sym)	111.0	121.4	95.4	93.2	486.5
dense (2 sym)	96.4	88.3	66.0	70.7	196.5
sparse (4 sym)	81.0	72.1	48.2	60.9	426.8
cancel/modify (3 sym)	59.7	59.8	44.3	42.8	283.1
match/traversal (1 sym)	109.3	117.9	87.2	69.9	126.5

(bold = fastest mode in that row; Source digest: sha256:b606452b…, Dirty inputs: no.)

Honest reading: with per-run setup excluded the four modes cluster into a tight ~40–120 ns/cmd
band (vs the old 40–486 spread). Intrusive and contiguous are the two fastest and trade the lead by
workload shape; baseline/PMR sit behind. This does not mean "intrusive won" — it still pays a
large fixed init cost (pre-allocating 65536 slots/book) that this per-command metric deliberately
excludes and that only amortizes over a long engine lifetime. The contiguous fixed-band caveat
([1,1024]) still holds.

Verification of the fix

A controlled macOS before/after on the same host isolates the effect: intrusive dropped 45–92
ns/cmd while baseline/PMR/contiguous stayed within noise, and the drop scales with symbol count
(4-symbol flows ≈80–92, 2-symbol ≈45) — exactly the "N books × 2 × 65536 free-list init removed"
signature.

Definition of Done

Artifact generated by committed scripts with full metadata; regenerated in Docker Linux.
Provenance intact: digest sha256:b606452b…, Dirty inputs: no, informational commit cf0396f.
Second review finding (same root cause) closed: the non-timed characterize pass now observes
the same post-registration trading range the timed rows measure, sharing one registration-prefix
boundary and the same should_probe predicate — so the shape line's commands/top_probe_calls
match the per-run cmds/probes/run instead of counting the registration prefix.
Docs state the methodology change and the corrected interpretation; the stale "intrusive slow"
ranking is removed and explained as an artifact.
Negative/neutral framing preserved; no speedup overclaim (intrusive's excluded fixed cost is stated).
All-mode equivalence regression green; order_book.cpp change behavior-preserving.
make check / make asan pass; PROGRESS.md updated.

Tests

make check   # native macOS: 232/232 passed (Docker Linux: 240/240; delta is Linux-only epoll/socket tests)
make asan    # native macOS: 232/232 passed, 0 failed
make bench-storage  # regenerated in Docker Ubuntu 24.04 (g++ 13.3); Dirty inputs: no
CodeScene analyze_change_set (base main): quality_gates = passed, no degraded files

Notes / limitations

Engine-level synthetic benchmark only (single process, Release, no network/disk).
Workloads keep resting prices in the contiguous mode's [1,1024] band.
Production order-book storage default is unchanged; this is diagnostic evidence.
Does not address issue M29 follow-up: generate full Linux hardware PMU perf artifacts #90 (real hardware PMU evidence) or issue External technical review request #94 (external review).

coderabbitai · 2026-06-15T18:02:04Z

📝 Walkthrough

Walkthrough

This PR diagnoses a benchmarking artifact in the M47 storage study by excluding engine construction and symbol-registration from the timed interval. It replaces the single engine_flow benchmark with a five-workload framework (WorkloadBuilder, characterization, time_storage), refactors IntrusiveStore cancel/modify/rest paths, adds a cross-storage-mode equivalence regression test, and regenerates results and documentation.

Changes

M47 Storage Benchmark Diagnosis

Layer / File(s)	Summary
IntrusiveStore resting-order refactor `src/engine/order_book.cpp`	Introduces `RestingNode` and `Index` alias; adds `append_resting` (with index rollback) and `erase_indexed_order` helpers; rewires `rest`, `cancel`, and `modify` to use them; adds early-capacity check to `can_store_limit`.
Storage-mode equivalence regression test `tests/unit/test_matching_engine.cpp`	Adds `benchmark_mix_flow`, `BenchmarkMixCoverage`, and `run_flow_with_storage`; inserts a Catch2 test asserting all four `OrderBook::Storage` modes emit identical events and snapshots for the benchmark command mix.
Benchmark data structures, WorkloadBuilder, and characterization `benchmarks/bench_storage.cpp`	Defines `LimitSpec`, `Workload`, `WorkloadShape`, `RunSummary`, and `Timing` types; implements `WorkloadBuilder` for deterministic replay-command construction with active-order tracking; adds `characterize` pipeline computing command/event tallies, price-domain metrics, and top-of-book probe cadence.
Benchmark execution pipeline and workload generators `benchmarks/bench_storage.cpp`	Implements `probe_top_of_book`, `apply_trading` (timed path), `finalize_run` (untimed), `run_once`, and `time_storage` (warmup + ns/cmd sampling returning median/min/max); adds `benchmark_workload` coordinator; adds five named workload generators and rewrites `run_storage_benchmarks`.
Script metadata, regenerated results, and documentation `scripts/run_storage_benchmarks.sh`, `results/pool_backed_storage.txt`, `docs/pool_backed_storage.md`, `docs/benchmarking.md`	Updates shell-script header strings for timing scope and caveat; regenerates results with per-workload shape sections and per-mode median/min/max; expands docs with Command Hot Path comparison table, five-workload methodology, corrected-artifact section, and updated limitations.
Milestone and progress tracking `MILESTONES.md`, `PROGRESS.md`	Marks M47 as merged; updates current-state block, milestone table row, 2026-06-15 decision-log entry, and next-action guidance to reflect the active `perf/m47-storage-benchmark-diagnosis` branch and PR `#122`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

div0rce/quant-systems-lab#112: Directly overlaps with this PR's IntrusiveStore resting-node/index helper work and can_store_limit capacity-guarding logic in the same file and behaviors.

Suggested reviewers

codescene-delta-analysis

Poem

🐇 Hoppity-hop through the order book tree,
Five workloads now timed with precision and glee!
The setup excluded, the nanoseconds fair,
append_resting rolls back with delicate care.
Median, min, max — the benchmarks align,
This bunny says: corrected results are divine! 🎉

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'perf: diagnose order-book storage benchmark behavior' accurately describes the main change—a diagnostic follow-up to M47 that investigates and corrects storage benchmark methodology and results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description comprehensively covers all required template sections: Milestone, Summary, Definition of Done (with completed checklist items), Tests, and Notes/limitations, providing thorough context about the M47 follow-up diagnostic work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/m47-storage-benchmark-diagnosis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a42e7caf93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The storage benchmark timed run_once in full, including MatchingEngine construction and the RegisterSymbol prefix. Book construction is eager, so for the pooled modes that prefix runs OrderPool/RawPool free-list initialization over 65536 slots per book -- a fixed per-run setup cost charged to per-command time and amortized over only ~5k commands. With 2-4 books that setup dominated the intrusive-pool numbers. Apply registration (and the end-of-run snapshot readout) outside the timed interval and normalize over the timed command count, so each row reflects per-command work rather than per-run pool initialization. Addresses the PR #122 review finding that per-run setup leaked into the timed storage rows.

Install the Docker-Linux artifact regenerated with per-run setup excluded from timing (digest 81ff7430..., Dirty inputs: no, informational commit 476ba71) and rewrite the interpretation. The earlier "intrusive is the slow outlier / PMR fastest" ranking was dominated by per-run pool free-list initialization timed as command cost; with setup excluded the four modes cluster into a tight ~40-120 ns/cmd band and intrusive and contiguous are the two fastest, trading the lead by workload shape.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a396af2995

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The shape pass (characterize) walked the full command stream and derived top_probe_calls from a formula on the total command count, while the timed path measures only the post-registration trading commands. The shape line therefore reported top_probe_calls=20016 against probes/run=20008 (the 2 registration commands x 2 symbols x bid/ask). Share one registration-prefix boundary (registration_prefix_len) between characterize and the timed path, and count probes with the same should_probe predicate over the same range. The shape line now describes exactly the timed trading workload -- commands and top_probe_calls match the per-run rows -- closing the registration-prefix accounting class at the source rather than patching one formula. Addresses the second PR #122 review finding (same root cause as the first).

Regenerate the Docker-Linux artifact after the characterize fix (digest e12d1416..., Dirty inputs: no, informational commit d3ed253) so the shape line's commands/top_probe_calls match the per-run cmds/probes/run, and rebuild the doc table from that single artifact. The timed path was unchanged, so medians moved only within noise and the ranking is identical: the four modes cluster into a tight ~40-120 ns/cmd band with intrusive and contiguous fastest, trading the lead by workload shape.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/bench_storage.cpp`:
- Around line 369-393: The time_storage function can fail when reps is zero
(leaving samples empty) or when timed_commands is zero (causing division by
zero). Add explicit guards in the function: first check if reps is zero and
handle appropriately before entering the loop, then within the loop check if
timed_commands is zero before performing the division operation (ns /
timed_commands), and finally before accessing samples vector elements
(samples[samples.size() / 2], samples.front(), samples.back()), verify that
samples is not empty. These guards will prevent undefined behavior and crashes
on edge case inputs.

In `@docs/benchmarking.md`:
- Around line 113-117: In the paragraph describing the storage experiment that
includes variants like PMR-backed container-node allocation, intrusive
OrderPool-backed resting-order nodes, and M47 fixed-band contiguous
direct-price-indexed storage mode, the phrase on line 117 about timing being
over "full workload replays" is imprecise. Reword this portion to explicitly
clarify that the timing measurements are specifically for the post-registration
command path executed per replay, not the entire workload replay, to align with
the actual harness behavior and maintain precision in performance documentation
as required by the coding guidelines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6b270b7-07a7-4895-a9d7-905cfe43c960

📥 Commits

Reviewing files that changed from the base of the PR and between 93d5062 and a42dcc4.

📒 Files selected for processing (9)

MILESTONES.md
PROGRESS.md
benchmarks/bench_storage.cpp
docs/benchmarking.md
docs/pool_backed_storage.md
results/pool_backed_storage.txt
scripts/run_storage_benchmarks.sh
src/engine/order_book.cpp
tests/unit/test_matching_engine.cpp

time_storage now resolves the timed-command count once (it is identical across reps) and returns early when reps == 0 or a workload has no post-registration commands, avoiding empty-vector indexing and a divide-by-zero in the sampling math. docs/benchmarking.md no longer says timing covers "full workload replays" -- it times only the post-registration command path, matching the storage-doc and the harness. Addresses the CodeRabbit review comments on PR #122.

Regenerate the Docker-Linux artifact (digest b606452b..., Dirty inputs: no, informational commit cf0396f) after the time_storage guard/hoist and rebuild the doc table from that single artifact. The timed-path logic is unchanged, so medians moved only within noise and the four-mode ranking is identical.

codescene-delta-analysis

Code Health Improved (1 files improve in Code Health)

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

View Improvements

File	Code Health Impact	Categories Improved
bench_storage.cpp	9.69 → 10.00	Excess Number of Function Arguments

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

div0rce added 5 commits June 15, 2026 11:37

perf: expand storage benchmark diagnostics

4ea839e

refactor: simplify storage benchmark diagnostics

2a2d53f

docs: record storage benchmark diagnosis

13f8dcc

perf: trim intrusive storage lookup overhead

2c8748c

docs: refresh storage diagnosis artifact

bca7c4d

docs: sync progress for follow-up PR #122

a42e7ca

This comment was marked as outdated.

Sign in to view

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/bench_storage.cpp Outdated

div0rce added 2 commits June 15, 2026 14:29

This comment was marked as outdated.

Sign in to view

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/bench_storage.cpp Outdated

div0rce added 2 commits June 15, 2026 15:05

This comment was marked as outdated.

Sign in to view

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread benchmarks/bench_storage.cpp

Comment thread docs/benchmarking.md Outdated

div0rce added 2 commits June 15, 2026 15:40

codescene-delta-analysis Bot approved these changes Jun 15, 2026

View reviewed changes

div0rce merged commit 548cb68 into main Jun 15, 2026
8 checks passed

div0rce deleted the perf/m47-storage-benchmark-diagnosis branch June 16, 2026 17:34

Conversation

div0rce commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Milestone

Summary

Corrected results (engine-level synthetic; aarch64 / Linux / g++ 13.3 / Release; results/pool_backed_storage.txt)

Verification of the fix

Definition of Done

Tests

Notes / limitations

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

div0rce commented Jun 15, 2026 •

edited

Loading

Corrected results (engine-level synthetic; aarch64 / Linux / g++ 13.3 / Release; `results/pool_backed_storage.txt`)

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading