perf: diagnose order-book storage benchmark behavior#122
Conversation
📝 WalkthroughWalkthroughThis PR diagnoses a benchmarking artifact in the M47 storage study by excluding engine construction and symbol-registration from the timed interval. It replaces the single ChangesM47 Storage Benchmark Diagnosis
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a42e7caf93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The storage benchmark timed run_once in full, including MatchingEngine construction and the RegisterSymbol prefix. Book construction is eager, so for the pooled modes that prefix runs OrderPool/RawPool free-list initialization over 65536 slots per book -- a fixed per-run setup cost charged to per-command time and amortized over only ~5k commands. With 2-4 books that setup dominated the intrusive-pool numbers. Apply registration (and the end-of-run snapshot readout) outside the timed interval and normalize over the timed command count, so each row reflects per-command work rather than per-run pool initialization. Addresses the PR #122 review finding that per-run setup leaked into the timed storage rows.
Install the Docker-Linux artifact regenerated with per-run setup excluded from timing (digest 81ff7430..., Dirty inputs: no, informational commit 476ba71) and rewrite the interpretation. The earlier "intrusive is the slow outlier / PMR fastest" ranking was dominated by per-run pool free-list initialization timed as command cost; with setup excluded the four modes cluster into a tight ~40-120 ns/cmd band and intrusive and contiguous are the two fastest, trading the lead by workload shape.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a396af2995
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The shape pass (characterize) walked the full command stream and derived top_probe_calls from a formula on the total command count, while the timed path measures only the post-registration trading commands. The shape line therefore reported top_probe_calls=20016 against probes/run=20008 (the 2 registration commands x 2 symbols x bid/ask). Share one registration-prefix boundary (registration_prefix_len) between characterize and the timed path, and count probes with the same should_probe predicate over the same range. The shape line now describes exactly the timed trading workload -- commands and top_probe_calls match the per-run rows -- closing the registration-prefix accounting class at the source rather than patching one formula. Addresses the second PR #122 review finding (same root cause as the first).
Regenerate the Docker-Linux artifact after the characterize fix (digest e12d1416..., Dirty inputs: no, informational commit d3ed253) so the shape line's commands/top_probe_calls match the per-run cmds/probes/run, and rebuild the doc table from that single artifact. The timed path was unchanged, so medians moved only within noise and the ranking is identical: the four modes cluster into a tight ~40-120 ns/cmd band with intrusive and contiguous fastest, trading the lead by workload shape.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/bench_storage.cpp`:
- Around line 369-393: The time_storage function can fail when reps is zero
(leaving samples empty) or when timed_commands is zero (causing division by
zero). Add explicit guards in the function: first check if reps is zero and
handle appropriately before entering the loop, then within the loop check if
timed_commands is zero before performing the division operation (ns /
timed_commands), and finally before accessing samples vector elements
(samples[samples.size() / 2], samples.front(), samples.back()), verify that
samples is not empty. These guards will prevent undefined behavior and crashes
on edge case inputs.
In `@docs/benchmarking.md`:
- Around line 113-117: In the paragraph describing the storage experiment that
includes variants like PMR-backed container-node allocation, intrusive
OrderPool-backed resting-order nodes, and M47 fixed-band contiguous
direct-price-indexed storage mode, the phrase on line 117 about timing being
over "full workload replays" is imprecise. Reword this portion to explicitly
clarify that the timing measurements are specifically for the post-registration
command path executed per replay, not the entire workload replay, to align with
the actual harness behavior and maintain precision in performance documentation
as required by the coding guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b6b270b7-07a7-4895-a9d7-905cfe43c960
📒 Files selected for processing (9)
MILESTONES.mdPROGRESS.mdbenchmarks/bench_storage.cppdocs/benchmarking.mddocs/pool_backed_storage.mdresults/pool_backed_storage.txtscripts/run_storage_benchmarks.shsrc/engine/order_book.cpptests/unit/test_matching_engine.cpp
time_storage now resolves the timed-command count once (it is identical across reps) and returns early when reps == 0 or a workload has no post-registration commands, avoiding empty-vector indexing and a divide-by-zero in the sampling math. docs/benchmarking.md no longer says timing covers "full workload replays" -- it times only the post-registration command path, matching the storage-doc and the harness. Addresses the CodeRabbit review comments on PR #122.
Regenerate the Docker-Linux artifact (digest b606452b..., Dirty inputs: no, informational commit cf0396f) after the time_storage guard/hoist and rebuild the doc table from that single artifact. The timed-path logic is unchanged, so medians moved only within noise and the four-mode ranking is identical.
There was a problem hiding this comment.
Code Health Improved
(1 files improve in Code Health)
Our agent can fix these. Install it.
Gates Passed
6 Quality Gates Passed
View Improvements
| File | Code Health Impact | Categories Improved |
|---|---|---|
| bench_storage.cpp | 9.69 → 10.00 | Excess Number of Function Arguments |
Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
Milestone
M47 follow-up — Storage benchmark diagnosis (on top of the merged M47 study, PR #119).
Summary
This follow-up diagnoses why the M47 storage modes ranked as they did by adding deterministic
workload variants and non-timed shape characterization. In review, a Codex comment flagged that the
benchmark timed per-run setup as command cost — and chasing it down overturned the headline
conclusion.
run_onceconstructed a freshMatchingEngineand applied theRegisterSymbolprefix inside thetimed interval. Book construction is eager, so for the pooled modes that prefix runs
OrderPool/RawPoolfree-list initialization over 65536 slots per book. With a fresh engineper replay and only ~5000 commands, that one-time setup was charged to per-command time, amortized
over far too few commands, and scaled with symbol count — so it inflated the intrusive mode
most. (Note: the cost is incurred at
register_symbol/book construction, not at theMatchingEngineconstructor as the review supposed — the constructor only stores the storage enum.)The fix runs engine construction, the registration prefix, and the end-of-run snapshot outside
the timed interval and normalizes over timed commands. With setup excluded, the earlier "intrusive
is ~4–5× slower" result disappears — it was a benchmark-setup artifact, not a per-command property.
Corrected results (engine-level synthetic; aarch64 / Linux / g++ 13.3 / Release;
results/pool_backed_storage.txt)Median ns per timed command (per-run setup excluded), lower is better. Single-machine,
hardware/compiler/build-dependent — not a production-throughput claim.
(bold = fastest mode in that row;
Source digest: sha256:b606452b…,Dirty inputs: no.)Honest reading: with per-run setup excluded the four modes cluster into a tight ~40–120 ns/cmd
band (vs the old 40–486 spread). Intrusive and contiguous are the two fastest and trade the lead by
workload shape; baseline/PMR sit behind. This does not mean "intrusive won" — it still pays a
large fixed init cost (pre-allocating 65536 slots/book) that this per-command metric deliberately
excludes and that only amortizes over a long engine lifetime. The contiguous fixed-band caveat
(
[1,1024]) still holds.Verification of the fix
A controlled macOS before/after on the same host isolates the effect: intrusive dropped 45–92
ns/cmd while baseline/PMR/contiguous stayed within noise, and the drop scales with symbol count
(4-symbol flows ≈80–92, 2-symbol ≈45) — exactly the "N books × 2 × 65536 free-list init removed"
signature.
Definition of Done
sha256:b606452b…,Dirty inputs: no, informational commitcf0396f.characterizepass now observesthe same post-registration trading range the timed rows measure, sharing one registration-prefix
boundary and the same
should_probepredicate — so the shape line'scommands/top_probe_callsmatch the per-run
cmds/probes/runinstead of counting the registration prefix.ranking is removed and explained as an artifact.
order_book.cppchange behavior-preserving.make check/make asanpass;PROGRESS.mdupdated.Tests
Notes / limitations
[1,1024]band.