From 7bb86014d0cb4e57358a7e70cfbb904b102ca015 Mon Sep 17 00:00:00 2001 From: Bartosz Burda Date: Tue, 23 Jun 2026 21:03:42 +0200 Subject: [PATCH] docs(benchmark): point the README at generated results, not hard-coded numbers The example-results, what-to-optimize, and reference-point sections quoted host-specific figures from a seed run that drift out of sync with the committed baseline. Replace them with pointers to where results actually live - the committed baseline (benchmark/baseline/ci.json) and the per-run report.md / summary.json - plus a per-lane description of what each lane reports. Keeps the method and the qualitative findings, drops the absolute numbers that go stale. --- benchmark/README.md | 130 ++++++++++++++++++-------------------------- 1 file changed, 53 insertions(+), 77 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index b6b0eb9..ccef8d3 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -94,83 +94,59 @@ The goal is numbers you can trust and compare, not a single reading. dimension (and `--load light|heavy` composes onto `footprint`); `fault` measures a transient burst, not steady state. -## Example results - -From one host (Intel i7-10700K, 16 cores, 32 GiB, glibc allocator). Absolute -numbers depend on hardware; re-run for your own. The value is the method. - -Footprint on a real headless Nav2 stack (turtlebot3, ~22 discovered entities, -5 repeats): gateway USS around 95-100 MiB, PSS ~104 MiB, RSS ~128 MiB, 53 threads, -~0.2-0.3 CPU-cores under live diagnostics. Note: the gateway takes a while to settle -on this stack, so some repeats are flagged not-steady and excluded from the median; -the report shows the steady count. - -Scaling, gateway against a synthetic graph (6 sizes, 3 repeats each): - -![USS vs entity count](docs/scaling-example.png) - -| entities | USS | USS per entity | -|---|---|---| -| 12 | 24 MiB | 2.01 MiB | -| 32 | 28 MiB | 0.87 MiB | -| 62 | 36 MiB | 0.58 MiB | -| 102 | 48 MiB | 0.47 MiB | -| 152 | 67 MiB | 0.44 MiB | -| 252 | 95 MiB | 0.38 MiB | - -Fitted exponent **k = 0.46, 95% CI [0.26, 0.65]** (6 points, R2 = 0.91). Because the -whole CI is below 1, this is **sub-linear, confirmed**: memory grows slower than the -graph and per-entity cost falls as it grows. The gateway does not blow up on a large -project. (On only 3-4 points the CI spans 1 and the harness reports INDETERMINATE -instead - more points are needed to make the claim.) Absolute footprint is driven by -graph *complexity* (topic and message-type count) more than node count - a Nav2 node -with maps and scans costs several times more than a plain talker. - -Config sweep showed the discovery refresh interval as the main CPU lever: 200 ms -costs about 4x the CPU of the 1000 ms default, while memory differences across -configs stayed within a few MiB. - -Heap: a short synthetic run reports `inconclusive: warmup/cache fill` (positive USS -slope but no attributable leak call-sites). A 25-min heaptrack run on the **real Nav2 -demo** (gateway rebuilt with debug symbols, `benchmark/scripts/heap_on_nav2.sh`) settles -the question: the tracked heap plateaus at ~15.7 MiB (the last 40 % of the run adds -+0.7 MiB, final snapshots flat), so the gateway does **not** leak on Nav2. The slow USS -creep some footprint repeats show is warmup / per-message-type cache fill that takes -~15-20 min to settle (longer than the footprint lane's 120 s window), plus non-heap -allocator-arena retention - not growing allocations. - -HTTP load (synthetic graph, off / 8 clients / 32 clients): USS 28 -> 29 -> 35 MiB, -CPU 0.01 -> 0.02 -> 0.18 cores, request p95 latency 1.6 -> 2.3 ms. The thread census -shows ~50 threads of which ~39 share the `gateway_node` comm (the rclcpp executor and -cpp-httplib pools), 9 are DDS. Latency stays low under load; the thread count is the -efficiency concern, not latency. - -Fault / snapshot (synthetic, fault count N, 20 s window, a fresh container per N so -each baseline is clean): peak USS delta grows monotonically with N (~0.5 MiB at N=1 -> -~5.8 MiB at N=16); capture recovers within the window only for N<=2, and at **N>=4 it -no longer returns to baseline within 20 s**. Each cell is a single sample (n=1). -Sequentially reported faults each get their own rosbag (no contention); simultaneous -reporting would -hit the single rosbag writer. +## Reading the results + +Absolute numbers are per-host and per-build, so they live in the files each run +generates, not hard-coded here. Read them at the source: + +- **The committed baseline** - `benchmark/baseline/ci.json`: the footprint (USS, CPU, + threads) and the scaling exponent with its confidence interval that `compare` checks a + new run against, plus the host and gateway SHA it was measured on. +- **Per-run output** - `benchmark/results///`: `report.md` (median + + IQR table), `summary.json` (machine-readable numbers and a one-line verdict), `*.png` + (the chart), and `run_metadata.json` (host, allocator, gateway version). + +Re-run a lane to produce your own; the numbers depend on hardware, the **method** is +what transfers. + +### What each lane tells you + +- **footprint** - steady-state USS/PSS/CPU/threads of the gateway process alone on a real + headless Nav2 stack (turtlebot3), median over repeats with not-steady repeats excluded. +- **scaling** - fits `USS ~ entities^k` and reports `k` with a 95% CI: **sub-linear** only + when the CI upper bound is below 1, **super-linear** only when the lower bound exceeds 1, + otherwise **INDETERMINATE**. The fit, not a single ratio, is the claim. + + ![Example scaling fit - one host; regenerate for your own](docs/scaling-example.png) + +- **sweep** - footprint and CPU per config variant; surfaces which single setting costs the + most (the discovery refresh interval is the main CPU lever). +- **heap** / **memcheck** - the authoritative leak verdict: a long heaptrack run on the real + Nav2 demo (`benchmark/scripts/heap_on_nav2.sh`) shows whether the tracked heap plateaus or + grows, and memcheck gives a hard "definitely lost" count. +- **load** - footprint, CPU, threads and p50/p95 request latency under M concurrent HTTP + clients; the thread count, not latency, is the efficiency signal. +- **fault** - peak memory/CPU and recovery of snapshot capture across a burst of N faults, + with and without rosbag. +- **churn** - PASS/FAIL leak gate: USS slope on a static graph vs a churning one. ## What to optimize -The benchmark points at three concrete targets: +The harness points at concrete targets - run `load`, `fault`, and `sweep` to see the +current magnitudes on your own host: -- **Cap the executor / HTTP thread pool.** ~50 threads at idle, ~39 of them the - rclcpp executor + cpp-httplib pools. The `MultiThreadedExecutor` defaults to one - thread per hardware core; a bounded pool sized to the actual work would cut thread - overhead. (`load`, thread census.) -- **Make snapshot capture asynchronous / queued.** Capture stays fast for <=2 - concurrent faults but at N>=4 it no longer returns to baseline within the window and - peak memory grows monotonically with N (~5.8 MiB at N=16), so captures queue faster - than they drain. An async capture queue with bounded buffers would keep a fault storm - from blocking. - (`fault`.) -- **Watch the refresh interval for CPU.** A 200 ms discovery refresh costs ~4x the - CPU of the 1000 ms default for little memory benefit. (`sweep`.) +- **Bound the executor / HTTP thread pool.** Most of the gateway's idle threads are the + rclcpp executor and cpp-httplib pools; `MultiThreadedExecutor` defaults to one thread per + hardware core, so a bounded pool sized to the actual work cuts thread overhead. + (`load`, thread census.) +- **Make snapshot capture asynchronous / queued.** Capture stays fast for a couple of + concurrent faults but stops returning to baseline once several land at once, so captures + queue faster than they drain; an async queue with bounded buffers keeps a fault storm from + blocking. (`fault`.) +- **Watch the discovery refresh interval for CPU.** A fast refresh costs several times the + CPU of the default for little memory benefit. (`sweep`.) -These are signals to investigate, sized on one host - re-run to confirm on yours. +These are signals to investigate, measured on one host - re-run to confirm on yours. ## Running it @@ -212,11 +188,11 @@ Results are gitignored; commit your own copy if you want to track them over time ## Reference point -The published example numbers (in [Example results](#example-results)) were measured on: - -- **Host:** Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz, 16 cores, 32 GiB RAM -- **Gateway SHA:** `8569f213e7a1e4b4fd2f699546eda9a11e78dcf1` (scaling lane) -- **Footprint gateway SHA:** captured from image digest (SHA capture added after this seed run; will be real on the next footprint build) +Every result records what it was measured on, so there is no host to hard-code here: the +committed baseline carries its host and gateway SHA in `benchmark/baseline/ci.json`, and +each run writes host CPU/RAM, allocator, kernel and gateway version to `run_metadata.json`. +`compare` refuses to compare across different hosts or under high load, so pin +`ROS2_MEDKIT_REF` to a commit and run on the same machine to track a number over time. ### Benchmarking a specific commit