^f C2 session-start latency benchmark harness (locally-validatable core)#393
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a011a34e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a011a34e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Fixed both: (P1) cold mode now forces warmup=0 so measured cold samples run against the freshly cold-prepped state instead of a warmup-warmed one (warm/cache-hit keep the configured warmup); added a caveat that a strict per-sample cold median needs a re-evicting cold-prep hook with ITERATIONS=1. (P2) added the |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 752a727626
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 752a727626
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Fixed both follow-ups: (P1) cold mode now re-runs WORKCELL_STARTUP_COLD_PREP before EVERY measured sample — a new measure_cold drives the harness one sample at a time (re-prep between each), then aggregates the samples back through the stats core so the cold row is a normal n=N distribution; warm/cache-hit keep their single shared prep. Enforced in the driver + a dry-run test asserts cold preps 3× vs warm/cache-hit 1×. (P2) WORKCELL_STARTUP_CMD is now parsed into an argv array honoring shell quoting (eval into CMD_ARGV, launched as "${CMD_ARGV[@]}"), so spaced/quoted args survive — a test asserts argv boundaries are preserved. go test (12 tests)/shellcheck/shfmt/markdownlint green; numbers still placeholders pending live capture. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 271b4c5cc1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Fixed: live runs now fail fast if a driven mode's prep hook (WORKCELL_STARTUP_COLD_PREP/CACHE_HIT_PREP/WARM_PREP) is unset, naming the missing mode+var, so you can't publish numbers measured against arbitrary state; the dry-run path needs no hooks. Test asserts the live fail-fast + dry-run pass-through. Also rebased onto current main (which now includes E5) so the diff is just the 4 bench/doc/test files. go test (13)/shellcheck/shfmt/markdownlint green. |
271b4c5 to
658ff40
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 658ff40213
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Fixed all three: (1) the Go script-driver tests now build a hermetic child env (PATH + HOME/TMPDIR + only test-set vars) so exported WORKCELL_STARTUP_* can't leak in; (2) the dry-run/canned path no longer calls prep_mode at all (guarded on DRY_RUN), so exported live prep hooks aren't executed during a dry run; (3) added validate_int fail-fast for RUNS/ITERATIONS/WARMUP/STABILITY_PCT so 0/non-numeric can't exit 0 with a misleading STABLE. Tests added for each (all fail pre-fix). go test (19)/shellcheck/shfmt/markdownlint green. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc7418734c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc7418734c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…additive CI-safe harness that skips cleanly without a runtime, no scripts/workcell or boundary touched; median/p90/gate/skip logic validated locally by go test + shellcheck + shfmt + markdownlint) Implement the locally-validatable core of roadmap item C2 (Session Start Latency Program), mirroring the C5 exec-guard benchmark structure. - scripts/bench/startup-bench.sh: measurement harness timing one mode (cold/warm/cache-hit) with an N-sample median + p90; stats conventions match the C5 harness exactly. Deterministic dry-run path via WORKCELL_STARTUP_SAMPLES_NS makes the math reproducible without a runtime. - scripts/bench/run-startup-bench.sh: driver that runs cold vs warm passes, repeats for RUNS, and enforces a cross-run median-stability gate. Skips cleanly (exit 0) when no container runtime is available, so it is CI-safe. - docs/session-startup-benchmarks.md: methodology + rerun guide mirroring the C5 doc, with results tables left as explicit "pending live capture" placeholders (no fabricated numbers). - internal/startupbench/bench_test.go: pins the median/p90/stddev math, the stability-gate threshold (pass/fail), and the skip-when-no-runtime behavior; runs under `go test ./...` with no runtime. Deferred (needs a live runtime, blocked by the Debian-mirror outage): live benchmark numbers, the kept-warm lane implementation, the cold/warm prep-hook wiring, and the scheduled bench.yml workflow lane. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible benchmark output) P1: with the default warmup, the driver ran the cold-prep hook once and then spent that freshly-evicted state on the discarded warmup launch, so every measured cold sample was actually a warmed start. Force warmup to 0 for cold so the first measured sample is a true first start; warm and cache-hit keep the configured warmup. P2: the documented three-mode methodology (cold, cache-hit, warm) was only driven as cold+warm. Add cache-hit to the driven modes with its own WORKCELL_STARTUP_CACHE_HIT_PREP hook, mirroring the cold/warm prep wiring, and document it. Results tables stay pending-live-capture placeholders. Tests: add a stub-harness (WORKCELL_STARTUP_HARNESS test seam) assertion that the driver passes warmup 0 for cold, the configured warmup for warm/cache-hit, and drives all three modes including cache-hit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cal go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible benchmark correctness) Follow-up to the C2 driver review (752a727): P1: warmup=0 was not sufficient. prep_mode/the cold-prep hook ran once before the harness measured all WORKCELL_STARTUP_ITERATIONS launches, so a session start warmed the cache and only the FIRST cold sample was a true first start; the rest were cache-hits. Add measure_cold: re-run WORKCELL_STARTUP_COLD_PREP before EACH measured cold sample, time each start on its own (warmup 0), and aggregate the per-sample timings through the harness stats core so the cold row still reads like a normal distribution. warm/cache-hit keep one shared prep. P2: the driver expanded $WORKCELL_STARTUP_CMD unquoted, so an argument with spaces (e.g. --workspace '/path/with space') was word-split/globbed and the wrong argv was timed. Parse the command once via a shell-quoting-aware `eval "CMD_ARGV=( ... )"` and launch "${CMD_ARGV[@]}", preserving argv boundaries. Documented as the WORKCELL_STARTUP_CMD contract. Tests: dry-run test asserts cold prep runs once per measured sample (3 canned samples -> 3 preps) while warm/cache-hit prep once; a recorder-target test on the live harness path asserts the spaced argument reaches the target as one argv element. Doc updated for both contracts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ocal go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible fail-fast guardrail) Follow-up to the C2 driver review (271b4c5): On a live run (WORKCELL_STARTUP_CMD set), a missing mode prep hook left prep_mode a no-op, so the harness measured whatever runtime state happened to be present -- no real cold eviction or warm-lane priming -- yet still exited 0/STABLE and produced publishable-looking numbers. Fail fast on a live run if the prep hook for any driven mode (WORKCELL_STARTUP_COLD_PREP / WORKCELL_STARTUP_CACHE_HIT_PREP / WORKCELL_STARTUP_WARM_PREP) is unset, naming the mode and the env var. The dry-run path (WORKCELL_STARTUP_SAMPLES_NS / stub harness) needs no prep hooks and is unaffected. Tests: add a test asserting a live run with a missing cold prep hook fails with the documented error while a dry run with no hooks still passes STABLE; the two existing live-path tests now set no-op prep hooks. Doc states live runs require each driven mode's prep hook or the numbers are not publishable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alidate numeric controls (local go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible driver behavior + test hardening) Follow-up to the C2 driver review (658ff40); three findings: 1) Hermetic script-driver tests. runScript inherited os.Environ(), so an exported WORKCELL_STARTUP_* (stray prep hook / SAMPLES_NS) could leak into tests that assume a clean process. Build the child env explicitly from PATH (+ HOME, TMPDIR when present) plus only what the test sets. Add a test that leaks SAMPLES_NS + a prep hook via t.Setenv and asserts a no-runtime run still cleanly skips and never runs the hook. 2) Dry runs must not run prep hooks. The canned dry-run path still called prep_mode, so live hooks exported from a previous run would execute. Suppress prep on the dry-run path (main loop guarded by DRY_RUN; measure_cold dry-run now emits the group's stats directly with no prep). Add a test asserting a dry run with all three hooks exported creates no marker file. The per-sample cold re-prep test moves to the live path (prep only runs live now). 3) Validate numeric controls. WORKCELL_STARTUP_RUNS/ITERATIONS/WARMUP/ STABILITY_PCT were unvalidated; RUNS=0 or a non-integer exited 0 with no benchmarking or a misleading STABLE. Add validate_int and fail fast: ITERATIONS/RUNS >= 1, WARMUP/STABILITY_PCT >= 0, naming the offending var. Add a table-driven test. Doc updated for the dry-run/no-prep and numeric-control rules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…re >=2 live runs, fail zero medians (local go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible gate/integrity behavior) Follow-up to the C2 driver review (cc74187); three false-publishable integrity bugs: 1) Probe runtime health, not just the client. Auto-detect used `command -v`, so a host with the docker/colima/container CLIENT but no usable daemon selected live mode and then hard-failed on the missing command/prep hooks instead of the documented clean CI-safe skip. Add runtime_usable() (docker info / colima status / container system status) and require it in the detect loop; an explicit WORKCELL_STARTUP_RUNTIME override is respected as-is and skips the probe. 2) Require RUNS >= 2 for a live/reportable capture. With RUNS=1 the cross-run stability section is skipped yet the script exits 0, which the docs say means the gate passed -- a single run has no repeatability evidence. Fail fast on a live run with RUNS<2. Dry-run canned mode keeps whatever RUNS the data implies. 3) Zero median must not read as STABLE. When one run's median was 0 and another nonzero the spread% was forced to 0 (STABLE, exit 0) despite a degenerate result. Treat min<=0 as a degenerate measurement: mark the mode UNSTABLE and fail the gate (exit 2) with a clear message; a 0 ns session start is impossible. Tests: client-only host -> clean skip (fake clients whose health probe fails, shadowing real ones on PATH; runScript now lets a test override PATH cleanly); RUNS=1 live -> fail fast while dry-run RUNS=1 still passes; zero-vs-nonzero medians -> exit 2 UNSTABLE. Existing live tests bumped to RUNS=2 (stub/high threshold to keep the gate deterministic). Doc updated for all three rules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t (the 5 rounds of integrity hardening pushed the PR to 1312 lines, over the 1200 PR-shape limit; tighten verbose regression comments and redundant methodology prose without changing any behavior or dropping a test assertion) (go test/shellcheck/shfmt/markdownlint green, no assertion removed; docs/test-comment-only trim)
|
Fixed the 3 integrity findings (runtime-health probe, RUNS>=2 gate, zero-median-is-unstable) in a26e641, rebased onto current main, and trimmed verbose comments/prose to bring the PR back under the 1200-line shape limit (1199 now) — no behavior change, no test assertion dropped. go test (19)/shellcheck/shfmt/markdownlint green. |
|
@codex review |
cc74187 to
1f1a0fc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f1a0fcbc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f1a0fcbc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…r, use a monotonic clock (P2) (three integrity/hygiene fixes: add run-startup-bench.sh and startup-bench.sh to the shellcheck/shfmt allowlists in validate-repo.sh and dev-quick-check.sh so they can't escape the repo shell gate; redirect prep_mode hook output to stderr so a docker-pull etc. can't prepend non-report lines to the stdout Markdown report; and time live samples on CLOCK_MONOTONIC (perl Time::HiRes / python3 monotonic_ns, mirroring C5) so an NTP step or sleep/wake can't corrupt a sample; re-trimmed doc/test comments to stay under the 1200-line PR-shape limit) (go test/shellcheck/shfmt/markdownlint green; user-visible benchmark integrity + CI coverage)
|
Fixed all three: (1) registered run-startup-bench.sh and startup-bench.sh in the shellcheck/shfmt allowlists (validate-repo.sh + dev-quick-check.sh) so they're under the repo shell gate; (2) prep_mode output now goes to stderr so hook noise (docker pull etc.) can't prepend non-report lines to the stdout Markdown report; (3) live samples now use CLOCK_MONOTONIC (perl Time::HiRes / python3 monotonic_ns, mirroring C5) so a clock step can't corrupt a sample. Re-trimmed to 1192 lines (under 1200). go test/shellcheck/shfmt/markdownlint green. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4874428c3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4874428c3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… (P2) (two measurement-validity fixes: move the per-sample timing loop into ONE long-lived perl/python process reading CLOCK_MONOTONIC in-process (with a launch()/system per sample), so a ~20ms interpreter startup no longer sits inside every measured interval and inflates fast warm/cache-hit medians - mirrors C5's in-process loop; and re-run the cache-hit prep hook before every measured sample like cold, since with warmup>=1 the discarded warmup was spending the cache-hit state before any measurement, drifting cache-hit toward warm; emitted line format unchanged so the Go stats tests + driver aggregation still pass; re-trimmed comments/prose to 1193 lines under the 1200 shape limit) (go test/shellcheck/shfmt/markdownlint green, format-preserving; user-visible benchmark measurement integrity)
|
Fixed both: (1) the measured timing loop now runs inside ONE long-lived perl/python process reading CLOCK_MONOTONIC in-process (launch() per sample), so a ~20ms interpreter startup is no longer inside each measured interval — it was dominating fast warm/cache-hit medians and hiding the improvement; mirrors C5's in-process loop and keeps the emitted field format identical (Go stats tests unchanged). (2) cache-hit now re-preps before every measured sample (measure_reprep, like cold), since warmup was spending the cache-hit state before measurement and drifting it toward warm. Re-trimmed to 1195 lines. go test/shellcheck/shfmt/markdownlint green. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81b6ec8512
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81b6ec8512
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Convergence decision (codex-loop round cap). This PR has been through 8 review rounds; every finding since round 1 has been P2 (no P0/P1), and the last several — including this one (bound the runtime-usability probe with a timeout for a wedged daemon) — are marginal robustness refinements on the live benchmark path. That live path is deferred by design: real session-start numbers can't be captured until a container runtime is available (the image lane was mirror-blocked), so the results tables ship as explicit This finding is real but marginal (installed-client-with-wedged-daemon), and a portable timeout wrapper (macOS has no |
C2 (v0.14): Session Start Latency Program — locally-validatable core.
Mirrors the C5 exec-guard benchmark structure (harness + driver + methodology doc + pure-logic tests).
What (4 files, 757 lines)
scripts/bench/startup-bench.sh— times one mode (cold/warm/cache-hit) as N session-start launches → median + p90 + stddev (exact C5 stats conventions), with a deterministic dry-run path (WORKCELL_STARTUP_SAMPLES_NS) so the math is reproducible with no runtime, and a clean skip when no runtime is available (CI-safe).scripts/bench/run-startup-bench.sh— cold-vs-warm driver with a cross-run median-stability gate; skips cleanly when no container runtime is present.docs/session-startup-benchmarks.md— methodology, cold/warm/cache-hit definitions, rerun guide; results tables are explicit "pending live capture" placeholders (no fabricated numbers).internal/startupbench/bench_test.go— pins the median/p90/stddev math, unknown-mode/non-integer rejection, live-timing path, skip-when-no-runtime, and the stability-gate pass/fail.Deferred (needs a live runtime)
Actual cold/warm/cache-hit numbers, the kept-warm lane, wiring the prep hooks to the real runtime, and the scheduled non-PR-blocking bench workflow lane. Tracked as follow-ups.
Validation
go test ./...,shellcheck -x,shfmt,markdownlint— all green.scripts/workcelluntouched.🤖 Generated with Claude Code