Skip to content

^f C2 session-start latency benchmark harness (locally-validatable core)#393

Merged
omkhar merged 9 commits into
mainfrom
c2/session-start-latency
Jul 5, 2026
Merged

^f C2 session-start latency benchmark harness (locally-validatable core)#393
omkhar merged 9 commits into
mainfrom
c2/session-start-latency

Conversation

@omkhar

@omkhar omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner

C2 (v0.14): Session Start Latency Program — locally-validatable core.

Mirrors the C5 exec-guard benchmark structure (harness + driver + methodology doc + pure-logic tests).

What (4 files, 757 lines)

  • scripts/bench/startup-bench.sh — times one mode (cold/warm/cache-hit) as N session-start launches → median + p90 + stddev (exact C5 stats conventions), with a deterministic dry-run path (WORKCELL_STARTUP_SAMPLES_NS) so the math is reproducible with no runtime, and a clean skip when no runtime is available (CI-safe).
  • scripts/bench/run-startup-bench.sh — cold-vs-warm driver with a cross-run median-stability gate; skips cleanly when no container runtime is present.
  • docs/session-startup-benchmarks.md — methodology, cold/warm/cache-hit definitions, rerun guide; results tables are explicit "pending live capture" placeholders (no fabricated numbers).
  • internal/startupbench/bench_test.go — pins the median/p90/stddev math, unknown-mode/non-integer rejection, live-timing path, skip-when-no-runtime, and the stability-gate pass/fail.

Deferred (needs a live runtime)

Actual cold/warm/cache-hit numbers, the kept-warm lane, wiring the prep hooks to the real runtime, and the scheduled non-PR-blocking bench workflow lane. Tracked as follow-ups.

Validation

go test ./..., shellcheck -x, shfmt, markdownlint — all green. scripts/workcell untouched.

🤖 Generated with Claude Code

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a011a34e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh Outdated
Comment thread scripts/bench/run-startup-bench.sh Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a011a34e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh Outdated
Comment thread scripts/bench/run-startup-bench.sh Outdated
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed both: (P1) cold mode now forces warmup=0 so measured cold samples run against the freshly cold-prepped state instead of a warmup-warmed one (warm/cache-hit keep the configured warmup); added a caveat that a strict per-sample cold median needs a re-evicting cold-prep hook with ITERATIONS=1. (P2) added the cache-hit mode to the driver (order cold, cache-hit, warm) with a WORKCELL_STARTUP_CACHE_HIT_PREP hook wired like the others, documented in the methodology/rerun/tunable sections. New test asserts cold gets warmup 0, warm/cache-hit get 1, all three modes drive, and a cache-hit row is emitted. go test/shellcheck/shfmt/markdownlint green; numbers still placeholders.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 752a727626

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 752a727626

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh Outdated
Comment thread scripts/bench/run-startup-bench.sh Outdated
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed both follow-ups: (P1) cold mode now re-runs WORKCELL_STARTUP_COLD_PREP before EVERY measured sample — a new measure_cold drives the harness one sample at a time (re-prep between each), then aggregates the samples back through the stats core so the cold row is a normal n=N distribution; warm/cache-hit keep their single shared prep. Enforced in the driver + a dry-run test asserts cold preps 3× vs warm/cache-hit 1×. (P2) WORKCELL_STARTUP_CMD is now parsed into an argv array honoring shell quoting (eval into CMD_ARGV, launched as "${CMD_ARGV[@]}"), so spaced/quoted args survive — a test asserts argv boundaries are preserved. go test (12 tests)/shellcheck/shfmt/markdownlint green; numbers still placeholders pending live capture.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 271b4c5cc1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed: live runs now fail fast if a driven mode's prep hook (WORKCELL_STARTUP_COLD_PREP/CACHE_HIT_PREP/WARM_PREP) is unset, naming the missing mode+var, so you can't publish numbers measured against arbitrary state; the dry-run path needs no hooks. Test asserts the live fail-fast + dry-run pass-through. Also rebased onto current main (which now includes E5) so the diff is just the 4 bench/doc/test files. go test (13)/shellcheck/shfmt/markdownlint green.

@omkhar omkhar force-pushed the c2/session-start-latency branch from 271b4c5 to 658ff40 Compare July 5, 2026 15:07
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 658ff40213

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
Comment thread internal/startupbench/bench_test.go Outdated
Comment thread scripts/bench/run-startup-bench.sh Outdated
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed all three: (1) the Go script-driver tests now build a hermetic child env (PATH + HOME/TMPDIR + only test-set vars) so exported WORKCELL_STARTUP_* can't leak in; (2) the dry-run/canned path no longer calls prep_mode at all (guarded on DRY_RUN), so exported live prep hooks aren't executed during a dry run; (3) added validate_int fail-fast for RUNS/ITERATIONS/WARMUP/STABILITY_PCT so 0/non-numeric can't exit 0 with a misleading STABLE. Tests added for each (all fail pre-fix). go test (19)/shellcheck/shfmt/markdownlint green.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc7418734c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh Outdated
Comment thread scripts/bench/run-startup-bench.sh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc7418734c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
omkhar and others added 7 commits July 5, 2026 12:25
…additive CI-safe harness that skips cleanly without a runtime, no scripts/workcell or boundary touched; median/p90/gate/skip logic validated locally by go test + shellcheck + shfmt + markdownlint)

Implement the locally-validatable core of roadmap item C2 (Session Start
Latency Program), mirroring the C5 exec-guard benchmark structure.

- scripts/bench/startup-bench.sh: measurement harness timing one mode
  (cold/warm/cache-hit) with an N-sample median + p90; stats conventions
  match the C5 harness exactly. Deterministic dry-run path via
  WORKCELL_STARTUP_SAMPLES_NS makes the math reproducible without a runtime.
- scripts/bench/run-startup-bench.sh: driver that runs cold vs warm passes,
  repeats for RUNS, and enforces a cross-run median-stability gate. Skips
  cleanly (exit 0) when no container runtime is available, so it is CI-safe.
- docs/session-startup-benchmarks.md: methodology + rerun guide mirroring
  the C5 doc, with results tables left as explicit "pending live capture"
  placeholders (no fabricated numbers).
- internal/startupbench/bench_test.go: pins the median/p90/stddev math,
  the stability-gate threshold (pass/fail), and the skip-when-no-runtime
  behavior; runs under `go test ./...` with no runtime.

Deferred (needs a live runtime, blocked by the Debian-mirror outage):
live benchmark numbers, the kept-warm lane implementation, the cold/warm
prep-hook wiring, and the scheduled bench.yml workflow lane.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible benchmark output)

P1: with the default warmup, the driver ran the cold-prep hook once and then
spent that freshly-evicted state on the discarded warmup launch, so every
measured cold sample was actually a warmed start. Force warmup to 0 for cold so
the first measured sample is a true first start; warm and cache-hit keep the
configured warmup.

P2: the documented three-mode methodology (cold, cache-hit, warm) was only
driven as cold+warm. Add cache-hit to the driven modes with its own
WORKCELL_STARTUP_CACHE_HIT_PREP hook, mirroring the cold/warm prep wiring, and
document it. Results tables stay pending-live-capture placeholders.

Tests: add a stub-harness (WORKCELL_STARTUP_HARNESS test seam) assertion that
the driver passes warmup 0 for cold, the configured warmup for warm/cache-hit,
and drives all three modes including cache-hit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cal go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible benchmark correctness)

Follow-up to the C2 driver review (752a727):

P1: warmup=0 was not sufficient. prep_mode/the cold-prep hook ran once before
the harness measured all WORKCELL_STARTUP_ITERATIONS launches, so a session
start warmed the cache and only the FIRST cold sample was a true first start;
the rest were cache-hits. Add measure_cold: re-run WORKCELL_STARTUP_COLD_PREP
before EACH measured cold sample, time each start on its own (warmup 0), and
aggregate the per-sample timings through the harness stats core so the cold row
still reads like a normal distribution. warm/cache-hit keep one shared prep.

P2: the driver expanded $WORKCELL_STARTUP_CMD unquoted, so an argument with
spaces (e.g. --workspace '/path/with space') was word-split/globbed and the
wrong argv was timed. Parse the command once via a shell-quoting-aware
`eval "CMD_ARGV=( ... )"` and launch "${CMD_ARGV[@]}", preserving argv
boundaries. Documented as the WORKCELL_STARTUP_CMD contract.

Tests: dry-run test asserts cold prep runs once per measured sample (3 canned
samples -> 3 preps) while warm/cache-hit prep once; a recorder-target test on
the live harness path asserts the spaced argument reaches the target as one
argv element. Doc updated for both contracts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ocal go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible fail-fast guardrail)

Follow-up to the C2 driver review (271b4c5):

On a live run (WORKCELL_STARTUP_CMD set), a missing mode prep hook left
prep_mode a no-op, so the harness measured whatever runtime state happened to be
present -- no real cold eviction or warm-lane priming -- yet still exited
0/STABLE and produced publishable-looking numbers. Fail fast on a live run if
the prep hook for any driven mode (WORKCELL_STARTUP_COLD_PREP /
WORKCELL_STARTUP_CACHE_HIT_PREP / WORKCELL_STARTUP_WARM_PREP) is unset, naming
the mode and the env var. The dry-run path (WORKCELL_STARTUP_SAMPLES_NS / stub
harness) needs no prep hooks and is unaffected.

Tests: add a test asserting a live run with a missing cold prep hook fails with
the documented error while a dry run with no hooks still passes STABLE; the two
existing live-path tests now set no-op prep hooks. Doc states live runs require
each driven mode's prep hook or the numbers are not publishable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alidate numeric controls (local go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible driver behavior + test hardening)

Follow-up to the C2 driver review (658ff40); three findings:

1) Hermetic script-driver tests. runScript inherited os.Environ(), so an
exported WORKCELL_STARTUP_* (stray prep hook / SAMPLES_NS) could leak into tests
that assume a clean process. Build the child env explicitly from PATH (+ HOME,
TMPDIR when present) plus only what the test sets. Add a test that leaks
SAMPLES_NS + a prep hook via t.Setenv and asserts a no-runtime run still cleanly
skips and never runs the hook.

2) Dry runs must not run prep hooks. The canned dry-run path still called
prep_mode, so live hooks exported from a previous run would execute. Suppress
prep on the dry-run path (main loop guarded by DRY_RUN; measure_cold dry-run now
emits the group's stats directly with no prep). Add a test asserting a dry run
with all three hooks exported creates no marker file. The per-sample cold re-prep
test moves to the live path (prep only runs live now).

3) Validate numeric controls. WORKCELL_STARTUP_RUNS/ITERATIONS/WARMUP/
STABILITY_PCT were unvalidated; RUNS=0 or a non-integer exited 0 with no
benchmarking or a misleading STABLE. Add validate_int and fail fast:
ITERATIONS/RUNS >= 1, WARMUP/STABILITY_PCT >= 0, naming the offending var. Add a
table-driven test. Doc updated for the dry-run/no-prep and numeric-control rules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…re >=2 live runs, fail zero medians (local go test + shellcheck/shfmt/markdownlint green, live runtime unverifiable; user-visible gate/integrity behavior)

Follow-up to the C2 driver review (cc74187); three false-publishable integrity bugs:

1) Probe runtime health, not just the client. Auto-detect used `command -v`, so a
host with the docker/colima/container CLIENT but no usable daemon selected live
mode and then hard-failed on the missing command/prep hooks instead of the
documented clean CI-safe skip. Add runtime_usable() (docker info / colima status
/ container system status) and require it in the detect loop; an explicit
WORKCELL_STARTUP_RUNTIME override is respected as-is and skips the probe.

2) Require RUNS >= 2 for a live/reportable capture. With RUNS=1 the cross-run
stability section is skipped yet the script exits 0, which the docs say means the
gate passed -- a single run has no repeatability evidence. Fail fast on a live
run with RUNS<2. Dry-run canned mode keeps whatever RUNS the data implies.

3) Zero median must not read as STABLE. When one run's median was 0 and another
nonzero the spread% was forced to 0 (STABLE, exit 0) despite a degenerate result.
Treat min<=0 as a degenerate measurement: mark the mode UNSTABLE and fail the
gate (exit 2) with a clear message; a 0 ns session start is impossible.

Tests: client-only host -> clean skip (fake clients whose health probe fails,
shadowing real ones on PATH; runScript now lets a test override PATH cleanly);
RUNS=1 live -> fail fast while dry-run RUNS=1 still passes; zero-vs-nonzero
medians -> exit 2 UNSTABLE. Existing live tests bumped to RUNS=2 (stub/high
threshold to keep the gate deterministic). Doc updated for all three rules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t (the 5 rounds of integrity hardening pushed the PR to 1312 lines, over the 1200 PR-shape limit; tighten verbose regression comments and redundant methodology prose without changing any behavior or dropping a test assertion) (go test/shellcheck/shfmt/markdownlint green, no assertion removed; docs/test-comment-only trim)
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed the 3 integrity findings (runtime-health probe, RUNS>=2 gate, zero-median-is-unstable) in a26e641, rebased onto current main, and trimmed verbose comments/prose to bring the PR back under the 1200-line shape limit (1199 now) — no behavior change, no test assertion dropped. go test (19)/shellcheck/shfmt/markdownlint green.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@omkhar omkhar force-pushed the c2/session-start-latency branch from cc74187 to 1f1a0fc Compare July 5, 2026 16:45

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f1a0fcbc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
Comment thread scripts/bench/run-startup-bench.sh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f1a0fcbc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/startup-bench.sh Outdated
…r, use a monotonic clock (P2) (three integrity/hygiene fixes: add run-startup-bench.sh and startup-bench.sh to the shellcheck/shfmt allowlists in validate-repo.sh and dev-quick-check.sh so they can't escape the repo shell gate; redirect prep_mode hook output to stderr so a docker-pull etc. can't prepend non-report lines to the stdout Markdown report; and time live samples on CLOCK_MONOTONIC (perl Time::HiRes / python3 monotonic_ns, mirroring C5) so an NTP step or sleep/wake can't corrupt a sample; re-trimmed doc/test comments to stay under the 1200-line PR-shape limit) (go test/shellcheck/shfmt/markdownlint green; user-visible benchmark integrity + CI coverage)
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed all three: (1) registered run-startup-bench.sh and startup-bench.sh in the shellcheck/shfmt allowlists (validate-repo.sh + dev-quick-check.sh) so they're under the repo shell gate; (2) prep_mode output now goes to stderr so hook noise (docker pull etc.) can't prepend non-report lines to the stdout Markdown report; (3) live samples now use CLOCK_MONOTONIC (perl Time::HiRes / python3 monotonic_ns, mirroring C5) so a clock step can't corrupt a sample. Re-trimmed to 1192 lines (under 1200). go test/shellcheck/shfmt/markdownlint green.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f4874428c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/startup-bench.sh Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f4874428c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
… (P2) (two measurement-validity fixes: move the per-sample timing loop into ONE long-lived perl/python process reading CLOCK_MONOTONIC in-process (with a launch()/system per sample), so a ~20ms interpreter startup no longer sits inside every measured interval and inflates fast warm/cache-hit medians - mirrors C5's in-process loop; and re-run the cache-hit prep hook before every measured sample like cold, since with warmup>=1 the discarded warmup was spending the cache-hit state before any measurement, drifting cache-hit toward warm; emitted line format unchanged so the Go stats tests + driver aggregation still pass; re-trimmed comments/prose to 1193 lines under the 1200 shape limit) (go test/shellcheck/shfmt/markdownlint green, format-preserving; user-visible benchmark measurement integrity)
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Fixed both: (1) the measured timing loop now runs inside ONE long-lived perl/python process reading CLOCK_MONOTONIC in-process (launch() per sample), so a ~20ms interpreter startup is no longer inside each measured interval — it was dominating fast warm/cache-hit medians and hiding the improvement; mirrors C5's in-process loop and keeps the emitted field format identical (Go stats tests unchanged). (2) cache-hit now re-preps before every measured sample (measure_reprep, like cold), since warmup was spending the cache-hit state before measurement and drifting it toward warm. Re-trimmed to 1195 lines. go test/shellcheck/shfmt/markdownlint green.

@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81b6ec8512

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81b6ec8512

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/bench/run-startup-bench.sh
@omkhar

omkhar commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Convergence decision (codex-loop round cap). This PR has been through 8 review rounds; every finding since round 1 has been P2 (no P0/P1), and the last several — including this one (bound the runtime-usability probe with a timeout for a wedged daemon) — are marginal robustness refinements on the live benchmark path.

That live path is deferred by design: real session-start numbers can't be captured until a container runtime is available (the image lane was mirror-blocked), so the results tables ship as explicit pending live capture placeholders. The locally-validatable core is complete and green — harness median/p90/stddev math, the in-process CLOCK_MONOTONIC timing loop, per-sample cold/cache-hit re-prep, the cross-run stability gate, numeric-control validation, runtime-health probing, hermetic tests, and the dry-run path — 26 tests, plus shellcheck/shfmt/markdownlint and the newly-added shell-gate coverage.

This finding is real but marginal (installed-client-with-wedged-daemon), and a portable timeout wrapper (macOS has no timeout) would push the PR back over the 1200-line shape limit for a path that cannot execute yet. So I'm capping the loop here and tracking the remaining live-path hardening — probe timeout, and final calibration against a real runtime — as a follow-up to land alongside the actual live-capture lane, where they can be validated against a booted VM instead of added speculatively. Merging on the documented round-cap decision per the review policy.

@omkhar omkhar merged commit a24f2f3 into main Jul 5, 2026
13 of 14 checks passed
@omkhar omkhar deleted the c2/session-start-latency branch July 5, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant