Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
3de1415
Harden CI measurement regression gates
schickling-assistant May 14, 2026
2784c66
Clarify CI measurement gate labels
schickling-assistant May 14, 2026
64ff288
Clarify measurement gate summary labels
schickling-assistant May 14, 2026
d455009
Make measurement comments hermetic
schickling-assistant May 14, 2026
f1ad665
Make measurement comment renderer hermetic
schickling-assistant May 14, 2026
702a75f
Add CI measurement baseline backfill support
schickling-assistant May 15, 2026
988f1d3
Use committed lock for closure measurements
schickling-assistant May 16, 2026
c2a673a
Support historical devenv tracing in CI measurements
schickling-assistant May 16, 2026
61c147a
Allow historical baseline probes to publish partial data
schickling-assistant May 16, 2026
c0a028f
Fix CI measurement seed artifact lookup
schickling-assistant May 16, 2026
97f0092
Resolve GitHub CLI for CI baseline downloads
schickling-assistant May 16, 2026
76c5c70
Allow CI measurements to read baseline artifacts
schickling-assistant May 16, 2026
7b42b8d
Prefer explicit CI measurement baseline seeds
schickling-assistant May 16, 2026
7e60e57
Read nested CI measurement baselines
schickling-assistant May 16, 2026
ef7a4e3
Prune nested CI measurement baselines
schickling-assistant May 16, 2026
7ad96c5
Record only bounded baseline candidates
schickling-assistant May 16, 2026
5e842c8
Track typed measurement baseline seeds
schickling-assistant May 16, 2026
012fe75
Update measurement seed workflow tests
schickling-assistant May 16, 2026
a6813f9
Expose measurement gate readiness
schickling-assistant May 16, 2026
1b45f54
Allow parallel measurement baseline backfills
schickling-assistant May 16, 2026
3daf309
Seed effect-utils devenv perf baselines
schickling-assistant May 16, 2026
15226c8
Filter non-comparable measurement history
schickling-assistant May 16, 2026
d3139f8
Make measurement backfill concurrency reusable
schickling-assistant May 16, 2026
5bc374d
Accumulate compatible measurement baselines
schickling-assistant May 16, 2026
5027825
Clarify measurement comment interpretation
schickling-assistant May 17, 2026
1c9b554
Render CI measurement chart as PNG
schickling-assistant May 18, 2026
5f67d95
Avoid private raw URLs in CI measurement comments
schickling-assistant May 18, 2026
0ca1c21
Use PR head SHA for CI measurement comments
schickling-assistant May 18, 2026
43b9cdc
Adapt CI measurement SVG to dark mode
schickling-assistant May 18, 2026
c0d8b03
Use theme-aware CI measurement preview images
schickling-assistant May 18, 2026
ef07395
Allow dispatched CI measurement PR comments
schickling-assistant May 18, 2026
574fe53
Require public assets for private measurement charts
schickling-assistant May 18, 2026
98ea4a9
Scope managed measurement comments
schickling-assistant May 18, 2026
47d6ba9
Harden CI measurement gates
schickling-assistant May 18, 2026
54aaa2e
Make measurement budgets authoritative
schickling-assistant May 18, 2026
f6178fd
Make CI measurement chart show semantic impact
schickling-assistant May 18, 2026
c9b324e
Clarify diagnostic measurement impact
schickling-assistant May 19, 2026
c3a9b9c
Require paired evidence for wall-clock gates
schickling-assistant May 19, 2026
9c668e7
Format measurement architecture doc
schickling-assistant May 19, 2026
52004dd
Produce paired wall-clock measurement evidence
schickling-assistant May 19, 2026
32da40c
Make wall-clock measurement gates uncertainty-aware
schickling-assistant May 19, 2026
2d000f0
Keep deterministic measurement gates point-budget based
schickling-assistant May 19, 2026
317dbd0
Update measurement gate workflow assertion
schickling-assistant May 19, 2026
d62fb13
Document typed CI measurement architecture
schickling-assistant May 19, 2026
c1fbbed
Gate paired wall-clock measurements from paired evidence
schickling-assistant May 19, 2026
a9039e9
Use paired delta evidence for wall-clock gates
schickling-assistant May 19, 2026
a9e6518
Document CI measurement engine direction
schickling-assistant May 19, 2026
aa2130a
Clarify quick check measurement workloads
schickling-assistant May 19, 2026
ac2ce6e
Update measurement workflow expectations
schickling-assistant May 19, 2026
b78ad06
Backfill source shape measurement artifacts
schickling-assistant May 19, 2026
4a4888f
Seed source shape baseline artifact
schickling-assistant May 19, 2026
8a9bed2
Rank measurement comments by semantic impact
schickling-assistant May 19, 2026
b5ff591
Clarify below-threshold measurement improvements
schickling-assistant May 19, 2026
0468728
Add reusable Nix closure measurement helpers
schickling-assistant May 19, 2026
5978381
Measure effect-utils Nix closure sizes
schickling-assistant May 19, 2026
21d7273
Split zero-impact measurement rows
schickling-assistant May 19, 2026
2df4450
Test zero-impact measurement table split
schickling-assistant May 19, 2026
0c55ca8
Unblock current CI measurement runs
schickling-assistant May 20, 2026
dc9fada
Support manual measurement PR comment refresh
schickling-assistant May 20, 2026
0369f97
Trigger CI measurement comment refresh
schickling-assistant May 20, 2026
b366b85
Improve CI measurement comment scanability
schickling-assistant May 20, 2026
72d0690
Hide diagnostic measurements from scan table
schickling-assistant May 20, 2026
19ebea8
Unblock manual measurement dispatches
schickling-assistant May 20, 2026
7ac62d6
Use job-level CI concurrency
schickling-assistant May 20, 2026
ef5c63e
Make PR CI authoritative for measurement comments
schickling-assistant May 20, 2026
868059f
Merge branch 'main' into schickling/2026-05-14-ci-measurement-gates
schickling-assistant May 20, 2026
234bd49
Probe PR CI synchronize trigger
schickling-assistant May 20, 2026
8b77e43
Consolidate CI measurement reporting
schickling-assistant May 20, 2026
6002014
Format CI workflow helper test
schickling-assistant May 20, 2026
f878f6c
Split CI measurement production from reporting
schickling-assistant May 20, 2026
332c3b5
Bootstrap CI measurement producer tools
schickling-assistant May 20, 2026
db7be84
Merge branch 'main' into schickling/2026-05-14-ci-measurement-gates
schickling-assistant May 20, 2026
fc8703b
Fix CI workflow helper expectations
schickling-assistant May 20, 2026
327d393
Trigger clean CI run
schickling-assistant May 20, 2026
5119eb0
Fix matrix-safe CI job concurrency
schickling-assistant May 20, 2026
c9b03ce
Format matrix concurrency test
schickling-assistant May 20, 2026
0ad1844
Run merge queue gates on control-plane runners
schickling-assistant May 20, 2026
3ca45dc
Keep merge queue gates on available runners
schickling-assistant May 20, 2026
820c16f
Retry transient Nix source fetch failures
schickling-assistant May 20, 2026
54169bc
Refresh effect megarepo lock
schickling-assistant May 20, 2026
61aee5a
Declare baseline ref input for shared concurrency
schickling-assistant May 20, 2026
79156f1
Harden CI measurement setup
schickling-assistant May 20, 2026
b5220e8
Bound CI measurement baseline downloads
schickling-assistant May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,805 changes: 3,301 additions & 504 deletions .github/workflows/ci.yml

Large diffs are not rendered by default.

331 changes: 318 additions & 13 deletions .github/workflows/ci.yml.genie.ts

Large diffs are not rendered by default.

175 changes: 175 additions & 0 deletions context/ci-measurement-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# CI Measurement Engine

This document specifies the reusable CI measurement engine. It builds on
[ci-measurements.md](./ci-measurements.md).

## Status

Draft - architecture target for replacing generated shell/jq comparison code
with a typed reusable implementation.

## Scope

This spec defines:

- the stable measurement artifact contract;
- comparison policy semantics;
- the native engine boundary;
- external-tool integration boundaries;
- the rollout path from generated shell/jq to a packaged CLI.

This spec does not define individual probes. Devenv, Nix closure, source-shape,
LOC, and complexity probes remain producer adapters that emit the shared
artifact format.

## Architecture

```text
producer adapters
devenv wall-clock
nix closure size
source shape
future LOC / complexity
|
v
measurements.json
|
v
ci-measure native engine
schema validation
compatibility matching
comparison policy
gate decision
report projection
|
+--> measurement-comparison.json
+--> GitHub Markdown comment
+--> SVG/PNG chart payload
+--> optional trend export
```

The engine owns comparison and rendering. Workflows own checkout, dependency
setup, artifact upload, and GitHub API calls.

## Measurement Registry

Every observation is interpreted through a registry entry:

| Field | Purpose |
| ------------------------- | ------------------------------------------------------ |
| `id` | Stable public identity. |
| `label` | Human review label. |
| `semanticPath` | Hierarchical grouping for comments and charts. |
| `measurementKind` | `deterministic`, `wall-clock`, or `diagnostic`. |
| `unit` | Canonical unit for values and deltas. |
| `direction` | Whether larger values are better, worse, or neutral. |
| `defaultComparisonMode` | `budget`, `paired`, `historical`, or `diagnostic`. |
| `gatePolicy` | Absolute/relative budgets and sample requirements. |
| `compatibilityDimensions` | Which dimensions must match for historical comparison. |
| `displayPolicy` | Visibility, sorting, and chart inclusion behavior. |
| `rawSampleSchema` | Optional schema for per-sample evidence. |

The registry is the public API for cross-repo reuse. Repos may add local entries,
but they must not fork comparison semantics.

Wall-clock registry entries should include a workload dimension when the same
logical command can be measured under different cache conditions. For example,
`task_check_quick_warm` and `task_check_quick_forced` intentionally share the
semantic path `devenv / quality gates / check:quick`, but they are separate IDs
because one measures the warm cached no-op path while the other refreshes the
devenv task cache. This avoids false product claims such as treating a cached
orchestration improvement as a full developer quick-check improvement.

## Comparison Semantics

| Kind | Merge-gate mode | Evidence model |
| --------------- | --------------- | -------------------------------------------------- |
| `deterministic` | `budget` | Exact comparable value plus configured budget. |
| `wall-clock` | `paired` | Same-run base/head pairs and paired delta samples. |
| `wall-clock` | `historical` | Advisory trend context only. |
| `diagnostic` | `diagnostic` | Non-gating explanatory data. |

Wall-clock PR gates must not depend on historical timing alone. Historical
timing is useful for drift detection, A/A calibration, and dashboards, but it
does not prove PR causality.

Paired wall-clock gates use nonparametric evidence by default:

```text
paired_delta_i = current_duration_i - baseline_duration_i
evidence_lower = quantile(paired_delta, pairedEvidenceQuantile)
evidence_upper = quantile(paired_delta, 1 - pairedEvidenceQuantile)
fail = evidence_lower > fail_budget
warn = evidence_lower > warn_budget
```

The engine may add bootstrap or permutation intervals for selected probes, but
it must keep the raw paired delta samples in the artifact so decisions remain
auditable.

## Native CLI Boundary

The long-term implementation should be a packaged `ci-measure` CLI.

```text
ci-measure validate --input measurements.json
ci-measure compare --current DIR --baseline DIR --output comparison.json
ci-measure render-comment --comparison comparison.json --output comment.md
ci-measure render-chart --comparison comparison.json --theme light --output chart.svg
ci-measure export-trends --comparison comparison.json --format bencher-json
```

Rust is the preferred implementation language for the engine because it gives:

- typed schemas for artifact compatibility;
- deterministic rendering without ad hoc heredocs;
- fast startup in generated CI workflows;
- property tests for policy classification;
- snapshot tests for Markdown/SVG output;
- a single packaged binary for all repos.

Shell remains appropriate for probe execution because probes invoke arbitrary
repo-local commands, Nix, devenv, and GitHub workflow primitives.

## External Tool Boundary

External tools may be exporters, not authorities.

| Tool class | Allowed role | Not allowed role |
| -------------------------- | ----------------------------------------- | -------------------------------------- |
| Bencher / trend stores | Historical storage, dashboards, alerting. | Primary PR gate for paired wall-clock. |
| CodSpeed-style instruments | Language-level benchmark suites. | Devenv/Nix shell gate replacement. |
| OTEL backends | Trace explanation and runner diagnostics. | Canonical numeric regression decision. |
| GitHub artifacts/comments | Current authoritative review projection. | Long-term statistical trend database. |

This keeps the merge contract under our control while still allowing the best
external system to own trend visualization or specialized microbenchmarking.

The Bencher experiment in
[ci-measurement-experiments.md](./ci-measurement-experiments.md) confirms this
boundary: Bencher is useful for historical storage and scalar threshold alerts,
but it does not natively gate on same-run paired base/head evidence.

## Rollout

1. Keep the current generated workflow behavior and comment shape stable.
2. Add schema fixtures from existing production `measurements.json` artifacts.
3. Implement `ci-measure compare` behind a workflow environment switch.
4. Run generated jq and native CLI comparisons side by side in CI.
5. Require byte-for-byte compatible `measurement-comparison.json` for existing
fixtures, except for intentional schema-version changes.
6. Move Markdown and SVG rendering into the native CLI after comparison parity.
7. Remove generated jq/Node snippets once all megarepo consumers use the CLI.

The branch-protection surface must keep the same job names during rollout.

## Open Questions

- **DQ1 Bootstrap intervals:** Which probes are valuable enough to pay for
bootstrap or permutation intervals instead of quantile evidence?
- **DQ2 Trend backend:** Should historical trend export target Bencher, an
object-store-backed JSON index, Prometheus/OTEL metrics, or more than one?
- **DQ3 Registry location:** Should shared registry entries live in effect-utils
source, generated repo config, or both?
- **DQ4 Calibration lane:** Which repos should run scheduled A/A and injected
regression calibration first?
97 changes: 97 additions & 0 deletions context/ci-measurement-experiments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# CI Measurement Experiments

This document records experiments that inform
[ci-measurement-engine.md](./ci-measurement-engine.md).

## Bencher Fit Experiment

Date: 2026-05-19.

Purpose: evaluate whether Bencher should replace or complement the
GitHub-native CI measurement gate.

### Setup

The experiment used a local self-hosted Bencher instance and synthetic metrics
that mimic our current measurement families:

- wall-clock duration;
- deterministic Nix closure size;
- deterministic store path count;
- diagnostic counters.

Commands exercised:

```bash
docker run --rm ghcr.io/bencherdev/bencher --version

bencher up --detach --pull missing \
--console-port 33080 \
--api-port 61018 \
--console-env BENCHER_API_URL=http://localhost:61018

bencher run --host http://localhost:61018 \
--project effect-utils-ci-measurements \
--branch main \
--testbed github-ubuntu-latest \
--adapter json \
--file measurements-base.json \
--format json

bencher run --host http://localhost:61018 \
--project effect-utils-ci-measurements \
--branch pr-658 \
--start-point main \
--start-point-clone-thresholds \
--start-point-reset \
--testbed github-ubuntu-latest \
--error-on-alert \
--adapter json \
--file measurements-head.json \
--format json
```

### Findings

Bencher worked well for:

- storing historical benchmark rows by project, branch, testbed, benchmark,
and measure;
- cloning thresholds from a main start point into a PR branch;
- failing CI through `--error-on-alert`;
- percentage thresholds for coarse performance trend alerts;
- static thresholds for simple absolute deterministic budgets;
- multi-measure reports through Bencher Metric Format JSON;
- local self-hosting through Docker.

Bencher did not model our primary wall-clock gate:

- same-run base/head paired samples are not first-class;
- multiple files in one report become iterations, not paired comparisons;
- alerting compares scalar metric values against thresholds;
- stored lower/upper metric fields are not treated as paired evidence
intervals for gating;
- comments and checks would be Bencher-shaped alerts, not our semantic PR
report with paired `n` and delta evidence intervals.

### Decision

Bencher is not the authority for PR merge gates.

Allowed use:

- optional trend backend;
- historical dashboards;
- coarse scheduled alerts;
- export target for already-computed metrics, including paired summary metrics
and deterministic budget ratios.

Disallowed use:

- replacing the GitHub-native PR comment;
- replacing paired wall-clock gate decisions;
- replacing deterministic budget evaluation when budgets are metric-specific.

The native `ci-measure` engine should own gate semantics. A future Bencher
exporter can publish selected observations after `ci-measure compare` has
produced the authoritative decision.
Loading
Loading