Skip to content

feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6

Draft
jim4226 wants to merge 2 commits into
mainfrom
claude/link-evals-response-loop-YKoOo
Draft

feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6
jim4226 wants to merge 2 commits into
mainfrom
claude/link-evals-response-loop-YKoOo

Conversation

@jim4226
Copy link
Copy Markdown
Owner

@jim4226 jim4226 commented May 20, 2026

Summary

Adds a second flavor of V1 grader for domains where the "good" criterion is a scalar metric over a held-out sample (Dice score, anatomical landmark RMSE, Hausdorff95, boundary F-score) rather than a binary pass/fail check — and links it to the loop via a persisted rolling baseline so the threshold tightens with each promoted artifact.

Prompted by a follow-up exchange with Michael Cohen on graders for outcomes-based eval in clinical workflows. Draft response in brain/responses/01-michael-reply-distributional-graders.md.

What's new

  • csis/verification/distributional.py — three types (DistributionalSample, RollingBaseline, DistributionalThreshold) plus a distributional_grader(name=..., threshold=..., baseline=...) factory that pins into the same GraderRegistry the existing V1 graders use. F6 (pinned-source-hash drift), cross-checkpoint cert signing, V2 critic, and auditor why-doc all keep working unmodified.
  • make_clinical_imaging_registry() — four-metric V1 set tuned for the 2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision (dice_score, boundary_f1, landmark_rmse, hausdorff_95), each with (floor, op, summary_stat, min_samples, max_regression, regression_stat).
  • update_baselines_after_promotion() — the explicit link from this iteration's verified gain to the next iteration's threshold. Daemon calls it after coord.run_iteration() returns outcome == "promoted"; the next iteration's grader reads the updated JSON cold.
  • brain/research/02-distributional-graders.md — writeup mapping the pattern to Open Research Question Add a local-backend adapter (vLLM, llama.cpp, or Ollama) #1 ("Verifier-grade for fuzzy domains") and V5 calibration.

The pass criterion that matters

A grader passes iff:

  1. summary_stat of the sample (mean, median, p10, p90 — configurable) clears floor under op.
  2. AND min_samples is met (defends against zero-sample fake-passes).
  3. AND (if baseline + max_regression are set) the sample's regression_stat is not worse than the median of recent per-iteration regression_stats by more than max_regression. Median-of-tail-watermarks is robust to a single outlier promoted iteration.

The closure does NOT mutate the baseline at evaluate time — promoting the baseline pre-decision would let a failing artifact contaminate the very watermark its successor will be measured against.

Test plan

  • 20 new tests in tests/test_distributional_graders.py
  • Summary statistics correctness (mean, median, p10, p90, single-value edge case)
  • Rolling-baseline window enforcement + JSON persistence (write → fresh process → cold-load)
  • Median-of-p10s robust to a single outlier promoted iteration
  • Higher-is-better and lower-is-better threshold paths
  • Silent-regression case (mean still clears floor, p10 collapsed → regression check fails)
  • "No baseline yet" first-iteration case
  • F6 pinning + drift detection holds with distributional graders in the registry
  • Clinical-imaging registry happy path + sub-mm-precision-broken failure case
  • End-to-end: Coordinator runs the full 8-step loop with the clinical registry, the cert carries every distributional grader's full metrics dict, the artifact promotes, the baseline updates, and the next iteration's grader reads the tighter watermark.
  • Full suite: 237 passing (217 → 237)

https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX


Generated by Claude Code

claude added 2 commits May 20, 2026 14:02
…e-Vision-style clinical eval)

Adds a second flavor of V1 grader for domains where the "good" criterion
is a scalar metric over a held-out sample (Dice score, anatomical
landmark RMSE, Hausdorff95, boundary F-score) rather than a binary
pass/fail check. Pass criterion = "summary statistic clears a threshold
AND the lower-tail watermark does not regress vs the rolling baseline of
recent promoted artifacts." Composes with the existing GraderRegistry —
F6 pinning, cross-checkpoint cert signing, V2 critic, and auditor
why-doc all keep working unmodified.

Links to the loop via update_baselines_after_promotion(): each promoted
artifact appends summary stats to a per-metric RollingBaseline persisted
as JSON, so the next iteration's grader reads a tighter watermark. This
unblocks Open Research Question #1 ("Verifier-grade for fuzzy domains")
for the distributional case; V5 calibration (Phase 2) will read the same
persisted series.

Includes a ready-to-use make_clinical_imaging_registry() tuned for the
2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision, an
end-to-end Coordinator integration test, and a research writeup +
response draft for the conversation that prompted this.

20 new tests; full suite 237 passing.

https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX
Refactor the distributional grader factory so clinical imaging is one of
several preset spec packs, not a special case. Adds a generic builder
make_distributional_registry(specs, baseline_root, window) plus a
DistributionalGraderSpec dataclass that bundles (name, threshold).

Six preset packs ship in csis/verification/distributional.py with
convenience factories around the generic builder:

  - CLINICAL_IMAGING_SPECS / make_clinical_imaging_registry
      dice_score, boundary_f1, landmark_rmse, hausdorff_95
  - SEARCH_RANKING_SPECS / make_search_ranking_registry
      ndcg_at_10, mrr, recall_at_10
  - FORECASTING_SPECS / make_forecasting_registry
      mae, crps, pinball_loss_p90
  - CLASSIFICATION_SPECS / make_classification_registry
      f1_macro, roc_auc, precision_at_recall_90
  - LLM_EVAL_SPECS / make_llm_eval_registry
      pass_at_1, win_rate_vs_baseline, judge_score
  - RECOMMENDATION_SPECS / make_recommendation_registry
      hit_at_10, ndcg_at_10, catalog_coverage

The same DistributionalThreshold primitive handles continuous Dice
scores AND binary LLM win-rates without branching — for binary metrics
the regression watermark is mean (per-prompt p10 is degenerate on 0/1
data), expressed declaratively via regression_stat="mean".

10 new tests bring the file to 30 (suite 247). New coverage:
  - generic builder accepts any spec list
  - each domain registry happy-path
  - search ranking floor-below-threshold failure
  - forecasting lower-is-better p90 regression check (op inverts)
  - LLM eval win-rate regression-against-baseline (binary-metric path)
  - structural invariant: every preset registry pins cleanly and the
    F6 drift check holds across all six

Writeup and Michael reply draft updated to lead with the cross-domain
framing instead of clinical-specific.

https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants