feat(verification): distributional V1 graders + rolling baseline (clinical eval) by jim4226 · Pull Request #6 · jim4226/CSIS

jim4226 · 2026-05-20T14:03:19Z

Summary

Adds a second flavor of V1 grader for domains where the "good" criterion is a scalar metric over a held-out sample (Dice score, anatomical landmark RMSE, Hausdorff95, boundary F-score) rather than a binary pass/fail check — and links it to the loop via a persisted rolling baseline so the threshold tightens with each promoted artifact.

Prompted by a follow-up exchange with Michael Cohen on graders for outcomes-based eval in clinical workflows. Draft response in brain/responses/01-michael-reply-distributional-graders.md.

What's new

csis/verification/distributional.py — three types (DistributionalSample, RollingBaseline, DistributionalThreshold) plus a distributional_grader(name=..., threshold=..., baseline=...) factory that pins into the same GraderRegistry the existing V1 graders use. F6 (pinned-source-hash drift), cross-checkpoint cert signing, V2 critic, and auditor why-doc all keep working unmodified.
make_clinical_imaging_registry() — four-metric V1 set tuned for the 2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision (dice_score, boundary_f1, landmark_rmse, hausdorff_95), each with (floor, op, summary_stat, min_samples, max_regression, regression_stat).
update_baselines_after_promotion() — the explicit link from this iteration's verified gain to the next iteration's threshold. Daemon calls it after coord.run_iteration() returns outcome == "promoted"; the next iteration's grader reads the updated JSON cold.
brain/research/02-distributional-graders.md — writeup mapping the pattern to Open Research Question Add a local-backend adapter (vLLM, llama.cpp, or Ollama) #1 ("Verifier-grade for fuzzy domains") and V5 calibration.

The pass criterion that matters

A grader passes iff:

summary_stat of the sample (mean, median, p10, p90 — configurable) clears floor under op.
AND min_samples is met (defends against zero-sample fake-passes).
AND (if baseline + max_regression are set) the sample's regression_stat is not worse than the median of recent per-iteration regression_stats by more than max_regression. Median-of-tail-watermarks is robust to a single outlier promoted iteration.

The closure does NOT mutate the baseline at evaluate time — promoting the baseline pre-decision would let a failing artifact contaminate the very watermark its successor will be measured against.

Test plan

https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX

Generated by Claude Code

…e-Vision-style clinical eval) Adds a second flavor of V1 grader for domains where the "good" criterion is a scalar metric over a held-out sample (Dice score, anatomical landmark RMSE, Hausdorff95, boundary F-score) rather than a binary pass/fail check. Pass criterion = "summary statistic clears a threshold AND the lower-tail watermark does not regress vs the rolling baseline of recent promoted artifacts." Composes with the existing GraderRegistry — F6 pinning, cross-checkpoint cert signing, V2 critic, and auditor why-doc all keep working unmodified. Links to the loop via update_baselines_after_promotion(): each promoted artifact appends summary stats to a per-metric RollingBaseline persisted as JSON, so the next iteration's grader reads a tighter watermark. This unblocks Open Research Question #1 ("Verifier-grade for fuzzy domains") for the distributional case; V5 calibration (Phase 2) will read the same persisted series. Includes a ready-to-use make_clinical_imaging_registry() tuned for the 2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision, an end-to-end Coordinator integration test, and a research writeup + response draft for the conversation that prompted this. 20 new tests; full suite 237 passing. https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX

Refactor the distributional grader factory so clinical imaging is one of several preset spec packs, not a special case. Adds a generic builder make_distributional_registry(specs, baseline_root, window) plus a DistributionalGraderSpec dataclass that bundles (name, threshold). Six preset packs ship in csis/verification/distributional.py with convenience factories around the generic builder: - CLINICAL_IMAGING_SPECS / make_clinical_imaging_registry dice_score, boundary_f1, landmark_rmse, hausdorff_95 - SEARCH_RANKING_SPECS / make_search_ranking_registry ndcg_at_10, mrr, recall_at_10 - FORECASTING_SPECS / make_forecasting_registry mae, crps, pinball_loss_p90 - CLASSIFICATION_SPECS / make_classification_registry f1_macro, roc_auc, precision_at_recall_90 - LLM_EVAL_SPECS / make_llm_eval_registry pass_at_1, win_rate_vs_baseline, judge_score - RECOMMENDATION_SPECS / make_recommendation_registry hit_at_10, ndcg_at_10, catalog_coverage The same DistributionalThreshold primitive handles continuous Dice scores AND binary LLM win-rates without branching — for binary metrics the regression watermark is mean (per-prompt p10 is degenerate on 0/1 data), expressed declaratively via regression_stat="mean". 10 new tests bring the file to 30 (suite 247). New coverage: - generic builder accepts any spec list - each domain registry happy-path - search ranking floor-below-threshold failure - forecasting lower-is-better p90 regression check (op inverts) - LLM eval win-rate regression-against-baseline (binary-metric path) - structural invariant: every preset registry pins cleanly and the F6 drift check holds across all six Writeup and Michael reply draft updated to lead with the cross-domain framing instead of clinical-specific. https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX

claude added 2 commits May 20, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6

feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6
jim4226 wants to merge 2 commits into
mainfrom
claude/link-evals-response-loop-YKoOo

jim4226 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jim4226 commented May 20, 2026

Summary

What's new

The pass criterion that matters

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants