feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6
Draft
jim4226 wants to merge 2 commits into
Draft
feat(verification): distributional V1 graders + rolling baseline (clinical eval)#6jim4226 wants to merge 2 commits into
jim4226 wants to merge 2 commits into
Conversation
…e-Vision-style clinical eval) Adds a second flavor of V1 grader for domains where the "good" criterion is a scalar metric over a held-out sample (Dice score, anatomical landmark RMSE, Hausdorff95, boundary F-score) rather than a binary pass/fail check. Pass criterion = "summary statistic clears a threshold AND the lower-tail watermark does not regress vs the rolling baseline of recent promoted artifacts." Composes with the existing GraderRegistry — F6 pinning, cross-checkpoint cert signing, V2 critic, and auditor why-doc all keep working unmodified. Links to the loop via update_baselines_after_promotion(): each promoted artifact appends summary stats to a per-metric RollingBaseline persisted as JSON, so the next iteration's grader reads a tighter watermark. This unblocks Open Research Question #1 ("Verifier-grade for fuzzy domains") for the distributional case; V5 calibration (Phase 2) will read the same persisted series. Includes a ready-to-use make_clinical_imaging_registry() tuned for the 2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision, an end-to-end Coordinator integration test, and a research writeup + response draft for the conversation that prompted this. 20 new tests; full suite 237 passing. https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX
Refactor the distributional grader factory so clinical imaging is one of
several preset spec packs, not a special case. Adds a generic builder
make_distributional_registry(specs, baseline_root, window) plus a
DistributionalGraderSpec dataclass that bundles (name, threshold).
Six preset packs ship in csis/verification/distributional.py with
convenience factories around the generic builder:
- CLINICAL_IMAGING_SPECS / make_clinical_imaging_registry
dice_score, boundary_f1, landmark_rmse, hausdorff_95
- SEARCH_RANKING_SPECS / make_search_ranking_registry
ndcg_at_10, mrr, recall_at_10
- FORECASTING_SPECS / make_forecasting_registry
mae, crps, pinball_loss_p90
- CLASSIFICATION_SPECS / make_classification_registry
f1_macro, roc_auc, precision_at_recall_90
- LLM_EVAL_SPECS / make_llm_eval_registry
pass_at_1, win_rate_vs_baseline, judge_score
- RECOMMENDATION_SPECS / make_recommendation_registry
hit_at_10, ndcg_at_10, catalog_coverage
The same DistributionalThreshold primitive handles continuous Dice
scores AND binary LLM win-rates without branching — for binary metrics
the regression watermark is mean (per-prompt p10 is degenerate on 0/1
data), expressed declaratively via regression_stat="mean".
10 new tests bring the file to 30 (suite 247). New coverage:
- generic builder accepts any spec list
- each domain registry happy-path
- search ranking floor-below-threshold failure
- forecasting lower-is-better p90 regression check (op inverts)
- LLM eval win-rate regression-against-baseline (binary-metric path)
- structural invariant: every preset registry pins cleanly and the
F6 drift check holds across all six
Writeup and Michael reply draft updated to lead with the cross-domain
framing instead of clinical-specific.
https://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a second flavor of V1 grader for domains where the "good" criterion is a scalar metric over a held-out sample (Dice score, anatomical landmark RMSE, Hausdorff95, boundary F-score) rather than a binary pass/fail check — and links it to the loop via a persisted rolling baseline so the threshold tightens with each promoted artifact.
Prompted by a follow-up exchange with Michael Cohen on graders for outcomes-based eval in clinical workflows. Draft response in
brain/responses/01-michael-reply-distributional-graders.md.What's new
csis/verification/distributional.py— three types (DistributionalSample,RollingBaseline,DistributionalThreshold) plus adistributional_grader(name=..., threshold=..., baseline=...)factory that pins into the sameGraderRegistrythe existing V1 graders use. F6 (pinned-source-hash drift), cross-checkpoint cert signing, V2 critic, and auditor why-doc all keep working unmodified.make_clinical_imaging_registry()— four-metric V1 set tuned for the 2D-X-ray-to-3D-bone-reconstruction problem at sub-mm precision (dice_score,boundary_f1,landmark_rmse,hausdorff_95), each with(floor, op, summary_stat, min_samples, max_regression, regression_stat).update_baselines_after_promotion()— the explicit link from this iteration's verified gain to the next iteration's threshold. Daemon calls it aftercoord.run_iteration()returnsoutcome == "promoted"; the next iteration's grader reads the updated JSON cold.brain/research/02-distributional-graders.md— writeup mapping the pattern to Open Research Question Add a local-backend adapter (vLLM, llama.cpp, or Ollama) #1 ("Verifier-grade for fuzzy domains") and V5 calibration.The pass criterion that matters
A grader passes iff:
summary_statof the sample (mean, median, p10, p90 — configurable) clearsfloorunderop.min_samplesis met (defends against zero-sample fake-passes).max_regressionare set) the sample'sregression_statis not worse than the median of recent per-iterationregression_stats by more thanmax_regression. Median-of-tail-watermarks is robust to a single outlier promoted iteration.The closure does NOT mutate the baseline at evaluate time — promoting the baseline pre-decision would let a failing artifact contaminate the very watermark its successor will be measured against.
Test plan
tests/test_distributional_graders.pyhttps://claude.ai/code/session_01XesuxhRkBxEyHGGY4GzqdX
Generated by Claude Code