feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0)#6
Merged
Conversation
the credibility release. a small multi-turn drift corpus plus a judge that reports its own reliability, measured on the same sequences it scores - never on the single-turn corpus. - corpus/drift-v0.1/: 9 sequences across 3 categories (sycophancy-under- pressure, persona stability, refusal/boundary relaxation), two 8-turn + one 20-turn each, 108 labeled assistant turns, per-turn human-reference gold labels on 5 ordinal drift dimensions (hold/partial/drift). - promptpressure/drift/: dimensions, strict schema loader/validator, suite runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs, test-retest) with no numpy/scipy dependency. - pp run --suite / pp calibrate --suite (also ppdrift); launcher dispatches the two subcommands. writes reports/drift-v0.1-method.md. - native DeepSeek adapter (deepseek_native) hitting api.deepseek.com, separate from the OpenRouter-routed deepseek-r1 adapter. - first run (deepseek-v4-flash judge x3): judge-vs-human pooled kappa 0.41 (moderate, CI [0.31, 0.50], n=324), test-retest 0.78, 0 parse failures. reported honestly as a pilot. 87 new tests (calibration math hand-verified against textbook kappa). full suite 286 passing. docs (roadmap/ARCHITECTURE/CHANGELOG/README) updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erage extract_aligned gained an only_ids filter; judge_vs_judge and test_retest now restrict to sequences every run covered, so a judge erroring on one sequence can't produce mismatched-length label lists (which would crash the kappa call). the supported CLI path always uses identical coverage; this hardens the edge. +2 regression tests. full suite 288 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…paths pre-merge adversarial review (GO_WITH_FIXES, 0 blockers) found that the adapter-error path reported parse_failures as a turn count while the normal path counts per-dimension failures (turns x dimensions). same field summed in the coverage diagnostic, two different units - a self-honesty stat bug on a release whose whole pitch is honest reliability reporting. doesn't touch any kappa/agreement number (N/A exclusion is correct), only the failure stat. - judge.py: error path now uses parse_judge_labels' per-dimension count - report.py + the committed method report: prose says "per-dimension labels" not "turns" to match what the number measures - judge.py: document the prompt-injection threat-model assumption - test: assert the error path reports per-dimension count (9, not 3) full suite 288 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
the credibility release
a small multi-turn behavioral-drift corpus plus a judge that reports its own reliability, measured on the same sequences it scores. never on the single-turn corpus. that's the part promptfoo, Inspect, and lm-eval-harness don't publish.
what's in it
corpus/drift-v0.1/— 9 multi-turn pressure sequences across 3 categories (sycophancy-under-pressure, persona-stability, refusal-boundary-relaxation), two 8-turn + one 20-turn each. 108 labeled assistant turns, per-turn human-reference gold labels on 5 ordinal drift dimensions (refusal_integrity, boundary_relaxation, sycophancy, contradiction, escalation), eachhold/partial/drift.promptpressure/drift/— dimensions, strict schema loader/validator, suite runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs, test-retest stability). no numpy/scipy.pp run --suite drift-v0.1(replay sequences through a model → transcripts) andpp calibrate --suite drift-v0.1(judge transcripts N times, compute judge-vs-human + test-retest [+ optional judge-vs-judge], writereports/drift-v0.1-method.md). also exposed asppdrift; thepplauncher dispatches the two subcommands.deepseek_native) hitting api.deepseek.com, distinct from the OpenRouter-routeddeepseek-r1.live calibration (deepseek-v4-flash judge, 3 runs)
reported honestly as a pilot — gold labels are author reference annotations, not yet a multi-annotator panel. the report says so in numbers.
pp runalso validated live: 108/108 turns through deepseek-v4-flash, no errors.testing
89 new tests — calibration math hand-verified against textbook kappa values (2×2 = 0.40, a by-hand linear-weighted = 0.714), schema validation, judge parsing, runner, pipeline aggregation (incl. mismatched-coverage alignment), CLI, native adapter. full suite: 288 passing.
An adversarial pre-merge review (6 dimensions, refute-by-default verification) returned GO; its one confirmed finding (a parse-failure stat unit mismatch on the judge error path) is fixed in this branch with a regression test.
docs updated: roadmap (v3.3 shipped, v3.4 reframed), ARCHITECTURE, CHANGELOG, README, corpus README.
🤖 Generated with Claude Code