Skip to content

feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0)#6

Merged
StressTestor merged 4 commits into
mainfrom
feat/drift-v0.1-calibration
Jun 18, 2026
Merged

feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0)#6
StressTestor merged 4 commits into
mainfrom
feat/drift-v0.1-calibration

Conversation

@StressTestor

Copy link
Copy Markdown
Owner

the credibility release

a small multi-turn behavioral-drift corpus plus a judge that reports its own reliability, measured on the same sequences it scores. never on the single-turn corpus. that's the part promptfoo, Inspect, and lm-eval-harness don't publish.

what's in it

  • corpus/drift-v0.1/ — 9 multi-turn pressure sequences across 3 categories (sycophancy-under-pressure, persona-stability, refusal-boundary-relaxation), two 8-turn + one 20-turn each. 108 labeled assistant turns, per-turn human-reference gold labels on 5 ordinal drift dimensions (refusal_integrity, boundary_relaxation, sycophancy, contradiction, escalation), each hold/partial/drift.
  • promptpressure/drift/ — dimensions, strict schema loader/validator, suite runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs, test-retest stability). no numpy/scipy.
  • pp run --suite drift-v0.1 (replay sequences through a model → transcripts) and pp calibrate --suite drift-v0.1 (judge transcripts N times, compute judge-vs-human + test-retest [+ optional judge-vs-judge], write reports/drift-v0.1-method.md). also exposed as ppdrift; the pp launcher dispatches the two subcommands.
  • native DeepSeek adapter (deepseek_native) hitting api.deepseek.com, distinct from the OpenRouter-routed deepseek-r1.

live calibration (deepseek-v4-flash judge, 3 runs)

metric value
judge-vs-human pooled κ 0.41 (moderate), 95% CI [0.31, 0.50], n=324
test-retest pooled κ 0.78 (94% agreement)
parse failures 0

reported honestly as a pilot — gold labels are author reference annotations, not yet a multi-annotator panel. the report says so in numbers.

pp run also validated live: 108/108 turns through deepseek-v4-flash, no errors.

testing

89 new tests — calibration math hand-verified against textbook kappa values (2×2 = 0.40, a by-hand linear-weighted = 0.714), schema validation, judge parsing, runner, pipeline aggregation (incl. mismatched-coverage alignment), CLI, native adapter. full suite: 288 passing.

An adversarial pre-merge review (6 dimensions, refute-by-default verification) returned GO; its one confirmed finding (a parse-failure stat unit mismatch on the judge error path) is fixed in this branch with a regression test.

docs updated: roadmap (v3.3 shipped, v3.4 reframed), ARCHITECTURE, CHANGELOG, README, corpus README.

🤖 Generated with Claude Code

StressTestor and others added 4 commits June 16, 2026 15:07
the credibility release. a small multi-turn drift corpus plus a judge that
reports its own reliability, measured on the same sequences it scores - never
on the single-turn corpus.

- corpus/drift-v0.1/: 9 sequences across 3 categories (sycophancy-under-
  pressure, persona stability, refusal/boundary relaxation), two 8-turn + one
  20-turn each, 108 labeled assistant turns, per-turn human-reference gold
  labels on 5 ordinal drift dimensions (hold/partial/drift).
- promptpressure/drift/: dimensions, strict schema loader/validator, suite
  runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib
  calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs,
  test-retest) with no numpy/scipy dependency.
- pp run --suite / pp calibrate --suite (also ppdrift); launcher dispatches
  the two subcommands. writes reports/drift-v0.1-method.md.
- native DeepSeek adapter (deepseek_native) hitting api.deepseek.com, separate
  from the OpenRouter-routed deepseek-r1 adapter.
- first run (deepseek-v4-flash judge x3): judge-vs-human pooled kappa 0.41
  (moderate, CI [0.31, 0.50], n=324), test-retest 0.78, 0 parse failures.
  reported honestly as a pilot.

87 new tests (calibration math hand-verified against textbook kappa). full
suite 286 passing. docs (roadmap/ARCHITECTURE/CHANGELOG/README) updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erage

extract_aligned gained an only_ids filter; judge_vs_judge and test_retest now
restrict to sequences every run covered, so a judge erroring on one sequence
can't produce mismatched-length label lists (which would crash the kappa call).
the supported CLI path always uses identical coverage; this hardens the edge.

+2 regression tests. full suite 288 passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…paths

pre-merge adversarial review (GO_WITH_FIXES, 0 blockers) found that the
adapter-error path reported parse_failures as a turn count while the normal
path counts per-dimension failures (turns x dimensions). same field summed in
the coverage diagnostic, two different units - a self-honesty stat bug on a
release whose whole pitch is honest reliability reporting. doesn't touch any
kappa/agreement number (N/A exclusion is correct), only the failure stat.

- judge.py: error path now uses parse_judge_labels' per-dimension count
- report.py + the committed method report: prose says "per-dimension labels"
  not "turns" to match what the number measures
- judge.py: document the prompt-injection threat-model assumption
- test: assert the error path reports per-dimension count (9, not 3)

full suite 288 passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@StressTestor StressTestor merged commit 599d4c1 into main Jun 18, 2026
4 of 5 checks passed
@StressTestor StressTestor deleted the feat/drift-v0.1-calibration branch June 18, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant