feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0) by StressTestor · Pull Request #6 · StressTestor/PromptPressure

StressTestor · 2026-06-18T02:24:59Z

the credibility release

a small multi-turn behavioral-drift corpus plus a judge that reports its own reliability, measured on the same sequences it scores. never on the single-turn corpus. that's the part promptfoo, Inspect, and lm-eval-harness don't publish.

what's in it

corpus/drift-v0.1/ — 9 multi-turn pressure sequences across 3 categories (sycophancy-under-pressure, persona-stability, refusal-boundary-relaxation), two 8-turn + one 20-turn each. 108 labeled assistant turns, per-turn human-reference gold labels on 5 ordinal drift dimensions (refusal_integrity, boundary_relaxation, sycophancy, contradiction, escalation), each hold/partial/drift.
promptpressure/drift/ — dimensions, strict schema loader/validator, suite runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs, test-retest stability). no numpy/scipy.
pp run --suite drift-v0.1 (replay sequences through a model → transcripts) and pp calibrate --suite drift-v0.1 (judge transcripts N times, compute judge-vs-human + test-retest [+ optional judge-vs-judge], write reports/drift-v0.1-method.md). also exposed as ppdrift; the pp launcher dispatches the two subcommands.
native DeepSeek adapter (deepseek_native) hitting api.deepseek.com, distinct from the OpenRouter-routed deepseek-r1.

live calibration (deepseek-v4-flash judge, 3 runs)

metric	value
judge-vs-human pooled κ	0.41 (moderate), 95% CI [0.31, 0.50], n=324
test-retest pooled κ	0.78 (94% agreement)
parse failures	0

reported honestly as a pilot — gold labels are author reference annotations, not yet a multi-annotator panel. the report says so in numbers.

pp run also validated live: 108/108 turns through deepseek-v4-flash, no errors.

testing

89 new tests — calibration math hand-verified against textbook kappa values (2×2 = 0.40, a by-hand linear-weighted = 0.714), schema validation, judge parsing, runner, pipeline aggregation (incl. mismatched-coverage alignment), CLI, native adapter. full suite: 288 passing.

An adversarial pre-merge review (6 dimensions, refute-by-default verification) returned GO; its one confirmed finding (a parse-failure stat unit mismatch on the judge error path) is fixed in this branch with a regression test.

docs updated: roadmap (v3.3 shipped, v3.4 reframed), ARCHITECTURE, CHANGELOG, README, corpus README.

🤖 Generated with Claude Code

the credibility release. a small multi-turn drift corpus plus a judge that reports its own reliability, measured on the same sequences it scores - never on the single-turn corpus. - corpus/drift-v0.1/: 9 sequences across 3 categories (sycophancy-under- pressure, persona stability, refusal/boundary relaxation), two 8-turn + one 20-turn each, 108 labeled assistant turns, per-turn human-reference gold labels on 5 ordinal drift dimensions (hold/partial/drift). - promptpressure/drift/: dimensions, strict schema loader/validator, suite runner, injection-hardened per-turn LLM-as-judge, and a pure-stdlib calibration engine (Cohen's + linearly-weighted kappa, bootstrap CIs, test-retest) with no numpy/scipy dependency. - pp run --suite / pp calibrate --suite (also ppdrift); launcher dispatches the two subcommands. writes reports/drift-v0.1-method.md. - native DeepSeek adapter (deepseek_native) hitting api.deepseek.com, separate from the OpenRouter-routed deepseek-r1 adapter. - first run (deepseek-v4-flash judge x3): judge-vs-human pooled kappa 0.41 (moderate, CI [0.31, 0.50], n=324), test-retest 0.78, 0 parse failures. reported honestly as a pilot. 87 new tests (calibration math hand-verified against textbook kappa). full suite 286 passing. docs (roadmap/ARCHITECTURE/CHANGELOG/README) updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…erage extract_aligned gained an only_ids filter; judge_vs_judge and test_retest now restrict to sequences every run covered, so a judge erroring on one sequence can't produce mismatched-length label lists (which would crash the kappa call). the supported CLI path always uses identical coverage; this hardens the edge. +2 regression tests. full suite 288 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…paths pre-merge adversarial review (GO_WITH_FIXES, 0 blockers) found that the adapter-error path reported parse_failures as a turn count while the normal path counts per-dimension failures (turns x dimensions). same field summed in the coverage diagnostic, two different units - a self-honesty stat bug on a release whose whole pitch is honest reliability reporting. doesn't touch any kappa/agreement number (N/A exclusion is correct), only the failure stat. - judge.py: error path now uses parse_judge_labels' per-dimension count - report.py + the committed method report: prose says "per-dimension labels" not "turns" to match what the number measures - judge.py: document the prompt-injection threat-model assumption - test: assert the error path reports per-dimension count (9, not 3) full suite 288 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

StressTestor and others added 4 commits June 16, 2026 15:07

chore(gitignore): ignore drift suite run transcripts

a56064f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

StressTestor merged commit 599d4c1 into main Jun 18, 2026
4 of 5 checks passed

StressTestor deleted the feat/drift-v0.1-calibration branch June 18, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0)#6

feat(drift): drift-v0.1 corpus + judge calibration (v3.3.0)#6
StressTestor merged 4 commits into
mainfrom
feat/drift-v0.1-calibration

StressTestor commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StressTestor commented Jun 18, 2026

the credibility release

what's in it

live calibration (deepseek-v4-flash judge, 3 runs)

testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant