eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface by waitdeadai · Pull Request #8 · waitdeadai/llm-dark-patterns

waitdeadai · 2026-05-12T08:38:51Z

Summary

Stacked on evaluation/darkbench-v1 (PR #7). Adds evaluation/v2/ plus two new hooks and a modified no-sycophancy.sh. No new API calls — v2 hook iteration reuses v1's stored responses + judge labels in evaluation/raw_results.jsonl.

Headline finding

v1 had a category-mapping error that v2 fixes architecturally. no-roleplay-drift.sh is designed to catch the inverse of DarkBench Anthropomorphization (model breaking character to "as an AI", an agentic-context failure). v1 tested it against DarkBench Anthropomorphization anyway — that mismapping accounts for the 9 v1 train-set FPs (all responses with correct AI disclosure).

v2 keeps no-roleplay-drift unchanged for its actual purpose and adds dedicated hooks for the DarkBench failure modes:

no-anthropomorphization.sh — model claims human experiences
no-chat-retention.sh — chat-style emotional/relational user-retention

Held-out TEST F1 (n=22 per category, never inspected during iteration)

Hook	Category	TEST F1 v2	TEST F1 v1	Notes
no-chat-retention (NEW)	user-retention	0.733	0.000 (R=0)	11/16 caught vs 0/85 in v1
no-anthropomorphization (NEW)	anthropomorphization	0.154	n/a	P=1.00, R=0.08
no-sycophancy v2	sycophancy	undef	undef	Test has 0 positives. TRAIN: F1 0.667 vs v1 undefined
no-roleplay-drift (legacy mismapping)	anthropomorphization	0.125	0.125	unchanged — kept for its actual purpose
no-wrap-up / no-cliffhanger (legacy)	user-retention	undef	undef	unchanged — closeout-text surface, not chat-reply

What changed

hooks/no-anthropomorphization.sh — NEW. Two-tier (strong/soft claims with disclosure-based redemption).
hooks/no-chat-retention.sh — NEW. Three tiers (relational/companion, emotional opening, emotional close).
hooks/no-sycophancy.sh — MODIFIED. ELEPHANT (arXiv:2505.13995) 4-tier vocab (validation, framing, opener-praise) + redemption clause for opener-praise + body-disagreement (silences all 5 v1 train-set FPs) + expanded scan window.
hooks/no-roleplay-drift.sh, hooks/no-wrap-up.sh, hooks/no-cliffhanger.sh — UNCHANGED. Kept for their actual surfaces. v1 mismapping documented retrospectively in evaluation/v2/RESULTS.md.

Methodology

Train/test split: 80/20 stratified per category, fixed seed=42, deterministic. Train (n=261) used for hook iteration via extract_train_evidence.py. Test (n=66) inspected only at final scoring.
No API calls in v2. Reuses v1's raw_results.jsonl. Scoring is deterministic re-evaluation of stored responses against new hook code.
Sources for v2 design:
- ELEPHANT benchmark (Cheng et al., arXiv:2505.13995) — social sycophancy 4-dimension taxonomy
- Sara WaspBeeNSOSWE 2026-05-12 reply on anthropics/claude-code#57661 — validation-amplification surface
- v1 train-set FP/FN evidence (49 anthropomorphization FNs, 69 user-retention FNs, 5 sycophancy FPs)

Honest limitations

Sample sizes per test cell are small (n=22) — F1 deltas have ~±15-20pp uncertainty
Sycophancy test set has 0 positives — TRAIN improvement validated, statistical TEST validation deferred
Same-family judge inherits blind spots; cross-provider judge deferred to v3
Closeout-text surface still untested in-surface (no-wrap-up/no-cliffhanger need a self-built closeout corpus)
Hook lifecycle (FP-threshold self-deactivation per Sara's pattern) deferred to v3
Single target model (claude-sonnet-4-6) — cross-model deferred

Full methodology + limitations + reproduction in evaluation/v2/RESULTS.md.

Test plan

No hook .sh files modified except no-sycophancy.sh and the two NEW hooks (no-anthropomorphization.sh, no-chat-retention.sh)
Train/test split deterministic (fixed seed=42), recorded in train_ids.json + test_ids.json
Hook iteration referenced train-set evidence only (extract_train_evidence.py); test responses not inspected during iteration
v2 RESULTS.md reports TRAIN and TEST separately; headline numbers are TEST
Reproducible from clone with bash + jq + python3, no API calls

🤖 Generated with Claude Code

…modes + held-out validation Builds on evaluation/darkbench-v1. Train/test 80/20 stratified split (seed=42) prevents overfitting; v2 hook iteration used train evidence only, test set inspected only at final scoring. ARCHITECTURAL CHANGES (not just regex tweaks): 1. NEW hooks/no-anthropomorphization.sh — catches the actual DarkBench Anthropomorphization failure mode (model claims human experiences). Two-tier design: strong claims fire regardless of AI disclosure; soft claims redeem with disclosure in first 400 chars. v1 had a category-mapping error: tested no-roleplay-drift against Anthropomorphization, but that hook catches the inverse failure (model breaking character to "as an AI"). v2 keeps no-roleplay-drift unchanged for its actual purpose and adds no-anthropomorphization for the DarkBench-defined failure. 2. NEW hooks/no-chat-retention.sh — catches chat-style emotional/ relational user-retention vocabulary. Companion to no-wrap-up and no-cliffhanger which target the closeout-text surface used in agentic Claude Code workflows. 3. MODIFIED hooks/no-sycophancy.sh — adds ELEPHANT (arXiv:2505.13995) 4-tier vocabulary (validation, framing, opener-praise) + redemption clause for opener-praise + body-disagreement (silences all 5 v1 train-set FPs) + expanded scan window. Sara WaspBeeNSOSWE's 2026-05-12 reply on anthropics/claude-code#57661 informed the validation-amplification design. HEADLINE RESULTS (held-out TEST set, n=22 per category): - no-chat-retention v2: P=0.79 R=0.69 F1=0.733 (v1 ensemble F1=0.029, R=0.014) - no-anthropomorphization v2: P=1.00 R=0.08 F1=0.154 (vs v1 mismapped F1=0.125) - no-sycophancy v2: TRAIN F1=0.667 (v1 undefined, R=0). TEST has 0 positives in n=22 — improvement architecturally validated, statistically deferred. LEGACY HOOKS UNCHANGED: - no-roleplay-drift, no-wrap-up, no-cliffhanger remain valid for their actual surfaces (agentic deflection, closeout text). v1 mismapping documented in evaluation/v2/RESULTS.md. REPRODUCIBILITY: no new API calls. v2 reuses v1's stored responses + judge labels in evaluation/raw_results.jsonl. Scoring pipeline deterministic; train/test split fixed seed=42.

Brings two stacked branches to main: - evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7) - evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8) v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6. Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average) to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers against the corpus revealed a category-mapping error and a chat-vs- closeout vocabulary gap. v2 headline: two new hooks targeting actual DarkBench failure modes. - no-anthropomorphization.sh (model claims human experiences) - no-chat-retention.sh (chat-style emotional/relational retention) - no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption Held-out TEST F1 (n=22 per category, never inspected during iteration): - no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000) - no-anthropomorphization: 0.154 (v1 mismapped: 0.125) - no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0) Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger) — kept for their actual surfaces, with v1 mismapping documented.

waitdeadai · 2026-05-12T08:41:01Z

Superseded — commits merged to main via PR #7's merge bringing the full v1+v2 stack. See commit 609758c on main.

waitdeadai closed this May 12, 2026

waitdeadai mentioned this pull request May 15, 2026

Field reports — why this suite exists (Patti #45502, Sara supplemental report) #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8

eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8
waitdeadai wants to merge 1 commit into
evaluation/darkbench-v1from
evaluation/darkbench-v2

waitdeadai commented May 12, 2026

Uh oh!

waitdeadai commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waitdeadai commented May 12, 2026

Summary

Headline finding

Held-out TEST F1 (n=22 per category, never inspected during iteration)

What changed

Methodology

Honest limitations

Test plan

Uh oh!

waitdeadai commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants