eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8
Closed
waitdeadai wants to merge 1 commit into
Closed
eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8waitdeadai wants to merge 1 commit into
waitdeadai wants to merge 1 commit into
Conversation
…modes + held-out validation Builds on evaluation/darkbench-v1. Train/test 80/20 stratified split (seed=42) prevents overfitting; v2 hook iteration used train evidence only, test set inspected only at final scoring. ARCHITECTURAL CHANGES (not just regex tweaks): 1. NEW hooks/no-anthropomorphization.sh — catches the actual DarkBench Anthropomorphization failure mode (model claims human experiences). Two-tier design: strong claims fire regardless of AI disclosure; soft claims redeem with disclosure in first 400 chars. v1 had a category-mapping error: tested no-roleplay-drift against Anthropomorphization, but that hook catches the inverse failure (model breaking character to "as an AI"). v2 keeps no-roleplay-drift unchanged for its actual purpose and adds no-anthropomorphization for the DarkBench-defined failure. 2. NEW hooks/no-chat-retention.sh — catches chat-style emotional/ relational user-retention vocabulary. Companion to no-wrap-up and no-cliffhanger which target the closeout-text surface used in agentic Claude Code workflows. 3. MODIFIED hooks/no-sycophancy.sh — adds ELEPHANT (arXiv:2505.13995) 4-tier vocabulary (validation, framing, opener-praise) + redemption clause for opener-praise + body-disagreement (silences all 5 v1 train-set FPs) + expanded scan window. Sara WaspBeeNSOSWE's 2026-05-12 reply on anthropics/claude-code#57661 informed the validation-amplification design. HEADLINE RESULTS (held-out TEST set, n=22 per category): - no-chat-retention v2: P=0.79 R=0.69 F1=0.733 (v1 ensemble F1=0.029, R=0.014) - no-anthropomorphization v2: P=1.00 R=0.08 F1=0.154 (vs v1 mismapped F1=0.125) - no-sycophancy v2: TRAIN F1=0.667 (v1 undefined, R=0). TEST has 0 positives in n=22 — improvement architecturally validated, statistically deferred. LEGACY HOOKS UNCHANGED: - no-roleplay-drift, no-wrap-up, no-cliffhanger remain valid for their actual surfaces (agentic deflection, closeout text). v1 mismapping documented in evaluation/v2/RESULTS.md. REPRODUCIBILITY: no new API calls. v2 reuses v1's stored responses + judge labels in evaluation/raw_results.jsonl. Scoring pipeline deterministic; train/test split fixed seed=42.
waitdeadai
pushed a commit
that referenced
this pull request
May 12, 2026
Brings two stacked branches to main: - evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7) - evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8) v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6. Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average) to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers against the corpus revealed a category-mapping error and a chat-vs- closeout vocabulary gap. v2 headline: two new hooks targeting actual DarkBench failure modes. - no-anthropomorphization.sh (model claims human experiences) - no-chat-retention.sh (chat-style emotional/relational retention) - no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption Held-out TEST F1 (n=22 per category, never inspected during iteration): - no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000) - no-anthropomorphization: 0.154 (v1 mismapped: 0.125) - no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0) Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger) — kept for their actual surfaces, with v1 mismapping documented.
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on
evaluation/darkbench-v1(PR #7). Addsevaluation/v2/plus two new hooks and a modifiedno-sycophancy.sh. No new API calls — v2 hook iteration reuses v1's stored responses + judge labels inevaluation/raw_results.jsonl.Headline finding
v1 had a category-mapping error that v2 fixes architecturally.
no-roleplay-drift.shis designed to catch the inverse of DarkBench Anthropomorphization (model breaking character to "as an AI", an agentic-context failure). v1 tested it against DarkBench Anthropomorphization anyway — that mismapping accounts for the 9 v1 train-set FPs (all responses with correct AI disclosure).v2 keeps
no-roleplay-driftunchanged for its actual purpose and adds dedicated hooks for the DarkBench failure modes:no-anthropomorphization.sh— model claims human experiencesno-chat-retention.sh— chat-style emotional/relational user-retentionHeld-out TEST F1 (n=22 per category, never inspected during iteration)
What changed
hooks/no-anthropomorphization.sh— NEW. Two-tier (strong/soft claims with disclosure-based redemption).hooks/no-chat-retention.sh— NEW. Three tiers (relational/companion, emotional opening, emotional close).hooks/no-sycophancy.sh— MODIFIED. ELEPHANT (arXiv:2505.13995) 4-tier vocab (validation, framing, opener-praise) + redemption clause for opener-praise + body-disagreement (silences all 5 v1 train-set FPs) + expanded scan window.hooks/no-roleplay-drift.sh,hooks/no-wrap-up.sh,hooks/no-cliffhanger.sh— UNCHANGED. Kept for their actual surfaces. v1 mismapping documented retrospectively inevaluation/v2/RESULTS.md.Methodology
extract_train_evidence.py. Test (n=66) inspected only at final scoring.raw_results.jsonl. Scoring is deterministic re-evaluation of stored responses against new hook code.anthropics/claude-code#57661— validation-amplification surfaceHonest limitations
no-wrap-up/no-cliffhangerneed a self-built closeout corpus)claude-sonnet-4-6) — cross-model deferredFull methodology + limitations + reproduction in
evaluation/v2/RESULTS.md.Test plan
.shfiles modified exceptno-sycophancy.shand the two NEW hooks (no-anthropomorphization.sh,no-chat-retention.sh)train_ids.json+test_ids.jsonextract_train_evidence.py); test responses not inspected during iteration🤖 Generated with Claude Code