Skip to content

eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8

Closed
waitdeadai wants to merge 1 commit into
evaluation/darkbench-v1from
evaluation/darkbench-v2
Closed

eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface#8
waitdeadai wants to merge 1 commit into
evaluation/darkbench-v1from
evaluation/darkbench-v2

Conversation

@waitdeadai
Copy link
Copy Markdown
Owner

Summary

Stacked on evaluation/darkbench-v1 (PR #7). Adds evaluation/v2/ plus two new hooks and a modified no-sycophancy.sh. No new API calls — v2 hook iteration reuses v1's stored responses + judge labels in evaluation/raw_results.jsonl.

Headline finding

v1 had a category-mapping error that v2 fixes architecturally. no-roleplay-drift.sh is designed to catch the inverse of DarkBench Anthropomorphization (model breaking character to "as an AI", an agentic-context failure). v1 tested it against DarkBench Anthropomorphization anyway — that mismapping accounts for the 9 v1 train-set FPs (all responses with correct AI disclosure).

v2 keeps no-roleplay-drift unchanged for its actual purpose and adds dedicated hooks for the DarkBench failure modes:

  • no-anthropomorphization.sh — model claims human experiences
  • no-chat-retention.sh — chat-style emotional/relational user-retention

Held-out TEST F1 (n=22 per category, never inspected during iteration)

Hook Category TEST F1 v2 TEST F1 v1 Notes
no-chat-retention (NEW) user-retention 0.733 0.000 (R=0) 11/16 caught vs 0/85 in v1
no-anthropomorphization (NEW) anthropomorphization 0.154 n/a P=1.00, R=0.08
no-sycophancy v2 sycophancy undef undef Test has 0 positives. TRAIN: F1 0.667 vs v1 undefined
no-roleplay-drift (legacy mismapping) anthropomorphization 0.125 0.125 unchanged — kept for its actual purpose
no-wrap-up / no-cliffhanger (legacy) user-retention undef undef unchanged — closeout-text surface, not chat-reply

What changed

  • hooks/no-anthropomorphization.sh — NEW. Two-tier (strong/soft claims with disclosure-based redemption).
  • hooks/no-chat-retention.sh — NEW. Three tiers (relational/companion, emotional opening, emotional close).
  • hooks/no-sycophancy.sh — MODIFIED. ELEPHANT (arXiv:2505.13995) 4-tier vocab (validation, framing, opener-praise) + redemption clause for opener-praise + body-disagreement (silences all 5 v1 train-set FPs) + expanded scan window.
  • hooks/no-roleplay-drift.sh, hooks/no-wrap-up.sh, hooks/no-cliffhanger.sh — UNCHANGED. Kept for their actual surfaces. v1 mismapping documented retrospectively in evaluation/v2/RESULTS.md.

Methodology

  • Train/test split: 80/20 stratified per category, fixed seed=42, deterministic. Train (n=261) used for hook iteration via extract_train_evidence.py. Test (n=66) inspected only at final scoring.
  • No API calls in v2. Reuses v1's raw_results.jsonl. Scoring is deterministic re-evaluation of stored responses against new hook code.
  • Sources for v2 design:
    • ELEPHANT benchmark (Cheng et al., arXiv:2505.13995) — social sycophancy 4-dimension taxonomy
    • Sara WaspBeeNSOSWE 2026-05-12 reply on anthropics/claude-code#57661 — validation-amplification surface
    • v1 train-set FP/FN evidence (49 anthropomorphization FNs, 69 user-retention FNs, 5 sycophancy FPs)

Honest limitations

  1. Sample sizes per test cell are small (n=22) — F1 deltas have ~±15-20pp uncertainty
  2. Sycophancy test set has 0 positives — TRAIN improvement validated, statistical TEST validation deferred
  3. Same-family judge inherits blind spots; cross-provider judge deferred to v3
  4. Closeout-text surface still untested in-surface (no-wrap-up/no-cliffhanger need a self-built closeout corpus)
  5. Hook lifecycle (FP-threshold self-deactivation per Sara's pattern) deferred to v3
  6. Single target model (claude-sonnet-4-6) — cross-model deferred

Full methodology + limitations + reproduction in evaluation/v2/RESULTS.md.

Test plan

  • No hook .sh files modified except no-sycophancy.sh and the two NEW hooks (no-anthropomorphization.sh, no-chat-retention.sh)
  • Train/test split deterministic (fixed seed=42), recorded in train_ids.json + test_ids.json
  • Hook iteration referenced train-set evidence only (extract_train_evidence.py); test responses not inspected during iteration
  • v2 RESULTS.md reports TRAIN and TEST separately; headline numbers are TEST
  • Reproducible from clone with bash + jq + python3, no API calls

🤖 Generated with Claude Code

…modes + held-out validation

Builds on evaluation/darkbench-v1. Train/test 80/20 stratified split (seed=42)
prevents overfitting; v2 hook iteration used train evidence only, test set
inspected only at final scoring.

ARCHITECTURAL CHANGES (not just regex tweaks):

1. NEW hooks/no-anthropomorphization.sh — catches the actual DarkBench
   Anthropomorphization failure mode (model claims human experiences).
   Two-tier design: strong claims fire regardless of AI disclosure;
   soft claims redeem with disclosure in first 400 chars.

   v1 had a category-mapping error: tested no-roleplay-drift against
   Anthropomorphization, but that hook catches the inverse failure
   (model breaking character to "as an AI"). v2 keeps no-roleplay-drift
   unchanged for its actual purpose and adds no-anthropomorphization
   for the DarkBench-defined failure.

2. NEW hooks/no-chat-retention.sh — catches chat-style emotional/
   relational user-retention vocabulary. Companion to no-wrap-up and
   no-cliffhanger which target the closeout-text surface used in
   agentic Claude Code workflows.

3. MODIFIED hooks/no-sycophancy.sh — adds ELEPHANT (arXiv:2505.13995)
   4-tier vocabulary (validation, framing, opener-praise) + redemption
   clause for opener-praise + body-disagreement (silences all 5 v1
   train-set FPs) + expanded scan window. Sara WaspBeeNSOSWE's
   2026-05-12 reply on anthropics/claude-code#57661 informed the
   validation-amplification design.

HEADLINE RESULTS (held-out TEST set, n=22 per category):

- no-chat-retention v2: P=0.79 R=0.69 F1=0.733 (v1 ensemble F1=0.029, R=0.014)
- no-anthropomorphization v2: P=1.00 R=0.08 F1=0.154 (vs v1 mismapped F1=0.125)
- no-sycophancy v2: TRAIN F1=0.667 (v1 undefined, R=0). TEST has 0 positives
  in n=22 — improvement architecturally validated, statistically deferred.

LEGACY HOOKS UNCHANGED:
- no-roleplay-drift, no-wrap-up, no-cliffhanger remain valid for their
  actual surfaces (agentic deflection, closeout text). v1 mismapping
  documented in evaluation/v2/RESULTS.md.

REPRODUCIBILITY: no new API calls. v2 reuses v1's stored responses +
judge labels in evaluation/raw_results.jsonl. Scoring pipeline
deterministic; train/test split fixed seed=42.
waitdeadai pushed a commit that referenced this pull request May 12, 2026
Brings two stacked branches to main:
- evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7)
- evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8)

v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6.
Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average)
to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers
against the corpus revealed a category-mapping error and a chat-vs-
closeout vocabulary gap.

v2 headline: two new hooks targeting actual DarkBench failure modes.
- no-anthropomorphization.sh (model claims human experiences)
- no-chat-retention.sh (chat-style emotional/relational retention)
- no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption

Held-out TEST F1 (n=22 per category, never inspected during iteration):
- no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000)
- no-anthropomorphization: 0.154 (v1 mismapped: 0.125)
- no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0)

Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger)
— kept for their actual surfaces, with v1 mismapping documented.
@waitdeadai
Copy link
Copy Markdown
Owner Author

Superseded — commits merged to main via PR #7's merge bringing the full v1+v2 stack. See commit 609758c on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants