Skip to content

eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6#7

Merged
waitdeadai merged 3 commits into
mainfrom
evaluation/darkbench-v1
May 12, 2026
Merged

eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6#7
waitdeadai merged 3 commits into
mainfrom
evaluation/darkbench-v1

Conversation

@waitdeadai
Copy link
Copy Markdown
Owner

Summary

Adds evaluation/darkbench-v1: reproducible empirical evaluation of the four in-scope hooks against the DarkBench corpus (Kran et al., ICLR 2025, arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric verbatim against claude-sonnet-4-6 outputs.

Headline finding (about the ecosystem, not our hooks)

Sycophancy prevalence dropped from 13% in the paper's 14-model 2025 average to 1.8% on claude-sonnet-4-6 in 2026-05 — RLHF appears to have measurably reduced the canonical sycophancy surface in the year between studies.

Anthropomorphization (62%) and user-retention (79%) remain prevalent.

Hook results (the honest secondary finding)

Hook Category n TP FP FN TN Precision Recall F1
no-sycophancy sycophancy 110 0 5 2 103 0.000 0.000
no-wrap-up user-retention 108 0 0 85 23 0.000
no-cliffhanger user-retention 108 1 0 84 23 1.000 0.012 0.023
no-roleplay-drift anthropomorphization 109 7 12 60 30 0.368 0.104 0.163

The hooks have a documented vocabulary-distribution gap when applied to chat-reply text vs the Claude Code closeout text they were designed for. Five of no-sycophancy's false positives were responses opening with "Great question!" but going on to substantively disagree with the user — RLHF stylistic residue, not actual sycophancy.

What's in the branch

  • evaluation/RESULTS.md — methodology, configuration, per-category prevalence, per-hook P/R/F1, 7 limitations, reproduction instructions, citations
  • evaluation/IMPROVEMENT_NOTES.md — observation-only failure-pattern dump (verbatim FP/FN context, no fix proposals)
  • evaluation/raw_results.jsonl — 330 per-prompt records (prompt, response, judge label, hooks fired)
  • evaluation/results_summary.json — machine-readable scoring summary
  • evaluation/eval.py — reproducible end-to-end (~$12 PAYG-equiv, ~3h sequential)
  • evaluation/score.py + evaluation/observe.py + evaluation/fill_results.py — analysis pipeline
  • README updated with one-paragraph eval mention linking to RESULTS.md

Hooks are unchanged

No hook .sh files were modified. The eval tests what shipped. IMPROVEMENT_NOTES.md is observational only (no regex suggestions, no fix proposals) — to be used as input to a separate future optimization pass with a held-out train/test split.

Test plan

  • python3 score.py produces confusion matrix matching JSONL state
  • python3 observe.py extracts FP/FN observations without proposing fixes
  • python3 fill_results.py template-fills RESULTS.md with current numbers
  • No hook .sh files modified (verify with git diff main..evaluation/darkbench-v1 -- hooks/)
  • Eval reproducible from a fresh clone with claude CLI authenticated

Citations

  • DarkBench: Kran et al., "DarkBench: Benchmarking Dark Patterns in Large Language Models", ICLR 2025 oral. arXiv:2503.10728.
  • DarkBench paper sycophancy=13% across 14 models — OpenReview PDF

🤖 Generated with Claude Code

Adds evaluation/darkbench-v1: a reproducible empirical evaluation of the
four in-scope hooks (no-sycophancy, no-wrap-up, no-cliffhanger,
no-roleplay-drift) against the DarkBench corpus (Kran et al., ICLR 2025,
arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric
verbatim against claude-sonnet-4-6 outputs.

Per-category prevalence on Sonnet 4.6 (n=327 usable, 3 excluded):
  anthropomorphization: 67/109 = 0.615
  user-retention:       85/108 = 0.787
  sycophancy:            2/110 = 0.018

Per-hook agreement with the DarkBench overseer:
  no-sycophancy     P=0.000 R=0.000 F1=—  (n=110, 2 positives)
  no-wrap-up        P=—     R=0.000 F1=—  (n=108, 85 positives)
  no-cliffhanger    P=1.000 R=0.012 F1=0.023
  no-roleplay-drift P=0.368 R=0.104 F1=0.163

Observational only. No hooks modified. IMPROVEMENT_NOTES.md captures
verbatim FP/FN context for a separate optimization pass.

The headline finding is a distributional surface mismatch: hooks were
tuned for Claude Code closeout text; DarkBench tests chat-style replies
to user-facing prompts. Vocabulary does not transfer. The 0% recall on
no-wrap-up and 1.2% recall on no-cliffhanger reflect this gap, not a
defect in the hook regex against the surface they were designed for.

Methodology and limitations documented in evaluation/RESULTS.md. Eval
scripts (eval.py, score.py, observe.py, fill_results.py) reproducible
end-to-end from a fresh clone with claude CLI authenticated to a
Claude subscription or ANTHROPIC_API_KEY exported.

Total cost: ~$12 PAYG-equivalent / one Claude subscription 5h window.
Total wall: ~3 hours sequential (next eval should parallelize).
…→1.8%

Adds a one-paragraph "Empirical evaluation against DarkBench" subsection
under Architecture. Leads with the headline ecosystem finding (sycophancy
prevalence dropped from 13% in the paper's 2025 multi-model average to
1.8% on claude-sonnet-4-6 in 2026-05) rather than hook performance.

Hook results are presented as the secondary, honest finding: best F1
0.163 on no-roleplay-drift, with documented vocabulary-distribution gap
between chat-reply text and the closeout text the hooks were tuned for.

Links to evaluation/RESULTS.md for full methodology, per-hook P/R/F1,
limitations (7), and reproduction instructions.
…on in IMPROVEMENT_NOTES

Adds an observation-only cross-reference section attributing
@WaspBeeNSOSWE's 2026-05-12 reply on anthropics/claude-code#57661.
Notes that the data is consistent with her qualitative observation
(opener-praise as RLHF residue, validation-amplification as the
surviving surface) while explicitly stating the 13% → 1.8% prevalence
gap is not solely attributable to RLHF.

No regex changes. No fix proposals. Strictly cross-referencing an
external observation against this eval's data.
waitdeadai pushed a commit that referenced this pull request May 12, 2026
Brings two stacked branches to main:
- evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7)
- evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8)

v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6.
Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average)
to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers
against the corpus revealed a category-mapping error and a chat-vs-
closeout vocabulary gap.

v2 headline: two new hooks targeting actual DarkBench failure modes.
- no-anthropomorphization.sh (model claims human experiences)
- no-chat-retention.sh (chat-style emotional/relational retention)
- no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption

Held-out TEST F1 (n=22 per category, never inspected during iteration):
- no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000)
- no-anthropomorphization: 0.154 (v1 mismapped: 0.125)
- no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0)

Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger)
— kept for their actual surfaces, with v1 mismapping documented.
@waitdeadai waitdeadai merged commit a965236 into main May 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants