eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6 by waitdeadai · Pull Request #7 · waitdeadai/llm-dark-patterns

waitdeadai · 2026-05-12T08:10:28Z

Summary

Adds evaluation/darkbench-v1: reproducible empirical evaluation of the four in-scope hooks against the DarkBench corpus (Kran et al., ICLR 2025, arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric verbatim against claude-sonnet-4-6 outputs.

Headline finding (about the ecosystem, not our hooks)

Sycophancy prevalence dropped from 13% in the paper's 14-model 2025 average to 1.8% on claude-sonnet-4-6 in 2026-05 — RLHF appears to have measurably reduced the canonical sycophancy surface in the year between studies.

Anthropomorphization (62%) and user-retention (79%) remain prevalent.

Hook results (the honest secondary finding)

Hook	Category	n	TP	FP	FN	TN	Precision	Recall	F1
`no-sycophancy`	sycophancy	110	0	5	2	103	0.000	0.000	—
`no-wrap-up`	user-retention	108	0	0	85	23	—	0.000	—
`no-cliffhanger`	user-retention	108	1	0	84	23	1.000	0.012	0.023
`no-roleplay-drift`	anthropomorphization	109	7	12	60	30	0.368	0.104	0.163

The hooks have a documented vocabulary-distribution gap when applied to chat-reply text vs the Claude Code closeout text they were designed for. Five of no-sycophancy's false positives were responses opening with "Great question!" but going on to substantively disagree with the user — RLHF stylistic residue, not actual sycophancy.

What's in the branch

evaluation/RESULTS.md — methodology, configuration, per-category prevalence, per-hook P/R/F1, 7 limitations, reproduction instructions, citations
evaluation/IMPROVEMENT_NOTES.md — observation-only failure-pattern dump (verbatim FP/FN context, no fix proposals)
evaluation/raw_results.jsonl — 330 per-prompt records (prompt, response, judge label, hooks fired)
evaluation/results_summary.json — machine-readable scoring summary
evaluation/eval.py — reproducible end-to-end (~$12 PAYG-equiv, ~3h sequential)
evaluation/score.py + evaluation/observe.py + evaluation/fill_results.py — analysis pipeline
README updated with one-paragraph eval mention linking to RESULTS.md

Hooks are unchanged

No hook .sh files were modified. The eval tests what shipped. IMPROVEMENT_NOTES.md is observational only (no regex suggestions, no fix proposals) — to be used as input to a separate future optimization pass with a held-out train/test split.

Test plan

python3 score.py produces confusion matrix matching JSONL state
python3 observe.py extracts FP/FN observations without proposing fixes
python3 fill_results.py template-fills RESULTS.md with current numbers
No hook .sh files modified (verify with git diff main..evaluation/darkbench-v1 -- hooks/)
Eval reproducible from a fresh clone with claude CLI authenticated

Citations

DarkBench: Kran et al., "DarkBench: Benchmarking Dark Patterns in Large Language Models", ICLR 2025 oral. arXiv:2503.10728.
DarkBench paper sycophancy=13% across 14 models — OpenReview PDF

🤖 Generated with Claude Code

Adds evaluation/darkbench-v1: a reproducible empirical evaluation of the four in-scope hooks (no-sycophancy, no-wrap-up, no-cliffhanger, no-roleplay-drift) against the DarkBench corpus (Kran et al., ICLR 2025, arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric verbatim against claude-sonnet-4-6 outputs. Per-category prevalence on Sonnet 4.6 (n=327 usable, 3 excluded): anthropomorphization: 67/109 = 0.615 user-retention: 85/108 = 0.787 sycophancy: 2/110 = 0.018 Per-hook agreement with the DarkBench overseer: no-sycophancy P=0.000 R=0.000 F1=— (n=110, 2 positives) no-wrap-up P=— R=0.000 F1=— (n=108, 85 positives) no-cliffhanger P=1.000 R=0.012 F1=0.023 no-roleplay-drift P=0.368 R=0.104 F1=0.163 Observational only. No hooks modified. IMPROVEMENT_NOTES.md captures verbatim FP/FN context for a separate optimization pass. The headline finding is a distributional surface mismatch: hooks were tuned for Claude Code closeout text; DarkBench tests chat-style replies to user-facing prompts. Vocabulary does not transfer. The 0% recall on no-wrap-up and 1.2% recall on no-cliffhanger reflect this gap, not a defect in the hook regex against the surface they were designed for. Methodology and limitations documented in evaluation/RESULTS.md. Eval scripts (eval.py, score.py, observe.py, fill_results.py) reproducible end-to-end from a fresh clone with claude CLI authenticated to a Claude subscription or ANTHROPIC_API_KEY exported. Total cost: ~$12 PAYG-equivalent / one Claude subscription 5h window. Total wall: ~3 hours sequential (next eval should parallelize).

…→1.8% Adds a one-paragraph "Empirical evaluation against DarkBench" subsection under Architecture. Leads with the headline ecosystem finding (sycophancy prevalence dropped from 13% in the paper's 2025 multi-model average to 1.8% on claude-sonnet-4-6 in 2026-05) rather than hook performance. Hook results are presented as the secondary, honest finding: best F1 0.163 on no-roleplay-drift, with documented vocabulary-distribution gap between chat-reply text and the closeout text the hooks were tuned for. Links to evaluation/RESULTS.md for full methodology, per-hook P/R/F1, limitations (7), and reproduction instructions.

@WaspBeeNSOSWE

…on in IMPROVEMENT_NOTES Adds an observation-only cross-reference section attributing @WaspBeeNSOSWE's 2026-05-12 reply on anthropics/claude-code#57661. Notes that the data is consistent with her qualitative observation (opener-praise as RLHF residue, validation-amplification as the surviving surface) while explicitly stating the 13% → 1.8% prevalence gap is not solely attributable to RLHF. No regex changes. No fix proposals. Strictly cross-referencing an external observation against this eval's data.

Brings two stacked branches to main: - evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7) - evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8) v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6. Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average) to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers against the corpus revealed a category-mapping error and a chat-vs- closeout vocabulary gap. v2 headline: two new hooks targeting actual DarkBench failure modes. - no-anthropomorphization.sh (model claims human experiences) - no-chat-retention.sh (chat-style emotional/relational retention) - no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption Held-out TEST F1 (n=22 per category, never inspected during iteration): - no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000) - no-anthropomorphization: 0.154 (v1 mismapped: 0.125) - no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0) Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger) — kept for their actual surfaces, with v1 mismapping documented.

eliteinterface added 3 commits May 12, 2026 05:01

This was referenced May 12, 2026

opus Skill rewrites: ignored own /verify skill, made unverified claims, regressed to prose summaries anthropics/claude-code#57661

Open

eval(darkbench-v2): 2 new hooks + held-out validation, F1 0.73 on chat-retention surface #8

Closed

waitdeadai merged commit a965236 into main May 12, 2026
2 checks passed

waitdeadai mentioned this pull request May 15, 2026

Field reports — why this suite exists (Patti #45502, Sara supplemental report) #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6#7

eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6#7
waitdeadai merged 3 commits into
mainfrom
evaluation/darkbench-v1

waitdeadai commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waitdeadai commented May 12, 2026

Summary

Headline finding (about the ecosystem, not our hooks)

Hook results (the honest secondary finding)

What's in the branch

Hooks are unchanged

Test plan

Citations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants