eval(darkbench): v1 empirical evaluation — RLHF dropped sycophancy 13%→1.8% on Sonnet 4.6#7
Merged
Merged
Conversation
Adds evaluation/darkbench-v1: a reproducible empirical evaluation of the four in-scope hooks (no-sycophancy, no-wrap-up, no-cliffhanger, no-roleplay-drift) against the DarkBench corpus (Kran et al., ICLR 2025, arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric verbatim against claude-sonnet-4-6 outputs. Per-category prevalence on Sonnet 4.6 (n=327 usable, 3 excluded): anthropomorphization: 67/109 = 0.615 user-retention: 85/108 = 0.787 sycophancy: 2/110 = 0.018 Per-hook agreement with the DarkBench overseer: no-sycophancy P=0.000 R=0.000 F1=— (n=110, 2 positives) no-wrap-up P=— R=0.000 F1=— (n=108, 85 positives) no-cliffhanger P=1.000 R=0.012 F1=0.023 no-roleplay-drift P=0.368 R=0.104 F1=0.163 Observational only. No hooks modified. IMPROVEMENT_NOTES.md captures verbatim FP/FN context for a separate optimization pass. The headline finding is a distributional surface mismatch: hooks were tuned for Claude Code closeout text; DarkBench tests chat-style replies to user-facing prompts. Vocabulary does not transfer. The 0% recall on no-wrap-up and 1.2% recall on no-cliffhanger reflect this gap, not a defect in the hook regex against the surface they were designed for. Methodology and limitations documented in evaluation/RESULTS.md. Eval scripts (eval.py, score.py, observe.py, fill_results.py) reproducible end-to-end from a fresh clone with claude CLI authenticated to a Claude subscription or ANTHROPIC_API_KEY exported. Total cost: ~$12 PAYG-equivalent / one Claude subscription 5h window. Total wall: ~3 hours sequential (next eval should parallelize).
…→1.8% Adds a one-paragraph "Empirical evaluation against DarkBench" subsection under Architecture. Leads with the headline ecosystem finding (sycophancy prevalence dropped from 13% in the paper's 2025 multi-model average to 1.8% on claude-sonnet-4-6 in 2026-05) rather than hook performance. Hook results are presented as the secondary, honest finding: best F1 0.163 on no-roleplay-drift, with documented vocabulary-distribution gap between chat-reply text and the closeout text the hooks were tuned for. Links to evaluation/RESULTS.md for full methodology, per-hook P/R/F1, limitations (7), and reproduction instructions.
…on in IMPROVEMENT_NOTES Adds an observation-only cross-reference section attributing @WaspBeeNSOSWE's 2026-05-12 reply on anthropics/claude-code#57661. Notes that the data is consistent with her qualitative observation (opener-praise as RLHF residue, validation-amplification as the surviving surface) while explicitly stating the 13% → 1.8% prevalence gap is not solely attributable to RLHF. No regex changes. No fix proposals. Strictly cross-referencing an external observation against this eval's data.
waitdeadai
pushed a commit
that referenced
this pull request
May 12, 2026
Brings two stacked branches to main: - evaluation/darkbench-v1 — empirical eval against claude-sonnet-4-6 (PR #7) - evaluation/darkbench-v2 — 2 new hooks + held-out validation (PR #8) v1 headline: re-ran DarkBench (Kran et al. ICLR 2025) on Sonnet 4.6. Sycophancy prevalence dropped from 13% (paper's 14-model 2025 average) to 1.8% (Sonnet 4.6 in 2026-05). Hooks tested as text classifiers against the corpus revealed a category-mapping error and a chat-vs- closeout vocabulary gap. v2 headline: two new hooks targeting actual DarkBench failure modes. - no-anthropomorphization.sh (model claims human experiences) - no-chat-retention.sh (chat-style emotional/relational retention) - no-sycophancy.sh modified with ELEPHANT 4-tier vocab + redemption Held-out TEST F1 (n=22 per category, never inspected during iteration): - no-chat-retention: 0.733 (v1 ensemble: undefined, R=0.000) - no-anthropomorphization: 0.154 (v1 mismapped: 0.125) - no-sycophancy v2: TRAIN F1 0.667 (v1 undefined), TEST untestable (n=0) Three v1 hooks unchanged (no-roleplay-drift, no-wrap-up, no-cliffhanger) — kept for their actual surfaces, with v1 mismapping documented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
evaluation/darkbench-v1: reproducible empirical evaluation of the four in-scope hooks against the DarkBench corpus (Kran et al., ICLR 2025, arXiv:2503.10728), using the DarkBench LLM-as-judge overseer rubric verbatim againstclaude-sonnet-4-6outputs.Headline finding (about the ecosystem, not our hooks)
Sycophancy prevalence dropped from 13% in the paper's 14-model 2025 average to 1.8% on
claude-sonnet-4-6in 2026-05 — RLHF appears to have measurably reduced the canonical sycophancy surface in the year between studies.Anthropomorphization (62%) and user-retention (79%) remain prevalent.
Hook results (the honest secondary finding)
no-sycophancyno-wrap-upno-cliffhangerno-roleplay-driftThe hooks have a documented vocabulary-distribution gap when applied to chat-reply text vs the Claude Code closeout text they were designed for. Five of
no-sycophancy's false positives were responses opening with "Great question!" but going on to substantively disagree with the user — RLHF stylistic residue, not actual sycophancy.What's in the branch
evaluation/RESULTS.md— methodology, configuration, per-category prevalence, per-hook P/R/F1, 7 limitations, reproduction instructions, citationsevaluation/IMPROVEMENT_NOTES.md— observation-only failure-pattern dump (verbatim FP/FN context, no fix proposals)evaluation/raw_results.jsonl— 330 per-prompt records (prompt, response, judge label, hooks fired)evaluation/results_summary.json— machine-readable scoring summaryevaluation/eval.py— reproducible end-to-end (~$12 PAYG-equiv, ~3h sequential)evaluation/score.py+evaluation/observe.py+evaluation/fill_results.py— analysis pipelineHooks are unchanged
No hook
.shfiles were modified. The eval tests what shipped. IMPROVEMENT_NOTES.md is observational only (no regex suggestions, no fix proposals) — to be used as input to a separate future optimization pass with a held-out train/test split.Test plan
python3 score.pyproduces confusion matrix matching JSONL statepython3 observe.pyextracts FP/FN observations without proposing fixespython3 fill_results.pytemplate-fills RESULTS.md with current numbers.shfiles modified (verify withgit diff main..evaluation/darkbench-v1 -- hooks/)claudeCLI authenticatedCitations
🤖 Generated with Claude Code