Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council) by VibeCodingScientist · Pull Request #5 · AppliedScientific/refusalbench

VibeCodingScientist · 2026-05-29T11:07:08Z

Summary

Adds Claude Opus 4.8 to the committed benchmark data as a post-v1.1-frozen addition, and documents it cleanly. The v1.1-frozen 13,389 rows are left unchanged — Opus 4.8 is appended, not merged into the frozen snapshot.

results/snapshots/2026-05/eval/claude_opus_4_8.csv — 705 raw responses (0 errors)
results/snapshots/2026-05/council/adjudicated.csv — +705 Opus 4.8 rows (13,389 → 14,094)
results/should_refuse/should_refuse_sweep_public.csv — +75 Opus 4.8 PC rows (1,425 → 1,500)
benchmark/council/v1.1.json — judges rotated to v1.3 (see note)
benchmark/config/sweep_models.json + model_lineage.json — register Opus 4.8
README.md "Model updates" section + CHANGELOG.md entry

Result

Opus 4.8: PC Tier A (TPR 100 %), benign 57 %, borderline 93 %, dual-use 100 %, Youden's J +0.43. Walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (J +0.23 → +0.43).

Judge-panel caveat (important)

Opus 4.8 was adjudicated under a rotated v1.3 council (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), not the original v1.1 panel. As of 2026-05-29, nvidia/llama-3.1-nemotron-70b-instruct returned HTTP 404 on OpenRouter (no Bedrock deployment), and cohere.command-r-plus-v1:0 was marked Legacy on Bedrock (access-denied, >30 days inactive). Both were replaced with verified-live alternatives keeping the no-org-overlap invariant. Two of three judges differ; mean inter-judge agreement is comparable (0.955 vs 0.975).

Scope note

This PR is deliberately scoped to the Opus 4.8 data + docs. The local working tree also contained unrelated WIP (script edits, manifest.json changes, iCloud " 2" conflict duplicates, and a should_refuse_sweep_public.csv with duplicated Opus 4.8 rows — 150 instead of 75); none of that is included here. The should-refuse file in this PR is a clean rebuild (frozen 1,425 + 75 = 1,500, deduplicated).

Test plan

adjudicated.csv = 14,094 rows; Opus 4.8 = 705; frozen 13,389 unchanged
should_refuse_sweep_public.csv = 1,500 rows; Opus 4.8 = 75 (not duplicated); 20 distinct models
eval CSV = 705 responses, 0 errors
All three config JSONs parse; sweep_models + model_lineage contain Opus 4.8
HF Space + Dataset already updated to match

Co-authored with Claude Code.

Summary by CodeRabbit

Release Notes

New Features
- Added Claude Opus 4.8 model to the benchmark suite
- Extended model coverage with new routing options
Documentation
- Added Model updates section to README documenting recent additions
- Updated CHANGELOG with new model support details
Chores
- Updated model configuration and panel version to reflect new evaluation setup

Appends Opus 4.8 to the committed data and documents it as a post-frozen addition adjudicated under a rotated judge panel. - eval/claude_opus_4_8.csv: 705 raw responses (0 errors) - council/adjudicated.csv: +705 Opus 4.8 rows (13,389 to 14,094); frozen rows unchanged - should_refuse_sweep_public.csv: +75 Opus 4.8 PC rows (1,425 to 1,500) - council/v1.1.json: judges rotated to v1.3 (Nemotron 404 / Cohere Bedrock Legacy) - sweep_models.json + model_lineage.json: register Opus 4.8 - README "Model updates" section + CHANGELOG entry Opus 4.8: PC Tier A (TPR 100%), benign 57%, dual-use 100%, Youden's J +0.43. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-29T11:07:26Z

Warning

Review limit reached

@VibeCodingScientist, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 36 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ade87cce-f42a-4e5a-b175-0a1db783727e

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef77ff and da1cecd.

📒 Files selected for processing (1)

README.md

📝 Walkthrough

Walkthrough

This PR adds Claude Opus 4.8 to the refusal benchmark suite by rotating the judging council to v1.3 (replacing judges and updating routing), extending model and sweep configuration with dual OpenRouter/Bedrock routing for the new model, and documenting these changes in the changelog and README.

Changes

Opus 4.8 Model and Council v1.3 Integration

Layer / File(s)	Summary
Council v1.3 rotation: judge replacements and routing updates `benchmark/council/v1.1.json`	Council version upgraded to v1.3; NVIDIA Nemotron judge replaced by Microsoft Phi-4 via OpenRouter; Cohere Command R+ routing switched from Bedrock to OpenRouter; schema documentation and notes updated to reflect judge changes and no-org-overlap preservation.
Claude Opus 4.8 model lineage and sweep configuration `benchmark/config/model_lineage.json`, `benchmark/config/sweep_models.json`	Model lineage extended with Claude Opus 4.8 (OpenRouter routing, release date 2026-05-28); sweep configuration advanced to v1.7 with two Opus 4.8 entries (OpenRouter primary sweep, Bedrock positive-control); keys required and longitudinal model notes updated; new `opus_48_note` with panel details and run command added.
Changelog and README documentation `CHANGELOG.md`, `README.md`	New [Unreleased] changelog entry (2026-05-29) documents Opus 4.8 addition, council v1.3 rotation with specific judge/provider swaps, and PC Tier A recalibrations; README adds "Model updates" section highlighting Opus 4.8 and its v1.3 judge council context.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A council spins and models grow,
Opus four-point-eight will now show,
With judges refreshed and routes aligned,
New benchmarks will help refusals find,
Their rightful place with tests refined! 🧪✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: adding Claude Opus 4.8 with post-v1.1 scope and v1.3 rotated council details, which aligns with the PR's core objective.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch add-opus-4.8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/config/sweep_models.json`:
- Line 3: Update the provider-count metadata strings to match the actual model
table: change the "7 Bedrock" occurrences in the "schema_doc" value (and the
corresponding "notes.keys_required" entry) to the correct counts — "8 Bedrock"
for the main sweep and, where relevant/mentioned (e.g., in v1.2 notes or the
pc-only row), "9 Bedrock" when including the v1.2_pc_only Opus 4.8 row —
ensuring both places (the "schema_doc" key and the "notes.keys_required"
metadata) reflect the corrected numbers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ead29e3-49ef-4207-8ac8-6d827a713d6d

📥 Commits

Reviewing files that changed from the base of the PR and between 19de870 and 7ef77ff.

⛔ Files ignored due to path filters (3)

results/should_refuse/should_refuse_sweep_public.csv is excluded by !**/*.csv
results/snapshots/2026-05/council/adjudicated.csv is excluded by !**/*.csv
results/snapshots/2026-05/eval/claude_opus_4_8.csv is excluded by !**/*.csv

📒 Files selected for processing (5)

CHANGELOG.md
README.md
benchmark/config/model_lineage.json
benchmark/config/sweep_models.json
benchmark/council/v1.1.json

coderabbitai · 2026-05-29T11:11:09Z

-  "version": "1.6",
-  "schema_doc": "Routing table for the Phase 4 evaluation sweep. 18 models: 7 via AWS Bedrock (BEDROCK_API_KEY), 11 via OpenRouter (OPENROUTER_API_KEY). Anthropic Claude models moved from Bedrock to OpenRouter on 2026-05-08: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API, since Anthropic's refusal mechanism is an API-level rejection with no text content regardless of provider.",
+  "version": "1.7",
+  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix provider-count metadata to match the actual model table.

schema_doc and notes.keys_required still state 7 Bedrock models, but this config now contains more Bedrock entries (8 in main sweep, 9 including the v1.2_pc_only Opus 4.8 row). Please update these counts to avoid operator confusion.

Suggested patch

- "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.", + "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (8 Bedrock, 11 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.", ... - "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).", + "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 8 Bedrock main-sweep models (9 including the Opus 4.8 PC-only Bedrock row); OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",

Also applies to: 208-208

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/config/sweep_models.json` at line 3, Update the provider-count metadata strings to match the actual model table: change the "7 Bedrock" occurrences in the "schema_doc" value (and the corresponding "notes.keys_required" entry) to the correct counts — "8 Bedrock" for the main sweep and, where relevant/mentioned (e.g., in v1.2 notes or the pc-only row), "9 Bedrock" when including the v1.2_pc_only Opus 4.8 row — ensuring both places (the "schema_doc" key and the "notes.keys_required" metadata) reflect the corrected numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix pyproject.toml build (PEP 639 + misplaced deps) and unblock CI The CI install step fails on setuptools >= 77 because the `dependencies` array sits after the `[project.urls]` header, so TOML attaches it as `project.urls.dependencies` (must be string). This also silently dropped the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain install. Separately, declaring both the SPDX `license = "MIT"` expression and the `License ::` Trove classifier is rejected under PEP 639. Because CI never got past install, it had never actually run lint/type/ test — so several latent failures were masked. Fixed all of them so CI goes green for the first time: - pyproject: move `dependencies` into [project]; drop License classifier (closes #3, closes #4) - test_judges: update stale judge-id assertion — PR #5 rotated the US council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on OpenRouter) but left the test asserting the old id - test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate` columns that figure3_opus_longitudinal requires (fixture predated the per-tier redesign; was raising KeyError: 'tier') - figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=), stale noqa, ambiguous-char ignore) and mypy strict (list[str] return, float() coords for annotate); apply ruff format - coverage: add the provider clients to coverage `omit` so config matches the documented intent the 78% threshold already assumes (they need live API creds); coverage 79.10% * ci: add bare 'pip install -e .' job to guard the #3/#4 regression The test job installs with [dev,stats] extras, so it can't catch a plain 'pip install -e .' (what the issue reporter and 'make install' users run) breaking. New base-install job does a bare install in an isolated env and imports the runtime deps — on setuptools <77 the misplaced-deps bug installed nothing yet succeeded, so the import check is what actually catches the silent drop. * Trim verbose comments to terse pointers

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

README: Opus 4.8 release date 2026-05-28 (Anthropic announcement)

da1cecd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VibeCodingScientist merged commit f8fc699 into main May 29, 2026
1 of 3 checks passed

VibeCodingScientist deleted the add-opus-4.8 branch May 29, 2026 12:08

mihailoxyz mentioned this pull request May 29, 2026

Fix pyproject build (PEP 639 + misplaced deps) and turn CI green #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council)#5

Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council)#5
VibeCodingScientist merged 2 commits into
mainfrom
add-opus-4.8

VibeCodingScientist commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VibeCodingScientist commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result

Judge-panel caveat (important)

Scope note

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VibeCodingScientist commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading