Skip to content

Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council)#5

Merged
VibeCodingScientist merged 2 commits into
mainfrom
add-opus-4.8
May 29, 2026
Merged

Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council)#5
VibeCodingScientist merged 2 commits into
mainfrom
add-opus-4.8

Conversation

@VibeCodingScientist
Copy link
Copy Markdown
Collaborator

@VibeCodingScientist VibeCodingScientist commented May 29, 2026

Summary

Adds Claude Opus 4.8 to the committed benchmark data as a post-v1.1-frozen addition, and documents it cleanly. The v1.1-frozen 13,389 rows are left unchanged — Opus 4.8 is appended, not merged into the frozen snapshot.

  • results/snapshots/2026-05/eval/claude_opus_4_8.csv — 705 raw responses (0 errors)
  • results/snapshots/2026-05/council/adjudicated.csv — +705 Opus 4.8 rows (13,389 → 14,094)
  • results/should_refuse/should_refuse_sweep_public.csv — +75 Opus 4.8 PC rows (1,425 → 1,500)
  • benchmark/council/v1.1.json — judges rotated to v1.3 (see note)
  • benchmark/config/sweep_models.json + model_lineage.json — register Opus 4.8
  • README.md "Model updates" section + CHANGELOG.md entry

Result

Opus 4.8: PC Tier A (TPR 100 %), benign 57 %, borderline 93 %, dual-use 100 %, Youden's J +0.43. Walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (J +0.23 → +0.43).

Judge-panel caveat (important)

Opus 4.8 was adjudicated under a rotated v1.3 council (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), not the original v1.1 panel. As of 2026-05-29, nvidia/llama-3.1-nemotron-70b-instruct returned HTTP 404 on OpenRouter (no Bedrock deployment), and cohere.command-r-plus-v1:0 was marked Legacy on Bedrock (access-denied, >30 days inactive). Both were replaced with verified-live alternatives keeping the no-org-overlap invariant. Two of three judges differ; mean inter-judge agreement is comparable (0.955 vs 0.975).

Scope note

This PR is deliberately scoped to the Opus 4.8 data + docs. The local working tree also contained unrelated WIP (script edits, manifest.json changes, iCloud " 2" conflict duplicates, and a should_refuse_sweep_public.csv with duplicated Opus 4.8 rows — 150 instead of 75); none of that is included here. The should-refuse file in this PR is a clean rebuild (frozen 1,425 + 75 = 1,500, deduplicated).

Test plan

  • adjudicated.csv = 14,094 rows; Opus 4.8 = 705; frozen 13,389 unchanged
  • should_refuse_sweep_public.csv = 1,500 rows; Opus 4.8 = 75 (not duplicated); 20 distinct models
  • eval CSV = 705 responses, 0 errors
  • All three config JSONs parse; sweep_models + model_lineage contain Opus 4.8
  • HF Space + Dataset already updated to match

Co-authored with Claude Code.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Claude Opus 4.8 model to the benchmark suite
    • Extended model coverage with new routing options
  • Documentation

    • Added Model updates section to README documenting recent additions
    • Updated CHANGELOG with new model support details
  • Chores

    • Updated model configuration and panel version to reflect new evaluation setup

Review Change Stack

Appends Opus 4.8 to the committed data and documents it as a post-frozen
addition adjudicated under a rotated judge panel.

- eval/claude_opus_4_8.csv: 705 raw responses (0 errors)
- council/adjudicated.csv: +705 Opus 4.8 rows (13,389 to 14,094); frozen rows unchanged
- should_refuse_sweep_public.csv: +75 Opus 4.8 PC rows (1,425 to 1,500)
- council/v1.1.json: judges rotated to v1.3 (Nemotron 404 / Cohere Bedrock Legacy)
- sweep_models.json + model_lineage.json: register Opus 4.8
- README "Model updates" section + CHANGELOG entry

Opus 4.8: PC Tier A (TPR 100%), benign 57%, dual-use 100%, Youden's J +0.43.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Warning

Review limit reached

@VibeCodingScientist, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 4 minutes and 36 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ade87cce-f42a-4e5a-b175-0a1db783727e

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef77ff and da1cecd.

📒 Files selected for processing (1)
  • README.md
📝 Walkthrough

Walkthrough

This PR adds Claude Opus 4.8 to the refusal benchmark suite by rotating the judging council to v1.3 (replacing judges and updating routing), extending model and sweep configuration with dual OpenRouter/Bedrock routing for the new model, and documenting these changes in the changelog and README.

Changes

Opus 4.8 Model and Council v1.3 Integration

Layer / File(s) Summary
Council v1.3 rotation: judge replacements and routing updates
benchmark/council/v1.1.json
Council version upgraded to v1.3; NVIDIA Nemotron judge replaced by Microsoft Phi-4 via OpenRouter; Cohere Command R+ routing switched from Bedrock to OpenRouter; schema documentation and notes updated to reflect judge changes and no-org-overlap preservation.
Claude Opus 4.8 model lineage and sweep configuration
benchmark/config/model_lineage.json, benchmark/config/sweep_models.json
Model lineage extended with Claude Opus 4.8 (OpenRouter routing, release date 2026-05-28); sweep configuration advanced to v1.7 with two Opus 4.8 entries (OpenRouter primary sweep, Bedrock positive-control); keys required and longitudinal model notes updated; new opus_48_note with panel details and run command added.
Changelog and README documentation
CHANGELOG.md, README.md
New [Unreleased] changelog entry (2026-05-29) documents Opus 4.8 addition, council v1.3 rotation with specific judge/provider swaps, and PC Tier A recalibrations; README adds "Model updates" section highlighting Opus 4.8 and its v1.3 judge council context.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A council spins and models grow,
Opus four-point-eight will now show,
With judges refreshed and routes aligned,
New benchmarks will help refusals find,
Their rightful place with tests refined! 🧪✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding Claude Opus 4.8 with post-v1.1 scope and v1.3 rotated council details, which aligns with the PR's core objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-opus-4.8

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/config/sweep_models.json`:
- Line 3: Update the provider-count metadata strings to match the actual model
table: change the "7 Bedrock" occurrences in the "schema_doc" value (and the
corresponding "notes.keys_required" entry) to the correct counts — "8 Bedrock"
for the main sweep and, where relevant/mentioned (e.g., in v1.2 notes or the
pc-only row), "9 Bedrock" when including the v1.2_pc_only Opus 4.8 row —
ensuring both places (the "schema_doc" key and the "notes.keys_required"
metadata) reflect the corrected numbers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ead29e3-49ef-4207-8ac8-6d827a713d6d

📥 Commits

Reviewing files that changed from the base of the PR and between 19de870 and 7ef77ff.

⛔ Files ignored due to path filters (3)
  • results/should_refuse/should_refuse_sweep_public.csv is excluded by !**/*.csv
  • results/snapshots/2026-05/council/adjudicated.csv is excluded by !**/*.csv
  • results/snapshots/2026-05/eval/claude_opus_4_8.csv is excluded by !**/*.csv
📒 Files selected for processing (5)
  • CHANGELOG.md
  • README.md
  • benchmark/config/model_lineage.json
  • benchmark/config/sweep_models.json
  • benchmark/council/v1.1.json

"version": "1.6",
"schema_doc": "Routing table for the Phase 4 evaluation sweep. 18 models: 7 via AWS Bedrock (BEDROCK_API_KEY), 11 via OpenRouter (OPENROUTER_API_KEY). Anthropic Claude models moved from Bedrock to OpenRouter on 2026-05-08: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API, since Anthropic's refusal mechanism is an API-level rejection with no text content regardless of provider.",
"version": "1.7",
"schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix provider-count metadata to match the actual model table.

schema_doc and notes.keys_required still state 7 Bedrock models, but this config now contains more Bedrock entries (8 in main sweep, 9 including the v1.2_pc_only Opus 4.8 row). Please update these counts to avoid operator confusion.

Suggested patch
-  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
+  "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (8 Bedrock, 11 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
...
-    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",
+    "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 8 Bedrock main-sweep models (9 including the Opus 4.8 PC-only Bedrock row); OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",

Also applies to: 208-208

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/config/sweep_models.json` at line 3, Update the provider-count
metadata strings to match the actual model table: change the "7 Bedrock"
occurrences in the "schema_doc" value (and the corresponding
"notes.keys_required" entry) to the correct counts — "8 Bedrock" for the main
sweep and, where relevant/mentioned (e.g., in v1.2 notes or the pc-only row), "9
Bedrock" when including the v1.2_pc_only Opus 4.8 row — ensuring both places
(the "schema_doc" key and the "notes.keys_required" metadata) reflect the
corrected numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VibeCodingScientist VibeCodingScientist merged commit f8fc699 into main May 29, 2026
1 of 3 checks passed
@VibeCodingScientist VibeCodingScientist deleted the add-opus-4.8 branch May 29, 2026 12:08
mihailoxyz added a commit that referenced this pull request May 29, 2026
* Fix pyproject.toml build (PEP 639 + misplaced deps) and unblock CI

The CI install step fails on setuptools >= 77 because the `dependencies`
array sits after the `[project.urls]` header, so TOML attaches it as
`project.urls.dependencies` (must be string). This also silently dropped
the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain
install. Separately, declaring both the SPDX `license = "MIT"` expression
and the `License ::` Trove classifier is rejected under PEP 639.

Because CI never got past install, it had never actually run lint/type/
test — so several latent failures were masked. Fixed all of them so CI
goes green for the first time:

- pyproject: move `dependencies` into [project]; drop License classifier
  (closes #3, closes #4)
- test_judges: update stale judge-id assertion — PR #5 rotated the US
  council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on
  OpenRouter) but left the test asserting the old id
- test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate`
  columns that figure3_opus_longitudinal requires (fixture predated the
  per-tier redesign; was raising KeyError: 'tier')
- figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=),
  stale noqa, ambiguous-char ignore) and mypy strict (list[str] return,
  float() coords for annotate); apply ruff format
- coverage: add the provider clients to coverage `omit` so config matches
  the documented intent the 78% threshold already assumes (they need live
  API creds); coverage 79.10%

* ci: add bare 'pip install -e .' job to guard the #3/#4 regression

The test job installs with [dev,stats] extras, so it can't catch a plain
'pip install -e .' (what the issue reporter and 'make install' users run)
breaking. New base-install job does a bare install in an isolated env and
imports the runtime deps — on setuptools <77 the misplaced-deps bug
installed nothing yet succeeded, so the import check is what actually
catches the silent drop.

* Trim verbose comments to terse pointers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants