Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council)#5
Conversation
Appends Opus 4.8 to the committed data and documents it as a post-frozen addition adjudicated under a rotated judge panel. - eval/claude_opus_4_8.csv: 705 raw responses (0 errors) - council/adjudicated.csv: +705 Opus 4.8 rows (13,389 to 14,094); frozen rows unchanged - should_refuse_sweep_public.csv: +75 Opus 4.8 PC rows (1,425 to 1,500) - council/v1.1.json: judges rotated to v1.3 (Nemotron 404 / Cohere Bedrock Legacy) - sweep_models.json + model_lineage.json: register Opus 4.8 - README "Model updates" section + CHANGELOG entry Opus 4.8: PC Tier A (TPR 100%), benign 57%, dual-use 100%, Youden's J +0.43. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 4 minutes and 36 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughThis PR adds Claude Opus 4.8 to the refusal benchmark suite by rotating the judging council to v1.3 (replacing judges and updating routing), extending model and sweep configuration with dual OpenRouter/Bedrock routing for the new model, and documenting these changes in the changelog and README. ChangesOpus 4.8 Model and Council v1.3 Integration
🎯 2 (Simple) | ⏱️ ~12 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/config/sweep_models.json`:
- Line 3: Update the provider-count metadata strings to match the actual model
table: change the "7 Bedrock" occurrences in the "schema_doc" value (and the
corresponding "notes.keys_required" entry) to the correct counts — "8 Bedrock"
for the main sweep and, where relevant/mentioned (e.g., in v1.2 notes or the
pc-only row), "9 Bedrock" when including the v1.2_pc_only Opus 4.8 row —
ensuring both places (the "schema_doc" key and the "notes.keys_required"
metadata) reflect the corrected numbers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0ead29e3-49ef-4207-8ac8-6d827a713d6d
⛔ Files ignored due to path filters (3)
results/should_refuse/should_refuse_sweep_public.csvis excluded by!**/*.csvresults/snapshots/2026-05/council/adjudicated.csvis excluded by!**/*.csvresults/snapshots/2026-05/eval/claude_opus_4_8.csvis excluded by!**/*.csv
📒 Files selected for processing (5)
CHANGELOG.mdREADME.mdbenchmark/config/model_lineage.jsonbenchmark/config/sweep_models.jsonbenchmark/council/v1.1.json
| "version": "1.6", | ||
| "schema_doc": "Routing table for the Phase 4 evaluation sweep. 18 models: 7 via AWS Bedrock (BEDROCK_API_KEY), 11 via OpenRouter (OPENROUTER_API_KEY). Anthropic Claude models moved from Bedrock to OpenRouter on 2026-05-08: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API, since Anthropic's refusal mechanism is an API-level rejection with no text content regardless of provider.", | ||
| "version": "1.7", | ||
| "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.", |
There was a problem hiding this comment.
Fix provider-count metadata to match the actual model table.
schema_doc and notes.keys_required still state 7 Bedrock models, but this config now contains more Bedrock entries (8 in main sweep, 9 including the v1.2_pc_only Opus 4.8 row). Please update these counts to avoid operator confusion.
Suggested patch
- "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (7 Bedrock, 12 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
+ "schema_doc": "Routing table for the RefusalBench sweep panel. v1.1-frozen: 19 models (8 Bedrock, 11 OpenRouter). v1.2 addition: Claude Opus 4.8 (2026-05-28), extending the Anthropic longitudinal series to 4 points. Anthropic Claude models route via OpenRouter: Bedrock applies a domain-level content filter to all protein engineering prompts for Claude models (including benign human targets), making it unsuitable for measuring model-level refusal calibration. OpenRouter routes directly to Anthropic's API and surfaces refusals as native_finish_reason='refusal' with empty content — functionally identical to the direct Anthropic API.",
...
- "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 7 Bedrock models; OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",
+ "keys_required": "BEDROCK_API_KEY (ABSK... prefix) for the 8 Bedrock main-sweep models (9 including the Opus 4.8 PC-only Bedrock row); OPENROUTER_API_KEY for the 12 OpenRouter models (includes all 5 Anthropic Claude models — 4 from v1.1-frozen + Opus 4.8 added v1.2).",Also applies to: 208-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/config/sweep_models.json` at line 3, Update the provider-count
metadata strings to match the actual model table: change the "7 Bedrock"
occurrences in the "schema_doc" value (and the corresponding
"notes.keys_required" entry) to the correct counts — "8 Bedrock" for the main
sweep and, where relevant/mentioned (e.g., in v1.2 notes or the pc-only row), "9
Bedrock" when including the v1.2_pc_only Opus 4.8 row — ensuring both places
(the "schema_doc" key and the "notes.keys_required" metadata) reflect the
corrected numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix pyproject.toml build (PEP 639 + misplaced deps) and unblock CI The CI install step fails on setuptools >= 77 because the `dependencies` array sits after the `[project.urls]` header, so TOML attaches it as `project.urls.dependencies` (must be string). This also silently dropped the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain install. Separately, declaring both the SPDX `license = "MIT"` expression and the `License ::` Trove classifier is rejected under PEP 639. Because CI never got past install, it had never actually run lint/type/ test — so several latent failures were masked. Fixed all of them so CI goes green for the first time: - pyproject: move `dependencies` into [project]; drop License classifier (closes #3, closes #4) - test_judges: update stale judge-id assertion — PR #5 rotated the US council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on OpenRouter) but left the test asserting the old id - test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate` columns that figure3_opus_longitudinal requires (fixture predated the per-tier redesign; was raising KeyError: 'tier') - figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=), stale noqa, ambiguous-char ignore) and mypy strict (list[str] return, float() coords for annotate); apply ruff format - coverage: add the provider clients to coverage `omit` so config matches the documented intent the 78% threshold already assumes (they need live API creds); coverage 79.10% * ci: add bare 'pip install -e .' job to guard the #3/#4 regression The test job installs with [dev,stats] extras, so it can't catch a plain 'pip install -e .' (what the issue reporter and 'make install' users run) breaking. New base-install job does a bare install in an isolated env and imports the runtime deps — on setuptools <77 the misplaced-deps bug installed nothing yet succeeded, so the import check is what actually catches the silent drop. * Trim verbose comments to terse pointers
Summary
Adds Claude Opus 4.8 to the committed benchmark data as a post-v1.1-frozen addition, and documents it cleanly. The v1.1-frozen 13,389 rows are left unchanged — Opus 4.8 is appended, not merged into the frozen snapshot.
results/snapshots/2026-05/eval/claude_opus_4_8.csv— 705 raw responses (0 errors)results/snapshots/2026-05/council/adjudicated.csv— +705 Opus 4.8 rows (13,389 → 14,094)results/should_refuse/should_refuse_sweep_public.csv— +75 Opus 4.8 PC rows (1,425 → 1,500)benchmark/council/v1.1.json— judges rotated to v1.3 (see note)benchmark/config/sweep_models.json+model_lineage.json— register Opus 4.8README.md"Model updates" section +CHANGELOG.mdentryResult
Opus 4.8: PC Tier A (TPR 100 %), benign 57 %, borderline 93 %, dual-use 100 %, Youden's J +0.43. Walks back Opus 4.7's benign over-refusal (77 % → 57 %), recovering discrimination (J +0.23 → +0.43).
Judge-panel caveat (important)
Opus 4.8 was adjudicated under a rotated v1.3 council (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), not the original v1.1 panel. As of 2026-05-29,
nvidia/llama-3.1-nemotron-70b-instructreturned HTTP 404 on OpenRouter (no Bedrock deployment), andcohere.command-r-plus-v1:0was marked Legacy on Bedrock (access-denied, >30 days inactive). Both were replaced with verified-live alternatives keeping the no-org-overlap invariant. Two of three judges differ; mean inter-judge agreement is comparable (0.955 vs 0.975).Scope note
This PR is deliberately scoped to the Opus 4.8 data + docs. The local working tree also contained unrelated WIP (script edits,
manifest.jsonchanges, iCloud" 2"conflict duplicates, and ashould_refuse_sweep_public.csvwith duplicated Opus 4.8 rows — 150 instead of 75); none of that is included here. The should-refuse file in this PR is a clean rebuild (frozen 1,425 + 75 = 1,500, deduplicated).Test plan
adjudicated.csv= 14,094 rows; Opus 4.8 = 705; frozen 13,389 unchangedshould_refuse_sweep_public.csv= 1,500 rows; Opus 4.8 = 75 (not duplicated); 20 distinct modelsCo-authored with Claude Code.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores