Remove legacy Claude CLI leaderboard alias#58
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR migrates the benchmark from the legacy ChangesLegacy Claude CLI Provider Removal and Leaderboard Update
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the benchmark documentation and website to reflect a revised count of 1,584 runs and the removal of the 'Full-history reasoning' taxonomy label. Key code changes include updating Claude model aliases to consistently use the 'anthropic:' prefix and implementing logic to exclude legacy CLI runs from the public leaderboard. Additionally, new tests were added to verify the rejection of legacy provider strings and the correct filtering of leaderboard results. I have no feedback to provide as there were no review comments to evaluate.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/DATASHEET.md`:
- Around line 89-91: The datasheet contains inconsistent taxonomy wording: the
sentence beginning "Taxonomy labels currently represented in the public
comparison slice: Minimal prompt, Short-context reasoning, Compact memory /
memo, Prompt hints, Tools + compact memory, Tools + hints + compact memory, and
Planner loop." conflicts with another section that still lists "full-history
reasoning." Make the two descriptions identical by either adding "full-history
reasoning" to the public comparison slice list or removing it wherever the other
section includes it; update every occurrence of the exact phrase "Taxonomy
labels currently represented..." and any section that mentions "full-history
reasoning" so both sections use the same taxonomy wording.
- Around line 83-84: The datasheet header's "Last updated:" date is inconsistent
with the body "as of 2026-05-13"; update the header value for the "Last
updated:" field in DATASHEET.md to 2026-05-13 so the header and the reported "as
of" timestamp in the content match, ensuring the string for the header exactly
reflects the new date.
In `@docs/SPEC.md`:
- Line 32: The SPEC.md still lists an 8-label taxonomy entry "Full-history
reasoning" which conflicts with the new public 7-label taxonomy; edit the
taxonomy table in SPEC.md to remove the "Full-history reasoning" row so the
table lists only the 7 public labels used elsewhere in the PR, and verify any
surrounding text or counts that reference "8-label" are updated to reflect the
7-label taxonomy and the updated run count.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6373b392-d085-4ce0-b1d7-163b8460a9b8
📒 Files selected for processing (8)
docs/DATASHEET.mddocs/SPEC.mdllm_quest_benchmark/core/leaderboard.pyllm_quest_benchmark/llm/client.pyllm_quest_benchmark/tests/test_leaderboard.pyllm_quest_benchmark/tests/test_llm_client.pysite/about.htmlsite/leaderboard.json
Summary
claude:benchmark runs into the public Claude Haiku bucketclaude-haiku-4.5Verification
uv run pytestuv run ruff check .uv run ruff format --check ../scripts/update_leaderboard.shSummary by CodeRabbit
Documentation
Bug Fixes
Tests