Skip to content

Remove legacy Claude CLI leaderboard alias#58

Merged
yourconscience merged 2 commits into
masterfrom
remove-claude-exec
May 13, 2026
Merged

Remove legacy Claude CLI leaderboard alias#58
yourconscience merged 2 commits into
masterfrom
remove-claude-exec

Conversation

@yourconscience

@yourconscience yourconscience commented May 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • stop mapping legacy claude: benchmark runs into the public Claude Haiku bucket
  • keep Anthropic API Claude runs mapped to claude-haiku-4.5
  • refresh checked-in leaderboard/about/docs counts after removing the legacy CLI data

Verification

  • uv run pytest
  • uv run ruff check .
  • uv run ruff format --check .
  • ./scripts/update_leaderboard.sh

Summary by CodeRabbit

  • Documentation

    • Updated published benchmark count from 1,615 to 1,584 runs.
    • Updated taxonomy coverage from 8 to 7 labels.
    • Removed "Full-history reasoning" from benchmark scenarios.
  • Bug Fixes

    • Removed legacy provider support (command injection risk).
  • Tests

    • Updated test expectations for revised taxonomy.
    • Added validation for legacy model exclusion.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8208322-dd99-4b2c-b205-533ac92869c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR migrates the benchmark from the legacy claude: provider prefix to the Anthropic-qualified anthropic: prefix, adds validation to reject the old prefix, excludes legacy Claude CLI runs from public leaderboard slices, and updates all documentation to reflect revised published run counts (1,615→1,584) and taxonomy labels (8→7).

Changes

Legacy Claude CLI Provider Removal and Leaderboard Update

Layer / File(s) Summary
Provider alias migration and rejection
llm_quest_benchmark/core/leaderboard.py, llm_quest_benchmark/llm/client.py
MODEL_ALIASES now maps anthropic:claude-haiku-4-5-20251001 to the display name, and get_llm_client removes "claude" from the removed-provider set so the legacy prefix no longer triggers the command-injection-risk rejection path.
Provider validation unit tests
llm_quest_benchmark/tests/test_llm_client.py
New test test_parse_model_name_rejects_legacy_claude_cli_provider verifies that the legacy claude: prefix raises NotImplementedError with message "Provider claude is not supported".
Leaderboard legacy CLI exclusion and taxonomy update
llm_quest_benchmark/tests/test_leaderboard.py
New test test_generate_leaderboard_excludes_legacy_claude_cli_runs_from_public_slice ensures that public leaderboard generation with public_model_ids=["claude-haiku-4.5"] excludes runs with agent_id="legacy-claude-cli" and includes only "anthropic-api" runs. Existing taxonomy test updated to remove deprecated "Full-history reasoning" label.
Published run count and taxonomy documentation updates
docs/DATASHEET.md, docs/SPEC.md, site/about.html
All documentation updated consistently: published run count 1,615 → 1,584, taxonomy labels 8 → 7 (removing "Full-history reasoning"), and related figure captions and content sections revised across datasheet, specification, and landing page.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Poem

🐰 The Claude CLI bids farewell, no longer in our config,
Anthropic now leads the way with qualified prefixes so chic,
Run counts and labels realigned, the leaderboard grows sleek,
Legacy ghosts are filtered out—the future's looking strong and antique! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: removing the legacy Claude CLI leaderboard alias. This aligns perfectly with the PR's primary objective to stop mapping legacy claude: benchmark runs and the code changes that remove the claude: provider check and update model aliases.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch remove-claude-exec

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the benchmark documentation and website to reflect a revised count of 1,584 runs and the removal of the 'Full-history reasoning' taxonomy label. Key code changes include updating Claude model aliases to consistently use the 'anthropic:' prefix and implementing logic to exclude legacy CLI runs from the public leaderboard. Additionally, new tests were added to verify the rejection of legacy provider strings and the correct filtering of leaderboard results. I have no feedback to provide as there were no review comments to evaluate.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/DATASHEET.md`:
- Around line 89-91: The datasheet contains inconsistent taxonomy wording: the
sentence beginning "Taxonomy labels currently represented in the public
comparison slice: Minimal prompt, Short-context reasoning, Compact memory /
memo, Prompt hints, Tools + compact memory, Tools + hints + compact memory, and
Planner loop." conflicts with another section that still lists "full-history
reasoning." Make the two descriptions identical by either adding "full-history
reasoning" to the public comparison slice list or removing it wherever the other
section includes it; update every occurrence of the exact phrase "Taxonomy
labels currently represented..." and any section that mentions "full-history
reasoning" so both sections use the same taxonomy wording.
- Around line 83-84: The datasheet header's "Last updated:" date is inconsistent
with the body "as of 2026-05-13"; update the header value for the "Last
updated:" field in DATASHEET.md to 2026-05-13 so the header and the reported "as
of" timestamp in the content match, ensuring the string for the header exactly
reflects the new date.

In `@docs/SPEC.md`:
- Line 32: The SPEC.md still lists an 8-label taxonomy entry "Full-history
reasoning" which conflicts with the new public 7-label taxonomy; edit the
taxonomy table in SPEC.md to remove the "Full-history reasoning" row so the
table lists only the 7 public labels used elsewhere in the PR, and verify any
surrounding text or counts that reference "8-label" are updated to reflect the
7-label taxonomy and the updated run count.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6373b392-d085-4ce0-b1d7-163b8460a9b8

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb0a9d and 71c9234.

📒 Files selected for processing (8)
  • docs/DATASHEET.md
  • docs/SPEC.md
  • llm_quest_benchmark/core/leaderboard.py
  • llm_quest_benchmark/llm/client.py
  • llm_quest_benchmark/tests/test_leaderboard.py
  • llm_quest_benchmark/tests/test_llm_client.py
  • site/about.html
  • site/leaderboard.json

Comment thread docs/DATASHEET.md
Comment thread docs/DATASHEET.md
Comment thread docs/SPEC.md
@yourconscience yourconscience merged commit 2e88e42 into master May 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant