Remove legacy Claude CLI leaderboard alias by yourconscience · Pull Request #58 · yourconscience/llm_quest_benchmark

yourconscience · 2026-05-13T19:22:48Z

Summary

stop mapping legacy claude: benchmark runs into the public Claude Haiku bucket
keep Anthropic API Claude runs mapped to claude-haiku-4.5
refresh checked-in leaderboard/about/docs counts after removing the legacy CLI data

Verification

uv run pytest
uv run ruff check .
uv run ruff format --check .
./scripts/update_leaderboard.sh

Summary by CodeRabbit

Documentation
- Updated published benchmark count from 1,615 to 1,584 runs.
- Updated taxonomy coverage from 8 to 7 labels.
- Removed "Full-history reasoning" from benchmark scenarios.
Bug Fixes
- Removed legacy provider support (command injection risk).
Tests
- Updated test expectations for revised taxonomy.
- Added validation for legacy model exclusion.

coderabbitai · 2026-05-13T19:23:00Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8208322-dd99-4b2c-b205-533ac92869c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR migrates the benchmark from the legacy claude: provider prefix to the Anthropic-qualified anthropic: prefix, adds validation to reject the old prefix, excludes legacy Claude CLI runs from public leaderboard slices, and updates all documentation to reflect revised published run counts (1,615→1,584) and taxonomy labels (8→7).

Changes

Legacy Claude CLI Provider Removal and Leaderboard Update

Layer / File(s)	Summary
Provider alias migration and rejection `llm_quest_benchmark/core/leaderboard.py`, `llm_quest_benchmark/llm/client.py`	`MODEL_ALIASES` now maps `anthropic:claude-haiku-4-5-20251001` to the display name, and `get_llm_client` removes `"claude"` from the removed-provider set so the legacy prefix no longer triggers the command-injection-risk rejection path.
Provider validation unit tests `llm_quest_benchmark/tests/test_llm_client.py`	New test `test_parse_model_name_rejects_legacy_claude_cli_provider` verifies that the legacy `claude:` prefix raises `NotImplementedError` with message "Provider claude is not supported".
Leaderboard legacy CLI exclusion and taxonomy update `llm_quest_benchmark/tests/test_leaderboard.py`	New test `test_generate_leaderboard_excludes_legacy_claude_cli_runs_from_public_slice` ensures that public leaderboard generation with `public_model_ids=["claude-haiku-4.5"]` excludes runs with `agent_id="legacy-claude-cli"` and includes only `"anthropic-api"` runs. Existing taxonomy test updated to remove deprecated "Full-history reasoning" label.
Published run count and taxonomy documentation updates `docs/DATASHEET.md`, `docs/SPEC.md`, `site/about.html`	All documentation updated consistently: published run count 1,615 → 1,584, taxonomy labels 8 → 7 (removing "Full-history reasoning"), and related figure captions and content sections revised across datasheet, specification, and landing page.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

yourconscience/llm_quest_benchmark#28: Modifies claude: handling in llm_quest_benchmark/llm/client.py, directly affecting how legacy Claude CLI model specs are interpreted.
yourconscience/llm_quest_benchmark#45: Modifies llm_quest_benchmark/core/leaderboard.py's MODEL_ALIASES for Claude/Anthropic model-identifier unification.
yourconscience/llm_quest_benchmark#29: Adds Claude Haiku 4.5 benchmark configs that rely on the anthropic:claude-haiku-4-5-20251001 model-id handling updated in this PR.

Poem

🐰 The Claude CLI bids farewell, no longer in our config,
Anthropic now leads the way with qualified prefixes so chic,
Run counts and labels realigned, the leaderboard grows sleek,
Legacy ghosts are filtered out—the future's looking strong and antique! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: removing the legacy Claude CLI leaderboard alias. This aligns perfectly with the PR's primary objective to stop mapping legacy `claude:` benchmark runs and the code changes that remove the `claude:` provider check and update model aliases.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch remove-claude-exec

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the benchmark documentation and website to reflect a revised count of 1,584 runs and the removal of the 'Full-history reasoning' taxonomy label. Key code changes include updating Claude model aliases to consistently use the 'anthropic:' prefix and implementing logic to exclude legacy CLI runs from the public leaderboard. Additionally, new tests were added to verify the rejection of legacy provider strings and the correct filtering of leaderboard results. I have no feedback to provide as there were no review comments to evaluate.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/DATASHEET.md`:
- Around line 89-91: The datasheet contains inconsistent taxonomy wording: the
sentence beginning "Taxonomy labels currently represented in the public
comparison slice: Minimal prompt, Short-context reasoning, Compact memory /
memo, Prompt hints, Tools + compact memory, Tools + hints + compact memory, and
Planner loop." conflicts with another section that still lists "full-history
reasoning." Make the two descriptions identical by either adding "full-history
reasoning" to the public comparison slice list or removing it wherever the other
section includes it; update every occurrence of the exact phrase "Taxonomy
labels currently represented..." and any section that mentions "full-history
reasoning" so both sections use the same taxonomy wording.
- Around line 83-84: The datasheet header's "Last updated:" date is inconsistent
with the body "as of 2026-05-13"; update the header value for the "Last
updated:" field in DATASHEET.md to 2026-05-13 so the header and the reported "as
of" timestamp in the content match, ensuring the string for the header exactly
reflects the new date.

In `@docs/SPEC.md`:
- Line 32: The SPEC.md still lists an 8-label taxonomy entry "Full-history
reasoning" which conflicts with the new public 7-label taxonomy; edit the
taxonomy table in SPEC.md to remove the "Full-history reasoning" row so the
table lists only the 7 public labels used elsewhere in the PR, and verify any
surrounding text or counts that reference "8-label" are updated to reflect the
7-label taxonomy and the updated run count.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6373b392-d085-4ce0-b1d7-163b8460a9b8

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb0a9d and 71c9234.

📒 Files selected for processing (8)

docs/DATASHEET.md
docs/SPEC.md
llm_quest_benchmark/core/leaderboard.py
llm_quest_benchmark/llm/client.py
llm_quest_benchmark/tests/test_leaderboard.py
llm_quest_benchmark/tests/test_llm_client.py
site/about.html
site/leaderboard.json

Remove legacy Claude CLI leaderboard alias

71c9234

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread docs/DATASHEET.md

Comment thread docs/DATASHEET.md

Comment thread docs/SPEC.md

Address public taxonomy docs

40b26d0

yourconscience merged commit 2e88e42 into master May 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove legacy Claude CLI leaderboard alias#58

Remove legacy Claude CLI leaderboard alias#58
yourconscience merged 2 commits into
masterfrom
remove-claude-exec

yourconscience commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yourconscience commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yourconscience commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading