Fix pyproject build (PEP 639 + misplaced deps) and turn CI green#6
Conversation
The CI install step fails on setuptools >= 77 because the `dependencies` array sits after the `[project.urls]` header, so TOML attaches it as `project.urls.dependencies` (must be string). This also silently dropped the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain install. Separately, declaring both the SPDX `license = "MIT"` expression and the `License ::` Trove classifier is rejected under PEP 639. Because CI never got past install, it had never actually run lint/type/ test — so several latent failures were masked. Fixed all of them so CI goes green for the first time: - pyproject: move `dependencies` into [project]; drop License classifier (closes #3, closes #4) - test_judges: update stale judge-id assertion — PR #5 rotated the US council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on OpenRouter) but left the test asserting the old id - test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate` columns that figure3_opus_longitudinal requires (fixture predated the per-tier redesign; was raising KeyError: 'tier') - figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=), stale noqa, ambiguous-char ignore) and mypy strict (list[str] return, float() coords for annotate); apply ruff format - coverage: add the provider clients to coverage `omit` so config matches the documented intent the 78% threshold already assumes (they need live API creds); coverage 79.10%
|
Warning Review limit reached
More reviews will be available in 47 minutes and 13 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR refactors figure generation code and test data structures. Configuration updates expand coverage exclusions for credential-dependent modules. Figure helper functions and rendering code are reformatted for consistency, and two locations introduce stricter sequence validation via ChangesFigure refactoring with test alignment
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/refusalbench/analysis/figures.py`:
- Line 584: The provider color mapping can raise when provider_col contains NaN
or non-string values because _provider_color calls p.lower(); normalize values
first by retrieving the column with df.get(provider_col, pd.Series(["other"] *
len(df))) then coerce/fill entries to a safe string (e.g. replace NaN with
"other" and convert non-strings to str or call .astype(str).str.lower()) before
calling _provider_color for each element; update the list comprehension that
builds colors to use the normalized series so _provider_color always receives a
lowercase string (or "other") and cannot error on None/NaN.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e46f259b-6cd6-4965-af4f-483e281b6402
📒 Files selected for processing (5)
pyproject.tomlsrc/refusalbench/analysis/figures.pysrc/refusalbench/analysis/longitudinal.pytests/test_analysis.pytests/test_judges.py
| _provider_color(p) | ||
| for p in df.get(provider_col, pd.Series(["other"] * len(df))) | ||
| ] | ||
| colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))] |
There was a problem hiding this comment.
Harden provider color mapping against null/non-string provider values.
If provider_col exists but contains NaN/non-string values, _provider_color(p) can fail on p.lower(). Normalize before mapping.
Suggested fix
- colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))]
+ provider_vals = df.get(provider_col, pd.Series(["other"] * len(df), index=df.index)).fillna("other")
+ colors = [_provider_color(str(p)) for p in provider_vals]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/refusalbench/analysis/figures.py` at line 584, The provider color mapping
can raise when provider_col contains NaN or non-string values because
_provider_color calls p.lower(); normalize values first by retrieving the column
with df.get(provider_col, pd.Series(["other"] * len(df))) then coerce/fill
entries to a safe string (e.g. replace NaN with "other" and convert non-strings
to str or call .astype(str).str.lower()) before calling _provider_color for each
element; update the list comprehension that builds colors to use the normalized
series so _provider_color always receives a lowercase string (or "other") and
cannot error on None/NaN.
The test job installs with [dev,stats] extras, so it can't catch a plain 'pip install -e .' (what the issue reporter and 'make install' users run) breaking. New base-install job does a bare install in an isolated env and imports the runtime deps — on setuptools <77 the misplaced-deps bug installed nothing yet succeeded, so the import check is what actually catches the silent drop.
Summary
Fixes the two reported
pyproject.tomlbuild failures (#3, #4) and, in the process, gets CI to pass for the first time in the repo's history. CI had always died atpip install, so the lint / type / test steps had never actually executed — this PR fixes the install and the latent failures that were hiding behind it.Root cause of the red CI
Every run failed at ~19s during
pip install -e ".[dev,stats]":The
dependenciesarray was placed after the[project.urls]header, so TOML attached it asproject.urls.dependencies. This also meant a plain install silently dropped the runtime deps (pandas,scipy,matplotlib,click,krippendorff).Changes
pyproject.tomldependenciesinto the[project]table (closes pip install fails on setuptools >= 77 #3). Verifiedpip install -e .now actually installs the runtime deps.License :: OSI Approved :: MIT LicenseTrove classifier; keep the SPDXlicense = "MIT"(PEP 639 / setuptools ≥77 reject both together) (closes License classifier is deprecated (PEP 639) #4).Latent failures unmasked once install worked:
tests/test_judges.py— PR Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council) #5 rotated the US council seatnvidia_nemotron→microsoft_phi4(Nemotron 404'd on OpenRouter) incouncil/v1.1.jsonbut left the test asserting the old id. Updated the assertion to match the shipped config.tests/test_analysis.py— theopus_dffixture predatedfigure3_opus_longitudinal's per-tier redesign and lacked thetier/raw_ratecolumns it requires, raisingKeyError: 'tier'. Rebuilt the fixture to match the function's documented contract (the shape_compute_statsemits in production).analysis/figures.py— ruff lint (remove deadopus_ids/bars,zip(strict=True), stalenoqa, ambiguous-char ignore) and mypy--strict(list[str]return in_overall_order,float()coords forannotate); appliedruff format.anthropic.py,bedrock.py,openrouter.py) to coverageomit. The existing[tool.coverage.report]note already documents that these are "excluded from CI coverage measurement — hence 78%", but they were never actually in the omit list. Config now matches the documented intent the threshold assumes. Coverage: 79.10%.Verification
Full CI sequence run locally in a fresh venv (Python 3.11, setuptools 82.0.1):
pip install -e ".[dev,stats]"ruff check src testsruff format --check src testsmypy --strict src/refusalbenchpytest --covpython scripts/validate_prompts.pyCloses #3
Closes #4
Summary by CodeRabbit
Bug Fixes
Tests
Chores