Fix pyproject build (PEP 639 + misplaced deps) and turn CI green by mihailoxyz · Pull Request #6 · AppliedScientific/refusalbench

mihailoxyz · 2026-05-29T16:13:02Z

Summary

Fixes the two reported pyproject.toml build failures (#3, #4) and, in the process, gets CI to pass for the first time in the repo's history. CI had always died at pip install, so the lint / type / test steps had never actually executed — this PR fixes the install and the latent failures that were hiding behind it.

Root cause of the red CI

Every run failed at ~19s during pip install -e ".[dev,stats]":

configuration error: `project.urls.dependencies` must be string

The dependencies array was placed after the [project.urls] header, so TOML attached it as project.urls.dependencies. This also meant a plain install silently dropped the runtime deps (pandas, scipy, matplotlib, click, krippendorff).

Changes

pyproject.toml

Move dependencies into the [project] table (closes pip install fails on setuptools >= 77 #3). Verified pip install -e . now actually installs the runtime deps.
Remove the deprecated License :: OSI Approved :: MIT License Trove classifier; keep the SPDX license = "MIT" (PEP 639 / setuptools ≥77 reject both together) (closes License classifier is deprecated (PEP 639) #4).

Latent failures unmasked once install worked:

tests/test_judges.py — PR Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council) #5 rotated the US council seat nvidia_nemotron → microsoft_phi4 (Nemotron 404'd on OpenRouter) in council/v1.1.json but left the test asserting the old id. Updated the assertion to match the shipped config.
tests/test_analysis.py — the opus_df fixture predated figure3_opus_longitudinal's per-tier redesign and lacked the tier/raw_rate columns it requires, raising KeyError: 'tier'. Rebuilt the fixture to match the function's documented contract (the shape _compute_stats emits in production).
analysis/figures.py — ruff lint (remove dead opus_ids/bars, zip(strict=True), stale noqa, ambiguous-char ignore) and mypy --strict (list[str] return in _overall_order, float() coords for annotate); applied ruff format.
coverage — added the provider clients (anthropic.py, bedrock.py, openrouter.py) to coverage omit. The existing [tool.coverage.report] note already documents that these are "excluded from CI coverage measurement — hence 78%", but they were never actually in the omit list. Config now matches the documented intent the threshold assumes. Coverage: 79.10%.

Verification

Full CI sequence run locally in a fresh venv (Python 3.11, setuptools 82.0.1):

Step	Result
`pip install -e ".[dev,stats]"`	✅
`ruff check src tests`	✅ All checks passed
`ruff format --check src tests`	✅
`mypy --strict src/refusalbench`	✅ no issues (33 files)
`pytest --cov`	✅ 324 passed, coverage 79.10%
`python scripts/validate_prompts.py`	✅

Closes #3
Closes #4

Summary by CodeRabbit

Bug Fixes
- Stricter validation added for data consistency checks in figure generation and data mapping.
Tests
- Updated test fixtures and assertions to cover additional dataset scenarios and configurations.
Chores
- Updated project configuration and code formatting.
- Expanded linting rule exceptions.

The CI install step fails on setuptools >= 77 because the `dependencies` array sits after the `[project.urls]` header, so TOML attaches it as `project.urls.dependencies` (must be string). This also silently dropped the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain install. Separately, declaring both the SPDX `license = "MIT"` expression and the `License ::` Trove classifier is rejected under PEP 639. Because CI never got past install, it had never actually run lint/type/ test — so several latent failures were masked. Fixed all of them so CI goes green for the first time: - pyproject: move `dependencies` into [project]; drop License classifier (closes #3, closes #4) - test_judges: update stale judge-id assertion — PR #5 rotated the US council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on OpenRouter) but left the test asserting the old id - test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate` columns that figure3_opus_longitudinal requires (fixture predated the per-tier redesign; was raising KeyError: 'tier') - figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=), stale noqa, ambiguous-char ignore) and mypy strict (list[str] return, float() coords for annotate); apply ruff format - coverage: add the provider clients to coverage `omit` so config matches the documented intent the 78% threshold already assumes (they need live API creds); coverage 79.10%

coderabbitai · 2026-05-29T16:13:24Z

Warning

Review limit reached

@mihailoxyz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 13 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa360cef-2a32-47cf-a984-1bc24a79eabb

📥 Commits

Reviewing files that changed from the base of the PR and between 8e70653 and abab13e.

📒 Files selected for processing (3)

.github/workflows/ci.yml
pyproject.toml
tests/test_analysis.py

📝 Walkthrough

Walkthrough

This PR refactors figure generation code and test data structures. Configuration updates expand coverage exclusions for credential-dependent modules. Figure helper functions and rendering code are reformatted for consistency, and two locations introduce stricter sequence validation via zip(..., strict=True). Test fixtures are updated to match new tiered data expectations.

Changes

Figure refactoring with test alignment

Layer / File(s)	Summary
Configuration and coverage updates `pyproject.toml`	License classifier removed; coverage omit list expanded to exclude credential-dependent provider modules; Ruff per-file ignores extended with `RUF003`.
Figure helper functions and data computation `src/refusalbench/analysis/figures.py`	Color palettes and `_compute_stats` helper refactored with compact lambda expressions for display/provider/is_refused derivation; grouping logic and Wilson interval stats generation unchanged.
Figure 1–5 visualization formatting `src/refusalbench/analysis/figures.py`	Titles, legends, bar/errorbar arguments, and label styling reformatted across `figure1_provider_gradient`, `figure2_subdomain_heatmap`, `figure3_opus_longitudinal`, `figure4_refusal_taxonomy`, and `figure5_tier_comparison` with no change to rendered output.
Strict sequence validation in figure 3 and CLI `src/refusalbench/analysis/figures.py`	Tier annotation loop in `figure3_opus_longitudinal` and `wmdp_scores` dict construction in CLI switched to `zip(..., strict=True)` to raise on length mismatches. CLI removes intermediate `opus_ids` list; Opus model selection now driven directly from `opus_labels` for figure-3.
Figure 6 and CLI infrastructure `src/refusalbench/analysis/figures.py`	Figure 6 provider color derivation compacted to list comprehension; regression-line plotting refactored; CLI data mapping for figures 2 and 4 switched to single-line lambda/isin expressions.
Test data and expectations `tests/test_analysis.py`, `tests/test_judges.py`, `src/refusalbench/analysis/longitudinal.py`	Opus fixture updated to generate tiered longitudinal dataset (one row per Opus release and tier) replacing single-row structure. Judge expectation changed to `microsoft_phi4`. Path formatting in `cochran_q_across_snapshots` refactored without logic change.

🎯 3 (Moderate) | ⏱️ ~25 minutes

📊 Figures now stand with stricter sight,
Tiers aligned from left to right,
Formatting cleaned, validations tight,
Data shapes dancing in the light. ✨
Test fixtures leap to match the code,
One refactor down the winding road! 🐰

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly addresses the main objectives: fixing pyproject.toml build issues (PEP 639 + misplaced deps) and resolving CI failures, which matches the core changes in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/pyproject-pep639-and-deps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/refusalbench/analysis/figures.py`:
- Line 584: The provider color mapping can raise when provider_col contains NaN
or non-string values because _provider_color calls p.lower(); normalize values
first by retrieving the column with df.get(provider_col, pd.Series(["other"] *
len(df))) then coerce/fill entries to a safe string (e.g. replace NaN with
"other" and convert non-strings to str or call .astype(str).str.lower()) before
calling _provider_color for each element; update the list comprehension that
builds colors to use the normalized series so _provider_color always receives a
lowercase string (or "other") and cannot error on None/NaN.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e46f259b-6cd6-4965-af4f-483e281b6402

📥 Commits

Reviewing files that changed from the base of the PR and between f8fc699 and 8e70653.

📒 Files selected for processing (5)

pyproject.toml
src/refusalbench/analysis/figures.py
src/refusalbench/analysis/longitudinal.py
tests/test_analysis.py
tests/test_judges.py

coderabbitai · 2026-05-29T16:17:47Z

-        _provider_color(p)
-        for p in df.get(provider_col, pd.Series(["other"] * len(df)))
-    ]
+    colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))]


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden provider color mapping against null/non-string provider values.

If provider_col exists but contains NaN/non-string values, _provider_color(p) can fail on p.lower(). Normalize before mapping.

Suggested fix

- colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))] + provider_vals = df.get(provider_col, pd.Series(["other"] * len(df), index=df.index)).fillna("other") + colors = [_provider_color(str(p)) for p in provider_vals]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/refusalbench/analysis/figures.py` at line 584, The provider color mapping can raise when provider_col contains NaN or non-string values because _provider_color calls p.lower(); normalize values first by retrieving the column with df.get(provider_col, pd.Series(["other"] * len(df))) then coerce/fill entries to a safe string (e.g. replace NaN with "other" and convert non-strings to str or call .astype(str).str.lower()) before calling _provider_color for each element; update the list comprehension that builds colors to use the normalized series so _provider_color always receives a lowercase string (or "other") and cannot error on None/NaN.

The test job installs with [dev,stats] extras, so it can't catch a plain 'pip install -e .' (what the issue reporter and 'make install' users run) breaking. New base-install job does a bare install in an isolated env and imports the runtime deps — on setuptools <77 the misplaced-deps bug installed nothing yet succeeded, so the import check is what actually catches the silent drop.

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

mihailoxyz added 2 commits May 29, 2026 18:23

Trim verbose comments to terse pointers

abab13e

mihailoxyz merged commit 13f9637 into main May 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pyproject build (PEP 639 + misplaced deps) and turn CI green#6

Fix pyproject build (PEP 639 + misplaced deps) and turn CI green#6
mihailoxyz merged 3 commits into
mainfrom
fix/pyproject-pep639-and-deps

mihailoxyz commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihailoxyz commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause of the red CI

Changes

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihailoxyz commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading