Skip to content

Fix pyproject build (PEP 639 + misplaced deps) and turn CI green#6

Merged
mihailoxyz merged 3 commits into
mainfrom
fix/pyproject-pep639-and-deps
May 29, 2026
Merged

Fix pyproject build (PEP 639 + misplaced deps) and turn CI green#6
mihailoxyz merged 3 commits into
mainfrom
fix/pyproject-pep639-and-deps

Conversation

@mihailoxyz
Copy link
Copy Markdown
Contributor

@mihailoxyz mihailoxyz commented May 29, 2026

Summary

Fixes the two reported pyproject.toml build failures (#3, #4) and, in the process, gets CI to pass for the first time in the repo's history. CI had always died at pip install, so the lint / type / test steps had never actually executed — this PR fixes the install and the latent failures that were hiding behind it.

Root cause of the red CI

Every run failed at ~19s during pip install -e ".[dev,stats]":

configuration error: `project.urls.dependencies` must be string

The dependencies array was placed after the [project.urls] header, so TOML attached it as project.urls.dependencies. This also meant a plain install silently dropped the runtime deps (pandas, scipy, matplotlib, click, krippendorff).

Changes

pyproject.toml

Latent failures unmasked once install worked:

  • tests/test_judges.py — PR Add Claude Opus 4.8 (post-v1.1, rotated v1.3 council) #5 rotated the US council seat nvidia_nemotronmicrosoft_phi4 (Nemotron 404'd on OpenRouter) in council/v1.1.json but left the test asserting the old id. Updated the assertion to match the shipped config.
  • tests/test_analysis.py — the opus_df fixture predated figure3_opus_longitudinal's per-tier redesign and lacked the tier/raw_rate columns it requires, raising KeyError: 'tier'. Rebuilt the fixture to match the function's documented contract (the shape _compute_stats emits in production).
  • analysis/figures.py — ruff lint (remove dead opus_ids/bars, zip(strict=True), stale noqa, ambiguous-char ignore) and mypy --strict (list[str] return in _overall_order, float() coords for annotate); applied ruff format.
  • coverage — added the provider clients (anthropic.py, bedrock.py, openrouter.py) to coverage omit. The existing [tool.coverage.report] note already documents that these are "excluded from CI coverage measurement — hence 78%", but they were never actually in the omit list. Config now matches the documented intent the threshold assumes. Coverage: 79.10%.

Verification

Full CI sequence run locally in a fresh venv (Python 3.11, setuptools 82.0.1):

Step Result
pip install -e ".[dev,stats]"
ruff check src tests ✅ All checks passed
ruff format --check src tests
mypy --strict src/refusalbench ✅ no issues (33 files)
pytest --cov ✅ 324 passed, coverage 79.10%
python scripts/validate_prompts.py

Closes #3
Closes #4

Summary by CodeRabbit

  • Bug Fixes

    • Stricter validation added for data consistency checks in figure generation and data mapping.
  • Tests

    • Updated test fixtures and assertions to cover additional dataset scenarios and configurations.
  • Chores

    • Updated project configuration and code formatting.
    • Expanded linting rule exceptions.

Review Change Stack

The CI install step fails on setuptools >= 77 because the `dependencies`
array sits after the `[project.urls]` header, so TOML attaches it as
`project.urls.dependencies` (must be string). This also silently dropped
the runtime deps (pandas/scipy/matplotlib/click/krippendorff) on a plain
install. Separately, declaring both the SPDX `license = "MIT"` expression
and the `License ::` Trove classifier is rejected under PEP 639.

Because CI never got past install, it had never actually run lint/type/
test — so several latent failures were masked. Fixed all of them so CI
goes green for the first time:

- pyproject: move `dependencies` into [project]; drop License classifier
  (closes #3, closes #4)
- test_judges: update stale judge-id assertion — PR #5 rotated the US
  council seat nvidia_nemotron -> microsoft_phi4 (Nemotron 404'd on
  OpenRouter) but left the test asserting the old id
- test_analysis: rebuild the opus_df fixture with the `tier`/`raw_rate`
  columns that figure3_opus_longitudinal requires (fixture predated the
  per-tier redesign; was raising KeyError: 'tier')
- figures.py: fix ruff lint (drop dead `opus_ids`/`bars`, zip(strict=),
  stale noqa, ambiguous-char ignore) and mypy strict (list[str] return,
  float() coords for annotate); apply ruff format
- coverage: add the provider clients to coverage `omit` so config matches
  the documented intent the 78% threshold already assumes (they need live
  API creds); coverage 79.10%
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Warning

Review limit reached

@mihailoxyz, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 13 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa360cef-2a32-47cf-a984-1bc24a79eabb

📥 Commits

Reviewing files that changed from the base of the PR and between 8e70653 and abab13e.

📒 Files selected for processing (3)
  • .github/workflows/ci.yml
  • pyproject.toml
  • tests/test_analysis.py
📝 Walkthrough

Walkthrough

This PR refactors figure generation code and test data structures. Configuration updates expand coverage exclusions for credential-dependent modules. Figure helper functions and rendering code are reformatted for consistency, and two locations introduce stricter sequence validation via zip(..., strict=True). Test fixtures are updated to match new tiered data expectations.

Changes

Figure refactoring with test alignment

Layer / File(s) Summary
Configuration and coverage updates
pyproject.toml
License classifier removed; coverage omit list expanded to exclude credential-dependent provider modules; Ruff per-file ignores extended with RUF003.
Figure helper functions and data computation
src/refusalbench/analysis/figures.py
Color palettes and _compute_stats helper refactored with compact lambda expressions for display/provider/is_refused derivation; grouping logic and Wilson interval stats generation unchanged.
Figure 1–5 visualization formatting
src/refusalbench/analysis/figures.py
Titles, legends, bar/errorbar arguments, and label styling reformatted across figure1_provider_gradient, figure2_subdomain_heatmap, figure3_opus_longitudinal, figure4_refusal_taxonomy, and figure5_tier_comparison with no change to rendered output.
Strict sequence validation in figure 3 and CLI
src/refusalbench/analysis/figures.py
Tier annotation loop in figure3_opus_longitudinal and wmdp_scores dict construction in CLI switched to zip(..., strict=True) to raise on length mismatches. CLI removes intermediate opus_ids list; Opus model selection now driven directly from opus_labels for figure-3.
Figure 6 and CLI infrastructure
src/refusalbench/analysis/figures.py
Figure 6 provider color derivation compacted to list comprehension; regression-line plotting refactored; CLI data mapping for figures 2 and 4 switched to single-line lambda/isin expressions.
Test data and expectations
tests/test_analysis.py, tests/test_judges.py, src/refusalbench/analysis/longitudinal.py
Opus fixture updated to generate tiered longitudinal dataset (one row per Opus release and tier) replacing single-row structure. Judge expectation changed to microsoft_phi4. Path formatting in cochran_q_across_snapshots refactored without logic change.

🎯 3 (Moderate) | ⏱️ ~25 minutes


📊 Figures now stand with stricter sight,
Tiers aligned from left to right,
Formatting cleaned, validations tight,
Data shapes dancing in the light. ✨
Test fixtures leap to match the code,
One refactor down the winding road! 🐰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly addresses the main objectives: fixing pyproject.toml build issues (PEP 639 + misplaced deps) and resolving CI failures, which matches the core changes in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/pyproject-pep639-and-deps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/refusalbench/analysis/figures.py`:
- Line 584: The provider color mapping can raise when provider_col contains NaN
or non-string values because _provider_color calls p.lower(); normalize values
first by retrieving the column with df.get(provider_col, pd.Series(["other"] *
len(df))) then coerce/fill entries to a safe string (e.g. replace NaN with
"other" and convert non-strings to str or call .astype(str).str.lower()) before
calling _provider_color for each element; update the list comprehension that
builds colors to use the normalized series so _provider_color always receives a
lowercase string (or "other") and cannot error on None/NaN.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e46f259b-6cd6-4965-af4f-483e281b6402

📥 Commits

Reviewing files that changed from the base of the PR and between f8fc699 and 8e70653.

📒 Files selected for processing (5)
  • pyproject.toml
  • src/refusalbench/analysis/figures.py
  • src/refusalbench/analysis/longitudinal.py
  • tests/test_analysis.py
  • tests/test_judges.py

_provider_color(p)
for p in df.get(provider_col, pd.Series(["other"] * len(df)))
]
colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden provider color mapping against null/non-string provider values.

If provider_col exists but contains NaN/non-string values, _provider_color(p) can fail on p.lower(). Normalize before mapping.

Suggested fix
-    colors = [_provider_color(p) for p in df.get(provider_col, pd.Series(["other"] * len(df)))]
+    provider_vals = df.get(provider_col, pd.Series(["other"] * len(df), index=df.index)).fillna("other")
+    colors = [_provider_color(str(p)) for p in provider_vals]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/refusalbench/analysis/figures.py` at line 584, The provider color mapping
can raise when provider_col contains NaN or non-string values because
_provider_color calls p.lower(); normalize values first by retrieving the column
with df.get(provider_col, pd.Series(["other"] * len(df))) then coerce/fill
entries to a safe string (e.g. replace NaN with "other" and convert non-strings
to str or call .astype(str).str.lower()) before calling _provider_color for each
element; update the list comprehension that builds colors to use the normalized
series so _provider_color always receives a lowercase string (or "other") and
cannot error on None/NaN.

The test job installs with [dev,stats] extras, so it can't catch a plain
'pip install -e .' (what the issue reporter and 'make install' users run)
breaking. New base-install job does a bare install in an isolated env and
imports the runtime deps — on setuptools <77 the misplaced-deps bug
installed nothing yet succeeded, so the import check is what actually
catches the silent drop.
@mihailoxyz mihailoxyz merged commit 13f9637 into main May 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

License classifier is deprecated (PEP 639) pip install fails on setuptools >= 77

1 participant