Skip to content

PolicyBench v2: AI tax/benefit benchmark#2

Merged
MaxGhenis merged 10 commits intomainfrom
v2
Feb 25, 2026
Merged

PolicyBench v2: AI tax/benefit benchmark#2
MaxGhenis merged 10 commits intomainfrom
v2

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Complete rewrite of PolicyBench as a rigorous benchmark measuring whether AI models can accurately calculate US tax/benefit outcomes — and whether PolicyEngine tools bridge the gap.

  • Two conditions: AI alone (EDSL) vs AI with PolicyEngine tools (LiteLLM)
  • 5 models: Claude Opus 4.6, Claude Sonnet 4.5, GPT-4o, GPT o3, Gemini 2.5 Pro
  • 14 programs: Federal income tax, EITC, CTC, SNAP, SSI, Medicaid, state tax, net income, marginal rates
  • 100 scenarios: Deterministic households varying income ($0-$500k), filing status, children (0-4), 12 states
  • 46 tests: Full coverage with mocked external calls for CI, plus PE-US integration tests
  • CI: GitHub Actions with ruff lint + pytest

Test plan

  • All 46 tests pass locally (39 fast + 7 PE-US integration)
  • Ruff lint and format checks pass
  • CI passes on push
  • Run full benchmark with real API keys (manual, post-merge)

🤖 Generated with Claude Code

MaxGhenis and others added 10 commits February 16, 2026 07:27
Two-condition benchmark testing whether AI models can accurately calculate
tax/benefit outcomes without tools vs with PolicyEngine tool access.

- 5 frontier models (Claude Opus/Sonnet, GPT-4o/o3, Gemini 2.5 Pro)
- 14 programs (federal tax, credits, benefits, state tax, rates)
- 100 deterministic household scenarios across 12 states
- EDSL for AI-alone eval, LiteLLM for tool-calling eval
- Full test suite (46 tests), ruff lint, GitHub Actions CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s thesis

- Drop EDSL dependency, use LiteLLM for both conditions
- Add diskcache for reproducible/cost-efficient re-runs
- Update models: GPT-5.2, Gemini 3 Pro (latest as of Feb 2026)
- Fix handle_tool_call to use fallback household when model omits it
- Pilot results (5 scenarios × 3 programs × GPT-5.2):
  - Without tools: 80% within 10%, MAE $1,990, misses SNAP entirely
  - With tools: 100% exact match, MAE $0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- RESULTS.md with partial results (4000/4200 no-tools, 1400/4200 with-tools)
- Key finding: tools > models (100% vs 74.5% within-10% accuracy)
- JB2 paper in docs/ (intro, methodology, results, discussion, references)
- Add retry logic with exponential backoff to both eval modules
- Add incremental CSV saving to prevent data loss on crashes
- Prediction CSVs for completed model runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix with-tools extraction: use tool result directly instead of parsing
  verbose model responses (which grabbed income amounts, not answers)
- All 4,200 with-tools predictions now $0 MAE, 100% within-10%
- No-tools results: Opus 70.8%, Sonnet 61.9%, GPT-5.2 62.1%
- Update React app with real benchmark data (replacing mock data)
- Fix paper data loading to use single predictions.csv
- Clean up partial/stale result files
- All 53 tests pass, lint clean

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 8,400 predictions (3 models x 100 scenarios x 14 programs x 2 conditions)
now complete. All three models achieve 100% accuracy with PolicyEngine tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sonnet 4.6 edges out Opus (72.3% vs 70.8% within 10%) and dramatically
improves over Sonnet 4.5 (72.3% vs 61.9%). Still far from 100% with tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit a883bb5 into main Feb 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant