PolicyBench v2: AI tax/benefit benchmark by MaxGhenis · Pull Request #2 · PolicyEngine/policybench

MaxGhenis · 2026-02-16T15:28:07Z

Summary

Complete rewrite of PolicyBench as a rigorous benchmark measuring whether AI models can accurately calculate US tax/benefit outcomes — and whether PolicyEngine tools bridge the gap.

Two conditions: AI alone (EDSL) vs AI with PolicyEngine tools (LiteLLM)
5 models: Claude Opus 4.6, Claude Sonnet 4.5, GPT-4o, GPT o3, Gemini 2.5 Pro
14 programs: Federal income tax, EITC, CTC, SNAP, SSI, Medicaid, state tax, net income, marginal rates
100 scenarios: Deterministic households varying income ($0-$500k), filing status, children (0-4), 12 states
46 tests: Full coverage with mocked external calls for CI, plus PE-US integration tests
CI: GitHub Actions with ruff lint + pytest

Test plan

All 46 tests pass locally (39 fast + 7 PE-US integration)
Ruff lint and format checks pass
CI passes on push
Run full benchmark with real API keys (manual, post-merge)

🤖 Generated with Claude Code

Two-condition benchmark testing whether AI models can accurately calculate tax/benefit outcomes without tools vs with PolicyEngine tool access. - 5 frontier models (Claude Opus/Sonnet, GPT-4o/o3, Gemini 2.5 Pro) - 14 programs (federal tax, credits, benefits, state tax, rates) - 100 deterministic household scenarios across 12 states - EDSL for AI-alone eval, LiteLLM for tool-calling eval - Full test suite (46 tests), ruff lint, GitHub Actions CI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s thesis - Drop EDSL dependency, use LiteLLM for both conditions - Add diskcache for reproducible/cost-efficient re-runs - Update models: GPT-5.2, Gemini 3 Pro (latest as of Feb 2026) - Fix handle_tool_call to use fallback household when model omits it - Pilot results (5 scenarios × 3 programs × GPT-5.2): - Without tools: 80% within 10%, MAE $1,990, misses SNAP entirely - With tools: 100% exact match, MAE $0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…alidates thesis

- RESULTS.md with partial results (4000/4200 no-tools, 1400/4200 with-tools) - Key finding: tools > models (100% vs 74.5% within-10% accuracy) - JB2 paper in docs/ (intro, methodology, results, discussion, references) - Add retry logic with exponential backoff to both eval modules - Add incremental CSV saving to prevent data loss on crashes - Prediction CSVs for completed model runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…vements

…l improvements

- Fix with-tools extraction: use tool result directly instead of parsing verbose model responses (which grabbed income amounts, not answers) - All 4,200 with-tools predictions now $0 MAE, 100% within-10% - No-tools results: Opus 70.8%, Sonnet 61.9%, GPT-5.2 62.1% - Update React app with real benchmark data (replacing mock data) - Fix paper data loading to use single predictions.csv - Clean up partial/stale result files - All 53 tests pass, lint clean Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All 8,400 predictions (3 models x 100 scenarios x 14 programs x 2 conditions) now complete. All three models achieve 100% accuracy with PolicyEngine tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sonnet 4.6 edges out Opus (72.3% vs 70.8% within 10%) and dramatically improves over Sonnet 4.5 (72.3% vs 61.9%). Still far from 100% with tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t 72.3%

MaxGhenis and others added 10 commits February 16, 2026 07:27

fixup! Consolidate on LiteLLM, add disk cache, update models, pilot v…

5f19c57

…alidates thesis

fixup! Add preliminary benchmark results, paper draft, and eval impro…

2f0a85f

…vements

fixup! fixup! Add preliminary benchmark results, paper draft, and eva…

c016de6

…l improvements

Update RESULTS.md with final complete benchmark results

476e462

All 8,400 predictions (3 models x 100 scenarios x 14 programs x 2 conditions) now complete. All three models achieve 100% accuracy with PolicyEngine tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Claude Sonnet 4.6 to benchmark — now best no-tools model at 72.3%

5e61083

Sonnet 4.6 edges out Opus (72.3% vs 70.8% within 10%) and dramatically improves over Sonnet 4.5 (72.3% vs 61.9%). Still far from 100% with tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fixup! Add Claude Sonnet 4.6 to benchmark — now best no-tools model a…

3caf19f

…t 72.3%

MaxGhenis merged commit a883bb5 into main Feb 25, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PolicyBench v2: AI tax/benefit benchmark#2

PolicyBench v2: AI tax/benefit benchmark#2
MaxGhenis merged 10 commits intomainfrom
v2

MaxGhenis commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Feb 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant