Skip to content

Commit a883bb5

Browse files
authored
Merge pull request #2 from CosilicoAI/v2
PolicyBench v2: AI tax/benefit benchmark
2 parents 6fd5196 + 3caf19f commit a883bb5

64 files changed

Lines changed: 102637 additions & 2397 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main, v2]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
lint:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
- uses: actions/setup-python@v5
15+
with:
16+
python-version: "3.12"
17+
- run: pip install ruff
18+
- run: ruff check .
19+
- run: ruff format --check .
20+
21+
test:
22+
runs-on: ubuntu-latest
23+
steps:
24+
- uses: actions/checkout@v4
25+
- uses: actions/setup-python@v5
26+
with:
27+
python-version: "3.12"
28+
- run: pip install -e ".[dev]"
29+
- run: pytest -m "not slow" --tb=short -q

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
# LiteLLM disk cache
2+
.policybench_cache/
3+
14
# Byte-compiled / optimized / DLL files
25
__pycache__/
36
*.py[cod]

CLAUDE.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# PolicyBench development
2+
3+
## Quick start
4+
```bash
5+
pip install -e ".[dev]"
6+
pytest # Run tests (all external calls mocked)
7+
ruff check . # Lint
8+
ruff format . # Format
9+
```
10+
11+
## Architecture
12+
- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
13+
- **Ground truth**: policyengine-us Simulation
14+
- **TDD**: Write tests first, then implement
15+
16+
## Key files
17+
- `policybench/config.py` — Models, programs, constants
18+
- `policybench/scenarios.py` — Household scenario generation
19+
- `policybench/ground_truth.py` — PE-US calculations
20+
- `policybench/prompts.py` — Natural language prompt templates
21+
- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
22+
- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
23+
- `policybench/analysis.py` — Metrics and reporting
24+
25+
## Testing
26+
- All tests mock external calls (EDSL, LiteLLM, PE-US API)
27+
- `pytest -m "not slow"` to skip slow tests
28+
- Full benchmark runs are manual and expensive
29+
30+
## Formatting
31+
- Use `ruff format .` before committing
32+
- Use `ruff check . --fix` for auto-fixable lint issues

Makefile

Lines changed: 0 additions & 18 deletions
This file was deleted.

README.md

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,43 @@
11
# PolicyBench
22

3-
A mini-benchmark comparing Large Language Model estimates to PolicyEngine calculations of US tax/benefit programs.
3+
Can AI models accurately calculate tax and benefit outcomes without tools?
44

5-
## Requirements
5+
PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
66

7-
- Python 3.9+
8-
- `policyengine-us` for ground truth
9-
- `edsl` for LLM queries
10-
- (Optional) `pytest`, etc. for tests
7+
## Conditions
118

12-
## Installation
9+
1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
10+
2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
1311

14-
1. Clone the repo:
15-
```bash
16-
git clone https://github.com/YOUR_USERNAME/policybench.git
17-
cd policybench
18-
```
12+
## Models tested
13+
14+
- Claude (Opus 4.6, Sonnet 4.5)
15+
- GPT (4o, o3)
16+
- Gemini 2.5 Pro
17+
18+
## Programs evaluated
19+
20+
Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
21+
22+
## Quick start
23+
24+
```bash
25+
pip install -e ".[dev]"
26+
pytest # Run tests (mocked, no API calls)
27+
```
28+
29+
## Full benchmark
30+
31+
```bash
32+
# Generate ground truth from PolicyEngine-US
33+
policybench ground-truth
34+
35+
# Run AI-alone evaluations
36+
policybench eval-no-tools
37+
38+
# Run AI-with-tools evaluations
39+
policybench eval-with-tools
40+
41+
# Analyze results
42+
policybench analyze
43+
```

RESULTS.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
2+
3+
> Can frontier AI models accurately calculate US tax and benefit outcomes?
4+
5+
**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
6+
7+
## Setup
8+
9+
- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
10+
- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
11+
- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
12+
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
13+
- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
14+
- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)
15+
16+
## Headline results
17+
18+
### Without tools (AI alone)
19+
20+
| Model | MAE | MAPE | Within 10% |
21+
|:------|----:|-----:|----------:|
22+
| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
23+
| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
24+
| GPT-5.2 | $2,578 | 78% | 62.1% |
25+
| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
26+
27+
### With PolicyEngine tools
28+
29+
| Model | MAE | MAPE | Within 10% |
30+
|:------|----:|-----:|----------:|
31+
| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
32+
| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
33+
| GPT-5.2 | **$0** | **0%** | **100.0%** |
34+
35+
### By program (AI alone, all models averaged)
36+
37+
| Program | MAE | MAPE | Within 10% |
38+
|:--------|----:|-----:|----------:|
39+
| Federal income tax | $4,234 | 54% | 41.0% |
40+
| Income tax before credits | $2,683 | 39% | 62.7% |
41+
| EITC | $727 | 298% | 75.3% |
42+
| CTC | $1,028 | 174% | 74.3% |
43+
| Refundable credits | $981 | 128% | 62.3% |
44+
| SNAP | $769 | 55% | 80.7% |
45+
| SSI | $436 | 100% | 95.7% |
46+
| State income tax | $938 | 76% | 59.7% |
47+
| Household net income | $10,586 | 14% | 66.0% |
48+
| Total benefits | $5,228 | 117% | 43.7% |
49+
| Market income | $0 | 0% | 100.0% |
50+
| Marginal tax rate | $347 | N/A | 18.0% |
51+
52+
## Key takeaways
53+
54+
1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.
55+
56+
2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
57+
58+
3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
59+
60+
4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.
61+
62+
5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.
63+
64+
6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
65+
66+
## Methodology
67+
68+
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
69+
70+
---
71+
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*

app/.gitignore

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Logs
2+
logs
3+
*.log
4+
npm-debug.log*
5+
yarn-debug.log*
6+
yarn-error.log*
7+
pnpm-debug.log*
8+
lerna-debug.log*
9+
10+
node_modules
11+
dist
12+
dist-ssr
13+
*.local
14+
15+
# Editor directories and files
16+
.vscode/*
17+
!.vscode/extensions.json
18+
.idea
19+
.DS_Store
20+
*.suo
21+
*.ntvs*
22+
*.njsproj
23+
*.sln
24+
*.sw?

0 commit comments

Comments
 (0)