|
1 | | -# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that |
| 1 | +# PolicyBench Results |
2 | 2 |
|
3 | | -> Can frontier AI models accurately calculate US tax and benefit outcomes? |
| 3 | +PolicyBench is a no-tools benchmark. Generated outputs live in `results/analysis/` after a benchmark run. |
4 | 4 |
|
5 | | -**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.** |
| 5 | +## Run |
6 | 6 |
|
7 | | -## Setup |
| 7 | +```bash |
| 8 | +policybench ground-truth -n 100 --seed 42 |
| 9 | +policybench eval-no-tools -n 100 --seed 42 |
| 10 | +policybench analyze --output-dir results/analysis |
| 11 | +``` |
8 | 12 |
|
9 | | -- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition |
10 | | -- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more |
11 | | -- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6 |
12 | | -- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools |
13 | | -- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs) |
14 | | -- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools) |
| 13 | +## Artifacts |
15 | 14 |
|
16 | | -## Headline results |
17 | | - |
18 | | -### Without tools (AI alone) |
19 | | - |
20 | | -| Model | MAE | MAPE | Within 10% | |
21 | | -|:------|----:|-----:|----------:| |
22 | | -| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% | |
23 | | -| Claude Opus 4.6 | $1,257 | 85% | 70.8% | |
24 | | -| GPT-5.2 | $2,578 | 78% | 62.1% | |
25 | | -| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% | |
26 | | - |
27 | | -### With PolicyEngine tools |
28 | | - |
29 | | -| Model | MAE | MAPE | Within 10% | |
30 | | -|:------|----:|-----:|----------:| |
31 | | -| Claude Opus 4.6 | **$0** | **0%** | **100.0%** | |
32 | | -| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** | |
33 | | -| GPT-5.2 | **$0** | **0%** | **100.0%** | |
34 | | - |
35 | | -### By program (AI alone, all models averaged) |
36 | | - |
37 | | -| Program | MAE | MAPE | Within 10% | |
38 | | -|:--------|----:|-----:|----------:| |
39 | | -| Federal income tax | $4,234 | 54% | 41.0% | |
40 | | -| Income tax before credits | $2,683 | 39% | 62.7% | |
41 | | -| EITC | $727 | 298% | 75.3% | |
42 | | -| CTC | $1,028 | 174% | 74.3% | |
43 | | -| Refundable credits | $981 | 128% | 62.3% | |
44 | | -| SNAP | $769 | 55% | 80.7% | |
45 | | -| SSI | $436 | 100% | 95.7% | |
46 | | -| State income tax | $938 | 76% | 59.7% | |
47 | | -| Household net income | $10,586 | 14% | 66.0% | |
48 | | -| Total benefits | $5,228 | 117% | 43.7% | |
49 | | -| Market income | $0 | 0% | 100.0% | |
50 | | -| Marginal tax rate | $347 | N/A | 18.0% | |
51 | | - |
52 | | -## Key takeaways |
53 | | - |
54 | | -1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model. |
55 | | - |
56 | | -2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most. |
57 | | - |
58 | | -3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it. |
59 | | - |
60 | | -4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools. |
61 | | - |
62 | | -5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing. |
63 | | - |
64 | | -6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge. |
| 15 | +- `results/ground_truth.csv` |
| 16 | +- `results/no_tools/predictions.csv` |
| 17 | +- `results/analysis/metrics.csv` |
| 18 | +- `results/analysis/summary_by_model.csv` |
| 19 | +- `results/analysis/summary_by_variable.csv` |
| 20 | +- `results/analysis/report.md` |
65 | 21 |
|
66 | 22 | ## Methodology |
67 | 23 |
|
68 | | -See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility. |
| 24 | +See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). LLM responses are cached for reproducibility. |
69 | 25 |
|
70 | 26 | --- |
71 | 27 | *[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)* |
0 commit comments