|
1 | | -# PolicyBench: preliminary results |
| 1 | +# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that |
2 | 2 |
|
3 | 3 | > Can frontier AI models accurately calculate US tax and benefit outcomes? |
4 | 4 |
|
5 | | -**TL;DR: No — but they can with the right tools.** |
| 5 | +**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.** |
6 | 6 |
|
7 | 7 | ## Setup |
8 | 8 |
|
9 | 9 | - **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition |
10 | | -- **14 tax/benefit programs** (federal income tax, EITC, CTC, SNAP, SSI, Medicaid, state taxes, and more) |
| 10 | +- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more |
11 | 11 | - **3 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.6 |
12 | 12 | - **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools |
13 | | -- **Ground truth**: PolicyEngine-US microsimulation (1,400 values) |
| 13 | +- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs) |
| 14 | +- **Total predictions**: 8,400 (4,200 per condition) |
14 | 15 |
|
15 | 16 | ## Headline results |
16 | 17 |
|
17 | 18 | ### Without tools (AI alone) |
18 | 19 |
|
19 | 20 | | Model | MAE | MAPE | Within 10% | |
20 | 21 | |:------|----:|-----:|----------:| |
21 | | -| Claude Opus 4.6 | $1,551 | 44% | 74.5% | |
22 | | -| Claude Sonnet 4.5 | $2,751 | 90% | 66.1% | |
23 | | -| GPT-5.2 | $3,231 | 57% | 66.7% | |
| 22 | +| Claude Opus 4.6 | $1,257 | 85% | 70.8% | |
| 23 | +| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% | |
| 24 | +| GPT-5.2 | $2,578 | 78% | 62.1% | |
24 | 25 |
|
25 | 26 | ### With PolicyEngine tools |
26 | 27 |
|
27 | | -| Model | MAE | Within 10% | |
28 | | -|:------|----:|----------:| |
29 | | -| GPT-5.2 | $0 | 100.0% | |
| 28 | +| Model | MAE | MAPE | Within 10% | |
| 29 | +|:------|----:|-----:|----------:| |
| 30 | +| Claude Opus 4.6 | **$0** | **0%** | **100.0%** | |
| 31 | +| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** | |
| 32 | +| GPT-5.2 | **$0** | **0%** | **100.0%** | |
30 | 33 |
|
31 | 34 | ### By program (AI alone, all models averaged) |
32 | 35 |
|
33 | | -| Program | MAE | Within 10% | |
34 | | -|:--------|----:|----------:| |
35 | | -| Federal income tax | $4,328 | 40.9% | |
36 | | -| Income tax before refundable credits | $2,745 | 62.6% | |
37 | | -| EITC | $744 | 74.8% | |
38 | | -| CTC | $986 | 75.2% | |
39 | | -| Refundable credits | $998 | 61.5% | |
40 | | -| SNAP | $797 | 80.1% | |
41 | | -| SSI | $458 | 95.5% | |
42 | | -| State income tax | $955 | 58.7% | |
43 | | -| Household net income | $10,852 | 65.3% | |
44 | | -| Total benefits | $5,315 | 42.8% | |
45 | | -| Market income | $0 | 100.0% | |
| 36 | +| Program | MAE | MAPE | Within 10% | |
| 37 | +|:--------|----:|-----:|----------:| |
| 38 | +| Federal income tax | $4,234 | 54% | 41.0% | |
| 39 | +| Income tax before credits | $2,683 | 39% | 62.7% | |
| 40 | +| EITC | $727 | 298% | 75.3% | |
| 41 | +| CTC | $1,028 | 174% | 74.3% | |
| 42 | +| Refundable credits | $981 | 128% | 62.3% | |
| 43 | +| SNAP | $769 | 55% | 80.7% | |
| 44 | +| SSI | $436 | 100% | 95.7% | |
| 45 | +| State income tax | $938 | 76% | 59.7% | |
| 46 | +| Household net income | $10,586 | 14% | 66.0% | |
| 47 | +| Total benefits | $5,228 | 117% | 43.7% | |
| 48 | +| Market income | $0 | 0% | 100.0% | |
| 49 | +| Marginal tax rate | $347 | N/A | 18.0% | |
46 | 50 |
|
47 | 51 | ## Key takeaways |
48 | 52 |
|
49 | | -1. **Tools > models.** The weakest model with PolicyEngine (100% accuracy) vastly outperforms the strongest model without it (74.5%). The choice of computational tool matters more than the choice of frontier model. |
50 | | -2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Opus) averages $1,551 error per calculation and gets only 3 in 4 answers within 10% of correct. |
51 | | -3. **Complex programs are hardest.** Income tax (40.9% within 10%), aggregate benefits (42.8%), and state taxes (58.7%) have the worst AI-alone accuracy — precisely the programs where getting it wrong matters most. |
52 | | -4. **With tools, accuracy is perfect.** GPT-5.2 with PolicyEngine achieves $0 MAE and 100% within-10% accuracy across all 1,400 scenario-program pairs tested so far. |
| 53 | +1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–71%). The choice of computational tool matters more than the choice of frontier model. |
| 54 | + |
| 55 | +2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Opus) averages $1,257 error per calculation and gets only 71% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most. |
| 56 | + |
| 57 | +3. **With tools, accuracy is perfect.** All three frontier models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it. |
53 | 58 |
|
54 | | -## Status |
| 59 | +4. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate — essentially a coin flip. This makes AI-generated policy advice about work incentives unreliable without computational backing. |
55 | 60 |
|
56 | | -- No-tools: 4000/4,200 predictions complete |
57 | | -- With-tools: 1400/4,200 predictions complete (remaining models in progress) |
58 | | -- Full results with all models and analysis charts coming soon |
| 61 | +5. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge. |
59 | 62 |
|
60 | 63 | ## Methodology |
61 | 64 |
|
62 | | -See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All predictions are cached and reproducible. |
| 65 | +See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility. |
63 | 66 |
|
64 | 67 | --- |
65 | 68 | *[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)* |
0 commit comments