Skip to content

Commit 476e462

Browse files
MaxGhenisclaude
andcommitted
Update RESULTS.md with final complete benchmark results
All 8,400 predictions (3 models x 100 scenarios x 14 programs x 2 conditions) now complete. All three models achieve 100% accuracy with PolicyEngine tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b355903 commit 476e462

1 file changed

Lines changed: 35 additions & 32 deletions

File tree

RESULTS.md

Lines changed: 35 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,68 @@
1-
# PolicyBench: preliminary results
1+
# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
22

33
> Can frontier AI models accurately calculate US tax and benefit outcomes?
44
5-
**TL;DR: No — but they can with the right tools.**
5+
**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
66

77
## Setup
88

99
- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
10-
- **14 tax/benefit programs** (federal income tax, EITC, CTC, SNAP, SSI, Medicaid, state taxes, and more)
10+
- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
1111
- **3 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.6
1212
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
13-
- **Ground truth**: PolicyEngine-US microsimulation (1,400 values)
13+
- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
14+
- **Total predictions**: 8,400 (4,200 per condition)
1415

1516
## Headline results
1617

1718
### Without tools (AI alone)
1819

1920
| Model | MAE | MAPE | Within 10% |
2021
|:------|----:|-----:|----------:|
21-
| Claude Opus 4.6 | $1,551 | 44% | 74.5% |
22-
| Claude Sonnet 4.5 | $2,751 | 90% | 66.1% |
23-
| GPT-5.2 | $3,231 | 57% | 66.7% |
22+
| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
23+
| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
24+
| GPT-5.2 | $2,578 | 78% | 62.1% |
2425

2526
### With PolicyEngine tools
2627

27-
| Model | MAE | Within 10% |
28-
|:------|----:|----------:|
29-
| GPT-5.2 | $0 | 100.0% |
28+
| Model | MAE | MAPE | Within 10% |
29+
|:------|----:|-----:|----------:|
30+
| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
31+
| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
32+
| GPT-5.2 | **$0** | **0%** | **100.0%** |
3033

3134
### By program (AI alone, all models averaged)
3235

33-
| Program | MAE | Within 10% |
34-
|:--------|----:|----------:|
35-
| Federal income tax | $4,328 | 40.9% |
36-
| Income tax before refundable credits | $2,745 | 62.6% |
37-
| EITC | $744 | 74.8% |
38-
| CTC | $986 | 75.2% |
39-
| Refundable credits | $998 | 61.5% |
40-
| SNAP | $797 | 80.1% |
41-
| SSI | $458 | 95.5% |
42-
| State income tax | $955 | 58.7% |
43-
| Household net income | $10,852 | 65.3% |
44-
| Total benefits | $5,315 | 42.8% |
45-
| Market income | $0 | 100.0% |
36+
| Program | MAE | MAPE | Within 10% |
37+
|:--------|----:|-----:|----------:|
38+
| Federal income tax | $4,234 | 54% | 41.0% |
39+
| Income tax before credits | $2,683 | 39% | 62.7% |
40+
| EITC | $727 | 298% | 75.3% |
41+
| CTC | $1,028 | 174% | 74.3% |
42+
| Refundable credits | $981 | 128% | 62.3% |
43+
| SNAP | $769 | 55% | 80.7% |
44+
| SSI | $436 | 100% | 95.7% |
45+
| State income tax | $938 | 76% | 59.7% |
46+
| Household net income | $10,586 | 14% | 66.0% |
47+
| Total benefits | $5,228 | 117% | 43.7% |
48+
| Market income | $0 | 0% | 100.0% |
49+
| Marginal tax rate | $347 | N/A | 18.0% |
4650

4751
## Key takeaways
4852

49-
1. **Tools > models.** The weakest model with PolicyEngine (100% accuracy) vastly outperforms the strongest model without it (74.5%). The choice of computational tool matters more than the choice of frontier model.
50-
2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Opus) averages $1,551 error per calculation and gets only 3 in 4 answers within 10% of correct.
51-
3. **Complex programs are hardest.** Income tax (40.9% within 10%), aggregate benefits (42.8%), and state taxes (58.7%) have the worst AI-alone accuracy — precisely the programs where getting it wrong matters most.
52-
4. **With tools, accuracy is perfect.** GPT-5.2 with PolicyEngine achieves $0 MAE and 100% within-10% accuracy across all 1,400 scenario-program pairs tested so far.
53+
1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–71%). The choice of computational tool matters more than the choice of frontier model.
54+
55+
2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Opus) averages $1,257 error per calculation and gets only 71% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
56+
57+
3. **With tools, accuracy is perfect.** All three frontier models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
5358

54-
## Status
59+
4. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate — essentially a coin flip. This makes AI-generated policy advice about work incentives unreliable without computational backing.
5560

56-
- No-tools: 4000/4,200 predictions complete
57-
- With-tools: 1400/4,200 predictions complete (remaining models in progress)
58-
- Full results with all models and analysis charts coming soon
61+
5. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
5962

6063
## Methodology
6164

62-
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All predictions are cached and reproducible.
65+
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
6366

6467
---
6568
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*

0 commit comments

Comments
 (0)