Skip to content

Commit 6cd9bce

Browse files
committed
Migrate dashboard to Next.js and refresh benchmark app
1 parent dc3069f commit 6cd9bce

123 files changed

Lines changed: 54614 additions & 100199 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ruff format . # Format
99
```
1010

1111
## Architecture
12-
- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
12+
- **One condition**: AI alone (no tools)
1313
- **Ground truth**: policyengine-us Simulation
1414
- **TDD**: Write tests first, then implement
1515

@@ -18,12 +18,11 @@ ruff format . # Format
1818
- `policybench/scenarios.py` — Household scenario generation
1919
- `policybench/ground_truth.py` — PE-US calculations
2020
- `policybench/prompts.py` — Natural language prompt templates
21-
- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
22-
- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
21+
- `policybench/eval_no_tools.py` — LiteLLM-based AI-alone benchmark
2322
- `policybench/analysis.py` — Metrics and reporting
2423

2524
## Testing
26-
- All tests mock external calls (EDSL, LiteLLM, PE-US API)
25+
- All tests mock external calls (LiteLLM, PE-US API)
2726
- `pytest -m "not slow"` to skip slow tests
2827
- Full benchmark runs are manual and expensive
2928

README.md

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,25 @@
11
# PolicyBench
22

3-
Can AI models accurately calculate tax and benefit outcomes without tools?
3+
How well can frontier models calculate tax and benefit outcomes without tools?
44

5-
PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
5+
PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households using pure reasoning alone.
66

7-
## Conditions
7+
Benchmark scenarios are sampled from real households in the Enhanced CPS and then evaluated under 2025 policy rules with PolicyEngine-US. That gives the benchmark more realistic joint distributions of age, income, filing status, and family composition than independent synthetic sampling.
8+
9+
## Condition
810

911
1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
10-
2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
1112

1213
## Models tested
1314

14-
- Claude (Opus 4.6, Sonnet 4.5)
15-
- GPT (4o, o3)
16-
- Gemini 2.5 Pro
15+
- Claude Opus 4.6
16+
- Claude Sonnet 4.6
17+
- GPT-5.4
18+
- Gemini 3.1 Pro Preview
1719

1820
## Programs evaluated
1921

20-
Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
22+
Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, and related core household policy outputs.
2123

2224
## Quick start
2325

@@ -26,18 +28,25 @@ pip install -e ".[dev]"
2628
pytest # Run tests (mocked, no API calls)
2729
```
2830

29-
## Full benchmark
31+
## Benchmark run
3032

3133
```bash
32-
# Generate ground truth from PolicyEngine-US
33-
policybench ground-truth
34+
# Generate ground truth for 100 sampled CPS households
35+
policybench ground-truth -n 100 --seed 42
36+
37+
# Run AI-alone evaluations on the same sampled households
38+
policybench eval-no-tools -n 100 --seed 42
3439

35-
# Run AI-alone evaluations
36-
policybench eval-no-tools
40+
# Analyze results and export production artifacts
41+
policybench analyze --output-dir results/analysis
42+
```
3743

38-
# Run AI-with-tools evaluations
39-
policybench eval-with-tools
44+
## Repeated runs
45+
46+
```bash
47+
# Optional: run the same benchmark multiple times on the same sampled households
48+
policybench eval-no-tools-repeated -n 100 --seed 42 --repeats 3 -o results/no_tools/runs
4049

41-
# Analyze results
42-
policybench analyze
50+
# Analyze the canonical point estimate plus across-run stability
51+
policybench analyze --runs-dir results/no_tools/runs --output-dir results/analysis
4352
```

RESULTS.md

Lines changed: 16 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,71 +1,27 @@
1-
# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
1+
# PolicyBench Results
22

3-
> Can frontier AI models accurately calculate US tax and benefit outcomes?
3+
PolicyBench is a no-tools benchmark. Generated outputs live in `results/analysis/` after a benchmark run.
44

5-
**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
5+
## Run
66

7-
## Setup
7+
```bash
8+
policybench ground-truth -n 100 --seed 42
9+
policybench eval-no-tools -n 100 --seed 42
10+
policybench analyze --output-dir results/analysis
11+
```
812

9-
- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
10-
- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
11-
- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
12-
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
13-
- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
14-
- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)
13+
## Artifacts
1514

16-
## Headline results
17-
18-
### Without tools (AI alone)
19-
20-
| Model | MAE | MAPE | Within 10% |
21-
|:------|----:|-----:|----------:|
22-
| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
23-
| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
24-
| GPT-5.2 | $2,578 | 78% | 62.1% |
25-
| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
26-
27-
### With PolicyEngine tools
28-
29-
| Model | MAE | MAPE | Within 10% |
30-
|:------|----:|-----:|----------:|
31-
| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
32-
| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
33-
| GPT-5.2 | **$0** | **0%** | **100.0%** |
34-
35-
### By program (AI alone, all models averaged)
36-
37-
| Program | MAE | MAPE | Within 10% |
38-
|:--------|----:|-----:|----------:|
39-
| Federal income tax | $4,234 | 54% | 41.0% |
40-
| Income tax before credits | $2,683 | 39% | 62.7% |
41-
| EITC | $727 | 298% | 75.3% |
42-
| CTC | $1,028 | 174% | 74.3% |
43-
| Refundable credits | $981 | 128% | 62.3% |
44-
| SNAP | $769 | 55% | 80.7% |
45-
| SSI | $436 | 100% | 95.7% |
46-
| State income tax | $938 | 76% | 59.7% |
47-
| Household net income | $10,586 | 14% | 66.0% |
48-
| Total benefits | $5,228 | 117% | 43.7% |
49-
| Market income | $0 | 0% | 100.0% |
50-
| Marginal tax rate | $347 | N/A | 18.0% |
51-
52-
## Key takeaways
53-
54-
1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.
55-
56-
2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
57-
58-
3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
59-
60-
4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.
61-
62-
5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.
63-
64-
6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
15+
- `results/ground_truth.csv`
16+
- `results/no_tools/predictions.csv`
17+
- `results/analysis/metrics.csv`
18+
- `results/analysis/summary_by_model.csv`
19+
- `results/analysis/summary_by_variable.csv`
20+
- `results/analysis/report.md`
6521

6622
## Methodology
6723

68-
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
24+
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). LLM responses are cached for reproducibility.
6925

7026
---
7127
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*

app/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ lerna-debug.log*
1010
node_modules
1111
dist
1212
dist-ssr
13+
.next
14+
out
1315
*.local
1416

1517
# Editor directories and files
@@ -22,3 +24,5 @@ dist-ssr
2224
*.njsproj
2325
*.sln
2426
*.sw?
27+
.vercel
28+
.env*.local

0 commit comments

Comments
 (0)