Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: CI

on:
push:
branches: [main, v2]
pull_request:
branches: [main]

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install ruff
- run: ruff check .
- run: ruff format --check .

test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev]"
- run: pytest -m "not slow" --tb=short -q
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# LiteLLM disk cache
.policybench_cache/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
32 changes: 32 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# PolicyBench development

## Quick start
```bash
pip install -e ".[dev]"
pytest # Run tests (all external calls mocked)
ruff check . # Lint
ruff format . # Format
```

## Architecture
- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
- **Ground truth**: policyengine-us Simulation
- **TDD**: Write tests first, then implement

## Key files
- `policybench/config.py` — Models, programs, constants
- `policybench/scenarios.py` — Household scenario generation
- `policybench/ground_truth.py` — PE-US calculations
- `policybench/prompts.py` — Natural language prompt templates
- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
- `policybench/analysis.py` — Metrics and reporting

## Testing
- All tests mock external calls (EDSL, LiteLLM, PE-US API)
- `pytest -m "not slow"` to skip slow tests
- Full benchmark runs are manual and expensive

## Formatting
- Use `ruff format .` before committing
- Use `ruff check . --fix` for auto-fixable lint issues
18 changes: 0 additions & 18 deletions Makefile

This file was deleted.

49 changes: 37 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,43 @@
# PolicyBench

A mini-benchmark comparing Large Language Model estimates to PolicyEngine calculations of US tax/benefit programs.
Can AI models accurately calculate tax and benefit outcomes without tools?

## Requirements
PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).

- Python 3.9+
- `policyengine-us` for ground truth
- `edsl` for LLM queries
- (Optional) `pytest`, etc. for tests
## Conditions

## Installation
1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers

1. Clone the repo:
```bash
git clone https://github.com/YOUR_USERNAME/policybench.git
cd policybench
```
## Models tested

- Claude (Opus 4.6, Sonnet 4.5)
- GPT (4o, o3)
- Gemini 2.5 Pro

## Programs evaluated

Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.

## Quick start

```bash
pip install -e ".[dev]"
pytest # Run tests (mocked, no API calls)
```

## Full benchmark

```bash
# Generate ground truth from PolicyEngine-US
policybench ground-truth

# Run AI-alone evaluations
policybench eval-no-tools

# Run AI-with-tools evaluations
policybench eval-with-tools

# Analyze results
policybench analyze
```
71 changes: 71 additions & 0 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that

> Can frontier AI models accurately calculate US tax and benefit outcomes?

**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**

## Setup

- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)

## Headline results

### Without tools (AI alone)

| Model | MAE | MAPE | Within 10% |
|:------|----:|-----:|----------:|
| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
| GPT-5.2 | $2,578 | 78% | 62.1% |
| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |

### With PolicyEngine tools

| Model | MAE | MAPE | Within 10% |
|:------|----:|-----:|----------:|
| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
| GPT-5.2 | **$0** | **0%** | **100.0%** |

### By program (AI alone, all models averaged)

| Program | MAE | MAPE | Within 10% |
|:--------|----:|-----:|----------:|
| Federal income tax | $4,234 | 54% | 41.0% |
| Income tax before credits | $2,683 | 39% | 62.7% |
| EITC | $727 | 298% | 75.3% |
| CTC | $1,028 | 174% | 74.3% |
| Refundable credits | $981 | 128% | 62.3% |
| SNAP | $769 | 55% | 80.7% |
| SSI | $436 | 100% | 95.7% |
| State income tax | $938 | 76% | 59.7% |
| Household net income | $10,586 | 14% | 66.0% |
| Total benefits | $5,228 | 117% | 43.7% |
| Market income | $0 | 0% | 100.0% |
| Marginal tax rate | $347 | N/A | 18.0% |

## Key takeaways

1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.

2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.

3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.

4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.

5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.

6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.

## Methodology

See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.

---
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*
24 changes: 24 additions & 0 deletions app/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
Loading