PolicyEngine · MaxGhenis · Feb 25, 2026 · Feb 16, 2026 · Feb 17, 2026 · Feb 20, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,29 @@
+name: CI
+
+on:
+  push:
+    branches: [main, v2]
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install ruff
+      - run: ruff check .
+      - run: ruff format --check .
+
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install -e ".[dev]"
+      - run: pytest -m "not slow" --tb=short -q
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+# LiteLLM disk cache
+.policybench_cache/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,32 @@
+# PolicyBench development
+
+## Quick start
+```bash
+pip install -e ".[dev]"
+pytest                    # Run tests (all external calls mocked)
+ruff check .              # Lint
+ruff format .             # Format
+```
+
+## Architecture
+- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
+- **Ground truth**: policyengine-us Simulation
+- **TDD**: Write tests first, then implement
+
+## Key files
+- `policybench/config.py` — Models, programs, constants
+- `policybench/scenarios.py` — Household scenario generation
+- `policybench/ground_truth.py` — PE-US calculations
+- `policybench/prompts.py` — Natural language prompt templates
+- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
+- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
+- `policybench/analysis.py` — Metrics and reporting
+
+## Testing
+- All tests mock external calls (EDSL, LiteLLM, PE-US API)
+- `pytest -m "not slow"` to skip slow tests
+- Full benchmark runs are manual and expensive
+
+## Formatting
+- Use `ruff format .` before committing
+- Use `ruff check . --fix` for auto-fixable lint issues
diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -1,18 +1,43 @@
 # PolicyBench
 
-A mini-benchmark comparing Large Language Model estimates to PolicyEngine calculations of US tax/benefit programs.
+Can AI models accurately calculate tax and benefit outcomes without tools?
 
-## Requirements
+PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
 
-- Python 3.9+
-- `policyengine-us` for ground truth
-- `edsl` for LLM queries
-- (Optional) `pytest`, etc. for tests
+## Conditions
 
-## Installation
+1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
+2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
 
-1. Clone the repo:
-   ```bash
-   git clone https://github.com/YOUR_USERNAME/policybench.git
-   cd policybench
-   ```
+## Models tested
+
+- Claude (Opus 4.6, Sonnet 4.5)
+- GPT (4o, o3)
+- Gemini 2.5 Pro
+
+## Programs evaluated
+
+Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
+
+## Quick start
+
+```bash
+pip install -e ".[dev]"
+pytest  # Run tests (mocked, no API calls)
+```
+
+## Full benchmark
+
+```bash
+# Generate ground truth from PolicyEngine-US
+policybench ground-truth
+
+# Run AI-alone evaluations
+policybench eval-no-tools
+
+# Run AI-with-tools evaluations
+policybench eval-with-tools
+
+# Analyze results
+policybench analyze
+```
diff --git a/RESULTS.md b/RESULTS.md
@@ -0,0 +1,71 @@
+# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
+
+> Can frontier AI models accurately calculate US tax and benefit outcomes?
+
+**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
+
+## Setup
+
+- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
+- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
+- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
+- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
+- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
+- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)
+
+## Headline results
+
+### Without tools (AI alone)
+
+| Model | MAE | MAPE | Within 10% |
+|:------|----:|-----:|----------:|
+| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
+| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
+| GPT-5.2 | $2,578 | 78% | 62.1% |
+| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
+
+### With PolicyEngine tools
+
+| Model | MAE | MAPE | Within 10% |
+|:------|----:|-----:|----------:|
+| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
+| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
+| GPT-5.2 | **$0** | **0%** | **100.0%** |
+
+### By program (AI alone, all models averaged)
+
+| Program | MAE | MAPE | Within 10% |
+|:--------|----:|-----:|----------:|
+| Federal income tax | $4,234 | 54% | 41.0% |
+| Income tax before credits | $2,683 | 39% | 62.7% |
+| EITC | $727 | 298% | 75.3% |
+| CTC | $1,028 | 174% | 74.3% |
+| Refundable credits | $981 | 128% | 62.3% |
+| SNAP | $769 | 55% | 80.7% |
+| SSI | $436 | 100% | 95.7% |
+| State income tax | $938 | 76% | 59.7% |
+| Household net income | $10,586 | 14% | 66.0% |
+| Total benefits | $5,228 | 117% | 43.7% |
+| Market income | $0 | 0% | 100.0% |
+| Marginal tax rate | $347 | N/A | 18.0% |
+
+## Key takeaways
+
+1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.
+
+2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
+
+3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
+
+4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.
+
+5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.
+
+6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
+
+## Methodology
+
+See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
+
+---
+*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*
diff --git a/app/.gitignore b/app/.gitignore
@@ -0,0 +1,24 @@
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-debug.log*
+lerna-debug.log*
+
+node_modules
+dist
+dist-ssr
+*.local
+
+# Editor directories and files
+.vscode/*
+!.vscode/extensions.json
+.idea
+.DS_Store
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?