PolicyEngine
diff --git a/‎CLAUDE.md‎
Lines changed: 3 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 26 additions & 17 deletions b/‎README.md‎
Lines changed: 26 additions & 17 deletions
diff --git a/‎RESULTS.md‎
Lines changed: 16 additions & 60 deletions b/‎RESULTS.md‎
Lines changed: 16 additions & 60 deletions
diff --git a/‎app/.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎app/.gitignore‎
Lines changed: 4 additions & 0 deletions
@@ -9,7 +9,7 @@ ruff format .             # Format
 ```
 
 ## Architecture
-- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
+- **One condition**: AI alone (no tools)
 - **Ground truth**: policyengine-us Simulation
 - **TDD**: Write tests first, then implement
 
@@ -18,12 +18,11 @@ ruff format .             # Format
 - `policybench/scenarios.py` — Household scenario generation
 - `policybench/ground_truth.py` — PE-US calculations
 - `policybench/prompts.py` — Natural language prompt templates
-- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
-- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
+- `policybench/eval_no_tools.py` — LiteLLM-based AI-alone benchmark
 - `policybench/analysis.py` — Metrics and reporting
 
 ## Testing
-- All tests mock external calls (EDSL, LiteLLM, PE-US API)
+- All tests mock external calls (LiteLLM, PE-US API)
 - `pytest -m "not slow"` to skip slow tests
 - Full benchmark runs are manual and expensive
 
 
@@ -1,23 +1,25 @@
 # PolicyBench
 
-Can AI models accurately calculate tax and benefit outcomes without tools?
+How well can frontier models calculate tax and benefit outcomes without tools?
 
-PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
+PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households using pure reasoning alone.
 
-## Conditions
+Benchmark scenarios are sampled from real households in the Enhanced CPS and then evaluated under 2025 policy rules with PolicyEngine-US. That gives the benchmark more realistic joint distributions of age, income, filing status, and family composition than independent synthetic sampling.
+
+## Condition
 
 1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
-2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
 
 ## Models tested
 
-- Claude (Opus 4.6, Sonnet 4.5)
-- GPT (4o, o3)
-- Gemini 2.5 Pro
+- Claude Opus 4.6
+- Claude Sonnet 4.6
+- GPT-5.4
+- Gemini 3.1 Pro Preview
 
 ## Programs evaluated
 
-Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
+Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, and related core household policy outputs.
 
 ## Quick start
 
@@ -26,18 +28,25 @@ pip install -e ".[dev]"
 pytest  # Run tests (mocked, no API calls)
 ```
 
-## Full benchmark
+## Benchmark run
 
 ```bash
-# Generate ground truth from PolicyEngine-US
-policybench ground-truth
+# Generate ground truth for 100 sampled CPS households
+policybench ground-truth -n 100 --seed 42
+
+# Run AI-alone evaluations on the same sampled households
+policybench eval-no-tools -n 100 --seed 42
 
-# Run AI-alone evaluations
-policybench eval-no-tools
+# Analyze results and export production artifacts
+policybench analyze --output-dir results/analysis
+```
 
-# Run AI-with-tools evaluations
-policybench eval-with-tools
+## Repeated runs
+
+```bash
+# Optional: run the same benchmark multiple times on the same sampled households
+policybench eval-no-tools-repeated -n 100 --seed 42 --repeats 3 -o results/no_tools/runs
 
-# Analyze results
-policybench analyze
+# Analyze the canonical point estimate plus across-run stability
+policybench analyze --runs-dir results/no_tools/runs --output-dir results/analysis
 ```
@@ -1,71 +1,27 @@
-# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
+# PolicyBench Results
 
-> Can frontier AI models accurately calculate US tax and benefit outcomes?
+PolicyBench is a no-tools benchmark. Generated outputs live in `results/analysis/` after a benchmark run.
 
-**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
+## Run
 
-## Setup
+```bash
+policybench ground-truth -n 100 --seed 42
+policybench eval-no-tools -n 100 --seed 42
+policybench analyze --output-dir results/analysis
+```
 
-- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
-- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
-- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
-- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
-- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
-- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)
+## Artifacts
 
-## Headline results
-
-### Without tools (AI alone)
-
-| Model | MAE | MAPE | Within 10% |
-|:------|----:|-----:|----------:|
-| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
-| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
-| GPT-5.2 | $2,578 | 78% | 62.1% |
-| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
-
-### With PolicyEngine tools
-
-| Model | MAE | MAPE | Within 10% |
-|:------|----:|-----:|----------:|
-| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
-| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
-| GPT-5.2 | **$0** | **0%** | **100.0%** |
-
-### By program (AI alone, all models averaged)
-
-| Program | MAE | MAPE | Within 10% |
-|:--------|----:|-----:|----------:|
-| Federal income tax | $4,234 | 54% | 41.0% |
-| Income tax before credits | $2,683 | 39% | 62.7% |
-| EITC | $727 | 298% | 75.3% |
-| CTC | $1,028 | 174% | 74.3% |
-| Refundable credits | $981 | 128% | 62.3% |
-| SNAP | $769 | 55% | 80.7% |
-| SSI | $436 | 100% | 95.7% |
-| State income tax | $938 | 76% | 59.7% |
-| Household net income | $10,586 | 14% | 66.0% |
-| Total benefits | $5,228 | 117% | 43.7% |
-| Market income | $0 | 0% | 100.0% |
-| Marginal tax rate | $347 | N/A | 18.0% |
-
-## Key takeaways
-
-1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.
-
-2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
-
-3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
-
-4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.
-
-5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.
-
-6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
+- `results/ground_truth.csv`
+- `results/no_tools/predictions.csv`
+- `results/analysis/metrics.csv`
+- `results/analysis/summary_by_model.csv`
+- `results/analysis/summary_by_variable.csv`
+- `results/analysis/report.md`
 
 ## Methodology
 
-See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
+See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). LLM responses are cached for reproducibility.
 
 ---
 *[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*
@@ -10,6 +10,8 @@ lerna-debug.log*
 node_modules
 dist
 dist-ssr
+.next
+out
 *.local
 
 # Editor directories and files
@@ -22,3 +24,5 @@ dist-ssr
 *.njsproj
 *.sln
 *.sw?
+.vercel
+.env*.local