PolicyEngine
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 29 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 32 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 0 additions & 18 deletions b/‎Makefile‎
Lines changed: 0 additions & 18 deletions
diff --git a/‎README.md‎
Lines changed: 37 additions & 12 deletions b/‎README.md‎
Lines changed: 37 additions & 12 deletions
diff --git a/‎benchmark_results/benchmark_output.csv‎
Lines changed: 0 additions & 21 deletions b/‎benchmark_results/benchmark_output.csv‎
Lines changed: 0 additions & 21 deletions
@@ -0,0 +1,29 @@
+name: CI
+
+on:
+  push:
+    branches: [main, v2]
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install ruff
+      - run: ruff check .
+      - run: ruff format --check .
+
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install -e ".[dev]"
+      - run: pytest -m "not slow" --tb=short -q
@@ -0,0 +1,32 @@
+# PolicyBench development
+
+## Quick start
+```bash
+pip install -e ".[dev]"
+pytest                    # Run tests (all external calls mocked)
+ruff check .              # Lint
+ruff format .             # Format
+```
+
+## Architecture
+- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
+- **Ground truth**: policyengine-us Simulation
+- **TDD**: Write tests first, then implement
+
+## Key files
+- `policybench/config.py` — Models, programs, constants
+- `policybench/scenarios.py` — Household scenario generation
+- `policybench/ground_truth.py` — PE-US calculations
+- `policybench/prompts.py` — Natural language prompt templates
+- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
+- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
+- `policybench/analysis.py` — Metrics and reporting
+
+## Testing
+- All tests mock external calls (EDSL, LiteLLM, PE-US API)
+- `pytest -m "not slow"` to skip slow tests
+- Full benchmark runs are manual and expensive
+
+## Formatting
+- Use `ruff format .` before committing
+- Use `ruff check . --fix` for auto-fixable lint issues
@@ -1,18 +1,43 @@
 # PolicyBench
 
-A mini-benchmark comparing Large Language Model estimates to PolicyEngine calculations of US tax/benefit programs.
+Can AI models accurately calculate tax and benefit outcomes without tools?
 
-## Requirements
+PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
 
-- Python 3.9+
-- `policyengine-us` for ground truth
-- `edsl` for LLM queries
-- (Optional) `pytest`, etc. for tests
+## Conditions
 
-## Installation
+1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
+2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
 
-1. Clone the repo:
-   ```bash
-   git clone https://github.com/YOUR_USERNAME/policybench.git
-   cd policybench
-   ```
+## Models tested
+
+- Claude (Opus 4.6, Sonnet 4.5)
+- GPT (4o, o3)
+- Gemini 2.5 Pro
+
+## Programs evaluated
+
+Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
+
+## Quick start
+
+```bash
+pip install -e ".[dev]"
+pytest  # Run tests (mocked, no API calls)
+```
+
+## Full benchmark
+
+```bash
+# Generate ground truth from PolicyEngine-US
+policybench ground-truth
+
+# Run AI-alone evaluations
+policybench eval-no-tools
+
+# Run AI-with-tools evaluations
+policybench eval-with-tools
+
+# Analyze results
+policybench analyze
+```