Skip to content

Commit 87fb46a

Browse files
MaxGhenisclaude
andcommitted
PolicyBench v2: complete rewrite for AI tax/benefit benchmark
Two-condition benchmark testing whether AI models can accurately calculate tax/benefit outcomes without tools vs with PolicyEngine tool access. - 5 frontier models (Claude Opus/Sonnet, GPT-4o/o3, Gemini 2.5 Pro) - 14 programs (federal tax, credits, benefits, state tax, rates) - 100 deterministic household scenarios across 12 states - EDSL for AI-alone eval, LiteLLM for tool-calling eval - Full test suite (46 tests), ruff lint, GitHub Actions CI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6fd5196 commit 87fb46a

34 files changed

Lines changed: 1660 additions & 2397 deletions

.github/workflows/ci.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main, v2]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
lint:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
- uses: actions/setup-python@v5
15+
with:
16+
python-version: "3.12"
17+
- run: pip install ruff
18+
- run: ruff check .
19+
- run: ruff format --check .
20+
21+
test:
22+
runs-on: ubuntu-latest
23+
steps:
24+
- uses: actions/checkout@v4
25+
- uses: actions/setup-python@v5
26+
with:
27+
python-version: "3.12"
28+
- run: pip install -e ".[dev]"
29+
- run: pytest -m "not slow" --tb=short -q

CLAUDE.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# PolicyBench development
2+
3+
## Quick start
4+
```bash
5+
pip install -e ".[dev]"
6+
pytest # Run tests (all external calls mocked)
7+
ruff check . # Lint
8+
ruff format . # Format
9+
```
10+
11+
## Architecture
12+
- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
13+
- **Ground truth**: policyengine-us Simulation
14+
- **TDD**: Write tests first, then implement
15+
16+
## Key files
17+
- `policybench/config.py` — Models, programs, constants
18+
- `policybench/scenarios.py` — Household scenario generation
19+
- `policybench/ground_truth.py` — PE-US calculations
20+
- `policybench/prompts.py` — Natural language prompt templates
21+
- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
22+
- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
23+
- `policybench/analysis.py` — Metrics and reporting
24+
25+
## Testing
26+
- All tests mock external calls (EDSL, LiteLLM, PE-US API)
27+
- `pytest -m "not slow"` to skip slow tests
28+
- Full benchmark runs are manual and expensive
29+
30+
## Formatting
31+
- Use `ruff format .` before committing
32+
- Use `ruff check . --fix` for auto-fixable lint issues

Makefile

Lines changed: 0 additions & 18 deletions
This file was deleted.

README.md

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,43 @@
11
# PolicyBench
22

3-
A mini-benchmark comparing Large Language Model estimates to PolicyEngine calculations of US tax/benefit programs.
3+
Can AI models accurately calculate tax and benefit outcomes without tools?
44

5-
## Requirements
5+
PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
66

7-
- Python 3.9+
8-
- `policyengine-us` for ground truth
9-
- `edsl` for LLM queries
10-
- (Optional) `pytest`, etc. for tests
7+
## Conditions
118

12-
## Installation
9+
1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
10+
2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
1311

14-
1. Clone the repo:
15-
```bash
16-
git clone https://github.com/YOUR_USERNAME/policybench.git
17-
cd policybench
18-
```
12+
## Models tested
13+
14+
- Claude (Opus 4.6, Sonnet 4.5)
15+
- GPT (4o, o3)
16+
- Gemini 2.5 Pro
17+
18+
## Programs evaluated
19+
20+
Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
21+
22+
## Quick start
23+
24+
```bash
25+
pip install -e ".[dev]"
26+
pytest # Run tests (mocked, no API calls)
27+
```
28+
29+
## Full benchmark
30+
31+
```bash
32+
# Generate ground truth from PolicyEngine-US
33+
policybench ground-truth
34+
35+
# Run AI-alone evaluations
36+
policybench eval-no-tools
37+
38+
# Run AI-with-tools evaluations
39+
policybench eval-with-tools
40+
41+
# Analyze results
42+
policybench analyze
43+
```

benchmark_results/benchmark_output.csv

Lines changed: 0 additions & 21 deletions
This file was deleted.

0 commit comments

Comments
 (0)