Skip to content

Commit 2e0aa0d

Browse files
MaxGhenisclaude
andcommitted
Add preliminary benchmark results, paper draft, and eval improvements
- RESULTS.md with partial results (4000/4200 no-tools, 1400/4200 with-tools) - Key finding: tools > models (100% vs 74.5% within-10% accuracy) - JB2 paper in docs/ (intro, methodology, results, discussion, references) - Add retry logic with exponential backoff to both eval modules - Add incremental CSV saving to prevent data loss on crashes - Prediction CSVs for completed model runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5f19c57 commit 2e0aa0d

13 files changed

Lines changed: 44527 additions & 13 deletions

File tree

RESULTS.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# PolicyBench: preliminary results
2+
3+
> Can frontier AI models accurately calculate US tax and benefit outcomes?
4+
5+
**TL;DR: No — but they can with the right tools.**
6+
7+
## Setup
8+
9+
- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
10+
- **14 tax/benefit programs** (federal income tax, EITC, CTC, SNAP, SSI, Medicaid, state taxes, and more)
11+
- **3 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.6
12+
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
13+
- **Ground truth**: PolicyEngine-US microsimulation (1,400 values)
14+
15+
## Headline results
16+
17+
### Without tools (AI alone)
18+
19+
| Model | MAE | MAPE | Within 10% |
20+
|:------|----:|-----:|----------:|
21+
| Claude Opus 4.6 | $1,551 | 44% | 74.5% |
22+
| Claude Sonnet 4.5 | $2,751 | 90% | 66.1% |
23+
| GPT-5.2 | $3,231 | 57% | 66.7% |
24+
25+
### With PolicyEngine tools
26+
27+
| Model | MAE | Within 10% |
28+
|:------|----:|----------:|
29+
| GPT-5.2 | $0 | 100.0% |
30+
31+
### By program (AI alone, all models averaged)
32+
33+
| Program | MAE | Within 10% |
34+
|:--------|----:|----------:|
35+
| Federal income tax | $4,328 | 40.9% |
36+
| Income tax before refundable credits | $2,745 | 62.6% |
37+
| EITC | $744 | 74.8% |
38+
| CTC | $986 | 75.2% |
39+
| Refundable credits | $998 | 61.5% |
40+
| SNAP | $797 | 80.1% |
41+
| SSI | $458 | 95.5% |
42+
| State income tax | $955 | 58.7% |
43+
| Household net income | $10,852 | 65.3% |
44+
| Total benefits | $5,315 | 42.8% |
45+
| Market income | $0 | 100.0% |
46+
47+
## Key takeaways
48+
49+
1. **Tools > models.** The weakest model with PolicyEngine (100% accuracy) vastly outperforms the strongest model without it (74.5%). The choice of computational tool matters more than the choice of frontier model.
50+
2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Opus) averages $1,551 error per calculation and gets only 3 in 4 answers within 10% of correct.
51+
3. **Complex programs are hardest.** Income tax (40.9% within 10%), aggregate benefits (42.8%), and state taxes (58.7%) have the worst AI-alone accuracy — precisely the programs where getting it wrong matters most.
52+
4. **With tools, accuracy is perfect.** GPT-5.2 with PolicyEngine achieves $0 MAE and 100% within-10% accuracy across all 1,400 scenario-program pairs tested so far.
53+
54+
## Status
55+
56+
- No-tools: 4000/4,200 predictions complete
57+
- With-tools: 1400/4,200 predictions complete (remaining models in progress)
58+
- Full results with all models and analysis charts coming soon
59+
60+
## Methodology
61+
62+
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All predictions are cached and reproducible.
63+
64+
---
65+
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*

docs/discussion.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: Discussion
3+
---
4+
5+
# Discussion
6+
7+
## Where models fail
8+
9+
The AI-alone results reveal systematic patterns in model errors that reflect the underlying structure of US tax and benefit programs.
10+
11+
**Means-tested benefits are hardest.** Programs like SNAP and SSI involve multi-step eligibility determinations: gross income tests, net income tests, asset limits, categorical eligibility provisions, and benefit reduction rates that differ by household size and state. Models must not only know these rules but execute them in the correct order, applying the right thresholds for the specific household configuration. Even models that can recite SNAP eligibility rules struggle to correctly determine whether a family of four in California with $25,000 in income qualifies, and if so, for how much.
12+
13+
**Phase-outs and cliffs create discontinuities.** The EITC, CTC, and many state tax provisions have phase-in and phase-out schedules that create sharp nonlinearities in the relationship between income and the computed value. Models tend to produce smooth approximations where the true function is discontinuous. For example, a model might estimate a positive EITC for a household whose income is just above the phase-out threshold, producing an error of several thousand dollars at a single dollar of income difference.
14+
15+
**State-level variation adds complexity.** State income tax calculations require knowledge of state-specific bracket structures, deductions, credits, and their interactions with federal provisions. Models must effectively maintain 50 separate tax code implementations in their parameters. Errors are systematically larger for states with complex tax systems (California, New York) than for states with no income tax (Texas, Florida, Washington).
16+
17+
**Marginal tax rates are especially challenging.** Computing the marginal tax rate requires determining how a one-dollar increase in income changes tax liability and benefit amounts across all programs simultaneously. This involves understanding not just each program's rules but their interactions --- how additional income affects SNAP eligibility, EITC phase-out, and federal tax brackets concurrently. The resulting effective marginal tax rates can exceed 100% in some income ranges due to benefit cliffs, a phenomenon that models rarely capture.
18+
19+
**Income tax estimates are closer but still unreliable.** Federal income tax is the program where models perform best in the AI-alone condition, likely because tax bracket calculations are well-represented in training data and involve relatively straightforward arithmetic. However, even here, models make errors on the order of thousands of dollars for complex returns, particularly those involving interactions between the standard deduction, credits, and the alternative minimum tax.
20+
21+
## Why tool access works
22+
23+
The tool-augmented condition produces near-perfect accuracy because it shifts the computational burden from the model to the microsimulation engine. The model's role changes from "compute the answer" to "translate the question into the correct API call." This is a fundamentally easier task: the model must construct a valid household JSON object and specify the correct variable name, but it does not need to execute any tax or benefit calculations itself.
24+
25+
The residual errors in the tool-augmented condition fall into a few categories:
26+
27+
- **Malformed household JSON.** Occasionally, a model constructs a household object that is syntactically valid but semantically incorrect --- for example, placing a child in the wrong tax unit or omitting the state code.
28+
- **Wrong variable name.** A model might request `federal_income_tax` instead of `income_tax`, or `snap_benefits` instead of `snap`.
29+
- **Failure to invoke the tool.** In rare cases, a model attempts to answer from memory rather than using the available tool, particularly for variables it perceives as simple (like market income).
30+
31+
These errors are model-specific but small in aggregate. Importantly, they are addressable through better tool documentation, structured output schemas, or few-shot examples --- unlike the fundamental computational limitations exposed in the AI-alone condition.
32+
33+
## The tool matters more than the model
34+
35+
Perhaps the most striking finding is the relative magnitude of the two gaps: (1) the gap between models within each condition, and (2) the gap between conditions for each model. The between-model differences in the AI-alone condition are modest: all frontier models struggle with the same programs and make qualitatively similar errors. The between-condition difference, by contrast, is transformative: the worst model with tools outperforms the best model without tools by a wide margin.
36+
37+
This finding has a direct practical implication: investments in better computational tools yield larger returns than investments in better base models, at least for the specific task of policy calculation. An organization seeking accurate AI-assisted policy analysis should prioritize tool access over model selection.
38+
39+
## Implications for AI-assisted policy analysis
40+
41+
These results suggest a clear architecture for AI systems that provide policy analysis:
42+
43+
1. **Computation should be delegated to validated tools.** LLMs should not be trusted to perform tax and benefit calculations from memory, regardless of their general capability. Microsimulation engines like PolicyEngine exist precisely to handle this complexity and have been validated against statutory rules.
44+
45+
2. **Models add value as interfaces, not calculators.** The appropriate role for an LLM in policy analysis is to translate natural language questions into structured API calls, interpret results for non-technical users, and synthesize findings across multiple scenarios. These are tasks where models excel.
46+
47+
3. **Benchmarks should test tool-augmented performance.** Evaluating models on unaided policy computation may be informative for understanding model capabilities but is not predictive of real-world utility. Practical evaluations should measure the full system --- model plus tools --- since that is what users will interact with.
48+
49+
4. **Tool quality is a bottleneck.** If the tool matters more than the model, then the accuracy, coverage, and usability of the computational tool become the binding constraints on system performance. Expanding microsimulation coverage to more programs, states, and countries is likely to have a larger impact than improving model reasoning on policy questions.
50+
51+
## Limitations
52+
53+
Several limitations qualify these findings:
54+
55+
**Scope of programs.** PolicyBench evaluates 14 variables covering the major federal and state tax-and-benefit programs, but the US system includes hundreds of additional provisions (housing subsidies, healthcare premium tax credits, education credits, retirement savings incentives, and more). Model performance may differ on programs not included in this benchmark.
56+
57+
**Household complexity.** Our 100 scenarios vary across six dimensions (state, filing status, income, children, adult ages) but do not include many real-world complications: multiple income sources (self-employment, investment, retirement), itemized deductions, prior-year carryovers, mid-year moves, or non-standard family structures. More complex households may be even harder for models to evaluate correctly.
58+
59+
**Single tax year.** All evaluations use tax year 2025. Model performance may differ for historical years (where training data is more abundant) or future years (where models must extrapolate from known rules).
60+
61+
**Prompt sensitivity.** We use a single prompt template per condition. Model performance may be sensitive to prompt phrasing, particularly in the AI-alone condition where chain-of-thought prompting or structured reasoning might improve accuracy.
62+
63+
**Model versions.** AI model capabilities change rapidly. Results for specific model versions may not generalize to future releases, though the qualitative finding --- that models struggle with precise computation without tools --- is likely to persist.
64+
65+
## Future work
66+
67+
Several extensions of PolicyBench are planned or in progress:
68+
69+
**International coverage.** PolicyEngine supports the UK, Canadian, and other tax-benefit systems. Extending PolicyBench to multiple countries would test whether models' computational limitations are specific to US policy complexity or are more general.
70+
71+
**Specialized policy models.** Cosilico is developing AI models specifically trained for policy analysis, with fine-tuning on microsimulation inputs and outputs. PolicyBench provides a natural evaluation framework for measuring whether specialized training improves unaided performance.
72+
73+
**Dynamic scenarios.** Current scenarios are static household snapshots. Future versions could test models on reform scenarios (e.g., "What would this household's SNAP benefits be if the maximum allotment increased by 10%?"), which require understanding both baseline rules and the proposed change.
74+
75+
**Multi-turn evaluation.** Real-world policy analysis often involves iterative questioning: a user asks about one variable, then follows up about related variables or alternative scenarios. Evaluating models in multi-turn settings would better reflect actual use cases.

docs/index.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
title: "PolicyBench: Can AI models calculate tax and benefit outcomes?"
3+
---
4+
5+
# PolicyBench: Can AI models calculate tax and benefit outcomes?
6+
7+
**Max Ghenis** (Cosilico)
8+
9+
## Abstract
10+
11+
Large language models have absorbed vast quantities of information about tax codes, benefit programs, and policy rules, yet their ability to translate this knowledge into precise quantitative outputs remains untested. PolicyBench is a benchmark that evaluates whether frontier AI models can accurately calculate US tax and benefit outcomes for specific households. We test three frontier models --- GPT-5.2, Claude Sonnet 4.5, and Claude Opus 4.6 --- across 14 federal and state tax-and-benefit programs for 100 diverse household scenarios spanning 12 states, income levels from $0 to $500,000, and varying family compositions.
12+
13+
We evaluate models under two conditions: (1) AI alone, where models rely solely on their parametric knowledge to estimate policy outcomes, and (2) AI with PolicyEngine, where models have tool access to the PolicyEngine-US microsimulation engine. Without tools, models achieve low accuracy across programs, with particularly large errors on means-tested benefits and programs involving complex phase-outs. With PolicyEngine tool access, models achieve near-perfect accuracy, as the microsimulation engine handles the computational complexity that models cannot reliably perform from memory alone.
14+
15+
These findings demonstrate that domain-specific computational tools are essential for reliable AI-assisted policy analysis. The choice of tool matters more than the choice of model: even the most capable frontier models cannot substitute for rigorous microsimulation when precise household-level calculations are required.
16+
17+
```{tableofcontents}
18+
```

docs/introduction.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: Introduction
3+
---
4+
5+
# Introduction
6+
7+
## The promise and peril of AI for policy analysis
8+
9+
Artificial intelligence is increasingly invoked as a tool for public policy analysis. Large language models (LLMs) can summarize legislation, explain eligibility rules, and draft policy memos with impressive fluency. Policymakers, journalists, and researchers have begun using these models to answer questions about how tax and benefit systems affect specific households --- questions like "How much would this family receive in SNAP benefits?" or "What is the marginal tax rate for a single parent earning $40,000 in California?"
10+
11+
These questions have precise, deterministic answers. The US tax code and benefit programs define exact formulas, phase-out schedules, income thresholds, and interaction effects that together determine a household's tax liability, credit amounts, and benefit eligibility. A correct answer requires not just knowledge of individual program rules but the ability to execute multi-step calculations that account for interactions across programs, state-specific provisions, and household-specific circumstances.
12+
13+
LLMs are trained on tax law, IRS publications, benefit program documentation, and policy analyses. They can often describe the rules governing a program in considerable detail. But describing rules and computing outcomes from those rules are fundamentally different tasks. The question motivating this paper is whether frontier AI models can bridge that gap --- whether their parametric knowledge of policy rules translates into accurate quantitative outputs for specific household scenarios.
14+
15+
## Why precision matters
16+
17+
Policy analysis is a domain where approximate answers can be worse than no answer at all. Consider a family evaluating whether to accept a raise that might push them above a benefit cliff, a tax preparer estimating a client's refundable credits, or a researcher modeling the distributional effects of a proposed reform. In each case, errors of even a few hundred dollars can lead to materially wrong conclusions.
18+
19+
The stakes are compounded by the complexity of the US tax-and-benefit system. Federal income tax alone involves multiple filing statuses, bracket structures, deductions, exemptions, and credits --- each with its own phase-in and phase-out schedules. Layered on top are state income taxes (with their own brackets and rules), means-tested benefits like SNAP and SSI (with asset tests, income disregards, and categorical eligibility rules), and tax credits like the EITC and CTC (with earned income requirements, child age limits, and investment income thresholds). These programs interact in ways that create effective marginal tax rates that are discontinuous, non-monotonic, and difficult to compute even for domain experts.
20+
21+
Microsimulation models exist precisely to handle this complexity. Tools like PolicyEngine-US encode the full logic of each program and compute exact outcomes for arbitrary household configurations. The question is whether AI models, armed with their training data, can approximate these computations --- or whether they require access to such tools to produce reliable results.
22+
23+
## Prior work
24+
25+
Benchmarking AI models on quantitative reasoning tasks is a well-established area. Mathematical reasoning benchmarks like GSM8K {cite}`cobbe2021gsm8k` and MATH {cite}`hendrycks2021math` evaluate models on multi-step arithmetic and algebraic problems. Domain-specific benchmarks exist for medical reasoning, legal analysis, and financial calculations.
26+
27+
However, benchmarks for tax and benefit computation are scarce. TaxBench evaluated LLMs on tax preparation questions but focused on qualitative understanding of tax rules rather than precise numerical computation for specific households. No prior benchmark, to our knowledge, has systematically evaluated frontier models on their ability to compute exact tax liabilities, credit amounts, and benefit levels for diverse household scenarios across multiple programs.
28+
29+
PolicyBench fills this gap. It provides a rigorous, reproducible benchmark that isolates the computational challenge: given a fully specified household and a specific policy variable, can the model produce the correct numerical answer? By testing under two conditions --- with and without tool access --- we can separate what models know from what they can compute.
30+
31+
## This paper's contributions
32+
33+
This paper makes three contributions:
34+
35+
1. **A new benchmark for AI-assisted policy analysis.** PolicyBench defines 100 household scenarios across 12 US states with varying income levels ($0--$500,000), filing statuses, and family compositions. For each scenario, we evaluate 14 tax-and-benefit variables, producing 1,400 ground-truth values computed by PolicyEngine-US. This benchmark is open-source and extensible to additional countries and programs.
36+
37+
2. **An empirical evaluation of frontier model capabilities.** We test GPT-5.2, Claude Sonnet 4.5, and Claude Opus 4.6 under both AI-alone and tool-augmented conditions. Our results quantify the gap between what models know about policy rules and what they can accurately compute.
38+
39+
3. **Evidence for tool-augmented policy analysis.** We show that tool access transforms model performance from unreliable to near-perfect, demonstrating that the computational tool matters more than the choice of frontier model. This finding has direct implications for how AI systems should be designed for policy analysis applications.

0 commit comments

Comments
 (0)