Conversation
Two-condition benchmark testing whether AI models can accurately calculate tax/benefit outcomes without tools vs with PolicyEngine tool access. - 5 frontier models (Claude Opus/Sonnet, GPT-4o/o3, Gemini 2.5 Pro) - 14 programs (federal tax, credits, benefits, state tax, rates) - 100 deterministic household scenarios across 12 states - EDSL for AI-alone eval, LiteLLM for tool-calling eval - Full test suite (46 tests), ruff lint, GitHub Actions CI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s thesis - Drop EDSL dependency, use LiteLLM for both conditions - Add diskcache for reproducible/cost-efficient re-runs - Update models: GPT-5.2, Gemini 3 Pro (latest as of Feb 2026) - Fix handle_tool_call to use fallback household when model omits it - Pilot results (5 scenarios × 3 programs × GPT-5.2): - Without tools: 80% within 10%, MAE $1,990, misses SNAP entirely - With tools: 100% exact match, MAE $0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- RESULTS.md with partial results (4000/4200 no-tools, 1400/4200 with-tools) - Key finding: tools > models (100% vs 74.5% within-10% accuracy) - JB2 paper in docs/ (intro, methodology, results, discussion, references) - Add retry logic with exponential backoff to both eval modules - Add incremental CSV saving to prevent data loss on crashes - Prediction CSVs for completed model runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix with-tools extraction: use tool result directly instead of parsing verbose model responses (which grabbed income amounts, not answers) - All 4,200 with-tools predictions now $0 MAE, 100% within-10% - No-tools results: Opus 70.8%, Sonnet 61.9%, GPT-5.2 62.1% - Update React app with real benchmark data (replacing mock data) - Fix paper data loading to use single predictions.csv - Clean up partial/stale result files - All 53 tests pass, lint clean Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 8,400 predictions (3 models x 100 scenarios x 14 programs x 2 conditions) now complete. All three models achieve 100% accuracy with PolicyEngine tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sonnet 4.6 edges out Opus (72.3% vs 70.8% within 10%) and dramatically improves over Sonnet 4.5 (72.3% vs 61.9%). Still far from 100% with tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete rewrite of PolicyBench as a rigorous benchmark measuring whether AI models can accurately calculate US tax/benefit outcomes — and whether PolicyEngine tools bridge the gap.
Test plan
🤖 Generated with Claude Code