Cost-per-correct headline metric with versioned pricing (HELM methodology)#34
Cost-per-correct headline metric with versioned pricing (HELM methodology)#34jphein wants to merge 2 commits into
Conversation
…dology) (#26) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds versioned model pricing support (YAML-backed) and utilities to compute cost-per-correct metrics, with accompanying tests.
Changes:
- Introduces
sme.eval.pricingwith pricing table loading and cost computation helpers. - Adds a May 2026 pricing snapshot YAML.
- Adds unit tests validating pricing lookup, token cost calculation, and cost-per-correct behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
sme/eval/pricing.py |
Implements YAML-backed pricing tables and cost-per-correct computation. |
sme/eval/pricing_2026_05.yaml |
Adds the initial versioned pricing snapshot used by the loader/tests. |
tests/test_pricing.py |
Adds tests covering pricing loading, lookup, and cost-per-correct outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raw = yaml.safe_load(path.read_text()) | ||
| models = {} | ||
| for model_id, rates in raw.get("models", {}).items(): | ||
| models[model_id] = ModelPricing( | ||
| model_id=model_id, | ||
| input_per_1m=float(rates.get("input_per_1m", 0)), | ||
| output_per_1m=float(rates.get("output_per_1m", 0)), | ||
| ) | ||
| return PricingTable(version=version, models=models) |
| if not path.exists(): | ||
| raise FileNotFoundError( | ||
| f"pricing table not found: {path}. " | ||
| f"Available: {[p.stem for p in PRICING_DIR.glob('pricing_*.yaml')]}" |
| from sme.eval.pricing import load_pricing_table, cost_per_correct, ModelPricing, PricingTable | ||
| import pytest | ||
|
|
||
| def test_load_default_pricing_table(): |
| result = cost_per_correct(1.0, 10, 100) | ||
| assert result["cost_per_correct_usd"] == 0.1 | ||
| assert result["cost_per_query_usd"] == 0.01 |
|
Cost-per-correct is the right headline metric — pure-accuracy comparisons across systems with wildly different inference costs are misleading, and HELM's framing is the closest thing to a standard in the literature. The versioned YAML pricing tables are exactly the right shape — model prices move, and versioning lets historical readings stay reproducible. CI is red only on a single ruff finding: unused Two small follow-ups worth doing in the same touch:
Framing question: see PR #33 comment on whether Cat 7 (cost) and Cat 7.b (latency) want a combined readout or two side-by-side. Resolving that before merge lets both PRs land with a consistent reporting shape. |
…pprox - Remove unused PricingTable import from tests (was failing CI ruff) - load_pricing_table now uses the YAML's declared version as authoritative and raises ValueError on filename/YAML version drift instead of silently trusting the constructor argument - Use pytest.approx() for cost-per-correct float assertions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pushed 60978a0:
On the Cat 7 framing question (also on #33): going with two side-by-side readouts — cost (this PR) and latency (#33) stay independent so each can land on its own. A fused "cost-per-correct at p95" headline can come later if it earns its keep, without blocking either now. 🫏 |
|
Verified — YAML One edge case worth flagging (small follow-up, not a blocker): Ship as-is; the missing-field case is rare enough not to gate the merge. |
Summary
Closes #26.
Adds cost-per-correct computation following HELM methodology (Liang et al. 2022):
sme/eval/pricing.py—PricingTable,ModelPricing,cost_per_correct()with versioned pricing table loadingsme/eval/pricing_2026_05.yaml— May 2026 pricing snapshot covering OpenAI (gpt-4o, o1, o4-mini), Anthropic (Claude Opus/Sonnet/Haiku), open-source endpoints (Llama 3.1, Qwen 2.5), and local (zero-cost)cost_per_correct()returnstotal_cost_usd,cost_per_correct_usd,cost_per_query_usdDesign decisions:
pricing_YYYY_MM.yaml) per HELM convention — pricing drifts, so re-runs must declare which table they usedcost_estimator()ABC method deferred to follow-up — this PR provides the computation engine; adapter integration comes separatelyTest plan
test_pricing.py— table loading, model lookup, cost computation, missing model handling, zero-correct edge case🫏 Generated with Claude Code