Skip to content

Cost-per-correct headline metric with versioned pricing (HELM methodology)#34

Open
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/cat7-cost-per-correct
Open

Cost-per-correct headline metric with versioned pricing (HELM methodology)#34
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/cat7-cost-per-correct

Conversation

@jphein
Copy link
Copy Markdown
Contributor

@jphein jphein commented May 25, 2026

Summary

Closes #26.

Adds cost-per-correct computation following HELM methodology (Liang et al. 2022):

  • sme/eval/pricing.pyPricingTable, ModelPricing, cost_per_correct() with versioned pricing table loading
  • sme/eval/pricing_2026_05.yaml — May 2026 pricing snapshot covering OpenAI (gpt-4o, o1, o4-mini), Anthropic (Claude Opus/Sonnet/Haiku), open-source endpoints (Llama 3.1, Qwen 2.5), and local (zero-cost)
  • cost_per_correct() returns total_cost_usd, cost_per_correct_usd, cost_per_query_usd

Design decisions:

  • Pricing tables are versioned YAML files (pricing_YYYY_MM.yaml) per HELM convention — pricing drifts, so re-runs must declare which table they used
  • No new dependencies (yaml is already a core dep)
  • cost_estimator() ABC method deferred to follow-up — this PR provides the computation engine; adapter integration comes separately
  • Local models priced at $0 — amortized hardware cost tracking is out of scope

Test plan

  • test_pricing.py — table loading, model lookup, cost computation, missing model handling, zero-correct edge case
  • Existing tests pass unchanged

🫏 Generated with Claude Code

…dology) (#26)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 25, 2026 01:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds versioned model pricing support (YAML-backed) and utilities to compute cost-per-correct metrics, with accompanying tests.

Changes:

  • Introduces sme.eval.pricing with pricing table loading and cost computation helpers.
  • Adds a May 2026 pricing snapshot YAML.
  • Adds unit tests validating pricing lookup, token cost calculation, and cost-per-correct behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
sme/eval/pricing.py Implements YAML-backed pricing tables and cost-per-correct computation.
sme/eval/pricing_2026_05.yaml Adds the initial versioned pricing snapshot used by the loader/tests.
tests/test_pricing.py Adds tests covering pricing loading, lookup, and cost-per-correct outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sme/eval/pricing.py Outdated
Comment on lines +59 to +67
raw = yaml.safe_load(path.read_text())
models = {}
for model_id, rates in raw.get("models", {}).items():
models[model_id] = ModelPricing(
model_id=model_id,
input_per_1m=float(rates.get("input_per_1m", 0)),
output_per_1m=float(rates.get("output_per_1m", 0)),
)
return PricingTable(version=version, models=models)
Comment thread sme/eval/pricing.py
if not path.exists():
raise FileNotFoundError(
f"pricing table not found: {path}. "
f"Available: {[p.stem for p in PRICING_DIR.glob('pricing_*.yaml')]}"
Comment thread tests/test_pricing.py
from sme.eval.pricing import load_pricing_table, cost_per_correct, ModelPricing, PricingTable
import pytest

def test_load_default_pricing_table():
Comment thread tests/test_pricing.py Outdated
Comment on lines +34 to +36
result = cost_per_correct(1.0, 10, 100)
assert result["cost_per_correct_usd"] == 0.1
assert result["cost_per_query_usd"] == 0.01
@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Cost-per-correct is the right headline metric — pure-accuracy comparisons across systems with wildly different inference costs are misleading, and HELM's framing is the closest thing to a standard in the literature. The versioned YAML pricing tables are exactly the right shape — model prices move, and versioning lets historical readings stay reproducible.

CI is red only on a single ruff finding: unused PricingTable import in tests/test_pricing.py:2. One-line fix and this is ready.

Two small follow-ups worth doing in the same touch:

  1. Loader currently discards the YAML's declared version field and uses the constructor argument instead. Either validate they match (and raise on mismatch) or use the YAML's version as authoritative — silent drift between filename, YAML field, and PricingTable.version is the kind of confusion that bites six months later.
  2. Float equality in test_pricing.py:36 — use pytest.approx().

Framing question: see PR #33 comment on whether Cat 7 (cost) and Cat 7.b (latency) want a combined readout or two side-by-side. Resolving that before merge lets both PRs land with a consistent reporting shape.

…pprox

- Remove unused PricingTable import from tests (was failing CI ruff)
- load_pricing_table now uses the YAML's declared version as authoritative
  and raises ValueError on filename/YAML version drift instead of silently
  trusting the constructor argument
- Use pytest.approx() for cost-per-correct float assertions

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor Author

jphein commented May 27, 2026

Pushed 60978a0:

  1. Dropped the unused PricingTable import in tests/test_pricing.py — that was the only red-CI finding; checks are green now on 3.10/3.11/3.12.
  2. The YAML's declared version is now authoritative: the loader uses it and raises ValueError on mismatch with the requested version, so filename / YAML field / PricingTable.version can't silently drift.
  3. The float equality in test_pricing.py now uses pytest.approx().

On the Cat 7 framing question (also on #33): going with two side-by-side readouts — cost (this PR) and latency (#33) stay independent so each can land on its own. A fused "cost-per-correct at p95" headline can come later if it earns its keep, without blocking either now.

🫏

@M0nkeyFl0wer
Copy link
Copy Markdown
Owner

Verified — YAML version: is now authoritative with ValueError on filename/field drift, unused import gone, pytest.approx in place. "Version is the YAML's, not the caller's claim" is the right reading of HELM-versioned pricing.

One edge case worth flagging (small follow-up, not a blocker): declared_version = raw.get("version") returns None if the field is missing entirely, and the if declared_version is not None and ... != version guard then silently accepts the constructor argument. So mismatch raises, but a missing field doesn't — which is the same silent-drift shape this fix was trying to close. If every pricing table should self-declare its version (probably the right answer for HELM-style reproducibility), the guard should require declared_version to be present and raise otherwise. Worth a small follow-up so a future YAML can't lose its version: line and become caller-trusted by accident.

Ship as-is; the missing-field case is rare enough not to gate the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cat 7 — cost-per-correct headline (HELM methodology), versioned pricing

3 participants