Cost-per-correct headline metric with versioned pricing (HELM methodology) by jphein · Pull Request #34 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-25T01:01:48Z

Summary

Closes #26.

Adds cost-per-correct computation following HELM methodology (Liang et al. 2022):

sme/eval/pricing.py — PricingTable, ModelPricing, cost_per_correct() with versioned pricing table loading
sme/eval/pricing_2026_05.yaml — May 2026 pricing snapshot covering OpenAI (gpt-4o, o1, o4-mini), Anthropic (Claude Opus/Sonnet/Haiku), open-source endpoints (Llama 3.1, Qwen 2.5), and local (zero-cost)
cost_per_correct() returns total_cost_usd, cost_per_correct_usd, cost_per_query_usd

Design decisions:

Pricing tables are versioned YAML files (pricing_YYYY_MM.yaml) per HELM convention — pricing drifts, so re-runs must declare which table they used
No new dependencies (yaml is already a core dep)
cost_estimator() ABC method deferred to follow-up — this PR provides the computation engine; adapter integration comes separately
Local models priced at $0 — amortized hardware cost tracking is out of scope

Test plan

test_pricing.py — table loading, model lookup, cost computation, missing model handling, zero-correct edge case
Existing tests pass unchanged

🫏 Generated with Claude Code

…dology) (#26) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds versioned model pricing support (YAML-backed) and utilities to compute cost-per-correct metrics, with accompanying tests.

Changes:

Introduces sme.eval.pricing with pricing table loading and cost computation helpers.
Adds a May 2026 pricing snapshot YAML.
Adds unit tests validating pricing lookup, token cost calculation, and cost-per-correct behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`sme/eval/pricing.py`	Implements YAML-backed pricing tables and cost-per-correct computation.
`sme/eval/pricing_2026_05.yaml`	Adds the initial versioned pricing snapshot used by the loader/tests.
`tests/test_pricing.py`	Adds tests covering pricing loading, lookup, and cost-per-correct outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    raw = yaml.safe_load(path.read_text())
+    models = {}
+    for model_id, rates in raw.get("models", {}).items():
+        models[model_id] = ModelPricing(
+            model_id=model_id,
+            input_per_1m=float(rates.get("input_per_1m", 0)),
+            output_per_1m=float(rates.get("output_per_1m", 0)),
+        )
+    return PricingTable(version=version, models=models)


+    if not path.exists():
+        raise FileNotFoundError(
+            f"pricing table not found: {path}. "
+            f"Available: {[p.stem for p in PRICING_DIR.glob('pricing_*.yaml')]}"


+from sme.eval.pricing import load_pricing_table, cost_per_correct, ModelPricing, PricingTable
+import pytest
+
+def test_load_default_pricing_table():


+    result = cost_per_correct(1.0, 10, 100)
+    assert result["cost_per_correct_usd"] == 0.1
+    assert result["cost_per_query_usd"] == 0.01


M0nkeyFl0wer · 2026-05-25T23:34:11Z

Cost-per-correct is the right headline metric — pure-accuracy comparisons across systems with wildly different inference costs are misleading, and HELM's framing is the closest thing to a standard in the literature. The versioned YAML pricing tables are exactly the right shape — model prices move, and versioning lets historical readings stay reproducible.

CI is red only on a single ruff finding: unused PricingTable import in tests/test_pricing.py:2. One-line fix and this is ready.

Two small follow-ups worth doing in the same touch:

Loader currently discards the YAML's declared version field and uses the constructor argument instead. Either validate they match (and raise on mismatch) or use the YAML's version as authoritative — silent drift between filename, YAML field, and PricingTable.version is the kind of confusion that bites six months later.
Float equality in test_pricing.py:36 — use pytest.approx().

Framing question: see PR #33 comment on whether Cat 7 (cost) and Cat 7.b (latency) want a combined readout or two side-by-side. Resolving that before merge lets both PRs land with a consistent reporting shape.

…pprox - Remove unused PricingTable import from tests (was failing CI ruff) - load_pricing_table now uses the YAML's declared version as authoritative and raises ValueError on filename/YAML version drift instead of silently trusting the constructor argument - Use pytest.approx() for cost-per-correct float assertions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-27T14:11:03Z

Pushed 60978a0:

Dropped the unused PricingTable import in tests/test_pricing.py — that was the only red-CI finding; checks are green now on 3.10/3.11/3.12.
The YAML's declared version is now authoritative: the loader uses it and raises ValueError on mismatch with the requested version, so filename / YAML field / PricingTable.version can't silently drift.
The float equality in test_pricing.py now uses pytest.approx().

On the Cat 7 framing question (also on #33): going with two side-by-side readouts — cost (this PR) and latency (#33) stay independent so each can land on its own. A fused "cost-per-correct at p95" headline can come later if it earns its keep, without blocking either now.

🫏

M0nkeyFl0wer · 2026-05-27T14:56:21Z

Verified — YAML version: is now authoritative with ValueError on filename/field drift, unused import gone, pytest.approx in place. "Version is the YAML's, not the caller's claim" is the right reading of HELM-versioned pricing.

One edge case worth flagging (small follow-up, not a blocker): declared_version = raw.get("version") returns None if the field is missing entirely, and the if declared_version is not None and ... != version guard then silently accepts the constructor argument. So mismatch raises, but a missing field doesn't — which is the same silent-drift shape this fix was trying to close. If every pricing table should self-declare its version (probably the right answer for HELM-style reproducibility), the guard should require declared_version to be present and raise otherwise. Worth a small follow-up so a future YAML can't lose its version: line and become caller-trusted by accident.

Ship as-is; the missing-field case is rare enough not to gate the merge.

feat(eval): cost-per-correct with versioned pricing table (HELM metho…

19ef532

…dology) (#26) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 25, 2026 01:01

Copilot AI reviewed May 25, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

Cat 7.b: query latency distribution (YCSB percentiles) #33

Open

2 tasks

This was referenced May 27, 2026

Random + oracle retrieval baselines (TREC bounds) #32

Open

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost-per-correct headline metric with versioned pricing (HELM methodology)#34

Cost-per-correct headline metric with versioned pricing (HELM methodology)#34
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/cat7-cost-per-correct

jphein commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jphein commented May 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants