Scope. scripts/eval_harness.py, scripts/assemble.py, scripts/gen_*.py,
scripts/qa_build.py, SCHEMA.md, design/judge-minimization.md, and the
merged cachebench.jsonl (2,000 rows).
Method. Static reading + dynamic probes: ran every baseline, re-ran every
generator in fresh subprocesses (with varying PYTHONHASHSEED), exercised the
harness on hand-crafted edge cases, byte-compared re-merges, validated metric
formulae against numpy/Wilson-formula references.
Code-quality grade: B- (passes happy-path; reproducibility, schema-validation, and verification-tier wiring all have material defects).
The harness's headline metrics (precision/recall/F1/FHR/Wilson CI/percentile) are arithmetically correct against published formulae. The assembler enforces the documented row-count quotas and the merged dataset hits 2,000/2,000 with all per-domain × per-label cells matching the SCHEMA matrix exactly. Three built-in baselines run end-to-end and produce sensible reports.
But the audit found:
- Reproducibility is broken for two generators (
gen_multilingual.py,qa_build.py). They reseedrandom.Random(hash(...))on Python strings/tuples; CPython randomizeshash()per process unlessPYTHONHASHSEED=0. Two fresh runs produce different SHA-256s — this is a P0 reproducibility regression for ~640 of the 2,000 rows. - Construction-method mix drifts severely from
SCHEMA.md(e.g.lm_generatedis 767 vs target 250;real_trafficis 310 vs target 600). Nothing flags this —assemble.pychecks label quotas but not construction quotas. assemble.pydoes not validate theverification_methodenum. 380 rows userubricwhich is not in SCHEMA.md; SCHEMA.md listscode_exec,sql_canonical,set_matchwhich never appear. The two documents are out of sync and the assembler is silent.SCHEMA.mditself doesn't listrubricin the verification_method enum, but generators emit it 380×. Either fix the doc or fix the data.- The
validate_rowreturn value is dead code (never assigned) AND uses a buggyerrors[-5:].startswith(rid + ":")heuristic that returns False positives when prior rows' errors happen to share an id prefix. - The harness's
load_benchis too tolerant: unknown fields silently dropped, missing fields silentlyNone. A row withbinary_label: nullcrashes downstream intruth_is_hitwithAttributeError. - The harness crashes if a cache returns
None(no friendly error) or setscost_usd=None. No input validation on theVerdictit receives. - Verification methods are referenced but not implemented. SCHEMA.md and
judge-minimization.mddescribe an elaborate tier-1/2/3 verification stack (sympy, ast_diff, json_schema, llm_judge, rubric…); the harness implements none of them — it only compares the cache'sis_hitagainst the pre-computedbinary_label. This is internally consistent (HIT/MISS is the headline metric) but undocumented and surprising. Anyone reading SCHEMA.md expects the harness to honorverification_method; it doesn't.
Below: bug list, then patches.
| # | File:line | Severity | Description |
|---|---|---|---|
| C1 | scripts/gen_multilingual.py:85, 114 |
Critical | Uses random.Random(hash(lang) & 0xFFFF).shuffle(idx). CPython hashes strings with a per-process random salt unless PYTHONHASHSEED=0, so two fresh runs produce different PAWS-X picks → different multilingual.jsonl. Verified empirically: two fresh subprocess runs produced SHA-256 3eec67d… vs d621241…. |
| C2 | scripts/qa_build.py:200, 405 |
Critical | Same bug pattern: random.Random(hash(intent) % (2**32)) and random.Random((hash((ia, ib)) & 0xFFFFFFFF) ^ 1). ~490 of 2,000 rows depend on this. Verified: two fresh runs produced different SHA-256s; with PYTHONHASHSEED=0 vs =1 results also differ. |
| C3 | scripts/qa_build.py whole file |
Major | qa_build.py is missing a shebang line and a if __name__ == "__main__": guard. The script executes module-level side-effects on import (the assertion-heavy build sequence). Anything that imports it as a module will run the whole build. (gen_multilingual.py, gen_personalized.py, gen_creative.py have the same issue, but their reproducibility is intact.) |
| C4 | cachebench.jsonl distribution |
Major | Construction-method counts deviate hugely from SCHEMA.md §"Construction-method mix": lm_generated is 767/250 (+207%), real_traffic is 310/600 (-48%), intent_bottom_up is 205/500 (-59%), adversarial_perturbation is 533/500 (+7%), existing_benchmark is 185/150 (+23%). Either fix generators or rewrite SCHEMA.md. |
| C5 | cachebench.jsonl verification_method |
Major | 380 rows have verification_method: rubric but SCHEMA.md line 30 doesn't include rubric in the allowed enum. Conversely, code_exec, sql_canonical, set_match are in the schema enum but never appear in the dataset. |
| C6 | cachebench.jsonl difficulty mix |
Minor | SCHEMA.md targets 30/50/20 → 600/1000/400. Actual: 639/869/492 (+39 easy, -131 medium, +92 hard). Within tolerance but worth documenting. |
| # | File:line | Severity | Description |
|---|---|---|---|
| M1 | scripts/assemble.py:76 |
Major | return not any(e.startswith(f"{rid}:") for e in errors[-5:]) is a fragile sliding-window check. It (a) silently misses errors when more than 5 are pushed by the same row; (b) returns False positives when prior rows' errors happen to share id prefix. Reproduced: validate_row(good_row, ...) returns False when errors = ['X: missing fields'] and good_row.id == 'X'. The return value is never used in main() so the bug is latent — but the dead code is a maintenance trap. |
| M2 | scripts/assemble.py:54-77 |
Major | validate_row never checks verification_method against the SCHEMA enum, never checks construction_method against the SCHEMA enum, and never checks that verification_payload is None-or-dict. All three are SCHEMA.md requirements. |
| M3 | scripts/assemble.py:39-50 |
Major | EXPECTED_QUOTA only enforces per-domain × per-label rows. It does not enforce SCHEMA.md's construction-method mix (Section "Construction-method mix") nor the difficulty mix (Section "Difficulty distribution"). The "OK/DRIFT" audit only covers labels, so reviewers think the dataset is clean when half the documented mixes are off. |
| M4 | scripts/eval_harness.py:78-87 |
Major | load_bench uses Row(**{k: d.get(k) for k in Row.__dataclass_fields__}). Two failure modes: (i) unknown JSONL fields are silently dropped; (ii) d.get(k) returns None for missing fields. A row with binary_label: null (or missing) crashes in truth_is_hit with 'NoneType' object has no attribute 'upper'. No friendly error. |
| M5 | scripts/eval_harness.py:75 |
Major | truth_is_hit is self.binary_label.upper() == "HIT". If a row has an unrecognized binary_label (e.g. "hit", "true", 1, None), it silently maps to MISS. Should explicitly validate-or-raise. |
| M6 | scripts/eval_harness.py:294, 297 |
Major | No validation that verdict.decision_ms ≥ 0, that verdict.cost_usd is a finite number, or that verdict.tier_used is a non-empty string. A cache returning Verdict(cost_usd=None) causes TypeError on the sum. A cache returning Verdict(decision_ms=float('inf')) poisons the percentile. |
| M7 | scripts/eval_harness.py:290 |
Major | verdict = cache_decide(row) — if cache raises, the entire evaluation aborts. No try/except to mark that row as a "decision error" and continue. For a 2,000-row benchmark you want the rest of the run regardless. |
| M8 | scripts/eval_harness.py:5 |
Minor | Docstring claims cache exposes decide(row) -> Verdict. The actual code calls cache_decide(row) (a Callable); the built-in baselines use __call__. Either the docstring or the contract is wrong. |
| M9 | scripts/eval_harness.py:188 |
Minor | The by_label docstring says "restricted to MISS labels for FHR breakdown" but at line 302 the harness puts ALL labels (including EQUIV / PARA_SAFE) into the dict. to_markdown() only iterates ["RELATED_UNSAFE", "ADVERSARIAL", "UNRELATED"] so the extra entries are dead weight. |
| M10 | scripts/eval_harness.py:376 |
Minor | If --cache-module is given but the module exposes neither a class nor a top-level cache attribute, the user gets AttributeError: module 'X' has no attribute 'cache' — not a friendly diagnostic. |
| M11 | scripts/eval_harness.py:373-376 |
Minor | cache = getattr(mod, args.cache_class)() instantiates with no args. If the cache class needs config (model name, embedder path, threshold), the user can't pass it via CLI. No --cache-args flag, no JSON-config option. |
| M12 | scripts/qa_build.py:101, 200, 405 and scripts/gen_multilingual.py:85, 114 |
Major | hash() of any non-numeric Python object is process-randomized. Treating hash(s) & 0xFFFF as a deterministic seed is a textbook reproducibility bug. |
| # | File:line | Severity | Description |
|---|---|---|---|
| m1 | scripts/gen_personalized.py:15, gen_multilingual.py:19, gen_creative.py:14, gen_conv.py:10, gen_multiturn.py:13, gen_code.py:36, gen_math.py:936, qa_build.py:32-33, gen_tool.py:22 |
Minor | Every generator hardcodes /home/bud/ditto/budCache/research/... paths. A third party cannot rerun these without sed-replacing or symlinking. Take an --out and --datasets-dir CLI flag, or read from $CACHEBENCH_DATASETS. |
| m2 | scripts/gen_tool.py:709, 714, 957, 1587, 1711 |
Minor | Hard-coded /home/me/... strings appear inside query text. Cosmetic — they're synthetic example paths intended to look like a user's home dir — but worth a comment. |
| m3 | scripts/qa_build.py:1 |
Minor | Missing shebang line. Same for gen_multilingual.py, gen_personalized.py, gen_creative.py. gen_math.py, gen_code.py, gen_conv.py, gen_multiturn.py, gen_tool.py, eval_harness.py, assemble.py have it. Inconsistent. |
| m4 | scripts/eval_harness.py:142-149 |
Minor | to_dict rounds to 4 decimals at serialization time. If a downstream consumer wants raw precision they have to recompute. Prefer storing raw and rounding only in to_markdown. |
| m5 | scripts/eval_harness.py:170-176 |
Minor | Wilson CI returns (0.0, 0.0) when n=0. Mathematically the CI is undefined; (0, 1) is a more conventional default. The harness's "Wilson CI: 0.000, 0.000" line in the empty-dataset markdown is misleading. |
| m6 | scripts/eval_harness.py:256-259 |
Minor | per 1k decisions: ${self.total_cost_usd / max(1, self.overall.total) * 1000:.4f} divides by max(1, total) which gives the right answer for total=0 but the displayed denominator is fictional. |
| m7 | scripts/eval_harness.py:265-268 |
Minor | Tier-counts table sorts by descending count using sorted(..., key=lambda x: -x[1]). For ties the iteration order is dict-insertion which is fine but not documented. Use sorted(..., key=lambda x: (-x[1], x[0])) for stable output. |
| m8 | scripts/assemble.py:159 |
Minor | json.dumps(row, ensure_ascii=False) — fine for compactness, but no separators argument means default , and : spacing. The merged file is therefore 25% larger than necessary. Use separators=(",", ":") for compactness. Trivial. |
| m9 | scripts/gen_multilingual.py:303 |
Minor | random.seed(123) after using a different seed at line 22 — the second seed(123) overrides global state for the rest of the script. If a future edit adds RNG calls after that point, behavior changes silently. |
| m10 | scripts/eval_harness.py:84 |
Minor | for line in f reads line-by-line — file isn't closed inside with because... wait, it IS inside with. OK. But uses line.strip() which would strip BOM too — fine but undocumented. |
| m11 | scripts/eval_harness.py:362-368 |
Minor | The baselines dict is rebuilt on every CLI invocation. Trivial cost but the name = args.cache_name or args.baseline line never falls into the else branch (args.baseline is truthy when set), so naming is fine. |
| m12 | design/judge-minimization.md & SCHEMA.md |
Minor | The two documents disagree on enum values for verification_method (SCHEMA.md has code_exec and sql_canonical, judge-minimization.md shows policy_no_cache, rubric). And the dataset uses a third subset of the two. Pick a master list and link both docs to it. |
| m13 | scripts/eval_harness.py no test file |
Minor | No test_eval_harness.py. No unit tests for wilson_ci, percentile, ConfusionMatrix, or evaluate. Adding ~50 lines of pytest would catch most of the issues above. |
Each metric is correctly defined and correctly implemented for non-degenerate inputs:
| Metric | Formula | Implementation | Verdict |
|---|---|---|---|
| precision | TP / (TP+FP) |
eval_harness.py:116-118 returns 0.0 at TP+FP=0 |
Correct. Returning 0.0 at "no predicted hits" is the standard sklearn-like convention; a paper-pure choice would be undefined / NaN but 0.0 is defensible. |
| recall | TP / (TP+FN) |
line 121-123 returns 0.0 at TP+FN=0 |
Correct. Same caveat — 0.0 vs NaN at the zero-positive edge. |
| f1 | 2pr / (p+r) |
line 126-128 returns 0.0 at p+r=0 |
Correct. |
| false_hit_rate | FP / (FP+TN) |
line 131-134 returns 0.0 at FP+TN=0 |
Correct. |
| accuracy | (TP+TN) / total |
line 137-138 | Correct. |
| Wilson 95% CI | per Wallis 1927 / Newcombe 1998 | line 169-176 | Correct. Compared against scipy-equivalent: wilson_ci(10, 100) = (0.05522854, 0.17436730), matches Brown/Cai/DasGupta. Edge cases handled: n=0 → (0,0); k=0 → (0, 0.037); k=n → (0.963, 1.0). Center and half formulas are textbook. |
| latency p50/p95/p99 | linear interpolation | line 156-162 | Correct, matches numpy's method='linear'. Diff < 1e-13 across test cases including p=0.0, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0. |
| total cost | sum(verdict.cost_usd) |
line 286-297 | Correct if the cache reports its own total cost per decision. The harness does not detect double-counting (e.g., a cache returning the same call's cost twice). The contract is "incremental cost of THIS decision", which is reasonable. |
Bottom line: the metrics are correctly computed. The only concerns are
(a) zero-input behavior (0.0 vs NaN — a documentation/convention choice), and
(b) no input validation on Verdict, so a malformed cache can poison the
report (Bug M6).
Handled:
- Empty
cachebench.jsonl→ no crash, all metrics return 0, markdown report renders without errors (verified:evaluate(lambda r: Verdict(False), [])produces a clean degenerate report). query_a == query_bwithbinary_label == "MISS"→ 172 such rows in the dataset (personalized 50, conversational 50, multi_turn 50, tool 22). TheExactMatchCachebaseline correctly flags these as FP and the per-domain FHR breakdown surfaces them.query_a == query_bwithbinary_label == "HIT"→ 192 such rows; counted as TP.is_hit=Truewithconfidence=0.0→ silently accepted (semantically questionable — see M6).- Bad
--baselinevalue → argparse rejects. - Bad
--benchpath →FileNotFoundError, not a friendly message but it bubbles up. - Multiple judges per decision → cache must aggregate. Harness sees only the total.
- Verdict with
decision_ms=0→ falls back to wall-clock perf_counter. - Verdict with negative
decision_ms→ also falls back to wall-clock becauseverdict.decision_ms > 0is False (works by accident, not by design). - Verdict with NaN
decision_ms→ falls back to wall-clock (NaN > 0 == False).
Not handled (bugs):
- Row with
binary_labelnot in{HIT, MISS}→ silently treated as MISS. - Row with
binary_label = null→ AttributeError on.upper(). - Verdict with
cost_usd = None→ TypeError on summation (M6). - Verdict with
cost_usd = float('inf')→ corrupts total silently. - Cache callable that raises → entire run aborts (M7).
- Cache callable that returns
None→ crash with cryptic AttributeError. - Cache callable that returns an int (truthy) → likely AttributeError on
.decision_ms. - Unknown
domainvalue in a row → silently bucketed under that string; per-domain table grows with garbage rows. - Unknown
verification_methodin a row → ignored (the harness doesn't use the field at all).
Cannot rebuild bit-for-bit on a clean machine. Three independent failure modes:
-
hash()-based seeding (P0).gen_multilingual.py(lines 85, 114) andqa_build.py(lines 200, 405) seedrandom.Random(hash(...))on strings and tuples. CPython'shash()of any non-numeric object includes a per-process random salt unlessPYTHONHASHSEED=0. Two fresh subprocess runs produce SHA-256 mismatches;PYTHONHASHSEED=0vs=1also produce different outputs.Verified:
gen_multilingual run1: 3eec67d… run2: d621241… match: False qa_build run1: 88a849d… run2: 2e824e6… match: FalseThe other generators (
gen_code,gen_conv,gen_creative,gen_math,gen_multiturn,gen_personalized,gen_tool) are deterministic across fresh processes. -
Hard-coded absolute paths (P1). Every generator (and
qa_build.py) hardcodes/home/bud/ditto/budCache/research/.... A third party MUST either (a) put their checkout at that exact path, or (b) sed-replace. -
External datasets (P1). Generators depend on local copies of:
- TriviaQA (
rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquet) - BANKING77 MTEB (
intent-canonicalization/banking77-mteb/train.jsonl) - PAWS-X for 7 languages (
paraphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquet) - GLUE QQP (
paraphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquet) - BFCL v3 (
agent-tool/function-calling/bfcl/) - xLAM irrelevance 7.5k (
agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json)
None of these are vendored. There is no
download_datasets.sh. If HuggingFace shifts a dataset's file structure, the build breaks. - TriviaQA (
On the positive side, assemble.py IS bit-for-bit reproducible: rerunning
it on the same domains/ directory produces an identical
cachebench.jsonl (SHA-256 verified).
Reproducibility grade: D. The merge step is deterministic, but ~640 of
2,000 rows (multilingual + qa) cannot be regenerated to match the published
bytes without setting PYTHONHASHSEED=0 and matching the original env
exactly.
Python stdlib only. No third-party deps required.
Verified imports in eval_harness.py: argparse, importlib, json, time,
collections, dataclasses, pathlib, statistics, typing, __future__.
Python ≥ 3.10 (uses set[str], tuple[float, float], list[float], X | Y
union syntax).
| Package | Used by |
|---|---|
sympy |
gen_math.py (sanity-check equivalences during build) |
pyarrow |
qa_build.py (read TriviaQA / QQP parquet) |
pandas |
gen_multilingual.py, qa_build.py (read PAWS-X parquet via pd.read_parquet) |
External data (must be present at /home/bud/ditto/budCache/research/datasets/
or paths sed-replaced):
rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquetintent-canonicalization/banking77-mteb/train.jsonlparaphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquetparaphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquetagent-tool/function-calling/bfcl/...agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json
Also internal: cachebench/sources/code_fixtures/snippets.py (used by
gen_code.py:42).
These methods are not implemented anywhere in the harness — the harness's
evaluate() ignores verification_method and only compares
is_hit against binary_label. If someone wanted to actually wire up the
verification stack as designed, they'd need:
| Method | Needed deps |
|---|---|
exact_match |
stdlib only (regex normalization) |
sympy |
sympy (already required for build); optional: sympy.parsing.latex needs antlr4-python3-runtime |
ast_diff |
stdlib ast |
code_exec |
A sandboxed Python subprocess (the design doc recommends subprocess.run with timeout); ideally nsjail / firejail / Docker for safety |
regex |
stdlib re |
json_schema |
jsonschema (not currently installed) |
sql_canonical |
sqlglot (not installed; referenced in judge-min doc §3.5) |
set_match |
spacy for NER + en_core_web_sm model (per design doc §3.1) |
policy_no_cache |
stdlib only |
llm_judge |
anthropic + openai SDKs, snapshot-pinned model IDs, ~$20–60/run per the doc §1.5 |
rubric |
stdlib only (judge-min §4.8) |
Optional secondaries from the design doc:
sentence-transformers(judge-min §3.6 for embedding cosine sim)bert-score(judge-min §3.7)LaBSE(SCHEMA.md multilingual fallback)pint(units conversion in math tier-2)tree-sitter(code tier-2)
None of these tier-2/3 dependencies are pinned anywhere. There is no
requirements.txt, no pyproject.toml, no environment.yml in the
cachebench/ tree.
Replace random.Random(hash(...)) with a deterministic integer seed.
gen_multilingual.py line 85:
# OLD
random.Random(hash(lang) & 0xFFFF).shuffle(idx)
# NEW: stable lookup
_LANG_SEED = {"en": 1, "de": 2, "fr": 3, "es": 4, "ja": 5, "ko": 6, "zh": 7}
random.Random(_LANG_SEED[lang]).shuffle(idx)gen_multilingual.py line 114:
# OLD
random.Random((hash(lang) * 31) & 0xFFFF).shuffle(idx)
# NEW
random.Random(_LANG_SEED[lang] * 31).shuffle(idx)qa_build.py line 200 and line 405: build a stable int from the string using
hashlib.sha256(intent.encode()).digest()[:4] (a real cryptographic hash, not
process-salted):
import hashlib
def _stable_seed(s: str) -> int:
return int.from_bytes(hashlib.sha256(s.encode()).digest()[:4], "big")
# then
rng = random.Random(_stable_seed(intent))After patching, verify two fresh subprocess runs produce identical bytes and
update the published cachebench.jsonl (this WILL change ~640 row contents).
Each generator should accept --out and --datasets-dir:
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path(__file__).parent.parent / "domains" / "multilingual.jsonl")
ap.add_argument("--datasets-dir", type=Path, default=Path(os.environ.get("CACHEBENCH_DATASETS", "/home/bud/ditto/budCache/research/datasets")))
args = ap.parse_args()
OUT = args.out
PAWS = args.datasets_dir / "paraphrase-pairs/paws-x"Or simpler: read CACHEBENCH_ROOT from env with a sensible default. Apply
to every generator.
assemble.py ~line 30, add:
VALID_VERIFICATION = {
"exact_match", "sympy", "ast_diff", "code_exec", "regex",
"json_schema", "sql_canonical", "set_match", "policy_no_cache",
"llm_judge", "rubric",
}
VALID_CONSTRUCTION = {
"real_traffic", "intent_bottom_up", "adversarial_perturbation",
"lm_generated", "existing_benchmark",
}
EXPECTED_CONSTRUCTION = {
"real_traffic": 600, "intent_bottom_up": 500,
"adversarial_perturbation": 500, "lm_generated": 250,
"existing_benchmark": 150,
}
EXPECTED_DIFFICULTY = {"easy": 600, "medium": 1000, "hard": 400}Then inside validate_row add:
if row["verification_method"] not in VALID_VERIFICATION:
errors.append(f"{rid}: invalid verification_method {row['verification_method']!r}")
if row["construction_method"] not in VALID_CONSTRUCTION:
errors.append(f"{rid}: invalid construction_method {row['construction_method']!r}")
if row.get("verification_payload") is not None and not isinstance(row["verification_payload"], dict):
errors.append(f"{rid}: verification_payload must be null or dict")And in main, add a second audit block that prints construction-method drift
and difficulty drift the same way label drift is printed.
Either:
(a) add "rubric" to the SCHEMA.md verification_method enum at line 30, and
remove code_exec / sql_canonical / set_match (they're unused), OR
(b) regenerate the dataset using only the SCHEMA-blessed enum.
Recommendation: (a). The design doc §3 makes it clear that "rubric" is a first-class tier. Update SCHEMA.md to:
- "verification_method": "exact_match | sympy | ast_diff | code_exec | regex | json_schema | sql_canonical | set_match | policy_no_cache | llm_judge",
+ "verification_method": "exact_match | sympy | ast_diff | regex | json_schema | policy_no_cache | rubric | llm_judge",(or keep code_exec / sql_canonical / set_match if generators will
eventually emit them).
scripts/assemble.py:54-77 — drop the return value entirely:
def validate_row(row: Dict[str, Any], path: Path, line_no: int, errors: List[str]) -> None:
rid = row.get("id", f"{path.name}:{line_no}")
missing = REQUIRED_FIELDS - row.keys()
if missing:
errors.append(f"{rid}: missing required fields: {sorted(missing)}")
return # <- early return, simple
# ... rest of checks unchanged, no return at the endmain() currently doesn't use the return value, so this is a no-op semantic
change.
def load_bench(path: Path) -> List[Row]:
rows = []
for ln, line in enumerate(path.open(), 1):
line = line.strip()
if not line:
continue
d = json.loads(line)
# validate required fields
for k in ("id", "domain", "binary_label", "query_a", "query_b"):
if d.get(k) is None:
raise ValueError(f"{path}:{ln} missing required field {k!r}")
if d["binary_label"] not in ("HIT", "MISS"):
raise ValueError(f"{path}:{ln} bad binary_label {d['binary_label']!r}")
rows.append(Row(**{k: d.get(k) for k in Row.__dataclass_fields__}))
return rowsfor row in rows:
t0 = time.perf_counter()
try:
verdict = cache_decide(row)
except Exception as e:
# log + skip, count as a decision error in the report
by_tier["error"] += 1
overall.add(row.truth_is_hit, False) # treat as MISS
latencies.append((time.perf_counter() - t0) * 1000)
costs.append(0.0)
continue
if not isinstance(verdict, Verdict):
raise TypeError(f"cache returned {type(verdict).__name__}, expected Verdict")
elapsed_ms = (time.perf_counter() - t0) * 1000
# validate
dm = verdict.decision_ms if (verdict.decision_ms is not None and verdict.decision_ms > 0) else elapsed_ms
cost = verdict.cost_usd if verdict.cost_usd is not None else 0.0
if not (cost == cost): # NaN check
cost = 0.0
latencies.append(dm)
costs.append(cost)
total_cost += cost
by_tier[verdict.tier_used or "unknown"] += 1
overall.add(row.truth_is_hit, verdict.is_hit)
by_domain[row.domain].add(row.truth_is_hit, verdict.is_hit)
by_label[row.label].add(row.truth_is_hit, verdict.is_hit)
by_diff[row.difficulty].add(row.truth_is_hit, verdict.is_hit)Add a ## Note on verification_method`` section to SCHEMA.md:
The harness's headline HIT/MISS metric does not invoke
verification_method. Labels are pre-computed at build time and the harness only compares the cache'sVerdict.is_hitagainstbinary_label. Theverification_methodfield documents how the gold label was derived and is intended for downstream response-quality / answer-correctness scoring (out of scope for v1). For ~120 rows that carryllm_judge, the gold labels were curator-assigned; no judge runs at eval time.
# tests/test_eval_harness.py
import pytest
from eval_harness import wilson_ci, percentile, ConfusionMatrix, Verdict, Row, evaluate
def test_wilson_ci_edge_cases():
assert wilson_ci(0, 0) == (0.0, 0.0)
lo, hi = wilson_ci(0, 100)
assert 0.0 == lo and 0.03 < hi < 0.05
lo, hi = wilson_ci(50, 100)
assert abs(lo - 0.404) < 1e-2 and abs(hi - 0.596) < 1e-2
def test_percentile_matches_numpy():
import numpy as np
vals = list(range(1, 1001))
for p in [0.0, 0.5, 0.95, 0.99, 1.0]:
assert abs(percentile([float(x) for x in vals], p) - np.percentile(vals, p*100)) < 1e-9
def test_confusion_matrix_zero():
cm = ConfusionMatrix()
assert cm.precision == 0.0 and cm.recall == 0.0 and cm.f1 == 0.0
def test_evaluate_empty():
rep = evaluate(lambda r: Verdict(False), [])
assert rep.overall.total == 0
assert rep.to_markdown() # doesn't crashReplace random.seed(123) with a local rng = random.Random(123):
rng = random.Random(123)
while len(unrel_pairs) < 30 and attempts < 1000:
la = rng.choice(all_langs)
lb = rng.choice([x for x in all_langs if x != la])
# ...Avoid clobbering the global RNG state.
At cachebench/requirements.txt:
# To run the harness
# (stdlib only)
# To re-run generators
sympy>=1.12
pyarrow>=15.0
pandas>=2.1
# Optional: to wire up verification methods per design/judge-minimization.md
# sqlglot>=23.0 # sql_canonical
# jsonschema>=4.0 # json_schema
# spacy>=3.7 # set_match (NER)
# sentence-transformers>=2.2 # conv tier-2 embedding sim
# bert-score>=0.3 # creative tier-2
# pint>=0.23 # math tier-2 units
# anthropic>=0.30 # llm_judge (cross-family)
# openai>=1.30 # llm_judge (cross-family)
cachebench/build.sh:
#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")"
export PYTHONHASHSEED=0 # critical for qa_build and gen_multilingual until they're patched
export CACHEBENCH_DATASETS="${CACHEBENCH_DATASETS:-/home/bud/ditto/budCache/research/datasets}"
mkdir -p domains
python scripts/gen_math.py
python scripts/gen_code.py
python scripts/gen_conv.py
python scripts/gen_multiturn.py
python scripts/gen_tool.py
python scripts/gen_creative.py
python scripts/gen_personalized.py
python scripts/gen_multilingual.py
python scripts/qa_build.py
python scripts/assemble.py --strictassemble.py --strict on current dataset: PASSES (2000/2000, all label quotas OK)
re-running assemble.py: bit-for-bit identical (sha256 stable)
gen_creative.py rerun: bit-for-bit identical
gen_personalized.py rerun: bit-for-bit identical
gen_math.py rerun: bit-for-bit identical
gen_code.py rerun: bit-for-bit identical
gen_conv.py rerun: bit-for-bit identical
gen_tool.py rerun: bit-for-bit identical
gen_multiturn.py rerun: bit-for-bit identical
gen_multilingual.py rerun: DIFFERENT (hash() salt — Bug C1)
qa_build.py rerun: DIFFERENT (hash() salt — Bug C2)
gen_multilingual.py PYTHONHASHSEED=0 vs 1: DIFFERENT (confirms C1)
qa_build.py PYTHONHASHSEED=0 vs 1: DIFFERENT (confirms C2)
eval_harness baselines:
always_miss: Precision 0.000 Recall 0.000 FHR 0.000 N=2000
always_hit: Precision 0.412 Recall 1.000 FHR 1.000 N=2000
exact_match: has 172 trivial FPs from personalized/multi_turn/conv same-text MISSes
Empirical metric verification:
wilson_ci(10, 100) = (0.05522854, 0.17436730) ← matches Brown/Cai/DasGupta
percentile matches numpy method="linear" within 1e-13 across p=0..1
/home/bud/ditto/budCache/research/cachebench/scripts/eval_harness.py (386 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/assemble.py (166 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_math.py (946 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_code.py (1400+ lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_conv.py (~600 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multiturn.py (~1500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_tool.py (~2000 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_creative.py (~400 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_personalized.py (~500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multilingual.py (~356 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/qa_build.py (~1100 lines)
/home/bud/ditto/budCache/research/cachebench/SCHEMA.md
/home/bud/ditto/budCache/research/cachebench/design/judge-minimization.md
/home/bud/ditto/budCache/research/cachebench/cachebench.jsonl (2000 rows, 2.1 MB)