Skip to content

Latest commit

 

History

History
621 lines (504 loc) · 33.2 KB

File metadata and controls

621 lines (504 loc) · 33.2 KB

CacheBench v1 — Brutal Code Audit

Scope. scripts/eval_harness.py, scripts/assemble.py, scripts/gen_*.py, scripts/qa_build.py, SCHEMA.md, design/judge-minimization.md, and the merged cachebench.jsonl (2,000 rows).

Method. Static reading + dynamic probes: ran every baseline, re-ran every generator in fresh subprocesses (with varying PYTHONHASHSEED), exercised the harness on hand-crafted edge cases, byte-compared re-merges, validated metric formulae against numpy/Wilson-formula references.


1. Executive summary

Code-quality grade: B- (passes happy-path; reproducibility, schema-validation, and verification-tier wiring all have material defects).

The harness's headline metrics (precision/recall/F1/FHR/Wilson CI/percentile) are arithmetically correct against published formulae. The assembler enforces the documented row-count quotas and the merged dataset hits 2,000/2,000 with all per-domain × per-label cells matching the SCHEMA matrix exactly. Three built-in baselines run end-to-end and produce sensible reports.

But the audit found:

  • Reproducibility is broken for two generators (gen_multilingual.py, qa_build.py). They reseed random.Random(hash(...)) on Python strings/tuples; CPython randomizes hash() per process unless PYTHONHASHSEED=0. Two fresh runs produce different SHA-256s — this is a P0 reproducibility regression for ~640 of the 2,000 rows.
  • Construction-method mix drifts severely from SCHEMA.md (e.g. lm_generated is 767 vs target 250; real_traffic is 310 vs target 600). Nothing flags this — assemble.py checks label quotas but not construction quotas.
  • assemble.py does not validate the verification_method enum. 380 rows use rubric which is not in SCHEMA.md; SCHEMA.md lists code_exec, sql_canonical, set_match which never appear. The two documents are out of sync and the assembler is silent.
  • SCHEMA.md itself doesn't list rubric in the verification_method enum, but generators emit it 380×. Either fix the doc or fix the data.
  • The validate_row return value is dead code (never assigned) AND uses a buggy errors[-5:].startswith(rid + ":") heuristic that returns False positives when prior rows' errors happen to share an id prefix.
  • The harness's load_bench is too tolerant: unknown fields silently dropped, missing fields silently None. A row with binary_label: null crashes downstream in truth_is_hit with AttributeError.
  • The harness crashes if a cache returns None (no friendly error) or sets cost_usd=None. No input validation on the Verdict it receives.
  • Verification methods are referenced but not implemented. SCHEMA.md and judge-minimization.md describe an elaborate tier-1/2/3 verification stack (sympy, ast_diff, json_schema, llm_judge, rubric…); the harness implements none of them — it only compares the cache's is_hit against the pre-computed binary_label. This is internally consistent (HIT/MISS is the headline metric) but undocumented and surprising. Anyone reading SCHEMA.md expects the harness to honor verification_method; it doesn't.

Below: bug list, then patches.


2. Bug list

Critical (data-correctness or reproducibility)

# File:line Severity Description
C1 scripts/gen_multilingual.py:85, 114 Critical Uses random.Random(hash(lang) & 0xFFFF).shuffle(idx). CPython hashes strings with a per-process random salt unless PYTHONHASHSEED=0, so two fresh runs produce different PAWS-X picks → different multilingual.jsonl. Verified empirically: two fresh subprocess runs produced SHA-256 3eec67d… vs d621241….
C2 scripts/qa_build.py:200, 405 Critical Same bug pattern: random.Random(hash(intent) % (2**32)) and random.Random((hash((ia, ib)) & 0xFFFFFFFF) ^ 1). ~490 of 2,000 rows depend on this. Verified: two fresh runs produced different SHA-256s; with PYTHONHASHSEED=0 vs =1 results also differ.
C3 scripts/qa_build.py whole file Major qa_build.py is missing a shebang line and a if __name__ == "__main__": guard. The script executes module-level side-effects on import (the assertion-heavy build sequence). Anything that imports it as a module will run the whole build. (gen_multilingual.py, gen_personalized.py, gen_creative.py have the same issue, but their reproducibility is intact.)
C4 cachebench.jsonl distribution Major Construction-method counts deviate hugely from SCHEMA.md §"Construction-method mix": lm_generated is 767/250 (+207%), real_traffic is 310/600 (-48%), intent_bottom_up is 205/500 (-59%), adversarial_perturbation is 533/500 (+7%), existing_benchmark is 185/150 (+23%). Either fix generators or rewrite SCHEMA.md.
C5 cachebench.jsonl verification_method Major 380 rows have verification_method: rubric but SCHEMA.md line 30 doesn't include rubric in the allowed enum. Conversely, code_exec, sql_canonical, set_match are in the schema enum but never appear in the dataset.
C6 cachebench.jsonl difficulty mix Minor SCHEMA.md targets 30/50/20 → 600/1000/400. Actual: 639/869/492 (+39 easy, -131 medium, +92 hard). Within tolerance but worth documenting.

Major (correctness bugs that don't yet bite)

# File:line Severity Description
M1 scripts/assemble.py:76 Major return not any(e.startswith(f"{rid}:") for e in errors[-5:]) is a fragile sliding-window check. It (a) silently misses errors when more than 5 are pushed by the same row; (b) returns False positives when prior rows' errors happen to share id prefix. Reproduced: validate_row(good_row, ...) returns False when errors = ['X: missing fields'] and good_row.id == 'X'. The return value is never used in main() so the bug is latent — but the dead code is a maintenance trap.
M2 scripts/assemble.py:54-77 Major validate_row never checks verification_method against the SCHEMA enum, never checks construction_method against the SCHEMA enum, and never checks that verification_payload is None-or-dict. All three are SCHEMA.md requirements.
M3 scripts/assemble.py:39-50 Major EXPECTED_QUOTA only enforces per-domain × per-label rows. It does not enforce SCHEMA.md's construction-method mix (Section "Construction-method mix") nor the difficulty mix (Section "Difficulty distribution"). The "OK/DRIFT" audit only covers labels, so reviewers think the dataset is clean when half the documented mixes are off.
M4 scripts/eval_harness.py:78-87 Major load_bench uses Row(**{k: d.get(k) for k in Row.__dataclass_fields__}). Two failure modes: (i) unknown JSONL fields are silently dropped; (ii) d.get(k) returns None for missing fields. A row with binary_label: null (or missing) crashes in truth_is_hit with 'NoneType' object has no attribute 'upper'. No friendly error.
M5 scripts/eval_harness.py:75 Major truth_is_hit is self.binary_label.upper() == "HIT". If a row has an unrecognized binary_label (e.g. "hit", "true", 1, None), it silently maps to MISS. Should explicitly validate-or-raise.
M6 scripts/eval_harness.py:294, 297 Major No validation that verdict.decision_ms ≥ 0, that verdict.cost_usd is a finite number, or that verdict.tier_used is a non-empty string. A cache returning Verdict(cost_usd=None) causes TypeError on the sum. A cache returning Verdict(decision_ms=float('inf')) poisons the percentile.
M7 scripts/eval_harness.py:290 Major verdict = cache_decide(row) — if cache raises, the entire evaluation aborts. No try/except to mark that row as a "decision error" and continue. For a 2,000-row benchmark you want the rest of the run regardless.
M8 scripts/eval_harness.py:5 Minor Docstring claims cache exposes decide(row) -> Verdict. The actual code calls cache_decide(row) (a Callable); the built-in baselines use __call__. Either the docstring or the contract is wrong.
M9 scripts/eval_harness.py:188 Minor The by_label docstring says "restricted to MISS labels for FHR breakdown" but at line 302 the harness puts ALL labels (including EQUIV / PARA_SAFE) into the dict. to_markdown() only iterates ["RELATED_UNSAFE", "ADVERSARIAL", "UNRELATED"] so the extra entries are dead weight.
M10 scripts/eval_harness.py:376 Minor If --cache-module is given but the module exposes neither a class nor a top-level cache attribute, the user gets AttributeError: module 'X' has no attribute 'cache' — not a friendly diagnostic.
M11 scripts/eval_harness.py:373-376 Minor cache = getattr(mod, args.cache_class)() instantiates with no args. If the cache class needs config (model name, embedder path, threshold), the user can't pass it via CLI. No --cache-args flag, no JSON-config option.
M12 scripts/qa_build.py:101, 200, 405 and scripts/gen_multilingual.py:85, 114 Major hash() of any non-numeric Python object is process-randomized. Treating hash(s) & 0xFFFF as a deterministic seed is a textbook reproducibility bug.

Minor / style / docs

# File:line Severity Description
m1 scripts/gen_personalized.py:15, gen_multilingual.py:19, gen_creative.py:14, gen_conv.py:10, gen_multiturn.py:13, gen_code.py:36, gen_math.py:936, qa_build.py:32-33, gen_tool.py:22 Minor Every generator hardcodes /home/bud/ditto/budCache/research/... paths. A third party cannot rerun these without sed-replacing or symlinking. Take an --out and --datasets-dir CLI flag, or read from $CACHEBENCH_DATASETS.
m2 scripts/gen_tool.py:709, 714, 957, 1587, 1711 Minor Hard-coded /home/me/... strings appear inside query text. Cosmetic — they're synthetic example paths intended to look like a user's home dir — but worth a comment.
m3 scripts/qa_build.py:1 Minor Missing shebang line. Same for gen_multilingual.py, gen_personalized.py, gen_creative.py. gen_math.py, gen_code.py, gen_conv.py, gen_multiturn.py, gen_tool.py, eval_harness.py, assemble.py have it. Inconsistent.
m4 scripts/eval_harness.py:142-149 Minor to_dict rounds to 4 decimals at serialization time. If a downstream consumer wants raw precision they have to recompute. Prefer storing raw and rounding only in to_markdown.
m5 scripts/eval_harness.py:170-176 Minor Wilson CI returns (0.0, 0.0) when n=0. Mathematically the CI is undefined; (0, 1) is a more conventional default. The harness's "Wilson CI: 0.000, 0.000" line in the empty-dataset markdown is misleading.
m6 scripts/eval_harness.py:256-259 Minor per 1k decisions: ${self.total_cost_usd / max(1, self.overall.total) * 1000:.4f} divides by max(1, total) which gives the right answer for total=0 but the displayed denominator is fictional.
m7 scripts/eval_harness.py:265-268 Minor Tier-counts table sorts by descending count using sorted(..., key=lambda x: -x[1]). For ties the iteration order is dict-insertion which is fine but not documented. Use sorted(..., key=lambda x: (-x[1], x[0])) for stable output.
m8 scripts/assemble.py:159 Minor json.dumps(row, ensure_ascii=False) — fine for compactness, but no separators argument means default , and : spacing. The merged file is therefore 25% larger than necessary. Use separators=(",", ":") for compactness. Trivial.
m9 scripts/gen_multilingual.py:303 Minor random.seed(123) after using a different seed at line 22 — the second seed(123) overrides global state for the rest of the script. If a future edit adds RNG calls after that point, behavior changes silently.
m10 scripts/eval_harness.py:84 Minor for line in f reads line-by-line — file isn't closed inside with because... wait, it IS inside with. OK. But uses line.strip() which would strip BOM too — fine but undocumented.
m11 scripts/eval_harness.py:362-368 Minor The baselines dict is rebuilt on every CLI invocation. Trivial cost but the name = args.cache_name or args.baseline line never falls into the else branch (args.baseline is truthy when set), so naming is fine.
m12 design/judge-minimization.md & SCHEMA.md Minor The two documents disagree on enum values for verification_method (SCHEMA.md has code_exec and sql_canonical, judge-minimization.md shows policy_no_cache, rubric). And the dataset uses a third subset of the two. Pick a master list and link both docs to it.
m13 scripts/eval_harness.py no test file Minor No test_eval_harness.py. No unit tests for wilson_ci, percentile, ConfusionMatrix, or evaluate. Adding ~50 lines of pytest would catch most of the issues above.

3. Metric correctness verdict

Each metric is correctly defined and correctly implemented for non-degenerate inputs:

Metric Formula Implementation Verdict
precision TP / (TP+FP) eval_harness.py:116-118 returns 0.0 at TP+FP=0 Correct. Returning 0.0 at "no predicted hits" is the standard sklearn-like convention; a paper-pure choice would be undefined / NaN but 0.0 is defensible.
recall TP / (TP+FN) line 121-123 returns 0.0 at TP+FN=0 Correct. Same caveat — 0.0 vs NaN at the zero-positive edge.
f1 2pr / (p+r) line 126-128 returns 0.0 at p+r=0 Correct.
false_hit_rate FP / (FP+TN) line 131-134 returns 0.0 at FP+TN=0 Correct.
accuracy (TP+TN) / total line 137-138 Correct.
Wilson 95% CI per Wallis 1927 / Newcombe 1998 line 169-176 Correct. Compared against scipy-equivalent: wilson_ci(10, 100) = (0.05522854, 0.17436730), matches Brown/Cai/DasGupta. Edge cases handled: n=0 → (0,0); k=0 → (0, 0.037); k=n → (0.963, 1.0). Center and half formulas are textbook.
latency p50/p95/p99 linear interpolation line 156-162 Correct, matches numpy's method='linear'. Diff < 1e-13 across test cases including p=0.0, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0.
total cost sum(verdict.cost_usd) line 286-297 Correct if the cache reports its own total cost per decision. The harness does not detect double-counting (e.g., a cache returning the same call's cost twice). The contract is "incremental cost of THIS decision", which is reasonable.

Bottom line: the metrics are correctly computed. The only concerns are (a) zero-input behavior (0.0 vs NaN — a documentation/convention choice), and (b) no input validation on Verdict, so a malformed cache can poison the report (Bug M6).


4. Edge-case coverage

Handled:

  • Empty cachebench.jsonl → no crash, all metrics return 0, markdown report renders without errors (verified: evaluate(lambda r: Verdict(False), []) produces a clean degenerate report).
  • query_a == query_b with binary_label == "MISS" → 172 such rows in the dataset (personalized 50, conversational 50, multi_turn 50, tool 22). The ExactMatchCache baseline correctly flags these as FP and the per-domain FHR breakdown surfaces them.
  • query_a == query_b with binary_label == "HIT" → 192 such rows; counted as TP.
  • is_hit=True with confidence=0.0 → silently accepted (semantically questionable — see M6).
  • Bad --baseline value → argparse rejects.
  • Bad --bench path → FileNotFoundError, not a friendly message but it bubbles up.
  • Multiple judges per decision → cache must aggregate. Harness sees only the total.
  • Verdict with decision_ms=0 → falls back to wall-clock perf_counter.
  • Verdict with negative decision_ms → also falls back to wall-clock because verdict.decision_ms > 0 is False (works by accident, not by design).
  • Verdict with NaN decision_ms → falls back to wall-clock (NaN > 0 == False).

Not handled (bugs):

  • Row with binary_label not in {HIT, MISS} → silently treated as MISS.
  • Row with binary_label = null → AttributeError on .upper().
  • Verdict with cost_usd = None → TypeError on summation (M6).
  • Verdict with cost_usd = float('inf') → corrupts total silently.
  • Cache callable that raises → entire run aborts (M7).
  • Cache callable that returns None → crash with cryptic AttributeError.
  • Cache callable that returns an int (truthy) → likely AttributeError on .decision_ms.
  • Unknown domain value in a row → silently bucketed under that string; per-domain table grows with garbage rows.
  • Unknown verification_method in a row → ignored (the harness doesn't use the field at all).

5. Reproducibility verdict

Cannot rebuild bit-for-bit on a clean machine. Three independent failure modes:

  1. hash()-based seeding (P0). gen_multilingual.py (lines 85, 114) and qa_build.py (lines 200, 405) seed random.Random(hash(...)) on strings and tuples. CPython's hash() of any non-numeric object includes a per-process random salt unless PYTHONHASHSEED=0. Two fresh subprocess runs produce SHA-256 mismatches; PYTHONHASHSEED=0 vs =1 also produce different outputs.

    Verified:

    gen_multilingual run1: 3eec67d…  run2: d621241…  match: False
    qa_build       run1: 88a849d…  run2: 2e824e6…  match: False
    

    The other generators (gen_code, gen_conv, gen_creative, gen_math, gen_multiturn, gen_personalized, gen_tool) are deterministic across fresh processes.

  2. Hard-coded absolute paths (P1). Every generator (and qa_build.py) hardcodes /home/bud/ditto/budCache/research/.... A third party MUST either (a) put their checkout at that exact path, or (b) sed-replace.

  3. External datasets (P1). Generators depend on local copies of:

    • TriviaQA (rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquet)
    • BANKING77 MTEB (intent-canonicalization/banking77-mteb/train.jsonl)
    • PAWS-X for 7 languages (paraphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquet)
    • GLUE QQP (paraphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquet)
    • BFCL v3 (agent-tool/function-calling/bfcl/)
    • xLAM irrelevance 7.5k (agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json)

    None of these are vendored. There is no download_datasets.sh. If HuggingFace shifts a dataset's file structure, the build breaks.

On the positive side, assemble.py IS bit-for-bit reproducible: rerunning it on the same domains/ directory produces an identical cachebench.jsonl (SHA-256 verified).

Reproducibility grade: D. The merge step is deterministic, but ~640 of 2,000 rows (multilingual + qa) cannot be regenerated to match the published bytes without setting PYTHONHASHSEED=0 and matching the original env exactly.


6. Required dependencies inventory

(a) To run the harness against an existing cachebench.jsonl

Python stdlib only. No third-party deps required.

Verified imports in eval_harness.py: argparse, importlib, json, time, collections, dataclasses, pathlib, statistics, typing, __future__.

Python ≥ 3.10 (uses set[str], tuple[float, float], list[float], X | Y union syntax).

(b) To re-run all generators

Package Used by
sympy gen_math.py (sanity-check equivalences during build)
pyarrow qa_build.py (read TriviaQA / QQP parquet)
pandas gen_multilingual.py, qa_build.py (read PAWS-X parquet via pd.read_parquet)

External data (must be present at /home/bud/ditto/budCache/research/datasets/ or paths sed-replaced):

  • rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquet
  • intent-canonicalization/banking77-mteb/train.jsonl
  • paraphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquet
  • paraphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquet
  • agent-tool/function-calling/bfcl/...
  • agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json

Also internal: cachebench/sources/code_fixtures/snippets.py (used by gen_code.py:42).

(c) To execute all verification methods that judge-minimization.md proposes

These methods are not implemented anywhere in the harness — the harness's evaluate() ignores verification_method and only compares is_hit against binary_label. If someone wanted to actually wire up the verification stack as designed, they'd need:

Method Needed deps
exact_match stdlib only (regex normalization)
sympy sympy (already required for build); optional: sympy.parsing.latex needs antlr4-python3-runtime
ast_diff stdlib ast
code_exec A sandboxed Python subprocess (the design doc recommends subprocess.run with timeout); ideally nsjail / firejail / Docker for safety
regex stdlib re
json_schema jsonschema (not currently installed)
sql_canonical sqlglot (not installed; referenced in judge-min doc §3.5)
set_match spacy for NER + en_core_web_sm model (per design doc §3.1)
policy_no_cache stdlib only
llm_judge anthropic + openai SDKs, snapshot-pinned model IDs, ~$20–60/run per the doc §1.5
rubric stdlib only (judge-min §4.8)

Optional secondaries from the design doc:

  • sentence-transformers (judge-min §3.6 for embedding cosine sim)
  • bert-score (judge-min §3.7)
  • LaBSE (SCHEMA.md multilingual fallback)
  • pint (units conversion in math tier-2)
  • tree-sitter (code tier-2)

None of these tier-2/3 dependencies are pinned anywhere. There is no requirements.txt, no pyproject.toml, no environment.yml in the cachebench/ tree.


7. Concrete patch list

Patch 1: Fix gen_multilingual.py and qa_build.py reproducibility

Replace random.Random(hash(...)) with a deterministic integer seed.

gen_multilingual.py line 85:

# OLD
random.Random(hash(lang) & 0xFFFF).shuffle(idx)
# NEW: stable lookup
_LANG_SEED = {"en": 1, "de": 2, "fr": 3, "es": 4, "ja": 5, "ko": 6, "zh": 7}
random.Random(_LANG_SEED[lang]).shuffle(idx)

gen_multilingual.py line 114:

# OLD
random.Random((hash(lang) * 31) & 0xFFFF).shuffle(idx)
# NEW
random.Random(_LANG_SEED[lang] * 31).shuffle(idx)

qa_build.py line 200 and line 405: build a stable int from the string using hashlib.sha256(intent.encode()).digest()[:4] (a real cryptographic hash, not process-salted):

import hashlib
def _stable_seed(s: str) -> int:
    return int.from_bytes(hashlib.sha256(s.encode()).digest()[:4], "big")
# then
rng = random.Random(_stable_seed(intent))

After patching, verify two fresh subprocess runs produce identical bytes and update the published cachebench.jsonl (this WILL change ~640 row contents).

Patch 2: Make all generator paths configurable

Each generator should accept --out and --datasets-dir:

import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path(__file__).parent.parent / "domains" / "multilingual.jsonl")
ap.add_argument("--datasets-dir", type=Path, default=Path(os.environ.get("CACHEBENCH_DATASETS", "/home/bud/ditto/budCache/research/datasets")))
args = ap.parse_args()
OUT = args.out
PAWS = args.datasets_dir / "paraphrase-pairs/paws-x"

Or simpler: read CACHEBENCH_ROOT from env with a sensible default. Apply to every generator.

Patch 3: Fix assemble.py to validate verification & construction methods

assemble.py ~line 30, add:

VALID_VERIFICATION = {
    "exact_match", "sympy", "ast_diff", "code_exec", "regex",
    "json_schema", "sql_canonical", "set_match", "policy_no_cache",
    "llm_judge", "rubric",
}
VALID_CONSTRUCTION = {
    "real_traffic", "intent_bottom_up", "adversarial_perturbation",
    "lm_generated", "existing_benchmark",
}

EXPECTED_CONSTRUCTION = {
    "real_traffic": 600, "intent_bottom_up": 500,
    "adversarial_perturbation": 500, "lm_generated": 250,
    "existing_benchmark": 150,
}

EXPECTED_DIFFICULTY = {"easy": 600, "medium": 1000, "hard": 400}

Then inside validate_row add:

if row["verification_method"] not in VALID_VERIFICATION:
    errors.append(f"{rid}: invalid verification_method {row['verification_method']!r}")
if row["construction_method"] not in VALID_CONSTRUCTION:
    errors.append(f"{rid}: invalid construction_method {row['construction_method']!r}")
if row.get("verification_payload") is not None and not isinstance(row["verification_payload"], dict):
    errors.append(f"{rid}: verification_payload must be null or dict")

And in main, add a second audit block that prints construction-method drift and difficulty drift the same way label drift is printed.

Patch 4: Sync SCHEMA.md ↔ data

Either: (a) add "rubric" to the SCHEMA.md verification_method enum at line 30, and remove code_exec / sql_canonical / set_match (they're unused), OR (b) regenerate the dataset using only the SCHEMA-blessed enum.

Recommendation: (a). The design doc §3 makes it clear that "rubric" is a first-class tier. Update SCHEMA.md to:

- "verification_method": "exact_match | sympy | ast_diff | code_exec | regex | json_schema | sql_canonical | set_match | policy_no_cache | llm_judge",
+ "verification_method": "exact_match | sympy | ast_diff | regex | json_schema | policy_no_cache | rubric | llm_judge",

(or keep code_exec / sql_canonical / set_match if generators will eventually emit them).

Patch 5: Remove the validate_row dead-code / fragile prefix check

scripts/assemble.py:54-77 — drop the return value entirely:

def validate_row(row: Dict[str, Any], path: Path, line_no: int, errors: List[str]) -> None:
    rid = row.get("id", f"{path.name}:{line_no}")
    missing = REQUIRED_FIELDS - row.keys()
    if missing:
        errors.append(f"{rid}: missing required fields: {sorted(missing)}")
        return                                  # <- early return, simple
    # ... rest of checks unchanged, no return at the end

main() currently doesn't use the return value, so this is a no-op semantic change.

Patch 6: Harden load_bench

def load_bench(path: Path) -> List[Row]:
    rows = []
    for ln, line in enumerate(path.open(), 1):
        line = line.strip()
        if not line:
            continue
        d = json.loads(line)
        # validate required fields
        for k in ("id", "domain", "binary_label", "query_a", "query_b"):
            if d.get(k) is None:
                raise ValueError(f"{path}:{ln} missing required field {k!r}")
        if d["binary_label"] not in ("HIT", "MISS"):
            raise ValueError(f"{path}:{ln} bad binary_label {d['binary_label']!r}")
        rows.append(Row(**{k: d.get(k) for k in Row.__dataclass_fields__}))
    return rows

Patch 7: Harden evaluate against bad caches

for row in rows:
    t0 = time.perf_counter()
    try:
        verdict = cache_decide(row)
    except Exception as e:
        # log + skip, count as a decision error in the report
        by_tier["error"] += 1
        overall.add(row.truth_is_hit, False)   # treat as MISS
        latencies.append((time.perf_counter() - t0) * 1000)
        costs.append(0.0)
        continue
    if not isinstance(verdict, Verdict):
        raise TypeError(f"cache returned {type(verdict).__name__}, expected Verdict")
    elapsed_ms = (time.perf_counter() - t0) * 1000
    # validate
    dm = verdict.decision_ms if (verdict.decision_ms is not None and verdict.decision_ms > 0) else elapsed_ms
    cost = verdict.cost_usd if verdict.cost_usd is not None else 0.0
    if not (cost == cost):  # NaN check
        cost = 0.0
    latencies.append(dm)
    costs.append(cost)
    total_cost += cost
    by_tier[verdict.tier_used or "unknown"] += 1
    overall.add(row.truth_is_hit, verdict.is_hit)
    by_domain[row.domain].add(row.truth_is_hit, verdict.is_hit)
    by_label[row.label].add(row.truth_is_hit, verdict.is_hit)
    by_diff[row.difficulty].add(row.truth_is_hit, verdict.is_hit)

Patch 8: Document the verification-method gap

Add a ## Note on verification_method`` section to SCHEMA.md:

The harness's headline HIT/MISS metric does not invoke verification_method. Labels are pre-computed at build time and the harness only compares the cache's Verdict.is_hit against binary_label. The verification_method field documents how the gold label was derived and is intended for downstream response-quality / answer-correctness scoring (out of scope for v1). For ~120 rows that carry llm_judge, the gold labels were curator-assigned; no judge runs at eval time.

Patch 9: Add a minimal test suite

# tests/test_eval_harness.py
import pytest
from eval_harness import wilson_ci, percentile, ConfusionMatrix, Verdict, Row, evaluate

def test_wilson_ci_edge_cases():
    assert wilson_ci(0, 0) == (0.0, 0.0)
    lo, hi = wilson_ci(0, 100)
    assert 0.0 == lo and 0.03 < hi < 0.05
    lo, hi = wilson_ci(50, 100)
    assert abs(lo - 0.404) < 1e-2 and abs(hi - 0.596) < 1e-2

def test_percentile_matches_numpy():
    import numpy as np
    vals = list(range(1, 1001))
    for p in [0.0, 0.5, 0.95, 0.99, 1.0]:
        assert abs(percentile([float(x) for x in vals], p) - np.percentile(vals, p*100)) < 1e-9

def test_confusion_matrix_zero():
    cm = ConfusionMatrix()
    assert cm.precision == 0.0 and cm.recall == 0.0 and cm.f1 == 0.0

def test_evaluate_empty():
    rep = evaluate(lambda r: Verdict(False), [])
    assert rep.overall.total == 0
    assert rep.to_markdown()  # doesn't crash

Patch 10: Fix gen_multilingual.py re-seed at line 303

Replace random.seed(123) with a local rng = random.Random(123):

rng = random.Random(123)
while len(unrel_pairs) < 30 and attempts < 1000:
    la = rng.choice(all_langs)
    lb = rng.choice([x for x in all_langs if x != la])
    # ...

Avoid clobbering the global RNG state.

Patch 11: Add a requirements.txt

At cachebench/requirements.txt:

# To run the harness
# (stdlib only)

# To re-run generators
sympy>=1.12
pyarrow>=15.0
pandas>=2.1

# Optional: to wire up verification methods per design/judge-minimization.md
# sqlglot>=23.0       # sql_canonical
# jsonschema>=4.0     # json_schema
# spacy>=3.7          # set_match (NER)
# sentence-transformers>=2.2  # conv tier-2 embedding sim
# bert-score>=0.3     # creative tier-2
# pint>=0.23          # math tier-2 units
# anthropic>=0.30     # llm_judge (cross-family)
# openai>=1.30        # llm_judge (cross-family)

Patch 12: Add a download / build script

cachebench/build.sh:

#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")"
export PYTHONHASHSEED=0  # critical for qa_build and gen_multilingual until they're patched
export CACHEBENCH_DATASETS="${CACHEBENCH_DATASETS:-/home/bud/ditto/budCache/research/datasets}"

mkdir -p domains
python scripts/gen_math.py
python scripts/gen_code.py
python scripts/gen_conv.py
python scripts/gen_multiturn.py
python scripts/gen_tool.py
python scripts/gen_creative.py
python scripts/gen_personalized.py
python scripts/gen_multilingual.py
python scripts/qa_build.py
python scripts/assemble.py --strict

Appendix A: What the audit verified empirically

assemble.py --strict on current dataset:  PASSES (2000/2000, all label quotas OK)
re-running assemble.py:                   bit-for-bit identical (sha256 stable)
gen_creative.py rerun:                    bit-for-bit identical
gen_personalized.py rerun:                bit-for-bit identical
gen_math.py rerun:                        bit-for-bit identical
gen_code.py rerun:                        bit-for-bit identical
gen_conv.py rerun:                        bit-for-bit identical
gen_tool.py rerun:                        bit-for-bit identical
gen_multiturn.py rerun:                   bit-for-bit identical
gen_multilingual.py rerun:                DIFFERENT (hash() salt — Bug C1)
qa_build.py rerun:                        DIFFERENT (hash() salt — Bug C2)
gen_multilingual.py PYTHONHASHSEED=0 vs 1: DIFFERENT (confirms C1)
qa_build.py PYTHONHASHSEED=0 vs 1:        DIFFERENT (confirms C2)

eval_harness baselines:
  always_miss:  Precision 0.000  Recall 0.000  FHR 0.000  N=2000
  always_hit:   Precision 0.412 Recall 1.000  FHR 1.000  N=2000
  exact_match:  has 172 trivial FPs from personalized/multi_turn/conv same-text MISSes

Empirical metric verification:
  wilson_ci(10, 100) = (0.05522854, 0.17436730)  ← matches Brown/Cai/DasGupta
  percentile matches numpy method="linear" within 1e-13 across p=0..1

Appendix B: Files audited

/home/bud/ditto/budCache/research/cachebench/scripts/eval_harness.py    (386 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/assemble.py        (166 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_math.py        (946 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_code.py        (1400+ lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_conv.py        (~600 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multiturn.py   (~1500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_tool.py        (~2000 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_creative.py    (~400 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_personalized.py (~500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multilingual.py (~356 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/qa_build.py        (~1100 lines)
/home/bud/ditto/budCache/research/cachebench/SCHEMA.md
/home/bud/ditto/budCache/research/cachebench/design/judge-minimization.md
/home/bud/ditto/budCache/research/cachebench/cachebench.jsonl           (2000 rows, 2.1 MB)