CacheBench v1 — Brutal Code Audit

Scope. scripts/eval_harness.py, scripts/assemble.py, scripts/gen_*.py, scripts/qa_build.py, SCHEMA.md, design/judge-minimization.md, and the merged cachebench.jsonl (2,000 rows).

Method. Static reading + dynamic probes: ran every baseline, re-ran every generator in fresh subprocesses (with varying PYTHONHASHSEED), exercised the harness on hand-crafted edge cases, byte-compared re-merges, validated metric formulae against numpy/Wilson-formula references.

1. Executive summary

Code-quality grade: B- (passes happy-path; reproducibility, schema-validation, and verification-tier wiring all have material defects).

The harness's headline metrics (precision/recall/F1/FHR/Wilson CI/percentile) are arithmetically correct against published formulae. The assembler enforces the documented row-count quotas and the merged dataset hits 2,000/2,000 with all per-domain × per-label cells matching the SCHEMA matrix exactly. Three built-in baselines run end-to-end and produce sensible reports.

But the audit found:

Reproducibility is broken for two generators (gen_multilingual.py, qa_build.py). They reseed random.Random(hash(...)) on Python strings/tuples; CPython randomizes hash() per process unless PYTHONHASHSEED=0. Two fresh runs produce different SHA-256s — this is a P0 reproducibility regression for ~640 of the 2,000 rows.
Construction-method mix drifts severely from SCHEMA.md (e.g. lm_generated is 767 vs target 250; real_traffic is 310 vs target 600). Nothing flags this — assemble.py checks label quotas but not construction quotas.
assemble.py does not validate the verification_method enum. 380 rows use rubric which is not in SCHEMA.md; SCHEMA.md lists code_exec, sql_canonical, set_match which never appear. The two documents are out of sync and the assembler is silent.
SCHEMA.md itself doesn't list rubric in the verification_method enum, but generators emit it 380×. Either fix the doc or fix the data.
The validate_row return value is dead code (never assigned) AND uses a buggy errors[-5:].startswith(rid + ":") heuristic that returns False positives when prior rows' errors happen to share an id prefix.
The harness's load_bench is too tolerant: unknown fields silently dropped, missing fields silently None. A row with binary_label: null crashes downstream in truth_is_hit with AttributeError.
The harness crashes if a cache returns None (no friendly error) or sets cost_usd=None. No input validation on the Verdict it receives.
Verification methods are referenced but not implemented. SCHEMA.md and judge-minimization.md describe an elaborate tier-1/2/3 verification stack (sympy, ast_diff, json_schema, llm_judge, rubric…); the harness implements none of them — it only compares the cache's is_hit against the pre-computed binary_label. This is internally consistent (HIT/MISS is the headline metric) but undocumented and surprising. Anyone reading SCHEMA.md expects the harness to honor verification_method; it doesn't.

Below: bug list, then patches.

2. Bug list

Critical (data-correctness or reproducibility)

#	File:line	Severity	Description
C1	`scripts/gen_multilingual.py:85, 114`	Critical	Uses `random.Random(hash(lang) & 0xFFFF).shuffle(idx)`. CPython hashes strings with a per-process random salt unless `PYTHONHASHSEED=0`, so two fresh runs produce different PAWS-X picks → different multilingual.jsonl. Verified empirically: two fresh subprocess runs produced SHA-256 `3eec67d…` vs `d621241…`.
C2	`scripts/qa_build.py:200, 405`	Critical	Same bug pattern: `random.Random(hash(intent) % (2**32))` and `random.Random((hash((ia, ib)) & 0xFFFFFFFF) ^ 1)`. ~490 of 2,000 rows depend on this. Verified: two fresh runs produced different SHA-256s; with `PYTHONHASHSEED=0` vs `=1` results also differ.
C3	`scripts/qa_build.py` whole file	Major	`qa_build.py` is missing a shebang line and a `if __name__ == "__main__":` guard. The script executes module-level side-effects on import (the assertion-heavy build sequence). Anything that imports it as a module will run the whole build. (`gen_multilingual.py`, `gen_personalized.py`, `gen_creative.py` have the same issue, but their reproducibility is intact.)
C4	`cachebench.jsonl` distribution	Major	Construction-method counts deviate hugely from SCHEMA.md §"Construction-method mix": `lm_generated` is 767/250 (+207%), `real_traffic` is 310/600 (-48%), `intent_bottom_up` is 205/500 (-59%), `adversarial_perturbation` is 533/500 (+7%), `existing_benchmark` is 185/150 (+23%). Either fix generators or rewrite SCHEMA.md.
C5	`cachebench.jsonl` verification_method	Major	380 rows have `verification_method: rubric` but SCHEMA.md line 30 doesn't include `rubric` in the allowed enum. Conversely, `code_exec`, `sql_canonical`, `set_match` are in the schema enum but never appear in the dataset.
C6	`cachebench.jsonl` difficulty mix	Minor	SCHEMA.md targets 30/50/20 → 600/1000/400. Actual: 639/869/492 (+39 easy, -131 medium, +92 hard). Within tolerance but worth documenting.

Major (correctness bugs that don't yet bite)

#	File:line	Severity	Description
M1	`scripts/assemble.py:76`	Major	`return not any(e.startswith(f"{rid}:") for e in errors[-5:])` is a fragile sliding-window check. It (a) silently misses errors when more than 5 are pushed by the same row; (b) returns False positives when prior rows' errors happen to share id prefix. Reproduced: `validate_row(good_row, ...)` returns `False` when `errors = ['X: missing fields']` and `good_row.id == 'X'`. The return value is never used in `main()` so the bug is latent — but the dead code is a maintenance trap.
M2	`scripts/assemble.py:54-77`	Major	`validate_row` never checks `verification_method` against the SCHEMA enum, never checks `construction_method` against the SCHEMA enum, and never checks that `verification_payload` is None-or-dict. All three are SCHEMA.md requirements.
M3	`scripts/assemble.py:39-50`	Major	`EXPECTED_QUOTA` only enforces per-domain × per-label rows. It does not enforce SCHEMA.md's construction-method mix (Section "Construction-method mix") nor the difficulty mix (Section "Difficulty distribution"). The "OK/DRIFT" audit only covers labels, so reviewers think the dataset is clean when half the documented mixes are off.
M4	`scripts/eval_harness.py:78-87`	Major	`load_bench` uses `Row(**{k: d.get(k) for k in Row.__dataclass_fields__})`. Two failure modes: (i) unknown JSONL fields are silently dropped; (ii) `d.get(k)` returns None for missing fields. A row with `binary_label: null` (or missing) crashes in `truth_is_hit` with `'NoneType' object has no attribute 'upper'`. No friendly error.
M5	`scripts/eval_harness.py:75`	Major	`truth_is_hit` is `self.binary_label.upper() == "HIT"`. If a row has an unrecognized binary_label (e.g. `"hit"`, `"true"`, `1`, `None`), it silently maps to MISS. Should explicitly validate-or-raise.
M6	`scripts/eval_harness.py:294, 297`	Major	No validation that `verdict.decision_ms ≥ 0`, that `verdict.cost_usd` is a finite number, or that `verdict.tier_used` is a non-empty string. A cache returning `Verdict(cost_usd=None)` causes `TypeError` on the sum. A cache returning `Verdict(decision_ms=float('inf'))` poisons the percentile.
M7	`scripts/eval_harness.py:290`	Major	`verdict = cache_decide(row)` — if cache raises, the entire evaluation aborts. No `try/except` to mark that row as a "decision error" and continue. For a 2,000-row benchmark you want the rest of the run regardless.
M8	`scripts/eval_harness.py:5`	Minor	Docstring claims `cache` exposes `decide(row) -> Verdict`. The actual code calls `cache_decide(row)` (a `Callable`); the built-in baselines use `__call__`. Either the docstring or the contract is wrong.
M9	`scripts/eval_harness.py:188`	Minor	The `by_label` docstring says "restricted to MISS labels for FHR breakdown" but at line 302 the harness puts ALL labels (including EQUIV / PARA_SAFE) into the dict. `to_markdown()` only iterates `["RELATED_UNSAFE", "ADVERSARIAL", "UNRELATED"]` so the extra entries are dead weight.
M10	`scripts/eval_harness.py:376`	Minor	If `--cache-module` is given but the module exposes neither a class nor a top-level `cache` attribute, the user gets `AttributeError: module 'X' has no attribute 'cache'` — not a friendly diagnostic.
M11	`scripts/eval_harness.py:373-376`	Minor	`cache = getattr(mod, args.cache_class)()` instantiates with no args. If the cache class needs config (model name, embedder path, threshold), the user can't pass it via CLI. No `--cache-args` flag, no JSON-config option.
M12	`scripts/qa_build.py:101, 200, 405` and `scripts/gen_multilingual.py:85, 114`	Major	`hash()` of any non-numeric Python object is process-randomized. Treating `hash(s) & 0xFFFF` as a deterministic seed is a textbook reproducibility bug.

Minor / style / docs

#	File:line	Severity	Description
m1	`scripts/gen_personalized.py:15`, `gen_multilingual.py:19`, `gen_creative.py:14`, `gen_conv.py:10`, `gen_multiturn.py:13`, `gen_code.py:36`, `gen_math.py:936`, `qa_build.py:32-33`, `gen_tool.py:22`	Minor	Every generator hardcodes `/home/bud/ditto/budCache/research/...` paths. A third party cannot rerun these without sed-replacing or symlinking. Take an `--out` and `--datasets-dir` CLI flag, or read from `$CACHEBENCH_DATASETS`.
m2	`scripts/gen_tool.py:709, 714, 957, 1587, 1711`	Minor	Hard-coded `/home/me/...` strings appear inside query text. Cosmetic — they're synthetic example paths intended to look like a user's home dir — but worth a comment.
m3	`scripts/qa_build.py:1`	Minor	Missing shebang line. Same for `gen_multilingual.py`, `gen_personalized.py`, `gen_creative.py`. `gen_math.py`, `gen_code.py`, `gen_conv.py`, `gen_multiturn.py`, `gen_tool.py`, `eval_harness.py`, `assemble.py` have it. Inconsistent.
m4	`scripts/eval_harness.py:142-149`	Minor	`to_dict` rounds to 4 decimals at serialization time. If a downstream consumer wants raw precision they have to recompute. Prefer storing raw and rounding only in `to_markdown`.
m5	`scripts/eval_harness.py:170-176`	Minor	Wilson CI returns `(0.0, 0.0)` when `n=0`. Mathematically the CI is undefined; `(0, 1)` is a more conventional default. The harness's "Wilson CI: 0.000, 0.000" line in the empty-dataset markdown is misleading.
m6	`scripts/eval_harness.py:256-259`	Minor	`per 1k decisions: ${self.total_cost_usd / max(1, self.overall.total) * 1000:.4f}` divides by `max(1, total)` which gives the right answer for total=0 but the displayed denominator is fictional.
m7	`scripts/eval_harness.py:265-268`	Minor	Tier-counts table sorts by descending count using `sorted(..., key=lambda x: -x[1])`. For ties the iteration order is dict-insertion which is fine but not documented. Use `sorted(..., key=lambda x: (-x[1], x[0]))` for stable output.
m8	`scripts/assemble.py:159`	Minor	`json.dumps(row, ensure_ascii=False)` — fine for compactness, but no separators argument means default `,` and `:` spacing. The merged file is therefore 25% larger than necessary. Use `separators=(",", ":")` for compactness. Trivial.
m9	`scripts/gen_multilingual.py:303`	Minor	`random.seed(123)` after using a different seed at line 22 — the second `seed(123)` overrides global state for the rest of the script. If a future edit adds RNG calls after that point, behavior changes silently.
m10	`scripts/eval_harness.py:84`	Minor	`for line in f` reads line-by-line — file isn't closed inside `with` because... wait, it IS inside `with`. OK. But uses `line.strip()` which would strip BOM too — fine but undocumented.
m11	`scripts/eval_harness.py:362-368`	Minor	The baselines dict is rebuilt on every CLI invocation. Trivial cost but the `name = args.cache_name or args.baseline` line never falls into the else branch (`args.baseline` is truthy when set), so naming is fine.
m12	`design/judge-minimization.md` & `SCHEMA.md`	Minor	The two documents disagree on enum values for `verification_method` (`SCHEMA.md` has `code_exec` and `sql_canonical`, `judge-minimization.md` shows `policy_no_cache`, `rubric`). And the dataset uses a third subset of the two. Pick a master list and link both docs to it.
m13	`scripts/eval_harness.py` no test file	Minor	No `test_eval_harness.py`. No unit tests for `wilson_ci`, `percentile`, `ConfusionMatrix`, or `evaluate`. Adding ~50 lines of `pytest` would catch most of the issues above.

3. Metric correctness verdict

Each metric is correctly defined and correctly implemented for non-degenerate inputs:

Metric	Formula	Implementation	Verdict
precision	`TP / (TP+FP)`	`eval_harness.py:116-118` returns 0.0 at `TP+FP=0`	Correct. Returning 0.0 at "no predicted hits" is the standard sklearn-like convention; a paper-pure choice would be undefined / NaN but 0.0 is defensible.
recall	`TP / (TP+FN)`	line 121-123 returns 0.0 at `TP+FN=0`	Correct. Same caveat — 0.0 vs NaN at the zero-positive edge.
f1	`2pr / (p+r)`	line 126-128 returns 0.0 at `p+r=0`	Correct.
false_hit_rate	`FP / (FP+TN)`	line 131-134 returns 0.0 at `FP+TN=0`	Correct.
accuracy	`(TP+TN) / total`	line 137-138	Correct.
Wilson 95% CI	per Wallis 1927 / Newcombe 1998	line 169-176	Correct. Compared against scipy-equivalent: `wilson_ci(10, 100) = (0.05522854, 0.17436730)`, matches Brown/Cai/DasGupta. Edge cases handled: `n=0` → (0,0); `k=0` → (0, 0.037); `k=n` → (0.963, 1.0). Center and half formulas are textbook.
latency p50/p95/p99	linear interpolation	line 156-162	Correct, matches numpy's `method='linear'`. Diff < 1e-13 across test cases including p=0.0, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0.
total cost	`sum(verdict.cost_usd)`	line 286-297	Correct if the cache reports its own total cost per decision. The harness does not detect double-counting (e.g., a cache returning the same call's cost twice). The contract is "incremental cost of THIS decision", which is reasonable.

Bottom line: the metrics are correctly computed. The only concerns are (a) zero-input behavior (0.0 vs NaN — a documentation/convention choice), and (b) no input validation on Verdict, so a malformed cache can poison the report (Bug M6).

4. Edge-case coverage

Handled:

Empty cachebench.jsonl → no crash, all metrics return 0, markdown report renders without errors (verified: evaluate(lambda r: Verdict(False), []) produces a clean degenerate report).
query_a == query_b with binary_label == "MISS" → 172 such rows in the dataset (personalized 50, conversational 50, multi_turn 50, tool 22). The ExactMatchCache baseline correctly flags these as FP and the per-domain FHR breakdown surfaces them.
query_a == query_b with binary_label == "HIT" → 192 such rows; counted as TP.
is_hit=True with confidence=0.0 → silently accepted (semantically questionable — see M6).
Bad --baseline value → argparse rejects.
Bad --bench path → FileNotFoundError, not a friendly message but it bubbles up.
Multiple judges per decision → cache must aggregate. Harness sees only the total.
Verdict with decision_ms=0 → falls back to wall-clock perf_counter.
Verdict with negative decision_ms → also falls back to wall-clock because verdict.decision_ms > 0 is False (works by accident, not by design).
Verdict with NaN decision_ms → falls back to wall-clock (NaN > 0 == False).

Not handled (bugs):

Row with binary_label not in {HIT, MISS} → silently treated as MISS.
Row with binary_label = null → AttributeError on .upper().
Verdict with cost_usd = None → TypeError on summation (M6).
Verdict with cost_usd = float('inf') → corrupts total silently.
Cache callable that raises → entire run aborts (M7).
Cache callable that returns None → crash with cryptic AttributeError.
Cache callable that returns an int (truthy) → likely AttributeError on .decision_ms.
Unknown domain value in a row → silently bucketed under that string; per-domain table grows with garbage rows.
Unknown verification_method in a row → ignored (the harness doesn't use the field at all).

5. Reproducibility verdict

Cannot rebuild bit-for-bit on a clean machine. Three independent failure modes:

hash()-based seeding (P0). gen_multilingual.py (lines 85, 114) and qa_build.py (lines 200, 405) seed random.Random(hash(...)) on strings and tuples. CPython's hash() of any non-numeric object includes a per-process random salt unless PYTHONHASHSEED=0. Two fresh subprocess runs produce SHA-256 mismatches; PYTHONHASHSEED=0 vs =1 also produce different outputs.

Verified:
```
gen_multilingual run1: 3eec67d…  run2: d621241…  match: False
qa_build       run1: 88a849d…  run2: 2e824e6…  match: False
```
The other generators (gen_code, gen_conv, gen_creative, gen_math, gen_multiturn, gen_personalized, gen_tool) are deterministic across fresh processes.
Hard-coded absolute paths (P1). Every generator (and qa_build.py) hardcodes /home/bud/ditto/budCache/research/.... A third party MUST either (a) put their checkout at that exact path, or (b) sed-replace.
External datasets (P1). Generators depend on local copies of:
- TriviaQA (rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquet)
- BANKING77 MTEB (intent-canonicalization/banking77-mteb/train.jsonl)
- PAWS-X for 7 languages (paraphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquet)
- GLUE QQP (paraphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquet)
- BFCL v3 (agent-tool/function-calling/bfcl/)
- xLAM irrelevance 7.5k (agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json)
None of these are vendored. There is no download_datasets.sh. If HuggingFace shifts a dataset's file structure, the build breaks.

On the positive side, assemble.py IS bit-for-bit reproducible: rerunning it on the same domains/ directory produces an identical cachebench.jsonl (SHA-256 verified).

Reproducibility grade: D. The merge step is deterministic, but ~640 of 2,000 rows (multilingual + qa) cannot be regenerated to match the published bytes without setting PYTHONHASHSEED=0 and matching the original env exactly.

6. Required dependencies inventory

(a) To run the harness against an existing `cachebench.jsonl`

Python stdlib only. No third-party deps required.

Verified imports in eval_harness.py: argparse, importlib, json, time, collections, dataclasses, pathlib, statistics, typing, __future__.

Python ≥ 3.10 (uses set[str], tuple[float, float], list[float], X | Y union syntax).

(b) To re-run all generators

Package	Used by
`sympy`	`gen_math.py` (sanity-check equivalences during build)
`pyarrow`	`qa_build.py` (read TriviaQA / QQP parquet)
`pandas`	`gen_multilingual.py`, `qa_build.py` (read PAWS-X parquet via `pd.read_parquet`)

External data (must be present at /home/bud/ditto/budCache/research/datasets/ or paths sed-replaced):

rag-qa/trivia-qa/rc.nocontext/train-00000-of-00001.parquet
intent-canonicalization/banking77-mteb/train.jsonl
paraphrase-pairs/paws-x/{en,de,fr,es,ja,ko,zh}/test-00000-of-00001.parquet
paraphrase-pairs/glue-qqp/qqp/train-00000-of-00001.parquet
agent-tool/function-calling/bfcl/...
agent-tool/function-calling/xlam-irrelevance-7p5k/xlam-7.5k-irrelevancek.json

Also internal: cachebench/sources/code_fixtures/snippets.py (used by gen_code.py:42).

(c) To execute all verification methods that `judge-minimization.md` proposes

These methods are not implemented anywhere in the harness — the harness's evaluate() ignores verification_method and only compares is_hit against binary_label. If someone wanted to actually wire up the verification stack as designed, they'd need:

Method	Needed deps
`exact_match`	stdlib only (regex normalization)
`sympy`	`sympy` (already required for build); optional: `sympy.parsing.latex` needs `antlr4-python3-runtime`
`ast_diff`	stdlib `ast`
`code_exec`	A sandboxed Python subprocess (the design doc recommends `subprocess.run` with `timeout`); ideally `nsjail` / `firejail` / Docker for safety
`regex`	stdlib `re`
`json_schema`	`jsonschema` (not currently installed)
`sql_canonical`	`sqlglot` (not installed; referenced in judge-min doc §3.5)
`set_match`	`spacy` for NER + `en_core_web_sm` model (per design doc §3.1)
`policy_no_cache`	stdlib only
`llm_judge`	`anthropic` + `openai` SDKs, snapshot-pinned model IDs, ~$20–60/run per the doc §1.5
`rubric`	stdlib only (judge-min §4.8)

Optional secondaries from the design doc:

sentence-transformers (judge-min §3.6 for embedding cosine sim)
bert-score (judge-min §3.7)
LaBSE (SCHEMA.md multilingual fallback)
pint (units conversion in math tier-2)
tree-sitter (code tier-2)

None of these tier-2/3 dependencies are pinned anywhere. There is no requirements.txt, no pyproject.toml, no environment.yml in the cachebench/ tree.

7. Concrete patch list

Patch 1: Fix `gen_multilingual.py` and `qa_build.py` reproducibility

Replace random.Random(hash(...)) with a deterministic integer seed.

gen_multilingual.py line 85:

# OLD
random.Random(hash(lang) & 0xFFFF).shuffle(idx)
# NEW: stable lookup
_LANG_SEED = {"en": 1, "de": 2, "fr": 3, "es": 4, "ja": 5, "ko": 6, "zh": 7}
random.Random(_LANG_SEED[lang]).shuffle(idx)

gen_multilingual.py line 114:

# OLD
random.Random((hash(lang) * 31) & 0xFFFF).shuffle(idx)
# NEW
random.Random(_LANG_SEED[lang] * 31).shuffle(idx)

qa_build.py line 200 and line 405: build a stable int from the string using hashlib.sha256(intent.encode()).digest()[:4] (a real cryptographic hash, not process-salted):

import hashlib
def _stable_seed(s: str) -> int:
    return int.from_bytes(hashlib.sha256(s.encode()).digest()[:4], "big")
# then
rng = random.Random(_stable_seed(intent))

After patching, verify two fresh subprocess runs produce identical bytes and update the published cachebench.jsonl (this WILL change ~640 row contents).

Patch 2: Make all generator paths configurable

Each generator should accept --out and --datasets-dir:

import argparse
ap = argparse.ArgumentParser()
ap.add_argument("--out", type=Path, default=Path(__file__).parent.parent / "domains" / "multilingual.jsonl")
ap.add_argument("--datasets-dir", type=Path, default=Path(os.environ.get("CACHEBENCH_DATASETS", "/home/bud/ditto/budCache/research/datasets")))
args = ap.parse_args()
OUT = args.out
PAWS = args.datasets_dir / "paraphrase-pairs/paws-x"

Or simpler: read CACHEBENCH_ROOT from env with a sensible default. Apply to every generator.

Patch 3: Fix `assemble.py` to validate verification & construction methods

assemble.py ~line 30, add:

VALID_VERIFICATION = {
    "exact_match", "sympy", "ast_diff", "code_exec", "regex",
    "json_schema", "sql_canonical", "set_match", "policy_no_cache",
    "llm_judge", "rubric",
}
VALID_CONSTRUCTION = {
    "real_traffic", "intent_bottom_up", "adversarial_perturbation",
    "lm_generated", "existing_benchmark",
}

EXPECTED_CONSTRUCTION = {
    "real_traffic": 600, "intent_bottom_up": 500,
    "adversarial_perturbation": 500, "lm_generated": 250,
    "existing_benchmark": 150,
}

EXPECTED_DIFFICULTY = {"easy": 600, "medium": 1000, "hard": 400}

Then inside validate_row add:

if row["verification_method"] not in VALID_VERIFICATION:
    errors.append(f"{rid}: invalid verification_method {row['verification_method']!r}")
if row["construction_method"] not in VALID_CONSTRUCTION:
    errors.append(f"{rid}: invalid construction_method {row['construction_method']!r}")
if row.get("verification_payload") is not None and not isinstance(row["verification_payload"], dict):
    errors.append(f"{rid}: verification_payload must be null or dict")

And in main, add a second audit block that prints construction-method drift and difficulty drift the same way label drift is printed.

Patch 4: Sync `SCHEMA.md` ↔ data

Either: (a) add "rubric" to the SCHEMA.md verification_method enum at line 30, and remove code_exec / sql_canonical / set_match (they're unused), OR (b) regenerate the dataset using only the SCHEMA-blessed enum.

Recommendation: (a). The design doc §3 makes it clear that "rubric" is a first-class tier. Update SCHEMA.md to:

- "verification_method": "exact_match | sympy | ast_diff | code_exec | regex | json_schema | sql_canonical | set_match | policy_no_cache | llm_judge",
+ "verification_method": "exact_match | sympy | ast_diff | regex | json_schema | policy_no_cache | rubric | llm_judge",

(or keep code_exec / sql_canonical / set_match if generators will eventually emit them).

Patch 5: Remove the `validate_row` dead-code / fragile prefix check

scripts/assemble.py:54-77 — drop the return value entirely:

def validate_row(row: Dict[str, Any], path: Path, line_no: int, errors: List[str]) -> None:
    rid = row.get("id", f"{path.name}:{line_no}")
    missing = REQUIRED_FIELDS - row.keys()
    if missing:
        errors.append(f"{rid}: missing required fields: {sorted(missing)}")
        return                                  # <- early return, simple
    # ... rest of checks unchanged, no return at the end

main() currently doesn't use the return value, so this is a no-op semantic change.

Patch 6: Harden `load_bench`

def load_bench(path: Path) -> List[Row]:
    rows = []
    for ln, line in enumerate(path.open(), 1):
        line = line.strip()
        if not line:
            continue
        d = json.loads(line)
        # validate required fields
        for k in ("id", "domain", "binary_label", "query_a", "query_b"):
            if d.get(k) is None:
                raise ValueError(f"{path}:{ln} missing required field {k!r}")
        if d["binary_label"] not in ("HIT", "MISS"):
            raise ValueError(f"{path}:{ln} bad binary_label {d['binary_label']!r}")
        rows.append(Row(**{k: d.get(k) for k in Row.__dataclass_fields__}))
    return rows

Patch 7: Harden `evaluate` against bad caches

for row in rows:
    t0 = time.perf_counter()
    try:
        verdict = cache_decide(row)
    except Exception as e:
        # log + skip, count as a decision error in the report
        by_tier["error"] += 1
        overall.add(row.truth_is_hit, False)   # treat as MISS
        latencies.append((time.perf_counter() - t0) * 1000)
        costs.append(0.0)
        continue
    if not isinstance(verdict, Verdict):
        raise TypeError(f"cache returned {type(verdict).__name__}, expected Verdict")
    elapsed_ms = (time.perf_counter() - t0) * 1000
    # validate
    dm = verdict.decision_ms if (verdict.decision_ms is not None and verdict.decision_ms > 0) else elapsed_ms
    cost = verdict.cost_usd if verdict.cost_usd is not None else 0.0
    if not (cost == cost):  # NaN check
        cost = 0.0
    latencies.append(dm)
    costs.append(cost)
    total_cost += cost
    by_tier[verdict.tier_used or "unknown"] += 1
    overall.add(row.truth_is_hit, verdict.is_hit)
    by_domain[row.domain].add(row.truth_is_hit, verdict.is_hit)
    by_label[row.label].add(row.truth_is_hit, verdict.is_hit)
    by_diff[row.difficulty].add(row.truth_is_hit, verdict.is_hit)

Patch 8: Document the verification-method gap

Add a ## Note on verification_method`` section to SCHEMA.md:

The harness's headline HIT/MISS metric does not invoke verification_method. Labels are pre-computed at build time and the harness only compares the cache's Verdict.is_hit against binary_label. The verification_method field documents how the gold label was derived and is intended for downstream response-quality / answer-correctness scoring (out of scope for v1). For ~120 rows that carry llm_judge, the gold labels were curator-assigned; no judge runs at eval time.

Patch 9: Add a minimal test suite

# tests/test_eval_harness.py
import pytest
from eval_harness import wilson_ci, percentile, ConfusionMatrix, Verdict, Row, evaluate

def test_wilson_ci_edge_cases():
    assert wilson_ci(0, 0) == (0.0, 0.0)
    lo, hi = wilson_ci(0, 100)
    assert 0.0 == lo and 0.03 < hi < 0.05
    lo, hi = wilson_ci(50, 100)
    assert abs(lo - 0.404) < 1e-2 and abs(hi - 0.596) < 1e-2

def test_percentile_matches_numpy():
    import numpy as np
    vals = list(range(1, 1001))
    for p in [0.0, 0.5, 0.95, 0.99, 1.0]:
        assert abs(percentile([float(x) for x in vals], p) - np.percentile(vals, p*100)) < 1e-9

def test_confusion_matrix_zero():
    cm = ConfusionMatrix()
    assert cm.precision == 0.0 and cm.recall == 0.0 and cm.f1 == 0.0

def test_evaluate_empty():
    rep = evaluate(lambda r: Verdict(False), [])
    assert rep.overall.total == 0
    assert rep.to_markdown()  # doesn't crash

Patch 10: Fix `gen_multilingual.py` re-seed at line 303

Replace random.seed(123) with a local rng = random.Random(123):

rng = random.Random(123)
while len(unrel_pairs) < 30 and attempts < 1000:
    la = rng.choice(all_langs)
    lb = rng.choice([x for x in all_langs if x != la])
    # ...

Avoid clobbering the global RNG state.

Patch 11: Add a `requirements.txt`

At cachebench/requirements.txt:

# To run the harness
# (stdlib only)

# To re-run generators
sympy>=1.12
pyarrow>=15.0
pandas>=2.1

# Optional: to wire up verification methods per design/judge-minimization.md
# sqlglot>=23.0       # sql_canonical
# jsonschema>=4.0     # json_schema
# spacy>=3.7          # set_match (NER)
# sentence-transformers>=2.2  # conv tier-2 embedding sim
# bert-score>=0.3     # creative tier-2
# pint>=0.23          # math tier-2 units
# anthropic>=0.30     # llm_judge (cross-family)
# openai>=1.30        # llm_judge (cross-family)

Patch 12: Add a download / build script

cachebench/build.sh:

#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")"
export PYTHONHASHSEED=0  # critical for qa_build and gen_multilingual until they're patched
export CACHEBENCH_DATASETS="${CACHEBENCH_DATASETS:-/home/bud/ditto/budCache/research/datasets}"

mkdir -p domains
python scripts/gen_math.py
python scripts/gen_code.py
python scripts/gen_conv.py
python scripts/gen_multiturn.py
python scripts/gen_tool.py
python scripts/gen_creative.py
python scripts/gen_personalized.py
python scripts/gen_multilingual.py
python scripts/qa_build.py
python scripts/assemble.py --strict

Appendix A: What the audit verified empirically

assemble.py --strict on current dataset:  PASSES (2000/2000, all label quotas OK)
re-running assemble.py:                   bit-for-bit identical (sha256 stable)
gen_creative.py rerun:                    bit-for-bit identical
gen_personalized.py rerun:                bit-for-bit identical
gen_math.py rerun:                        bit-for-bit identical
gen_code.py rerun:                        bit-for-bit identical
gen_conv.py rerun:                        bit-for-bit identical
gen_tool.py rerun:                        bit-for-bit identical
gen_multiturn.py rerun:                   bit-for-bit identical
gen_multilingual.py rerun:                DIFFERENT (hash() salt — Bug C1)
qa_build.py rerun:                        DIFFERENT (hash() salt — Bug C2)
gen_multilingual.py PYTHONHASHSEED=0 vs 1: DIFFERENT (confirms C1)
qa_build.py PYTHONHASHSEED=0 vs 1:        DIFFERENT (confirms C2)

eval_harness baselines:
  always_miss:  Precision 0.000  Recall 0.000  FHR 0.000  N=2000
  always_hit:   Precision 0.412 Recall 1.000  FHR 1.000  N=2000
  exact_match:  has 172 trivial FPs from personalized/multi_turn/conv same-text MISSes

Empirical metric verification:
  wilson_ci(10, 100) = (0.05522854, 0.17436730)  ← matches Brown/Cai/DasGupta
  percentile matches numpy method="linear" within 1e-13 across p=0..1

Appendix B: Files audited

/home/bud/ditto/budCache/research/cachebench/scripts/eval_harness.py    (386 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/assemble.py        (166 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_math.py        (946 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_code.py        (1400+ lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_conv.py        (~600 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multiturn.py   (~1500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_tool.py        (~2000 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_creative.py    (~400 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_personalized.py (~500 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/gen_multilingual.py (~356 lines)
/home/bud/ditto/budCache/research/cachebench/scripts/qa_build.py        (~1100 lines)
/home/bud/ditto/budCache/research/cachebench/SCHEMA.md
/home/bud/ditto/budCache/research/cachebench/design/judge-minimization.md
/home/bud/ditto/budCache/research/cachebench/cachebench.jsonl           (2000 rows, 2.1 MB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CacheBench v1 — Brutal Code Audit

1. Executive summary

2. Bug list

Critical (data-correctness or reproducibility)

Major (correctness bugs that don't yet bite)

Minor / style / docs

3. Metric correctness verdict

4. Edge-case coverage

5. Reproducibility verdict

6. Required dependencies inventory

(a) To run the harness against an existing `cachebench.jsonl`

(b) To re-run all generators

(c) To execute all verification methods that `judge-minimization.md` proposes

7. Concrete patch list

Patch 1: Fix `gen_multilingual.py` and `qa_build.py` reproducibility

Patch 2: Make all generator paths configurable

Patch 3: Fix `assemble.py` to validate verification & construction methods

Patch 4: Sync `SCHEMA.md` ↔ data

Patch 5: Remove the `validate_row` dead-code / fragile prefix check

Patch 6: Harden `load_bench`

Patch 7: Harden `evaluate` against bad caches

Patch 8: Document the verification-method gap

Patch 9: Add a minimal test suite

Patch 10: Fix `gen_multilingual.py` re-seed at line 303

Patch 11: Add a `requirements.txt`

Patch 12: Add a download / build script

Appendix A: What the audit verified empirically

Appendix B: Files audited

FilesExpand file tree

AUDIT_code.md

Latest commit

History

AUDIT_code.md

File metadata and controls

CacheBench v1 — Brutal Code Audit

1. Executive summary

2. Bug list

Critical (data-correctness or reproducibility)

Major (correctness bugs that don't yet bite)

Minor / style / docs

3. Metric correctness verdict

4. Edge-case coverage

5. Reproducibility verdict

6. Required dependencies inventory

(a) To run the harness against an existing cachebench.jsonl

(b) To re-run all generators

(c) To execute all verification methods that judge-minimization.md proposes

7. Concrete patch list

Patch 1: Fix gen_multilingual.py and qa_build.py reproducibility

Patch 2: Make all generator paths configurable

Patch 3: Fix assemble.py to validate verification & construction methods

Patch 4: Sync SCHEMA.md ↔ data

Patch 5: Remove the validate_row dead-code / fragile prefix check

Patch 6: Harden load_bench

Patch 7: Harden evaluate against bad caches

Patch 8: Document the verification-method gap

Patch 9: Add a minimal test suite

Patch 10: Fix gen_multilingual.py re-seed at line 303

Patch 11: Add a requirements.txt

Patch 12: Add a download / build script

Appendix A: What the audit verified empirically

Appendix B: Files audited

(a) To run the harness against an existing `cachebench.jsonl`

(c) To execute all verification methods that `judge-minimization.md` proposes

Patch 1: Fix `gen_multilingual.py` and `qa_build.py` reproducibility

Patch 3: Fix `assemble.py` to validate verification & construction methods

Patch 4: Sync `SCHEMA.md` ↔ data

Patch 5: Remove the `validate_row` dead-code / fragile prefix check

Patch 6: Harden `load_bench`

Patch 7: Harden `evaluate` against bad caches

Patch 10: Fix `gen_multilingual.py` re-seed at line 303

Patch 11: Add a `requirements.txt`