feat: autoresearch harness for combfind init speed (SPRT-gated) by nymann · Pull Request #10 · The127/combfind

nymann · 2026-05-03T07:31:55Z

Status: draft. Not ready for merge — opening for visibility while we iterate.

What this is

A Karpathy-style autoresearch loop, gated by chess-engine SPRT, for autonomously optimizing combfind's init speed. The agent edits combfind/pipeline/; the harness (frozen) decides if the change is correct and ≥ Δ faster.

bench/
  fixture/                # 20-file synthetic Java tree (com.example.tasks)
  fixture.manifest        # sha256 per file; tampering rejected before scoring
  golden/                 # cold + incremental reference outputs (parse, index)
  _manifest.py            # fixture integrity check
  capture.py              # rebaseline golden from current master
  incremental_edit.py     # canonical edit applied to a tmp copy
  score.py                # one run, gate + timing, JSON to stdout on pass
  runner.py               # paired Wald SPRT, baseline vs patch
  program.md              # instructions for the autonomous agent
  README.md               # how to use it

See plans/autoresearch-harness.md for the design (Karpathy's three primitives + SPRT for the noisy continuous metric).

What's been validated

ACCEPT path: lazy stage imports (perf: lazy-import pipeline stage modules #8) accepted in 1 paired trial, 86% cold-init speedup.
REJECT path: lazy parser construction (within noise floor +0.3%) rejected in 8 pairs.
CORRECTNESS path: a "skip scip-java when no build descriptor" attempt was rejected with more refs than golden — surfaced the underlying scip-java fallback bug now fixed in fix: tree-sitter fallback runs when scip-java fails at runtime #7. The harness genuinely catches output regressions before any timing is reported.
13 unit tests pin score.py's exit-code contract and the SPRT math.

Why this PR is draft

The methodology I used while building this was wrong:

Linear chain on one branch with reverts inline, instead of per-experiment branches merging into an integration branch on ACCEPT.
Stopped after ~40 minutes of an authorized 8-hour autonomous session.

Before merging, want to:

Re-run the loop with the proper topology (<owner>/<topic> integration branch + <owner>/experiment/N branches; merge only ACCEPTed) to demonstrate the harness's actual workflow.
Fold the methodology guidance ("ask methodology questions up front for long sessions") into bench/program.md.
Decide on the plans/2026-05-03-night-session.md artifact — keep, drop, or rework as a generic runbook.

Dependencies

Stacked on (rebase once these land):

fix: handle Java method overloads in incremental reparse #6 fix: Java overload reparse (the canonical edit on Math.java assumes overloads survive)
fix: tree-sitter fallback runs when scip-java fails at runtime #7 fix: scip-java fallback (the bench/golden index_refs values assume the fallback runs)

#8 (lazy stage imports) and #9 (input_hash filter) are also included as part of the harness chain since they were discovered while building it; they can be reviewed standalone via their own PRs.

Test plan

pytest tests/unit/ — 143 passed (gleam/erlang excluded for missing tree-sitter-language-pack, unrelated to this work)
uv run python -m bench.score --mode=cold returns OK
uv run python -m bench.score --mode=incremental returns OK
uv run python -m bench.runner --mode=cold ACCEPT/REJECT smoke-tested
Re-run with proper branch topology to demonstrate the loop (pending)

_diff_symbols keyed `existing` on qualified_name alone, which collides for Java overloads like Foo.bar(int) and Foo.bar(String) (the walker omits arity from qualified_name). On reparse of an edited file the dict collapsed both rows to one, the hash check then deleted the wrong overload's row, leaving an orphan and silently dropping a valid sibling. Key existing on (qualified_name, signature) instead. Fresh-parse already inserted overloads as distinct rows; this fix makes the reparse path match.

Hand-crafted ~20-file Java tree (com.example.tasks: model, queue, worker, store, util, exceptions). Covers: classes, interfaces, records, enums, nested classes, generics, javadoc, method overloads (Math, Format, ConcurrentQueue, MemoryStore). No third-party dependencies, no build tool — combfind's tree-sitter parser + tree-sitter-import fallback handle it directly. bench/_manifest.py records sha256 per file; verify() raises SystemExit with a tampered-files diff so score.py and runner.py can refuse to run against a doctored fixture. Smoke-tested: combfind parse+index on the fixture yields 126 symbols and 4 references in ~3s.

bench/capture.py runs combfind's parse + index stages against the fixture in a temp DB and dumps deterministic outputs as TSV: - bench/golden/parse_symbols.tsv (126 rows): file, qualified_name, signature, kind, content_hash. Sorted lexicographically. - bench/golden/index_refs.tsv (4 rows): src_qname, dst_qname, kind. Sorted lexicographically. Inheritance refs only (tree-sitter import fallback; scip-java needs a build tool we don't ship). Also runs an idempotency check: two consecutive captures must produce byte-identical files, otherwise the byte-equal gate downstream would be flaky. Pass. The golden contains the multi-row overload structure (4x Math.add, 3x Format.hex, etc.), so reverting the parse.py overload fix makes the byte-equal check fail — covering the bench/score.py smoke test ahead.

bench/incremental_edit.py defines the canonical edit (adds pause/resume to App.java); bench applies it only to a tmp copy so the pristine fixture is preserved. bench/capture.py now seeds four golden files (parse_symbols and index_refs, each in cold and incremental variants) and runs an idempotency check that both modes are byte-stable across two captures. bench/score.py runs combfind's parse+index against the fixture (cold: fresh DB; incremental: warmed DB + canonical edit) and gates on: 1. Fixture manifest matches (exit 2 if tampered) 2. parse_symbols byte-equal to golden (exit 3 on mismatch) 3. index_refs byte-equal to golden (exit 4 on mismatch) 5. Pipeline crash (exit 5) On pass, prints JSON to stdout with elapsed_seconds and counts. On any fail, prints diagnostic to stderr; no timing reported. bench/__init__.py silences combfind telemetry so stdout is pure JSON. Smoke-tested: cold passes (1.9s, 126 rows), incremental passes (0.25s, 128 rows after edit), tampering detected (exit 2), parse mismatch detected (exit 3) with a sensible diff summary on stderr.

bench/runner.py drives paired baseline-vs-patch trials and decides via Wald's SPRT (normal mean, assumed variance). Refs are materialized as detached git worktrees in a tmp dir; bench/ is overridden from the current repo so the harness is held constant while pipeline code is the variable under test. Flow: 1. Worktree both refs. 2. Run scorer once per side; abort with exit 2 if either fails the correctness gate (no timing reported). 3. Loop paired trials in randomized order; first --warmup-pairs are discarded (cold caches). 4. After each non-warmup pair, compute LR. Stop at upper bound (ACCEPT, exit 0) or lower bound (REJECT, exit 1). Hard cap INDETERMINATE → REJECT-equivalent (exit 1). Defaults: alpha=beta=0.05, delta=0.05 (5% baseline mean), max-pairs=50, warmup-pairs=2. Sigma assumption: 0.10*baseline_mean for the first 5 pairs, then running sample stdev floored at 0.05*baseline_mean. Final verdict prints sample sizes, means, mean diff (absolute and %), and 95% CI half-width.

…regressions The incremental scorer is the gate that catches the overload-on-reparse class of bug. Without an overload-bearing file in the canonical edit set, the bench would happily score a buggy _diff_symbols (keyed on qualified_name alone) as correct. Now applying two edits to the tmp fixture copy: 1. App.java: add pause()/resume() — pure insertion path. 2. Math.java: tweak `int add(int, int)`'s docstring — exercises the "key already in existing, hash differs" branch in the presence of three sibling add() overloads. Smoke-tested by reverting the overload fix locally: scorer exits 3 with a parse_symbols mismatch showing 2 rows missing (one of the unrelated overloads got erroneously deleted) and 1 row stale (the orphaned pre-edit add(int,int)). Restoring the fix passes again.

Documents the per-round loop (hypothesis → edit → commit → run SPRT → ACCEPT keep / REJECT git reset), editable scope (combfind/pipeline + db.py), forbidden scope (bench/, tests/), the meaning of each runner exit code, an enumerated list of known suspects from plans/incremental-reindex-investigation.md, and the hard rules. Pairs with bench/runner.py for the overnight loop the user spec'd: agent runs autonomously on a branch, every ACCEPT becomes a commit, human reviews the chain at merge time.

_stage_fn imported all seven stage modules unconditionally on every call, even when only one stage was being executed. embed and embed_concepts pull in sentence_transformers (PyTorch), which is 1-2s of import time at the very top of any combfind invocation. Switch to per-stage imports: only the stage module actually being returned gets imported. For full `combfind init` runs nothing changes (every stage runs eventually); for partial-stage runs (tooling, tests, the bench harness) the unused-stage imports are skipped. The bench harness invokes parse+index only; on this fixture, score.py in a fresh Python process drops dramatically once the heavy stage modules are not imported.

JavaIndexer.run claimed to fall back to tree-sitter when scip-java wasn't available, but the existing branch only triggered when shutil.which("scip-java") returned None. When scip-java was on PATH but failed at runtime (e.g., no Maven/Gradle/sbt/mill descriptor), _run_scip silently returned 0 and the tree-sitter else-branch never ran — see plans/scip-java-fallback-bug.md. Predict scip-java's runtime requirement cheaply: only invoke it when the repo root has one of the build descriptors it understands. If either the binary is missing OR no descriptor is present, fall through to the tree-sitter import extractor. Side benefits: - Saves ~400ms per init for Java repos without a build tool (the failing scip-java subprocess is no longer invoked). - Closes the long-standing tests/unit/test_index_java.py::test_import_reference failure. Re-baselined bench/golden/index_refs*.tsv to reflect the now-emitted 17 import refs that the broken fallback was silently dropping; the fixture has zero build descriptors so this is the path the bench exercises.

_input_hash hashed the full **params dict, including operational keys that don't change pipeline output: repo_path, llm_model, llm_workers, docgen flag (already excluded as kwarg). The result: changing parallelism or moving the DB to a different cwd invalidated every stage's cache, forcing a full re-run on data that was identical. Filter to the keys that actually affect output: exclude_paths and exclude_regex. Adding new output-affecting params (e.g., a future walker config flag) is a one-line allowlist update. Tracked in plans/incremental-reindex-investigation.md as suspect #1. The bench harness can't measure this directly (each trial uses a fresh DB), but it's a correctness fix that means re-running combfind init with --llm-workers tweaked no longer triggers a full re-parse.

…RT math Two new test modules pin the bench harness's behavior so future tweaks don't silently break the agent loop: - test_bench_score.py invokes score.py as a subprocess against the real fixture and verifies each exit code path: 0 (clean), 2 (tampered fixture, restored after), 3 (parse golden mismatch), 4 (index golden mismatch). Also verifies stdout JSON shape on pass and absence of JSON on fail. - test_bench_runner.py exercises the _sprt() function directly with scripted diff arrays: strong improvement accepts past the upper bound, strong regression rejects past the lower, no-op rejects given enough samples, sigma floor prevents division blowup with zero-variance inputs, baseline_mean=0 doesn't raise.

Runner now prints a final one-liner formatted for inclusion in the agent's commit body: SPRT-verdict: ACCEPT mode=cold pairs=8 mean_diff=+12.34% ci=±2.10% program.md instructs the agent to git commit --amend --no-edit with this line appended so the morning reviewer can pull a verdict summary across an overnight run with: git log --grep="SPRT-verdict:" --pretty="%h %s%n%b" master..HEAD

… repo)

nymann added 15 commits May 3, 2026 09:34

fix(bench): runner reports n/a for CI when only one paired trial counted

916705e

docs(bench): README explaining structure, usage, and what's been proven

079c7de

docs(bench): remove references to plans/ (kept as local notes, not in…

25cc7d2

… repo)

nymann force-pushed the bench/autoresearch-harness branch from ae4df88 to 25cc7d2 Compare May 3, 2026 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: autoresearch harness for combfind init speed (SPRT-gated)#10

feat: autoresearch harness for combfind init speed (SPRT-gated)#10
nymann wants to merge 15 commits into
masterfrom
bench/autoresearch-harness

nymann commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nymann commented May 3, 2026

What this is

What's been validated

Why this PR is draft

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant