Skip to content

feat: autoresearch harness for combfind init speed (SPRT-gated)#10

Draft
nymann wants to merge 15 commits into
masterfrom
bench/autoresearch-harness
Draft

feat: autoresearch harness for combfind init speed (SPRT-gated)#10
nymann wants to merge 15 commits into
masterfrom
bench/autoresearch-harness

Conversation

@nymann
Copy link
Copy Markdown
Collaborator

@nymann nymann commented May 3, 2026

Status: draft. Not ready for merge — opening for visibility while we iterate.

What this is

A Karpathy-style autoresearch loop, gated by chess-engine SPRT, for autonomously optimizing combfind's init speed. The agent edits combfind/pipeline/; the harness (frozen) decides if the change is correct and ≥ Δ faster.

bench/
  fixture/                # 20-file synthetic Java tree (com.example.tasks)
  fixture.manifest        # sha256 per file; tampering rejected before scoring
  golden/                 # cold + incremental reference outputs (parse, index)
  _manifest.py            # fixture integrity check
  capture.py              # rebaseline golden from current master
  incremental_edit.py     # canonical edit applied to a tmp copy
  score.py                # one run, gate + timing, JSON to stdout on pass
  runner.py               # paired Wald SPRT, baseline vs patch
  program.md              # instructions for the autonomous agent
  README.md               # how to use it

See plans/autoresearch-harness.md for the design (Karpathy's three primitives + SPRT for the noisy continuous metric).

What's been validated

  • ACCEPT path: lazy stage imports (perf: lazy-import pipeline stage modules #8) accepted in 1 paired trial, 86% cold-init speedup.
  • REJECT path: lazy parser construction (within noise floor +0.3%) rejected in 8 pairs.
  • CORRECTNESS path: a "skip scip-java when no build descriptor" attempt was rejected with more refs than golden — surfaced the underlying scip-java fallback bug now fixed in fix: tree-sitter fallback runs when scip-java fails at runtime #7. The harness genuinely catches output regressions before any timing is reported.
  • 13 unit tests pin score.py's exit-code contract and the SPRT math.

Why this PR is draft

The methodology I used while building this was wrong:

  • Linear chain on one branch with reverts inline, instead of per-experiment branches merging into an integration branch on ACCEPT.
  • Stopped after ~40 minutes of an authorized 8-hour autonomous session.

Before merging, want to:

  • Re-run the loop with the proper topology (<owner>/<topic> integration branch + <owner>/experiment/N branches; merge only ACCEPTed) to demonstrate the harness's actual workflow.
  • Fold the methodology guidance ("ask methodology questions up front for long sessions") into bench/program.md.
  • Decide on the plans/2026-05-03-night-session.md artifact — keep, drop, or rework as a generic runbook.

Dependencies

Stacked on (rebase once these land):

#8 (lazy stage imports) and #9 (input_hash filter) are also included as part of the harness chain since they were discovered while building it; they can be reviewed standalone via their own PRs.

Test plan

  • pytest tests/unit/ — 143 passed (gleam/erlang excluded for missing tree-sitter-language-pack, unrelated to this work)
  • uv run python -m bench.score --mode=cold returns OK
  • uv run python -m bench.score --mode=incremental returns OK
  • uv run python -m bench.runner --mode=cold ACCEPT/REJECT smoke-tested
  • Re-run with proper branch topology to demonstrate the loop (pending)

nymann added 15 commits May 3, 2026 09:34
_diff_symbols keyed `existing` on qualified_name alone, which collides
for Java overloads like Foo.bar(int) and Foo.bar(String) (the walker
omits arity from qualified_name). On reparse of an edited file the dict
collapsed both rows to one, the hash check then deleted the wrong
overload's row, leaving an orphan and silently dropping a valid sibling.

Key existing on (qualified_name, signature) instead. Fresh-parse already
inserted overloads as distinct rows; this fix makes the reparse path
match.
Hand-crafted ~20-file Java tree (com.example.tasks: model, queue,
worker, store, util, exceptions). Covers: classes, interfaces, records,
enums, nested classes, generics, javadoc, method overloads (Math,
Format, ConcurrentQueue, MemoryStore). No third-party dependencies, no
build tool — combfind's tree-sitter parser + tree-sitter-import
fallback handle it directly.

bench/_manifest.py records sha256 per file; verify() raises SystemExit
with a tampered-files diff so score.py and runner.py can refuse to run
against a doctored fixture.

Smoke-tested: combfind parse+index on the fixture yields 126 symbols
and 4 references in ~3s.
bench/capture.py runs combfind's parse + index stages against the
fixture in a temp DB and dumps deterministic outputs as TSV:

- bench/golden/parse_symbols.tsv (126 rows): file, qualified_name,
  signature, kind, content_hash. Sorted lexicographically.
- bench/golden/index_refs.tsv (4 rows): src_qname, dst_qname, kind.
  Sorted lexicographically. Inheritance refs only (tree-sitter import
  fallback; scip-java needs a build tool we don't ship).

Also runs an idempotency check: two consecutive captures must produce
byte-identical files, otherwise the byte-equal gate downstream would be
flaky. Pass.

The golden contains the multi-row overload structure (4x Math.add,
3x Format.hex, etc.), so reverting the parse.py overload fix makes the
byte-equal check fail — covering the bench/score.py smoke test ahead.
bench/incremental_edit.py defines the canonical edit (adds pause/resume
to App.java); bench applies it only to a tmp copy so the pristine
fixture is preserved.

bench/capture.py now seeds four golden files (parse_symbols and
index_refs, each in cold and incremental variants) and runs an
idempotency check that both modes are byte-stable across two captures.

bench/score.py runs combfind's parse+index against the fixture (cold:
fresh DB; incremental: warmed DB + canonical edit) and gates on:
  1. Fixture manifest matches (exit 2 if tampered)
  2. parse_symbols byte-equal to golden (exit 3 on mismatch)
  3. index_refs byte-equal to golden (exit 4 on mismatch)
  5. Pipeline crash (exit 5)
On pass, prints JSON to stdout with elapsed_seconds and counts.
On any fail, prints diagnostic to stderr; no timing reported.

bench/__init__.py silences combfind telemetry so stdout is pure JSON.

Smoke-tested: cold passes (1.9s, 126 rows), incremental passes (0.25s,
128 rows after edit), tampering detected (exit 2), parse mismatch
detected (exit 3) with a sensible diff summary on stderr.
bench/runner.py drives paired baseline-vs-patch trials and decides via
Wald's SPRT (normal mean, assumed variance). Refs are materialized as
detached git worktrees in a tmp dir; bench/ is overridden from the
current repo so the harness is held constant while pipeline code is
the variable under test.

Flow:
  1. Worktree both refs.
  2. Run scorer once per side; abort with exit 2 if either fails the
     correctness gate (no timing reported).
  3. Loop paired trials in randomized order; first --warmup-pairs are
     discarded (cold caches).
  4. After each non-warmup pair, compute LR. Stop at upper bound
     (ACCEPT, exit 0) or lower bound (REJECT, exit 1). Hard cap
     INDETERMINATE → REJECT-equivalent (exit 1).

Defaults: alpha=beta=0.05, delta=0.05 (5% baseline mean), max-pairs=50,
warmup-pairs=2. Sigma assumption: 0.10*baseline_mean for the first 5
pairs, then running sample stdev floored at 0.05*baseline_mean.

Final verdict prints sample sizes, means, mean diff (absolute and %),
and 95% CI half-width.
…regressions

The incremental scorer is the gate that catches the
overload-on-reparse class of bug. Without an overload-bearing file in
the canonical edit set, the bench would happily score a buggy
_diff_symbols (keyed on qualified_name alone) as correct.

Now applying two edits to the tmp fixture copy:
  1. App.java: add pause()/resume() — pure insertion path.
  2. Math.java: tweak `int add(int, int)`'s docstring — exercises the
     "key already in existing, hash differs" branch in the presence of
     three sibling add() overloads.

Smoke-tested by reverting the overload fix locally: scorer exits 3
with a parse_symbols mismatch showing 2 rows missing (one of the
unrelated overloads got erroneously deleted) and 1 row stale
(the orphaned pre-edit add(int,int)). Restoring the fix passes again.
Documents the per-round loop (hypothesis → edit → commit → run SPRT →
ACCEPT keep / REJECT git reset), editable scope (combfind/pipeline +
db.py), forbidden scope (bench/, tests/), the meaning of each runner
exit code, an enumerated list of known suspects from
plans/incremental-reindex-investigation.md, and the hard rules.

Pairs with bench/runner.py for the overnight loop the user spec'd:
agent runs autonomously on a branch, every ACCEPT becomes a commit,
human reviews the chain at merge time.
_stage_fn imported all seven stage modules unconditionally on every
call, even when only one stage was being executed. embed and
embed_concepts pull in sentence_transformers (PyTorch), which is
1-2s of import time at the very top of any combfind invocation.

Switch to per-stage imports: only the stage module actually being
returned gets imported. For full `combfind init` runs nothing changes
(every stage runs eventually); for partial-stage runs (tooling,
tests, the bench harness) the unused-stage imports are skipped.

The bench harness invokes parse+index only; on this fixture, score.py
in a fresh Python process drops dramatically once the heavy stage
modules are not imported.
JavaIndexer.run claimed to fall back to tree-sitter when scip-java
wasn't available, but the existing branch only triggered when
shutil.which("scip-java") returned None. When scip-java was on PATH
but failed at runtime (e.g., no Maven/Gradle/sbt/mill descriptor),
_run_scip silently returned 0 and the tree-sitter else-branch never
ran — see plans/scip-java-fallback-bug.md.

Predict scip-java's runtime requirement cheaply: only invoke it when
the repo root has one of the build descriptors it understands. If
either the binary is missing OR no descriptor is present, fall
through to the tree-sitter import extractor.

Side benefits:
  - Saves ~400ms per init for Java repos without a build tool
    (the failing scip-java subprocess is no longer invoked).
  - Closes the long-standing
    tests/unit/test_index_java.py::test_import_reference failure.

Re-baselined bench/golden/index_refs*.tsv to reflect the now-emitted
17 import refs that the broken fallback was silently dropping; the
fixture has zero build descriptors so this is the path the bench
exercises.
_input_hash hashed the full **params dict, including operational keys
that don't change pipeline output: repo_path, llm_model, llm_workers,
docgen flag (already excluded as kwarg). The result: changing
parallelism or moving the DB to a different cwd invalidated every
stage's cache, forcing a full re-run on data that was identical.

Filter to the keys that actually affect output: exclude_paths and
exclude_regex. Adding new output-affecting params (e.g., a future
walker config flag) is a one-line allowlist update.

Tracked in plans/incremental-reindex-investigation.md as suspect #1.
The bench harness can't measure this directly (each trial uses a
fresh DB), but it's a correctness fix that means re-running combfind
init with --llm-workers tweaked no longer triggers a full re-parse.
…RT math

Two new test modules pin the bench harness's behavior so future
tweaks don't silently break the agent loop:

- test_bench_score.py invokes score.py as a subprocess against the
  real fixture and verifies each exit code path: 0 (clean), 2
  (tampered fixture, restored after), 3 (parse golden mismatch), 4
  (index golden mismatch). Also verifies stdout JSON shape on pass
  and absence of JSON on fail.

- test_bench_runner.py exercises the _sprt() function directly with
  scripted diff arrays: strong improvement accepts past the upper
  bound, strong regression rejects past the lower, no-op rejects
  given enough samples, sigma floor prevents division blowup with
  zero-variance inputs, baseline_mean=0 doesn't raise.
Runner now prints a final one-liner formatted for inclusion in the
agent's commit body:

    SPRT-verdict: ACCEPT mode=cold pairs=8 mean_diff=+12.34% ci=±2.10%

program.md instructs the agent to git commit --amend --no-edit with
this line appended so the morning reviewer can pull a verdict
summary across an overnight run with:

    git log --grep="SPRT-verdict:" --pretty="%h %s%n%b" master..HEAD
@nymann nymann force-pushed the bench/autoresearch-harness branch from ae4df88 to 25cc7d2 Compare May 3, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant