feat: autoresearch harness for combfind init speed (SPRT-gated)#10
Draft
nymann wants to merge 15 commits into
Draft
feat: autoresearch harness for combfind init speed (SPRT-gated)#10nymann wants to merge 15 commits into
nymann wants to merge 15 commits into
Conversation
_diff_symbols keyed `existing` on qualified_name alone, which collides for Java overloads like Foo.bar(int) and Foo.bar(String) (the walker omits arity from qualified_name). On reparse of an edited file the dict collapsed both rows to one, the hash check then deleted the wrong overload's row, leaving an orphan and silently dropping a valid sibling. Key existing on (qualified_name, signature) instead. Fresh-parse already inserted overloads as distinct rows; this fix makes the reparse path match.
Hand-crafted ~20-file Java tree (com.example.tasks: model, queue, worker, store, util, exceptions). Covers: classes, interfaces, records, enums, nested classes, generics, javadoc, method overloads (Math, Format, ConcurrentQueue, MemoryStore). No third-party dependencies, no build tool — combfind's tree-sitter parser + tree-sitter-import fallback handle it directly. bench/_manifest.py records sha256 per file; verify() raises SystemExit with a tampered-files diff so score.py and runner.py can refuse to run against a doctored fixture. Smoke-tested: combfind parse+index on the fixture yields 126 symbols and 4 references in ~3s.
bench/capture.py runs combfind's parse + index stages against the fixture in a temp DB and dumps deterministic outputs as TSV: - bench/golden/parse_symbols.tsv (126 rows): file, qualified_name, signature, kind, content_hash. Sorted lexicographically. - bench/golden/index_refs.tsv (4 rows): src_qname, dst_qname, kind. Sorted lexicographically. Inheritance refs only (tree-sitter import fallback; scip-java needs a build tool we don't ship). Also runs an idempotency check: two consecutive captures must produce byte-identical files, otherwise the byte-equal gate downstream would be flaky. Pass. The golden contains the multi-row overload structure (4x Math.add, 3x Format.hex, etc.), so reverting the parse.py overload fix makes the byte-equal check fail — covering the bench/score.py smoke test ahead.
bench/incremental_edit.py defines the canonical edit (adds pause/resume to App.java); bench applies it only to a tmp copy so the pristine fixture is preserved. bench/capture.py now seeds four golden files (parse_symbols and index_refs, each in cold and incremental variants) and runs an idempotency check that both modes are byte-stable across two captures. bench/score.py runs combfind's parse+index against the fixture (cold: fresh DB; incremental: warmed DB + canonical edit) and gates on: 1. Fixture manifest matches (exit 2 if tampered) 2. parse_symbols byte-equal to golden (exit 3 on mismatch) 3. index_refs byte-equal to golden (exit 4 on mismatch) 5. Pipeline crash (exit 5) On pass, prints JSON to stdout with elapsed_seconds and counts. On any fail, prints diagnostic to stderr; no timing reported. bench/__init__.py silences combfind telemetry so stdout is pure JSON. Smoke-tested: cold passes (1.9s, 126 rows), incremental passes (0.25s, 128 rows after edit), tampering detected (exit 2), parse mismatch detected (exit 3) with a sensible diff summary on stderr.
bench/runner.py drives paired baseline-vs-patch trials and decides via
Wald's SPRT (normal mean, assumed variance). Refs are materialized as
detached git worktrees in a tmp dir; bench/ is overridden from the
current repo so the harness is held constant while pipeline code is
the variable under test.
Flow:
1. Worktree both refs.
2. Run scorer once per side; abort with exit 2 if either fails the
correctness gate (no timing reported).
3. Loop paired trials in randomized order; first --warmup-pairs are
discarded (cold caches).
4. After each non-warmup pair, compute LR. Stop at upper bound
(ACCEPT, exit 0) or lower bound (REJECT, exit 1). Hard cap
INDETERMINATE → REJECT-equivalent (exit 1).
Defaults: alpha=beta=0.05, delta=0.05 (5% baseline mean), max-pairs=50,
warmup-pairs=2. Sigma assumption: 0.10*baseline_mean for the first 5
pairs, then running sample stdev floored at 0.05*baseline_mean.
Final verdict prints sample sizes, means, mean diff (absolute and %),
and 95% CI half-width.
…regressions
The incremental scorer is the gate that catches the
overload-on-reparse class of bug. Without an overload-bearing file in
the canonical edit set, the bench would happily score a buggy
_diff_symbols (keyed on qualified_name alone) as correct.
Now applying two edits to the tmp fixture copy:
1. App.java: add pause()/resume() — pure insertion path.
2. Math.java: tweak `int add(int, int)`'s docstring — exercises the
"key already in existing, hash differs" branch in the presence of
three sibling add() overloads.
Smoke-tested by reverting the overload fix locally: scorer exits 3
with a parse_symbols mismatch showing 2 rows missing (one of the
unrelated overloads got erroneously deleted) and 1 row stale
(the orphaned pre-edit add(int,int)). Restoring the fix passes again.
Documents the per-round loop (hypothesis → edit → commit → run SPRT → ACCEPT keep / REJECT git reset), editable scope (combfind/pipeline + db.py), forbidden scope (bench/, tests/), the meaning of each runner exit code, an enumerated list of known suspects from plans/incremental-reindex-investigation.md, and the hard rules. Pairs with bench/runner.py for the overnight loop the user spec'd: agent runs autonomously on a branch, every ACCEPT becomes a commit, human reviews the chain at merge time.
_stage_fn imported all seven stage modules unconditionally on every call, even when only one stage was being executed. embed and embed_concepts pull in sentence_transformers (PyTorch), which is 1-2s of import time at the very top of any combfind invocation. Switch to per-stage imports: only the stage module actually being returned gets imported. For full `combfind init` runs nothing changes (every stage runs eventually); for partial-stage runs (tooling, tests, the bench harness) the unused-stage imports are skipped. The bench harness invokes parse+index only; on this fixture, score.py in a fresh Python process drops dramatically once the heavy stage modules are not imported.
JavaIndexer.run claimed to fall back to tree-sitter when scip-java
wasn't available, but the existing branch only triggered when
shutil.which("scip-java") returned None. When scip-java was on PATH
but failed at runtime (e.g., no Maven/Gradle/sbt/mill descriptor),
_run_scip silently returned 0 and the tree-sitter else-branch never
ran — see plans/scip-java-fallback-bug.md.
Predict scip-java's runtime requirement cheaply: only invoke it when
the repo root has one of the build descriptors it understands. If
either the binary is missing OR no descriptor is present, fall
through to the tree-sitter import extractor.
Side benefits:
- Saves ~400ms per init for Java repos without a build tool
(the failing scip-java subprocess is no longer invoked).
- Closes the long-standing
tests/unit/test_index_java.py::test_import_reference failure.
Re-baselined bench/golden/index_refs*.tsv to reflect the now-emitted
17 import refs that the broken fallback was silently dropping; the
fixture has zero build descriptors so this is the path the bench
exercises.
_input_hash hashed the full **params dict, including operational keys that don't change pipeline output: repo_path, llm_model, llm_workers, docgen flag (already excluded as kwarg). The result: changing parallelism or moving the DB to a different cwd invalidated every stage's cache, forcing a full re-run on data that was identical. Filter to the keys that actually affect output: exclude_paths and exclude_regex. Adding new output-affecting params (e.g., a future walker config flag) is a one-line allowlist update. Tracked in plans/incremental-reindex-investigation.md as suspect #1. The bench harness can't measure this directly (each trial uses a fresh DB), but it's a correctness fix that means re-running combfind init with --llm-workers tweaked no longer triggers a full re-parse.
…RT math Two new test modules pin the bench harness's behavior so future tweaks don't silently break the agent loop: - test_bench_score.py invokes score.py as a subprocess against the real fixture and verifies each exit code path: 0 (clean), 2 (tampered fixture, restored after), 3 (parse golden mismatch), 4 (index golden mismatch). Also verifies stdout JSON shape on pass and absence of JSON on fail. - test_bench_runner.py exercises the _sprt() function directly with scripted diff arrays: strong improvement accepts past the upper bound, strong regression rejects past the lower, no-op rejects given enough samples, sigma floor prevents division blowup with zero-variance inputs, baseline_mean=0 doesn't raise.
Runner now prints a final one-liner formatted for inclusion in the
agent's commit body:
SPRT-verdict: ACCEPT mode=cold pairs=8 mean_diff=+12.34% ci=±2.10%
program.md instructs the agent to git commit --amend --no-edit with
this line appended so the morning reviewer can pull a verdict
summary across an overnight run with:
git log --grep="SPRT-verdict:" --pretty="%h %s%n%b" master..HEAD
ae4df88 to
25cc7d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: draft. Not ready for merge — opening for visibility while we iterate.
What this is
A Karpathy-style autoresearch loop, gated by chess-engine SPRT, for autonomously optimizing combfind's
initspeed. The agent editscombfind/pipeline/; the harness (frozen) decides if the change is correct and ≥ Δ faster.See
plans/autoresearch-harness.mdfor the design (Karpathy's three primitives + SPRT for the noisy continuous metric).What's been validated
Why this PR is draft
The methodology I used while building this was wrong:
Before merging, want to:
<owner>/<topic>integration branch +<owner>/experiment/Nbranches; merge only ACCEPTed) to demonstrate the harness's actual workflow.bench/program.md.plans/2026-05-03-night-session.mdartifact — keep, drop, or rework as a generic runbook.Dependencies
Stacked on (rebase once these land):
#8 (lazy stage imports) and #9 (input_hash filter) are also included as part of the harness chain since they were discovered while building it; they can be reviewed standalone via their own PRs.
Test plan
pytest tests/unit/— 143 passed (gleam/erlang excluded for missing tree-sitter-language-pack, unrelated to this work)uv run python -m bench.score --mode=coldreturns OKuv run python -m bench.score --mode=incrementalreturns OKuv run python -m bench.runner --mode=coldACCEPT/REJECT smoke-tested