A multi-target workspace for evolving Lance
hot-path kernels via LLM coding agents (Claude Code, Codex, Cursor),
in the style of Andrej Karpathy's
autoresearch single-agent loop.
Karpathy's autoresearch
(early 2026): give an LLM agent one mutable file, a fixed bench, and a
program.md of priors. The agent loops (edit, build, run,
keep-or-revert, commit) overnight, until you stop it. Karpathy's framing:
"You wake up to a log of experiments and (hopefully) a better model."
This repo adapts that shape for Lance kernel optimization: per-trial
~30s, one mutable file (crates/<target>/src/kernels.rs), correctness
oracle is upstream Lance code itself (vendored verbatim in
crates/lance-snapshots/). Any kept commit is bit-equivalent to what
Lance ships; wins port upstream as Apache-2.0 PRs.
Why the shape works: fixed-cost trials bound the per-iteration budget
(~100/hour); one mutable file keeps diffs reviewable and prevents scope
creep; a deterministic oracle kills failed trials without spiraling;
the loop self-orchestrates so the human can leave; findings compound
across sessions via gitignored lessons.md per target.
Each landed target is an independent Rust crate under crates/. The
candidates below are listed as a roadmap. They have no code yet, only a
docs/targets/<name>.md capsule when one exists. Spinning up a candidate
follows the docs/adding-a-target.md workflow.
| Target | Status | Lance source area | What's optimized | Best result |
|---|---|---|---|---|
crates/pq-l2 |
landed | lance-linalg::distance::l2, PQ probe |
PQ L2 distance: distance_table + per-vector distances | −43% geomean vs upstream (M1 Max, aarch64; bit-equivalent output; x86 untested) |
crates/pq-cosine |
candidate | lance-linalg::distance::cosine |
PQ cosine distance | pending |
crates/pq-dot |
candidate | lance-linalg::distance::dot |
PQ dot-product distance | pending |
crates/ivf-partition |
candidate | lance-index::vector::ivf partition select |
IVF partition selection (centroid scan) | pending |
crates/fts-bm25 |
candidate (Step 0 ✓) | lance-index::scalar::inverted::scorer Scorer::doc_weight |
FTS BM25 scoring inner loop | pending; clean call site at wand.rs:252 via the Scorer trait — ready to scaffold |
crates/bitpack |
candidate | lance-encoding::encodings::bitpack |
Bitpack integer decode | pending |
crates/dictionary |
candidate | lance-encoding::encodings::dictionary |
Dictionary decode | pending |
crates/fsst |
candidate | lance-encoding::encodings::fsst |
FSST string decode | pending |
crates/take |
candidate | lance-core::utils::take |
Take / gather kernel | pending |
crates/predicate |
candidate | lance-datafusion filter eval |
Predicate evaluation kernels | pending |
crates/posting-intersect |
landed (off-path; see capsule) | lance-index::scalar::inverted (no direct call site) |
Sorted u32 posting-list AND intersect | −81% geomean vs scalar K-way merge (M1 Max, aarch64; bit-equivalent output; x86 fallback intact). Kernel surface not in current Lance hot path; see posting-seek for the Lance-aligned shape. |
crates/posting-seek |
kernel landed; REJECTED as upstream PR | lance-index::scalar::inverted::wand (next, shallow_next) |
Block-aware seek over compressed posting list | Microbench −97% worst-case / −58% geomean. Upstream integration: 1M no change (p > 0.05); 10M REGRESSES OR queries +12.7% (p=0.03). WAND's score-skip preempts deep next() calls; gallop's overhead loses on shallow skips that actually dominate. The microbench was self-fulfilling. See capsule for full empirical breakdown |
crates/topk-merge |
candidate | scan-merge | Top-K k-way merge | pending |
The candidate targets are documented in docs/targets/ and
can be added by following docs/adding-a-target.md.
pq-l2, posting-intersect, and posting-seek are landed; the rest
wait for an agent to spin them up. pq-l2 carries a −43% geomean win
on M1 Max. posting-intersect lands at −81% geomean via three trials
(branchless merge → galloping at ratio>16× → NEON cross-product SIMD
merge), but a retroactive Step 0 trace (see docs/adding-a-target.md)
showed its kernel surface is not in Lance's current WAND hot path —
the trial wins are clean kernel engineering on a primitive Lance would
need a refactor to use. posting-seek is the Lance-aligned follow-up:
a hybrid linear-budget + McIlroy gallop change in wand.rs::next that
drops the worst-case seek (Large × Skip-deep) from 3011 ns → 74 ns,
~30 LOC, no unsafe, no SIMD. Step 0 of the workflow was added in
response to posting-intersect's mis-scope; future targets won't ship
without their "Lance call site" capsule section filed first.
Karpathy's three-file shape, applied per target:
| File (per target crate) | Mutability | Edited by |
|---|---|---|
src/kernels.rs |
mutable | the agent |
src/reference.rs, src/inputs.rs, src/lib.rs, src/bin/run_experiment.rs, benches/*.rs |
immutable | nobody |
program.md |
human-iterated | the human, between runs |
results.tsv |
append-only | the agent, per trial (gitignored) |
lessons.md |
append-only | the agent, on load-bearing findings (gitignored) |
The shared utilities (deterministic PRNG, geomean, bootstrap CI, PMC
counters, peak-RSS readback, tolerance constants, time-budget) live in
crates/harness-common and are
consumed by every target. There is intentionally no Target trait:
decode-kernel signatures and distance-kernel signatures are different
enough that a unifying trait would either bloat or require erased
boxing. Each target is its own natural shape; the shared crate is
plumbing only.
The shared loop conventions every target's program.md inherits live in
HARNESS.md. Per-target priors and API specifics live in
each target's own program.md.
Every other ANN benchmark you've seen is "compete on this fixed dataset" (SIFT1M, GIST1M, DEEP1B). That conflates two things: kernel correctness (the math) and kernel speed under one specific data distribution. An LLM agent given recall@K as the oracle has incentive to overfit to the dataset's quirks.
We split them, every target:
- Correctness = bit-equivalent (
max_abs_err ≤ 1e-4for floats; bitwise for integer/byte kernels) match to a scalar reference, on diverse generated inputs. Mathematical equivalence; no dataset to overfit. Lossy techniques fail this gate. - Speed = geomean ns/operation across multiple shape × distribution combinations, with worst-case guard. A kernel that wins on one distribution and regresses on another fails to keep.
Fixtures generate from a seeded PRNG in each target's inputs.rs.
Nothing to download. Reproducible across machines and across runs from
the same SHA.
# Run the landed PQ L2 target's baseline (3-pass for tight CI).
cargo run --release --bin run_experiment -p pq-l2 -- --mode baseline
# Or per-trial mode (1-pass, faster iteration):
cargo run --release --bin run_experiment -p pq-l2
# With Claude Code / Codex, working on one target:
cd crates/pq-l2
# Open in your agent of choice and prompt:
# Hi, have a look at program.md and let's kick off a new experiment.
# Add a new target (see docs/adding-a-target.md):
./scripts/scaffold-target.sh pq-cosine
# Then rewrite kernels.rs / reference.rs / inputs.rs / program.md for the
# new kernel's math.
# Check whether our vendored upstream code has drifted:
./scripts/check-lance-drift.shlance-autoresearch/
├── Cargo.toml # workspace root
├── README.md # you are here
├── HARNESS.md # shared loop contract every target inherits
├── LICENSE # Apache-2.0 (matches upstream Lance)
├── scripts/
│ ├── scaffold-target.sh # cp -r pq-l2 + rename for a new target
│ └── check-lance-drift.sh # report upstream-snapshot drift
├── crates/
│ ├── harness-common/ # SplitMix64, geomean, bootstrap CI, PMC counters, tolerance, time budget
│ │ └── src/{lib,prng,stats,sysinfo,tolerance,perf}.rs
│ ├── lance-snapshots/ # verbatim Apache-2.0 vendored Lance hot-path kernels (pinned SHA)
│ │ ├── RESYNC.md
│ │ └── src/{lib,assume,l2,pq}.rs
│ └── pq-l2/ # landed target
│ ├── Cargo.toml
│ ├── program.md # this target's agent skill
│ ├── src/
│ │ ├── lib.rs # PqShape + module wiring (immutable)
│ │ ├── kernels.rs # MUTABLE; agent's playground (starts as upstream clone)
│ │ ├── reference.rs # IMMUTABLE; thin wrapper over lance-snapshots (oracle IS upstream code)
│ │ ├── inputs.rs # IMMUTABLE; diverse test-data generators
│ │ └── bin/run_experiment.rs # IMMUTABLE; per-trial entry point
│ └── benches/pq_l2.rs # criterion benchmark (immutable)
└── docs/
├── design.md # rationale for the workspace shape
├── robustness.md # why each measurement feature exists
├── adding-a-target.md # workflow for spinning up a new target
└── targets/
└── pq-l2.md # capsule: upstream Lance pointers, oracle, status
When a commit on any target clears the keep bar by a meaningful margin
(≥10% geomean speedup with worst-case guard intact), the human reviews
the diff, ports the technique against
lance-format/lance HEAD, runs
Lance's own test suite, and opens a PR. The harness is Apache-2.0
licensed to match Lance; the upstream PR inherits Apache-2.0 cleanly.
The correctness gate (MAX_ABS_ERR ≤ 1e-4 against the vendored upstream
code in crates/lance-snapshots) means any kept commit is bit-equivalent
to what Lance ships today. Recall is preserved by construction, not just
empirically.
Licensed under the Apache License, Version 2.0 (LICENSE).
Vendored upstream code in crates/lance-snapshots/ carries the same
license and is attributed to The Lance Authors in each file's SPDX
header. See crates/lance-snapshots/RESYNC.md for the re-sync ritual.