Skip to content

[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill#62

Merged
nitesh32 merged 11 commits into
pre-releasefrom
feat/fastingest
May 22, 2026
Merged

[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill#62
nitesh32 merged 11 commits into
pre-releasefrom
feat/fastingest

Conversation

@Dead-Bytes
Copy link
Copy Markdown
Collaborator

@Dead-Bytes Dead-Bytes commented May 22, 2026

What changed

Scan: two-pass parallel mode

packages/ingest-github/src/pipeline/scan.ts gains a twoPassScan path used when both a skipDecider and a ConcurrencyLimiter are supplied. The original inline-await walk stays as the fallback. Pass 1 collects every undecided file as a PendingFile and deduplicates extension-based decisions; pass 2 dispatches LLM skip decisions through the limiter and yields ScanEntrys.

SkipDeciderInput was promoted to an exported type in packages/ingest-github/src/types/pipeline.ts so the new code path can construct it without going through Parameters<typeof …>.

Skip-decider: caching + dedupe

packages/ingest-github/src/pipeline/skip-decisions/decider.ts rewritten so identical decisions (same extension + similar file signature) collapse to a single LLM round-trip instead of N. README in skip-decisions/README.md updated with the dedupe semantics and one extra entry in ignorePatterns.json.

Phase split: classify-and-analyse-smallscanAndClassify + analyseSmallFiles

The old single phase did both jobs sequentially. Split into:

FileAnalysisCache: one-shot load between phases

New strategies/flat-folder/file-analysis-cache.ts. Loads every CondensedFileAnalysis JSON under metaPaths.fileAnalysisDir once (concurrency 20), exposes .get / .set / .values / .size. Replaces three sequential iterateCondensed disk walks (backfill, folder-summary, graph-store) with one parallel preload + in-memory iteration. Phase 3 calls .set(...) after each disk write to keep the cache in sync.

Backfill: concurrency-limited + cache-driven

backfill/fields.ts now takes a FileAnalysisCache and ConcurrencyLimiter, dispatches every missing-field rewrite through the limiter as a Promise[] and Promise.alls at the end. Progress total switches from { kind: "growing" } to { kind: "fixed", total: cache.size } because the total is known up front. backfill/big-files.ts deleted — the dedicated big-file backfill is redundant now that fields.ts sweeps the whole cache.

Folder summary: batching

folder-summary.ts gains groupFoldersForBatching + FOLDER_BATCH_SYSTEM_PROMPT / folderBatchUserPrompt. Small folders (≤ folder.summary.batch.max.files) are grouped into batches of folder.summary.batch.size and summarised in one LLM call per batch; folders over the file cap stay on the individual path. Sort is deterministic by folder path so two runs produce identical batch compositions. groupByDirectFolder switched to sync (it now iterates the in-memory cache instead of iterateCondensed).

prompts/folder-summary.ts adds the batch prompt + BatchedFolderInput type. folder-summary-selective.ts reworked to consume the cache.

Big files: parallel chunk analysis + condense retry

phases/process-big-files.ts keeps the legacy processBigFilesQueue (still used by pipeline/pull.ts) and adds an analyseBigFiles(manifest, …) path that drives chunker → analyzeChunkcondenseChunks per file with chunk-level parallelism through the limiter and a 2-attempt condense retry (2 s backoff). Per-chunk results are persisted via saveChunk so a worker restart can resume mid-file. big-file/condenser.ts small tweak to support the retry path.

Neo4j: batched upsert

packages/neo4j/src/files.ts gains a batched upsert path used by the flat-folder indexing phase to land 50+ files (neo4j.batch.size) in one transaction instead of 12 round-trips per file. Same Cypher shape as the existing single-shot upsert, wrapped with UNWIND $files AS f. Each of the five rel types (HAS_KEYWORD, HAS_CLASS, HAS_FUNCTION, HAS_IMPORT_INTERNAL, HAS_IMPORT_EXTERNAL) gets a paired batched DELETE + batched UNWIND. packages/neo4j/src/client.ts gains _runInTransaction + CypherStep to compose the multi-statement transaction. packages/neo4j/src/folder.ts gets a matching batched folder write.

strategies/flat-folder/phases/store-flat-analysis.ts switched onto the new batched API.

OpenRouter: pin upstream provider

packages/llm/src/openrouter.ts adds provider: { allow_fallbacks: false } to every request. Without it, OpenRouter silently cycles across upstream providers on a slow first attempt and the per-call wall-clock budget is gone before any real error surfaces — letting the model-escalation retry above OpenRouter take over instead.

LLM credential context plumbed through big-file path

big-file/condenser.ts + big-file/index.ts now take and forward AskLlmOptions so per-call apiKey / provider overrides reach the chunk-analyzer and condenser. Previously these used the global config and ignored the per-job credential resolution at the enqueue boundary.

Config

Four new keys in packages/types/src/config.ts + packages/config/src/schema.ts, with sensible defaults:

Key Default Purpose
llm.concurrency 29 Shared LLM limiter ceiling across scan + analyse + backfill phases
folder.summary.batch.size 10 Folders summarised per batched LLM call (1 disables batching)
folder.summary.batch.max.files 15 Folder file-count cap above which we fall back to per-folder calls
neo4j.batch.size 50 Files landed per batched Cypher transaction

All four are settable via bytebell set <key> <n>.

Why

Repository ingestion was bottlenecked at three independent points:

  1. Scan phase awaited each LLM skip-decision inline → multi-minute scans on repos with thousands of ambiguous files.
  2. Phases 2a/2b/3 each walked fileAnalysisDir from disk → three sequential O(filecount) directory walks where one parallel preload sufficed.
  3. Neo4j indexing did 12 round-trips per file (one MERGE + five rel-type DELETE + five rel-type UNWIND, separate calls) → indexing dominated the tail of large ingestions.

On top of that, OpenRouter's silent provider-fallback was masking upstream stalls — our model-escalation retry never ran because OpenRouter would hold the connection until its own internal fallback chain exhausted, by which time our timeout had fired.

The backfillBigFiles removal is correctness, not perf: with the unified cache, the fields.ts sweep already visits every condensed analysis (big-file or otherwise), so the dedicated big-file backfill was double-visiting the same entries.

How to test

Unit / type

  1. bun install
  2. bun run typecheck — must pass clean across the workspace
  3. bun test packages/ingest-github packages/neo4j packages/llm — no regressions

End-to-end ingestion

  1. Empty Neo4j + clean ~/.bytebell work dir
  2. bytebell index <small-public-github-repo> — confirm completion, check logs for:
    • scan: acceptStatic=… acceptLlm=… (LLM count should be visibly lower than file count if dedupe is working)
    • file-analysis-cache: loaded N entries in M ms
    • phase3 dispatching N backfill tasks followed by phase3 done: …
    • Folder summary phase emitting folder-summary: batched N folders into M batches (or equivalent — see logs)
    • Neo4j indexing log mentioning batch size 50
  3. Open Neo4j Browser, sanity-check that File, Folder, HAS_KEYWORD, HAS_FUNCTION, HAS_CLASS, HAS_IMPORT_INTERNAL, HAS_IMPORT_EXTERNAL rels exist and CONTAINS from folder to file is present
  4. Re-run ingestion against the same repo, same commit — should be a near no-op (caches hit, skip-decision cache hit, big-file cache hit)

Big-file path

  1. Point at a repo with at least one file over Config.BigFileLineThreshold
  2. Confirm logs show chunker → chunk-analyzer dispatched through the limiter (look for parallel chunk activity, not sequential)
  3. Kill the worker mid-big-file, restart, confirm it picks up from the last persisted chunk via loadChunkIfPresent

Folder summary batching

  1. bytebell set folder.summary.batch.size 1 → re-run ingest → folder summaries should all be individual calls
  2. bytebell set folder.summary.batch.size 10 → re-run → batches should re-form
  3. bytebell set folder.summary.batch.max.files 3 → repo with one big folder (>3 files) → that folder takes individual path, smaller siblings batch

OpenRouter fallback off

  1. Force an OpenRouter slow upstream (OR_TEST_DELAY_MS=15000 or a known-slow model)
  2. Confirm timeout fires within Config.LlmTimeout, error surfaces, and model escalation retries on the next model in the chain rather than burning the wall-clock inside OpenRouter

Neo4j throughput sanity

  1. Time the store-flat-analysis phase on a 500-file repo before/after this PR
  2. Expect order-of-magnitude reduction in wall-clock for that phase

@Dead-Bytes Dead-Bytes changed the title Feat/fastingest [TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill May 22, 2026
@nitesh32 nitesh32 merged commit 9143c74 into pre-release May 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants