[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill by Dead-Bytes · Pull Request #62 · ByteBell/bytebell-oss

Dead-Bytes · 2026-05-22T12:14:13Z

What changed

Scan: two-pass parallel mode

packages/ingest-github/src/pipeline/scan.ts gains a twoPassScan path used when both a skipDecider and a ConcurrencyLimiter are supplied. The original inline-await walk stays as the fallback. Pass 1 collects every undecided file as a PendingFile and deduplicates extension-based decisions; pass 2 dispatches LLM skip decisions through the limiter and yields ScanEntrys.

SkipDeciderInput was promoted to an exported type in packages/ingest-github/src/types/pipeline.ts so the new code path can construct it without going through Parameters<typeof …>.

Skip-decider: caching + dedupe

packages/ingest-github/src/pipeline/skip-decisions/decider.ts rewritten so identical decisions (same extension + similar file signature) collapse to a single LLM round-trip instead of N. README in skip-decisions/README.md updated with the dedupe semantics and one extra entry in ignorePatterns.json.

Phase split: `classify-and-analyse-small` → `scanAndClassify` + `analyseSmallFiles`

The old single phase did both jobs sequentially. Split into:

phases/scan-and-classify.ts (new) — walks the scanner and writes scan-manifest.json with { kind: "small" | "big" | "oversized" } entries.
phases/analyse-small.ts (new) — consumes the manifest, runs every small entry through the LLM limiter, writes oversized stubs.
phases/classify-and-analyse-small.ts — deleted.

`FileAnalysisCache`: one-shot load between phases

New strategies/flat-folder/file-analysis-cache.ts. Loads every CondensedFileAnalysis JSON under metaPaths.fileAnalysisDir once (concurrency 20), exposes .get / .set / .values / .size. Replaces three sequential iterateCondensed disk walks (backfill, folder-summary, graph-store) with one parallel preload + in-memory iteration. Phase 3 calls .set(...) after each disk write to keep the cache in sync.

Backfill: concurrency-limited + cache-driven

backfill/fields.ts now takes a FileAnalysisCache and ConcurrencyLimiter, dispatches every missing-field rewrite through the limiter as a Promise[] and Promise.alls at the end. Progress total switches from { kind: "growing" } to { kind: "fixed", total: cache.size } because the total is known up front. backfill/big-files.ts deleted — the dedicated big-file backfill is redundant now that fields.ts sweeps the whole cache.

Folder summary: batching

folder-summary.ts gains groupFoldersForBatching + FOLDER_BATCH_SYSTEM_PROMPT / folderBatchUserPrompt. Small folders (≤ folder.summary.batch.max.files) are grouped into batches of folder.summary.batch.size and summarised in one LLM call per batch; folders over the file cap stay on the individual path. Sort is deterministic by folder path so two runs produce identical batch compositions. groupByDirectFolder switched to sync (it now iterates the in-memory cache instead of iterateCondensed).

prompts/folder-summary.ts adds the batch prompt + BatchedFolderInput type. folder-summary-selective.ts reworked to consume the cache.

Big files: parallel chunk analysis + condense retry

phases/process-big-files.ts keeps the legacy processBigFilesQueue (still used by pipeline/pull.ts) and adds an analyseBigFiles(manifest, …) path that drives chunker → analyzeChunk → condenseChunks per file with chunk-level parallelism through the limiter and a 2-attempt condense retry (2 s backoff). Per-chunk results are persisted via saveChunk so a worker restart can resume mid-file. big-file/condenser.ts small tweak to support the retry path.

Neo4j: batched upsert

packages/neo4j/src/files.ts gains a batched upsert path used by the flat-folder indexing phase to land 50+ files (neo4j.batch.size) in one transaction instead of 12 round-trips per file. Same Cypher shape as the existing single-shot upsert, wrapped with UNWIND $files AS f. Each of the five rel types (HAS_KEYWORD, HAS_CLASS, HAS_FUNCTION, HAS_IMPORT_INTERNAL, HAS_IMPORT_EXTERNAL) gets a paired batched DELETE + batched UNWIND. packages/neo4j/src/client.ts gains _runInTransaction + CypherStep to compose the multi-statement transaction. packages/neo4j/src/folder.ts gets a matching batched folder write.

strategies/flat-folder/phases/store-flat-analysis.ts switched onto the new batched API.

OpenRouter: pin upstream provider

packages/llm/src/openrouter.ts adds provider: { allow_fallbacks: false } to every request. Without it, OpenRouter silently cycles across upstream providers on a slow first attempt and the per-call wall-clock budget is gone before any real error surfaces — letting the model-escalation retry above OpenRouter take over instead.

LLM credential context plumbed through big-file path

big-file/condenser.ts + big-file/index.ts now take and forward AskLlmOptions so per-call apiKey / provider overrides reach the chunk-analyzer and condenser. Previously these used the global config and ignored the per-job credential resolution at the enqueue boundary.

Config

Four new keys in packages/types/src/config.ts + packages/config/src/schema.ts, with sensible defaults:

Key	Default	Purpose
`llm.concurrency`	29	Shared LLM limiter ceiling across scan + analyse + backfill phases
`folder.summary.batch.size`	10	Folders summarised per batched LLM call (1 disables batching)
`folder.summary.batch.max.files`	15	Folder file-count cap above which we fall back to per-folder calls
`neo4j.batch.size`	50	Files landed per batched Cypher transaction

All four are settable via bytebell set <key> <n>.

Why

Repository ingestion was bottlenecked at three independent points:

Scan phase awaited each LLM skip-decision inline → multi-minute scans on repos with thousands of ambiguous files.
Phases 2a/2b/3 each walked fileAnalysisDir from disk → three sequential O(filecount) directory walks where one parallel preload sufficed.
Neo4j indexing did 12 round-trips per file (one MERGE + five rel-type DELETE + five rel-type UNWIND, separate calls) → indexing dominated the tail of large ingestions.

On top of that, OpenRouter's silent provider-fallback was masking upstream stalls — our model-escalation retry never ran because OpenRouter would hold the connection until its own internal fallback chain exhausted, by which time our timeout had fired.

The backfillBigFiles removal is correctness, not perf: with the unified cache, the fields.ts sweep already visits every condensed analysis (big-file or otherwise), so the dedicated big-file backfill was double-visiting the same entries.

How to test

Unit / type

bun install
bun run typecheck — must pass clean across the workspace
bun test packages/ingest-github packages/neo4j packages/llm — no regressions

End-to-end ingestion

Empty Neo4j + clean ~/.bytebell work dir
bytebell index <small-public-github-repo> — confirm completion, check logs for:
- scan: acceptStatic=… acceptLlm=… (LLM count should be visibly lower than file count if dedupe is working)
- file-analysis-cache: loaded N entries in M ms
- phase3 dispatching N backfill tasks followed by phase3 done: …
- Folder summary phase emitting folder-summary: batched N folders into M batches (or equivalent — see logs)
- Neo4j indexing log mentioning batch size 50
Open Neo4j Browser, sanity-check that File, Folder, HAS_KEYWORD, HAS_FUNCTION, HAS_CLASS, HAS_IMPORT_INTERNAL, HAS_IMPORT_EXTERNAL rels exist and CONTAINS from folder to file is present
Re-run ingestion against the same repo, same commit — should be a near no-op (caches hit, skip-decision cache hit, big-file cache hit)

Big-file path

Point at a repo with at least one file over Config.BigFileLineThreshold
Confirm logs show chunker → chunk-analyzer dispatched through the limiter (look for parallel chunk activity, not sequential)
Kill the worker mid-big-file, restart, confirm it picks up from the last persisted chunk via loadChunkIfPresent

Folder summary batching

bytebell set folder.summary.batch.size 1 → re-run ingest → folder summaries should all be individual calls
bytebell set folder.summary.batch.size 10 → re-run → batches should re-form
bytebell set folder.summary.batch.max.files 3 → repo with one big folder (>3 files) → that folder takes individual path, smaller siblings batch

OpenRouter fallback off

Force an OpenRouter slow upstream (OR_TEST_DELAY_MS=15000 or a known-slow model)
Confirm timeout fires within Config.LlmTimeout, error surfaces, and model escalation retries on the next model in the chain rather than burning the wall-clock inside OpenRouter

Neo4j throughput sanity

Time the store-flat-analysis phase on a 500-file repo before/after this PR
Expect order-of-magnitude reduction in wall-clock for that phase

…ondensing

…slow calls

…formance

…e analysis phases

…r summary processing

…upserts

…y across multiple files

…BigFiles code

Dead-Bytes added 7 commits May 22, 2026 13:01

refactor: update LLM credential handling in big-file processing and c…

064ebf2

…ondensing

refactor: enhance OpenRouter provider routing to prevent fallback on …

f9949f6

…slow calls

refactor: restructure flat-folder phases for improved clarity and per…

665c4d1

…formance

refactor: implement FileAnalysisCache for improved performance in fil…

b6311ba

…e analysis phases

refactor: add folder summary batching configuration and enhance folde…

13970c7

…r summary processing

refactor: remove backfillBigFiles phase and update related documentation

d4b99b1

Refactor backfill process to use concurrency limiter and batch Neo4j …

1afd5d6

…upserts

Dead-Bytes changed the title ~~Feat/fastingest~~ [TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill May 22, 2026

Dead-Bytes added 4 commits May 22, 2026 18:11

chore(format): clean up README formatting and improve code readabilit…

e45277d

…y across multiple files

chore: update bun.lock to reflect dependency changes

f5cdaa3

refactor: rename import for analyseBigFiles and remove legacy process…

0a58a29

…BigFiles code

chore(ts-cleanup): tsconfig files cleared

29e6cc5

nitesh32 merged commit 9143c74 into pre-release May 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill#62

[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill#62
nitesh32 merged 11 commits into
pre-releasefrom
feat/fastingest

Dead-Bytes commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dead-Bytes commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Scan: two-pass parallel mode

Skip-decider: caching + dedupe

Phase split: classify-and-analyse-small → scanAndClassify + analyseSmallFiles

FileAnalysisCache: one-shot load between phases

Backfill: concurrency-limited + cache-driven

Folder summary: batching

Big files: parallel chunk analysis + condense retry

Neo4j: batched upsert

OpenRouter: pin upstream provider

LLM credential context plumbed through big-file path

Config

Why

How to test

Unit / type

End-to-end ingestion

Big-file path

Folder summary batching

OpenRouter fallback off

Neo4j throughput sanity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dead-Bytes commented May 22, 2026 •

edited

Loading

Phase split: `classify-and-analyse-small` → `scanAndClassify` + `analyseSmallFiles`

`FileAnalysisCache`: one-shot load between phases