[TEST] improvement/flat-folder-throughput — parallel scan, batched folder summaries, batched Neo4j upserts, concurrency-limited backfill#62
Merged
Conversation
…e analysis phases
…r summary processing
…y across multiple files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Scan: two-pass parallel mode
packages/ingest-github/src/pipeline/scan.ts gains a
twoPassScanpath used when both askipDeciderand aConcurrencyLimiterare supplied. The original inline-awaitwalkstays as the fallback. Pass 1 collects every undecided file as aPendingFileand deduplicates extension-based decisions; pass 2 dispatches LLM skip decisions through the limiter and yieldsScanEntrys.SkipDeciderInputwas promoted to an exported type in packages/ingest-github/src/types/pipeline.ts so the new code path can construct it without going throughParameters<typeof …>.Skip-decider: caching + dedupe
packages/ingest-github/src/pipeline/skip-decisions/decider.ts rewritten so identical decisions (same extension + similar file signature) collapse to a single LLM round-trip instead of N. README in skip-decisions/README.md updated with the dedupe semantics and one extra entry in
ignorePatterns.json.Phase split:
classify-and-analyse-small→scanAndClassify+analyseSmallFilesThe old single phase did both jobs sequentially. Split into:
{ kind: "small" | "big" | "oversized" }entries.smallentry through the LLM limiter, writes oversized stubs.FileAnalysisCache: one-shot load between phasesNew strategies/flat-folder/file-analysis-cache.ts. Loads every
CondensedFileAnalysisJSON undermetaPaths.fileAnalysisDironce (concurrency 20), exposes.get/.set/.values/.size. Replaces three sequentialiterateCondenseddisk walks (backfill, folder-summary, graph-store) with one parallel preload + in-memory iteration. Phase 3 calls.set(...)after each disk write to keep the cache in sync.Backfill: concurrency-limited + cache-driven
backfill/fields.ts now takes a
FileAnalysisCacheandConcurrencyLimiter, dispatches every missing-field rewrite through the limiter as aPromise[]andPromise.alls at the end. Progress total switches from{ kind: "growing" }to{ kind: "fixed", total: cache.size }because the total is known up front. backfill/big-files.ts deleted — the dedicated big-file backfill is redundant now thatfields.tssweeps the whole cache.Folder summary: batching
folder-summary.ts gains
groupFoldersForBatching+FOLDER_BATCH_SYSTEM_PROMPT/folderBatchUserPrompt. Small folders (≤folder.summary.batch.max.files) are grouped into batches offolder.summary.batch.sizeand summarised in one LLM call per batch; folders over the file cap stay on the individual path. Sort is deterministic by folder path so two runs produce identical batch compositions.groupByDirectFolderswitched to sync (it now iterates the in-memory cache instead ofiterateCondensed).prompts/folder-summary.ts adds the batch prompt +
BatchedFolderInputtype. folder-summary-selective.ts reworked to consume the cache.Big files: parallel chunk analysis + condense retry
phases/process-big-files.ts keeps the legacy
processBigFilesQueue(still used by pipeline/pull.ts) and adds ananalyseBigFiles(manifest, …)path that drives chunker →analyzeChunk→condenseChunksper file with chunk-level parallelism through the limiter and a 2-attempt condense retry (2 s backoff). Per-chunk results are persisted viasaveChunkso a worker restart can resume mid-file. big-file/condenser.ts small tweak to support the retry path.Neo4j: batched upsert
packages/neo4j/src/files.ts gains a batched upsert path used by the flat-folder indexing phase to land 50+ files (
neo4j.batch.size) in one transaction instead of 12 round-trips per file. Same Cypher shape as the existing single-shot upsert, wrapped withUNWIND $files AS f. Each of the five rel types (HAS_KEYWORD,HAS_CLASS,HAS_FUNCTION,HAS_IMPORT_INTERNAL,HAS_IMPORT_EXTERNAL) gets a paired batched DELETE + batched UNWIND. packages/neo4j/src/client.ts gains_runInTransaction+CypherStepto compose the multi-statement transaction. packages/neo4j/src/folder.ts gets a matching batched folder write.strategies/flat-folder/phases/store-flat-analysis.ts switched onto the new batched API.
OpenRouter: pin upstream provider
packages/llm/src/openrouter.ts adds
provider: { allow_fallbacks: false }to every request. Without it, OpenRouter silently cycles across upstream providers on a slow first attempt and the per-call wall-clock budget is gone before any real error surfaces — letting the model-escalation retry above OpenRouter take over instead.LLM credential context plumbed through big-file path
big-file/condenser.ts + big-file/index.ts now take and forward
AskLlmOptionsso per-callapiKey/provideroverrides reach the chunk-analyzer and condenser. Previously these used the global config and ignored the per-job credential resolution at the enqueue boundary.Config
Four new keys in packages/types/src/config.ts + packages/config/src/schema.ts, with sensible defaults:
llm.concurrencyfolder.summary.batch.sizefolder.summary.batch.max.filesneo4j.batch.sizeAll four are settable via
bytebell set <key> <n>.Why
Repository ingestion was bottlenecked at three independent points:
fileAnalysisDirfrom disk → three sequentialO(filecount)directory walks where one parallel preload sufficed.On top of that, OpenRouter's silent provider-fallback was masking upstream stalls — our model-escalation retry never ran because OpenRouter would hold the connection until its own internal fallback chain exhausted, by which time our timeout had fired.
The
backfillBigFilesremoval is correctness, not perf: with the unified cache, thefields.tssweep already visits every condensed analysis (big-file or otherwise), so the dedicated big-file backfill was double-visiting the same entries.How to test
Unit / type
bun installbun run typecheck— must pass clean across the workspacebun test packages/ingest-github packages/neo4j packages/llm— no regressionsEnd-to-end ingestion
~/.bytebellwork dirbytebell index <small-public-github-repo>— confirm completion, check logs for:scan: acceptStatic=… acceptLlm=…(LLM count should be visibly lower than file count if dedupe is working)file-analysis-cache: loaded N entries in M msphase3 dispatching N backfill tasksfollowed byphase3 done: …folder-summary: batched N folders into M batches(or equivalent — see logs)File,Folder,HAS_KEYWORD,HAS_FUNCTION,HAS_CLASS,HAS_IMPORT_INTERNAL,HAS_IMPORT_EXTERNALrels exist andCONTAINSfrom folder to file is presentBig-file path
Config.BigFileLineThresholdloadChunkIfPresentFolder summary batching
bytebell set folder.summary.batch.size 1→ re-run ingest → folder summaries should all be individual callsbytebell set folder.summary.batch.size 10→ re-run → batches should re-formbytebell set folder.summary.batch.max.files 3→ repo with one big folder (>3 files) → that folder takes individual path, smaller siblings batchOpenRouter fallback off
OR_TEST_DELAY_MS=15000or a known-slow model)Config.LlmTimeout, error surfaces, and model escalation retries on the next model in the chain rather than burning the wall-clock inside OpenRouterNeo4j throughput sanity
store-flat-analysisphase on a 500-file repo before/after this PR