fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768
Open
rayers wants to merge 4 commits into
Open
fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768rayers wants to merge 4 commits into
rayers wants to merge 4 commits into
Conversation
…an#767) `gbrain sync --strategy code --source <id>` silently ran a markdown-only import on first sync (no anchor commit yet). The flag was parsed but dropped when performFullSync called runImport, which is hardcoded to collectMarkdownFiles. Result: registering a fresh code source produced 0 code pages, breaking code-def / code-refs / code-callers / code-callees on freshly-registered code repos. Changes: - src/commands/import.ts - runImport now parses `--strategy markdown|code|auto` (defaults to markdown, preserves backward compat). - When strategy != markdown, route file collection through a new `collectFilesByStrategy` helper that mirrors collectMarkdownFiles safety guards (lstatSync symlink containment, hidden + node_modules skip, 5MB size cap) and uses isSyncable as the include filter so inclusion logic stays in sync with the incremental walker. - Validates --strategy value with a clear error before falling into the file walk. - src/commands/sync.ts - performFullSync threads opts.strategy through to runImport's argv. - Dry-run branch also honors strategy (was a parallel bug — silent misleading dry-run counts when strategy=code). - test/collect-files-by-strategy.test.ts (new) - 10 tests, all pass: strategy=code returns code-only, strategy=auto returns code+md, recursion, node_modules + hidden-dir skips, 5MB cap, symlink containment (file + dir + dangling), parity with markdown. - Mirrors import-walker.test.ts patterns + L002 security invariants. Verified end-to-end: against a 1.6GB nvr repo (~7000 files, mixed Java + Python + C + TypeScript + bash), `gbrain sync --strategy code` produced 1882 code pages and 31933 tree-sitter chunks (was 0 before this patch). `gbrain code-def DstreamView` returns count: 1. Existing test surface (import-walker, import-file, import-resume, sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard) runs green: 64 pass, 0 fail. Closes garrytan#767
Adversarial review (Claude subagent) flagged three HIGH issues in the initial garrytan#767 patch. This commit addresses each: 1. HIGH — `auto` was not a strict superset of `collectMarkdownFiles`. The previous implementation routed `auto` through isSyncable, which strips brain-convention files (README.md, index.md, log.md, schema.md, ops/**, .raw/**) and multimodal images. A user migrating from strategy=markdown to strategy=auto would silently lose pages — the exact silent-drop class the patch is meant to fix. Fixed by computing `auto` in runImport as a UNION of `collectMarkdownFiles(dir) ∪ collectFilesByStrategy(dir, 'code')`. The raw `collectFilesByStrategy(dir, 'auto')` helper still uses isSyncable's auto rules (consistent with the incremental sync path) and is documented accordingly; callers wanting the union compose the two. 2. HIGH — `--strategy=code` (equals form) silently dropped. The original `args.indexOf('--strategy')` only matched the space-separated form; `--strategy=code` fell through to default 'markdown'. Verified by manual smoke test before the fix: `gbrain import <dir> --strategy=code` reported "Found N markdown files". Fixed with a two-form parser that accepts both `--strategy code` and `--strategy=code`. Verified: both forms now print "Found N code files". 3. MEDIUM — `require('../core/sync.ts')` in collectFilesByStrategy was the only `require` in the file (vs `import` everywhere else). Stylistic inconsistency + brittle if the project moves to ESM-only. Fixed by promoting to a top-level `import { isSyncable } from '...'`. 4. MEDIUM — duplicated 5MB constant. Extracted to a documented module- level `MAX_IMPORT_FILE_SIZE` to make the sync between this and `MAX_FILE_SIZE` in `import-file.ts` discoverable. 5. LOW — test coverage gaps: empty dir, non-existent dir, unreadable subdir, brain-convention strip behavior on `auto`. All added. Test count for collect-files-by-strategy.test.ts: 10 → 13 pass. Full import+sync suite (7 files): 77 pass, 0 fail (was 64 before, +13 new tests). Equals-form spot-check confirms parity: --strategy code → Found N code files --strategy=code → Found N code files --strategy auto → 4 files (3 md + 1 ts) on a 4-file fixture --strategy markdown → 3 files (md only) (no flag) → Found N markdown files (backward compat) The single-source `gbrain sync --source <id>` path that doesn't read cfg.strategy from sources.config (`sync.ts:1071-1101`) is pre-existing behavior and out of scope for garrytan#767. Documented in the related issue discussion; potential follow-up.
Closes garrytan#769. Every re-embed pass clobbered code-chunk metadata (language, symbol_name, symbol_type, start_line, end_line, parent_symbol_path, doc_comment, symbol_name_qualified) to NULL, disabling code-def queries across thousands of indexed chunks. Two complementary fixes: embed.ts — three re-upsert call sites (embedOnePage, embedAll non-stale, embedAllStale autopilot path) build ChunkInputs from loaded chunks; they were stripping the 8 metadata fields. New preserveCodeMetadata helper threads those fields through consistently. postgres-engine.ts + pglite-engine.ts — upsertChunks ON CONFLICT clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the embedding/embedded_at columns which already used a chunk_text-gated CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve). Applied the same pattern to all 8 metadata columns. Three regression tests in test/embed.serial.test.ts cover --stale (autopilot), --all, and --slugs paths. Each loads a chunk with full metadata, runs runEmbed, and asserts engine.upsertChunks receives the metadata round-tripped. Backfill required after deploy: `gbrain sync --strategy code --force --source <id>` per code source to re-populate metadata via the chunker. Without backfill, existing NULL columns stay NULL — re-embed alone never produces metadata, only the chunker does. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for graph extraction:
A. doctor.ts: graph_coverage warning pointed users at non-existent
commands (`gbrain link-extract && gbrain timeline-extract`). The
actual command is `gbrain extract all`. Misleading at best,
silently produces 0 results when copy-pasted at worst.
B. extract.ts + link-extraction.ts: body wikilinks were silently
dropped. Three layered issues:
1. WIKILINK_RE was gated on DIR_PATTERN (people|companies|
meetings|...). Wiki/topic/learning content uses bare-name
wikilinks like `[[Fast-Weigh]]` or `[[2026-05-07-cost-plan]]`
which fall outside that whitelist — the regex never matched.
2. extractPageLinks step 1 pushed `targetSlug = ref.slug`
verbatim with no resolver, so even when a wikilink matched
the regex, its raw text (`Fast-Weigh`) didn't match any
canonical slug.
3. extract.ts passed nullResolver when --include-frontmatter was
off, so frontmatter resolution AND any body resolution were
both disabled together.
Fixes:
- New WIKILINK_GENERIC_RE matches `[[anything]]` outside DIR_PATTERN.
extractEntityRefs runs it as pass 2c with masked-out ranges from
2a/2b so refs aren't double-emitted. Tagged with needsResolution.
- extractPageLinks step 1 routes needsResolution refs through the
resolver (existing makeResolver). Unresolvable refs silently
drop — better than dangling rows.
- Resolver gains step 2.5: slug-tail match. Wikilink text
`2026-05-07-cost-plan` matches the tail of slug
`topics/dragon-pilot/raw/notes/2026-05-07-cost-plan` even when
the title diverges. One getAllSlugs per resolver instance,
in-memory map lookups thereafter. Defensive — returns empty
index if engine.getAllSlugs throws.
- extractPageLinks gains opts.skipFrontmatter. extract.ts uses it
to keep --include-frontmatter semantics while still passing the
active resolver for body wikilinks.
4 new regression tests in test/link-extraction.test.ts:
- extractEntityRefs picks up generic wikilinks with needsResolution
- DIR_PATTERN refs stay untagged (no double-emit)
- skips qualified-syntax tokens (`[[wiki:topic]]`)
- extractPageLinks resolves generic wikilinks via resolver, drops
unresolvable ones, tags them linkSource='wikilink-resolved'
- skipFrontmatter opt suppresses frontmatter pass
Live verified on 7180-page brain: dry-run extract --source db
produced 221 real edges across dragon-pilot + aggregate-scale-software
wiki content. Pre-fix returned 0 from the same content.
FS-source path (extractLinksFromDir) NOT updated. It uses a different
codepath via extractMarkdownLinks + resolveSlug; bare-name wikilinks
in FS mode still won't resolve. Most users are on --source db
(autopilot uses it); FS is for offline Obsidian-vault mode. Separate
concern.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related fixes for code-symbol ingest and graph extraction:
gbrain sync --strategy codesilently ran a markdown-only import on first sync.gbrain doctorrecommended non-existent commands; body wikilinks like[[bare-name]]were silently dropped because they fell outsideDIR_PATTERNand the resolver wasn't routed through markdown extraction. Surfaced while diagnosing why post-Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 brains showedgraph_coverage 0%even after running the (correct) extract command.All three ship together: each is a layer of the same pipeline. Without #767 the chunks don't exist; without #769 their metadata is dead; without the extract fixes the graph stays empty for any brain not formatted in YC/CRM-style
[Name](people/slug)markup.#767 changes — first-sync strategy
src/commands/import.tsrunImportnow parses--strategy markdown|code|auto(defaults tomarkdownfor backward compat).autostrategy is computed as a UNION ofcollectMarkdownFiles ∪ collectFilesByStrategy('code')— strict superset of legacy markdown coverage.collectFilesByStrategy(dir, strategy)helper. MirrorscollectMarkdownFilessymlink + hidden-dir + node_modules safety.src/commands/sync.tsperformFullSyncthreadsopts.strategythrough torunImport'simportArgs. Same fix in thedryRunbranch.#769 changes — metadata preservation across re-embed
src/commands/embed.tsThree re-upsert call sites (
embedOnePage,embedAllnon-stale,embedAllStaleautopilot path) were stripping the 8 metadata fields. NewpreserveCodeMetadatahelper threads them through consistently.src/core/postgres-engine.ts+src/core/pglite-engine.tsupsertChunksON CONFLICT clause was overwriting metadata columns fromEXCLUDEDunconditionally. Applied the chunk_text-gated CASE pattern to all 8 metadata columns (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve), symmetric with the existingembedding/embedded_at/modelpattern.test/embed.serial.test.tsThree regression tests cover all three runEmbed paths. Each fails on pre-fix code (8 fields come back undefined), passes on the patched code.
Extract follow-ups — doctor hint + body wikilinks
src/commands/doctor.tsgraph_coverage warning pointed users at
gbrain link-extract && gbrain timeline-extract. Those commands don't exist. Updated togbrain extract all.src/core/link-extraction.tsWIKILINK_GENERIC_REmatches[[anything]]withoutDIR_PATTERNwhitelist gating.extractEntityRefsruns it as pass 2c with masked-out ranges from 2a/2b. Refs taggedneedsResolution: true.extractPageLinksstep 1 routesneedsResolutionrefs through the resolver. Unresolvable refs silently drop instead of writing dangling rows. TaggedlinkSource: 'wikilink-resolved'.makeResolvergains step 2.5: slug-tail match. Wikilink text2026-05-07-cost-planmatches the tail of slugtopics/dragon-pilot/raw/notes/2026-05-07-cost-planeven when the title diverges. OnegetAllSlugsper resolver instance, in-memory map lookups thereafter. Defensive — empty index ifgetAllSlugsthrows or is missing (legacy/test mocks).extractPageLinksgainsopts.skipFrontmatterso callers can control frontmatter resolution independently of body resolution. (Pre-fixnullResolverblocked both together.)src/commands/extract.tsDB-source path always passes the active batch resolver.
--include-frontmatterflag now controls only the frontmatter pass via the new opt, not body resolution.test/link-extraction.test.tsFour new tests cover the new behavior:
needsResolution: true[[wiki:topics/x]]) skipped by the generic passwikilink-resolvedskipFrontmatteropt suppresses frontmatter passOut-of-scope follow-up
FS-source extraction (
extractLinksFromDir+extractLinksFromFile) uses a separate code path that doesn't go throughextractEntityRefsorextractPageLinks. Body wikilinks in--source fsmode still won't resolve. Most users are on--source db(autopilot uses it); FS is for offline Obsidian-vault mode. Worth a follow-up PR but skipped here to keep scope contained.Verification
#767 end-to-end against a 1.6GB Dividia NVR repo (~7000 files)
Was 0 before this patch.
#769 unit test reproduction
Pre-fix: 3 fails (8 metadata fields come back undefined for --stale, --all, --slugs paths).
Post-fix: 14 pass.
#769 live verification
On a 7180-page brain, ran
gbrain reindex-code --force --no-embed --yesto backfill the 4875 existing code pages whose metadata had been clobbered. Result:chunk_textbyte-identical)gbrain code-def DstreamViewnow returns 20 hits across Java + Swift sources (was 0)Extract follow-up live verification
On the same 7180-page brain:
gbrain doctorwarning now readsRun: gbrain extract all(was the brokenlink-extract && timeline-extractrecommendation).gbrain extract links --source db --dry-runproduced 20+ real edges in the wiki/topic content. Pre-fix the same content produced 0.linkstable (193mentions, 14documents, 14documented_by).Tests
test/collect-files-by-strategy.test.ts(sync --strategy code dropped on first sync via performFullSync #767)test/embed.serial.test.ts(Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769)test/link-extraction.test.ts(extract follow-up)import-walker,import-file,import-resume,sync-classifier-widening,sources-resync-recovery,skillpack-sync-guard,embed.serial,incremental-chunking,chunk-grain-fts,chunker-version-gate,parent-scope,pglite-engine,schema-verify,search-image-column,search-lang-symbol-kind,link-extraction,extract-db,link-extraction-code-refs,cycle-patterns,post-write-lint): 178 → 178, all pass.Backfill required after #769 deploy
Existing brains where
embed --stalealready clobbered metadata need a chunker re-walk:Cost: only chunks whose
chunk_textactually changed get re-embedded. The chunker is deterministic andchunker_versionhasn't changed, so most chunks pass the dedup check and reuse existing embeddings. On the verification brain above, 0 chunks needed re-embedding.Adversarial review
Two rounds for #767 (Claude subagent in hostile-reviewer mode caught three HIGH issues — auto-strategy not strict superset,
--strategy=form silently dropped, lonerequire()— all addressed in commitd8e79f7).For #769 the fix mirrors the
embeddingcolumn's existing chunk_text-gated CASE pattern verbatim — pattern-symmetry is the strongest correctness argument.For the extract follow-up: the body wikilink path was carefully gated with masked-range tracking so existing DIR_PATTERN matches stay untagged (preserving back-compat for the 0.13 frontmatter behavior). The slug-tail match is defensive (graceful when
getAllSlugsis missing) and dir-hint scoped to avoid cross-linking unrelated dirs.Codex CLI review attempted on #767 round-1 (
codex review --base master) but stalled on this repo size; Claude subagent review was more productive.Out of scope (separate from this PR)
gbrain sync --source <id>without--strategydoesn't consultcfg.strategyfrom the source's stored config. The--allpath does. Pre-existing inconsistency.[Java] path:linefor some node types whereextractSymbolName(node)returns null. With Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 fixed, the symbols that DO populate (Python class/method, TS function/class, Go func/type, Java class/method) survive across re-embed. Coverage tuning is a separate chunker concern.brain_scoregraph component is still a page-coverage metric. For brains that primarily ingest code + transcripts + structured learnings (no Obsidian-style cross-references), the score stays low even after this fix because the content style doesn't produce wikilinks. That's a metric/expectation mismatch, not an extraction bug. Worth its own discussion.Environment
gbrain v0.30.1 (
dffb607), engine = Postgres 16.13 (pgvector + pg_trgm), bun 1.3.10, macOS 26.3.1.🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.