fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups) by rayers · Pull Request #768 · garrytan/gbrain

rayers · 2026-05-09T03:09:32Z

Summary

Three related fixes for code-symbol ingest and graph extraction:

sync --strategy code dropped on first sync via performFullSync #767 — gbrain sync --strategy code silently ran a markdown-only import on first sync.
Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 — once code chunks land, every re-embed pass clobbered their metadata to NULL.
Extract follow-ups (no separate issue filed yet) — gbrain doctor recommended non-existent commands; body wikilinks like [[bare-name]] were silently dropped because they fell outside DIR_PATTERN and the resolver wasn't routed through markdown extraction. Surfaced while diagnosing why post-Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 brains showed graph_coverage 0% even after running the (correct) extract command.

All three ship together: each is a layer of the same pipeline. Without #767 the chunks don't exist; without #769 their metadata is dead; without the extract fixes the graph stays empty for any brain not formatted in YC/CRM-style [Name](people/slug) markup.

#767 changes — first-sync strategy

`src/commands/import.ts`

runImport now parses --strategy markdown|code|auto (defaults to markdown for backward compat).
Both space-separated and equals-separated forms accepted.
auto strategy is computed as a UNION of collectMarkdownFiles ∪ collectFilesByStrategy('code') — strict superset of legacy markdown coverage.
New collectFilesByStrategy(dir, strategy) helper. Mirrors collectMarkdownFiles symlink + hidden-dir + node_modules safety.

`src/commands/sync.ts`

performFullSync threads opts.strategy through to runImport's importArgs. Same fix in the dryRun branch.

#769 changes — metadata preservation across re-embed

`src/commands/embed.ts`

Three re-upsert call sites (embedOnePage, embedAll non-stale, embedAllStale autopilot path) were stripping the 8 metadata fields. New preserveCodeMetadata helper threads them through consistently.

`src/core/postgres-engine.ts` + `src/core/pglite-engine.ts`

upsertChunks ON CONFLICT clause was overwriting metadata columns from EXCLUDED unconditionally. Applied the chunk_text-gated CASE pattern to all 8 metadata columns (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve), symmetric with the existing embedding/embedded_at/model pattern.

`test/embed.serial.test.ts`

Three regression tests cover all three runEmbed paths. Each fails on pre-fix code (8 fields come back undefined), passes on the patched code.

Extract follow-ups — doctor hint + body wikilinks

`src/commands/doctor.ts`

graph_coverage warning pointed users at gbrain link-extract && gbrain timeline-extract. Those commands don't exist. Updated to gbrain extract all.

`src/core/link-extraction.ts`

New WIKILINK_GENERIC_RE matches [[anything]] without DIR_PATTERN whitelist gating.
extractEntityRefs runs it as pass 2c with masked-out ranges from 2a/2b. Refs tagged needsResolution: true.
extractPageLinks step 1 routes needsResolution refs through the resolver. Unresolvable refs silently drop instead of writing dangling rows. Tagged linkSource: 'wikilink-resolved'.
makeResolver gains step 2.5: slug-tail match. Wikilink text 2026-05-07-cost-plan matches the tail of slug topics/dragon-pilot/raw/notes/2026-05-07-cost-plan even when the title diverges. One getAllSlugs per resolver instance, in-memory map lookups thereafter. Defensive — empty index if getAllSlugs throws or is missing (legacy/test mocks).
extractPageLinks gains opts.skipFrontmatter so callers can control frontmatter resolution independently of body resolution. (Pre-fix nullResolver blocked both together.)

`src/commands/extract.ts`

DB-source path always passes the active batch resolver. --include-frontmatter flag now controls only the frontmatter pass via the new opt, not body resolution.

`test/link-extraction.test.ts`

Four new tests cover the new behavior:

generic wikilinks emerge with needsResolution: true
DIR_PATTERN refs stay untagged (no double-emit)
qualified-syntax tokens ([[wiki:topics/x]]) skipped by the generic pass
generic wikilinks resolve via resolver, unresolvable ones silently drop, resolved ones tag as wikilink-resolved
skipFrontmatter opt suppresses frontmatter pass

Out-of-scope follow-up

FS-source extraction (extractLinksFromDir + extractLinksFromFile) uses a separate code path that doesn't go through extractEntityRefs or extractPageLinks. Body wikilinks in --source fs mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Worth a follow-up PR but skipped here to keep scope contained.

Verification

#767 end-to-end against a 1.6GB Dividia NVR repo (~7000 files)

$ gbrain sources add gstack-code-nvr --path /repo --federated
$ gbrain sync --source gstack-code-nvr --strategy code --no-embed
Found 1882 code files
[import.files] 1882/1882 (100%) imported=1882 skipped=0 errors=0
  31933 chunks created

Was 0 before this patch.

#769 unit test reproduction

Pre-fix: 3 fails (8 metadata fields come back undefined for --stale, --all, --slugs paths).
Post-fix: 14 pass.

#769 live verification

On a 7180-page brain, ran gbrain reindex-code --force --no-embed --yes to backfill the 4875 existing code pages whose metadata had been clobbered. Result:

47,908 chunks now have language + symbol_type metadata (was ~42)
Java symbol_name coverage 99.9%, Python 98.6%, Swift 87.3%
0 stale embeddings (incremental-chunking dedup hit 100% — chunk_text byte-identical)
gbrain code-def DstreamView now returns 20 hits across Java + Swift sources (was 0)

Extract follow-up live verification

On the same 7180-page brain:

gbrain doctor warning now reads Run: gbrain extract all (was the broken link-extract && timeline-extract recommendation).
gbrain extract links --source db --dry-run produced 20+ real edges in the wiki/topic content. Pre-fix the same content produced 0.
After running for real, 221 real edges landed in the links table (193 mentions, 14 documents, 14 documented_by).

Tests

13 new tests in test/collect-files-by-strategy.test.ts (sync --strategy code dropped on first sync via performFullSync #767)
3 new tests in test/embed.serial.test.ts (Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769)
4 new tests in test/link-extraction.test.ts (extract follow-up)
Pre-existing surface (import-walker, import-file, import-resume, sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard, embed.serial, incremental-chunking, chunk-grain-fts, chunker-version-gate, parent-scope, pglite-engine, schema-verify, search-image-column, search-lang-symbol-kind, link-extraction, extract-db, link-extraction-code-refs, cycle-patterns, post-write-lint): 178 → 178, all pass.

Backfill required after #769 deploy

Existing brains where embed --stale already clobbered metadata need a chunker re-walk:

gbrain reindex-code --force --no-embed --yes

Cost: only chunks whose chunk_text actually changed get re-embedded. The chunker is deterministic and chunker_version hasn't changed, so most chunks pass the dedup check and reuse existing embeddings. On the verification brain above, 0 chunks needed re-embedding.

Adversarial review

Two rounds for #767 (Claude subagent in hostile-reviewer mode caught three HIGH issues — auto-strategy not strict superset, --strategy= form silently dropped, lone require() — all addressed in commit d8e79f7).

For #769 the fix mirrors the embedding column's existing chunk_text-gated CASE pattern verbatim — pattern-symmetry is the strongest correctness argument.

For the extract follow-up: the body wikilink path was carefully gated with masked-range tracking so existing DIR_PATTERN matches stay untagged (preserving back-compat for the 0.13 frontmatter behavior). The slug-tail match is defensive (graceful when getAllSlugs is missing) and dir-hint scoped to avoid cross-linking unrelated dirs.

Codex CLI review attempted on #767 round-1 (codex review --base master) but stalled on this repo size; Claude subagent review was more productive.

Out of scope (separate from this PR)

gbrain sync --source <id> without --strategy doesn't consult cfg.strategy from the source's stored config. The --all path does. Pre-existing inconsistency.
Java tree-sitter chunker still emits headers like [Java] path:line for some node types where extractSymbolName(node) returns null. With Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 fixed, the symbols that DO populate (Python class/method, TS function/class, Go func/type, Java class/method) survive across re-embed. Coverage tuning is a separate chunker concern.
FS-source extract path doesn't use the new generic wikilink + resolver flow (see "Out-of-scope follow-up" above).
Doctor's brain_score graph component is still a page-coverage metric. For brains that primarily ingest code + transcripts + structured learnings (no Obsidian-style cross-references), the score stays low even after this fix because the content style doesn't produce wikilinks. That's a metric/expectation mismatch, not an extraction bug. Worth its own discussion.

Environment

gbrain v0.30.1 (dffb607), engine = Postgres 16.13 (pgvector + pg_trgm), bun 1.3.10, macOS 26.3.1.

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…an#767) `gbrain sync --strategy code --source <id>` silently ran a markdown-only import on first sync (no anchor commit yet). The flag was parsed but dropped when performFullSync called runImport, which is hardcoded to collectMarkdownFiles. Result: registering a fresh code source produced 0 code pages, breaking code-def / code-refs / code-callers / code-callees on freshly-registered code repos. Changes: - src/commands/import.ts - runImport now parses `--strategy markdown|code|auto` (defaults to markdown, preserves backward compat). - When strategy != markdown, route file collection through a new `collectFilesByStrategy` helper that mirrors collectMarkdownFiles safety guards (lstatSync symlink containment, hidden + node_modules skip, 5MB size cap) and uses isSyncable as the include filter so inclusion logic stays in sync with the incremental walker. - Validates --strategy value with a clear error before falling into the file walk. - src/commands/sync.ts - performFullSync threads opts.strategy through to runImport's argv. - Dry-run branch also honors strategy (was a parallel bug — silent misleading dry-run counts when strategy=code). - test/collect-files-by-strategy.test.ts (new) - 10 tests, all pass: strategy=code returns code-only, strategy=auto returns code+md, recursion, node_modules + hidden-dir skips, 5MB cap, symlink containment (file + dir + dangling), parity with markdown. - Mirrors import-walker.test.ts patterns + L002 security invariants. Verified end-to-end: against a 1.6GB nvr repo (~7000 files, mixed Java + Python + C + TypeScript + bash), `gbrain sync --strategy code` produced 1882 code pages and 31933 tree-sitter chunks (was 0 before this patch). `gbrain code-def DstreamView` returns count: 1. Existing test surface (import-walker, import-file, import-resume, sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard) runs green: 64 pass, 0 fail. Closes garrytan#767

Adversarial review (Claude subagent) flagged three HIGH issues in the initial garrytan#767 patch. This commit addresses each: 1. HIGH — `auto` was not a strict superset of `collectMarkdownFiles`. The previous implementation routed `auto` through isSyncable, which strips brain-convention files (README.md, index.md, log.md, schema.md, ops/**, .raw/**) and multimodal images. A user migrating from strategy=markdown to strategy=auto would silently lose pages — the exact silent-drop class the patch is meant to fix. Fixed by computing `auto` in runImport as a UNION of `collectMarkdownFiles(dir) ∪ collectFilesByStrategy(dir, 'code')`. The raw `collectFilesByStrategy(dir, 'auto')` helper still uses isSyncable's auto rules (consistent with the incremental sync path) and is documented accordingly; callers wanting the union compose the two. 2. HIGH — `--strategy=code` (equals form) silently dropped. The original `args.indexOf('--strategy')` only matched the space-separated form; `--strategy=code` fell through to default 'markdown'. Verified by manual smoke test before the fix: `gbrain import <dir> --strategy=code` reported "Found N markdown files". Fixed with a two-form parser that accepts both `--strategy code` and `--strategy=code`. Verified: both forms now print "Found N code files". 3. MEDIUM — `require('../core/sync.ts')` in collectFilesByStrategy was the only `require` in the file (vs `import` everywhere else). Stylistic inconsistency + brittle if the project moves to ESM-only. Fixed by promoting to a top-level `import { isSyncable } from '...'`. 4. MEDIUM — duplicated 5MB constant. Extracted to a documented module- level `MAX_IMPORT_FILE_SIZE` to make the sync between this and `MAX_FILE_SIZE` in `import-file.ts` discoverable. 5. LOW — test coverage gaps: empty dir, non-existent dir, unreadable subdir, brain-convention strip behavior on `auto`. All added. Test count for collect-files-by-strategy.test.ts: 10 → 13 pass. Full import+sync suite (7 files): 77 pass, 0 fail (was 64 before, +13 new tests). Equals-form spot-check confirms parity: --strategy code → Found N code files --strategy=code → Found N code files --strategy auto → 4 files (3 md + 1 ts) on a 4-file fixture --strategy markdown → 3 files (md only) (no flag) → Found N markdown files (backward compat) The single-source `gbrain sync --source <id>` path that doesn't read cfg.strategy from sources.config (`sync.ts:1071-1101`) is pre-existing behavior and out of scope for garrytan#767. Documented in the related issue discussion; potential follow-up.

Closes garrytan#769. Every re-embed pass clobbered code-chunk metadata (language, symbol_name, symbol_type, start_line, end_line, parent_symbol_path, doc_comment, symbol_name_qualified) to NULL, disabling code-def queries across thousands of indexed chunks. Two complementary fixes: embed.ts — three re-upsert call sites (embedOnePage, embedAll non-stale, embedAllStale autopilot path) build ChunkInputs from loaded chunks; they were stripping the 8 metadata fields. New preserveCodeMetadata helper threads those fields through consistently. postgres-engine.ts + pglite-engine.ts — upsertChunks ON CONFLICT clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the embedding/embedded_at columns which already used a chunk_text-gated CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve). Applied the same pattern to all 8 metadata columns. Three regression tests in test/embed.serial.test.ts cover --stale (autopilot), --all, and --slugs paths. Each loads a chunk with full metadata, runs runEmbed, and asserts engine.upsertChunks receives the metadata round-tripped. Backfill required after deploy: `gbrain sync --strategy code --force --source <id>` per code source to re-populate metadata via the chunker. Without backfill, existing NULL columns stay NULL — re-embed alone never produces metadata, only the chunker does. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related fixes for graph extraction: A. doctor.ts: graph_coverage warning pointed users at non-existent commands (`gbrain link-extract && gbrain timeline-extract`). The actual command is `gbrain extract all`. Misleading at best, silently produces 0 results when copy-pasted at worst. B. extract.ts + link-extraction.ts: body wikilinks were silently dropped. Three layered issues: 1. WIKILINK_RE was gated on DIR_PATTERN (people|companies| meetings|...). Wiki/topic/learning content uses bare-name wikilinks like `[[Fast-Weigh]]` or `[[2026-05-07-cost-plan]]` which fall outside that whitelist — the regex never matched. 2. extractPageLinks step 1 pushed `targetSlug = ref.slug` verbatim with no resolver, so even when a wikilink matched the regex, its raw text (`Fast-Weigh`) didn't match any canonical slug. 3. extract.ts passed nullResolver when --include-frontmatter was off, so frontmatter resolution AND any body resolution were both disabled together. Fixes: - New WIKILINK_GENERIC_RE matches `[[anything]]` outside DIR_PATTERN. extractEntityRefs runs it as pass 2c with masked-out ranges from 2a/2b so refs aren't double-emitted. Tagged with needsResolution. - extractPageLinks step 1 routes needsResolution refs through the resolver (existing makeResolver). Unresolvable refs silently drop — better than dangling rows. - Resolver gains step 2.5: slug-tail match. Wikilink text `2026-05-07-cost-plan` matches the tail of slug `topics/dragon-pilot/raw/notes/2026-05-07-cost-plan` even when the title diverges. One getAllSlugs per resolver instance, in-memory map lookups thereafter. Defensive — returns empty index if engine.getAllSlugs throws. - extractPageLinks gains opts.skipFrontmatter. extract.ts uses it to keep --include-frontmatter semantics while still passing the active resolver for body wikilinks. 4 new regression tests in test/link-extraction.test.ts: - extractEntityRefs picks up generic wikilinks with needsResolution - DIR_PATTERN refs stay untagged (no double-emit) - skips qualified-syntax tokens (`[[wiki:topic]]`) - extractPageLinks resolves generic wikilinks via resolver, drops unresolvable ones, tags them linkSource='wikilink-resolved' - skipFrontmatter opt suppresses frontmatter pass Live verified on 7180-page brain: dry-run extract --source db produced 221 real edges across dragon-pilot + aggregate-scale-software wiki content. Pre-fix returned 0 from the same content. FS-source path (extractLinksFromDir) NOT updated. It uses a different codepath via extractMarkdownLinks + resolveSlug; bare-name wikilinks in FS mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rayers added 2 commits May 8, 2026 21:55

rayers mentioned this pull request May 9, 2026

Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769

Open

rayers changed the title ~~fix(sync): honor --strategy on first sync via performFullSync (#767)~~ fix(sync+embed): code-symbol ingest end-to-end (#767 + #769) May 10, 2026

rayers changed the title ~~fix(sync+embed): code-symbol ingest end-to-end (#767 + #769)~~ fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups) May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768

fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768
rayers wants to merge 4 commits into
garrytan:masterfrom
rayers:local/sync-strategy-fix

rayers commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rayers commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

#767 changes — first-sync strategy

src/commands/import.ts

src/commands/sync.ts

#769 changes — metadata preservation across re-embed

src/commands/embed.ts

src/core/postgres-engine.ts + src/core/pglite-engine.ts

test/embed.serial.test.ts

Extract follow-ups — doctor hint + body wikilinks

src/commands/doctor.ts

src/core/link-extraction.ts

src/commands/extract.ts

test/link-extraction.test.ts

Out-of-scope follow-up

Verification

#767 end-to-end against a 1.6GB Dividia NVR repo (~7000 files)

#769 unit test reproduction

#769 live verification

Extract follow-up live verification

Tests

Backfill required after #769 deploy

Adversarial review

Out of scope (separate from this PR)

Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rayers commented May 9, 2026 •

edited

Loading

`src/commands/import.ts`

`src/commands/sync.ts`

`src/commands/embed.ts`

`src/core/postgres-engine.ts` + `src/core/pglite-engine.ts`

`test/embed.serial.test.ts`

`src/commands/doctor.ts`

`src/core/link-extraction.ts`

`src/commands/extract.ts`

`test/link-extraction.test.ts`