Skip to content

fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768

Open
rayers wants to merge 4 commits into
garrytan:masterfrom
rayers:local/sync-strategy-fix
Open

fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups)#768
rayers wants to merge 4 commits into
garrytan:masterfrom
rayers:local/sync-strategy-fix

Conversation

@rayers
Copy link
Copy Markdown

@rayers rayers commented May 9, 2026

Summary

Three related fixes for code-symbol ingest and graph extraction:

All three ship together: each is a layer of the same pipeline. Without #767 the chunks don't exist; without #769 their metadata is dead; without the extract fixes the graph stays empty for any brain not formatted in YC/CRM-style [Name](people/slug) markup.

#767 changes — first-sync strategy

src/commands/import.ts

  • runImport now parses --strategy markdown|code|auto (defaults to markdown for backward compat).
  • Both space-separated and equals-separated forms accepted.
  • auto strategy is computed as a UNION of collectMarkdownFiles ∪ collectFilesByStrategy('code') — strict superset of legacy markdown coverage.
  • New collectFilesByStrategy(dir, strategy) helper. Mirrors collectMarkdownFiles symlink + hidden-dir + node_modules safety.

src/commands/sync.ts

  • performFullSync threads opts.strategy through to runImport's importArgs. Same fix in the dryRun branch.

#769 changes — metadata preservation across re-embed

src/commands/embed.ts

Three re-upsert call sites (embedOnePage, embedAll non-stale, embedAllStale autopilot path) were stripping the 8 metadata fields. New preserveCodeMetadata helper threads them through consistently.

src/core/postgres-engine.ts + src/core/pglite-engine.ts

upsertChunks ON CONFLICT clause was overwriting metadata columns from EXCLUDED unconditionally. Applied the chunk_text-gated CASE pattern to all 8 metadata columns (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve), symmetric with the existing embedding/embedded_at/model pattern.

test/embed.serial.test.ts

Three regression tests cover all three runEmbed paths. Each fails on pre-fix code (8 fields come back undefined), passes on the patched code.

Extract follow-ups — doctor hint + body wikilinks

src/commands/doctor.ts

graph_coverage warning pointed users at gbrain link-extract && gbrain timeline-extract. Those commands don't exist. Updated to gbrain extract all.

src/core/link-extraction.ts

  • New WIKILINK_GENERIC_RE matches [[anything]] without DIR_PATTERN whitelist gating.
  • extractEntityRefs runs it as pass 2c with masked-out ranges from 2a/2b. Refs tagged needsResolution: true.
  • extractPageLinks step 1 routes needsResolution refs through the resolver. Unresolvable refs silently drop instead of writing dangling rows. Tagged linkSource: 'wikilink-resolved'.
  • makeResolver gains step 2.5: slug-tail match. Wikilink text 2026-05-07-cost-plan matches the tail of slug topics/dragon-pilot/raw/notes/2026-05-07-cost-plan even when the title diverges. One getAllSlugs per resolver instance, in-memory map lookups thereafter. Defensive — empty index if getAllSlugs throws or is missing (legacy/test mocks).
  • extractPageLinks gains opts.skipFrontmatter so callers can control frontmatter resolution independently of body resolution. (Pre-fix nullResolver blocked both together.)

src/commands/extract.ts

DB-source path always passes the active batch resolver. --include-frontmatter flag now controls only the frontmatter pass via the new opt, not body resolution.

test/link-extraction.test.ts

Four new tests cover the new behavior:

  • generic wikilinks emerge with needsResolution: true
  • DIR_PATTERN refs stay untagged (no double-emit)
  • qualified-syntax tokens ([[wiki:topics/x]]) skipped by the generic pass
  • generic wikilinks resolve via resolver, unresolvable ones silently drop, resolved ones tag as wikilink-resolved
  • skipFrontmatter opt suppresses frontmatter pass

Out-of-scope follow-up

FS-source extraction (extractLinksFromDir + extractLinksFromFile) uses a separate code path that doesn't go through extractEntityRefs or extractPageLinks. Body wikilinks in --source fs mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Worth a follow-up PR but skipped here to keep scope contained.

Verification

#767 end-to-end against a 1.6GB Dividia NVR repo (~7000 files)

$ gbrain sources add gstack-code-nvr --path /repo --federated
$ gbrain sync --source gstack-code-nvr --strategy code --no-embed
Found 1882 code files
[import.files] 1882/1882 (100%) imported=1882 skipped=0 errors=0
  31933 chunks created

Was 0 before this patch.

#769 unit test reproduction

Pre-fix: 3 fails (8 metadata fields come back undefined for --stale, --all, --slugs paths).
Post-fix: 14 pass.

#769 live verification

On a 7180-page brain, ran gbrain reindex-code --force --no-embed --yes to backfill the 4875 existing code pages whose metadata had been clobbered. Result:

  • 47,908 chunks now have language + symbol_type metadata (was ~42)
  • Java symbol_name coverage 99.9%, Python 98.6%, Swift 87.3%
  • 0 stale embeddings (incremental-chunking dedup hit 100% — chunk_text byte-identical)
  • gbrain code-def DstreamView now returns 20 hits across Java + Swift sources (was 0)

Extract follow-up live verification

On the same 7180-page brain:

  • gbrain doctor warning now reads Run: gbrain extract all (was the broken link-extract && timeline-extract recommendation).
  • gbrain extract links --source db --dry-run produced 20+ real edges in the wiki/topic content. Pre-fix the same content produced 0.
  • After running for real, 221 real edges landed in the links table (193 mentions, 14 documents, 14 documented_by).

Tests

Backfill required after #769 deploy

Existing brains where embed --stale already clobbered metadata need a chunker re-walk:

gbrain reindex-code --force --no-embed --yes

Cost: only chunks whose chunk_text actually changed get re-embedded. The chunker is deterministic and chunker_version hasn't changed, so most chunks pass the dedup check and reuse existing embeddings. On the verification brain above, 0 chunks needed re-embedding.

Adversarial review

Two rounds for #767 (Claude subagent in hostile-reviewer mode caught three HIGH issues — auto-strategy not strict superset, --strategy= form silently dropped, lone require() — all addressed in commit d8e79f7).

For #769 the fix mirrors the embedding column's existing chunk_text-gated CASE pattern verbatim — pattern-symmetry is the strongest correctness argument.

For the extract follow-up: the body wikilink path was carefully gated with masked-range tracking so existing DIR_PATTERN matches stay untagged (preserving back-compat for the 0.13 frontmatter behavior). The slug-tail match is defensive (graceful when getAllSlugs is missing) and dir-hint scoped to avoid cross-linking unrelated dirs.

Codex CLI review attempted on #767 round-1 (codex review --base master) but stalled on this repo size; Claude subagent review was more productive.

Out of scope (separate from this PR)

  • gbrain sync --source <id> without --strategy doesn't consult cfg.strategy from the source's stored config. The --all path does. Pre-existing inconsistency.
  • Java tree-sitter chunker still emits headers like [Java] path:line for some node types where extractSymbolName(node) returns null. With Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 fixed, the symbols that DO populate (Python class/method, TS function/class, Go func/type, Java class/method) survive across re-embed. Coverage tuning is a separate chunker concern.
  • FS-source extract path doesn't use the new generic wikilink + resolver flow (see "Out-of-scope follow-up" above).
  • Doctor's brain_score graph component is still a page-coverage metric. For brains that primarily ingest code + transcripts + structured learnings (no Obsidian-style cross-references), the score stays low even after this fix because the content style doesn't produce wikilinks. That's a metric/expectation mismatch, not an extraction bug. Worth its own discussion.

Environment

gbrain v0.30.1 (dffb607), engine = Postgres 16.13 (pgvector + pg_trgm), bun 1.3.10, macOS 26.3.1.

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

rayers added 2 commits May 8, 2026 21:55
…an#767)

`gbrain sync --strategy code --source <id>` silently ran a markdown-only
import on first sync (no anchor commit yet). The flag was parsed but
dropped when performFullSync called runImport, which is hardcoded to
collectMarkdownFiles. Result: registering a fresh code source produced
0 code pages, breaking code-def / code-refs / code-callers / code-callees
on freshly-registered code repos.

Changes:

- src/commands/import.ts
  - runImport now parses `--strategy markdown|code|auto` (defaults to
    markdown, preserves backward compat).
  - When strategy != markdown, route file collection through a new
    `collectFilesByStrategy` helper that mirrors collectMarkdownFiles
    safety guards (lstatSync symlink containment, hidden + node_modules
    skip, 5MB size cap) and uses isSyncable as the include filter so
    inclusion logic stays in sync with the incremental walker.
  - Validates --strategy value with a clear error before falling into
    the file walk.

- src/commands/sync.ts
  - performFullSync threads opts.strategy through to runImport's argv.
  - Dry-run branch also honors strategy (was a parallel bug — silent
    misleading dry-run counts when strategy=code).

- test/collect-files-by-strategy.test.ts (new)
  - 10 tests, all pass: strategy=code returns code-only, strategy=auto
    returns code+md, recursion, node_modules + hidden-dir skips, 5MB cap,
    symlink containment (file + dir + dangling), parity with markdown.
  - Mirrors import-walker.test.ts patterns + L002 security invariants.

Verified end-to-end: against a 1.6GB nvr repo (~7000 files, mixed Java
+ Python + C + TypeScript + bash), `gbrain sync --strategy code`
produced 1882 code pages and 31933 tree-sitter chunks (was 0 before
this patch). `gbrain code-def DstreamView` returns count: 1.

Existing test surface (import-walker, import-file, import-resume,
sync-classifier-widening, sources-resync-recovery, skillpack-sync-guard)
runs green: 64 pass, 0 fail.

Closes garrytan#767
Adversarial review (Claude subagent) flagged three HIGH issues in the
initial garrytan#767 patch. This commit addresses each:

1. HIGH — `auto` was not a strict superset of `collectMarkdownFiles`.
   The previous implementation routed `auto` through isSyncable, which
   strips brain-convention files (README.md, index.md, log.md, schema.md,
   ops/**, .raw/**) and multimodal images. A user migrating from
   strategy=markdown to strategy=auto would silently lose pages — the
   exact silent-drop class the patch is meant to fix.

   Fixed by computing `auto` in runImport as a UNION of
   `collectMarkdownFiles(dir) ∪ collectFilesByStrategy(dir, 'code')`. The
   raw `collectFilesByStrategy(dir, 'auto')` helper still uses isSyncable's
   auto rules (consistent with the incremental sync path) and is documented
   accordingly; callers wanting the union compose the two.

2. HIGH — `--strategy=code` (equals form) silently dropped. The original
   `args.indexOf('--strategy')` only matched the space-separated form;
   `--strategy=code` fell through to default 'markdown'. Verified by
   manual smoke test before the fix: `gbrain import <dir> --strategy=code`
   reported "Found N markdown files".

   Fixed with a two-form parser that accepts both `--strategy code` and
   `--strategy=code`. Verified: both forms now print "Found N code files".

3. MEDIUM — `require('../core/sync.ts')` in collectFilesByStrategy was the
   only `require` in the file (vs `import` everywhere else). Stylistic
   inconsistency + brittle if the project moves to ESM-only.

   Fixed by promoting to a top-level `import { isSyncable } from '...'`.

4. MEDIUM — duplicated 5MB constant. Extracted to a documented module-
   level `MAX_IMPORT_FILE_SIZE` to make the sync between this and
   `MAX_FILE_SIZE` in `import-file.ts` discoverable.

5. LOW — test coverage gaps: empty dir, non-existent dir, unreadable
   subdir, brain-convention strip behavior on `auto`. All added.

Test count for collect-files-by-strategy.test.ts: 10 → 13 pass.
Full import+sync suite (7 files): 77 pass, 0 fail (was 64 before, +13
new tests).

Equals-form spot-check confirms parity:
  --strategy code   → Found N code files
  --strategy=code   → Found N code files
  --strategy auto   → 4 files (3 md + 1 ts) on a 4-file fixture
  --strategy markdown → 3 files (md only)
  (no flag)        → Found N markdown files (backward compat)

The single-source `gbrain sync --source <id>` path that doesn't read
cfg.strategy from sources.config (`sync.ts:1071-1101`) is pre-existing
behavior and out of scope for garrytan#767. Documented in the related issue
discussion; potential follow-up.
Closes garrytan#769. Every re-embed pass clobbered code-chunk metadata
(language, symbol_name, symbol_type, start_line, end_line,
parent_symbol_path, doc_comment, symbol_name_qualified) to NULL,
disabling code-def queries across thousands of indexed chunks.

Two complementary fixes:

embed.ts — three re-upsert call sites (embedOnePage, embedAll
non-stale, embedAllStale autopilot path) build ChunkInputs from
loaded chunks; they were stripping the 8 metadata fields. New
preserveCodeMetadata helper threads those fields through
consistently.

postgres-engine.ts + pglite-engine.ts — upsertChunks ON CONFLICT
clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the
embedding/embedded_at columns which already used a chunk_text-gated
CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE
preserve). Applied the same pattern to all 8 metadata columns.

Three regression tests in test/embed.serial.test.ts cover --stale
(autopilot), --all, and --slugs paths. Each loads a chunk with
full metadata, runs runEmbed, and asserts engine.upsertChunks
receives the metadata round-tripped.

Backfill required after deploy: `gbrain sync --strategy code
--force --source <id>` per code source to re-populate metadata via
the chunker. Without backfill, existing NULL columns stay NULL —
re-embed alone never produces metadata, only the chunker does.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rayers rayers changed the title fix(sync): honor --strategy on first sync via performFullSync (#767) fix(sync+embed): code-symbol ingest end-to-end (#767 + #769) May 10, 2026
Two related fixes for graph extraction:

A. doctor.ts: graph_coverage warning pointed users at non-existent
   commands (`gbrain link-extract && gbrain timeline-extract`). The
   actual command is `gbrain extract all`. Misleading at best,
   silently produces 0 results when copy-pasted at worst.

B. extract.ts + link-extraction.ts: body wikilinks were silently
   dropped. Three layered issues:

   1. WIKILINK_RE was gated on DIR_PATTERN (people|companies|
      meetings|...). Wiki/topic/learning content uses bare-name
      wikilinks like `[[Fast-Weigh]]` or `[[2026-05-07-cost-plan]]`
      which fall outside that whitelist — the regex never matched.
   2. extractPageLinks step 1 pushed `targetSlug = ref.slug`
      verbatim with no resolver, so even when a wikilink matched
      the regex, its raw text (`Fast-Weigh`) didn't match any
      canonical slug.
   3. extract.ts passed nullResolver when --include-frontmatter was
      off, so frontmatter resolution AND any body resolution were
      both disabled together.

   Fixes:
   - New WIKILINK_GENERIC_RE matches `[[anything]]` outside DIR_PATTERN.
     extractEntityRefs runs it as pass 2c with masked-out ranges from
     2a/2b so refs aren't double-emitted. Tagged with needsResolution.
   - extractPageLinks step 1 routes needsResolution refs through the
     resolver (existing makeResolver). Unresolvable refs silently
     drop — better than dangling rows.
   - Resolver gains step 2.5: slug-tail match. Wikilink text
     `2026-05-07-cost-plan` matches the tail of slug
     `topics/dragon-pilot/raw/notes/2026-05-07-cost-plan` even when
     the title diverges. One getAllSlugs per resolver instance,
     in-memory map lookups thereafter. Defensive — returns empty
     index if engine.getAllSlugs throws.
   - extractPageLinks gains opts.skipFrontmatter. extract.ts uses it
     to keep --include-frontmatter semantics while still passing the
     active resolver for body wikilinks.

4 new regression tests in test/link-extraction.test.ts:
  - extractEntityRefs picks up generic wikilinks with needsResolution
  - DIR_PATTERN refs stay untagged (no double-emit)
  - skips qualified-syntax tokens (`[[wiki:topic]]`)
  - extractPageLinks resolves generic wikilinks via resolver, drops
    unresolvable ones, tags them linkSource='wikilink-resolved'
  - skipFrontmatter opt suppresses frontmatter pass

Live verified on 7180-page brain: dry-run extract --source db
produced 221 real edges across dragon-pilot + aggregate-scale-software
wiki content. Pre-fix returned 0 from the same content.

FS-source path (extractLinksFromDir) NOT updated. It uses a different
codepath via extractMarkdownLinks + resolveSlug; bare-name wikilinks
in FS mode still won't resolve. Most users are on --source db
(autopilot uses it); FS is for offline Obsidian-vault mode. Separate
concern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rayers rayers changed the title fix(sync+embed): code-symbol ingest end-to-end (#767 + #769) fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups) May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant