Skip to content

refactor(parser): structural parser registry foundation (#933 phase 1.0)#950

Merged
gfargo merged 1 commit into
mainfrom
feat/933-tree-sitter-foundation-ts
May 14, 2026
Merged

refactor(parser): structural parser registry foundation (#933 phase 1.0)#950
gfargo merged 1 commit into
mainfrom
feat/933-tree-sitter-foundation-ts

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 14, 2026

Summary

First piece of the tree-sitter integration (#933). Pure abstraction — introduces the registry that future phases plug tree-sitter parsers into. No behavior change to the existing regex extractors.

What's here

  • New `StructuralParser` interface. Sync OR async; returns a templated summary or undefined to surrender to the next parser in the chain.
  • New `REGISTRY: Record<StructuralLanguageId, StructuralParser[]>` — per-language chain in priority order. Today every entry is a regex parser wrapping the existing summarizer.
  • New `dispatchStructuralParser` walks the chain and returns the first non-undefined summary. Errors thrown by any parser are swallowed so the chain continues.
  • `summarizeLargeFiles.ts` refactored to await the new dispatcher.

Why a registry vs. inline dispatch

The upgrade path (regex → tree-sitter → tree-sitter-with-better-grammar) needs "multiple parsers per language, tried in order, with graceful fallback on error". Hard-coded dispatch would need `try { } catch { fallthrough }` branches in every case; the registry makes that part of the iteration loop.

Strategy for #933 (captured in the issue)

Bundle TS/JS, lazy-load Python/Rust/Go from `~/.cache/coco/tree-sitter/`, WASM (`web-tree-sitter`) over native bindings.

Phasing:

  • Phase 1.0 (this PR) — abstraction + registry, no tree-sitter yet
  • Phase 1.1 — bundle `web-tree-sitter` + TS parser `.wasm`, implement the tree-sitter side of the registry
  • Phase 2 — bundle JS parser
  • Phase 3 — lazy-load infra (cache, download, prompt)
  • Phase 4–6 — Python / Rust / Go via lazy-load
  • Phase 7 — polish + eval-harness comparison vs. regex

Test plan

  • `npx tsc --noEmit` → 0 errors
  • `npx jest` → 1631/1631 pass (6 new)
  • `npx eslint` on touched files → clean
  • No user-visible change — all existing extractor tests still pass through the new dispatch path

Refs #933.

First piece of the tree-sitter integration. Introduces the
abstraction layer that future phases plug into; no behavior change
to the existing regex extractors.

Why a registry instead of inline dispatch: the upgrade path
(regex → tree-sitter → tree-sitter-with-better-grammar) needs a
shape that supports "multiple parsers per language, tried in order,
with graceful fallback on error". Hard-coded dispatch would need
to grow `try { } catch { fallthrough }` branches in every case;
the registry makes that part of the iteration loop and keeps
per-language modules focused on "given this diff, produce a
summary".

Concretely:

- New `StructuralParser` interface — sync OR async; returns a
  templated summary or undefined to surrender to the next parser
  in the chain. Throwing also surrenders (failure observability
  via logger hook lands in phase 1.1+).
- New `REGISTRY: Record<StructuralLanguageId, StructuralParser[]>`
  — per-language chain in priority order. Today every entry is
  a regex parser wrapping the existing per-language summarizer
  (`summarizeTsStructuralDiff`, `summarizePythonStructuralDiff`,
  etc.). Phase 1.1 will prepend the tree-sitter parser for `ts`
  and `js` without touching call sites.
- New `dispatchStructuralParser(language, fileDiff)` walks the
  chain and returns the first non-undefined summary. Errors
  thrown by any parser are swallowed so the chain continues.
- `summarizeLargeFiles.ts` refactored to await the new dispatcher
  instead of calling the per-language switch directly. The
  language-detection helper stays where it is.

Strategy / packaging decisions for #933 captured in the issue
comment: bundle TS/JS, lazy-load Python/Rust/Go from a cache at
`~/.cache/coco/tree-sitter/`, WASM (`web-tree-sitter`) over native
bindings. Phasing:

  - Phase 1.0 (THIS PR): abstraction + registry, no tree-sitter yet
  - Phase 1.1: bundle web-tree-sitter + TS parser .wasm,
    implement the tree-sitter side
  - Phase 2: bundle JS parser
  - Phase 3: lazy-load infra (cache, download, prompt)
  - Phase 4–6: Python / Rust / Go via lazy-load
  - Phase 7: polish + eval-harness comparison vs. regex

Tests (6 new):
- Registry shape — regex parser registered for every supported
  language; every chain non-empty; only valid kind identifiers
- Dispatcher — returns regex summary for TS + Python diffs with
  structural signal; undefined for body-only diffs; undefined for
  unregistered languages

Validation:
- npx tsc --noEmit → 0 errors
- npx jest → 1631/1631 pass (6 new)
- npx eslint on touched files → clean

Refs #933.
@gfargo gfargo merged commit 16b6157 into main May 14, 2026
7 checks passed
@gfargo gfargo deleted the feat/933-tree-sitter-foundation-ts branch May 14, 2026 12:36
gfargo added a commit that referenced this pull request May 14, 2026
…1.1) (#955)

Plugs the first actual tree-sitter parser into the registry
introduced in phase 1.0 (#950). `ts` and `js` extraction now
prefer tree-sitter; the regex parser stays in the chain as the
fallback. No change to user-visible config — gated by the same
`languageAware.enabled` flag as before.

Packaging strategy (per the decision captured on #933):

- `web-tree-sitter` is added as a runtime dep (the engine; ~4.5 MB
  unpacked including its own .wasm engine).
- `tree-sitter-typescript` is added as a devDep — we only need
  the two .wasm grammar files (typescript + tsx), and we copy
  those into `dist/tree-sitter/` at build time. The package's
  native prebuilds (~18 MB) never ship to users.
- New `bin/copyTreeSitterWasm.mjs` runs as `postbuild`. Copies
  the engine + tsx + typescript wasms to `dist/tree-sitter/`
  (~2.9 MB total bundle cost).
- Rollup's ESM output (`dist/index.esm.mjs`) gains an `intro`
  shim that defines `__dirname` from `import.meta.url`. Source
  code uses `__dirname` for filesystem-relative paths in a
  format-agnostic way; CJS gets `__dirname` natively, ts-jest
  compiles to CJS so it does too, and tsx provides its own shim.

Runtime (`src/lib/parsers/default/__tree_sitter__/runtime.ts`):

- One-time `Parser.init()` via a memoized promise. `locateFile`
  hook points the engine at the bundled WASM. Failure caches as
  "unavailable" so we don't retry on every diff.
- Per-language `Language.load` + `Parser` cache keyed on language
  id. First parse for a language pays ~15ms; subsequent parses
  are free.
- Uses the codebase's standard `new Function('specifier', 'return
  import(specifier)')` shim to load `web-tree-sitter` (which is
  `"type": "module"`), bypassing ts-jest's CJS rewrite of
  `await import()` to `await require()`. Same idiom Ink and
  inquirer use.
- Path resolution: tries `<dist>/tree-sitter/` first (installed
  package layout), then `../../../../../dist/tree-sitter/` (dev
  / test layout running from `src/`). Returns undefined when
  neither contains the engine WASM — caller surrenders to the
  next parser in the registry chain.

Extractor (`tsTreeSitterParser.ts`):

- Per-line AST extraction: parses each `+`/`-` line with
  tree-sitter, walks the root's named children, classifies
  recognized declarations into `StructuralSymbol` shapes
  matching the regex extractor's output. Feeds through the
  shared `summarizeStructuralDiff` scaffolding so the summary
  text is identical for cases both extractors handle.
- Catches what the regex misses:
  - Arrow-function exports (`export const handler = (req) =>
    process(req)`) — regex returns `const handler`; tree-sitter
    sees the arrow_function in the declarator and classifies
    the binding as a function (`handler()`).
  - String-embedded keywords — tree-sitter's AST awareness
    skips `function` / `class` tokens inside string literals.
  - TSX (`.tsx` / `.jsx`) — the tsx grammar parses JSX cleanly.
- Falls through to regex for:
  - Multi-line declarations (parsing only `+`/`-` lines means we
    don't see the full signature across lines yet — phase 1.2).
  - Anything tree-sitter classifies as ERROR.

Registry change: `ts` and `js` chains become
`[treeSitterTsParser, regexTs]`. When tree-sitter is unavailable
(no wasm, ESM-import fails, parser load errors), the parser
surrenders via `undefined` and the regex parser handles the
diff. Steady-state behavior is unchanged from phase 1.0 for
users who don't have the wasms loaded.

Tests:

- 8 integration tests for `tsTreeSitterParser`. Gated on
  `COCO_TEST_TREE_SITTER=1` + `NODE_OPTIONS=--experimental-vm-
  modules` because jest doesn't support ESM dynamic import by
  default. CI can opt in; default `npx jest` runs skip the suite
  and rely on the registry-level tests (already covering the
  regex fallback path) plus the eval-harness CLI (#934) for
  integration verification. Eval harness runs as a vanilla Node
  CLI with full ESM — the right tool for the integration check.
- Existing 1640 tests still pass with no changes. 7 skipped
  (the new tree-sitter suite) when the integration flag isn't
  set.

Validation:

- npx tsc --noEmit → 0 errors
- npx jest → 1640/1647 pass, 7 skipped (default)
- NODE_OPTIONS=--experimental-vm-modules COCO_TEST_TREE_SITTER=1
  npx jest src/lib/parsers/default/__tree_sitter__ → 8/8 pass
- npx eslint on touched files → clean
- npm run eval:structural-extract: tree-sitter parser fires on
  the existing TS fixtures with parity to regex (the arrow-fn
  regression case is proven by the integration test). Adding a
  fixture that specifically targets the regex-miss case is a
  phase 1.2 polish item.
- Manual: `node bin/copyTreeSitterWasm.mjs` copies the 3 wasms
  to `dist/tree-sitter/` (2.91 MB total).

yarn.lock churn: large but legitimate. Cleans up stale transitive
deps from removed packages (`@anthropic-ai/sdk`,
`@browserbasehq/*` lineage) and adds tree-sitter's deps. The two
direct adds are `web-tree-sitter@^0.26.8` and
`tree-sitter-typescript@^0.23.2`.

Out of scope (future phases):

- Phase 2: bundle JS parser (separate grammar; `tree-sitter-javascript`).
- Phase 3: lazy-load infrastructure (~/.cache/coco/tree-sitter/,
  first-use prompt, manifest + integrity check).
- Phase 4–6: Python / Rust / Go via lazy-load.
- Phase 7: eval-harness baseline comparison + `coco cache`
  subcommand.

Cleanup follow-up: enable `--experimental-vm-modules` in
jest.config.ts so the tree-sitter integration tests run by
default. Today's gating keeps the contract clean while we
stabilize.

Refs #933.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant