refactor(parser): structural parser registry foundation (#933 phase 1.0)#950
Merged
Conversation
First piece of the tree-sitter integration. Introduces the
abstraction layer that future phases plug into; no behavior change
to the existing regex extractors.
Why a registry instead of inline dispatch: the upgrade path
(regex → tree-sitter → tree-sitter-with-better-grammar) needs a
shape that supports "multiple parsers per language, tried in order,
with graceful fallback on error". Hard-coded dispatch would need
to grow `try { } catch { fallthrough }` branches in every case;
the registry makes that part of the iteration loop and keeps
per-language modules focused on "given this diff, produce a
summary".
Concretely:
- New `StructuralParser` interface — sync OR async; returns a
templated summary or undefined to surrender to the next parser
in the chain. Throwing also surrenders (failure observability
via logger hook lands in phase 1.1+).
- New `REGISTRY: Record<StructuralLanguageId, StructuralParser[]>`
— per-language chain in priority order. Today every entry is
a regex parser wrapping the existing per-language summarizer
(`summarizeTsStructuralDiff`, `summarizePythonStructuralDiff`,
etc.). Phase 1.1 will prepend the tree-sitter parser for `ts`
and `js` without touching call sites.
- New `dispatchStructuralParser(language, fileDiff)` walks the
chain and returns the first non-undefined summary. Errors
thrown by any parser are swallowed so the chain continues.
- `summarizeLargeFiles.ts` refactored to await the new dispatcher
instead of calling the per-language switch directly. The
language-detection helper stays where it is.
Strategy / packaging decisions for #933 captured in the issue
comment: bundle TS/JS, lazy-load Python/Rust/Go from a cache at
`~/.cache/coco/tree-sitter/`, WASM (`web-tree-sitter`) over native
bindings. Phasing:
- Phase 1.0 (THIS PR): abstraction + registry, no tree-sitter yet
- Phase 1.1: bundle web-tree-sitter + TS parser .wasm,
implement the tree-sitter side
- Phase 2: bundle JS parser
- Phase 3: lazy-load infra (cache, download, prompt)
- Phase 4–6: Python / Rust / Go via lazy-load
- Phase 7: polish + eval-harness comparison vs. regex
Tests (6 new):
- Registry shape — regex parser registered for every supported
language; every chain non-empty; only valid kind identifiers
- Dispatcher — returns regex summary for TS + Python diffs with
structural signal; undefined for body-only diffs; undefined for
unregistered languages
Validation:
- npx tsc --noEmit → 0 errors
- npx jest → 1631/1631 pass (6 new)
- npx eslint on touched files → clean
Refs #933.
7 tasks
gfargo
added a commit
that referenced
this pull request
May 14, 2026
…1.1) (#955) Plugs the first actual tree-sitter parser into the registry introduced in phase 1.0 (#950). `ts` and `js` extraction now prefer tree-sitter; the regex parser stays in the chain as the fallback. No change to user-visible config — gated by the same `languageAware.enabled` flag as before. Packaging strategy (per the decision captured on #933): - `web-tree-sitter` is added as a runtime dep (the engine; ~4.5 MB unpacked including its own .wasm engine). - `tree-sitter-typescript` is added as a devDep — we only need the two .wasm grammar files (typescript + tsx), and we copy those into `dist/tree-sitter/` at build time. The package's native prebuilds (~18 MB) never ship to users. - New `bin/copyTreeSitterWasm.mjs` runs as `postbuild`. Copies the engine + tsx + typescript wasms to `dist/tree-sitter/` (~2.9 MB total bundle cost). - Rollup's ESM output (`dist/index.esm.mjs`) gains an `intro` shim that defines `__dirname` from `import.meta.url`. Source code uses `__dirname` for filesystem-relative paths in a format-agnostic way; CJS gets `__dirname` natively, ts-jest compiles to CJS so it does too, and tsx provides its own shim. Runtime (`src/lib/parsers/default/__tree_sitter__/runtime.ts`): - One-time `Parser.init()` via a memoized promise. `locateFile` hook points the engine at the bundled WASM. Failure caches as "unavailable" so we don't retry on every diff. - Per-language `Language.load` + `Parser` cache keyed on language id. First parse for a language pays ~15ms; subsequent parses are free. - Uses the codebase's standard `new Function('specifier', 'return import(specifier)')` shim to load `web-tree-sitter` (which is `"type": "module"`), bypassing ts-jest's CJS rewrite of `await import()` to `await require()`. Same idiom Ink and inquirer use. - Path resolution: tries `<dist>/tree-sitter/` first (installed package layout), then `../../../../../dist/tree-sitter/` (dev / test layout running from `src/`). Returns undefined when neither contains the engine WASM — caller surrenders to the next parser in the registry chain. Extractor (`tsTreeSitterParser.ts`): - Per-line AST extraction: parses each `+`/`-` line with tree-sitter, walks the root's named children, classifies recognized declarations into `StructuralSymbol` shapes matching the regex extractor's output. Feeds through the shared `summarizeStructuralDiff` scaffolding so the summary text is identical for cases both extractors handle. - Catches what the regex misses: - Arrow-function exports (`export const handler = (req) => process(req)`) — regex returns `const handler`; tree-sitter sees the arrow_function in the declarator and classifies the binding as a function (`handler()`). - String-embedded keywords — tree-sitter's AST awareness skips `function` / `class` tokens inside string literals. - TSX (`.tsx` / `.jsx`) — the tsx grammar parses JSX cleanly. - Falls through to regex for: - Multi-line declarations (parsing only `+`/`-` lines means we don't see the full signature across lines yet — phase 1.2). - Anything tree-sitter classifies as ERROR. Registry change: `ts` and `js` chains become `[treeSitterTsParser, regexTs]`. When tree-sitter is unavailable (no wasm, ESM-import fails, parser load errors), the parser surrenders via `undefined` and the regex parser handles the diff. Steady-state behavior is unchanged from phase 1.0 for users who don't have the wasms loaded. Tests: - 8 integration tests for `tsTreeSitterParser`. Gated on `COCO_TEST_TREE_SITTER=1` + `NODE_OPTIONS=--experimental-vm- modules` because jest doesn't support ESM dynamic import by default. CI can opt in; default `npx jest` runs skip the suite and rely on the registry-level tests (already covering the regex fallback path) plus the eval-harness CLI (#934) for integration verification. Eval harness runs as a vanilla Node CLI with full ESM — the right tool for the integration check. - Existing 1640 tests still pass with no changes. 7 skipped (the new tree-sitter suite) when the integration flag isn't set. Validation: - npx tsc --noEmit → 0 errors - npx jest → 1640/1647 pass, 7 skipped (default) - NODE_OPTIONS=--experimental-vm-modules COCO_TEST_TREE_SITTER=1 npx jest src/lib/parsers/default/__tree_sitter__ → 8/8 pass - npx eslint on touched files → clean - npm run eval:structural-extract: tree-sitter parser fires on the existing TS fixtures with parity to regex (the arrow-fn regression case is proven by the integration test). Adding a fixture that specifically targets the regex-miss case is a phase 1.2 polish item. - Manual: `node bin/copyTreeSitterWasm.mjs` copies the 3 wasms to `dist/tree-sitter/` (2.91 MB total). yarn.lock churn: large but legitimate. Cleans up stale transitive deps from removed packages (`@anthropic-ai/sdk`, `@browserbasehq/*` lineage) and adds tree-sitter's deps. The two direct adds are `web-tree-sitter@^0.26.8` and `tree-sitter-typescript@^0.23.2`. Out of scope (future phases): - Phase 2: bundle JS parser (separate grammar; `tree-sitter-javascript`). - Phase 3: lazy-load infrastructure (~/.cache/coco/tree-sitter/, first-use prompt, manifest + integrity check). - Phase 4–6: Python / Rust / Go via lazy-load. - Phase 7: eval-harness baseline comparison + `coco cache` subcommand. Cleanup follow-up: enable `--experimental-vm-modules` in jest.config.ts so the tree-sitter integration tests run by default. Today's gating keeps the contract clean while we stabilize. Refs #933.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First piece of the tree-sitter integration (#933). Pure abstraction — introduces the registry that future phases plug tree-sitter parsers into. No behavior change to the existing regex extractors.
What's here
Why a registry vs. inline dispatch
The upgrade path (regex → tree-sitter → tree-sitter-with-better-grammar) needs "multiple parsers per language, tried in order, with graceful fallback on error". Hard-coded dispatch would need `try { } catch { fallthrough }` branches in every case; the registry makes that part of the iteration loop.
Strategy for #933 (captured in the issue)
Bundle TS/JS, lazy-load Python/Rust/Go from `~/.cache/coco/tree-sitter/`, WASM (`web-tree-sitter`) over native bindings.
Phasing:
Test plan
Refs #933.