Skip to content

feat(parser): bundled tree-sitter TS/TSX parser via WASM (#933 phase 1.1)#955

Merged
gfargo merged 1 commit into
mainfrom
feat/933-tree-sitter-ts-parser
May 14, 2026
Merged

feat(parser): bundled tree-sitter TS/TSX parser via WASM (#933 phase 1.1)#955
gfargo merged 1 commit into
mainfrom
feat/933-tree-sitter-ts-parser

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 14, 2026

Summary

First actual tree-sitter parser plugged into the registry from #950. ts / js extraction now prefers tree-sitter; the regex parser stays in the chain as the fallback. No change to user-visible config — gated by the same languageAware.enabled flag.

What lands

  • Bundle: web-tree-sitter (runtime), tree-sitter-typescript (devDep, only the .wasm files ship).
  • Postbuild script copies the engine + tsx + ts wasms to dist/tree-sitter/ (~2.9 MB total).
  • Rollup ESM intro defines __dirname from import.meta.url so source code can use __dirname in both CJS and ESM outputs (per discussion about future-proofness — see commit body).
  • Runtime wrapper with memoized engine init, per-language parser cache, and graceful surrender when wasms aren't available.
  • TS/TSX extractor — per-line AST extraction matching the regex output shape. Catches arrow-function exports, string-embedded keywords, and JSX-in-TSX (which the regex either misses or could false-positive on).
  • Registry: ts/js chains become [treeSitterTsParser, regexTs]. Steady-state behavior unchanged for users without wasms loaded.

What lights up vs. regex

  • export const handler = (req) => process(req) now reads as handler() (function), not const handler.
  • TSX with JSX bodies parses cleanly.
  • Tokens inside string literals are correctly ignored.

Tests

  • 8 integration tests for the TS extractor, gated on COCO_TEST_TREE_SITTER=1 NODE_OPTIONS=--experimental-vm-modules. Run them locally with that flag set; CI can opt in. Default npx jest skips them and relies on the registry-level tests (regex fallback path) plus the eval-harness CLI (Structural-extract quality eval scaffolding #934) for integration verification — eval runs as vanilla Node with full ESM, which is the right tool for this check.
  • Existing 1640 tests still pass with no changes.

Decisions explained in commit body

  • Why WASM and not native bindings
  • Why __dirname + rollup intro vs. import.meta.url everywhere
  • Why the Function-constructor dynamic-import shim
  • Why the integration suite is opt-in (jest ESM limitations)
  • Why yarn.lock has a big diff (legit: cleans up stale transitive deps that have been out of sync, in addition to the 2 direct adds)

Out of scope (future phases)

  • Phase 2 — JS parser (tree-sitter-javascript)
  • Phase 3 — lazy-load infra (cache dir, first-use prompt, manifest)
  • Phase 4–6 — Python / Rust / Go via lazy-load
  • Phase 7 — eval-harness baseline comparison + coco cache subcommand

Test plan

  • npx tsc --noEmit → 0 errors
  • npx jest → 1640/1647 pass, 7 skipped (default mode)
  • NODE_OPTIONS=--experimental-vm-modules COCO_TEST_TREE_SITTER=1 npx jest src/lib/parsers/default/__tree_sitter__ → 8/8 pass
  • npx eslint on touched files → clean
  • node bin/copyTreeSitterWasm.mjs → copies 3 wasms (2.91 MB)
  • Manual: full build (npm run build) → confirms postbuild copies wasms to dist
  • Manual: in a TS project, run with languageAware.enabled: true and confirm tree-sitter fires (verbose log shows "language-aware fast-path skip")

Refs #933.

…1.1)

Plugs the first actual tree-sitter parser into the registry
introduced in phase 1.0 (#950). `ts` and `js` extraction now
prefer tree-sitter; the regex parser stays in the chain as the
fallback. No change to user-visible config — gated by the same
`languageAware.enabled` flag as before.

Packaging strategy (per the decision captured on #933):

- `web-tree-sitter` is added as a runtime dep (the engine; ~4.5 MB
  unpacked including its own .wasm engine).
- `tree-sitter-typescript` is added as a devDep — we only need
  the two .wasm grammar files (typescript + tsx), and we copy
  those into `dist/tree-sitter/` at build time. The package's
  native prebuilds (~18 MB) never ship to users.
- New `bin/copyTreeSitterWasm.mjs` runs as `postbuild`. Copies
  the engine + tsx + typescript wasms to `dist/tree-sitter/`
  (~2.9 MB total bundle cost).
- Rollup's ESM output (`dist/index.esm.mjs`) gains an `intro`
  shim that defines `__dirname` from `import.meta.url`. Source
  code uses `__dirname` for filesystem-relative paths in a
  format-agnostic way; CJS gets `__dirname` natively, ts-jest
  compiles to CJS so it does too, and tsx provides its own shim.

Runtime (`src/lib/parsers/default/__tree_sitter__/runtime.ts`):

- One-time `Parser.init()` via a memoized promise. `locateFile`
  hook points the engine at the bundled WASM. Failure caches as
  "unavailable" so we don't retry on every diff.
- Per-language `Language.load` + `Parser` cache keyed on language
  id. First parse for a language pays ~15ms; subsequent parses
  are free.
- Uses the codebase's standard `new Function('specifier', 'return
  import(specifier)')` shim to load `web-tree-sitter` (which is
  `"type": "module"`), bypassing ts-jest's CJS rewrite of
  `await import()` to `await require()`. Same idiom Ink and
  inquirer use.
- Path resolution: tries `<dist>/tree-sitter/` first (installed
  package layout), then `../../../../../dist/tree-sitter/` (dev
  / test layout running from `src/`). Returns undefined when
  neither contains the engine WASM — caller surrenders to the
  next parser in the registry chain.

Extractor (`tsTreeSitterParser.ts`):

- Per-line AST extraction: parses each `+`/`-` line with
  tree-sitter, walks the root's named children, classifies
  recognized declarations into `StructuralSymbol` shapes
  matching the regex extractor's output. Feeds through the
  shared `summarizeStructuralDiff` scaffolding so the summary
  text is identical for cases both extractors handle.
- Catches what the regex misses:
  - Arrow-function exports (`export const handler = (req) =>
    process(req)`) — regex returns `const handler`; tree-sitter
    sees the arrow_function in the declarator and classifies
    the binding as a function (`handler()`).
  - String-embedded keywords — tree-sitter's AST awareness
    skips `function` / `class` tokens inside string literals.
  - TSX (`.tsx` / `.jsx`) — the tsx grammar parses JSX cleanly.
- Falls through to regex for:
  - Multi-line declarations (parsing only `+`/`-` lines means we
    don't see the full signature across lines yet — phase 1.2).
  - Anything tree-sitter classifies as ERROR.

Registry change: `ts` and `js` chains become
`[treeSitterTsParser, regexTs]`. When tree-sitter is unavailable
(no wasm, ESM-import fails, parser load errors), the parser
surrenders via `undefined` and the regex parser handles the
diff. Steady-state behavior is unchanged from phase 1.0 for
users who don't have the wasms loaded.

Tests:

- 8 integration tests for `tsTreeSitterParser`. Gated on
  `COCO_TEST_TREE_SITTER=1` + `NODE_OPTIONS=--experimental-vm-
  modules` because jest doesn't support ESM dynamic import by
  default. CI can opt in; default `npx jest` runs skip the suite
  and rely on the registry-level tests (already covering the
  regex fallback path) plus the eval-harness CLI (#934) for
  integration verification. Eval harness runs as a vanilla Node
  CLI with full ESM — the right tool for the integration check.
- Existing 1640 tests still pass with no changes. 7 skipped
  (the new tree-sitter suite) when the integration flag isn't
  set.

Validation:

- npx tsc --noEmit → 0 errors
- npx jest → 1640/1647 pass, 7 skipped (default)
- NODE_OPTIONS=--experimental-vm-modules COCO_TEST_TREE_SITTER=1
  npx jest src/lib/parsers/default/__tree_sitter__ → 8/8 pass
- npx eslint on touched files → clean
- npm run eval:structural-extract: tree-sitter parser fires on
  the existing TS fixtures with parity to regex (the arrow-fn
  regression case is proven by the integration test). Adding a
  fixture that specifically targets the regex-miss case is a
  phase 1.2 polish item.
- Manual: `node bin/copyTreeSitterWasm.mjs` copies the 3 wasms
  to `dist/tree-sitter/` (2.91 MB total).

yarn.lock churn: large but legitimate. Cleans up stale transitive
deps from removed packages (`@anthropic-ai/sdk`,
`@browserbasehq/*` lineage) and adds tree-sitter's deps. The two
direct adds are `web-tree-sitter@^0.26.8` and
`tree-sitter-typescript@^0.23.2`.

Out of scope (future phases):

- Phase 2: bundle JS parser (separate grammar; `tree-sitter-javascript`).
- Phase 3: lazy-load infrastructure (~/.cache/coco/tree-sitter/,
  first-use prompt, manifest + integrity check).
- Phase 4–6: Python / Rust / Go via lazy-load.
- Phase 7: eval-harness baseline comparison + `coco cache`
  subcommand.

Cleanup follow-up: enable `--experimental-vm-modules` in
jest.config.ts so the tree-sitter integration tests run by
default. Today's gating keeps the contract clean while we
stabilize.

Refs #933.
@gfargo gfargo merged commit 050fc19 into main May 14, 2026
4 of 7 checks passed
@gfargo gfargo deleted the feat/933-tree-sitter-ts-parser branch May 14, 2026 18:13
gfargo added a commit that referenced this pull request May 14, 2026
Closes out the loose ends from #955 (phase 1.1) before moving on
to the lazy-load infrastructure in phase 3.

1. **Integration tests run by default.**
   Added `cross-env NODE_OPTIONS=--experimental-vm-modules` to the
   `test:jest` script so jest can load the ESM-only
   `web-tree-sitter` package via dynamic import. Removed the
   `COCO_TEST_TREE_SITTER=1` opt-in gate from the test file — the
   .wasm-available check stays as the defensive fence.

   `pretest:jest` now also runs `node bin/copyTreeSitterWasm.mjs`
   to populate `dist/tree-sitter/` before jest starts, so a clean
   checkout's first `npm run test:jest` runs the integration suite
   instead of skipping it.

   Added `cross-env` as a devDep (small, standard solution for
   cross-platform env-var syntax — CI is Linux but contributors may
   be on Windows).

2. **Arrow-function eval fixture.**
   New `ts-arrow-fn-export` fixture in `__evals__/fixtures.ts`
   targeting the regex extractor's known weak spot:
   `export const handler = (...) => ...`. The regex parser
   classifies the symbol as `const handler`; the tree-sitter
   parser inspects the declarator value, recognizes the
   `arrow_function`, and classifies it as a function — surfacing
   the same diff as `handler()` instead.

   Both parsers fire the fast path on this fixture (so the LLM-
   call-saved metric is identical in the harness's mock-mode
   counter), but the per-fixture JSON now shows the qualitative
   difference between the two extractors. Run the eval with the
   wasms present vs. removed (`rm -rf dist/tree-sitter/`) to see
   the diff.

3. **Stability fix: retry-on-failure for the runtime init.**
   Discovered while validating Phase 2: when jest's full suite
   runs with ESM dynamic imports enabled, a transient init failure
   (e.g. a sibling test file's environment tearing down mid-
   import) was permanently caching `undefined` in the runtime's
   `initPromise`. Subsequent calls to `getTreeSitterParser` from
   any test file inherited the cached failure.

   `ensureRuntime` now awaits the cached promise and re-tries if
   the result is undefined, while still reusing a successful
   init across the whole process. Net cost: at most one extra
   init attempt per real failure. Removed the
   `beforeEach _resetTreeSitterRuntimeForTesting` call in the
   integration suite — the runtime is process-lifetime by design,
   and resetting it between tests created surface area for the
   exact teardown race the retry path now handles.

Validation:
- npx tsc --noEmit → 0 errors
- `npm run test:jest` → 1647/1647 pass (3 consecutive runs, all
  clean — previously 4-6 tree-sitter integration tests would flake)
- `npx eslint` on touched files → clean
- `npm run eval:structural-extract -- --fixtures-only` → 8
  fixtures, 6 LLM calls saved, including the new arrow-fn case.

Out of scope (queued for phase 3):
- Lazy-load infra for non-bundled languages (cache dir,
  first-use prompt, manifest, integrity check).
- Surface the qualitative regex-vs-tree-sitter delta in the eval
  output (today you have to diff JSONs by hand). Issue #934
  polish.
- Telemetry on tree-sitter init failures (today they're silently
  caught; would help diagnose user-environment issues like
  corrupted wasms).

Refs #933.
gfargo added a commit that referenced this pull request May 14, 2026
…phase 7) (#959)

Closes the tree-sitter integration feature (#933). Lazy-loaded
parsers gain a first-class CLI surface for cache management;
verbose mode surfaces a discoverability hint when the fast path
falls through to regex.

## New `coco cache` subcommands

Extends the existing `coco cache` (diff-summary info / clear)
with three tree-sitter subcommands:

- \`coco cache parsers\` — show every manifest language with its
  current cache status (cached size or "not cached" + the
  fetched-size estimate), version pin, and source URL. Footer
  summarizes total disk usage + quick-reference commands.
- \`coco cache prefetch [languages...]\` — download specific
  parsers (e.g. \`coco cache prefetch py rs go\` or
  \`coco cache prefetch all\`). When invoked with no args AND
  stdin is a TTY, opens an interactive checkbox picker. In
  non-interactive contexts (CI, pipes), no-arg invocations error
  out with usage hints instead of hanging on a prompt.
- \`coco cache clear-parsers\` — wipe \`~/.cache/coco/tree-sitter/\`.
  Idempotent; reports a per-language ✓ for each removed file.

Aliases mirror \`COCO_PREFETCH\` env grammar: \`py\` / \`python\`,
\`rs\` / \`rust\`, \`go\` / \`golang\`, \`all\`.

## Surrender telemetry

In verbose mode, when the language-aware fast path is enabled and
the parser chain falls through to LLM, emit a discoverability
hint:

  \`Tree-sitter parser surrendered for 'python'; using regex
   fallback. Hint: \`coco cache parsers\` to inspect,
   \`coco cache prefetch python\` to enable.\`

Quiet on the default path; visible only when the user is
debugging summary quality. Hint copy adapts: bundled-language
surrenders (\`ts\` / \`js\`) point at \`coco cache prefetch all\`
because TS / TSX wasms are always shipped (the surrender is from
a parser-init failure, not a missing download); lazy-loaded
languages get a per-language prefetch hint.

## Implementation

### \`cache.ts\` (lazy-load cache module)

- New \`getCachedParserStatus(language)\` returns
  \`{ language, cached, path, bytes?, mtime? }\` for the table
  renderer + interactive picker.
- New \`clearCachedParser(language)\` unlinks the cached .wasm.
  Idempotent; returns \`true\` when a file was actually removed.

### \`structuralParserRegistry.ts\`

- New \`hasTreeSitterParser(language)\` lets the LLM fallthrough
  path know whether a tree-sitter parser is registered for the
  language — used by the surrender-telemetry hint. Doesn't
  expose internals; the caller just needs the boolean.

### \`summarizeLargeFiles.ts\`

- Surrender-telemetry block fires after the registry returns
  undefined and BEFORE the cache lookup. Only emits when the
  chain includes a tree-sitter parser, so regex-only languages
  don't get a misleading hint.

### \`commands/cache/\`

- \`config.ts\` gains the \`CACHE_SUBCOMMANDS\` enum and a positional
  \`[languages..]\` for prefetch. Yargs validates the subcommand
  set; unknown tokens get caught by the language resolver.
- \`handler.ts\` adds three new branches:
  - \`parsers\` calls \`renderParsersTable\`
  - \`prefetch\` resolves tokens via \`parsePrefetchEnv\`
    (reusing the env-var grammar), prompts when interactive,
    and delegates to \`prefetchTreeSitterParsers\`. Failed
    downloads → \`process.exitCode = 1\`.
  - \`clear-parsers\` walks every manifest entry, calls
    \`clearCachedParser\`, reports per-language status.

### \`inquirerPrompts.ts\`

- New \`checkboxPrompt\` helper. Same dynamic-import shim as the
  other prompts; reuses the codebase's standard pattern for ESM
  inquirer modules under ts-jest.

## Tests

4 new test cases in \`handler.test.ts\` cover the new subcommands:
\`parsers\` lists every manifest language, \`prefetch\` warns on
unknown tokens, \`clear-parsers\` reports no-op when empty AND
removes cached files when present.

Test isolation: each test sets \`COCO_CACHE_DIR\` to the same tmp
dir the existing tests use for \`XDG_CACHE_HOME\`, so the
tree-sitter cache lives inside the per-test sandbox.

## Manual validation

\`\`\`
$ COCO_CACHE_DIR=/tmp/coco-phase7-smoke coco cache parsers
Tree-sitter parser cache

  Python   not cached          (448.0 KB when fetched)
  Rust     not cached          (1.05 MB when fetched)
  Go       not cached          (212.1 KB when fetched)

  cached: 0/3  total on disk: 0 B

$ coco cache prefetch py
· Python: downloading https://cdn.jsdelivr.net/.../tree-sitter-python.wasm…
✓ Python parser cached (447 KB)

Summary: 1 downloaded · 0 already cached · 0 failed

$ coco cache clear-parsers
✓ cleared Python

Cleared 1 parser(s) from ~/.cache/coco/tree-sitter/
\`\`\`

## Validation

- \`npx tsc --noEmit\` → 0 errors
- \`npm run test:jest\` → 1674/1674 pass (3 of 4 consecutive runs
  clean, 1 flake on the pre-existing scenarioInputs timeout
  pattern)
- \`npx eslint\` on touched files → clean
- Manual: all four subcommands round-trip cleanly

## Out of scope (genuine future work)

- **Eval-harness side-by-side regex-vs-tree-sitter comparison
  in the report output**. Today the eval reports per-fixture
  outcomes but doesn't discriminate WHICH parser produced each
  summary. Surfacing the regex vs. tree-sitter delta requires
  registry injection at eval time (the harness builds its own
  parser chain instead of using the global). Reasonable
  follow-up; not gating on #933 closure.

## #933 status: feature complete

| Phase | Status |
|---|---|
| 1.0 — Registry abstraction | ✓ #950 |
| 1.1 — TS/TSX bundled | ✓ #955 |
| 2 — Polish + ESM jest + arrow-fn fixture | ✓ #956 |
| 3 — Lazy-load infra + Python | ✓ #957 |
| 5 — Rust | ✓ #958 |
| 6 — Go | ✓ #958 |
| **7 — Cache CLI + telemetry** | **this PR** |

Closes #933.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant