Skip to content

spec-08: additional general-purpose language support (C#, Kotlin, PHP, C, Scala, Dart, Lua, Elixir, Bash)#93

Merged
clay-good merged 6 commits into
mainfrom
openlore-spec-08-additional-languages
May 27, 2026
Merged

spec-08: additional general-purpose language support (C#, Kotlin, PHP, C, Scala, Dart, Lua, Elixir, Bash)#93
clay-good merged 6 commits into
mainfrom
openlore-spec-08-additional-languages

Conversation

@clay-good
Copy link
Copy Markdown
Owner

Summary

Adds call-graph extraction for nine general-purpose languages on the
existing tree-sitter extractor pattern — no changes to the
FunctionNode/CallEdge/ClassNode schema, MCP tools, orient, the search
index, SCIP export, or the federation manifest.

  • Phantom-language bug fixed: C#, Kotlin, PHP, and C were detected by
    detectLanguage but had no dispatch branch, so they were counted but
    produced an empty graph
    . They now emit real nodes and edges.
  • New languages graphed: Scala, Dart, Lua, Elixir, Bash.
  • Each language is wired in the four canonical places (lazy soft loader, FN/CALL
    queries, dispatch branch, detection). Structurally similar languages share one
    query-driven extractor (extractByQueries); Elixir uses a bespoke walk (its
    grammar models everything as call nodes).
  • Graceful degradation: grammars are native modules. loadGrammarSoft
    warns once and skips a missing/ABI-incompatible grammar without aborting
    analyze or any other language. Files of that language are still indexed for
    search.
  • .h C/C++ heuristic (resolveHeaderLanguage, tested): C-only project → C;
    any C++ source → C++; standalone → C++ (superset default).
  • Name-based call resolution (matches the repo's best-effort approach) resolves
    this.M(), Class.M(), $this->m(), Obj.m(), Mod.fun(), etc.

Native grammar dependencies (justification)

All ship prebuilt binaries; none required a source compile here. Pinned to
ABI-14 prebuilds where needed to match the host tree-sitter (0.22.4, ABI 14):
tree-sitter-c-sharp@0.23.1, tree-sitter-php@0.23.12, tree-sitter-c@0.23.6,
tree-sitter-bash@0.23.3; tree-sitter-kotlin/-scala/-elixir at latest.
tree-sitter-lua and tree-sitter-dart ship ABI-15-only builds, so in this
environment (node 25 cannot build tree-sitter 0.25 from source) they exercise
the graceful-degradation path; their extractors + fixtures are in place and
graph wherever an ABI-15 host binding is available. 7 of 9 extract fully in
CI today; all 9 are wired end to end.

Test plan

  • Per-language unit tests (exact nodes/classes/edges)
  • Phantom-regression: C#/Kotlin/PHP/C produce non-empty graphs
  • .h heuristic (3-case table)
  • Polyglot integration across ≥3 new languages + TS (tools unchanged, TS unregressed)
  • Graceful degradation (simulated unavailable grammar via vi.doMock)
  • Determinism (build-twice deep-equal per language)
  • No regression: full suite green (2891 passing); existing extractors untouched
  • lint / typecheck / test:run / build all pass

Follow-ups (TODO(spec-08-followup))

  • Lua/Dart full extraction once an ABI-15 host tree-sitter is buildable on the
    CI node version (code + fixtures already present).
  • Bash source/. as file-level dependency edges.
  • Per-language Stage-1 regex signatures (the call graph already feeds search).
  • Deferred languages (Objective-C, Perl, Haskell, …) remain future specs.

🤖 Generated with Claude Code

clay-good and others added 6 commits May 27, 2026 15:01
…Dart, Lua, Elixir, Bash)

Fixes the phantom-language bug (C#/Kotlin/PHP/C were detected but never
graphed) and adds 5 new languages, all on the existing tree-sitter extractor
pattern — no schema/MCP/tool changes.

- loadGrammarSoft: lazy, cached, soft-failing grammar loader (graceful
  degradation — a missing/ABI-incompatible grammar warns once and skips that
  language without aborting analyze or any other language).
- extractByQueries: shared query-driven extractor (C#/Kotlin/PHP/C/Scala/Dart/
  Lua/Bash) parameterized by FN/CALL queries + grouping node types; bespoke
  walk for Elixir (its grammar models everything as `call` nodes).
- Name-based call resolution (matches the repo's best-effort approach).
- detectLanguage extensions; .h C/C++ heuristic (resolveHeaderLanguage);
  CALL_GRAPH_LANGS extended; ambient types for untyped grammars.
- Grammar versions pinned to ABI-14 prebuilds where required (c-sharp 0.23.1,
  php 0.23.12, c 0.23.6, bash 0.23.3); Lua/Dart ship ABI-15 only and exercise
  the graceful-degradation path in this environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…glot, degradation, determinism)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itter)

Previously Lua/Dart only degraded (no ABI-compatible native grammar for the
pinned host tree-sitter). They now extract for real via tree-sitter-wasms +
web-tree-sitter, so all nine spec-08 languages produce graphs.

- Unified GrammarHandle (withTree) abstracts native tree-sitter and the WASM
  backend behind one interface; the 7 native languages are unchanged.
- loadWasmGrammarSoft loads the grammar's WASM bytes ourselves and hands a
  Uint8Array to Language.load (avoids web-tree-sitter's ESM-unfriendly internal
  fs require); withTree disposes tree/queries per parse to protect the WASM heap.
- Dart uses a custom walk (its function_body is a sibling of function_signature,
  so a generic query attributes no calls); Lua uses the generic query path with
  the bundled grammar's node types, incl. t.f()/t:m() and table-name className.
- Validated against the real grammars; web-tree-sitter pinned to 0.25.0,
  native tree-sitter-lua/-dart deps dropped (WASM comes from tree-sitter-wasms).
- Dart/Lua tests split into their own files (vitest's sandbox corrupts the
  shared WASM heap across grammars in one file; production node does not).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-1 signatures

- Kotlin extension-function CALL now asserted ("hi".shout() resolves).
- Elixir remote Mod.fun() resolves to an in-project module (cross-module edge);
  emit the function name only so name-based resolution matches.
- ClassNode grouping: assert Service.methodIds references both methods (C#).
- Explicit phantom-regression test: C#/Kotlin/PHP/C always non-empty nodes+edges.
- Cross-cutting interop tests: SCIP export emits new-language nodes (no-enum →
  UnspecifiedLanguage), federation manifest languages[] includes the new tags.
- Best-effort Stage-1 search signatures for all nine languages so they are
  searchable via BM25 even when a grammar can't load (graceful degradation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a committed graph snapshot for a representative TypeScript fixture
(classes, generics, async methods, free functions, calls). Any change to an
existing-language extractor would diff this snapshot — the spec's "before/after
byte-identical repo graph" guard, made maintainable as a committed baseline.
Also asserts byte-identical output across rebuilds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clay-good clay-good merged commit e2baf74 into main May 27, 2026
4 checks passed
@clay-good clay-good deleted the openlore-spec-08-additional-languages branch May 29, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant