Skip to content

feat: multi-language tree-sitter extractor architecture (10 languages)#89

Merged
Lum1104 merged 18 commits intomainfrom
fix/graph-builder-use-language-registry
Apr 15, 2026
Merged

feat: multi-language tree-sitter extractor architecture (10 languages)#89
Lum1104 merged 18 commits intomainfrom
fix/graph-builder-use-language-registry

Conversation

@Lum1104
Copy link
Copy Markdown
Owner

@Lum1104 Lum1104 commented Apr 15, 2026

Summary

  • Decouple AST extraction from TS/JS-specific node types — introduced LanguageExtractor interface with per-language extractor classes, refactored TreeSitterPlugin (746→297 lines) to dispatch via registered extractors
  • Add tree-sitter structural analysis for 10 languages — TypeScript, JavaScript, Python, Go, Rust, Java, Ruby, PHP, C/C++, C# each get a dedicated extractor with full extractStructure() + extractCallGraph() support (239 new tests)
  • Replace LLM-generated regex scripts with deterministic extraction — bundled extract-structure.mjs uses PluginRegistry (tree-sitter + non-code parsers) so the file-analyzer agent no longer writes throwaway scripts from scratch each run
  • Fix GraphBuilder duplicate language mapping — replaced hardcoded EXTENSION_LANGUAGE map with LanguageRegistry delegation

Architecture

TreeSitterPlugin
  ├── registerExtractor(extractor)
  ├── getExtractor(langKey) → LanguageExtractor
  ├── analyzeFile() → delegates to extractor.extractStructure()
  └── extractCallGraph() → delegates to extractor.extractCallGraph()

extractors/
  ├── types.ts              # LanguageExtractor interface
  ├── base-extractor.ts     # Shared AST utilities
  ├── typescript-extractor.ts
  ├── python-extractor.ts
  ├── go-extractor.ts
  ├── rust-extractor.ts
  ├── java-extractor.ts
  ├── ruby-extractor.ts
  ├── php-extractor.ts
  ├── cpp-extractor.ts
  ├── csharp-extractor.ts
  └── index.ts              # builtinExtractors array

Changes by area

Core extractors (new)

  • LanguageExtractor interface + shared traverse/findChild/getStringValue utilities
  • 9 language-specific extractors (TS/JS shares one), each handling language-specific AST node types for functions, classes, imports, exports, and call graphs
  • builtinExtractors array auto-registered in TreeSitterPlugin constructor

TreeSitterPlugin refactor

  • Removed all TS/JS-specific extraction methods (moved to TypeScriptExtractor)
  • Added extractor dispatch: registerExtractor(), getExtractor() with tsx→typescript mapping
  • Constructor accepts optional extractors param, defaults to builtinExtractors

Plugin infrastructure

  • PluginRegistry.extractCallGraph() — new delegation method
  • DEFAULT_PLUGIN_CONFIG now derives languages dynamically from builtinLanguageConfigs.filter(c => c.treeSitter)
  • 8 new tree-sitter grammar deps + treeSitter configs for Python, Go, Rust, Java, Ruby, PHP, C++, C#

Agent pipeline

  • extract-structure.mjs — bundled script using PluginRegistry for deterministic extraction
  • file-analyzer.md Phase 1 rewritten: executes bundled script instead of LLM-generated regex (−213 lines)
  • SKILL.md passes <SKILL_DIR> to file-analyzer dispatch

Test plan

  • All 693 tests pass (651 core + 42 skill)
  • 239 new extractor tests across 8 test files (Python: 31, Go: 25, Rust: 30, Java: 25, Ruby: 32, PHP: 27, C++: 28, C#: 27)
  • All 426 pre-existing tests pass unchanged (zero behavior change in TS/JS extraction)
  • Core and skill packages build cleanly (tsc, zero errors)
  • extract-structure.mjs tested locally against TS, Markdown, and JSON files
  • Run /understand --full on a multi-language project to verify end-to-end pipeline

🤖 Generated with Claude Code

…th LanguageRegistry

GraphBuilder maintained its own ~60-line extension-to-language mapping that
duplicated and could diverge from the canonical LanguageRegistry. Now delegates
language detection to LanguageRegistry.getForFile(), eliminating the duplication
and ensuring new language configs are automatically picked up everywhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cba6de658

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

private detectLanguage(filePath: string): string {
return this.languageRegistry.getForFile(filePath)?.id ?? "unknown";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore Lua language detection fallback

Routing detection entirely through LanguageRegistry here drops support for extensions that were only in the old map, notably .lua. Because builtinLanguageConfigs has no Lua config, addFile("*.lua", ...) now resolves to "unknown" and project.languages no longer includes "lua", which regresses graph metadata and downstream summaries for Lua repositories compared with the previous behavior.

Useful? React with 👍 / 👎.

…r architecture

Plan covers: LanguageExtractor interface, per-language extractors for 8 languages
(Python, Go, Rust, Java, Ruby, PHP, C/C++, C#), bundled extract-structure.mjs
script replacing LLM-generated regex in file-analyzer agent, and PluginRegistry
enhancements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Lum1104 Lum1104 changed the title refactor: replace hardcoded EXTENSION_LANGUAGE with LanguageRegistry refactor: eliminate duplicated language mapping + plan multi-language tree-sitter support Apr 15, 2026
Lum1104 and others added 16 commits April 15, 2026 18:12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…h via LanguageExtractor interface

Extract all TypeScript/JavaScript-specific AST extraction functions
(extractParams, extractReturnType, extractImportSpecifiers, processTopLevelNode,
extractFunction, extractClass, extractVariableDeclarations, extractImport,
processExportStatement, and call graph walking) from TreeSitterPlugin into the
new TypeScriptExtractor class. TreeSitterPlugin now dispatches to registered
LanguageExtractor instances, defaulting to TypeScriptExtractor for backward
compatibility. All 426 existing tests pass unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ONFIG from configs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the LanguageExtractor interface for Python, extracting functions
(with type annotations, defaults, *args/**kwargs), classes (methods +
annotated properties), imports (plain, from, aliased, wildcard), exports
(top-level defs), and caller-callee call graphs. Includes 31 tests using
the real tree-sitter parser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the LanguageExtractor interface for Go, handling functions,
methods with receivers, structs, interfaces, imports, exports (via
capitalization convention), and call graph extraction. Includes 25 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Handles methods, classes, modules, attr_* properties, require imports,
and call graph including bare identifier calls (no-arg method invocations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctors into TreeSitterPlugin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the LLM-generated throwaway regex scripts in Phase 1 of the
file-analyzer with a deterministic script that uses PluginRegistry
(TreeSitterPlugin + all non-code parsers) from @understand-anything/core.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nerated regex

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ackage

Completes the language extractor architecture — 10 languages with
tree-sitter support (TS, JS, Python, Go, Rust, Java, Ruby, PHP, C/C++, C#).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Lum1104 Lum1104 changed the title refactor: eliminate duplicated language mapping + plan multi-language tree-sitter support feat: multi-language tree-sitter extractor architecture (10 languages) Apr 15, 2026
@Lum1104 Lum1104 merged commit 3f07df0 into main Apr 15, 2026
1 check passed
@Lum1104 Lum1104 deleted the fix/graph-builder-use-language-registry branch April 15, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant