feat: multi-language tree-sitter extractor architecture (10 languages)#89
feat: multi-language tree-sitter extractor architecture (10 languages)#89
Conversation
…th LanguageRegistry GraphBuilder maintained its own ~60-line extension-to-language mapping that duplicated and could diverge from the canonical LanguageRegistry. Now delegates language detection to LanguageRegistry.getForFile(), eliminating the duplication and ensuring new language configs are automatically picked up everywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cba6de658
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| private detectLanguage(filePath: string): string { | ||
| return this.languageRegistry.getForFile(filePath)?.id ?? "unknown"; |
There was a problem hiding this comment.
Restore Lua language detection fallback
Routing detection entirely through LanguageRegistry here drops support for extensions that were only in the old map, notably .lua. Because builtinLanguageConfigs has no Lua config, addFile("*.lua", ...) now resolves to "unknown" and project.languages no longer includes "lua", which regresses graph metadata and downstream summaries for Lua repositories compared with the previous behavior.
Useful? React with 👍 / 👎.
…r architecture Plan covers: LanguageExtractor interface, per-language extractors for 8 languages (Python, Go, Rust, Java, Ruby, PHP, C/C++, C#), bundled extract-structure.mjs script replacing LLM-generated regex in file-analyzer agent, and PluginRegistry enhancements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…h via LanguageExtractor interface Extract all TypeScript/JavaScript-specific AST extraction functions (extractParams, extractReturnType, extractImportSpecifiers, processTopLevelNode, extractFunction, extractClass, extractVariableDeclarations, extractImport, processExportStatement, and call graph walking) from TreeSitterPlugin into the new TypeScriptExtractor class. TreeSitterPlugin now dispatches to registered LanguageExtractor instances, defaulting to TypeScriptExtractor for backward compatibility. All 426 existing tests pass unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ONFIG from configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the LanguageExtractor interface for Python, extracting functions (with type annotations, defaults, *args/**kwargs), classes (methods + annotated properties), imports (plain, from, aliased, wildcard), exports (top-level defs), and caller-callee call graphs. Includes 31 tests using the real tree-sitter parser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the LanguageExtractor interface for Go, handling functions, methods with receivers, structs, interfaces, imports, exports (via capitalization convention), and call graph extraction. Includes 25 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Handles methods, classes, modules, attr_* properties, require imports, and call graph including bare identifier calls (no-arg method invocations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctors into TreeSitterPlugin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the LLM-generated throwaway regex scripts in Phase 1 of the file-analyzer with a deterministic script that uses PluginRegistry (TreeSitterPlugin + all non-code parsers) from @understand-anything/core. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nerated regex Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ackage Completes the language extractor architecture — 10 languages with tree-sitter support (TS, JS, Python, Go, Rust, Java, Ruby, PHP, C/C++, C#). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
LanguageExtractorinterface with per-language extractor classes, refactoredTreeSitterPlugin(746→297 lines) to dispatch via registered extractorsextractStructure()+extractCallGraph()support (239 new tests)extract-structure.mjsusesPluginRegistry(tree-sitter + non-code parsers) so the file-analyzer agent no longer writes throwaway scripts from scratch each runEXTENSION_LANGUAGEmap withLanguageRegistrydelegationArchitecture
Changes by area
Core extractors (new)
LanguageExtractorinterface + sharedtraverse/findChild/getStringValueutilitiesbuiltinExtractorsarray auto-registered inTreeSitterPluginconstructorTreeSitterPlugin refactor
TypeScriptExtractor)registerExtractor(),getExtractor()with tsx→typescript mappingextractorsparam, defaults tobuiltinExtractorsPlugin infrastructure
PluginRegistry.extractCallGraph()— new delegation methodDEFAULT_PLUGIN_CONFIGnow derives languages dynamically frombuiltinLanguageConfigs.filter(c => c.treeSitter)treeSitterconfigs for Python, Go, Rust, Java, Ruby, PHP, C++, C#Agent pipeline
extract-structure.mjs— bundled script usingPluginRegistryfor deterministic extractionfile-analyzer.mdPhase 1 rewritten: executes bundled script instead of LLM-generated regex (−213 lines)SKILL.mdpasses<SKILL_DIR>to file-analyzer dispatchTest plan
tsc, zero errors)extract-structure.mjstested locally against TS, Markdown, and JSON files/understand --fullon a multi-language project to verify end-to-end pipeline🤖 Generated with Claude Code