Skip to content

analysis code_parser

aakash-anko edited this page May 25, 2026 · 1 revision

analysis/code_parser.py

Multi-language source code parser using tree-sitter. Loads grammar modules, extracts function/class names and parameters from ASTs, and supports 14 languages.


Key Concepts

Term Definition Example
AST Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)].
tree-sitter A fast, multi-language parser that builds ASTs. Supports 100+ languages without needing each language's compiler. tree-sitter parses config.py into an AST, then we extract function/class nodes from it.

Source: src/codewalk/analysis/code_parser.py


Constants

GRAMMAR_MAP (line 9)

Maps language name → tree-sitter grammar pip package name:

{
    "python":     "tree_sitter_python",
    "javascript": "tree_sitter_javascript",
    "typescript": "tree_sitter_typescript",
    "dart":       "tree_sitter_dart",
    ...  # 14 languages total
}

_language_cache (line 26)

Module-level dict that caches loaded Language objects so each grammar is loaded only once.

NODE_TYPES (line 28)

Per-language mapping of which AST node types represent functions vs classes, and which child fields hold the name and parameters. Example for Python:

{
    "function": ["function_definition"],
    "class": ["class_definition"],
    "name_field": "name",
    "params_field": "parameters"
}

get_language

Loads a tree-sitter Language object for a given language name. Returns from cache if already loaded, otherwise imports the grammar module and creates the Language.

Example

Input language: "python"

Line 107: "python" in _language_cache → False (first call)
Line 110: model_name = GRAMMAR_MAP.get("python")"tree_sitter_python"
Line 112: model_name is not None → skip the return None
Line 115: grammar_module = importlib.import_module("tree_sitter_python")
Line 117: language == "typescript"? No
Line 119: language == "php"? No
Line 121: lang = Language(grammar_module.language()) — calls the grammar's language() C function
Line 123: _language_cache["python"] = lang — cached for next time

Return: <Language object for Python>

Second call with "python"Line 107: cache hit → returns immediately.

Special cases

  • "typescript" → calls grammar_module.language_typescript() (has separate TS + TSX grammars)
  • "php" → calls grammar_module.language_php()
  • Unknown language → returns None

get_parser_for_language

Creates a tree-sitter Parser loaded with the grammar for the given language.

Example

Input language: "dart"

Line 131: lang = get_language("dart") → loads and returns the Dart Language object
Line 133: lang is not None → skip return None
Line 135: Creates Parser(lang) and returns it

Return: <Parser object configured for Dart>

If language is unsupported → get_language returns None → this returns None.


extract_name

Pulls the function/class name out of an AST node by looking up a named child field.

Example (Python function)

Input node: <function_definition node for "def scan_directory(path):">
Input name_field: "name"

Line 138: name_node = node.child_by_field_name("name")<identifier node "scan_directory">
Line 139: name_node is truthy → Return: name_node.text.decode("utf-8")"scan_directory"

Example (C function — fallback)

Input node: <function_definition for "int main(int argc, char *argv[])">
Input name_field: "declarator"

Line 138: name_node = node.child_by_field_name("declarator") → None (C nests the name deeper)

Lines 143–146: Fallback loop — iterates over node.children:

  • Finds child.type == "function_declarator"
  • inner = child.child_by_field_name("declarator")<identifier "main">
  • Return: "main"

Example (Dart method — second fallback)

Input node: <method_signature wrapping a function_signature>
Input name_field: "name"

Line 138: Direct lookup fails → None
Lines 143–146: No function_declarator child
Lines 149–153: Finds child with type == "function_signature", looks up "name" inside it → returns the method name

If all fallbacks fail → Return: "<anonymous>"


extract_params

Pulls parameter names from a function's AST node.

Example

Input node: <function_definition for "def greet(name, age):">
Input params_field: "parameters"

Line 157: params_node = node.child_by_field_name("parameters")<parameters node "(name, age)">
Line 158: params_node is truthy → skip fallback

Line 167: param_names = []

Lines 169–180: Loop through params_node.children:

  • child = "(" → type is "("continue
  • child = <identifier "name">:
    • Line 174: name_node = child.child_by_field_name("name") → None (it IS the identifier)
    • Line 176: child.type == "identifier" ✓ → param_names.append("name")
  • child = ","continue
  • child = <identifier "age">:
    • Same path → param_names.append("age")
  • child = ")"continue

Return: ["name", "age"]

Example (typed Python parameter)

Input node: <function_definition for "def greet(name: str):">

The child is a typed_parameter node (not an identifier). Neither child_by_field_name("name") nor child.type == "identifier" matches.

Lines 179–182: Fallback — iterates child.children:

  • Finds sub.type == "identifier"param_names.append("name")break

Return: ["name"]


walk_tree

Recursively walks the concrete syntax tree (CST) and yields nodes whose type matches a given set.

Example

Input node: <root of a Python file with two function defs>
Input target_types: {"function_definition", "class_definition"}
Input skip_children_types: None

Line 223: node.type = "module" → not in target_types → skip yield
Lines 228–229: Recurse into each child of module:

  • Child 1: <function_definition> → type in target_types ✓ → yield this node
    • Not in skip_children_types → recurse into its children (won't find nested functions in this example)
  • Child 2: <function_definition> → yield

Yields: 2 function_definition nodes

skip_children_types usage

For Dart: method_signature contains a function_signature inside it. Without skip_children_types, you'd get both (duplicate). Setting skip_children_types = {"method_signature"} prevents recursing into matched method_signature nodes.


parse_file

Parses any supported language file and returns a list of function and class definitions with their source code.

Example

Input file_path: "/repo/src/config.py"
Input language: "python"

The file contains:

class Settings:
    host = "localhost"

def load_config(path):
    return Settings()

Line 234: parser = get_parser_for_language("python") → Parser object
Line 239: node_types = NODE_TYPES.get("python"){"function": ["function_definition"], "class": ["class_definition"], "name_field": "name", "params_field": "parameters"}

Line 245: Reads file as bytes
Line 250: tree = parser.parse(source) → AST
Line 252: lines = source.decode(...).splitlines()["class Settings:", ' host = "localhost"', "", "def load_config(path):", " return Settings()"]

Line 255: function_types = {"function_definition"}
Line 256: class_types = {"class_definition"}
Line 257: all_target_types = {"function_definition", "class_definition"}

Line 261: Walk tree, find matching nodes:

Node 1: class_definitionitem_type = "class"

  • start_line = 1, end_line = 2
  • name = extract_name(node, "name")"Settings"
  • code = "class Settings:\n host = \"localhost\""
  • Appends {"type": "class", "name": "Settings", "start_line": 1, "end_line": 2, "code": "..."}

Node 2: function_definitionitem_type = "function"

  • start_line = 4, end_line = 5
  • name = "load_config"
  • code = "def load_config(path):\n return Settings()"
  • args = extract_params(node, "parameters")["path"]
  • Appends {"type": "function", "name": "load_config", "start_line": 4, "end_line": 5, "code": "...", "args": ["path"]}

Return:

[
    {"type": "class", "name": "Settings", "start_line": 1, "end_line": 2, "code": "..."},
    {"type": "function", "name": "load_config", "start_line": 4, "end_line": 5, "code": "...", "args": ["path"]},
]

Clone this wiki locally