This document describes the high-level architecture of codemod-pilot. It is intended for contributors who want to understand how the system works internally.
codemod-pilot is structured as a Cargo workspace with three crates:
codemod-pilot/
βββ crates/
β βββ codemod-core/ # Core engine (library)
β βββ codemod-cli/ # CLI frontend (binary)
β βββ codemod-languages/ # Language adapters (library)
The data flows through the system in a pipeline:
User Input βββΆ Pattern Inference βββΆ Codebase Scanning βββΆ Transformation βββΆ Output
(examples) (codemod-core) (codemod-core) (codemod-core) (codemod-cli)
The core library contains the fundamental algorithms and data structures. It has no CLI or language-specific dependencies.
Key modules:
| Module | Responsibility |
|---|---|
pattern |
Infers a structural transformation pattern from before/after AST pairs |
matcher |
Finds all occurrences of a pattern in a given AST |
transform |
Applies the inferred transformation to matched code |
rule |
Parses and serializes .codemod.yaml rule files |
scanner |
Walks the file system, filters by glob patterns, reads files in parallel |
diff |
Generates unified diffs for preview output |
Design principles:
- All AST operations go through a
LanguageAdaptertrait β the core never depends on a specific tree-sitter grammar - The core is fully synchronous; parallelism is achieved via
rayonin the scanner - All public functions return
Result<T, CoreError>usingthiserror
The CLI binary provides the user-facing interface. It depends on codemod-core and codemod-languages.
Subcommands:
| Command | Description |
|---|---|
learn |
Accept before/after examples and infer a pattern |
scan |
Scan a directory and report all matches |
apply |
Apply transformations (with --preview, --execute, --rollback) |
export |
Export the current pattern as a .codemod.yaml file |
validate |
Validate a .codemod.yaml rule file |
Design principles:
- Uses
clapderive API for argument parsing - All user-facing output goes through a
Printerabstraction for testability - Supports
--cimode (JSON output, no interactive prompts)
Provides concrete LanguageAdapter implementations backed by tree-sitter grammars.
Design principles:
- Each language is a separate module implementing
LanguageAdapter - Languages are registered in a
LanguageRegistrythat maps file extensions to adapters - Adding a new language requires only implementing the trait and registering it
Represents an inferred transformation pattern.
Pattern {
before_template: AstTemplate, // Generalized AST with placeholders
after_template: AstTemplate, // Target AST with same placeholders
variables: Vec<PatternVar>, // Named placeholders ($id, $expr, etc.)
language: LanguageId, // Which language this pattern targets
}
A tree structure that mirrors tree-sitter's concrete syntax tree (CST) but with placeholder nodes where pattern variables appear.
AstTemplate {
kind: NodeKind, // Either Concrete("identifier") or Variable("$name")
children: Vec<AstTemplate>,
text: Option<String>, // Leaf node text (None for inner nodes)
}
A found occurrence of a pattern in a source file.
Match {
file_path: PathBuf,
byte_range: Range<usize>,
line_range: Range<usize>,
bindings: HashMap<String, String>, // $variable -> captured text
original_text: String,
transformed_text: String,
}
A serializable codemod rule (stored as .codemod.yaml).
Rule {
name: String,
description: String,
language: LanguageId,
version: String,
pattern: PatternDef, // before/after strings
include: Vec<GlobPattern>,
exclude: Vec<GlobPattern>,
examples: Vec<Example>, // For validation
}
before_code βββΆ parse(ASTβ) βββ
ββββΆ structural_diff(ASTβ, ASTβ) βββΆ generalize() βββΆ Pattern
after_code βββΆ parse(ASTβ) βββ
The inference algorithm:
- Parse both snippets into tree-sitter CSTs
- Walk both trees in parallel, comparing node types and text
- Where nodes differ, create a pattern variable
- Where nodes are identical, keep them as concrete template nodes
- Validate that all variables in
afteralso appear inbefore(no invented variables)
target_dir βββΆ walk_files() βββΆ filter(globs) βββΆ par_iter() βββΆ parse + match βββΆ Vec<Match>
Scanning uses walkdir for traversal, globset for filtering, and rayon for parallel processing. Each file is independently parsed and matched.
Match βββΆ substitute(after_template, bindings) βββΆ transformed_text
For each match, the after_template is instantiated by replacing pattern variables with their captured bindings from the match.
Vec<Match> βββΆ sort_by_file_and_offset() βββΆ apply_in_reverse_order() βββΆ write_files()
βββΆ generate_rollback_patch()
Matches within the same file are applied in reverse byte-offset order to avoid invalidating earlier offsets. A rollback patch (unified diff) is always generated before writing.
codemod-core: Usesthiserrorwith aCoreErrorenum. All functions returnResult<T, CoreError>.codemod-cli: Usesanyhowfor ergonomic error propagation. Errors are formatted for human-readable output.- Panics: The codebase should never panic in release mode. All potential panics are converted to
Resulterrors.
- File scanning:
rayon::par_iter()over files β each file is processed independently - Pattern matching: Single-threaded within a file (tree-sitter is not thread-safe per parser instance)
- File writing: Sequential to avoid data races on the file system
| Level | Location | Framework |
|---|---|---|
| Unit tests | src/*.rs (#[cfg(test)]) |
Built-in |
| Integration tests | crates/*/tests/ |
Built-in + insta |
| Snapshot tests | Transformation output | insta (YAML snapshots) |
| End-to-end tests | tests/ workspace root |
assert_cmd + tempfile |
- Plugin system: Language adapters may move to dynamic loading (
.so/.dylib) for v1.0 - LSP server: A
codemod-lspcrate may be added for VS Code extension support - WASM target: Core may compile to WASM for the web playground