-
Notifications
You must be signed in to change notification settings - Fork 0
analysis module_detector
Groups files into logical modules by detecting wrapper directories (like src/) and finding the right directory depth for module boundaries.
| Term | Definition | Example |
|---|---|---|
| strongly connected component | A group of vertices where every vertex can reach every other vertex. These form cycle groups. | If A→B→C→A, then {A, B, C} is one strongly connected component. |
| blast radius | All files that would be affected if a given file changes — found by following reverse import edges transitively. | If A imports B and C imports A, changing B has blast radius = {A, C}. |
| embedding | A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. | The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI). |
| chunk | A piece of source code (usually one function or class) stored as a unit for search. | The function def scan_directory(root): ... (20 lines) is one chunk. |
| AST | Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). |
def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)]. |
| RAG | Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. | Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer. |
| diff | The set of changes between two versions of code, showing added (+) and removed (-) lines. |
- old_line\n+ new_line shows old_line was replaced with new_line. |
| hunk | A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). | A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed). |
Source: src/codewalk/analysis/module_detector.py
Set of directory names that are "wrappers" — they don't represent real modules, they're just organizational containers:
{"src", "lib", "app", "source", "packages", "pkg", "internal", "cmd", "main"}Finds wrapper directories to strip before detecting modules. Walks down the directory tree, stripping one level at a time if 90%+ of files share the same prefix AND it's a known wrapper dir.
Input file_paths: [
"src/codewalk/config.py",
"src/codewalk/pipeline.py",
"src/codewalk/analysis/blast_radius.py",
"README.md",
]
Line 25: prefix_parts = []
Line 26: remaining = ["src/codewalk/config.py", "src/codewalk/pipeline.py", "src/codewalk/analysis/blast_radius.py", "README.md"]
--- Iteration 1 (depth 0) ---
Lines 30–35: Count first-level directories:
-
"src/codewalk/config.py"→ parts =["src", "codewalk", "config.py"], len > 1 ✓ →dir_counts["src"] += 1 -
"src/codewalk/pipeline.py"→dir_counts["src"] += 1 -
"src/codewalk/analysis/blast_radius.py"→dir_counts["src"] += 1 -
"README.md"→ parts =["README.md"], len == 1 → skip file_with_dirs = 3
Line 37: top_dir = "src", top_count = 3
Line 42: wrapper = "src" in _WRAPPER_DIRS → True
Line 43: single = len(dir_counts) == 1 → True (only "src")
Line 45: (3/3 >= 0.5 and True) or True → True → strip!
Line 46: prefix_parts = ["src"]
Line 47: prefix = "src/"
Line 48–50: remaining = ["codewalk/config.py", "codewalk/pipeline.py", "codewalk/analysis/blast_radius.py"] (README.md dropped — doesn't start with "src/")
--- Iteration 2 (depth 1) ---
Lines 30–35: Count first-level dirs in remaining:
- All 3 files →
dir_counts["codewalk"] = 3,file_with_dirs = 3
Line 37: top_dir = "codewalk", top_count = 3
Line 42: wrapper = "codewalk" in _WRAPPER_DIRS → False
Line 43: single = len(dir_counts) == 1 → True
Line 45: (False) or True → True → strip!
Line 46: prefix_parts = ["src", "codewalk"]
Line 47: prefix = "src/codewalk/"
Line 48–50: remaining = ["config.py", "pipeline.py", "analysis/blast_radius.py"]
--- Iteration 3 (depth 2) ---
Lines 30–35: Count first-level dirs:
-
"config.py"→ len == 1 → skip -
"pipeline.py"→ len == 1 → skip -
"analysis/blast_radius.py"→dir_counts["analysis"] = 1,file_with_dirs = 1
Line 37: top_dir = "analysis", top_count = 1
Line 42: wrapper = False, single = True
Line 45: (False) or True → True, but wait — 1/1 >= 0.5 is True and wrapper is False, so condition is (True and False) or True → True → strip!
Actually let's re-check: (top_count / file_with_dirs >= 0.5 and wrapper) or single = (1.0 and False) or True = False or True = True → strip!
But this would collapse everything into one module. The loop continues... but by iteration 4, remaining = ["blast_radius.py"] with no subdirectories → dir_counts empty → break.
Return: "src/codewalk" — wait, iteration 3 added "analysis" so it'd be "src/codewalk/analysis". But that only has 1 file with dirs out of 3 remaining. Let me re-check.
Actually at iteration 3, file_with_dirs = 1 but there are 3 remaining files. The single = True means only one subdir name exists ("analysis"), so the condition triggers. But remaining after stripping src/codewalk/analysis/ would only keep the 1 file that starts with that prefix, dropping config.py and pipeline.py.
Now remaining = ["blast_radius.py"] — only 1 file, no subdirs → dir_counts empty → break.
Return: "src/codewalk/analysis"
Hmm, that strips too much. In practice with more files (many modules under src/codewalk/), the single check would be False (multiple subdirs: analysis, embeddings, rag, etc.), so it would stop at "src/codewalk".
Corrected example with realistic files:
file_paths: [
"src/codewalk/config.py",
"src/codewalk/analysis/blast_radius.py",
"src/codewalk/embeddings/chunker.py",
"src/codewalk/rag/pipeline.py",
]
At depth 2 (after stripping src/codewalk): remaining = ["config.py", "analysis/blast_radius.py", "embeddings/chunker.py", "rag/pipeline.py"]. First-level dirs: dir_counts = {"analysis": 1, "embeddings": 1, "rag": 1}, single = False, wrapper = False → (False) or False → break.
Return: "src/codewalk"
Finds the directory depth that represents the module boundary. Looks for the level where child folder names start repeating across different parent directories (>50% shared), signaling internal structure.
Input file_paths: [
"features/auth/bloc/auth_bloc.dart",
"features/auth/ui/auth_screen.dart",
"features/home/bloc/home_bloc.dart",
"features/home/ui/home_screen.dart",
]
Input source_root: ""
Line 77: stripped = ["features/auth/bloc/auth_bloc.dart", "features/auth/ui/auth_screen.dart", "features/home/bloc/home_bloc.dart", "features/home/ui/home_screen.dart"] (no source_root to strip)
Line 88: best_depth = 1
--- depth = 1 ---
Lines 91–95: Collect names at depth 1:
- All paths have
parts[0] = "features"→names_at_depth = ["features", "features", "features", "features"]
Line 97: unique = 1 — only one unique name
Lines 100–107: Build parent_to_children:
- All paths have
len(parts) > 2✓ - Parent =
parts[:1]="features", children:"auth","auth","home","home" parent_to_children = {"features": {"auth", "home"}}- Only 1 parent →
len(parent_to_children) >= 2is False →cross_parent_repeat = 0
Line 114: unique >= 3? 1 >= 3? No → skip both branches
--- depth = 2 ---
Lines 91–95: Names at depth 2:
-
"features/auth","features/auth","features/home","features/home"
Line 97: unique = 2 — "features/auth" and "features/home"
Lines 100–107: parent_to_children:
- Parent
"features/auth"→ children:{"bloc", "ui"} - Parent
"features/home"→ children:{"bloc", "ui"} - 2 parents ✓
all_children = ["bloc", "ui", "bloc", "ui"]child_counts = {"bloc": 2, "ui": 2}-
repeated_children = 2(both appear ≥ 2 times) total_unique_children = 2cross_parent_repeat = 2/2 = 1.0
Line 114: unique >= 3? 2 >= 3? No → best_depth = 2 (candidate, but doesn't enter the break branch)
--- depth = 3 ---
Lines 91–95: names_at_depth = ["features/auth/bloc", "features/auth/ui", "features/home/bloc", "features/home/ui"]
Line 97: unique = 4
Lines 100–107: Parents at depth 3, children at depth 4:
- Parts are length 4, so
len(parts) > 4? No →parent_to_childrenstays empty cross_parent_repeat = 0
Line 114: unique >= 3? 4 >= 3? Yes. cross_parent_repeat > 0.5? 0 > 0.5? No → enters elif: best_depth = 3 (candidate)
But wait, at depth 2 we already found cross_parent_repeat = 1.0, but unique was only 2. The algorithm didn't trigger the break at depth 2 because unique < 3. At depth 3, unique >= 3 but no cross-parent repeat → sets best_depth = 3.
With MORE features (3+ unique modules at depth 2), the algorithm would detect unique >= 3 AND cross_parent_repeat > 0.5 at depth 2 → best_depth = 2 → break.
Return: 2 (with ≥ 3 features) or 3 (with only 2 features in this small example)
Assigns each file to a module based on its path components up to the detected depth.
Input files: [
{"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
{"file_path": "src/codewalk/analysis/code_parser.py", "language": "python"},
{"file_path": "src/codewalk/config.py", "language": "python"},
{"file_path": "README.md", "language": "markdown"},
]
Input source_root: "src/codewalk"
Input module_depth: 1
Line 131: modules = defaultdict(...) — auto-creates module entries
Iteration 1: file_path = "src/codewalk/analysis/blast_radius.py"
-
Line 137: Starts with
"src/codewalk/"✓ →relative_path = "analysis/blast_radius.py",depth = 1 -
Line 144:
parts = ["analysis", "blast_radius.py"],len(parts) = 2 > 1✓ -
Line 145:
module_name = "analysis" - Appends to
modules["analysis"]
Iteration 2: file_path = "src/codewalk/analysis/code_parser.py"
- Same path →
module_name = "analysis"
Iteration 3: file_path = "src/codewalk/config.py"
-
relative_path = "config.py",parts = ["config.py"],len(parts) = 1→ not > depth -
len(parts) > 1? No → Line 149:module_name = "root"
Iteration 4: file_path = "README.md"
- Doesn't start with
"src/codewalk/"→relative_path = "README.md",depth = 1 -
parts = ["README.md"],len = 1, not > 1 →module_name = "root"
Return:
{
"analysis": {"files": ["src/codewalk/analysis/blast_radius.py", "src/codewalk/analysis/code_parser.py"], "languages": Counter({"python": 2}), "file_count": 2},
"root": {"files": ["src/codewalk/config.py", "README.md"], "languages": Counter({"python": 1, "markdown": 1}), "file_count": 2},
}Top-level function that orchestrates module detection: finds source root, optimal depth, assigns files to modules, and builds a module-level dependency graph.
Input files: [
{"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
{"file_path": "src/codewalk/analysis/code_parser.py", "language": "python"},
{"file_path": "src/codewalk/embeddings/chunker.py", "language": "python"},
{"file_path": "src/codewalk/config.py", "language": "python"},
]
Input dep_graph: {
"graph": {
"src/codewalk/analysis/blast_radius.py": [],
"src/codewalk/analysis/code_parser.py": [],
"src/codewalk/embeddings/chunker.py": ["src/codewalk/config.py"],
"src/codewalk/config.py": [],
}
}
Line 202: file_paths = ["src/codewalk/analysis/blast_radius.py", ..., "src/codewalk/config.py"]
Step 1 (line 205): source_root = _find_source_root(file_paths) → "src/codewalk"
Step 2 (line 208): module_depth = _find_module_depth(file_paths, "src/codewalk") → 1
Step 3 (line 211): modules = _assign_modules(files, "src/codewalk", 1):
{
"analysis": {"files": [...blast_radius.py, ...code_parser.py], "file_count": 2, ...},
"embeddings": {"files": [...chunker.py], "file_count": 1, ...},
"root": {"files": [...config.py], "file_count": 1, ...},
}Line 214: len(modules) = 3, not > 20 → no fallback
Line 220: Convert Counter to dict for JSON
Step 4 (lines 223–238): Build module dependency graph:
-
file_to_module = {"src/codewalk/analysis/blast_radius.py": "analysis", "src/codewalk/analysis/code_parser.py": "analysis", "src/codewalk/embeddings/chunker.py": "embeddings", "src/codewalk/config.py": "root"} -
Module
"analysis": files have no deps →deps = set()→module_graph["analysis"] = [] -
Module
"embeddings": chunker.py depends on config.py → config.py is in module"root"→deps = {"root"}→module_graph["embeddings"] = ["root"] -
Module
"root": config.py has no deps →module_graph["root"] = []
Return:
{
"source_root": "src/codewalk",
"modules": {
"analysis": {"files": [...], "languages": {"python": 2}, "file_count": 2},
"embeddings": {"files": [...], "languages": {"python": 1}, "file_count": 1},
"root": {"files": [...], "languages": {"python": 1}, "file_count": 1},
},
"module_graph": {
"analysis": [],
"embeddings": ["root"],
"root": [],
},
"stats": {"total_modules": 3, "total_files": 4},
}