analysis module_detector

analysis/module_detector.py

Groups files into logical modules by detecting wrapper directories (like src/) and finding the right directory depth for module boundaries.

Key Concepts

Term	Definition	Example
strongly connected component	A group of vertices where every vertex can reach every other vertex. These form cycle groups.	If A→B→C→A, then {A, B, C} is one strongly connected component.
blast radius	All files that would be affected if a given file changes — found by following reverse import edges transitively.	If A imports B and C imports A, changing B has blast radius = {A, C}.
embedding	A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors.	The code `def add(a, b): return a+b` might become `[0.12, -0.45, 0.78, ...]` (1536 numbers for OpenAI).
chunk	A piece of source code (usually one function or class) stored as a unit for search.	The function `def scan_directory(root): ...` (20 lines) is one chunk.
AST	Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.).	`def add(a, b): return a+b` becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)].
RAG	Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt.	Question: `"What does scan_directory do?"` → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer.
diff	The set of changes between two versions of code, showing added (+) and removed (-) lines.	`- old_line\n+ new_line` shows `old_line` was replaced with `new_line`.
hunk	A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file).	A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed).

Source: src/codewalk/analysis/module_detector.py

Constants

`_WRAPPER_DIRS` (line 8)

Set of directory names that are "wrappers" — they don't represent real modules, they're just organizational containers:

{"src", "lib", "app", "source", "packages", "pkg", "internal", "cmd", "main"}

`_find_source_root`

Finds wrapper directories to strip before detecting modules. Walks down the directory tree, stripping one level at a time if 90%+ of files share the same prefix AND it's a known wrapper dir.

Example

Input file_paths: [
    "src/codewalk/config.py",
    "src/codewalk/pipeline.py",
    "src/codewalk/analysis/blast_radius.py",
    "README.md",
]

Line 25: prefix_parts = []
Line 26: remaining = ["src/codewalk/config.py", "src/codewalk/pipeline.py", "src/codewalk/analysis/blast_radius.py", "README.md"]

--- Iteration 1 (depth 0) ---

Lines 30–35: Count first-level directories:

"src/codewalk/config.py" → parts = ["src", "codewalk", "config.py"], len > 1 ✓ → dir_counts["src"] += 1
"src/codewalk/pipeline.py" → dir_counts["src"] += 1
"src/codewalk/analysis/blast_radius.py" → dir_counts["src"] += 1
"README.md" → parts = ["README.md"], len == 1 → skip
file_with_dirs = 3

Line 37: top_dir = "src", top_count = 3

Line 42: wrapper = "src" in _WRAPPER_DIRS → True
Line 43: single = len(dir_counts) == 1 → True (only "src")

Line 45: (3/3 >= 0.5 and True) or True → True → strip!
Line 46: prefix_parts = ["src"]
Line 47: prefix = "src/"
Line 48–50: remaining = ["codewalk/config.py", "codewalk/pipeline.py", "codewalk/analysis/blast_radius.py"] (README.md dropped — doesn't start with "src/")

--- Iteration 2 (depth 1) ---

Lines 30–35: Count first-level dirs in remaining:

All 3 files → dir_counts["codewalk"] = 3, file_with_dirs = 3

Line 37: top_dir = "codewalk", top_count = 3

Line 42: wrapper = "codewalk" in _WRAPPER_DIRS → False
Line 43: single = len(dir_counts) == 1 → True

Line 45: (False) or True → True → strip!
Line 46: prefix_parts = ["src", "codewalk"]
Line 47: prefix = "src/codewalk/"
Line 48–50: remaining = ["config.py", "pipeline.py", "analysis/blast_radius.py"]

--- Iteration 3 (depth 2) ---

Lines 30–35: Count first-level dirs:

"config.py" → len == 1 → skip
"pipeline.py" → len == 1 → skip
"analysis/blast_radius.py" → dir_counts["analysis"] = 1, file_with_dirs = 1

Line 37: top_dir = "analysis", top_count = 1

Line 42: wrapper = False, single = True

Line 45: (False) or True → True, but wait — 1/1 >= 0.5 is True and wrapper is False, so condition is (True and False) or True → True → strip!

Actually let's re-check: (top_count / file_with_dirs >= 0.5 and wrapper) or single = (1.0 and False) or True = False or True = True → strip!

But this would collapse everything into one module. The loop continues... but by iteration 4, remaining = ["blast_radius.py"] with no subdirectories → dir_counts empty → break.

Return: "src/codewalk" — wait, iteration 3 added "analysis" so it'd be "src/codewalk/analysis". But that only has 1 file with dirs out of 3 remaining. Let me re-check.

Actually at iteration 3, file_with_dirs = 1 but there are 3 remaining files. The single = True means only one subdir name exists ("analysis"), so the condition triggers. But remaining after stripping src/codewalk/analysis/ would only keep the 1 file that starts with that prefix, dropping config.py and pipeline.py.

Now remaining = ["blast_radius.py"] — only 1 file, no subdirs → dir_counts empty → break.

Return: "src/codewalk/analysis"

Hmm, that strips too much. In practice with more files (many modules under src/codewalk/), the single check would be False (multiple subdirs: analysis, embeddings, rag, etc.), so it would stop at "src/codewalk".

Corrected example with realistic files:

file_paths: [
    "src/codewalk/config.py",
    "src/codewalk/analysis/blast_radius.py",
    "src/codewalk/embeddings/chunker.py",
    "src/codewalk/rag/pipeline.py",
]

At depth 2 (after stripping src/codewalk): remaining = ["config.py", "analysis/blast_radius.py", "embeddings/chunker.py", "rag/pipeline.py"]. First-level dirs: dir_counts = {"analysis": 1, "embeddings": 1, "rag": 1}, single = False, wrapper = False → (False) or False → break.

Return: "src/codewalk"

`_find_module_depth`

Finds the directory depth that represents the module boundary. Looks for the level where child folder names start repeating across different parent directories (>50% shared), signaling internal structure.

Example

Input file_paths: [
    "features/auth/bloc/auth_bloc.dart",
    "features/auth/ui/auth_screen.dart",
    "features/home/bloc/home_bloc.dart",
    "features/home/ui/home_screen.dart",
]
Input source_root: ""

Line 77: stripped = ["features/auth/bloc/auth_bloc.dart", "features/auth/ui/auth_screen.dart", "features/home/bloc/home_bloc.dart", "features/home/ui/home_screen.dart"] (no source_root to strip)

Line 88: best_depth = 1

--- depth = 1 ---

Lines 91–95: Collect names at depth 1:

All paths have parts[0] = "features" → names_at_depth = ["features", "features", "features", "features"]

Line 97: unique = 1 — only one unique name

Lines 100–107: Build parent_to_children:

All paths have len(parts) > 2 ✓
Parent = parts[:1] = "features", children: "auth", "auth", "home", "home"
parent_to_children = {"features": {"auth", "home"}}
Only 1 parent → len(parent_to_children) >= 2 is False → cross_parent_repeat = 0

Line 114: unique >= 3? 1 >= 3? No → skip both branches

--- depth = 2 ---

Lines 91–95: Names at depth 2:

"features/auth", "features/auth", "features/home", "features/home"

Line 97: unique = 2 — "features/auth" and "features/home"

Lines 100–107: parent_to_children:

Parent "features/auth" → children: {"bloc", "ui"}
Parent "features/home" → children: {"bloc", "ui"}
2 parents ✓
all_children = ["bloc", "ui", "bloc", "ui"]
child_counts = {"bloc": 2, "ui": 2}
repeated_children = 2 (both appear ≥ 2 times)
total_unique_children = 2
cross_parent_repeat = 2/2 = 1.0

Line 114: unique >= 3? 2 >= 3? No → best_depth = 2 (candidate, but doesn't enter the break branch)

--- depth = 3 ---

Lines 91–95: names_at_depth = ["features/auth/bloc", "features/auth/ui", "features/home/bloc", "features/home/ui"]
Line 97: unique = 4

Lines 100–107: Parents at depth 3, children at depth 4:

Parts are length 4, so len(parts) > 4? No → parent_to_children stays empty
cross_parent_repeat = 0

Line 114: unique >= 3? 4 >= 3? Yes. cross_parent_repeat > 0.5? 0 > 0.5? No → enters elif: best_depth = 3 (candidate)

But wait, at depth 2 we already found cross_parent_repeat = 1.0, but unique was only 2. The algorithm didn't trigger the break at depth 2 because unique < 3. At depth 3, unique >= 3 but no cross-parent repeat → sets best_depth = 3.

With MORE features (3+ unique modules at depth 2), the algorithm would detect unique >= 3 AND cross_parent_repeat > 0.5 at depth 2 → best_depth = 2 → break.

Return: 2 (with ≥ 3 features) or 3 (with only 2 features in this small example)

`_assign_modules`

Assigns each file to a module based on its path components up to the detected depth.

Example

Input files: [
    {"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
    {"file_path": "src/codewalk/analysis/code_parser.py",  "language": "python"},
    {"file_path": "src/codewalk/config.py",                "language": "python"},
    {"file_path": "README.md",                             "language": "markdown"},
]
Input source_root: "src/codewalk"
Input module_depth: 1

Line 131: modules = defaultdict(...) — auto-creates module entries

Iteration 1: file_path = "src/codewalk/analysis/blast_radius.py"

Line 137: Starts with "src/codewalk/" ✓ → relative_path = "analysis/blast_radius.py", depth = 1
Line 144: parts = ["analysis", "blast_radius.py"], len(parts) = 2 > 1 ✓
Line 145: module_name = "analysis"
Appends to modules["analysis"]

Iteration 2: file_path = "src/codewalk/analysis/code_parser.py"

Same path → module_name = "analysis"

Iteration 3: file_path = "src/codewalk/config.py"

relative_path = "config.py", parts = ["config.py"], len(parts) = 1 → not > depth
len(parts) > 1? No → Line 149: module_name = "root"

Iteration 4: file_path = "README.md"

Doesn't start with "src/codewalk/" → relative_path = "README.md", depth = 1
parts = ["README.md"], len = 1, not > 1 → module_name = "root"

Return:

{
    "analysis": {"files": ["src/codewalk/analysis/blast_radius.py", "src/codewalk/analysis/code_parser.py"], "languages": Counter({"python": 2}), "file_count": 2},
    "root":     {"files": ["src/codewalk/config.py", "README.md"], "languages": Counter({"python": 1, "markdown": 1}), "file_count": 2},
}

`detect_modules`

Top-level function that orchestrates module detection: finds source root, optimal depth, assigns files to modules, and builds a module-level dependency graph.

Example

Input files: [
    {"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
    {"file_path": "src/codewalk/analysis/code_parser.py",  "language": "python"},
    {"file_path": "src/codewalk/embeddings/chunker.py",    "language": "python"},
    {"file_path": "src/codewalk/config.py",                "language": "python"},
]
Input dep_graph: {
    "graph": {
        "src/codewalk/analysis/blast_radius.py": [],
        "src/codewalk/analysis/code_parser.py":  [],
        "src/codewalk/embeddings/chunker.py":    ["src/codewalk/config.py"],
        "src/codewalk/config.py":                [],
    }
}

Line 202: file_paths = ["src/codewalk/analysis/blast_radius.py", ..., "src/codewalk/config.py"]

Step 1 (line 205): source_root = _find_source_root(file_paths) → "src/codewalk"

Step 2 (line 208): module_depth = _find_module_depth(file_paths, "src/codewalk") → 1

Step 3 (line 211): modules = _assign_modules(files, "src/codewalk", 1):

{
    "analysis":   {"files": [...blast_radius.py, ...code_parser.py], "file_count": 2, ...},
    "embeddings": {"files": [...chunker.py], "file_count": 1, ...},
    "root":       {"files": [...config.py], "file_count": 1, ...},
}

Line 214: len(modules) = 3, not > 20 → no fallback

Line 220: Convert Counter to dict for JSON

Step 4 (lines 223–238): Build module dependency graph:

file_to_module = {"src/codewalk/analysis/blast_radius.py": "analysis", "src/codewalk/analysis/code_parser.py": "analysis", "src/codewalk/embeddings/chunker.py": "embeddings", "src/codewalk/config.py": "root"}
Module "analysis": files have no deps → deps = set() → module_graph["analysis"] = []
Module "embeddings": chunker.py depends on config.py → config.py is in module "root" → deps = {"root"} → module_graph["embeddings"] = ["root"]
Module "root": config.py has no deps → module_graph["root"] = []

Return:

{
    "source_root": "src/codewalk",
    "modules": {
        "analysis":   {"files": [...], "languages": {"python": 2}, "file_count": 2},
        "embeddings": {"files": [...], "languages": {"python": 1}, "file_count": 1},
        "root":       {"files": [...], "languages": {"python": 1}, "file_count": 1},
    },
    "module_graph": {
        "analysis":   [],
        "embeddings": ["root"],
        "root":       [],
    },
    "stats": {"total_modules": 3, "total_files": 4},
}

analysis module_detector

analysis/module_detector.py

Key Concepts

Constants

_WRAPPER_DIRS (line 8)

_find_source_root

Example

_find_module_depth

Example

_assign_modules

Example

detect_modules

Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`_WRAPPER_DIRS` (line 8)

`_find_source_root`

`_find_module_depth`

`_assign_modules`

`detect_modules`