Skip to content

analysis module_detector

aakash-anko edited this page May 25, 2026 · 1 revision

analysis/module_detector.py

Groups files into logical modules by detecting wrapper directories (like src/) and finding the right directory depth for module boundaries.


Key Concepts

Term Definition Example
strongly connected component A group of vertices where every vertex can reach every other vertex. These form cycle groups. If A→B→C→A, then {A, B, C} is one strongly connected component.
blast radius All files that would be affected if a given file changes — found by following reverse import edges transitively. If A imports B and C imports A, changing B has blast radius = {A, C}.
embedding A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI).
chunk A piece of source code (usually one function or class) stored as a unit for search. The function def scan_directory(root): ... (20 lines) is one chunk.
AST Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)].
RAG Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer.
diff The set of changes between two versions of code, showing added (+) and removed (-) lines. - old_line\n+ new_line shows old_line was replaced with new_line.
hunk A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed).

Source: src/codewalk/analysis/module_detector.py


Constants

_WRAPPER_DIRS (line 8)

Set of directory names that are "wrappers" — they don't represent real modules, they're just organizational containers:

{"src", "lib", "app", "source", "packages", "pkg", "internal", "cmd", "main"}

_find_source_root

Finds wrapper directories to strip before detecting modules. Walks down the directory tree, stripping one level at a time if 90%+ of files share the same prefix AND it's a known wrapper dir.

Example

Input file_paths: [
    "src/codewalk/config.py",
    "src/codewalk/pipeline.py",
    "src/codewalk/analysis/blast_radius.py",
    "README.md",
]

Line 25: prefix_parts = []
Line 26: remaining = ["src/codewalk/config.py", "src/codewalk/pipeline.py", "src/codewalk/analysis/blast_radius.py", "README.md"]

--- Iteration 1 (depth 0) ---

Lines 30–35: Count first-level directories:

  • "src/codewalk/config.py" → parts = ["src", "codewalk", "config.py"], len > 1 ✓ → dir_counts["src"] += 1
  • "src/codewalk/pipeline.py"dir_counts["src"] += 1
  • "src/codewalk/analysis/blast_radius.py"dir_counts["src"] += 1
  • "README.md" → parts = ["README.md"], len == 1 → skip
  • file_with_dirs = 3

Line 37: top_dir = "src", top_count = 3

Line 42: wrapper = "src" in _WRAPPER_DIRS → True
Line 43: single = len(dir_counts) == 1 → True (only "src")

Line 45: (3/3 >= 0.5 and True) or True → True → strip!
Line 46: prefix_parts = ["src"]
Line 47: prefix = "src/"
Line 48–50: remaining = ["codewalk/config.py", "codewalk/pipeline.py", "codewalk/analysis/blast_radius.py"] (README.md dropped — doesn't start with "src/")

--- Iteration 2 (depth 1) ---

Lines 30–35: Count first-level dirs in remaining:

  • All 3 files → dir_counts["codewalk"] = 3, file_with_dirs = 3

Line 37: top_dir = "codewalk", top_count = 3

Line 42: wrapper = "codewalk" in _WRAPPER_DIRS → False
Line 43: single = len(dir_counts) == 1 → True

Line 45: (False) or True → True → strip!
Line 46: prefix_parts = ["src", "codewalk"]
Line 47: prefix = "src/codewalk/"
Line 48–50: remaining = ["config.py", "pipeline.py", "analysis/blast_radius.py"]

--- Iteration 3 (depth 2) ---

Lines 30–35: Count first-level dirs:

  • "config.py" → len == 1 → skip
  • "pipeline.py" → len == 1 → skip
  • "analysis/blast_radius.py"dir_counts["analysis"] = 1, file_with_dirs = 1

Line 37: top_dir = "analysis", top_count = 1

Line 42: wrapper = False, single = True

Line 45: (False) or True → True, but wait — 1/1 >= 0.5 is True and wrapper is False, so condition is (True and False) or True → True → strip!

Actually let's re-check: (top_count / file_with_dirs >= 0.5 and wrapper) or single = (1.0 and False) or True = False or True = True → strip!

But this would collapse everything into one module. The loop continues... but by iteration 4, remaining = ["blast_radius.py"] with no subdirectories → dir_counts empty → break.

Return: "src/codewalk" — wait, iteration 3 added "analysis" so it'd be "src/codewalk/analysis". But that only has 1 file with dirs out of 3 remaining. Let me re-check.

Actually at iteration 3, file_with_dirs = 1 but there are 3 remaining files. The single = True means only one subdir name exists ("analysis"), so the condition triggers. But remaining after stripping src/codewalk/analysis/ would only keep the 1 file that starts with that prefix, dropping config.py and pipeline.py.

Now remaining = ["blast_radius.py"] — only 1 file, no subdirs → dir_counts empty → break.

Return: "src/codewalk/analysis"

Hmm, that strips too much. In practice with more files (many modules under src/codewalk/), the single check would be False (multiple subdirs: analysis, embeddings, rag, etc.), so it would stop at "src/codewalk".

Corrected example with realistic files:

file_paths: [
    "src/codewalk/config.py",
    "src/codewalk/analysis/blast_radius.py",
    "src/codewalk/embeddings/chunker.py",
    "src/codewalk/rag/pipeline.py",
]

At depth 2 (after stripping src/codewalk): remaining = ["config.py", "analysis/blast_radius.py", "embeddings/chunker.py", "rag/pipeline.py"]. First-level dirs: dir_counts = {"analysis": 1, "embeddings": 1, "rag": 1}, single = False, wrapper = False(False) or Falsebreak.

Return: "src/codewalk"


_find_module_depth

Finds the directory depth that represents the module boundary. Looks for the level where child folder names start repeating across different parent directories (>50% shared), signaling internal structure.

Example

Input file_paths: [
    "features/auth/bloc/auth_bloc.dart",
    "features/auth/ui/auth_screen.dart",
    "features/home/bloc/home_bloc.dart",
    "features/home/ui/home_screen.dart",
]
Input source_root: ""

Line 77: stripped = ["features/auth/bloc/auth_bloc.dart", "features/auth/ui/auth_screen.dart", "features/home/bloc/home_bloc.dart", "features/home/ui/home_screen.dart"] (no source_root to strip)

Line 88: best_depth = 1

--- depth = 1 ---

Lines 91–95: Collect names at depth 1:

  • All paths have parts[0] = "features"names_at_depth = ["features", "features", "features", "features"]

Line 97: unique = 1 — only one unique name

Lines 100–107: Build parent_to_children:

  • All paths have len(parts) > 2
  • Parent = parts[:1] = "features", children: "auth", "auth", "home", "home"
  • parent_to_children = {"features": {"auth", "home"}}
  • Only 1 parent → len(parent_to_children) >= 2 is False → cross_parent_repeat = 0

Line 114: unique >= 3? 1 >= 3? No → skip both branches

--- depth = 2 ---

Lines 91–95: Names at depth 2:

  • "features/auth", "features/auth", "features/home", "features/home"

Line 97: unique = 2 — "features/auth" and "features/home"

Lines 100–107: parent_to_children:

  • Parent "features/auth" → children: {"bloc", "ui"}
  • Parent "features/home" → children: {"bloc", "ui"}
  • 2 parents ✓
  • all_children = ["bloc", "ui", "bloc", "ui"]
  • child_counts = {"bloc": 2, "ui": 2}
  • repeated_children = 2 (both appear ≥ 2 times)
  • total_unique_children = 2
  • cross_parent_repeat = 2/2 = 1.0

Line 114: unique >= 3? 2 >= 3? No → best_depth = 2 (candidate, but doesn't enter the break branch)

--- depth = 3 ---

Lines 91–95: names_at_depth = ["features/auth/bloc", "features/auth/ui", "features/home/bloc", "features/home/ui"]
Line 97: unique = 4

Lines 100–107: Parents at depth 3, children at depth 4:

  • Parts are length 4, so len(parts) > 4? No → parent_to_children stays empty
  • cross_parent_repeat = 0

Line 114: unique >= 3? 4 >= 3? Yes. cross_parent_repeat > 0.5? 0 > 0.5? No → enters elif: best_depth = 3 (candidate)

But wait, at depth 2 we already found cross_parent_repeat = 1.0, but unique was only 2. The algorithm didn't trigger the break at depth 2 because unique < 3. At depth 3, unique >= 3 but no cross-parent repeat → sets best_depth = 3.

With MORE features (3+ unique modules at depth 2), the algorithm would detect unique >= 3 AND cross_parent_repeat > 0.5 at depth 2 → best_depth = 2break.

Return: 2 (with ≥ 3 features) or 3 (with only 2 features in this small example)


_assign_modules

Assigns each file to a module based on its path components up to the detected depth.

Example

Input files: [
    {"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
    {"file_path": "src/codewalk/analysis/code_parser.py",  "language": "python"},
    {"file_path": "src/codewalk/config.py",                "language": "python"},
    {"file_path": "README.md",                             "language": "markdown"},
]
Input source_root: "src/codewalk"
Input module_depth: 1

Line 131: modules = defaultdict(...) — auto-creates module entries

Iteration 1: file_path = "src/codewalk/analysis/blast_radius.py"

  • Line 137: Starts with "src/codewalk/" ✓ → relative_path = "analysis/blast_radius.py", depth = 1
  • Line 144: parts = ["analysis", "blast_radius.py"], len(parts) = 2 > 1
  • Line 145: module_name = "analysis"
  • Appends to modules["analysis"]

Iteration 2: file_path = "src/codewalk/analysis/code_parser.py"

  • Same path → module_name = "analysis"

Iteration 3: file_path = "src/codewalk/config.py"

  • relative_path = "config.py", parts = ["config.py"], len(parts) = 1 → not > depth
  • len(parts) > 1? No → Line 149: module_name = "root"

Iteration 4: file_path = "README.md"

  • Doesn't start with "src/codewalk/"relative_path = "README.md", depth = 1
  • parts = ["README.md"], len = 1, not > 1 → module_name = "root"

Return:

{
    "analysis": {"files": ["src/codewalk/analysis/blast_radius.py", "src/codewalk/analysis/code_parser.py"], "languages": Counter({"python": 2}), "file_count": 2},
    "root":     {"files": ["src/codewalk/config.py", "README.md"], "languages": Counter({"python": 1, "markdown": 1}), "file_count": 2},
}

detect_modules

Top-level function that orchestrates module detection: finds source root, optimal depth, assigns files to modules, and builds a module-level dependency graph.

Example

Input files: [
    {"file_path": "src/codewalk/analysis/blast_radius.py", "language": "python"},
    {"file_path": "src/codewalk/analysis/code_parser.py",  "language": "python"},
    {"file_path": "src/codewalk/embeddings/chunker.py",    "language": "python"},
    {"file_path": "src/codewalk/config.py",                "language": "python"},
]
Input dep_graph: {
    "graph": {
        "src/codewalk/analysis/blast_radius.py": [],
        "src/codewalk/analysis/code_parser.py":  [],
        "src/codewalk/embeddings/chunker.py":    ["src/codewalk/config.py"],
        "src/codewalk/config.py":                [],
    }
}

Line 202: file_paths = ["src/codewalk/analysis/blast_radius.py", ..., "src/codewalk/config.py"]

Step 1 (line 205): source_root = _find_source_root(file_paths)"src/codewalk"

Step 2 (line 208): module_depth = _find_module_depth(file_paths, "src/codewalk")1

Step 3 (line 211): modules = _assign_modules(files, "src/codewalk", 1):

{
    "analysis":   {"files": [...blast_radius.py, ...code_parser.py], "file_count": 2, ...},
    "embeddings": {"files": [...chunker.py], "file_count": 1, ...},
    "root":       {"files": [...config.py], "file_count": 1, ...},
}

Line 214: len(modules) = 3, not > 20 → no fallback

Line 220: Convert Counter to dict for JSON

Step 4 (lines 223–238): Build module dependency graph:

  • file_to_module = {"src/codewalk/analysis/blast_radius.py": "analysis", "src/codewalk/analysis/code_parser.py": "analysis", "src/codewalk/embeddings/chunker.py": "embeddings", "src/codewalk/config.py": "root"}

  • Module "analysis": files have no deps → deps = set()module_graph["analysis"] = []

  • Module "embeddings": chunker.py depends on config.py → config.py is in module "root"deps = {"root"}module_graph["embeddings"] = ["root"]

  • Module "root": config.py has no deps → module_graph["root"] = []

Return:

{
    "source_root": "src/codewalk",
    "modules": {
        "analysis":   {"files": [...], "languages": {"python": 2}, "file_count": 2},
        "embeddings": {"files": [...], "languages": {"python": 1}, "file_count": 1},
        "root":       {"files": [...], "languages": {"python": 1}, "file_count": 1},
    },
    "module_graph": {
        "analysis":   [],
        "embeddings": ["root"],
        "root":       [],
    },
    "stats": {"total_modules": 3, "total_files": 4},
}

Clone this wiki locally