-
Notifications
You must be signed in to change notification settings - Fork 0
query
Core query logic — shared by MCP server, LangGraph agent, and FastAPI API. Each function takes explicit data arguments (no global state) and returns formatted markdown strings.
| Term | Definition | Example |
|---|---|---|
| topological sort | Ordering vertices so that for every edge A→B, A comes after B. Only works on DAGs. | If A imports B and B imports C, topological order is: C, B, A (dependencies first). |
| cycle | A path in a graph that starts and ends at the same vertex. A→B→C→A is a cycle. | If auth.py imports user.py and user.py imports auth.py, that's a cycle. |
| pagerank | Algorithm that ranks vertices by importance based on how many other important vertices link to them. Originally used by Google for web pages. |
utils.py with pagerank=0.12 is important because many important files import it. |
| blast radius | All files that would be affected if a given file changes — found by following reverse import edges transitively. | If A imports B and C imports A, changing B has blast radius = {A, C}. |
| transitive dependency | An indirect dependency through a chain. If A imports B and B imports C, then A transitively depends on C. | Changing C could break A even though A never directly imports C. |
| embedding | A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. | The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI). |
| vector store | A database optimized for storing embeddings and finding the most similar ones quickly. | ChromaDB stores code chunk embeddings and returns the 5 most similar chunks to your query. |
| cosine distance | Measures how different two vectors are. 0.0 = identical meaning, 1.0 = completely different, 2.0 = opposite. | Query "scan files" has cosine distance 0.15 to scan_directory() (very similar) and 0.85 to grade_answer() (very different). |
| ChromaDB | An open-source vector database for storing and searching embeddings. Used here to store code chunks. |
collection.query(query_texts=["scan files"], n_results=5) returns the 5 closest code chunks. |
| chunk | A piece of source code (usually one function or class) stored as a unit for search. | The function def scan_directory(root): ... (20 lines) is one chunk. |
| AST | Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). |
def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)]. |
| RAG | Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. | Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer. |
| glob pattern | A wildcard pattern for matching file paths. * matches anything in one directory, ** matches across directories. |
**/*.py matches all Python files in any subdirectory. src/*.ts matches TypeScript files only in src/. |
| hunk | A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). | A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed). |
One-line: Case-insensitive module name lookup — returns the actual key from the modules dict or None.
Input: module_name = "Analysis", modules = {"analysis": {...}, "embeddings": {...}, "rag": {...}}
Line 20: for name in modules:
name = "analysis" → "analysis".lower() == "analysis".lower()? → "analysis" == "analysis"? → True
return "analysis"
Return: "analysis"
Input: module_name = "database", modules = {"analysis": {...}, "embeddings": {...}}
Line 20: Loop through "analysis", "embeddings" → neither matches "database"
Line 22: return None
Return: None
One-line: Returns a standard error message listing available modules.
Input: module_name = "database", modules = {"analysis": {...}, "embeddings": {...}, "rag": {...}}
Line 26: available = ", ".join(sorted({"analysis", "embeddings", "rag"}))
→ available = "analysis, embeddings, rag"
Line 27: return "Module 'database' not found. Available: analysis, embeddings, rag"
Return: "Module 'database' not found. Available: analysis, embeddings, rag"
One-line: Extracts filename from a path.
Input: path = "src/codewalk/analysis/blast_radius.py"
Line 31: return "src/codewalk/analysis/blast_radius.py".split("/")[-1]
→ return "blast_radius.py"
Return: "blast_radius.py"
One-line: Computes per-file blast radius, returns sorted list of risk dicts + overall max risk level.
Input: file_paths = ["src/config.py", "src/utils.py", "src/main.py"]
runtime = <GraphRuntime instance>
Line 36: risk_order = {"critical": 4, "high": 3, "moderate": 2, "low": 1, "none": 0}
Line 37: max_risk = "low"
Line 38: results = []
file_path = "src/config.py":
Line 40: radius = get_blast_radius("src/config.py", runtime)
→ radius = {"risk_level": "high", "affected_files": 5, "direct": ["src/pipeline.py", "src/query.py"], "transitive": ["src/main.py"]}
Line 41: risk_order["high"]=3 > risk_order["low"]=1 → True → max_risk = "high"
Line 43: results.append({
"file": "src/config.py",
"risk_level": "high",
"affected_files": 5,
"direct": ["pipeline.py", "query.py"],
"transitive": ["main.py"],
})
file_path = "src/utils.py":
→ radius = {"risk_level": "moderate", "affected_files": 2, "direct": ["src/pipeline.py"], "transitive": []}
→ risk_order["moderate"]=2 < risk_order["high"]=3 → max_risk stays "high"
→ results.append({...})
file_path = "src/main.py":
→ radius = {"risk_level": "low", "affected_files": 0, "direct": [], "transitive": []}
→ results.append({...})
Line 50: results.sort(key=lambda x: x["affected_files"], reverse=True)
→ sorted: [config.py (5), utils.py (2), main.py (0)]
Return: ([{config.py, high, 5}, {utils.py, moderate, 2}, {main.py, low, 0}], "high")
One-line: Tries exact module match first, then falls back to matching as a sub-folder within a module.
Input: module_name = "auth",
modules_result = {
"source_root": "lib",
"modules": {
"features": {"files": ["lib/features/auth/login.dart", "lib/features/auth/register.dart", "lib/features/home/home.dart"]}
}
},
files = [{"file_path": "lib/features/auth/login.dart", "language": "dart"}, ...]
Line 62: modules = {"features": {...}}
Line 63: actual_name = resolve_module_name("auth", {"features": {...}})
→ None (no module called "auth")
Line 65: actual_name is None → fall to sub-folder search
Line 67: source_root = "lib"
Line 68: mod_name = "features", mod_info = {"files": [...]}
Line 70: prefix = "lib/features/auth/"
Line 74: matching_files = ["lib/features/auth/login.dart", "lib/features/auth/register.dart"]
→ 2 matches!
Line 75: lang_counter = Counter()
Line 77: matching_set = {"lib/features/auth/login.dart", "lib/features/auth/register.dart"}
Line 78-79: count languages → lang_counter = Counter({"dart": 2})
Line 80: info = {"files": ["lib/features/auth/login.dart", "lib/features/auth/register.dart"], "file_count": 2, "languages": {"dart": 2}}
Line 84: return ("features", info, True)
Return: ("features", {"files": [...], "file_count": 2, "languages": {"dart": 2}}, True)
One-line: Semantic search against ChromaDB, returns formatted context of top results.
Input: store = <VectorStore>, query = "how does authentication work"
Line 92: results = store.search("how does authentication work", n_results=5)
→ results = [{"content": "class AuthBloc...", "metadata": {...}, "distance": 0.3}, ...]
Line 93: filtered, _ = filter_by_distance(results)
→ filtered = [result1, result2, result3] (distance < threshold)
Line 94: filtered is not empty → skip "No relevant code" branch
Line 96: return format_context(filtered)
→ formatted markdown with code snippets
Return: Markdown string with matching code snippets.
One-line: Returns markdown with module files, languages, dependencies, hub files, and coupling stats.
Input: module_name = "analysis", modules_result has modules and module_graph
Line 102: actual_name = resolve_module_name("analysis", modules) → "analysis"
Line 105: info = modules["analysis"]
→ {"files": ["analysis/blast_radius.py", ...], "file_count": 8, "languages": {"python": 8}}
Line 106: depends_on = module_graph["analysis"] → ["embeddings", "graph"]
Line 107: depended_by = [mod for mod, deps in module_graph.items() if "analysis" in deps]
→ ["query", "mcp"]
Line 134: file_names = ["blast_radius.py", "code_parser.py", ...] (sorted short names)
Line 135: lang_str = "python (8 files)"
Return:
## Module: analysis
**Files (8):** blast_radius.py, code_parser.py, ...
**Languages:** python (8 files)
**Depends on:** embeddings, graph
**Depended on by:** query, mcp
One-line: Looks up a function/class in ChromaDB, shows its code + blast radius + callers/callees.
Input: function_name = "chunk_file", store = <VectorStore>
Line 146: results = store.search("chunk_file", n_results=10)
Line 147: filtered, _ = filter_by_distance(results)
Line 148: matches = [r for r in filtered if "chunk_file" in r.metadata.symbol_name.lower()]
→ matches = [result_for_chunk_file]
Line 150: to_show = matches[:3] → [result_for_chunk_file]
Line 153: context = format_context(to_show) → "```python\ndef chunk_file(file_info):\n..."
Line 155: file_path = "src/codewalk/embeddings/chunker.py"
Line 157: radius = get_blast_radius(file_path, runtime)
→ {"risk_level": "high", "affected_files": 3, "direct": ["pipeline.py"], "transitive": ["main.py"]}
context += "\n\n### Blast Radius\n**Risk:** HIGH — 3 files affected\n..."
Line 172: callers = graph_store.get_callers_of_symbol("chunker.py:chunk_file")
→ [{"caller": "chunk_and_embed_parallel", "file": "pipeline.py", "line": 48}]
context += "\n\n### Called by (1 caller):\n - chunk_and_embed_parallel() at pipeline.py:48"
Return: Markdown with code, blast radius, and call graph info.
One-line: Full project overview — tech stack, modules, dependency flow, riskiest files, PageRank, cycles.
Return: Multi-section markdown overview.
One-line: Blast radius report for a target module, file, or top 30 riskiest files.
Input: target = "analysis", modules_result has modules, deps has graph
Line 316: modules = modules_result["modules"]
Line 317: actual_module = resolve_module_name("analysis", modules) → "analysis"
Line 319: target_files = sorted(modules["analysis"]["files"])
→ ["analysis/blast_radius.py", "analysis/code_parser.py", ...]
Line 320: scope = "module 'analysis'"
Line 332: all_risks, max_risk = compute_file_risks(target_files, runtime)
→ ([{blast_radius.py, moderate, 3}, ...], "high")
Line 338-345: Format each risk entry into lines:
"[HIGH] analysis/blast_radius.py — 3 affected | breaks: query.py → then: main.py"
Return: Formatted blast radius report markdown.
One-line: Returns files in topological dependency order with blast radius risk annotations.
Input: module_name = "embeddings"
Line 362: order = generate_reading_order_raw(files, deps)
Line 366: actual_name = resolve_module_name("embeddings", modules) → "embeddings"
Line 369: module_files = set(modules["embeddings"]["files"])
Line 370: all_items = [item for item in all_items if item["file"] in module_files]
→ 4 items (chunker, embedder, vector_store, __init__)
Line 374: For each item, compute blast radius and format:
"1. [LOW] embeddings/__init__.py (0 affected) — re-exports"
"2. [MODERATE] embeddings/chunker.py (2 affected) — no deps, read first"
...
Return: "## Reading Order — module 'embeddings' (4 files)\n1. [LOW] ..."
One-line: Module-to-module or file-to-file dependency flow.
Input: module_name = ""
Line 395: depended_on = {"embeddings", "graph", "analysis"} (modules that others depend on)
Line 398: entry_modules = ["agent", "api", "mcp"] (not depended on by anyone)
Line 400-405: Format each module:
" analysis (8 files) → depends on: embeddings, graph"
" api (3 files) → (standalone)"
Return: Module-level dependency flow markdown.
Input: module_name = "embeddings"
Shows file-by-file imports within the module, separating internal vs cross-module dependencies.
Return: File-level dependency flow for the specified module.