Skip to content
aakash-anko edited this page May 25, 2026 · 1 revision

query.py

Core query logic — shared by MCP server, LangGraph agent, and FastAPI API. Each function takes explicit data arguments (no global state) and returns formatted markdown strings.


Key Concepts

Term Definition Example
topological sort Ordering vertices so that for every edge A→B, A comes after B. Only works on DAGs. If A imports B and B imports C, topological order is: C, B, A (dependencies first).
cycle A path in a graph that starts and ends at the same vertex. A→B→C→A is a cycle. If auth.py imports user.py and user.py imports auth.py, that's a cycle.
pagerank Algorithm that ranks vertices by importance based on how many other important vertices link to them. Originally used by Google for web pages. utils.py with pagerank=0.12 is important because many important files import it.
blast radius All files that would be affected if a given file changes — found by following reverse import edges transitively. If A imports B and C imports A, changing B has blast radius = {A, C}.
transitive dependency An indirect dependency through a chain. If A imports B and B imports C, then A transitively depends on C. Changing C could break A even though A never directly imports C.
embedding A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI).
vector store A database optimized for storing embeddings and finding the most similar ones quickly. ChromaDB stores code chunk embeddings and returns the 5 most similar chunks to your query.
cosine distance Measures how different two vectors are. 0.0 = identical meaning, 1.0 = completely different, 2.0 = opposite. Query "scan files" has cosine distance 0.15 to scan_directory() (very similar) and 0.85 to grade_answer() (very different).
ChromaDB An open-source vector database for storing and searching embeddings. Used here to store code chunks. collection.query(query_texts=["scan files"], n_results=5) returns the 5 closest code chunks.
chunk A piece of source code (usually one function or class) stored as a unit for search. The function def scan_directory(root): ... (20 lines) is one chunk.
AST Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)].
RAG Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer.
glob pattern A wildcard pattern for matching file paths. * matches anything in one directory, ** matches across directories. **/*.py matches all Python files in any subdirectory. src/*.ts matches TypeScript files only in src/.
hunk A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed).

Function: resolve_module_name(module_name, modules)

One-line: Case-insensitive module name lookup — returns the actual key from the modules dict or None.

Example

Input: module_name = "Analysis", modules = {"analysis": {...}, "embeddings": {...}, "rag": {...}}
Line 20: for name in modules:
  name = "analysis" → "analysis".lower() == "analysis".lower()? → "analysis" == "analysis"? → True
  return "analysis"

Return: "analysis"

Example (not found)

Input: module_name = "database", modules = {"analysis": {...}, "embeddings": {...}}
Line 20: Loop through "analysis", "embeddings" → neither matches "database"
Line 22: return None

Return: None


Function: module_not_found_error(module_name, modules)

One-line: Returns a standard error message listing available modules.

Example

Input: module_name = "database", modules = {"analysis": {...}, "embeddings": {...}, "rag": {...}}
Line 26: available = ", ".join(sorted({"analysis", "embeddings", "rag"}))
         → available = "analysis, embeddings, rag"
Line 27: return "Module 'database' not found. Available: analysis, embeddings, rag"

Return: "Module 'database' not found. Available: analysis, embeddings, rag"


Function: short_name(path)

One-line: Extracts filename from a path.

Example

Input: path = "src/codewalk/analysis/blast_radius.py"
Line 31: return "src/codewalk/analysis/blast_radius.py".split("/")[-1]
         → return "blast_radius.py"

Return: "blast_radius.py"


Function: compute_file_risks(file_paths, runtime)

One-line: Computes per-file blast radius, returns sorted list of risk dicts + overall max risk level.

Example

Input: file_paths = ["src/config.py", "src/utils.py", "src/main.py"]
       runtime = <GraphRuntime instance>
Line 36: risk_order = {"critical": 4, "high": 3, "moderate": 2, "low": 1, "none": 0}
Line 37: max_risk = "low"
Line 38: results = []

  file_path = "src/config.py":
    Line 40: radius = get_blast_radius("src/config.py", runtime)
             → radius = {"risk_level": "high", "affected_files": 5, "direct": ["src/pipeline.py", "src/query.py"], "transitive": ["src/main.py"]}
    Line 41: risk_order["high"]=3 > risk_order["low"]=1 → True → max_risk = "high"
    Line 43: results.append({
        "file": "src/config.py",
        "risk_level": "high",
        "affected_files": 5,
        "direct": ["pipeline.py", "query.py"],
        "transitive": ["main.py"],
    })

  file_path = "src/utils.py":
    → radius = {"risk_level": "moderate", "affected_files": 2, "direct": ["src/pipeline.py"], "transitive": []}
    → risk_order["moderate"]=2 < risk_order["high"]=3 → max_risk stays "high"
    → results.append({...})

  file_path = "src/main.py":
    → radius = {"risk_level": "low", "affected_files": 0, "direct": [], "transitive": []}
    → results.append({...})

Line 50: results.sort(key=lambda x: x["affected_files"], reverse=True)
         → sorted: [config.py (5), utils.py (2), main.py (0)]

Return: ([{config.py, high, 5}, {utils.py, moderate, 2}, {main.py, low, 0}], "high")


Function: resolve_module_with_fallback(module_name, modules_result, files)

One-line: Tries exact module match first, then falls back to matching as a sub-folder within a module.

Example: Sub-folder fallback

Input: module_name = "auth", 
       modules_result = {
           "source_root": "lib",
           "modules": {
               "features": {"files": ["lib/features/auth/login.dart", "lib/features/auth/register.dart", "lib/features/home/home.dart"]}
           }
       },
       files = [{"file_path": "lib/features/auth/login.dart", "language": "dart"}, ...]
Line 62: modules = {"features": {...}}
Line 63: actual_name = resolve_module_name("auth", {"features": {...}})
         → None (no module called "auth")

Line 65: actual_name is None → fall to sub-folder search

Line 67: source_root = "lib"
Line 68: mod_name = "features", mod_info = {"files": [...]}
Line 70: prefix = "lib/features/auth/"
Line 74: matching_files = ["lib/features/auth/login.dart", "lib/features/auth/register.dart"]
         → 2 matches!

Line 75: lang_counter = Counter()
Line 77: matching_set = {"lib/features/auth/login.dart", "lib/features/auth/register.dart"}
Line 78-79: count languages → lang_counter = Counter({"dart": 2})

Line 80: info = {"files": ["lib/features/auth/login.dart", "lib/features/auth/register.dart"], "file_count": 2, "languages": {"dart": 2}}
Line 84: return ("features", info, True)

Return: ("features", {"files": [...], "file_count": 2, "languages": {"dart": 2}}, True)


Function: search_codebase_text(store, query)

One-line: Semantic search against ChromaDB, returns formatted context of top results.

Example

Input: store = <VectorStore>, query = "how does authentication work"
Line 92: results = store.search("how does authentication work", n_results=5)
         → results = [{"content": "class AuthBloc...", "metadata": {...}, "distance": 0.3}, ...]

Line 93: filtered, _ = filter_by_distance(results)
         → filtered = [result1, result2, result3]  (distance < threshold)

Line 94: filtered is not empty → skip "No relevant code" branch

Line 96: return format_context(filtered)
         → formatted markdown with code snippets

Return: Markdown string with matching code snippets.


Function: module_info_text(modules_result, module_name, graph_runtime, graph_store)

One-line: Returns markdown with module files, languages, dependencies, hub files, and coupling stats.

Example

Input: module_name = "analysis", modules_result has modules and module_graph
Line 102: actual_name = resolve_module_name("analysis", modules) → "analysis"
Line 105: info = modules["analysis"]
          → {"files": ["analysis/blast_radius.py", ...], "file_count": 8, "languages": {"python": 8}}
Line 106: depends_on = module_graph["analysis"] → ["embeddings", "graph"]
Line 107: depended_by = [mod for mod, deps in module_graph.items() if "analysis" in deps]
          → ["query", "mcp"]

Line 134: file_names = ["blast_radius.py", "code_parser.py", ...] (sorted short names)
Line 135: lang_str = "python (8 files)"

Return:

## Module: analysis
**Files (8):** blast_radius.py, code_parser.py, ...
**Languages:** python (8 files)
**Depends on:** embeddings, graph
**Depended on by:** query, mcp

Function: explain_function_text(store, function_name, deps, graph_runtime, graph_store)

One-line: Looks up a function/class in ChromaDB, shows its code + blast radius + callers/callees.

Example

Input: function_name = "chunk_file", store = <VectorStore>
Line 146: results = store.search("chunk_file", n_results=10)
Line 147: filtered, _ = filter_by_distance(results)
Line 148: matches = [r for r in filtered if "chunk_file" in r.metadata.symbol_name.lower()]
          → matches = [result_for_chunk_file]

Line 150: to_show = matches[:3] → [result_for_chunk_file]
Line 153: context = format_context(to_show) → "```python\ndef chunk_file(file_info):\n..."

Line 155: file_path = "src/codewalk/embeddings/chunker.py"
Line 157: radius = get_blast_radius(file_path, runtime)
          → {"risk_level": "high", "affected_files": 3, "direct": ["pipeline.py"], "transitive": ["main.py"]}

          context += "\n\n### Blast Radius\n**Risk:** HIGH — 3 files affected\n..."

Line 172: callers = graph_store.get_callers_of_symbol("chunker.py:chunk_file")
          → [{"caller": "chunk_and_embed_parallel", "file": "pipeline.py", "line": 48}]
          context += "\n\n### Called by (1 caller):\n  - chunk_and_embed_parallel() at pipeline.py:48"

Return: Markdown with code, blast radius, and call graph info.


Function: overview_text(repo_path, modules_result, deps, graph_runtime)

One-line: Full project overview — tech stack, modules, dependency flow, riskiest files, PageRank, cycles.

Return: Multi-section markdown overview.


Function: blast_radius_map_text(modules_result, deps, target, graph_runtime)

One-line: Blast radius report for a target module, file, or top 30 riskiest files.

Example

Input: target = "analysis", modules_result has modules, deps has graph
Line 316: modules = modules_result["modules"]
Line 317: actual_module = resolve_module_name("analysis", modules) → "analysis"
Line 319: target_files = sorted(modules["analysis"]["files"])
          → ["analysis/blast_radius.py", "analysis/code_parser.py", ...]
Line 320: scope = "module 'analysis'"

Line 332: all_risks, max_risk = compute_file_risks(target_files, runtime)
          → ([{blast_radius.py, moderate, 3}, ...], "high")

Line 338-345: Format each risk entry into lines:
  "[HIGH] analysis/blast_radius.py — 3 affected | breaks: query.py → then: main.py"

Return: Formatted blast radius report markdown.


Function: reading_order_text(files, deps, modules_result, module_name, graph_runtime)

One-line: Returns files in topological dependency order with blast radius risk annotations.

Example (scoped to a module)

Input: module_name = "embeddings"
Line 362: order = generate_reading_order_raw(files, deps)
Line 366: actual_name = resolve_module_name("embeddings", modules) → "embeddings"
Line 369: module_files = set(modules["embeddings"]["files"])
Line 370: all_items = [item for item in all_items if item["file"] in module_files]
          → 4 items (chunker, embedder, vector_store, __init__)

Line 374: For each item, compute blast radius and format:
  "1. [LOW] embeddings/__init__.py (0 affected) — re-exports"
  "2. [MODERATE] embeddings/chunker.py (2 affected) — no deps, read first"
  ...

Return: "## Reading Order — module 'embeddings' (4 files)\n1. [LOW] ..."


Function: execution_flow_text(modules_result, deps, module_name)

One-line: Module-to-module or file-to-file dependency flow.

Example (module level, no module_name)

Input: module_name = ""
Line 395: depended_on = {"embeddings", "graph", "analysis"}  (modules that others depend on)
Line 398: entry_modules = ["agent", "api", "mcp"]  (not depended on by anyone)

Line 400-405: Format each module:
  "  analysis (8 files) → depends on: embeddings, graph"
  "  api (3 files) → (standalone)"

Return: Module-level dependency flow markdown.

Example (file level within a module)

Input: module_name = "embeddings"

Shows file-by-file imports within the module, separating internal vs cross-module dependencies.

Return: File-level dependency flow for the specified module.

Clone this wiki locally