-
Notifications
You must be signed in to change notification settings - Fork 0
Home
aakash-anko edited this page May 28, 2026
·
2 revisions
Codewalk is an AI-powered codebase onboarding tool that analyzes repositories, builds dependency graphs, embeds code into vector stores, and answers questions about codebases using RAG (Retrieval-Augmented Generation).
| Term | Definition | Example |
|---|---|---|
| topological sort | Ordering vertices so that for every edge A→B, A comes after B. Only works on DAGs. | If A imports B and B imports C, topological order is: C, B, A (dependencies first). |
| blast radius | All files that would be affected if a given file changes — found by following reverse import edges transitively. | If A imports B and C imports A, changing B has blast radius = {A, C}. |
| transitive dependency | An indirect dependency through a chain. If A imports B and B imports C, then A transitively depends on C. | Changing C could break A even though A never directly imports C. |
| embedding | A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. | The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI). |
| vector store | A database optimized for storing embeddings and finding the most similar ones quickly. | ChromaDB stores code chunk embeddings and returns the 5 most similar chunks to your query. |
| cosine distance | Measures how different two vectors are. 0.0 = identical meaning, 1.0 = completely different, 2.0 = opposite. | Query "scan files" has cosine distance 0.15 to scan_directory() (very similar) and 0.85 to grade_answer() (very different). |
| ChromaDB | An open-source vector database for storing and searching embeddings. Used here to store code chunks. |
collection.query(query_texts=["scan files"], n_results=5) returns the 5 closest code chunks. |
| DuckDB | An embedded SQL database (like SQLite but optimized for analytics). Used here to store file metadata and import edges. |
conn.execute("SELECT path FROM files WHERE language='python'").fetchall() returns all Python file paths. |
| chunk | A piece of source code (usually one function or class) stored as a unit for search. | The function def scan_directory(root): ... (20 lines) is one chunk. |
| AST | Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). |
def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)]. |
| tree-sitter | A fast, multi-language parser that builds ASTs. Supports 100+ languages without needing each language's compiler. | tree-sitter parses config.py into an AST, then we extract function/class nodes from it. |
| RAG | Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. | Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer. |
| LLM | Large Language Model — an AI model (like GPT-4, Claude) that generates text given a prompt. |
get_llm() returns a ChatOpenAI instance that can answer questions about code. |
| Pydantic | A Python library for data validation using type hints. Defines schemas as classes with typed fields. |
class QueryRoute(BaseModel): route: str; target: str — any instance is guaranteed to have string route and target fields. |
| glob pattern | A wildcard pattern for matching file paths. * matches anything in one directory, ** matches across directories. |
**/*.py matches all Python files in any subdirectory. src/*.ts matches TypeScript files only in src/. |
| diff | The set of changes between two versions of code, showing added (+) and removed (-) lines. |
- old_line\n+ new_line shows old_line was replaced with new_line. |
| hunk | A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). | A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed). |
| TTS | Text-to-Speech — converting text into spoken audio. |
speak("The config module loads settings") generates an audio file and plays it. |
| STT | Speech-to-Text — converting spoken audio into text (transcription). | User speaks into microphone → STT produces "What does scan_directory do?". |
Full definitions of all technical terms (vertex, edge, DAG, embedding, RAG, etc.) with examples.
- config.py — Settings and LLM factory for all providers
- errors.py — User-friendly error classification
- log.py — Shared logging utility (stderr + file)
- pipeline.py — Full indexing pipeline: scan → chunk → embed → store
- query.py — Core query logic shared by MCP, API, and agent
- module_detector.py — Detects logical modules from file paths
- dependency_graph.py — Builds file-level import dependency graphs
- blast_radius.py — Calculates change impact (direct + transitive dependents)
- reading_order.py — Suggests file reading order via topological sort
- code_parser.py — Extracts functions/classes from source files (Tree-sitter)
- python_parser.py — Python-specific AST parser for symbols
- relevance_filter.py — LLM-based file filtering for indexing
- chunker.py — Splits source files into parent/child chunks
- embedder.py — Generates vector embeddings for code chunks
- vector_store.py — ChromaDB wrapper for storing/querying embeddings
- scanner.py — Walks directories and collects source files
- file_filter.py — Pattern-based file inclusion/exclusion
- tech_detect.py — Detects project tech stack from marker files
- overview_generator.py — LLM-generated project overviews
- module_explainer.py — LLM-generated module explanations
- flow_generator.py — LLM-generated execution flow narratives
- diagram_generator.py — Mermaid diagram generation
- graph_store.py — DuckDB-backed import/call graph storage
- graph_runtime.py — In-memory graph for traversal (NetworkX)
- call_extractor.py — Extracts function call relationships from AST
- chain.py — RAG answer generation chain
- query_router.py — Routes queries to graph/vector/hybrid
- query_rewriter.py — Rewrites queries for better retrieval
- chunk_grader.py — Grades retrieved chunks for relevance
- answer_grader.py — Grades generated answers for hallucination
- graph_expansion.py — Expands retrieval using dependency graph
- retrieval_quality.py — Distance-based filtering of search results
- prompts.py — Prompt templates for RAG chain
- graph.py — LangGraph agent workflow definition
- tools.py — Tool definitions for the LangGraph agent
- prompts.py — System prompts for the agent
- main.py — FastAPI REST endpoints
- models.py — Pydantic request/response models
- state.py — Global application state management
- server.py — Model Context Protocol server for VS Code integration
- reviewer.py — Code review engine (diff + file review)
- diff_parser.py — Git diff parsing into structured hunks
- models.py — Review result data models
- review_prompts.py — Prompt templates for code review
- guidelines_loader.py — Loads team coding standards
- test_coverage.py — Test coverage gap detection
- doc_parser.py — Parses .md, .pdf, .txt documents into chunks
- doc_store.py — ChromaDB store for document chunks
- prompts.py — Prompt template for doc Q&A
- companion.py — Voice companion orchestrator
- stt.py — Speech-to-text (microphone recording + transcription)
- tts.py — Text-to-speech output
- router.py — Routes voice transcriptions to codewalk tools
- backends.py — TTS/STT backend implementations