Skip to content
aakash-anko edited this page May 28, 2026 · 2 revisions

Codewalk Wiki

Codewalk is an AI-powered codebase onboarding tool that analyzes repositories, builds dependency graphs, embeds code into vector stores, and answers questions about codebases using RAG (Retrieval-Augmented Generation).


Key Concepts

Term Definition Example
topological sort Ordering vertices so that for every edge A→B, A comes after B. Only works on DAGs. If A imports B and B imports C, topological order is: C, B, A (dependencies first).
blast radius All files that would be affected if a given file changes — found by following reverse import edges transitively. If A imports B and C imports A, changing B has blast radius = {A, C}.
transitive dependency An indirect dependency through a chain. If A imports B and B imports C, then A transitively depends on C. Changing C could break A even though A never directly imports C.
embedding A numerical vector (list of numbers) that represents the meaning of text. Similar text → similar vectors. The code def add(a, b): return a+b might become [0.12, -0.45, 0.78, ...] (1536 numbers for OpenAI).
vector store A database optimized for storing embeddings and finding the most similar ones quickly. ChromaDB stores code chunk embeddings and returns the 5 most similar chunks to your query.
cosine distance Measures how different two vectors are. 0.0 = identical meaning, 1.0 = completely different, 2.0 = opposite. Query "scan files" has cosine distance 0.15 to scan_directory() (very similar) and 0.85 to grade_answer() (very different).
ChromaDB An open-source vector database for storing and searching embeddings. Used here to store code chunks. collection.query(query_texts=["scan files"], n_results=5) returns the 5 closest code chunks.
DuckDB An embedded SQL database (like SQLite but optimized for analytics). Used here to store file metadata and import edges. conn.execute("SELECT path FROM files WHERE language='python'").fetchall() returns all Python file paths.
chunk A piece of source code (usually one function or class) stored as a unit for search. The function def scan_directory(root): ... (20 lines) is one chunk.
AST Abstract Syntax Tree — a tree representation of source code structure, where each node is a language construct (function, class, if-statement, etc.). def add(a, b): return a+b becomes a tree: FunctionDef → [args: a, b] → [body: Return → BinOp(a + b)].
tree-sitter A fast, multi-language parser that builds ASTs. Supports 100+ languages without needing each language's compiler. tree-sitter parses config.py into an AST, then we extract function/class nodes from it.
RAG Retrieval-Augmented Generation — instead of asking an LLM to answer from memory, first retrieve relevant documents, then include them in the prompt. Question: "What does scan_directory do?" → retrieve the source code of scan_directory → include it in the LLM prompt → get an accurate answer.
LLM Large Language Model — an AI model (like GPT-4, Claude) that generates text given a prompt. get_llm() returns a ChatOpenAI instance that can answer questions about code.
Pydantic A Python library for data validation using type hints. Defines schemas as classes with typed fields. class QueryRoute(BaseModel): route: str; target: str — any instance is guaranteed to have string route and target fields.
glob pattern A wildcard pattern for matching file paths. * matches anything in one directory, ** matches across directories. **/*.py matches all Python files in any subdirectory. src/*.ts matches TypeScript files only in src/.
diff The set of changes between two versions of code, showing added (+) and removed (-) lines. - old_line\n+ new_line shows old_line was replaced with new_line.
hunk A contiguous block of changes within a diff. One diff can contain multiple hunks (changes in different parts of a file). A diff might have hunk 1 (lines 10-15 changed) and hunk 2 (lines 80-85 changed).
TTS Text-to-Speech — converting text into spoken audio. speak("The config module loads settings") generates an audio file and plays it.
STT Speech-to-Text — converting spoken audio into text (transcription). User speaks into microphone → STT produces "What does scan_directory do?".

Full definitions of all technical terms (vertex, edge, DAG, embedding, RAG, etc.) with examples.


Table of Contents

Core Modules

  • config.py — Settings and LLM factory for all providers
  • errors.py — User-friendly error classification
  • log.py — Shared logging utility (stderr + file)
  • pipeline.py — Full indexing pipeline: scan → chunk → embed → store
  • query.py — Core query logic shared by MCP, API, and agent

Analysis Package

Embeddings Package

  • chunker.py — Splits source files into parent/child chunks
  • embedder.py — Generates vector embeddings for code chunks
  • vector_store.py — ChromaDB wrapper for storing/querying embeddings

Ingestion Package

Generation Package

Graph Package

RAG Package

Agent Package

  • graph.py — LangGraph agent workflow definition
  • tools.py — Tool definitions for the LangGraph agent
  • prompts.py — System prompts for the agent

API Package

  • main.py — FastAPI REST endpoints
  • models.py — Pydantic request/response models
  • state.py — Global application state management

MCP Package

  • server.py — Model Context Protocol server for VS Code integration

Review Package

Doc Knowledge Package

Voice Package

  • companion.py — Voice companion orchestrator
  • stt.py — Speech-to-text (microphone recording + transcription)
  • tts.py — Text-to-speech output
  • router.py — Routes voice transcriptions to codewalk tools
  • backends.py — TTS/STT backend implementations

Clone this wiki locally