Skip to content

Latest commit

 

History

History
86 lines (68 loc) · 4.17 KB

File metadata and controls

86 lines (68 loc) · 4.17 KB

CodeAtlas Architecture

Overview

CodeAtlas consists of modular components to crawl, embed, index, and search codebases.

graph TD
    %% --- Data Ingestion Pipeline ---
    subgraph "Data Ingestion Pipeline"
        A[📂 Code Repositories] --> B[🕷️ Crawler<br/>File Discovery]
        B --> C[🌳 Chunker<br/>AST / Tree-sitter Parsing]
        C --> D[🧩 Embedder<br/>HuggingFace / OpenAI]
        D --> E[🗄️ Indexer<br/>FAISS Vector Store]
    end

    %% --- Search & Intelligence Layer ---
    subgraph "Search & Intelligence Layer"
        E --> F[🔍 Searcher<br/>Vector Similarity]
        F --> G[💬 Chat Service<br/>Context + LLM]
        G --> H[🤖 LLM Backends<br/>HF Transformers / OpenAI]
    end

    %% --- API & Interface Layer ---
    subgraph "API & Interface Layer"
        G --> I[⚡ FastAPI Endpoints<br/>Chat / Search / Repos]
        I --> J[🖥️ Streamlit Frontend<br/>Interactive UI]
    end

    %% --- Config & Dependencies ---
    subgraph "Configuration & Dependencies"
        K[⚙️ Settings<br/>.env Config] --> L[📦 Dependency Injection<br/>Cached Services]
        L --> G
        L --> F
        L --> D
    end

    %% --- Continuous Pipeline ---
    subgraph "Continuous Pipeline"
        M[🚀 Init Scripts<br/>Auto Indexing] --> N[♻️ Hash-based<br/>Change Detection]
        N --> B
    end

    %% --- Styles (High Contrast, Works on Dark/Light) ---
    style A fill:#4FC3F7,stroke:#0288D1,stroke-width:2px,color:#000
    style B fill:#BA68C8,stroke:#6A1B9A,stroke-width:2px,color:#fff
    style C fill:#BA68C8,stroke:#6A1B9A,stroke-width:2px,color:#fff
    style D fill:#81C784,stroke:#2E7D32,stroke-width:2px,color:#000
    style E fill:#81C784,stroke:#2E7D32,stroke-width:2px,color:#000
    style F fill:#FFB74D,stroke:#E65100,stroke-width:2px,color:#000
    style G fill:#E57373,stroke:#B71C1C,stroke-width:2px,color:#fff
    style H fill:#E57373,stroke:#B71C1C,stroke-width:2px,color:#fff
    style I fill:#BDBDBD,stroke:#424242,stroke-width:2px,color:#000
    style J fill:#F06292,stroke:#880E4F,stroke-width:2px,color:#fff
    style K fill:#EEEEEE,stroke:#616161,stroke-width:2px,color:#000
    style L fill:#EEEEEE,stroke:#616161,stroke-width:2px,color:#000
    style M fill:#4DB6AC,stroke:#00695C,stroke-width:2px,color:#000
    style N fill:#4DB6AC,stroke:#00695C,stroke-width:2px,color:#000


Loading

Components

  • Crawler: Recursively scans target repositories to find source code files, filters by extension (.py, .js, .java, .ts, .cpp, .c, .go), and excludes directories like .git, pycache, node_modules, and venv.

  • Chunker: Extracts classes, functions, and overview chunks from source files using Python AST parsing or Tree-sitter for JavaScript, TypeScript, Java, Go, C, and C++, with overlapping context for better retrieval.

  • Embedder: Converts code chunks into dense vector embeddings using either HuggingFace SentenceTransformer models or OpenAI embedding APIs (configurable backend).

  • Indexer: Builds and maintains a FAISS vector index using L2 distance, storing embeddings with metadata (file path, line ranges, chunk type, chunk name).

  • Searcher: Performs semantic vector similarity search on the FAISS index to find relevant code snippets based on query embeddings.

  • Chat Service: Coordinates search results and context assembly, selects LLM backend (HuggingFace Transformers or OpenAI GPT) to generate developer-friendly responses.

  • API: Implements FastAPI REST endpoints for chat queries, semantic search, and repository listing with dependency injection.

  • Frontend: Streamlit web application providing an interactive chat interface for querying codebases.

Technologies

  • FastAPI for API
  • HuggingFace SentenceTransformers for local embedding models for converting code to vectors (configurable model selection).
  • OpenAI API for optional cloud-based embeddings (text-embedding-3-small) and conversational language models.
  • FAISS for similarity search
  • Optional Streamlit for frontend
  • Tree-sitter for language-agnostic parser generator for extracting structured code chunks from multiple programming languages.
  • Python AST for built-in Abstract Syntax Tree parser for precise Python code analysis.