This document outlines the detailed architecture of Codecrawl. It explains the responsibilities of each component and how they interact to achieve efficient code processing.
Codecrawl is designed to traverse code repositories, extract code segments using language-specific parsing techniques, and package the results into a configurable output format. Its modular architecture makes it easy to extend and maintain, with clear separations between the API layer, the core crawling and parsing functionalities, and the auxiliary services like database interaction and background processing.
- Location:
/apps/api - Purpose: Houses the primary API server which acts as the entry point for all client interactions.
- Environment Setup:
- Environment variables are managed via
.envfiles. - Build configuration is handled in files such as
tsconfig.jsonanddrizzle.config.ts.
- Environment variables are managed via
- Tools Used: PNPM for package management, facilitating workspaces.
- Configuration Files:
pnpm-lock.yamlandpnpm-workspace.yamlensure consistent dependency management.- Various configuration files control TypeScript and build settings.
- Core Technologies: Node.js, Express, and WebSockets.
- Responsibilities:
- Receive HTTP requests and WebSocket connections.
- Validate incoming data using middleware.
- Route requests to appropriate internal modules (e.g., crawler, parser, database handler).
- Main Entry:
apps/api/src/index.tsinitializes the server. - Routing: Request routing is defined using Express, with integration of error-handling and logging middleware.
- Purpose: Recursively traverse directories to locate and read source files for processing.
- Configuration:
- Users can configure file inclusion/exclusion patterns.
- The crawler adapts based on runtime configurations, merging inputs from command-line arguments and environment variables.
- Language Parsing:
- Supported languages are parsed using tree-sitter integrated through WebAssembly modules.
- File Mapping:
- Extensions are mapped to specific languages (e.g.,
.js,.ts,.py,.go). - The file
apps/api/src/core/treeSitter/ext2Lang.tshandles these mappings.
- Extensions are mapped to specific languages (e.g.,
- Parsing Strategies:
- Each language has a dedicated parse strategy implementing the common interface.
- Examples:
TypescriptStrategy.tsfor TypeScript files.PythonParseStrategy.tsfor Python files.- A default strategy in
DefaultParseStrategy.tsfor unsupported file types.
- Packaging Output:
- Processed code segments are merged into a final output file.
- Supported formats include Markdown, XML, and plain text.
- Specific styling is applied depending on the output style, managed by files in
apps/api/src/core/output/outputStyles/.
- Mechanism: Languages are loaded dynamically using WebAssembly modules.
- File Involved:
apps/api/src/core/treeSitter/loadLanguage.ts. - Benefit: Allows for high-performance parsing and the possibility to add new languages quickly.
- Queries:
- Each supported language has associated query files defining syntax patterns.
- Located in
apps/api/src/core/treeSitter/queries/.
- Strategy Mapping:
- Mapping files (
ext2Lang.tsandlang2Query.ts) determine the parsing logic based on file type and language.
- Mapping files (
- Usage:
- For heavy or long-running tasks (e.g., processing large repositories), tasks are offloaded to background workers.
- Service Files:
- Worker logic is implemented in dedicated files (
apps/api/src/services/queue-worker.js).
- Worker logic is implemented in dedicated files (
- Purpose:
- Ensures reliability and prevents request timeouts by decoupling the heavy processing logic from the main API response cycle.