Download and Compile a Full Codebase Dump Report for any GitHub repo or folder.
This script will compile a codebase and generate a report in markdown including each file. It will also count the number of tokens in the codebase. The output is in markdown format and perfectly formatted to be dropped into your favorite long context LLM for analysis.
It's a great way to explore and understand a codebase and has inadvertenly become a great way tool to archive our own codebase, and other open source projects not on github. It's also a great way to understand the complexity of a codebase and to identify potential security risks.
More planned features coming soon! Feel free to contribute to the project or contact us to work with us!
David Tapang / david@crewbrain.ai / CREWBRAIN.AI
- Scans and analyzes entire codebases, including local directories and remote GitHub repositories.
- Supports multiple file types including Python (.py), TypeScript (.ts), JavaScript (.js), Markdown (.md), and plain text (.txt).
- Configurable file type support allows easy addition or removal of supported languages.
- Counts lines and characters for each file.
- Calculates token count for each file using the tiktoken library, compatible with OpenAI's tokenization.
- Provides a comprehensive overview of the codebase structure and complexity.
- Generates detailed reports in Markdown format, perfect for integration with LLMs or documentation systems.
- Customizable report types allow focusing on specific file types or combinations.
- Includes a directory structure overview in the report for easy navigation of the codebase.
- Directly clone or update GitHub repositories for analysis.
- Handles both public and private repositories (with proper authentication).
- Displays download progress with a rich, interactive console interface.
- Customizable include/exclude folders for targeted analysis.
- Adjustable maximum file size limit to handle large codebases efficiently.
- Configurable maximum token count per file to manage analysis scope.
- Attempts multiple encodings (utf-8, latin-1, ascii) to read files, enhancing compatibility.
- Skips files exceeding size or token limits to prevent analysis bottlenecks.
- Utilizes the
richlibrary for a colorful, interactive console experience. - Provides progress bars and spinners for long-running operations like file analysis and report generation.
- Comprehensive logging system captures info, warnings, and errors.
- Logs are saved to a specified file for easy troubleshooting and auditing.
- Includes a built-in Flask server to expose analysis functionality via API.
- Allows for remote triggering of codebase analysis and report generation.
- Checks for required packages and attempts to install them if missing.
- Ensures all necessary dependencies are available before running the analysis.
- Automatically extracts and includes the content of README.md files in the generated report.
- Provides immediate context and project overview at the beginning of each report.
- Calculates and reports total token count for the entire codebase.
- Useful for estimating LLM processing costs or complexity metrics.
- Generates statistics on the number of files for each supported file type.
- Offers a quick overview of the codebase composition.
- Allows creation, modification, and deletion of report types through the settings menu.
- Enables tailored analysis for different project needs or language focuses.
- Saves user-defined settings to a YAML configuration file.
- Allows for consistent analysis parameters across multiple runs.
- Robust error handling for file reading, Git operations, and API requests.
- Continues analysis even if individual files or operations fail, ensuring maximum data collection.
- Provides an interactive menu for adjusting all configurable options.
- Allows real-time customization of the analysis process without editing configuration files directly.
- Implements generator-based file scanning to efficiently handle large directory structures.
- Manages memory usage effectively even for extensive codebases.
- Stores downloaded repositories and generated reports in configurable local directories.
- Ensures data privacy and allows for offline analysis of previously downloaded codebases.
- Designed with modularity in mind, allowing easy addition of new analysis types or report formats.
- Structured to facilitate future enhancements and integrations with other tools or services.
-
Clone this repository:
git clone https://github.com/your-username/crewbrain-code-compiler.git cd crewbrain-code-compiler -
Install Git on your system:
- For Windows: Download and install from https://git-scm.com/download/win
- For macOS: Use Homebrew with brew install git or download from https://git-scm.com/download/mac
- For Linux: Use your distribution's package manager, e.g., sudo apt-get install git for Ubuntu/Debian
-
Install Requirements:
- pip install -r requirements.txt
The CREWBRAIN Code Compiler offers multiple ways to analyze and report on codebases:
-
Run the script:
python Code_Compiler.py -
You'll be presented with a menu:
- Choose a report type (e.g., "All Supported File Types" or "Python, TypeScript, JavaScript")
- Access settings
- Start the API server
- Exit the program
-
If you choose a report type:
- Enter the path to a local directory or a GitHub repository URL
- The script will analyze the codebase and generate a report
-
Start the API server from the main menu or run:
flask run -
Send a POST request to
http://localhost:5000/api/analyzewith JSON payload:{ "source": "/path/to/local/directory or https://github.com/user/repo", "report_type": 1 } -
The server will respond with a success message or error details
Access the settings menu to customize:
- Supported file types
- Included/excluded folders
- Maximum file size and token count
- Report types
- Output directories
- Reports are saved in the configured results directory (default:
output/) - Each report is a Markdown file named
{directory_name}_report_{report_type}.md - Reports include:
- README content (if available)
- Codebase overview (total tokens, file type counts)
- Directory structure
- Detailed file contents
- Directly analyze GitHub repositories by providing the URL
- The script will clone or update the repository before analysis
- Logs are saved in the
logs/directory for debugging and auditing
- For large codebases, consider using more specific report types or adjusting the max file size/token count, the code will automatically chunk and separate the files.
- Regularly check for updates to ensure you have the latest features and optimizations
- Consolidate and compile into singular markdown file any folder or repo.
- Comprehensive codebase understanding and analysis using LLM.
- [ ]
This project is licensed under the MIT License - see the LICENSE file for details.
We appreciate your interest in Code Compiler! Here are several ways you can support the project and get assistance:
Your support helps us continue developing and maintaining this tool. Consider sponsoring us through:
- Open an issue for bug reports or feature requests
- Start a discussion to ask questions or share ideas
Need expert help with your AI and project development needs? We're here to assist you:
- AI Integration: Let us help you integrate AI into your existing projects or develop new AI-powered solutions.
- Custom Development: We can tailor Code Compiler to your specific needs or create custom tools for your workflow.
- Consulting: Get expert advice on code analysis, AI implementation, and software architecture.
Contact us at david@crewbrain.ai for professional services or to discuss your project requirements.
Help us grow by sharing the Code Compiler with your network:
- Star the repository on GitHub
- Share your experience on social media (Twitter, LinkedIn, etc.)
- Write a blog post about how you use the tool in your workflow
Your support, whether through sponsorship, community involvement, or spreading the word, is crucial for the continued development and improvement of this project. Thank you for being part of our community!
9-4-2024 - Fixed token counting error for special tokens
- Modified the
count_tokensmethod in the CodeCompiler class - Added
disallowed_special=()parameter to tiktoken encoding - Allows all special tokens (including '<|endoftext|>') to be encoded as normal text
- Resolves error when processing files containing special token strings