For extensive documentation, including detailed API guides, GUI walkthroughs, plugin development tutorials, and CLI references, visit our Compileo Documentation Site.
Compileo is a modular platform designed to process, analyze, structure, and generate data. Whether you're processing 1,000 page PDFs, scraping JavaScript heavy websites, or engineering datasets for LLM fine-tuning and personal study, Compileo provides a unified, AI-driven lifecycle for the modern data era.
Compileo isn't just a parser—it's a comprehensive data engineering ecosystem. It automates the complex journey from "Messy Source" to Validated Structured Data.
Imagine you have several thick textbooks and want to create a specialized dataset focused on disease treatments and symptoms. Compileo can:
- Ingest all books simultaneously (PDF, DOCX, etc.).
- Discover unified "Treatment" and "Symptom" taxonomies across all sources automatically.
- Extract every mention of treatments, dosages, clinical manifestations, and so on for its associated taxonomy for each disease.
- Consolidate this into one unified, high-quality Q&A dataset for training or study.
Compileo is designed for every workflow:
- Web GUI: A user friendly Streamlit interface including a 7 step guided wizard and .
- REST API: Seamlessly integrate data processing, chunking, NER, extraction, and dataset generation into your own applications.
- CLI: Automate heavy duty processing with powerful command line parameters.
- Massive PDF Autonomy: Automatically splits 1,000+ page documents into manageable segments with intelligent context-aware chunking ensuring LLM token limits are never hit while preserving context.
- Two-Pass VLM Parsing: Employs a "Skim and Extract" methodology using Vision Language Models (Grok, Gemini, ChatGPT, Huggingface, Ollama) to first understand document layout and then extract high-fidelity Markdown.
- AI-Assisted Chunking: Want to chunk your book by chapter but each chapter has a different title? You can instruct AI how to chunk it (Each Chapter starts with... I need you to split the document at this point) or ask AI to advise you about the best strategy for chunking (This is the pattern that start each chapter. Which splitting strategy should I use?). Compileo's AI will analyze your documents, pattern examples, and your splitting goal to recommend the optimal Semantic, Character, Token, Delimiter or Schema-based chunking strategy.
- AI-Assisted Taxonomy: Don't waste weeks defining categories. Compileo can build hierarchical knowledge trees automatically based on your goals. You can also manually define some categories and let Compileo extend them.
- Multi-Stage Extraction: Performs Hierarchical Classification, moving from coarse-grained categories to fine-grained entities based on your custom or generated taxonomy.
- Context-Aware NER or Full Text: Uses parent context during extraction to disambiguate entities and discover deep relationships between concepts, then extract name entities or full text.
- Two-step AI Confidence Scoring: Every extracted entity and relationship is assigned an AI confidence level (0.0 - 1.0), allowing you to filter for only the most reliable data.
- Deep Quality Metrics: Automated scoring for Lexical Diversity, Demographic Bias, Answer Coherence, and Target Audience Alignment via the
datasetqualmodule. - Fine-Tuned Model Testing: Use the
benchmarkingmodule to evaluate how your fine-tuned models perform on your custom datasets using industry-standard metrics (Accuracy, F1, BLEU, ROUGE) and suites like GLUE or MMLU.
- Robust Plugin System: Effortlessly extend Compileo by adding custom plugins via a simple
.zippackage architecture. - Custom Exports: Out-of-the-box support for Scrapy URL Scraping and Anki flashcard export, allowing you to turn any technical document into a high quality study deck.
- CPU: 4-core processor minimum (8-core recommended).
- RAM: 8GB minimum (16GB recommended for heavy processing).
- GPU (Optional): NVIDIA GPU with 8GB+ VRAM. Required for HuggingFace local inference and advanced system performance monitoring.
- Storage: 25GB free disk space.
- Operating System: Linux, macOS, or Windows.
The fastest way to deploy the full stack (API, GUI, and Redis).
- Clone & Prepare:
git clone https://github.com/SunPCSolutions/Compileo.git cd compileo cp .env.example .env # Configure COMPILEO_API_KEYS (optional - can also be set via GUI after deployment)
- Launch:
docker compose up --build -d
- Access:
- Web GUI:
http://localhost:8501 - API Docs:
http://localhost:8000/docs
- Web GUI:
Compileo implements an "Auto-Lock" security model designed for zero-config startup without sacrificing security.
- Unsecured Mode (Default): If no API keys are defined, Compileo allows all requests. This is ideal for first-time setup and local experimentation.
- Secured Mode: As soon as you define an API key, the system "locks" and strictly requires that key for all API and GUI operations.
- GUI (Recommended): Launch Compileo, go to Settings > 🔗 API Configuration, enter one or more API Keys, and click Save. The system locks instantly.
- CLI: Start the API with the
--api-keyflag:Note: Set API keys via GUI Settings after startup.uvicorn src.compileo.api.main:app --host 0.0.0.0 --port 8000
- Environment: Define
COMPILEO_API_KEY=your_secret_keyin your.envor Docker configuration.
All API requests must include the following header:
X-API-Key: your_secret_keyIdeal for local development, CLI automation, or custom integrations.
Prerequisites: A running Redis server (sudo apt install redis-server).
- Setup Environment:
cd <github clone folder> python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate
- Install Dependencies:
pip install -r requirements.txt
- Start Services:
Note: For API security, set API keys via the GUI Settings after startup.
# 1. Start the API server uvicorn src.compileo.api.main:app --host 0.0.0.0 --port 8000 # 2. Start the Web GUI (In a new terminal) streamlit run src/compileo/features/gui/main.py --server.port 8501 --server.address 0.0.0.0
Apache-2.0 license