Infrastructure for optimizing and testing LLM models on edge devices using Llama.cpp
🚧 Under Development - See IMPLEMENTATION_PLAN.md for detailed roadmap
Ellmo is a Python-based infrastructure project for:
- Downloading models from HuggingFace
- Converting and quantizing models for edge deployment
- Training and applying LoRA adapters
- Benchmarking model performance on CPU
- Running optimized inference on edge devices
# Setup (coming soon - Phase 0)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Download a model (Phase 2)
ellmo download --model "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" --variant Q4_K_M
# Run inference (Phase 3)
ellmo run --model tinyllama-Q4_K_M.gguf --prompt "Hello, world!"
# Quantize a model (Phase 5)
ellmo quantize --model tinyllama-f16.gguf --type Q4_K_M
# Benchmark performance (Phase 6)
ellmo benchmark --model tinyllama-Q4_K_M.gguf --runs 10- ✅ Model downloading from HuggingFace Hub
- ✅ Model format conversion (PyTorch → GGUF)
- ✅ Multiple quantization levels (Q2_K → Q8_0)
- ✅ LoRA training and inference
- ✅ CPU-optimized inference engine
- ✅ Comprehensive benchmarking suite
- ✅ Model registry and management
- ✅ OpenAI-compatible API server
- ✅ Configuration presets for common scenarios
- ✅ Visual benchmark reports
- ✅ Device-specific optimization profiles
- Language: Python 3.10+
- Inference: llama.cpp + llama-cpp-python
- Models: HuggingFace Hub
- Optimization: Quantization, LoRA/PEFT
- Target: CPU (edge devices)
ellmo/
├── src/ # Source code
│ ├── models/ # Model downloading and conversion
│ ├── optimization/ # Quantization and LoRA
│ ├── inference/ # Inference engine
│ ├── benchmark/ # Performance testing
│ └── cli/ # Command-line interface
├── configs/ # Configuration files
├── models/ # Downloaded and optimized models
├── tests/ # Test suite
└── docs/ # Documentation
See IMPLEMENTATION_PLAN.md for the detailed multi-phase implementation plan with testing checkpoints.
Phase 0: Project Bootstrap - Setting up basic infrastructure
This project is currently in early development. Contribution guidelines will be added once the core infrastructure is complete.
- Implementation Plan - Detailed 14-phase roadmap with testing strategy
- Testing Strategy - Comprehensive testing approach and protocols
All major design decisions are documented with full context, rationale, and alternatives:
- ADR Index - Complete list of all decisions
- Decision Making Guide - How we make and document decisions
- 6 ADRs documenting Phase 0 decisions (~1,500 lines of documentation)
Key Decisions:
- ADR-001: Python as Primary Language
- ADR-003: Click CLI Framework
- ADR-005: Project Structure
- ADR-006: YAML Configuration
MIT License - see LICENSE file for details
Copyright (c) 2025 Fadi Labib
- llama.cpp - High-performance LLM inference
- HuggingFace - Model hub and tools
- PEFT - Parameter-efficient fine-tuning