Skip to content

fadi-labib/ellmo

Repository files navigation

Ellmo - Llama.cpp Edge Infrastructure

Infrastructure for optimizing and testing LLM models on edge devices using Llama.cpp

Project Status

🚧 Under Development - See IMPLEMENTATION_PLAN.md for detailed roadmap

Overview

Ellmo is a Python-based infrastructure project for:

  • Downloading models from HuggingFace
  • Converting and quantizing models for edge deployment
  • Training and applying LoRA adapters
  • Benchmarking model performance on CPU
  • Running optimized inference on edge devices

Quick Start

# Setup (coming soon - Phase 0)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Download a model (Phase 2)
ellmo download --model "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" --variant Q4_K_M

# Run inference (Phase 3)
ellmo run --model tinyllama-Q4_K_M.gguf --prompt "Hello, world!"

# Quantize a model (Phase 5)
ellmo quantize --model tinyllama-f16.gguf --type Q4_K_M

# Benchmark performance (Phase 6)
ellmo benchmark --model tinyllama-Q4_K_M.gguf --runs 10

Features (Planned)

Core Capabilities

  • ✅ Model downloading from HuggingFace Hub
  • ✅ Model format conversion (PyTorch → GGUF)
  • ✅ Multiple quantization levels (Q2_K → Q8_0)
  • ✅ LoRA training and inference
  • ✅ CPU-optimized inference engine
  • ✅ Comprehensive benchmarking suite

Advanced Features

  • ✅ Model registry and management
  • ✅ OpenAI-compatible API server
  • ✅ Configuration presets for common scenarios
  • ✅ Visual benchmark reports
  • ✅ Device-specific optimization profiles

Technology Stack

  • Language: Python 3.10+
  • Inference: llama.cpp + llama-cpp-python
  • Models: HuggingFace Hub
  • Optimization: Quantization, LoRA/PEFT
  • Target: CPU (edge devices)

Project Structure

ellmo/
├── src/                  # Source code
│   ├── models/          # Model downloading and conversion
│   ├── optimization/    # Quantization and LoRA
│   ├── inference/       # Inference engine
│   ├── benchmark/       # Performance testing
│   └── cli/             # Command-line interface
├── configs/             # Configuration files
├── models/              # Downloaded and optimized models
├── tests/               # Test suite
└── docs/                # Documentation

Development

See IMPLEMENTATION_PLAN.md for the detailed multi-phase implementation plan with testing checkpoints.

Current Phase

Phase 0: Project Bootstrap - Setting up basic infrastructure

Contributing

This project is currently in early development. Contribution guidelines will be added once the core infrastructure is complete.

Documentation

Core Documentation

Architecture Decision Records (ADRs)

All major design decisions are documented with full context, rationale, and alternatives:

  • ADR Index - Complete list of all decisions
  • Decision Making Guide - How we make and document decisions
  • 6 ADRs documenting Phase 0 decisions (~1,500 lines of documentation)

Key Decisions:

License

MIT License - see LICENSE file for details

Copyright (c) 2025 Fadi Labib

Acknowledgments

About

Edge LLM infrastructure: quantize, LoRA-tune & benchmark llama.cpp models for on-device inference.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors