A comprehensive toolkit for profiling, benchmarking, and running Local LLMs (GGUF format) on Edge CPUs. This project is built on top of llama.cpp and provides automated scripts for model management, memory footprint analysis, and inference performance testing.
- Model Management: Easily download and manage various GGUF models ranging from tiny 15M parameter models to large 22B parameter ones using a unified configuration (
models_config.py). - CPU-Optimized Inference: Automated setup builds
llama.cpplocally with AVX2 and FMA optimizations for peak CPU performance. - Memory Profiling: Tools to measure Static RAM usage and dynamic KV-cache growth per token during inference.
- Benchmarking: Automated benchmarking scripts to evaluate decode rate, prefill rate, latency, and parallel efficiency across different thread configurations.
- Interactive Demo: A simple CLI chat application to interact with downloaded models in real-time.
Run the setup script to install necessary system dependencies, compile llama.cpp with optimization flags, and prepare the Python virtual environment:
./setup.shActivate the Python virtual environment created by the setup script:
python -m venv venv
source venv/bin/activateThe project includes a unified model registry in models_config.py. To download all configured models:
python3 download_manager.pyNote: This script uses tqdm to display download progress and saves models directly to the root directory.
To quickly test a model interactively, run the demo script. By default, it will load the tinyllama_15m model:
python3 demo_inference.pyThe primary benchmarking logic is driven by memory_test.py and prompts.py. The suite records inference statistics across different thread counts and context sizes.
Metrics collected include:
- Prefill Rate (tokens/sec)
- Decode Rate (tokens/sec)
- Total Latency
- Peak Memory (Static + KV Cache)
Results are automatically appended to report.csv for further analysis.
A simple C program (list.c) is provided to parse report.csv and extract unique model names tested:
# Compile
gcc list.c -o list_models
# Run
./list_modelsBased on extensive benchmarking across various models and thread configurations (captured in report.csv), several key patterns for Edge CPU inference have emerged:
- Observation: Performance scaling is non-linear. Moving from 1 to 4 threads typically yields a ~2x speedup in decode rate.
- Diminishing Returns: Increasing from 4 to 8 threads often shows negligible gains or even performance regression due to cache contention and synchronization overhead on many consumer/edge CPUs.
- Recommendation: For most edge devices, 4 threads provides the best balance between throughput and power efficiency.
- Ultra-Small ( < 500M params): Models like
TinyLlama (15M)andStableLM-2 Tiny (0.13B)achieve blistering speeds of 60-100+ TPS, making them ideal for real-time edge triggers or simple classification. - Small (0.5B - 2B params): The most versatile tier.
Falcon3 (1B)andQwen 2.5 (0.5B)consistently hit 20-30 TPS, providing a smooth conversational experience. - Medium (3B - 7B params): Higher intelligence models like
Gemma (7B)orPhi-3 (6B)run at 3-6 TPS. While slower, they are capable of complex reasoning within acceptable human-reading speeds.
| Model Tier | Parameter Count | Peak RAM (GGUF) |
|---|---|---|
| Tiny | 15M - 160M | 0.15GB - 0.4GB |
| Small | 0.5B - 1.5B | 0.7GB - 2.5GB |
| Medium | 3B - 7B | 3.5GB - 12GB |
- Context Overhead: KV cache growth is a critical factor for long-context applications. Some models (e.g.,
Falcon3 (1B)) exhibit significantly higher memory growth per 1,000 tokens compared toPhi-3 Mini (1.5B). - Constraint: When running on devices with < 4GB RAM, choosing models with low KV cache growth (like
TinyLlama (15M)) is essential for maintaining stability during long-running sessions.
- High Throughput:
Falcon3 (1B)outperformed many 1B competitors, achieving nearly double the TPS ofGemma (1B)in similar tests. - Resource Efficiency:
TinyLlama (15M)remains the undisputed king of efficiency with the highest "CIES Score" (Performance/Resource ratio), proving that specialized small models are highly viable for edge deployment.
setup.sh: Environment initialization, dependency installation, andllama.cppsource compilation.models_config.py: Registry containing model details (URL, filename, prompt templates).download_manager.py: Utility to fetch models defined in the registry.demo_inference.py: Minimal CLI chat interface for testing models.memory_test.py: Core profiling script to measure RAM footprint and benchmark inference speeds.prompts.py: Utility to generate synthetic prompts of specific lengths for standardized benchmarking.report.csv: Aggregated dataset of inference benchmark results.list.c/list_models: C utility to extract unique model runs from the generated report.