A Rust LLM inference engine, built to run on the Raspberry Pi Zero 2W. Inspired from picolm.
Runs SmolLM 360M at ~2.5 tokens/second on a ~18€ board with 512 MB RAM and a quad-core ARM Cortex-A53.
| Device | Model | Format | Generation |
|---|---|---|---|
| Pi Zero 2W | SmolLM 360M | Q4_K_M | 2.5 tok/s |
| Pi Zero 2W | TinyLlama 1.1B | Q4_K_M | 0.04 tok/s (IO-bound, model > RAM) |
| x86 desktop (no SIMD) | TinyLlama 1.1B | bf16 | 18.4 tok/s |
| x86 desktop (no SIMD) | TinyLlama 1.1B | f32 | 10.0 tok/s |
- Memory-mapped weights (mmap): zero-copy, run any model size as long as it fits on disk
- Parallel matrix-vector multiply (rayon)
- KV cache (for the moment in f32)
- Grouped Query Attention (GQA)
- BF16 and GGUF quantized weights (Q4_K, Q5_0, Q6_K, Q8_0)
- ARM NEON SIMD with fused dequantize + dot product
Any Llama-architecture model in GGUF or safetensors format.
Supported GGUF quantization types: Q4_K, Q5_0, Q6_K, Q8_0, F16, F32
Tested with:
- SmolLM 135M / 360M
- TinyLlama 1.1B
Requires Rust and Cargo. For cross-compilation to the Pi, install the aarch64 toolchain:
rustup target add aarch64-unknown-linux-gnu
sudo apt install gcc-aarch64-linux-gnucargo build --release # native x86 build
cargo build-pi # cross-compile for Pi Zero 2W (aarch64)
cargo test # run tests./rustllm -m model.gguf -tok HuggingFaceTB/SmolLM-360M-Instruct -p "Hello" -n 100 --chat --template smollm| Flag | Default | Description |
|---|---|---|
-m |
required | Path to model file (.gguf or .safetensors). For safetensors, place config.json in the same directory |
-tok |
SmolLM-360M-Instruct | HuggingFace tokenizer ID |
-p |
required | Prompt |
-n |
256 | Max tokens to generate |
--chat |
off | Apply chat template |
--template |
smollm | Chat template: smollm, llama, or default |
use rustllm::model::Model;
use rustllm::generate::GenerationConfig;
let mut model = Model::load("model.gguf")?;
let input_ids = vec![1, 15043, 29892];
let stats = model.generate(
&input_ids,
&GenerationConfig {
max_new_tokens: 100,
..Default::default()
},
&mut |token_id| {
// called for each generated token
print!("{}", decode(token_id));
},
);
println!("Generation: {:.1} tok/s", stats.generation_tok_per_sec());Three crates:
memmap2: memory-mapped file I/Orayon: parallel iteratorstokenizers: HuggingFace tokenizers
