rustllm

A Rust LLM inference engine, built to run on the Raspberry Pi Zero 2W. Inspired from picolm.

Runs SmolLM 360M at ~2.5 tokens/second on a ~18€ board with 512 MB RAM and a quad-core ARM Cortex-A53.

Device	Model	Format	Generation
Pi Zero 2W	SmolLM 360M	Q4_K_M	2.5 tok/s
Pi Zero 2W	TinyLlama 1.1B	Q4_K_M	0.04 tok/s (IO-bound, model > RAM)
x86 desktop (no SIMD)	TinyLlama 1.1B	bf16	18.4 tok/s
x86 desktop (no SIMD)	TinyLlama 1.1B	f32	10.0 tok/s

Optimizations

Memory-mapped weights (mmap): zero-copy, run any model size as long as it fits on disk
Parallel matrix-vector multiply (rayon)
KV cache (for the moment in f32)
Grouped Query Attention (GQA)
BF16 and GGUF quantized weights (Q4_K, Q5_0, Q6_K, Q8_0)
ARM NEON SIMD with fused dequantize + dot product

Supported models

Any Llama-architecture model in GGUF or safetensors format.

Supported GGUF quantization types: Q4_K, Q5_0, Q6_K, Q8_0, F16, F32

Tested with:

SmolLM 135M / 360M
TinyLlama 1.1B

Build

Requires Rust and Cargo. For cross-compilation to the Pi, install the aarch64 toolchain:

rustup target add aarch64-unknown-linux-gnu
sudo apt install gcc-aarch64-linux-gnu

cargo build --release       # native x86 build
cargo build-pi              # cross-compile for Pi Zero 2W (aarch64)
cargo test                  # run tests

Usage

./rustllm -m model.gguf -tok HuggingFaceTB/SmolLM-360M-Instruct -p "Hello" -n 100 --chat --template smollm

Flag	Default	Description
`-m`	required	Path to model file (.gguf or .safetensors). For safetensors, place `config.json` in the same directory
`-tok`	SmolLM-360M-Instruct	HuggingFace tokenizer ID
`-p`	required	Prompt
`-n`	256	Max tokens to generate
`--chat`	off	Apply chat template
`--template`	smollm	Chat template: `smollm`, `llama`, or `default`

Rust API

use rustllm::model::Model;
use rustllm::generate::GenerationConfig;

let mut model = Model::load("model.gguf")?;
let input_ids = vec![1, 15043, 29892];

let stats = model.generate(
    &input_ids,
    &GenerationConfig {
        max_new_tokens: 100,
        ..Default::default()
    },
    &mut |token_id| {
        // called for each generated token
        print!("{}", decode(token_id));
    },
);

println!("Generation: {:.1} tok/s", stats.generation_tok_per_sec());

Dependencies

Three crates:

memmap2: memory-mapped file I/O
rayon: parallel iterators
tokenizers: HuggingFace tokenizers

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.cargo		.cargo
.github/workflows		.github/workflows
assets		assets
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rustllm

Optimizations

Supported models

Build

Usage

Rust API

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rustllm

Optimizations

Supported models

Build

Usage

Rust API

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages