Skip to content

AyoubMDL/rust_llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rustllm

Raspberry Pi Zero 2W

A Rust LLM inference engine, built to run on the Raspberry Pi Zero 2W. Inspired from picolm.

Runs SmolLM 360M at ~2.5 tokens/second on a ~18€ board with 512 MB RAM and a quad-core ARM Cortex-A53.

Device Model Format Generation
Pi Zero 2W SmolLM 360M Q4_K_M 2.5 tok/s
Pi Zero 2W TinyLlama 1.1B Q4_K_M 0.04 tok/s (IO-bound, model > RAM)
x86 desktop (no SIMD) TinyLlama 1.1B bf16 18.4 tok/s
x86 desktop (no SIMD) TinyLlama 1.1B f32 10.0 tok/s

Optimizations

  • Memory-mapped weights (mmap): zero-copy, run any model size as long as it fits on disk
  • Parallel matrix-vector multiply (rayon)
  • KV cache (for the moment in f32)
  • Grouped Query Attention (GQA)
  • BF16 and GGUF quantized weights (Q4_K, Q5_0, Q6_K, Q8_0)
  • ARM NEON SIMD with fused dequantize + dot product

Supported models

Any Llama-architecture model in GGUF or safetensors format.

Supported GGUF quantization types: Q4_K, Q5_0, Q6_K, Q8_0, F16, F32

Tested with:

  • SmolLM 135M / 360M
  • TinyLlama 1.1B

Build

Requires Rust and Cargo. For cross-compilation to the Pi, install the aarch64 toolchain:

rustup target add aarch64-unknown-linux-gnu
sudo apt install gcc-aarch64-linux-gnu
cargo build --release       # native x86 build
cargo build-pi              # cross-compile for Pi Zero 2W (aarch64)
cargo test                  # run tests

Usage

./rustllm -m model.gguf -tok HuggingFaceTB/SmolLM-360M-Instruct -p "Hello" -n 100 --chat --template smollm
Flag Default Description
-m required Path to model file (.gguf or .safetensors). For safetensors, place config.json in the same directory
-tok SmolLM-360M-Instruct HuggingFace tokenizer ID
-p required Prompt
-n 256 Max tokens to generate
--chat off Apply chat template
--template smollm Chat template: smollm, llama, or default

Rust API

use rustllm::model::Model;
use rustllm::generate::GenerationConfig;

let mut model = Model::load("model.gguf")?;
let input_ids = vec![1, 15043, 29892];

let stats = model.generate(
    &input_ids,
    &GenerationConfig {
        max_new_tokens: 100,
        ..Default::default()
    },
    &mut |token_id| {
        // called for each generated token
        print!("{}", decode(token_id));
    },
);

println!("Generation: {:.1} tok/s", stats.generation_tok_per_sec());

Dependencies

Three crates:

  • memmap2: memory-mapped file I/O
  • rayon: parallel iterators
  • tokenizers: HuggingFace tokenizers

About

Rust LLM inference engine for edge devices

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages