A small language model training framework written in Rust. It covers the full pipeline: tokenizer training, pretraining on text corpora, supervised fine-tuning on chat data, reinforcement learning with GRPO, and inference with an interactive chat mode and web UI. The default model is a 28M parameter GPT with grouped-query attention, rotary embeddings, and sliding window masking.
The project was inspired by Karpathy's nanochat. I believe understanding how LLMs are trained is the start of understanding how to build systems to use the full power of LLMs. For this project I used Rust because I wanted to work closer to the metal and understand every operation without framework magic hiding what's happening. Rust's type system also makes it easier to catch shape mismatches and lifetime issues at compile time rather than getting silent numerical bugs at runtime. And honestly, I just wanted to see how far you could push a from-scratch LLM implementation in a systems language.
- Rust toolchain (tested with stable)
- A C compiler (gcc)
cargo build --release
A pretrained 90M parameter checkpoint is available on Hugging Face. Download the files and run:
mkdir -p runs/model
# place model.safetensors, config.json, and tokenizer.json in runs/model/
cargo run --release -- \
--chat --load runs/model --tokenizer runs/model/tokenizer.json \
--temperature 0.8 --max-tokens 256
cargo run --release -- \
--tokenizer-train --data data/corpus.txt \
--vocab-size 32768 --save-tokenizer runs/tok.json
Training also works from parquet files with a text column.
Pretraining expects a directory of parquet files (FineWeb format, with a text column). If you have a plain text file, convert it first:
cargo run --release -- \
--prepare-data --data data/corpus.txt --output data/train/corpus.parquet
Then pretrain:
cargo run --release -- \
--pretrain --data data/train --tokenizer runs/tok.json \
--depth 4 --steps 3000 --batch-size 2 --seq-len 256 \
--save runs/pretrain
SFT expects JSONL files with chat conversations:
{"messages": [{"role": "user", "content": "What is gravity?"}, {"role": "assistant", "content": "Gravity is..."}]}Run SFT on a pretrained checkpoint:
cargo run --release -- \
--sft --load runs/pretrain --tokenizer runs/tok.json \
--sft-data data/chat.jsonl --steps 200 --batch-size 2 --seq-len 256 \
--save runs/sft
Multiple datasets with weights: --sft-data "data/a.jsonl:1.0,data/b.jsonl:0.5"
Trains reasoning ability using group relative policy optimization with math (GSM8K), multiple choice (ARC), and tool-use tasks:
cargo run --release -- \
--grpo --load runs/sft --tokenizer runs/tok.json \
--gsm8k-data data/gsm8k.jsonl --arc-data data/arc.jsonl \
--steps 100 --group-size 16 \
--save runs/grpo
cargo run --release -- \
--chat --load runs/sft --tokenizer runs/tok.json \
--temperature 0.8 --max-tokens 256
cargo run --release -- \
--serve --load runs/sft --tokenizer runs/tok.json --port 8000
Bits-per-byte on validation data:
cargo run --release -- \
--eval-bpb --load runs/pretrain --tokenizer runs/tok.json \
--val-data data/val
ARC-Challenge accuracy:
cargo run --release -- \
--eval-arc --load runs/sft --tokenizer runs/tok.json \
--arc-data data/arc.jsonl
The model is a decoder-only transformer with:
- Grouped-query attention (GQA) with half as many KV heads as query heads
- Rotary positional embeddings (RoPE)
- Sliding window attention with a configurable per-layer pattern (default: SSSL)
- ReLU-squared activation in the MLP
- RMS normalization
- Learnable residual scaling (per-layer lambdas)
- Logit soft-capping
The --depth flag controls model size:
| Depth | Params | Layers | Dim | Heads |
|---|---|---|---|---|
| 4 | 28M | 4 | 256 | 4 |
| 8 | 90M | 8 | 512 | 8 |
The optimizer is a hybrid of Muon (for 2D weight matrices) and AdamW (for everything else), with per-parameter-group learning rates.
picochat-core-- model architecture, attention, embeddings, KV cachepicochat-tokenizer-- BPE tokenizer with special tokens for chat, reasoning, and tool usepicochat-data-- parquet readers, packing dataloaders, dataset formatspicochat-optim-- Muon/AdamW hybrid optimizer, LR schedulingpicochat-train-- pretraining, SFT, and GRPO training loopspicochat-eval-- BPB, ARC, GSM8K, and reasoning quality metricspicochat-engine-- generation, sampling, reasoning/tool-call handling, INT8 quantizationpicochat-serve-- HTTP server with SSE streamingpicochat-tool-- tool/expression evaluator for function callspicochat-cli-- command-line entry point
cargo test --workspace
This is a learning project, not a production language model. The models trained here produce coherent text on topics seen during training but will generate garbled output on novel questions. Getting useful general-purpose chat behavior requires billions of tokens of training data and GPU-scale compute -- far beyond what a CPU-trained 28M-90M parameter model can achieve. The value of this project is in the training framework and pipeline, not the resulting model weights.
This project draws heavily from Andrej Karpathy's nanochat and the broader nanoGPT lineage. His work on making LLM training understandable and accessible is what motivated this project.
MIT