Chat with a 95M parameter model.

A 95.7M parameter hybrid-attention language model pretrained on FineWeb (10B tokens) and finetuned on simple conversations. Built on a Qwen3.5 backbone with DeepSeek V4 architectural innovations, trained with the Muon optimizer on 4x NVIDIA B300 GPUs.

Model architecture


Parameters	95.7M
Dimensions	576
Layers	16 (hybrid pattern)
Attention heads	9 (3 KV heads, GQA)
Head dim	64
Context length	1024 tokens
Vocabulary	32K (Llama 2 tokenizer)
RoPE	Partial (50%), theta=20K

Hybrid attention layers

The model uses a repeating 4-layer pattern: 2 linear + 1 compressed + 1 full attention, tiled 4 times.

linear → linear → CSA → full  (×4 = 16 layers)

Linear attention (Gated Delta Net) — O(n) recurrence with causal depthwise conv, learned gating, and delta rule updates. 9 key/value heads, 112-dim keys and values.
Compressed Sparse Attention (CSA) — Dual-path design from DeepSeek V4. A learned stride-4 compressor produces compressed KV for long-range context, while a 64-token local window preserves fine-grained detail. A learned gate blends the two paths per-token.
Full attention — Standard softmax causal attention with gated Q projections, QK normalization, and grouped-query attention (9 heads, 3 KV heads).

Hyper-Connections

Every layer uses learned per-dimension residual scaling (zero-initialized). Instead of h = h + sublayer(h), each sublayer computes:

h = (1 + residual_scale) * h + (1 + sublayer_scale) * sublayer(h)

This gives the model fine-grained control over information flow without adding meaningful parameter count.

Training

Optimizer: Muon

Muon applies Newton-Schulz orthogonalization to the momentum buffer, normalizing all singular values toward 1 for faster convergence. Embeddings and the output head fall back to AdamW.


Learning rate	0.02 (Muon) / 0.002 (AdamW fallback)
Momentum	0.95 (Nesterov)
NS iterations	5
Weight decay	0.01
Gradient clip	1.0
Warmup	500 steps
Schedule	Cosine decay to 0

Pretraining data: FineWeb

Pretrained on FineWeb sample-10BT — 10 billion tokens of cleaned web text from Common Crawl, curated by HuggingFace.


Dataset	FineWeb sample-10BT
Documents	14.9M
Tokens	~10B
Tokenizer	Llama 2 SentencePiece (32K vocab)
Epochs	~2
Batch size	64 × 8 grad accum × 4 GPUs = 2.1M tokens/iter
Total iters	10,000

Hardware

4x NVIDIA B300 SXM6 (275 GB HBM3e each). With torch.compile, steady-state iteration time is ~2.35s at 23% MFU. Full pretraining takes approximately 7 hours.

Quick start

Pretrain from scratch

# Download and tokenize FineWeb (10B tokens, ~28 GB download)
python fineweb.py download
python fineweb.py pretokenize

# Train on 4 GPUs
torchrun --standalone --nproc_per_node=4 train.py config/fineweb_95m_dsv4_4xB300.py

Export and run in C

python export.py out_fineweb_95m_dsv4/model.bin --version 3 --checkpoint out_fineweb_95m_dsv4/ckpt.pt
gcc -O2 -o run run.c -lm
./run out_fineweb_95m_dsv4/model.bin -i "The history of" -n 256

Monitor training

tail -f train.log                          # live output
grep "| loss" train.log | tail -20         # recent loss values
nvidia-smi                                 # GPU utilization

Training metrics (loss, LR, MFU, samples) are logged to Weights & Biases. Checkpoints are saved every 1000 steps and uploaded to Cloudflare R2.

Flash Linear Attention

The linear attention layers automatically use fused Triton kernels from flash-linear-attention when installed, providing roughly 4x throughput improvement over the naive PyTorch fallback.

pip install flash-linear-attention

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 550 Commits
.github/workflows		.github/workflows
assets		assets
config		config
doc		doc
.gitignore		.gitignore
LICENSE		LICENSE
LOOPLM_IMPLEMENTATION_PLAN.md		LOOPLM_IMPLEMENTATION_PLAN.md
MEMORY_CACHING_PLAN.md		MEMORY_CACHING_PLAN.md
Makefile		Makefile
README.md		README.md
build_msvc.bat		build_msvc.bat
configurator.py		configurator.py
export.py		export.py
fineweb.py		fineweb.py
inference.py		inference.py
model.py		model.py
muon.py		muon.py
print_vocab.py		print_vocab.py
requirements.txt		requirements.txt
review.py		review.py
run.c		run.c
run.ipynb		run.ipynb
runq.c		runq.c
sample.py		sample.py
simple_conversations_filtered.jsonl		simple_conversations_filtered.jsonl
simplestories.py		simplestories.py
test.c		test.c
test_all.py		test_all.py
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train.py		train.py
win.c		win.c
win.h		win.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat with a 95M parameter model.

Model architecture

Hybrid attention layers

Hyper-Connections

Training

Optimizer: Muon

Pretraining data: FineWeb

Hardware

Quick start

Pretrain from scratch

Export and run in C

Monitor training

Flash Linear Attention

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chat with a 95M parameter model.

Model architecture

Hybrid attention layers

Hyper-Connections

Training

Optimizer: Muon

Pretraining data: FineWeb

Hardware

Quick start

Pretrain from scratch

Export and run in C

Monitor training

Flash Linear Attention

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages