Awesome Small Language Models 👼🏻

A curated list of open source Small Language Models (roughly ≤13B parameters, or small-active-parameter MoE variants).

Why small? Small language models can run on consumer GPUs, edge devices, and even phones — making AI accessible without cloud dependencies.

Quick Comparison

Latest version of each model family. Click the model name to jump to details.

Model	Org	Sizes	Context	Modality	License	Highlights
OLMo 2	AI2	1B, 7B, 13B	4K	Text	Apache 2.0	Fully open (data + code + recipes)
Qwen 3	Alibaba	0.6B–14B, 30B-A3B	128K	Text	Apache 2.0	Hybrid thinking; 119 languages; 36T tokens
OpenELM	Apple	270M–3B	2K	Text	Apple	Layer-wise scaling; MLX support
Command R7B	Cohere	7B	128K	Text	CC-BY-NC	RAG & tool-use optimized; 23 languages
DeepSeek-R1 Distill	DeepSeek	1.5B–14B	128K	Text	MIT	Reasoning via RL; distilled from R1
Gemma 4	Google	E2B, E4B, 26B-A4B, 31B	128K–256K	Text + Vision + Audio	Apache 2.0	MoE + Dense; agentic workflows; 140+ languages
SmolLM2	HuggingFace	135M–1.7B	8K	Text	Apache 2.0	Ultra-small; beats Qwen2.5-1.5B at scale
LLaMA 3.2	Meta	1B, 3B, 11B	128K	Text + Vision	Llama 3.2	Edge/mobile optimized; SpinQuant
Phi-4	Microsoft	3.8B, 5.6B, 14B	16K	Text + Vision + Audio	MIT	Surpasses GPT-4o on STEM QA
Mistral Small 3.1	Mistral	24B	128K	Text + Vision	Apache 2.0	Beats GPT-4o Mini; 150 tok/s
RWKV-7	RWKV	0.4B–2.9B	Unlimited	Text	Apache 2.0	RNN; O(1) memory; no KV cache
StableLM 2	Stability AI	1.6B, 12B	4K	Text	Stability AI	Laptop-deployable; 7 languages
Falcon 3	TII	1B–10B	8K	Text	Apache 2.0	14T tokens; includes Mamba variant

See also: Techniques for Creating SLMs | Deployment Tools | Contributing

AI2 OLMo

AI2 OLMo 2 (December 2024)

Models: 1B | 7B-Instruct | 13B-Instruct

Features:

Fully open: weights, training data, code, recipes, and intermediate checkpoints
OLMo 2 7B outperforms LLaMA 3.1 8B; 13B outperforms Qwen 2.5 7B
Two-stage curriculum pretraining on 3.9T tokens

Paper | BibTeX

Alibaba Qwen

Alibaba Qwen 3 (April 2025)

Dense Models: 0.6B | 1.7B | 4B | 8B | 14B

MoE Models: 30B-A3B

Features:

Hybrid thinking modes: fast responses + deep chain-of-thought reasoning
119 languages/dialects; trained on 36T tokens; Apache 2.0 license
MoE variant (30B total, 3B active) for efficient deployment

Paper | BibTeX

Alibaba Qwen 2.5 (September 2024)

Models: 0.5B | 1.5B | 3B | 7B | 14B

Features:

18T token pretraining; strong on coding, math, and instruction following
Specialized Coder and Math variants available
29 languages; long context up to 128K tokens

Paper | BibTeX

Alibaba Qwen 2 (June 2024)

Models: 0.5B | 1.5B | 7B

Features:

Significant improvements in coding, math, and multilingual tasks
GQA for efficient KV cache; 128K context for 7B model

Paper | BibTeX

Apple

Apple OpenELM (April 2024)

Pre-trained Models: 270M | 450M | 1.1B | 3B

Instruction-Tuned Models: 270M-Instruct | 450M-Instruct | 1.1B-Instruct | 3B-Instruct

Features:

Layer-wise scaling strategy for efficient parameter allocation
Fully open: training data, logs, checkpoints, and configurations released
MLX support for inference/fine-tuning on Apple devices

Paper | BibTeX

Cohere

Cohere Command R7B (December 2024)

Models: R7B

Features:

7B parameter model optimized for RAG, tool use, and agentic workflows
Multilingual (23 languages); 128K context length
Strong performance among similar-sized models on HuggingFace Open LLM Leaderboard

DeepSeek

DeepSeek-R1 Distilled Models (January 2025)

Small Distilled Models: Qwen-1.5B | Qwen-7B | Llama-8B | Qwen-14B

Features:

Distilled from DeepSeek-R1 using 800K curated reasoning samples
8B distilled model matches larger models on MATH-500 and AIME 2024
Reasoning via reinforcement learning without human-annotated reasoning data

Paper | BibTeX

Google Gemma

Google Gemma 4 (April 2026)

Pre-trained Models: E2B | E4B | 26B-A4B | 31B

Instruction-Tuned Models: E2B-IT | E4B-IT | 26B-A4B-IT | 31B-IT

Features:

Dense (E2B/E4B/31B) and MoE (26B total, 4B active) architectures; Apache 2.0 license
Multimodal: text + image + video input (31B/26B), text + image + audio input (E2B/E4B); 128K–256K context
Native function-calling, structured JSON output, and agentic workflow support; 140+ languages

Google Gemma 3 (March 2025)

Pre-trained Models: 1B | 4B | 12B

Instruction-Tuned Models: 1B-IT | 4B-IT | 12B-IT

Features:

Multimodal (text + image input) with 128K context window
Supports 140+ languages; knowledge distillation during training
Gemma 3n variants optimized for mobile/edge devices

Paper | BibTeX

Google Gemma 2 (June 2024)

Pre-trained Models: 2B | 9B

Instruction-Tuned Models: 2B-IT | 9B-IT

Features:

Interleaved local-global attention and group-query attention
Smaller models trained via knowledge distillation instead of next-token prediction
2B model trained on 2T tokens; 9B model trained on 8T tokens

Paper | BibTeX

Google Gemma (February 2024)

Pre-trained Models: 2B | 7B

Instruction-Tuned Models: 2B-IT | 7B-IT

Features:

Built on research from Gemini; trained on 6T tokens
Open-source code (PyTorch) and inference framework (C++)

Paper | BibTeX

HuggingFace SmolLM

HuggingFace SmolLM2 (November 2024)

Models: 135M | 360M | 1.7B | 1.7B-Instruct

Features:

1.7B model trained on 11T tokens with multi-stage curriculum
Outperforms Qwen2.5-1.5B and LLaMA 3.2-1B at similar scale
Apache 2.0 license; designed for on-device deployment

Paper | BibTeX

Meta LLaMA

Meta LLaMA 3.2 (September 2024)

Text Models: 1B | 3B | 1B-Instruct | 3B-Instruct

Vision Models: 11B-Vision-Instruct

Features:

First multimodal LLaMA release; 1B/3B text models for edge/mobile
128K context length; trained on 9T tokens with knowledge distillation from LLaMA 3.1
Optimized for on-device with SpinQuant and QLoRA support

Paper | BibTeX

Meta LLaMA 3.1 (July 2024)

Models: 8B | 8B-Instruct

Features:

128K context length; multilingual support for 8 languages
Improved tool use and function calling capabilities

Paper | BibTeX

Meta LLaMA 3 (April 2024)

Models: 8B | 8B-Instruct

Features:

New 128K vocabulary tokenizer (up from 32K in LLaMA 2)
Grouped-Query Attention (GQA) for improved inference

Paper | BibTeX

Meta LLaMA 2 (July 2023)

Pre-trained Models: 7B | 13B

Chat Models: 7B-Chat | 13B-Chat

Features:

Iterative RLHF for chat alignment; 4K context window
Trained on 2T tokens; safety-tuned with red-teaming

Paper | BibTeX

Meta LLaMA (February 2023)

Pre-trained Models: 7B

Features:

Trained on publicly available data only (1T–1.4T tokens)
Demonstrated that smaller models trained on more data can match larger models

Paper | BibTeX

Microsoft Phi

Microsoft Phi-4 (December 2024)

Models: 14B | Mini 3.8B | Multimodal 5.6B

Features:

14B model surpasses teacher model GPT-4o on STEM QA; trained on 9.8T tokens
Strategic use of synthetic data throughout training; superior on math competition problems
Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B, handles text/images/audio) variants

Paper | BibTeX

Microsoft Phi-3.5 (August 2024)

Models: Mini 3.8B | MoE | Vision 4.2B

Features:

Enhanced multilingual, multimodal, and long-context (128K) capabilities
MoE variant for efficient inference; Vision variant for image understanding
MIT licensed

Paper | BibTeX

Microsoft Phi-3 (April 2024)

Models: Mini 3.8B | Small 7B | Medium 14B

Features:

Phi-3-mini (3.8B) outperforms models twice its size
Data quality over scale philosophy; high-quality curated + synthetic data
Runs locally on phones

Paper | BibTeX

Mistral

Mistral Small 3.1 (March 2025)

Models: 24B-Instruct | 24B-Base

Features:

24B parameter multimodal model with 128K context; Apache 2.0 license
Outperforms GPT-4o Mini and Gemma 3; 150 tokens/s inference speed
Built on Mistral Small 3 with added vision understanding

Mistral Small 3 (January 2025)

Models: 24B-Instruct

Features:

24B parameters competitive with LLaMA 3.3 70B at 3x faster speed
Fits on single RTX 4090 or 32GB MacBook when quantized
Apache 2.0 license

Ministral (les Ministraux) (October 2024)

Models: 8B-Instruct

Features:

Ministral 3B and 8B for on-device and edge use cases
128K context support; interleaved sliding-window attention
Priced at $0.04/M tokens (3B) and $0.1/M tokens (8B)

Mistral 7B v0.3 (May 2024)

Models: Base | Instruct

Features:

Extended vocabulary and improved function calling
Sliding-window attention for efficient long-context inference

Paper | BibTeX

RWKV

RWKV-7 "Goose" (March 2025)

Models: 421M | 1.5B | 2.9B

Features:

RNN architecture with constant memory and constant inference time per token (no KV cache)
2.9B model achieves new 3B SOTA on multilingual tasks
Linear time complexity; infinite context length; Apache 2.0 license

Paper | BibTeX

Stability AI

Stability AI StableLM 2 (January 2024)

Models: 1.6B | 1.6B-Chat | 12B | 12B-Chat

Features:

1.6B model trained on 2T tokens across 7 languages
12B model trained on 2T tokens with DPO for chat alignment
Compact enough for laptop deployment

Paper | BibTeX

TII Falcon

TII Falcon 3 (December 2024)

Models: 1B | 3B | 7B | 7B-Instruct | 10B | Mamba-7B

Features:

Trained on 14T tokens (2x Falcon 2); 1B to 10B parameter range
Includes SSM-based Mamba variant alongside transformer models
Compatible with Llama architecture; Apache 2.0-based license

TII Falcon 2 (May 2024)

Models: 11B | 11B-VLM

Features:

11B parameters trained on 5.5T tokens across 11 languages
First Falcon with vision-language model (VLM) capability
Deployable on single A10 GPU

Paper | BibTeX

Techniques for Creating SLMs

Knowledge Distillation

Training a smaller "student" model to mimic the behavior of a larger "teacher" model. The student learns from the teacher's output probability distributions rather than just ground-truth labels, capturing richer information about inter-class relationships.

Used by: DeepSeek-R1 Distill, LLaMA 3.2 (1B/3B distilled from 3.1 70B), Gemma 2/3/4 smaller variants
Distilling the Knowledge in a Neural Network (Hinton et al., 2015)

Pruning

Removing redundant or less important parameters (weights, neurons, or entire layers) from a trained model to reduce size while preserving performance.

Unstructured pruning: zeroes out individual weights (sparse matrices)
Structured pruning: removes entire neurons, attention heads, or layers (directly reduces model dimensions)
A Survey on Model Compression for Large Language Models

Quantization

| OneComp | Fujitsu PTQ pipeline, Qwen3 0.6B-32B tested | arXiv:2603.28845 | Apache-2.0 | Reducing the numerical precision of model weights and activations (e.g., FP32 → INT8 or INT4), significantly reducing memory footprint and speeding up inference.

Post-Training Quantization (PTQ): quantize after training (GPTQ, AWQ, GGUF)
Quantization-Aware Training (QAT): train with quantization in the loop for better accuracy
Popular tools: llama.cpp, bitsandbytes, AutoGPTQ

Deployment Tools

Tool	Platform	Description	Link
Ollama	macOS, Linux, Windows	Run SLMs locally with a single command; supports GGUF models	GitHub
llama.cpp	Cross-platform	High-performance C/C++ inference with quantization support	GitHub
vLLM	Linux, Cloud	High-throughput serving with PagedAttention; production-grade	GitHub
MLC LLM	Mobile, Web, Desktop	Universal deployment across platforms including iOS/Android/WebGPU	GitHub
PocketPal AI	iOS, Android	Mobile app for running SLMs on-device	iOS / Android

Contributing

Contributions are welcome! Please open a pull request or issue to add a model, fix a link, or suggest improvements. When adding a new model, please follow the existing format and include:

Official announcement/blog link
HuggingFace model links
2–3 key features
Paper link (arXiv preferred)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bibtex		bibtex
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Small Language Models 👼🏻

Quick Comparison

AI2 OLMo

Alibaba Qwen

Apple

Cohere

DeepSeek

Google Gemma

HuggingFace SmolLM

Meta LLaMA

Microsoft Phi

Mistral

RWKV

Stability AI

TII Falcon

Techniques for Creating SLMs

Knowledge Distillation

Pruning

Quantization

Deployment Tools

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome Small Language Models 👼🏻

Quick Comparison

AI2 OLMo

Alibaba Qwen

Apple

Cohere

DeepSeek

Google Gemma

HuggingFace SmolLM

Meta LLaMA

Microsoft Phi

Mistral

RWKV

Stability AI

TII Falcon

Techniques for Creating SLMs

Knowledge Distillation

Pruning

Quantization

Deployment Tools

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages