A curated list of open source Small Language Models (roughly ≤13B parameters, or small-active-parameter MoE variants).
Why small? Small language models can run on consumer GPUs, edge devices, and even phones — making AI accessible without cloud dependencies.
Latest version of each model family. Click the model name to jump to details.
| Model | Org | Sizes | Context | Modality | License | Highlights |
|---|---|---|---|---|---|---|
| OLMo 2 | AI2 | 1B, 7B, 13B | 4K | Text | Apache 2.0 | Fully open (data + code + recipes) |
| Qwen 3 | Alibaba | 0.6B–14B, 30B-A3B | 128K | Text | Apache 2.0 | Hybrid thinking; 119 languages; 36T tokens |
| OpenELM | Apple | 270M–3B | 2K | Text | Apple | Layer-wise scaling; MLX support |
| Command R7B | Cohere | 7B | 128K | Text | CC-BY-NC | RAG & tool-use optimized; 23 languages |
| DeepSeek-R1 Distill | DeepSeek | 1.5B–14B | 128K | Text | MIT | Reasoning via RL; distilled from R1 |
| Gemma 4 | E2B, E4B, 26B-A4B, 31B | 128K–256K | Text + Vision + Audio | Apache 2.0 | MoE + Dense; agentic workflows; 140+ languages | |
| SmolLM2 | HuggingFace | 135M–1.7B | 8K | Text | Apache 2.0 | Ultra-small; beats Qwen2.5-1.5B at scale |
| LLaMA 3.2 | Meta | 1B, 3B, 11B | 128K | Text + Vision | Llama 3.2 | Edge/mobile optimized; SpinQuant |
| Phi-4 | Microsoft | 3.8B, 5.6B, 14B | 16K | Text + Vision + Audio | MIT | Surpasses GPT-4o on STEM QA |
| Mistral Small 3.1 | Mistral | 24B | 128K | Text + Vision | Apache 2.0 | Beats GPT-4o Mini; 150 tok/s |
| RWKV-7 | RWKV | 0.4B–2.9B | Unlimited | Text | Apache 2.0 | RNN; O(1) memory; no KV cache |
| StableLM 2 | Stability AI | 1.6B, 12B | 4K | Text | Stability AI | Laptop-deployable; 7 languages |
| Falcon 3 | TII | 1B–10B | 8K | Text | Apache 2.0 | 14T tokens; includes Mamba variant |
See also: Techniques for Creating SLMs | Deployment Tools | Contributing
AI2 OLMo 2 (December 2024)
Models: 1B | 7B-Instruct | 13B-Instruct
Features:
- Fully open: weights, training data, code, recipes, and intermediate checkpoints
- OLMo 2 7B outperforms LLaMA 3.1 8B; 13B outperforms Qwen 2.5 7B
- Two-stage curriculum pretraining on 3.9T tokens
Alibaba Qwen 3 (April 2025)
Dense Models: 0.6B | 1.7B | 4B | 8B | 14B
MoE Models: 30B-A3B
Features:
- Hybrid thinking modes: fast responses + deep chain-of-thought reasoning
- 119 languages/dialects; trained on 36T tokens; Apache 2.0 license
- MoE variant (30B total, 3B active) for efficient deployment
Alibaba Qwen 2.5 (September 2024)
Models: 0.5B | 1.5B | 3B | 7B | 14B
Features:
- 18T token pretraining; strong on coding, math, and instruction following
- Specialized Coder and Math variants available
- 29 languages; long context up to 128K tokens
Alibaba Qwen 2 (June 2024)
Features:
- Significant improvements in coding, math, and multilingual tasks
- GQA for efficient KV cache; 128K context for 7B model
Apple OpenELM (April 2024)
Pre-trained Models: 270M | 450M | 1.1B | 3B
Instruction-Tuned Models: 270M-Instruct | 450M-Instruct | 1.1B-Instruct | 3B-Instruct
Features:
- Layer-wise scaling strategy for efficient parameter allocation
- Fully open: training data, logs, checkpoints, and configurations released
- MLX support for inference/fine-tuning on Apple devices
Cohere Command R7B (December 2024)
Models: R7B
Features:
- 7B parameter model optimized for RAG, tool use, and agentic workflows
- Multilingual (23 languages); 128K context length
- Strong performance among similar-sized models on HuggingFace Open LLM Leaderboard
DeepSeek-R1 Distilled Models (January 2025)
Small Distilled Models: Qwen-1.5B | Qwen-7B | Llama-8B | Qwen-14B
Features:
- Distilled from DeepSeek-R1 using 800K curated reasoning samples
- 8B distilled model matches larger models on MATH-500 and AIME 2024
- Reasoning via reinforcement learning without human-annotated reasoning data
Google Gemma 4 (April 2026)
Pre-trained Models: E2B | E4B | 26B-A4B | 31B
Instruction-Tuned Models: E2B-IT | E4B-IT | 26B-A4B-IT | 31B-IT
Features:
- Dense (E2B/E4B/31B) and MoE (26B total, 4B active) architectures; Apache 2.0 license
- Multimodal: text + image + video input (31B/26B), text + image + audio input (E2B/E4B); 128K–256K context
- Native function-calling, structured JSON output, and agentic workflow support; 140+ languages
Google Gemma 3 (March 2025)
Pre-trained Models: 1B | 4B | 12B
Instruction-Tuned Models: 1B-IT | 4B-IT | 12B-IT
Features:
- Multimodal (text + image input) with 128K context window
- Supports 140+ languages; knowledge distillation during training
- Gemma 3n variants optimized for mobile/edge devices
Google Gemma 2 (June 2024)
Instruction-Tuned Models: 2B-IT | 9B-IT
Features:
- Interleaved local-global attention and group-query attention
- Smaller models trained via knowledge distillation instead of next-token prediction
- 2B model trained on 2T tokens; 9B model trained on 8T tokens
Google Gemma (February 2024)
Instruction-Tuned Models: 2B-IT | 7B-IT
Features:
- Built on research from Gemini; trained on 6T tokens
- Open-source code (PyTorch) and inference framework (C++)
HuggingFace SmolLM2 (November 2024)
Models: 135M | 360M | 1.7B | 1.7B-Instruct
Features:
- 1.7B model trained on 11T tokens with multi-stage curriculum
- Outperforms Qwen2.5-1.5B and LLaMA 3.2-1B at similar scale
- Apache 2.0 license; designed for on-device deployment
Meta LLaMA 3.2 (September 2024)
Text Models: 1B | 3B | 1B-Instruct | 3B-Instruct
Vision Models: 11B-Vision-Instruct
Features:
- First multimodal LLaMA release; 1B/3B text models for edge/mobile
- 128K context length; trained on 9T tokens with knowledge distillation from LLaMA 3.1
- Optimized for on-device with SpinQuant and QLoRA support
Meta LLaMA 3.1 (July 2024)
Models: 8B | 8B-Instruct
Features:
- 128K context length; multilingual support for 8 languages
- Improved tool use and function calling capabilities
Meta LLaMA 3 (April 2024)
Models: 8B | 8B-Instruct
Features:
- New 128K vocabulary tokenizer (up from 32K in LLaMA 2)
- Grouped-Query Attention (GQA) for improved inference
Meta LLaMA 2 (July 2023)
Chat Models: 7B-Chat | 13B-Chat
Features:
- Iterative RLHF for chat alignment; 4K context window
- Trained on 2T tokens; safety-tuned with red-teaming
Meta LLaMA (February 2023)
Pre-trained Models: 7B
Features:
- Trained on publicly available data only (1T–1.4T tokens)
- Demonstrated that smaller models trained on more data can match larger models
Microsoft Phi-4 (December 2024)
Models: 14B | Mini 3.8B | Multimodal 5.6B
Features:
- 14B model surpasses teacher model GPT-4o on STEM QA; trained on 9.8T tokens
- Strategic use of synthetic data throughout training; superior on math competition problems
- Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B, handles text/images/audio) variants
Microsoft Phi-3.5 (August 2024)
Models: Mini 3.8B | MoE | Vision 4.2B
Features:
- Enhanced multilingual, multimodal, and long-context (128K) capabilities
- MoE variant for efficient inference; Vision variant for image understanding
- MIT licensed
Microsoft Phi-3 (April 2024)
Models: Mini 3.8B | Small 7B | Medium 14B
Features:
- Phi-3-mini (3.8B) outperforms models twice its size
- Data quality over scale philosophy; high-quality curated + synthetic data
- Runs locally on phones
Mistral Small 3.1 (March 2025)
Models: 24B-Instruct | 24B-Base
Features:
- 24B parameter multimodal model with 128K context; Apache 2.0 license
- Outperforms GPT-4o Mini and Gemma 3; 150 tokens/s inference speed
- Built on Mistral Small 3 with added vision understanding
Mistral Small 3 (January 2025)
Models: 24B-Instruct
Features:
- 24B parameters competitive with LLaMA 3.3 70B at 3x faster speed
- Fits on single RTX 4090 or 32GB MacBook when quantized
- Apache 2.0 license
Ministral (les Ministraux) (October 2024)
Models: 8B-Instruct
Features:
- Ministral 3B and 8B for on-device and edge use cases
- 128K context support; interleaved sliding-window attention
- Priced at $0.04/M tokens (3B) and $0.1/M tokens (8B)
Mistral 7B v0.3 (May 2024)
Features:
- Extended vocabulary and improved function calling
- Sliding-window attention for efficient long-context inference
RWKV-7 "Goose" (March 2025)
Features:
- RNN architecture with constant memory and constant inference time per token (no KV cache)
- 2.9B model achieves new 3B SOTA on multilingual tasks
- Linear time complexity; infinite context length; Apache 2.0 license
Stability AI StableLM 2 (January 2024)
Models: 1.6B | 1.6B-Chat | 12B | 12B-Chat
Features:
- 1.6B model trained on 2T tokens across 7 languages
- 12B model trained on 2T tokens with DPO for chat alignment
- Compact enough for laptop deployment
TII Falcon 3 (December 2024)
Models: 1B | 3B | 7B | 7B-Instruct | 10B | Mamba-7B
Features:
- Trained on 14T tokens (2x Falcon 2); 1B to 10B parameter range
- Includes SSM-based Mamba variant alongside transformer models
- Compatible with Llama architecture; Apache 2.0-based license
TII Falcon 2 (May 2024)
Features:
- 11B parameters trained on 5.5T tokens across 11 languages
- First Falcon with vision-language model (VLM) capability
- Deployable on single A10 GPU
Training a smaller "student" model to mimic the behavior of a larger "teacher" model. The student learns from the teacher's output probability distributions rather than just ground-truth labels, capturing richer information about inter-class relationships.
- Used by: DeepSeek-R1 Distill, LLaMA 3.2 (1B/3B distilled from 3.1 70B), Gemma 2/3/4 smaller variants
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015)
Removing redundant or less important parameters (weights, neurons, or entire layers) from a trained model to reduce size while preserving performance.
- Unstructured pruning: zeroes out individual weights (sparse matrices)
- Structured pruning: removes entire neurons, attention heads, or layers (directly reduces model dimensions)
- A Survey on Model Compression for Large Language Models
| OneComp | Fujitsu PTQ pipeline, Qwen3 0.6B-32B tested | arXiv:2603.28845 | Apache-2.0 | Reducing the numerical precision of model weights and activations (e.g., FP32 → INT8 or INT4), significantly reducing memory footprint and speeding up inference.
- Post-Training Quantization (PTQ): quantize after training (GPTQ, AWQ, GGUF)
- Quantization-Aware Training (QAT): train with quantization in the loop for better accuracy
- Popular tools: llama.cpp, bitsandbytes, AutoGPTQ
| Tool | Platform | Description | Link |
|---|---|---|---|
| Ollama | macOS, Linux, Windows | Run SLMs locally with a single command; supports GGUF models | GitHub |
| llama.cpp | Cross-platform | High-performance C/C++ inference with quantization support | GitHub |
| vLLM | Linux, Cloud | High-throughput serving with PagedAttention; production-grade | GitHub |
| MLC LLM | Mobile, Web, Desktop | Universal deployment across platforms including iOS/Android/WebGPU | GitHub |
| PocketPal AI | iOS, Android | Mobile app for running SLMs on-device | iOS / Android |
Contributions are welcome! Please open a pull request or issue to add a model, fix a link, or suggest improvements. When adding a new model, please follow the existing format and include:
- Official announcement/blog link
- HuggingFace model links
- 2–3 key features
- Paper link (arXiv preferred)