Skip to content

agi-templar/Awesome-Small-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Awesome Small Language Models 👼🏻

Awesome

A curated list of open source Small Language Models (roughly ≤13B parameters, or small-active-parameter MoE variants).

Why small? Small language models can run on consumer GPUs, edge devices, and even phones — making AI accessible without cloud dependencies.

Quick Comparison

Latest version of each model family. Click the model name to jump to details.

Model Org Sizes Context Modality License Highlights
OLMo 2 AI2 1B, 7B, 13B 4K Text Apache 2.0 Fully open (data + code + recipes)
Qwen 3 Alibaba 0.6B–14B, 30B-A3B 128K Text Apache 2.0 Hybrid thinking; 119 languages; 36T tokens
OpenELM Apple 270M–3B 2K Text Apple Layer-wise scaling; MLX support
Command R7B Cohere 7B 128K Text CC-BY-NC RAG & tool-use optimized; 23 languages
DeepSeek-R1 Distill DeepSeek 1.5B–14B 128K Text MIT Reasoning via RL; distilled from R1
Gemma 4 Google E2B, E4B, 26B-A4B, 31B 128K–256K Text + Vision + Audio Apache 2.0 MoE + Dense; agentic workflows; 140+ languages
SmolLM2 HuggingFace 135M–1.7B 8K Text Apache 2.0 Ultra-small; beats Qwen2.5-1.5B at scale
LLaMA 3.2 Meta 1B, 3B, 11B 128K Text + Vision Llama 3.2 Edge/mobile optimized; SpinQuant
Phi-4 Microsoft 3.8B, 5.6B, 14B 16K Text + Vision + Audio MIT Surpasses GPT-4o on STEM QA
Mistral Small 3.1 Mistral 24B 128K Text + Vision Apache 2.0 Beats GPT-4o Mini; 150 tok/s
RWKV-7 RWKV 0.4B–2.9B Unlimited Text Apache 2.0 RNN; O(1) memory; no KV cache
StableLM 2 Stability AI 1.6B, 12B 4K Text Stability AI Laptop-deployable; 7 languages
Falcon 3 TII 1B–10B 8K Text Apache 2.0 14T tokens; includes Mamba variant

See also: Techniques for Creating SLMs | Deployment Tools | Contributing


AI2 OLMo

AI2 OLMo 2 (December 2024)

Models: 1B | 7B-Instruct | 13B-Instruct

Features:

  • Fully open: weights, training data, code, recipes, and intermediate checkpoints
  • OLMo 2 7B outperforms LLaMA 3.1 8B; 13B outperforms Qwen 2.5 7B
  • Two-stage curriculum pretraining on 3.9T tokens

Paper | BibTeX


Alibaba Qwen

Alibaba Qwen 3 (April 2025)

Dense Models: 0.6B | 1.7B | 4B | 8B | 14B

MoE Models: 30B-A3B

Features:

  • Hybrid thinking modes: fast responses + deep chain-of-thought reasoning
  • 119 languages/dialects; trained on 36T tokens; Apache 2.0 license
  • MoE variant (30B total, 3B active) for efficient deployment

Paper | BibTeX


Alibaba Qwen 2.5 (September 2024)

Models: 0.5B | 1.5B | 3B | 7B | 14B

Features:

  • 18T token pretraining; strong on coding, math, and instruction following
  • Specialized Coder and Math variants available
  • 29 languages; long context up to 128K tokens

Paper | BibTeX


Alibaba Qwen 2 (June 2024)

Models: 0.5B | 1.5B | 7B

Features:

  • Significant improvements in coding, math, and multilingual tasks
  • GQA for efficient KV cache; 128K context for 7B model

Paper | BibTeX


Apple

Apple OpenELM (April 2024)

Pre-trained Models: 270M | 450M | 1.1B | 3B

Instruction-Tuned Models: 270M-Instruct | 450M-Instruct | 1.1B-Instruct | 3B-Instruct

Features:

  • Layer-wise scaling strategy for efficient parameter allocation
  • Fully open: training data, logs, checkpoints, and configurations released
  • MLX support for inference/fine-tuning on Apple devices

Paper | BibTeX


Cohere

Cohere Command R7B (December 2024)

Models: R7B

Features:

  • 7B parameter model optimized for RAG, tool use, and agentic workflows
  • Multilingual (23 languages); 128K context length
  • Strong performance among similar-sized models on HuggingFace Open LLM Leaderboard

DeepSeek

DeepSeek-R1 Distilled Models (January 2025)

Small Distilled Models: Qwen-1.5B | Qwen-7B | Llama-8B | Qwen-14B

Features:

  • Distilled from DeepSeek-R1 using 800K curated reasoning samples
  • 8B distilled model matches larger models on MATH-500 and AIME 2024
  • Reasoning via reinforcement learning without human-annotated reasoning data

Paper | BibTeX


Google Gemma

Google Gemma 4 (April 2026)

Pre-trained Models: E2B | E4B | 26B-A4B | 31B

Instruction-Tuned Models: E2B-IT | E4B-IT | 26B-A4B-IT | 31B-IT

Features:

  • Dense (E2B/E4B/31B) and MoE (26B total, 4B active) architectures; Apache 2.0 license
  • Multimodal: text + image + video input (31B/26B), text + image + audio input (E2B/E4B); 128K–256K context
  • Native function-calling, structured JSON output, and agentic workflow support; 140+ languages

Google Gemma 3 (March 2025)

Pre-trained Models: 1B | 4B | 12B

Instruction-Tuned Models: 1B-IT | 4B-IT | 12B-IT

Features:

  • Multimodal (text + image input) with 128K context window
  • Supports 140+ languages; knowledge distillation during training
  • Gemma 3n variants optimized for mobile/edge devices

Paper | BibTeX


Google Gemma 2 (June 2024)

Pre-trained Models: 2B | 9B

Instruction-Tuned Models: 2B-IT | 9B-IT

Features:

  • Interleaved local-global attention and group-query attention
  • Smaller models trained via knowledge distillation instead of next-token prediction
  • 2B model trained on 2T tokens; 9B model trained on 8T tokens

Paper | BibTeX


Google Gemma (February 2024)

Pre-trained Models: 2B | 7B

Instruction-Tuned Models: 2B-IT | 7B-IT

Features:

  • Built on research from Gemini; trained on 6T tokens
  • Open-source code (PyTorch) and inference framework (C++)

Paper | BibTeX


HuggingFace SmolLM

HuggingFace SmolLM2 (November 2024)

Models: 135M | 360M | 1.7B | 1.7B-Instruct

Features:

  • 1.7B model trained on 11T tokens with multi-stage curriculum
  • Outperforms Qwen2.5-1.5B and LLaMA 3.2-1B at similar scale
  • Apache 2.0 license; designed for on-device deployment

Paper | BibTeX


Meta LLaMA

Meta LLaMA 3.2 (September 2024)

Text Models: 1B | 3B | 1B-Instruct | 3B-Instruct

Vision Models: 11B-Vision-Instruct

Features:

  • First multimodal LLaMA release; 1B/3B text models for edge/mobile
  • 128K context length; trained on 9T tokens with knowledge distillation from LLaMA 3.1
  • Optimized for on-device with SpinQuant and QLoRA support

Paper | BibTeX


Meta LLaMA 3.1 (July 2024)

Models: 8B | 8B-Instruct

Features:

  • 128K context length; multilingual support for 8 languages
  • Improved tool use and function calling capabilities

Paper | BibTeX


Meta LLaMA 3 (April 2024)

Models: 8B | 8B-Instruct

Features:

  • New 128K vocabulary tokenizer (up from 32K in LLaMA 2)
  • Grouped-Query Attention (GQA) for improved inference

Paper | BibTeX


Meta LLaMA 2 (July 2023)

Pre-trained Models: 7B | 13B

Chat Models: 7B-Chat | 13B-Chat

Features:

  • Iterative RLHF for chat alignment; 4K context window
  • Trained on 2T tokens; safety-tuned with red-teaming

Paper | BibTeX


Meta LLaMA (February 2023)

Pre-trained Models: 7B

Features:

  • Trained on publicly available data only (1T–1.4T tokens)
  • Demonstrated that smaller models trained on more data can match larger models

Paper | BibTeX


Microsoft Phi

Microsoft Phi-4 (December 2024)

Models: 14B | Mini 3.8B | Multimodal 5.6B

Features:

  • 14B model surpasses teacher model GPT-4o on STEM QA; trained on 9.8T tokens
  • Strategic use of synthetic data throughout training; superior on math competition problems
  • Phi-4-mini (3.8B) and Phi-4-multimodal (5.6B, handles text/images/audio) variants

Paper | BibTeX


Microsoft Phi-3.5 (August 2024)

Models: Mini 3.8B | MoE | Vision 4.2B

Features:

  • Enhanced multilingual, multimodal, and long-context (128K) capabilities
  • MoE variant for efficient inference; Vision variant for image understanding
  • MIT licensed

Paper | BibTeX


Microsoft Phi-3 (April 2024)

Models: Mini 3.8B | Small 7B | Medium 14B

Features:

  • Phi-3-mini (3.8B) outperforms models twice its size
  • Data quality over scale philosophy; high-quality curated + synthetic data
  • Runs locally on phones

Paper | BibTeX


Mistral

Mistral Small 3.1 (March 2025)

Models: 24B-Instruct | 24B-Base

Features:

  • 24B parameter multimodal model with 128K context; Apache 2.0 license
  • Outperforms GPT-4o Mini and Gemma 3; 150 tokens/s inference speed
  • Built on Mistral Small 3 with added vision understanding

Mistral Small 3 (January 2025)

Models: 24B-Instruct

Features:

  • 24B parameters competitive with LLaMA 3.3 70B at 3x faster speed
  • Fits on single RTX 4090 or 32GB MacBook when quantized
  • Apache 2.0 license

Ministral (les Ministraux) (October 2024)

Models: 8B-Instruct

Features:

  • Ministral 3B and 8B for on-device and edge use cases
  • 128K context support; interleaved sliding-window attention
  • Priced at $0.04/M tokens (3B) and $0.1/M tokens (8B)

Mistral 7B v0.3 (May 2024)

Models: Base | Instruct

Features:

  • Extended vocabulary and improved function calling
  • Sliding-window attention for efficient long-context inference

Paper | BibTeX


RWKV

RWKV-7 "Goose" (March 2025)

Models: 421M | 1.5B | 2.9B

Features:

  • RNN architecture with constant memory and constant inference time per token (no KV cache)
  • 2.9B model achieves new 3B SOTA on multilingual tasks
  • Linear time complexity; infinite context length; Apache 2.0 license

Paper | BibTeX


Stability AI

Stability AI StableLM 2 (January 2024)

Models: 1.6B | 1.6B-Chat | 12B | 12B-Chat

Features:

  • 1.6B model trained on 2T tokens across 7 languages
  • 12B model trained on 2T tokens with DPO for chat alignment
  • Compact enough for laptop deployment

Paper | BibTeX


TII Falcon

TII Falcon 3 (December 2024)

Models: 1B | 3B | 7B | 7B-Instruct | 10B | Mamba-7B

Features:

  • Trained on 14T tokens (2x Falcon 2); 1B to 10B parameter range
  • Includes SSM-based Mamba variant alongside transformer models
  • Compatible with Llama architecture; Apache 2.0-based license

TII Falcon 2 (May 2024)

Models: 11B | 11B-VLM

Features:

  • 11B parameters trained on 5.5T tokens across 11 languages
  • First Falcon with vision-language model (VLM) capability
  • Deployable on single A10 GPU

Paper | BibTeX


Techniques for Creating SLMs

Knowledge Distillation

Training a smaller "student" model to mimic the behavior of a larger "teacher" model. The student learns from the teacher's output probability distributions rather than just ground-truth labels, capturing richer information about inter-class relationships.

Pruning

Removing redundant or less important parameters (weights, neurons, or entire layers) from a trained model to reduce size while preserving performance.

Quantization

| OneComp | Fujitsu PTQ pipeline, Qwen3 0.6B-32B tested | arXiv:2603.28845 | Apache-2.0 | Reducing the numerical precision of model weights and activations (e.g., FP32 → INT8 or INT4), significantly reducing memory footprint and speeding up inference.

  • Post-Training Quantization (PTQ): quantize after training (GPTQ, AWQ, GGUF)
  • Quantization-Aware Training (QAT): train with quantization in the loop for better accuracy
  • Popular tools: llama.cpp, bitsandbytes, AutoGPTQ

Deployment Tools

Tool Platform Description Link
Ollama macOS, Linux, Windows Run SLMs locally with a single command; supports GGUF models GitHub
llama.cpp Cross-platform High-performance C/C++ inference with quantization support GitHub
vLLM Linux, Cloud High-throughput serving with PagedAttention; production-grade GitHub
MLC LLM Mobile, Web, Desktop Universal deployment across platforms including iOS/Android/WebGPU GitHub
PocketPal AI iOS, Android Mobile app for running SLMs on-device iOS / Android

Contributing

Contributions are welcome! Please open a pull request or issue to add a model, fix a link, or suggest improvements. When adding a new model, please follow the existing format and include:

  • Official announcement/blog link
  • HuggingFace model links
  • 2–3 key features
  • Paper link (arXiv preferred)

About

A curated list of open-source Small Language Models (≤13B parameters) — weights, papers, tools, and techniques for running AI on consumer hardware.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages