A model registry and lifecycle manager for local GGUF models served via llama-swap.
Herd gives you a single source of truth for your local model collection — organized by category, with cascading defaults, automatic HuggingFace metadata fetching, and one-command config generation.
- Registry — Organize models in per-category YAML files with cascading defaults (global → category → model)
- Add — Interactively add models from HuggingFace with auto-detected metadata (type, context length, special flags)
- Download — Generate and run
huggingface-cli downloadcommands for your entire registry - Build — Generate llama-swap
config.yamlfrom your registry with correct paths, flags, and sampling parameters - Status — Dashboard showing which models are downloaded, missing, or disabled
- Validate — Check for missing files, orphaned GGUFs, and registry errors
# Option 1: pip install (creates the `herd` command)
pip install -e .
# Option 2: use directly without installing
pip install pyyaml requests
alias herd='python3 /path/to/herd/manage.py'# Set your models directory
# Edit models/_defaults.yaml and set base_path to where your GGUFs live
# Add a model interactively
herd add unsloth/Qwen3-4B-Instruct-2507-GGUF
# Download it
herd download qwen3-4b
# Build llama-swap config and start serving
herd build --output config.yaml
llama-swap --config config.yamlmodels/
_defaults.yaml # Global config: base_path, server settings, category defaults
instruct.yaml # Instruct/chat models
coding.yaml # Code generation models
reasoning.yaml # Chain-of-thought reasoning models
embedding.yaml # Embedding models (auto-adds --embedding flag)
reranker.yaml # Reranker models (auto-adds --reranking flag)
Files prefixed with _ are config. Everything else is a category — the filename determines the category. Add as many as you want: creative.yaml, medical.yaml, multilingual.yaml, etc.
# models/instruct.yaml
qwen3-4b:
path: instruct/qwen3-4b-q8/Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf
repo: unsloth/Qwen3-4B-Instruct-2507-GGUF
file: Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf
ctx: 262144
system_prompt: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
sampling:
top_p: 0.8
top_k: 20
tags:
- q8
- smallConfig resolution: _defaults.yaml globals → category defaults → per-model overrides.
# models/_defaults.yaml
defaults:
gpu_layers: 99
categories:
instruct:
ctx: 32768
embedding:
ctx: 32768
flags: [--embedding]A model in instruct.yaml inherits gpu_layers: 99 and ctx: 32768 automatically. Override any field at the model level.
Flags merge by default (category flags + model flags). Use flags_override to replace entirely:
special-model:
flags: [--extra] # merged with category defaults
override-model:
flags_override: [--only-this] # replaces category defaults entirelyherd add [repo] # Add a model (interactive if no repo given)
herd build # Generate llama-swap config.yaml
herd status # Show model status dashboard
herd list # List models with filters
herd info <model> # Show resolved config and command for a model
herd download # Download models from HuggingFace
herd download --dry-run # Preview download commands
herd enable <model> # Enable a disabled model
herd disable <model> # Disable a model
herd validate # Check registry health
herd scan # Find orphaned GGUFs and propose registry entries
herd cleanup # Remove orphaned files from disk
herd monitor # Live TUI dashboard (requires textual, httpx)herd build --only embedding
herd build --exclude embedding,reranker
herd list --only instruct --tags small
herd status --all # include disabled models| Field | Required | Description |
|---|---|---|
path |
yes | Relative path from base_path to the GGUF file |
repo |
yes | HuggingFace repository (for downloads) |
file |
yes | Filename or glob pattern for the GGUF |
ctx |
no | Context length (inherits from category/global default) |
system_prompt |
no | Recommended system prompt from model documentation |
summary |
no | Model description and notes |
sampling |
no | Sampling parameters (temperature, top_p, top_k, etc.) |
tags |
no | Tags for filtering (e.g., small, q8, general) |
flags |
no | Additional llama-server flags (merged with category defaults) |
flags_override |
no | Replace category default flags entirely |
mmproj |
no | Path to multimodal projection file (vision models) |
enabled |
no | Set to false to disable without removing |
Relative paths from base_path:
{category}/{model-dir}/{filename.gguf}
Example: instruct/qwen3-4b-q8/Qwen3-4B-Instruct-2507-UD-Q8_K_XL.gguf
When adding models, Herd auto-detects:
- Model type — OCR, embedding, reranker, reasoning, coding, or instruct (from filename/repo patterns)
- Special flags —
--embeddingfor embedding models,--rerankingfor rerankers,--jinjafor Phi-4,--chat-template chatmlfor OLMo - Context length — From HuggingFace
config.json
- Python 3.8+
pyyaml,requests- llama-swap (for serving)
- llama.cpp (llama-server backend)
huggingface-cli(for downloads)
MIT