Unified LLM server with model-ID-based routing. Designed for Apple Silicon (M1/M2/M3/M4) using MLX and llama.cpp backends.
Client Request (port 8000)
↓
Routing Service (FastAPI, port 8000)
↓ (reads model ID from request body)
Backend Model Servers (MLX/llama.cpp on ports 8501, 8502, ...)
- macOS with Apple Silicon — required for MLX backend; llama.cpp works cross-platform
- Python 3.12+
- uv package manager
- For llama.cpp:
llama-serveron PATH (e.g.brew install llama.cpp) — required for rerank and Qwen3.5/newer architectures
- Install dependencies:
git clone <repository-url>
cd slm_server
uv sync --extra mlx # For MLX backend
uv sync --extra llamacpp # For llama.cpp backend (Python server fallback)- Configure models:
cp config/models.yaml.example config/models.yaml
# Edit config/models.yaml with your model paths- Start all services:
./start.shOr start individually:
# Terminal 1: Start backend servers
uv run python -m slm_server backends
# Terminal 2: Start routing service
uv run python -m slm_server routerFor detailed setup, see SETUP.md.
Copy config/models.yaml.example to config/models.yaml and set your model paths. Each model entry maps a role name to a server instance.
| Field | Required | Default | Description |
|---|---|---|---|
id |
yes | — | Model identifier used for routing (must match model field in requests) |
backend |
yes | — | mlx or llamacpp |
port |
yes | — | Port for this model's backend server (must be unique) |
model_path |
yes | — | Local path to model file/directory, or Hugging Face model ID (MLX only for HF IDs) |
default_timeout |
yes | — | Request timeout in seconds |
quantization |
yes | — | Quantization level (e.g. 8bit, Q8_0, f16) — informational for MLX; affects KV cache defaults for llamacpp |
model_type |
no | lm |
lm, multimodal, image-generation, image-edit, embeddings, rerank, or whisper |
context_length |
no | model default | Maximum context length; omit to use the model's built-in default |
max_concurrency |
no | 1 |
Maximum concurrent requests (MLX maps this to the server's supported queue/concurrency flag at runtime) |
host |
no | 0.0.0.0 |
Host the backend server binds to |
enabled |
no | true |
Set to false to skip this model on startup |
supports_function_calling |
no | false |
Reported in /v1/models response |
MLX-only fields (passed to mlx-openai-server launch):
| Field | Default | Description |
|---|---|---|
enable_auto_tool_choice |
false |
Pass --enable-auto-tool-choice to mlx-openai-server |
tool_call_parser |
null |
Parser for tool calls. See current mlx-openai-server --help for the full parser list supported by your installed version |
reasoning_parser |
null |
Parser for reasoning/thinking tokens. Set to null (or omit) to disable thinking mode |
config_name |
flux-schnell / flux-kontext-dev |
Config name for image-generation or image-edit model types |
llama.cpp-only fields (passed to llama-server or llama_cpp.server):
| Field | Default | Description |
|---|---|---|
chat_template_kwargs |
null |
Dict passed as --chat-template-kwargs (e.g. {enable_thinking: true} for Qwen3.5) |
chat_template_file |
null |
Optional path passed as --chat-template-file to override the GGUF-embedded Jinja template (resolved relative to repo root if not absolute) |
temp |
— | Sampling temperature |
top_p |
— | Top-p sampling |
top_k |
— | Top-k sampling |
min_p |
— | Min-p sampling |
cache_type_k |
— | KV cache type for K (e.g. q8_0, f16) |
cache_type_v |
— | KV cache type for V (e.g. q8_0, f16) |
flash_attn |
— | Flash attention (true / false) |
kv_unified |
— | Unified KV cache — native llama-server only |
fit |
— | --fit flag — native llama-server only |
Two formats are accepted:
-
Hugging Face model ID (MLX backend only): downloaded automatically on first use
model_path: "mlx-community/Qwen3-8B-MLX-8bit"
-
Local path: directory containing a
.gguf(llamacpp) or model files (MLX), or a direct path to a.gguffilemodel_path: "/path/to/models/Qwen3.5-9B-GGUF"
For llamacpp with a directory, the server picks the first .gguf file found (alphabetically). Hugging Face model IDs are not supported for llamacpp — use a local path.
For mlx-community/Qwen3.5-9B-8bit local checkpoints, use:
model_type: "multimodal"
tool_call_parser: "qwen3_coder"
reasoning_parser: null # disables thinkingThis model's config uses the Qwen 3.5 multimodal architecture, so model_type: multimodal is required.
The routing service exposes OpenAI-compatible endpoints on port 8000.
Standard chat completions. The model field in the request body selects the backend:
{
"model": "qwen/qwen3-4b-2507",
"messages": [{"role": "user", "content": "Hello"}]
}The router also injects chat_template_kwargs from config into the request body if set and not already present.
Routing uses only enabled model entries. If the same model ID appears in multiple enabled entries, the router returns 409 so model IDs stay unambiguous.
Responses API with automatic fallback. The router first tries /v1/responses on the backend. If the backend returns 404 or 422, it converts the request to /v1/chat/completions format and retries.
OpenAI-compatible embeddings. Requires a model with model_type: embeddings and backend: llamacpp. The backend is started with --embedding (native llama-server) or --embedding true (Python server).
{
"model": "Qwen/Qwen3-Embedding-0.6B",
"input": "Hello, world"
}curl -s http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen3-Embedding-0.6B","input":"test"}' | jqMLX embedding models are also supported: set backend: mlx and model_type: embeddings.
Reranking. Requires model_type: rerank, backend: llamacpp, and native llama-server on PATH. The backend is started with --embedding --pooling rank --reranking. The Python llama_cpp.server does not support rerank.
Request body follows the llama.cpp server rerank format (query + documents).
Lists all configured models and their settings (id, backend, port, model_type, context_length, quantization, supports_function_calling).
Health status of all configured backends:
{
"standard": {
"status": "healthy",
"model_id": "qwen/qwen3-4b-2507",
"backend": "mlx",
"port": 8501
},
"reasoning": {
"status": "unreachable",
"error": "Connection refused - backend not running"
}
}Possible statuses: healthy, unreachable, timeout, unhealthy, error, disabled.
Router health check.
- Install:
uv sync --extra mlx - Requires
mlx-openai-servercommand (installed via the extra) - Accepts Hugging Face model IDs (auto-downloads) or local model directories
- Apple Silicon only
- Install:
uv sync --extra llamacpp(installsllama-cpp-python[server]as fallback) - Native
llama-server(e.g.brew install llama.cpp) is used automatically when found on PATH and is required for:model_type: rerank- Models with newer architectures (Qwen3.5, etc.) not yet supported by the PyPI build
kv_unifiedandfitflags
- When native
llama-serveris not found, falls back topython -m llama_cpp.server - Requires local
.gguffiles — Hugging Face model IDs are not supported
curl http://localhost:8000/v1/backends/health | jq- Verify the
idinconfig/models.yamlmatches themodelfield in your request exactly - Check that
enabledis not set tofalse
lsof -i :8501Each model must have a unique port. Config validation warns about port conflicts on startup.
- Check
/v1/backends/healthto see which backends are down - Ensure model paths are correct and files exist
- For llamacpp: verify
llama-serveris on PATH (which llama-server) start.shnow checks process liveness in addition to open ports, so stale listeners on old PIDs no longer produce false "ready" status- Check logs for error messages
The PyPI build of llama-cpp-python may not support newer model architectures. Install native llama-server:
brew install llama.cppThe server detects it on PATH and uses it automatically.