You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> 🚀 **Production-ready FastAPI model server** for small language models with OpenAI-compatible API, built-in observability, and enterprise-grade deployment tools.
8
+
🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions`API. Core logic is just <100 lines under `./slm_server/app.py`!
9
9
10
-
A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) using `llama-cpp` via the OpenAI-compatible `/chat/completions` API. Designed for resource-constrained environments with comprehensive monitoring and deployment automation.
10
+
> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic.
11
11
12
12
## ✨ Features
13
13
14
+

15
+
14
16
- 🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
15
17
- ⚡ **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
0 commit comments