Train · Fine-Tune · Convert · Quantize · Serve · Benchmark AI models on NPU · TPU · GPU · CPU
NPU-STACK is an open-source, full-stack AI toolkit for developing, serving, and deploying machine learning models on every hardware accelerator — NPUs, TPUs, GPUs, and CPUs. It ships with an OpenAI-compatible API, making it a self-hosted alternative to LM Studio, Ollama, and OpenAI.
| Feature | Description |
|---|---|
| 🖥️ Model Serving | OpenAI-compatible /v1 API — chat completions, embeddings, streaming SSE. Works with LangChain, Open WebUI, and more |
| 🏋️ Fine-Tuning | LoRA/QLoRA via PEFT. Custom datasets, hyperparameters, real-time metrics |
| 🧪 Playground | Test models interactively — text generation, image classification, object detection, image synthesis |
| 🤗 HuggingFace Hub | Search, browse, one-click download models from HuggingFace |
| 🔄 Convert & Quantize | PyTorch → ONNX → OpenVINO IR. INT8/INT4 quantization with NNCF |
| 📊 Benchmark | Latency (p50/p95/p99), throughput, memory profiling across CPU/GPU/NPU |
| 📁 Dataset Manager | Upload, organize, auto-detect datasets (images, CSV, JSON, Parquet) |
| 🌐 Web Dashboard | Premium React UI with real-time training charts via WebSocket |
| 🦙 GGUF Studio | 5-tab studio for inspecting, quantizing (21 formats), converting, and LoRA merging |
| 🪄 Onboarding Wizard | Interactive 5-step tour guiding new users from import to deployment |
| ☁️ Edge & Cloud | Connect to NVIDIA NIM APIs, compile Vitis AI .xmodels, and manage CVEDIA-RT |
| 🐳 Docker Deploy | Single docker compose up launches the full stack |
| 📷 Webcam Detection | Real-time object detection with bounding box overlays |
| 🔍 Model Scanner | Discover model files on your PC (12+ formats) with interactive folder browser |
NPU-STACK includes a fully OpenAI-compatible API server. Use it as a drop-in replacement for OpenAI in any application.
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/models |
List available models |
POST |
/v1/chat/completions |
Chat completion (streaming + non-streaming) |
POST |
/v1/completions |
Text completion |
POST |
/v1/embeddings |
Generate text embeddings |
POST |
/v1/models/load |
Load a model into memory |
POST |
/v1/models/unload |
Unload model from memory |
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="any" # Not required for local
)
# Chat completion
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")# cURL
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "my-model", "messages": [{"role": "user", "content": "Hello!"}]}'Works with: OpenAI Python/JS SDK · LangChain · LlamaIndex · Open WebUI · Chatbot UI · Vercel AI SDK · cURL · Postman
Fine-tune any model in the registry with parameter-efficient methods:
import requests
requests.post("http://localhost:8000/api/finetune/start", json={
"model_id": 1,
"dataset": "my-dataset",
"epochs": 3,
"learning_rate": 2e-4,
"use_lora": True,
"lora_r": 16,
"lora_alpha": 32
})- Background training with real-time step/epoch/loss tracking
- Supports custom uploaded datasets and HuggingFace datasets
- Fine-tuned adapters saved to model registry
Expose NPU-STACK directly to Claude Desktop, Cursor, or any MCP-compatible AI Assistant! This gives your AI the ability to compile models to NPU formats and query your system hardware.
Add the following to your claude_desktop_config.json (or equivalent client config):
{
"mcpServers": {
"npu-stack": {
"command": "python",
"args": [
"J:\\NPU-STACK\\backend\\mcp_server.py"
]
}
}
}Note: Update the path to mcp_server.py and ensure the python command resolves to your local NPU-STACK venv if you aren't installing globally.
git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
setup.bat # Downloads Python, creates venv, installs everything
run-all.bat # Launches backend + frontend + APINote:
llama-cpp-python(GGUF inference) is an optional dependency. If a pre-built wheel is unavailable for your Python version or platform, setup will print an[INFO]warning and continue — the core platform works without it. Usedocker compose up --buildfor full out-of-the-box GGUF support.
git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
chmod +x *.sh
./setup.sh # Creates venv, installs dependencies, generates `.env`
./run-all.sh # Launches backend + frontend with proper SIGINT handlingdocker compose up --build# Backend
cd backend && pip install -r requirements.txt && python main.py
# Frontend
cd frontend && npm install && npm run devAccess:
- 🌐 Dashboard:
http://localhost:5173 - 📡 API Docs:
http://localhost:8000/api/docs - 🤖 OpenAI API:
http://localhost:8000/v1
├── backend/
│ ├── main.py # FastAPI entry (11 routers)
│ ├── database.py # SQLAlchemy models
│ ├── routers/
│ │ ├── models.py # Model registry CRUD
│ │ ├── training.py # Training job management
│ │ ├── inference.py # Multi-task inference
│ │ ├── conversion.py # Format conversion & quantization
│ │ ├── benchmark.py # Performance benchmarking
│ │ ├── serving.py # OpenAI-compatible /v1 API
│ │ ├── finetuning.py # LoRA/QLoRA fine-tuning
│ │ ├── huggingface.py # HuggingFace Hub search & download
│ │ ├── datasets.py # Dataset management
│ │ ├── scanner.py # Local model scanner (12+ formats)
│ │ ├── webcam.py # WebSocket webcam inference
│ │ └── filebrowser.py # Interactive file/folder browser
│ └── services/ # Business logic
│ ├── benchmark_service.py # 12-capability hardware detection
│ ├── conversion_service.py # OpenVINO/NNCF/Vitis conversion
│ ├── opencv_service.py # cv2.dnn inference & preprocessing
│ └── gguf_service.py # llama.cpp GGUF inference
├── frontend/
│ └── src/
│ ├── App.jsx # Router + sidebar (12 pages)
│ ├── components/
│ │ └── FolderBrowser.jsx # Modal folder picker
│ └── pages/
│ ├── Dashboard.jsx # Overview + system info
│ ├── Playground.jsx # Interactive model testing
│ ├── Models.jsx # Model registry
│ ├── HuggingFaceHub.jsx # Model discovery
│ ├── Datasets.jsx # Dataset manager
│ ├── Serving.jsx # Model serving UI
│ ├── Training.jsx # Training console
│ ├── FineTuning.jsx # Fine-tuning config & jobs
│ ├── Conversion.jsx # Format & quantization studio
│ ├── GGUFStudio.jsx # llama.cpp GGUF tooling suite
│ ├── Scanner.jsx # Model file scanner
│ ├── WebcamTest.jsx # Real-time object detection
│ └── Benchmark.jsx # Performance lab
├── docs/screenshots/ # App screenshots
├── web/ # Promotional website
└── docker-compose.yml
| Hardware | Backend | Status |
|---|---|---|
| NVIDIA CUDA GPUs | PyTorch CUDA, ONNX Runtime CUDA, TensorRT | ✅ |
| AMD ROCm GPUs | PyTorch HIP, ONNX Runtime ROCm | ✅ |
| AMD Vitis AI / Alveo FPGA | vai_q_onnx, Quark quantizer, xbutil | ✅ |
| Intel NPU (Core Ultra) | OpenVINO NPU plugin | ✅ |
| Google Coral Edge TPU | TFLite Delegate | ✅ |
| Rockchip NPU (RK3588, RV1103) | RKNN Toolkit 2, RKNN Lite 2, rk-llama.cpp | ✅ |
| DirectML (Windows) | ONNX Runtime DML Provider | ✅ |
| OpenCV DNN | cv2.dnn with CPU/OpenCL/CUDA targets | ✅ |
| CPU (x86/ARM) | ONNX Runtime, OpenVINO CPU | ✅ |
Edit .env in the project root:
| Variable | Default | Description |
|---|---|---|
NPU_STACK_API_KEY |
— | Optional API key for /v1 endpoints |
HUGGINGFACE_TOKEN |
— | HuggingFace token for private models |
HOST |
0.0.0.0 | Server bind address |
PORT |
8000 | Server port |
MODEL_STORAGE |
backend/data/models | Model storage path |
We welcome contributions! All PRs should target the dev branch.
git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK && git checkout dev
# make your changes, then push and open a PRMIT License — see LICENSE for details.
Made by Fanalogy · Powered by Nirvana
⭐ Star this repo to support the project!



