⚡ NPU-STACK

Full-Stack Neural Processor AI Toolkit

Train · Fine-Tune · Convert · Quantize · Serve · Benchmark AI models on NPU · TPU · GPU · CPU

🎯 What is NPU-STACK?

NPU-STACK is an open-source, full-stack AI toolkit for developing, serving, and deploying machine learning models on every hardware accelerator — NPUs, TPUs, GPUs, and CPUs. It ships with an OpenAI-compatible API, making it a self-hosted alternative to LM Studio, Ollama, and OpenAI.

✨ Key Features

Feature	Description
🖥️ Model Serving	OpenAI-compatible `/v1` API — chat completions, embeddings, streaming SSE. Works with LangChain, Open WebUI, and more
🏋️ Fine-Tuning	LoRA/QLoRA via PEFT. Custom datasets, hyperparameters, real-time metrics
🧪 Playground	Test models interactively — text generation, image classification, object detection, image synthesis
🤗 HuggingFace Hub	Search, browse, one-click download models from HuggingFace
🔄 Convert & Quantize	PyTorch → ONNX → OpenVINO IR. INT8/INT4 quantization with NNCF
📊 Benchmark	Latency (p50/p95/p99), throughput, memory profiling across CPU/GPU/NPU
📁 Dataset Manager	Upload, organize, auto-detect datasets (images, CSV, JSON, Parquet)
🌐 Web Dashboard	Premium React UI with real-time training charts via WebSocket
🦙 GGUF Studio	5-tab studio for inspecting, quantizing (21 formats), converting, and LoRA merging
🪄 Onboarding Wizard	Interactive 5-step tour guiding new users from import to deployment
☁️ Edge & Cloud	Connect to NVIDIA NIM APIs, compile Vitis AI `.xmodel`s, and manage CVEDIA-RT
🐳 Docker Deploy	Single `docker compose up` launches the full stack
📷 Webcam Detection	Real-time object detection with bounding box overlays
🔍 Model Scanner	Discover model files on your PC (12+ formats) with interactive folder browser

📸 Screenshots

Dashboard	Model Scanner

Model Registry	Conversion Studio

🖥️ OpenAI-Compatible Model Serving

NPU-STACK includes a fully OpenAI-compatible API server. Use it as a drop-in replacement for OpenAI in any application.

Endpoints

Method	Endpoint	Description
`GET`	`/v1/models`	List available models
`POST`	`/v1/chat/completions`	Chat completion (streaming + non-streaming)
`POST`	`/v1/completions`	Text completion
`POST`	`/v1/embeddings`	Generate text embeddings
`POST`	`/v1/models/load`	Load a model into memory
`POST`	`/v1/models/unload`	Unload model from memory

Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"  # Not required for local
)

# Chat completion
response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

# cURL
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model", "messages": [{"role": "user", "content": "Hello!"}]}'

Compatibility

Works with: OpenAI Python/JS SDK · LangChain · LlamaIndex · Open WebUI · Chatbot UI · Vercel AI SDK · cURL · Postman

🏋️ Fine-Tuning (LoRA/QLoRA)

Fine-tune any model in the registry with parameter-efficient methods:

import requests

requests.post("http://localhost:8000/api/finetune/start", json={
    "model_id": 1,
    "dataset": "my-dataset",
    "epochs": 3,
    "learning_rate": 2e-4,
    "use_lora": True,
    "lora_r": 16,
    "lora_alpha": 32
})

Background training with real-time step/epoch/loss tracking
Supports custom uploaded datasets and HuggingFace datasets
Fine-tuned adapters saved to model registry

🤖 AI Context Protocol (MCP) Server

Expose NPU-STACK directly to Claude Desktop, Cursor, or any MCP-compatible AI Assistant! This gives your AI the ability to compile models to NPU formats and query your system hardware.

Add the following to your claude_desktop_config.json (or equivalent client config):

{
  "mcpServers": {
    "npu-stack": {
      "command": "python",
      "args": [
        "J:\\NPU-STACK\\backend\\mcp_server.py"
      ]
    }
  }
}

Note: Update the path to mcp_server.py and ensure the python command resolves to your local NPU-STACK venv if you aren't installing globally.

🚀 Quick Start

Windows (Recommended)

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
setup.bat       # Downloads Python, creates venv, installs everything
run-all.bat     # Launches backend + frontend + API

Note: llama-cpp-python (GGUF inference) is an optional dependency. If a pre-built wheel is unavailable for your Python version or platform, setup will print an [INFO] warning and continue — the core platform works without it. Use docker compose up --build for full out-of-the-box GGUF support.

Linux / macOS

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
chmod +x *.sh
./setup.sh      # Creates venv, installs dependencies, generates `.env`
./run-all.sh    # Launches backend + frontend with proper SIGINT handling

Docker

docker compose up --build

Manual

# Backend
cd backend && pip install -r requirements.txt && python main.py

# Frontend
cd frontend && npm install && npm run dev

Access:

🌐 Dashboard: http://localhost:5173
📡 API Docs: http://localhost:8000/api/docs
🤖 OpenAI API: http://localhost:8000/v1

🏗️ Architecture

├── backend/
│   ├── main.py                # FastAPI entry (11 routers)
│   ├── database.py            # SQLAlchemy models
│   ├── routers/
│   │   ├── models.py          # Model registry CRUD
│   │   ├── training.py        # Training job management
│   │   ├── inference.py       # Multi-task inference
│   │   ├── conversion.py      # Format conversion & quantization
│   │   ├── benchmark.py       # Performance benchmarking
│   │   ├── serving.py         # OpenAI-compatible /v1 API
│   │   ├── finetuning.py      # LoRA/QLoRA fine-tuning
│   │   ├── huggingface.py     # HuggingFace Hub search & download
│   │   ├── datasets.py        # Dataset management
│   │   ├── scanner.py         # Local model scanner (12+ formats)
│   │   ├── webcam.py          # WebSocket webcam inference
│   │   └── filebrowser.py     # Interactive file/folder browser
│   └── services/              # Business logic
│       ├── benchmark_service.py  # 12-capability hardware detection
│       ├── conversion_service.py # OpenVINO/NNCF/Vitis conversion
│       ├── opencv_service.py     # cv2.dnn inference & preprocessing
│       └── gguf_service.py       # llama.cpp GGUF inference
├── frontend/
│   └── src/
│       ├── App.jsx            # Router + sidebar (12 pages)
│       ├── components/
│       │   └── FolderBrowser.jsx  # Modal folder picker
│       └── pages/
│           ├── Dashboard.jsx  # Overview + system info
│           ├── Playground.jsx # Interactive model testing
│           ├── Models.jsx     # Model registry
│           ├── HuggingFaceHub.jsx # Model discovery
│           ├── Datasets.jsx   # Dataset manager
│           ├── Serving.jsx    # Model serving UI
│           ├── Training.jsx   # Training console
│           ├── FineTuning.jsx # Fine-tuning config & jobs
│           ├── Conversion.jsx # Format & quantization studio
│           ├── GGUFStudio.jsx # llama.cpp GGUF tooling suite
│           ├── Scanner.jsx    # Model file scanner
│           ├── WebcamTest.jsx # Real-time object detection
│           └── Benchmark.jsx  # Performance lab
├── docs/screenshots/          # App screenshots
├── web/                       # Promotional website
└── docker-compose.yml

⚙️ Hardware Support

Hardware	Backend	Status
NVIDIA CUDA GPUs	PyTorch CUDA, ONNX Runtime CUDA, TensorRT	✅
AMD ROCm GPUs	PyTorch HIP, ONNX Runtime ROCm	✅
AMD Vitis AI / Alveo FPGA	vai_q_onnx, Quark quantizer, xbutil	✅
Intel NPU (Core Ultra)	OpenVINO NPU plugin	✅
Google Coral Edge TPU	TFLite Delegate	✅
Rockchip NPU (RK3588, RV1103)	RKNN Toolkit 2, RKNN Lite 2, rk-llama.cpp	✅
DirectML (Windows)	ONNX Runtime DML Provider	✅
OpenCV DNN	cv2.dnn with CPU/OpenCL/CUDA targets	✅
CPU (x86/ARM)	ONNX Runtime, OpenVINO CPU	✅

🔧 Configuration

Edit .env in the project root:

Variable	Default	Description
`NPU_STACK_API_KEY`	—	Optional API key for /v1 endpoints
`HUGGINGFACE_TOKEN`	—	HuggingFace token for private models
`HOST`	0.0.0.0	Server bind address
`PORT`	8000	Server port
`MODEL_STORAGE`	backend/data/models	Model storage path

🤝 Contributing

We welcome contributions! All PRs should target the dev branch.

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK && git checkout dev
# make your changes, then push and open a PR

📄 License

MIT License — see LICENSE for details.

Made by Fanalogy · Powered by Nirvana

⭐ Star this repo to support the project!

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github		.github
backend		backend
docs		docs
frontend		frontend
llama.cpp		llama.cpp
mcp_temp_assets		mcp_temp_assets
scripts		scripts
web		web
.env.example		.env.example
.gitignore		.gitignore
Long-Paths-Enabler.bat		Long-Paths-Enabler.bat
Proposal - Intellify.txt		Proposal - Intellify.txt
README.md		README.md
Set-MellanoxPriority.ps1		Set-MellanoxPriority.ps1
Torch 2.9.1+cu130 (default).bat		Torch 2.9.1+cu130 (default).bat
docker-compose.yml		docker-compose.yml
model2GGUF.bat		model2GGUF.bat
package-lock.json		package-lock.json
run-all.bat		run-all.bat
run-all.sh		run-all.sh
run-backend.bat		run-backend.bat
run-backend.sh		run-backend.sh
run-frontend.bat		run-frontend.bat
run-frontend.sh		run-frontend.sh
setup.bat		setup.bat
setup.sh		setup.sh
temp_prune_0kb.py		temp_prune_0kb.py
temp_prune_action.py		temp_prune_action.py
temp_prune_scanner.py		temp_prune_scanner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ NPU-STACK

Full-Stack Neural Processor AI Toolkit

🎯 What is NPU-STACK?

✨ Key Features

📸 Screenshots

🖥️ OpenAI-Compatible Model Serving

Endpoints

Usage

Compatibility

🏋️ Fine-Tuning (LoRA/QLoRA)

🤖 AI Context Protocol (MCP) Server

🚀 Quick Start

Windows (Recommended)

Linux / macOS

Docker

Manual

🏗️ Architecture

⚙️ Hardware Support

🔧 Configuration

🤝 Contributing

📄 License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

⚡ NPU-STACK

Full-Stack Neural Processor AI Toolkit

🎯 What is NPU-STACK?

✨ Key Features

📸 Screenshots

🖥️ OpenAI-Compatible Model Serving

Endpoints

Usage

Compatibility

🏋️ Fine-Tuning (LoRA/QLoRA)

🤖 AI Context Protocol (MCP) Server

🚀 Quick Start

Windows (Recommended)

Linux / macOS

Docker

Manual

🏗️ Architecture

⚙️ Hardware Support

🔧 Configuration

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages