wLLM — The Windows Native Inference Engine

The Vision

wLLM is a 100% ground-up, high-performance inference engine specifically architected for the Windows ecosystem. Built in pure Python and PyTorch, it delivers server-grade continuous batching and KV-paging capabilities to consumer hardware without the Linux-dependency overhead of vLLM.

Important

This is NOT a fork. wLLM is an independent implementation designed to bridge the gap between high-level research (HuggingFace) and low-level system performance on Windows.

Performance Matrix: Hardware-Agnostic Adaptation

wLLM uses a dynamic profiling engine to mathematically tune its performance based on your exact hardware signature. No static profiles, just raw math.

Hardware Class	VRAM Headroom	Max Batch Size	Context Window	Optimizations
Mobile / Entry-Level	< 12 GB	1 – 4	2,048	4-bit NF4 + SDPA
Prosumer Desktop	12 – 24 GB	5 – 12	4,096	Mixed Precision + SDPA
Enthusiast (RTX 4090)	24 GB+	12 – 32	8,192+	FlashAttention-2 + AWQ
Profiles can also be customised using commands.	∞	∞	∞	Custom

Architectural Flow: Continuous Batching Engine

Unlike standard sequential servers, wLLM treats the GPU as a shared resource that accepts new requests into the batch at every token iteration.

graph TD
    Client([API Client]) -->|v1/chat/completions| API[FastAPI Server]
    API -->|Submit Request| S[Scheduler Queue]
    
    subgraph Engine [The Inference Core]
        S -->|Admit| Loop[Inference Loop Thread]
        Loop -->|Check Budget| KV[KV Cache Manager]
        KV -->|Allocate Blocks| B[Paged Memory Blocks]
        Loop -->|Process Iteration| E[Batched Generation Step]
        E -->|Generate| Sampler[Logit Sampler]
        Sampler -->|Token Output| Loop
    end
    
    Loop -->|Stream| API
    API -->|SSE Stream| Client

Why Developers Choose wLLM

Zero-Day Architecture Support: Any AutoModelForCausalLM on HuggingFace works instantly. No waiting for community GGUF conversions.
Infinitely Extensible: 100% Python. Modify scheduling logic, integrate custom logit processors, or add backends without touching a C++ compiler.
Optimized Speculative Decoding: Integrated draft models accelerate inference by predicting multiple tokens per target forward pass.
Multi-Backend Choice: Native support for PyTorch, ONNX Runtime (via Optimum), and DirectML.

Rapid Deployment

Modern Installation (Recommended)

The install.ps1 script (or install.bat wrapper) handles everything: Python 3.12 bootstrapping, virtual environment creation, CUDA-specific PyTorch wheels, and PATH configuration. No pre-installed Python required.

# Run the installer directly
powershell -ExecutionPolicy Bypass -File .\install.ps1

Manual Configuration

Environment (uv - Recommended):

uv venv .venv --python 3.12
.venv\Scripts\activate
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
uv pip install -e . --extra-index-url https://download.pytorch.org/whl/cu124

Environment (standard pip):

python -m venv .venv
.venv\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu124

Interface Reference

Interactive Chat:

winllm chat --model "microsoft/Phi-3-mini-4k-instruct" --quantization 4bit

API Server Deployment:

winllm serve --model "microsoft/Phi-3-mini-4k-instruct" --quantization 4bit --port 8000

Technical Deep Dives

Dive into the engineering principles behind the engine:

Architecture Details — Visual guide to internal workflows.
Genesys Deep Dive — From-first-principles inference guide.
Walkthrough — End-to-end user guide.
Changelog — Evolution of the engine.

License

MIT — wLLM is, and will always be, free and open source.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
.obsidian		.obsidian
documentation		documentation
tests		tests
tmp		tmp
winllm		winllm
.gitignore		.gitignore
README.md		README.md
check_imports.py		check_imports.py
compile_onnx.py		compile_onnx.py
diagnose_gpu.py		diagnose_gpu.py
fix_torch.py		fix_torch.py
install.bat		install.bat
install.ps1		install.ps1
pyproject.toml		pyproject.toml
req_torch.txt		req_torch.txt
setup_and_test.bat		setup_and_test.bat
setup_and_test.ps1		setup_and_test.ps1
test_detect.bat		test_detect.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wLLM — The Windows Native Inference Engine

The Vision

Performance Matrix: Hardware-Agnostic Adaptation

Architectural Flow: Continuous Batching Engine

Why Developers Choose wLLM

Rapid Deployment

Modern Installation (Recommended)

Manual Configuration

Interface Reference

Technical Deep Dives

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wLLM — The Windows Native Inference Engine

The Vision

Performance Matrix: Hardware-Agnostic Adaptation

Architectural Flow: Continuous Batching Engine

Why Developers Choose wLLM

Rapid Deployment

Modern Installation (Recommended)

Manual Configuration

Interface Reference

Technical Deep Dives

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages