| title | Architecture |
|---|---|
| category | wLLm |
| tags | |
| status | Active |
| created | 2026-04-01 |
This document provides a comprehensive, visual guide to the software architecture, design patterns, and internal workflows of the WinLLM inference engine. Built from the ground up in pure Python, it is designed for Windows and inspired by vLLM's memory management principles (though it is not a fork).
[!NOTE] Table of Contents
At its core, WinLLM is divided into three main layers: the API Layer (FastAPI), the Core Engine (Request Scheduling & Memory Management), and the Inference Layer (Multi-Backend Generation Loop).
flowchart TB
%% Definitions
Client([API Client / HTTP])
subgraph APILayer ["API Layer Async Thread"]
Server["api_server.py (FastAPI App)"]
Router["Endpoints (/v1/chat/completions)"]
Lifespan["Lifespan Context Manager"]
end
subgraph CoreEngine ["Core Management Thread (Async)"]
Schedule["scheduler.py (Request Queue)"]
KVManager["kv_cache.py (Dynamic VRAM)"]
Config["config.py (Unified Defaults)"]
end
subgraph InferenceLayer ["Inference Loop Thread (Background)"]
Loop["Inference Loop (Continuous Batching)"]
SpecEngine["speculative.py (Speculative Decoding)"]
Engine["engine.py (Batched Generation)"]
Backend["backend.py (PyTorch / ONNX / DirectML)"]
Loader["model_loader.py (Draft Support)"]
Sample["sampler.py (Logits to Tokens)"]
end
subgraph AutoConfig ["Hardware & Model Auto-Tuning"]
Hardware["device.py (Hardware Detection)"]
Registry["registry.py (Model Profile)"]
end
%% Relationships
Client <-->|REST / SSE Streams| Router
Server --> Lifespan
Lifespan -.->|Trigger load/unload| Loader
Router -->|Creates GenerationRequest| Schedule
Schedule <-->|Checks admission| KVManager
Schedule -->|Centralized Loop| Loop
Loop -->|Calls generate_step| Engine
Loop <-->|Verification Loop| SpecEngine
Engine <-->|Computes next token| Sample
Engine -.->|Claims/Frees Blocks| KVManager
Engine -.->|Updates Request State| Schedule
Loader -->|Delegates to| Backend
Backend -->|Returns Model + Tokenizer| Loader
Hardware -->|Builds Defaults| Config
Config -->|Applies overrides| Registry
When a user submits a prompt, it travels exactly through this pipeline:
sequenceDiagram
participant C as Client
participant A as API Server (api_server.py)
participant S as Scheduler (scheduler.py)
participant L as Inference Loop (Background Thread)
participant E as Engine (engine.py)
participant K as KV Manager (kv_cache.py)
C->>A: POST /v1/chat/completions (prompt)
A->>S: submit(GenerationRequest)
Note over S: Add to _waiting queue
loop Every 100ms or on New Request
S->>K: can_allocate(new_req)?
alt Enough VRAM
S->>K: allocate_sequence()
S->>L: Admit to _active_reqs
end
Note over L: ITERATION STEP
L->>E: generate_step(batch)
Note over E: 1. Prefill for new reqs
Note over E: 2. Decode for existing reqs
E-->>L: batch (updated tokens)
alt Speculative Enabled
L->>L: Speculative Verification Loop
end
Note over L: Update Request Status & Stream
L-->>A: [Callback] Stream latest tokens
A-->>C: SSE chunks
alt Finished / Cancelled
L->>K: free_sequence(req_id)
L->>S: Move to _completed
end
end
WinLLM handles memory entirely mathematically at runtime, rather than relying on hardcoded rules.
flowchart LR
A["device.py - Hardware Detection"] --> B["Aggregate VRAM and GPU Count"]
B --> C{Total VRAM?}
C -- "Under 16 GB" --> D["Quantization = 4bit"]
C -- "16 GB or more" --> E["Quantization = none"]
B --> F["Max Batch Size Calculation"]
B --> G["Context Length Tiering"]
B --> H{Compute Capability?}
H -- "8.0 or higher" --> I["Backend = flash_attention_2"]
H -- "Lower than 8.0" --> J["Backend = sdpa"]
D & E & F & G & I & J --> K((HardwareDefaults))
K --> L["Process Environment Overrides"]
L --> M["Apply to Model Config"]
Because pure PyTorch doesn't natively support memory paging like vLLM does, kv_cache.py simulates block-level allocation. When the scheduler receives a request, the KVCacheManager:
- Uses actual model parameters (
num_layers,num_kv_heads,head_dim) to compute precise token byte costs. - Checks remaining available system VRAM via
_get_total_available_vram(). - Pre-allocates a percentage (default 90%) into logical blocks of 16 tokens.
- Tells the scheduler if there is enough block space to fit the incoming prompt + generation.
This diagram illustrates how core classes interact, and how data structures (like configs and requests) are passed throughout the system.
classDiagram
%% Core Data Structures
class ModelConfig {
+str model_name_or_path
+str draft_model_name_or_path
+str inference_backend
+QuantizationType quantization
+apply_hardware_defaults(defaults)
}
class SamplingParams {
+int max_tokens
+float temperature
+float top_p
+int top_k
+float repetition_penalty
}
class GenerationRequest {
+str request_id
+list[int] output_token_ids
+RequestStatus status
+tuple _past_key_values
+int _prefix_cache_token_len
+int _stream_text_cursor
+tuple _draft_past_key_values
}
class HardwareDefaults {
+int max_batch_size
+int max_model_len
+str attention_backend
+float kv_cache_fraction
}
%% Core Managers
class InferenceEngine {
+ModelConfig model_config
+KVCacheManager kv_cache_manager
+SpeculativeEngine speculative_engine
+generate_step(requests)
}
class Scheduler {
+deque _waiting
+list _active_reqs
+Thread _loop_thread
+submit(request)
-_run_inference_loop()
-_evict_completed()
}
class KVCacheManager {
+allocate_sequence(seq_id, tokens)
+extend_sequence(seq_id, tokens)
+free_sequence(seq_id)
}
class ModelLoader {
+ModelConfig config
+load() Model, Tokenizer
+get_kv_cache_params() dict
-_resolve_device_map()
}
class SpeculativeEngine {
+PreTrainedModel target_model
+PreTrainedModel draft_model
+step(request)
}
%% Relationships and Data Flow
ModelConfig <-- InferenceEngine : Contains
ModelConfig <-- ModelLoader : Uses
HardwareDefaults ..> ModelConfig : Applies overrides
SamplingParams <-- GenerationRequest : Contains
GenerationRequest <-- Scheduler : Batches
GenerationRequest <-- InferenceEngine : Processes & Modifies
GenerationRequest <-- SpeculativeEngine : Modifies
InferenceEngine *-- KVCacheManager : Initializes & Calls
InferenceEngine *-- ModelLoader : Initializes & Calls
InferenceEngine *-- SpeculativeEngine : Initializes
Scheduler o-- InferenceEngine : Calls generate_step()
class BackendFactory {
+load(model_config) Model, Tokenizer
-_load_pytorch()
-_load_onnxruntime()
-_load_directml()
-_load_tokenizer()
}
ModelLoader --> BackendFactory : Delegates loading
%% API entry point
class APIServer {
+chat_completions(req)
+completions(req)
-_stream_response()
}
APIServer --> GenerationRequest : Creates
APIServer --> Scheduler : Submits via submit()
api_server.py | The Gateway
- Emulates standard OpenAI REST API.
- Implements FastAPI's modern
@asynccontextmanagerlifespanhook. The model is loaded onto the GPU during startup, and gracefully unloaded during shutdown (Ctrl+C). - Handles streaming by acting as an asynchronous bridge to the synchronous PyTorch loops. Uses
asyncio.Queueandloop.call_soon_threadsafe(). - Catches GPU timeouts and injects JSON-formatted error chunks securely into the SSE stream.
scheduler.py | The Task Orchestrator
- Continuous Batching: No longer uses a semaphore for simple concurrency. Instead, it maintains a background
_loop_threadthat constantly attempts to admit new requests into an active batch based on KV cache availability. - Async Interface: Provides
submit()andsubmit_streaming()as async interfaces, while the actual heavy lifting happens in the background thread via theInferenceLoop.
engine.py | The Batched Inference Engine
generate_step(): The primary entry point for inference. It takes a list of requests and performs one iteration of prefill or decode for all of them.- Decomposed internals: The main generation pipeline is broken into focused helpers:
_validate_prompt(),_allocate_kv_cache(),_run_decode_loop(), and_finalize_generation(), making the code easy to follow step by step. _prefill_single_request()/_decode_single_request(): Extracted fromgenerate_step()for clarity.
backend.py | Multi-Backend Model Loading
BackendFactory: Factory class that abstracts model loading across three inference backends:- PyTorch (default): Standard HuggingFace
AutoModelForCausalLMwith quantization and multi-GPU support. - ONNX Runtime: Uses Optimum's
ORTModelForCausalLMfor Windows-native acceleration without Triton/MSVC. Includes smart handling of pre-exported ONNX repositories (e.g., LiquidAI models withsubfolderandfile_namerouting). - DirectML: Uses
torch-directmlfor cross-vendor GPU acceleration via DX12.
- PyTorch (default): Standard HuggingFace
- Tokenizer Fallback: Built-in workaround for the Optimum
TokenizersBackendbug on ONNX-exported models — automatically falls back to the base model's tokenizer.
speculative.py | Accelerated Generation
- Draft Model Logic: Implements speculative decoding where a smaller model proposes tokens that the larger target model verifies in a single forward pass.
- Three-phase pipeline:
_draft_proposals()generates candidates,_verify_proposals()runs target verification in one pass, and_accept_or_reject()handles the acceptance loop.
kv_cache.py | Logical Memory Tracker
- Iteration-Level Allocation: Tracks block usage across the entire batch.
- Sequence Management: Provides
allocate_sequence,extend_sequence, andfree_sequencemethods invoked by the scheduler and engine during the generation lifecycle.
- Centralizes Dataclasses (
ModelConfig,SchedulerConfig,KVCacheConfig,SamplingParams). - CLI params naturally cascade into Config objects. The
--auto-configflag triggers the dynamic hardware discovery sequence, overwriting baseline constraints with optimized formulas.
registry.py | Model Introspection
- Examines the HuggingFace repo name (e.g.
meta-llama/Llama-3.1-8B-Instruct). - Determines the architectural family (Llama, Gemma, Mistral, Qwen).
- Injects ideal hyper-parameters (e.g.
max_context_window=32768,rope_scaling=True) before the tensors are ever initialized in VRAM.