| title | WinLLM Deep Dive | ||||||
|---|---|---|---|---|---|---|---|
| description | How LLM inference engines work — from first principles | ||||||
| tags |
|
||||||
| created | 2026-02-27 |
This guide teaches you every concept and component behind WinLLM, from first principles.
[!abstract] Table of Contents
- [[#Chapter 1 How LLMs Generate Text|How LLMs Generate Text]]
- [[#Chapter 2 The KV Cache|The KV Cache]]
- [[#Chapter 3 Token Sampling|Token Sampling]]
- [[#Chapter 4 Model Loading and Quantization|Model Loading & Quantization]]
- [[#Chapter 5 The Request Lifecycle|The Request Lifecycle]]
- [[#Chapter 6 The Scheduler Continuous Batching|The Scheduler (Continuous Batching)]]
- [[#Chapter 7 Performance Optimizations|Performance Optimizations]]
- [[#Chapter 8 Hardware Detection and Scaling|Hardware Detection & Scaling]]
- [[#Chapter 9 Model Registry|Model Registry]]
- [[#Chapter 10 The OpenAI API|The OpenAI API]]
- [[#Chapter 11 WinLLM vs vLLM|WinLLM vs vLLM]]
- [[#Chapter 12 Concept-to-Code Map|Concept-to-Code Map]]
An LLM is a function that takes a sequence of ==tokens== (numbers) and outputs ==probabilities== for what the next token should be. The entire "intelligence" comes from repeating this one step over and over.
"Hello, how are" → tokenize → [15496, 11, 703, 527] → model → probabilities for 50,000+ tokens
↓
" you" (highest prob)
The generation loop:
Step 1: "Hello" → model → ","
Step 2: "Hello," → model → " how"
Step 3: "Hello, how" → model → " are"
Step 4: "Hello, how are" → model → " you"
Step 5: "Hello, how are you" → model → "?"
Step 6: "Hello, how are you?" → model → [EOS] ← stop!
[!info] Autoregressive Generation This is called autoregressive generation — each token depends on all previous tokens. The model never "sees" the future, only the past.
Every LLM inference has two distinct phases:
[!tip] Phase 1: Prefill (Prompt Processing)
- Feed the entire prompt through the model in one shot
- The model processes all tokens in parallel (this is fast!)
- Result: the first output token + internal state (KV cache)
[!tip] Phase 2: Decode (Token Generation)
- Generate one token at a time
- Each step uses the cached state from previous steps
- This is the bottleneck — it's sequential and memory-bound
In our code — [[../winllm/engine.py|winllm/engine.py]] (The Batched Reality):
# Instead of a simple loop, we use iteration-level scheduling:
def generate_step(self, requests: list[GenerationRequest]):
# 1. PREFILL for new requests (batch of prompts)
# 2. DECODE for existing requests (batch of single tokens)
# Each step computes exactly ONE token for EVERY request in the batch.
# This allows us to maximize GPU throughput.Prefix caching allows wLLM to reuse the KV cache of common prompt prefixes (e.g., system prompts or long instruction contexts) across different requests.
- Hashing: Prompts are hashed at block boundaries (e.g., every 16 tokens).
- Physical Cache: Instead of just tracking block IDs, the
KVCacheManagerstores the actual physical tensors for these prefix blocks. - Reference Counting: Blocks in the prefix cache have an incremented
ref_countto prevent them from being evicted while in use or during normal LRU cycles.
Beyond bitsandbytes, wLLM supports hardware-native 4-bit quantization:
- AWQ (AutoAWQ): Optimized for throughput with fused kernels.
- GPTQ (AutoGPTQ): High-speed inference using ExLlama-style kernels.
The inference loop is now "bubble-free" regarding network I/O:
- Token Queues: The GPU loop only emits raw token IDs to a thread-safe queue.
- Async Decoding: The REST API layer performs
tokenizer.decode()in the main async thread, decoupling expensive string manipulations from the time-sensitive GPU iteration.
In a transformer, every token needs to "attend" to every previous token. That means computing Key and Value matrices for every token at every layer.
Without caching: generating token #100 requires recomputing Keys and Values for all 99 previous tokens. Token #101 recomputes all 100. This is ==O(n²))== — disastrously slow.
After you compute Key/Value for a token, save them. The next step only needs to compute K/V for the new token and concatenate it to the cache.
Step 1: Compute KV for tokens 1-10 → cache = [KV₁, KV₂, ..., KV₁₀]
Step 2: Compute KV for token 11 ONLY → cache = [KV₁, ..., KV₁₀, KV₁₁]
Step 3: Compute KV for token 12 ONLY → cache = [KV₁, ..., KV₁₁, KV₁₂]
This is what past_key_values is in our engine — PyTorch / HuggingFace handle the actual tensors internally.
KV cache is the ==biggest memory hog== after the model weights itself:
[!example] Example: Mistral-7B
Metric Calculation Result Per token 2 × 32 × 8 × 128 × 2 128 KB 4096 tokens 128 KB × 4096 512 MB (one sequence!) 8 sequences 512 MB × 8 4 GB ← half your 8GB GPU!
[!important] Why vLLM Invented PagedAttention To avoid wasting memory with fragmented allocations. They split the KV cache into small "pages" like an OS manages RAM. This requires custom CUDA kernels.
We built a dynamic block manager in [[../winllm/kv_cache.py|winllm/kv_cache.py]]:
- Reads the model's actual dimensions after loading
- Calculates exact per-token memory cost
- Aggregates total available VRAM across all detected GPUs (
_get_total_available_vram()) - Divides available VRAM into fixed-size blocks (16 tokens each) with dynamic caps (e.g.,
max(2048, int(total_vram_gb * 50))) - Tracks allocations per sequence
- Tells the scheduler "can we afford this new request?"
# After model loads, we read its real architecture:
kv_params = self._loader.get_kv_cache_params()
# → {"num_layers": 32, "num_kv_heads": 8, "head_dim": 128}
# Then calculate precise budget:
per_token = 2 * 32 * 8 * 128 * 2 # = 131,072 bytes per token
per_block = per_token * 16 # = 2 MB per block
available = _get_total_available_vram() * 0.4 # Reserve 40% of remaining VRAM
max_blocks = available / per_block # How many blocks we can afford dynamicallyThe model outputs logits — raw scores for every token in the vocabulary (e.g., 32,000 scores). We need to pick ==ONE==. Sampling controls the model's "personality."
flowchart LR
A[Raw Logits] --> B[Repetition Penalty]
B --> C[Temperature]
C --> D[Top-K]
D --> E[Top-P]
E --> F[Sample Token]
style A fill:#4a5568,color:#fff
style F fill:#2d6a4f,color:#fff
See [[../winllm/sampler.py|winllm/sampler.py]] for the implementation.
Reduce score of tokens already generated, so the model doesn't repeat itself:
# If the model keeps saying "the the the", penalize "the"
# Positive logits → divide by penalty (make less attractive)
# Negative logits → multiply by penalty (make even less attractive)
scores = torch.where(scores > 0, scores / penalty, scores * penalty)Controls randomness by scaling the logits before converting to probabilities:
logits = logits / temperature[!note] Temperature Intuition
Value Effect Use Case 0Greedy — always pick highest Factual Q&A 0.1Near-deterministic Code generation 0.7Balanced (default) General chat 1.0Natural diversity Creative writing 2.0Very random Brainstorming
Think of it: dividing by a small number makes big scores ==HUGE== relative to small scores → model becomes more "confident." Dividing by a large number flattens the distribution → more random.
Only keep the K highest-scoring tokens, set everything else to -inf:
# top_k=50: keep the 50 best candidates
# This prevents sampling from the very low-probability "tail"
top_k_values, _ = torch.topk(logits, top_k)
min_threshold = top_k_values[:, -1]
logits[logits < min_threshold] = -infinityKeep the smallest set of tokens whose cumulative probability >= P:
# top_p=0.9: keep the tokens that together account for 90% probability
# This ADAPTS to the distribution:
# - If the model is confident: maybe only 5 tokens pass
# - If the model is uncertain: maybe 500 tokens pass[!tip] Why Top-P > Top-K Top-p ==adapts== to confidence. When the model is sure about one word, top-k=50 still includes 49 bad options. Top-p=0.9 might only keep 2 tokens.
probs = F.softmax(logits, dim=-1) # logits → probabilities (sum to 1.0)
token = torch.multinomial(probs, 1) # random weighted drawtorch.multinomial is like a weighted dice roll — tokens with higher probability are more likely to be chosen.
Your RTX 4070 Mobile has 8 GB VRAM. Model sizes in float16:
| Model | Parameters | float16 Size | Fits in 8GB? |
|---|---|---|---|
| Phi-3-mini | 3.8B | 7.6 GB | Barely |
| Mistral-7B | 7B | 14 GB | No |
| Llama-3.1-8B | 8B | 16 GB | No |
| Llama-3.1-70B | 70B | 140 GB | No |
4-bit quantization compresses each weight from 16 bits to 4 bits:
Mistral-7B: 7B × 2 bytes = 14 GB (float16)
7B × 0.5 bytes = 3.5 GB (4-bit) ← fits!
[!info] NormalFloat4 Neural network weights follow a roughly normal distribution (bell curve). NF4 creates a 4-bit lookup table optimized for this distribution — so the 16 quantization levels are placed where most weights actually are, not evenly spaced.
BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # Compute stays in fp16 for accuracy
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # Also quantize the quantization constants!
)double_quant=True saves another ~0.4 GB by quantizing the scaling factors themselves.
See [[../winllm/model_loader.py|winllm/model_loader.py]].
When you call device_map="auto", HuggingFace's accelerate library:
- Creates the model on a "meta" device (uses ==0 bytes==)
- Measures each layer's actual size
- Greedily assigns layers: fill GPU₀ → GPU₁ → … → CPU RAM → disk
load_kwargs["device_map"] = "auto" # Single GPU: everything on cuda:0
load_kwargs["device_map"] = "balanced" # Multi-GPU: spread evenlyFor tensor parallelism (splitting individual layers across GPUs):
load_kwargs["tp_plan"] = "auto" # Shard attention/MLP across GPUsflowchart TD
A["HTTP POST /v1/chat/completions"] --> B["Add to Scheduler Waiting Queue"]
B --> C["Inference Loop (Background Thread)"]
C --> D{"Space in KV Cache?"}
D -->|Yes| E["Admit to Active Batch"]
D -->|No| B
E --> F["PREFILL PHASE (Initial Prompt)"]
F --> G["DECODE PHASE (Token by Token)"]
G --> H{"Target Accepted?"}
H -->|No| I["Speculative Correction"]
H -->|Yes| J["Stream Token to Client"]
J --> K{"EOS or Max Tokens?"}
K -->|No| G
K -->|Yes| L["Free KV blocks & Cleanup"]
L --> M["Update Request Status"]
| Step | Module |
|---|---|
| 1–2 | [[../winllm/api_server.py|api_server.py]] |
| 3 | [[../winllm/scheduler.py|scheduler.py]] |
| 4–9 | [[../winllm/engine.py|engine.py]] |
| 5 | [[../winllm/kv_cache.py|kv_cache.py]] |
| 8 | [[../winllm/sampler.py|sampler.py]] |
| 10 | [[../winllm/kv_cache.py|kv_cache.py]] |
| 11 | [[../winllm/scheduler.py|scheduler.py]] |
| 12 | [[../winllm/api_server.py|api_server.py]] |
For "stream": true, we use a threadsafe async queue pattern to correctly bridge the blocking GPU thread and the async FastAPI server:
[GPU Thread] [Async HTTP Thread]
│ │
├── generate token ──→ queue.put() ──→ queue.get()
│ ├──→ SSE event to client
├── generate token ──→ queue.put() ──→ queue.get()
│ ├──→ SSE event to client
└── finished ──────→ queue.put() ──→ queue.get()
└──→ [DONE] event
[!note] Robust Error Handling & Timeouts Our streaming implementation catches
asyncio.TimeoutErrorif generation stalls, gracefully cancels the generation request, and yields an SSE error block to the client. This prevents hung connections and memory leaks.
We use FastAPI's robust @asynccontextmanager lifespan hook to manage the model's memory. When the server starts, we yield to load the model on a background thread. When the server shuts down (e.g. Ctrl+C), the hook resumes to gracefully unload the model and free GPU VRAM, completely replacing the deprecated @app.on_event handlers.
Because the GPU is ==underutilized== during decode. Each decode step is memory-bound (waiting for data to move between GPU memory and compute cores). The engine spends 90% of its time waiting for weights to load from VRAM rather than doing math.
See [[../winllm/scheduler.py|winllm/scheduler.py]]. We use a background thread _run_inference_loop that treats the model as a giant state machine.
[!info] The Global Loop Whereas standard Python servers use many threads, the
InferenceLoopis a single architectural bottleneck where all GPU operations happen. This avoids race conditions on the CUDA streams.
- Waiting Pool: New requests sit in a
deque. - Admission: Every iteration, the loop checks
kv_cache_manager.can_allocate(). If yes, it allocates blocks and moves the request to the_active_reqslist. - Execution: The loop calls
engine.generate_step(active_batch).- Prefill Merge: New requests are concatenated into a single padded prefill batch.
- Decode Merge: Existing requests are batched into a single 1-token-wide forward pass.
- Streaming: Tokens are streamed back via thread-safe callbacks (
loop.call_soon_threadsafe) to the FastAPI front-end.
[!tip] No More Semaphores We moved from a simple
Semaphore(which blocked based on the number of users) to Dynamic Admission Control. We admit as many users as the KV cache can literally hold, utilizing 100% of available VRAM.
To prevent memory leaks from completed requests continuously piling up in memory, the Scheduler runs a periodic eviction pass (_evict_completed). It clears out finished requests based on two bounded constraints:
- TTL (Time to Live): Requests older than
completed_request_ttlseconds are deleted. - Max Kept Limit: The dictionary size is hard-capped to
max_completed_requeststo ensure stable memory usage during extended operation.
| Hardware | max_batch_size |
Why |
|---|---|---|
| Laptop (8GB) | 4 | Limited VRAM for KV cache |
| Desktop (24GB) | 8 | More headroom |
| Datacenter (H200) | 64 | Massive VRAM |
[!warning] How vLLM Does It Differently vLLM's continuous batching combines multiple sequences into a single GPU kernel call — all sequences are processed together in one matrix multiplication. This is much more efficient but requires custom CUDA kernels. Our approach runs them in separate threads on the same GPU.
Generating tokens one-by-one is inherently slow. We implement three strategies to cheat the speed limit.
WinLLM supports three inference backends, each optimized for different deployment scenarios:
- PyTorch (default): Standard HuggingFace Transformers with quantization, multi-GPU, and speculative decoding.
- ONNX Runtime: Hardware-optimized inference via Optimum, ideal for pre-exported models (e.g., LiquidAI). Avoids Triton/MSVC entirely.
- DirectML: Cross-vendor GPU acceleration via DX12. Works with AMD, Intel, and NVIDIA GPUs.
[!tip] Choosing a Backend For most users, the default PyTorch backend is best. Use ONNX Runtime when you have pre-exported models that need Windows-native acceleration. Use DirectML for non-NVIDIA GPUs.
In standard servers, you wait for Request A to finish 500 tokens before starting Request B. In WinLLM, Request B enters the batch at the next token iteration. This is achieved by:
- Dynamic KV Admission: The scheduler checks
kv_cache_manager.can_allocate()every iteration, admitting as many requests as VRAM can hold. - Iteration Steering: The scheduler can stop, add, or remove requests from the batch every single token step.
- Prefix Caching: Common prompt prefixes (e.g., system prompts) are cached and reused across requests, avoiding redundant prefill computation.
We use a tiny Draft Model (e.g., Phi-1B) to guess the next 4 tokens. We then send all 4 guesses to the massive Target Model (e.g., Llama-70B) in a single forward pass.
- The Proposal: The Draft model is fast enough to generate 4 tokens in ~5ms.
- The Verification: The Target model checks the sequence
[Prompt + G1 + G2 + G3 + G4]in roughly the same time it takes to check[Prompt]. - The Acceptance: We compare the Target's hidden states at each position. If Target says
probs(G1)is high, we keep it and checkG2.
flowchart LR
Draft["Draft Model (Proposals)"] -- "G1, G2, G3, G4" --> Target["Target Model (Verification)"]
Target -- "Accepted: G1, G2. Corrected: X" --> Output["Final: G1, G2, X"]
Hard-coded settings don't scale. We need the engine to adapt to hardware. See [[../winllm/device.py|winllm/device.py]].
hw = DeviceInfo.detect()
# Queries torch.cuda.get_device_properties() for each GPU
# Returns: name, VRAM, compute capability, platform, CPU RAMInstead of classifying hardware into named tiers ("laptop", "desktop", etc.) with fixed lookup tables, we ==calculate every parameter mathematically== from the actual hardware:
# _build_defaults() — all allocation is formula-driven:
defaults = HardwareDefaults(
default_quantization="4bit" if total_vram_gb < 16 else "none",
max_batch_size=max(1, int(total_vram_gb / 1.5)), # Scales continuously
max_model_len=8192 if total_vram_gb >= 24 else (4096 if total_vram_gb >= 12 else 2048),
device_map_strategy="balanced" if device_count > 1 else "auto",
tensor_parallel_size=device_count, # Use all GPUs
gpu_memory_utilization=0.90,
kv_cache_fraction=0.90, # Pre-allocate 90% of remaining VRAM to KV pool
attention_backend="sdpa", # Default; overridden below if hardware supports it
)[!tip] Why Continuous Allocation > Named Profiles A 12 GB GPU shouldn't behave identically to an 8 GB GPU just because they're both "laptops." With math-based allocation, a 12 GB card gets
batch_size=8and a 8 GB card getsbatch_size=5— each scaled to its exact capacity.
The engine automatically selects the fastest attention implementation your GPU supports:
min_compute = min(g.compute_capability for g in info.devices)
if min_compute >= (8, 0): # Ampere+ (RTX 30xx, A100, etc.)
defaults.attention_backend = "flash_attention_2"
# Otherwise stays on "sdpa" (PyTorch native scaled dot-product)[!info] Attention Backends
Backend When Speed sdpaDefault, all GPUs Baseline flash_attention_2Compute ≥ 8.0 (Ampere+) ~2× faster attention eagerDebugging / compatibility Slowest, most compatible
Every auto-tuned default can be overridden without touching code:
set WINLLM_QUANTIZATION=none
set WINLLM_MAX_BATCH_SIZE=16
set WINLLM_ATTENTION_BACKEND=eager
winllm serve -m my-model --auto-config[!example] All Environment Variables
Variable Controls Type WINLLM_QUANTIZATIONQuantization mode str WINLLM_MAX_BATCH_SIZEMax concurrent requests int WINLLM_MAX_MODEL_LENMax context length int WINLLM_DEVICE_MAPDevice map strategy str WINLLM_TP_SIZETensor parallel size int `WINLLM_GPU_UTILIZATION" GPU memory utilization float WINLLM_KV_FRACTIONKV cache VRAM fraction float WINLLM_ATTENTION_BACKENDAttention implementation str
[!note] Device Map (Model Sharding) — Fits Bigger Models
GPU 0: layers 0-15 GPU 1: layers 16-31 Data flows: GPU0 → GPU1 (sequentially, no speedup)Easy to use (
device_map="balanced"). No speedup per-request, but fits models larger than one GPU.
[!note] Tensor Parallelism (Layer Splitting) — Actually Faster
GPU 0: left half of layer GPU 1: right half of layer Data flows: Both GPUs compute simultaneously, sync after each layerReal speedup (~linear with GPU count). Requires fast GPU interconnect (NVLink). Enabled with
tp_plan="auto".
See [[../winllm/registry.py|winllm/registry.py]].
Different model families have different optimal settings. Llama-3 supports RoPE scaling for extended context. Mistral handles sliding window attention natively. Gemma is sensitive to quantization. You shouldn't need to know all this — the engine should just pick the right settings.
The registry maintains a list of known model families with pre-tuned profiles:
KNOWN_MODELS = [
ModelProfile(family="llama", match_keywords=["llama-3", "llama-2", "llama"],
max_context_window=8192, rope_scaling=True),
ModelProfile(family="mistral", match_keywords=["mistral", "mixtral"],
max_context_window=32768, rope_scaling=False),
ModelProfile(family="qwen", match_keywords=["qwen1.5", "qwen2", "qwen"],
max_context_window=32768, rope_scaling=True),
ModelProfile(family="gemma", match_keywords=["gemma"],
max_context_window=8192, rope_scaling=False),
]When --auto-config is used, the pipeline:
- Detects hardware →
_build_defaults()calculates base parameters - Identifies model family →
identify_model_profile()matches repo name keywords - Applies family tweaks →
apply_model_profile()adjusts quantization, context window
[!info] Automatic, Not Mandatory If no model is recognized, the engine falls back to generic defaults. For new or custom models, everything still works — you just don't get family-specific optimizations.
See [[../winllm/api_server.py|winllm/api_server.py]].
By implementing the same API shape, any tool built for OpenAI works with WinLLM. Just change the base URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Phi-3-mini",
messages=[{"role": "user", "content": "Hello!"}]
)[!success] Compatible With
- Python
openailibrary- LangChain / LlamaIndex
- Continue (VS Code extension)
- Any OpenAI-compatible client
Different models expect different prompt formats. Llama uses [INST] tags, Phi-3 uses <|system|> tags, ChatML uses <|im_start|>. If you get the format wrong, the model outputs garbage.
# In api_server.py — we don't need to know the format!
prompt = engine.tokenizer.apply_chat_template(
messages, # [{"role": "user", "content": "Hello!"}]
tokenize=False, # Return string, not token IDs
add_generation_prompt=True, # Add the "Assistant:" prefix
)
# The tokenizer knows the model's format and does the right thingTip
Always use apply_chat_template() instead of manually building prompts — it works with any model automatically.
An honest comparison. Understanding the gap helps you know where to improve.
| Feature | vLLM | WinLLM |
|---|---|---|
| Language | Python + C++/CUDA kernels | Pure Python |
| KV Cache | PagedAttention (custom CUDA) | Logical block tracking + prefix caching |
| Batching | Continuous batching (fused kernels) | Iteration-level continuous batching |
| Speculative | Yes | Yes (SpeculativeEngine) |
| Backends | PyTorch only | PyTorch, ONNX Runtime, DirectML |
| Platform | Linux only | Windows + Linux |
| Performance | High-end production | High-performance consumer |
| Model Support | Model-specific kernels (Delayed) | Zero-Day (HF-Native) |
- PagedAttention kernel — Custom CUDA kernel reads KV cache from scattered memory "pages." Eliminates fragmentation.
- Fused continuous batching — Packs all active sequences into one padded batch tensor, one GPU kernel for the whole batch.
- Custom model implementations — Reimplements architectures with FlashAttention, fused rotary embeddings, etc.
[!success] Our Strengths
- Zero-Day Support — The "Wedge." If it's on HuggingFace, it runs in wLLM instantly. No waiting for GGUF/Exl2 conversions.
- Simplicity — ~800 lines of Python vs vLLM's ~100,000+
- Compatibility — Any HuggingFace model works out of the box
- Windows native — No WSL, no Linux-only CUDA extensions
- Learning — You can read and understand every component
- 80/20 rule — ~80% of capability with ~5% of complexity
[!abstract] Future Optimizations
- FlashAttention —
pip install flash-attn→ ~2× speedup on attention- llama.cpp backend — GGUF model support for optimized C++ inference
- TensorRT-LLM — NVIDIA's production engine, extremely fast
- True continuous batching — Fused batch scheduler combining sequences
Quick reference — where each concept lives in the codebase:
| Concept | File | Key Function |
|---|---|---|
| Autoregressive generation | [[../winllm/engine.py|engine.py]] | generate_step() |
| Continuous Batching | [[../winllm/scheduler.py|scheduler.py]] | _run_inference_loop() |
| Speculative Decoding | [[../winllm/speculative.py|speculative.py]] | SpeculativeEngine.step() |
| Multi-Backend Loading | [[../winllm/backend.py|backend.py]] | BackendFactory.load() |
| Model profile tuning | [[../winllm/registry.py|registry.py]] | apply_model_profile() |
| Memory budget | [[../winllm/kv_cache.py|kv_cache.py]] | allocate_sequence() |
| Prefix caching | [[../winllm/kv_cache.py|kv_cache.py]] | match_prefix(), promote_to_prefix() |
| Concept | File | Key Function |
|---|---|---|
| KV memory budget | [[../winllm/kv_cache.py|kv_cache.py]] | _estimate_max_blocks() |
| Model-aware KV sizing | [[../winllm/kv_cache.py|kv_cache.py]] | _estimate_per_token_kv_bytes() |
| Block allocation | [[../winllm/kv_cache.py|kv_cache.py]] | allocate_sequence() |
| Concept | File | Key Function |
|---|---|---|
| Temperature | [[../winllm/sampler.py|sampler.py]] | apply_temperature() |
| Top-k filtering | [[../winllm/sampler.py|sampler.py]] | apply_top_k() |
| Top-p / nucleus | [[../winllm/sampler.py|sampler.py]] | apply_top_p() |
| Repetition penalty | [[../winllm/sampler.py|sampler.py]] | apply_repetition_penalty() |
| Full pipeline | [[../winllm/sampler.py|sampler.py]] | sample_token() |
| Concept | File | Key Function |
|---|---|---|
| 4-bit quantization | [[../winllm/model_loader.py|model_loader.py]] | _build_quantization_config() |
| Model loading | [[../winllm/model_loader.py|model_loader.py]] | ModelLoader.load() |
| Multi-GPU distribution | [[../winllm/model_loader.py|model_loader.py]] | _resolve_device_map() |
| Model introspection | [[../winllm/model_loader.py|model_loader.py]] | _extract_model_kv_params() |
| Concept | File | Key Function |
|---|---|---|
| Hardware detection | [[../winllm/device.py|device.py]] | DeviceInfo.detect() |
| Dynamic defaults | [[../winllm/device.py|device.py]] | _build_defaults(), HardwareDefaults |
| Env overrides | [[../winllm/device.py|device.py]] | _apply_env_overrides() |
| Attention backend | [[../winllm/device.py|device.py]] | Auto-detect via compute_capability |
| GPU memory queries | [[../winllm/device.py|device.py]] | get_aggregate_gpu_memory(), get_all_gpu_memory_info() |
| Model registry | [[../winllm/registry.py|registry.py]] | identify_model_profile(), apply_model_profile() |
| Request scheduling | [[../winllm/scheduler.py|scheduler.py]] | Scheduler.submit() |
| Completed eviction | [[../winllm/scheduler.py|scheduler.py]] | _evict_completed() |
| Concurrency control | [[../winllm/scheduler.py|scheduler.py]] | Dynamic KV admission via can_allocate() |
| OpenAI chat endpoint | [[../winllm/api_server.py|api_server.py]] | chat_completions() |
| SSE streaming | [[../winllm/api_server.py|api_server.py]] | _stream_response() |
| CLI commands | [[../winllm/cli.py|cli.py]] | cmd_serve(), cmd_chat() |
| Configuration | [[../winllm/config.py|config.py]] | ModelConfig, SamplingParams, KVCacheConfig |