You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+68-53Lines changed: 68 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# zerostart
2
2
3
-
Parallel streaming wheel extraction for installing large Python packages on remote GPUs.
3
+
Fast cold starts for Python GPU applications. Drop-in wrapper that installs packages in the background while your app starts running.
4
4
5
5
```bash
6
6
zerostart run serve.py
@@ -10,21 +10,32 @@ Auto-detects dependencies from PEP 723 inline metadata, `pyproject.toml`, or `re
10
10
11
11
## Benchmarks
12
12
13
-
### Cold Start (first run, empty cache)
13
+
### Inference Server Cold Start
14
14
15
-
Cold start speedup depends on pod network bandwidth. zerostart opens multiple parallel HTTP connections per wheel — this helps when a single connection can't saturate the link, but doesn't help when one connection already maxes out the pipe.
15
+
Time-to-health-check for an inference server (torch + transformers + fastapi + uvicorn) on RTX A6000:
On bandwidth-constrained pods (common with cheaper providers), parallel Range requests download large wheels 3x faster. On fast-network pods, a single connection already saturates the link and both tools finish in about the same time. For small packages, zerostart's startup overhead makes uv faster — just use uv directly.
22
+
With progressive loading, your server accepts health probes in under 2 seconds while torch installs in the background. Kubernetes/load balancers see a healthy pod almost immediately.
23
+
24
+
### AI Pipeline Cold Starts
25
+
26
+
Cold start benchmarks across real AI workloads on RTX A6000:
| Data Science (pandas+sklearn+xgboost) | 8s | 8s |~1x |
24
35
25
36
### Warm Start (cached environment)
26
37
27
-
Warm starts are where zerostart consistently wins regardless of network speed. uv re-resolves dependencies and rebuilds the environment on every invocation. zerostart checks a cache marker and exec's Python directly.
38
+
Warm starts are where zerostart consistently wins. uv re-resolves and rebuilds the environment on every invocation. zerostart checks a cache marker and exec's Python directly.
28
39
29
40
| Workload | zerostart | uv | Speedup |
30
41
|----------|-----------|-----|---------|
@@ -34,52 +45,52 @@ Warm starts are where zerostart consistently wins regardless of network speed. u
34
45
35
46
All measured on RunPod (RTX 4090 / A6000).
36
47
37
-
### End-to-End: Install + Download + Load Model + Inference
38
-
39
-
Full cold-to-inference benchmark with Qwen3.5-35B-A3B (34.7B params, MoE) on RTX A6000. Includes 20-token generation to verify correct model loading. Each variant runs on its own isolated pod with HF cache cleared for fair cold-start comparison.
**Where zerostart wins (warm path):** torch import 0.0s vs 3.4s, transformers import 0.1s vs 5.8s, model load 35s vs 40s. `accelerate()` patches `from_pretrained` to skip random weight init and use direct safetensors loading.
52
+
Python starts immediately while packages install in the background. A lazy import hook blocks each `import` only until that specific package is extracted:
48
53
49
-
**Where baseline wins (cold path):** uv installs faster (42s vs 60s) when installing alone — zerostart's daemon and uv compete for bandwidth. HF model download time varies by pod (~6.5 min vs ~7.4 min). For full cold starts, the 7-minute HF download dwarfs any install optimization.
54
+
```
55
+
uv (sequential):
56
+
[====== install all packages ======] then [start Python]
57
+
46s to health check
58
+
59
+
zerostart (progressive):
60
+
[start Python immediately]
61
+
import fastapi → ready in 0.3s → /health responding at 1.8s
62
+
import torch → blocks 23s → first inference at 33s
63
+
[======= packages installing in background =======]
64
+
```
50
65
51
-
## How It Works
66
+
Small packages (fastapi, uvicorn) are available in under a second. Large packages (torch, transformers) block on import until extracted. Your app runs as soon as its first imports resolve.
52
67
53
-
### Cold starts: parallel Range-request streaming
68
+
### GET+pipeline architecture
54
69
55
-
uv downloads each wheel as a single HTTP connection. zerostart uses HTTP Range requests to download multiple chunks of each wheel in parallel, and starts extracting files while chunks are still arriving:
70
+
Downloads full wheels via single GET requests (32 parallel connections), then pipelines extraction through 4 parallel workers. Biggest wheels download first to maximize overlap:
56
71
57
72
```
58
-
uv (sequential per wheel):
59
-
torch.whl [=========downloading=========>] then [==extracting==]
chunk2 [====>]──extract──► ← 4 concurrent Range requests
65
-
chunk3 [====>]──extract──► per large wheel
66
-
chunk4 [====>]──extract──►
67
-
numpy.whl [=>]──extract──► ← all wheels in parallel
73
+
Download: torch [==============>] numpy [=>] six [>]
74
+
Extract: [worker 0: torch ======>]
75
+
[worker 1: numpy =>]
76
+
[worker 2: six >]
68
77
```
69
78
70
-
### Warm starts: Rust cache check vs re-resolve
79
+
### System CUDA detection
71
80
72
-
uv re-resolves dependencies and rebuilds the tool environment on every invocation — even when packages are cached. For vllm (177 packages), that means metadata checks and linking for each one.
81
+
On pods with CUDA pre-installed, zerostart detects the system CUDA version and skips downloading nvidia-* wheels when the system already provides compatible libraries. This saves ~2-6GB of downloads depending on the workload.
82
+
83
+
### Shared CUDA layer cache
84
+
85
+
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels and hardlinks them into new environments — so the second torch-based environment skips re-extracting those 6GB.
86
+
87
+
### Warm starts: Rust cache check
73
88
74
89
zerostart's warm path is three operations in Rust:
75
90
1.`stat(".complete")` — does the cached environment exist?
76
91
2.`find("lib/python*/site-packages")` — locate it
77
92
3.`exec(python)` — run directly
78
93
79
-
### Shared CUDA layer cache
80
-
81
-
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels and hardlinks them into new environments — so the second torch-based environment skips re-extracting those 6GB.
82
-
83
94
## Install
84
95
85
96
```bash
@@ -203,54 +214,58 @@ Auto-registers via vLLM's plugin system when zerostart is installed.
203
214
204
215
## Architecture
205
216
206
-
The entire cold path runs in Rust — no Python orchestrator:
217
+
The entire cold path runs in Rust with progressive loading:
7. Start Python (immediately, with lazy import hook)
229
+
└─ imports block only until their package is extracted
217
230
```
218
231
219
232
Key design decisions:
220
233
221
-
-**All wheels through the streaming daemon** — every package with a wheel URL goes through parallel download+extract. Only sdist-only packages (rare) fall back to `uv pip install`.
234
+
-**Progressive loading** — Python starts immediately. Imports block per-package, not globally. Health checks respond in seconds, not minutes.
235
+
-**GET+pipeline** — single GET per wheel (maximizes bandwidth), 4 parallel extract workers (prevents small wheels queuing behind large ones).
236
+
-**System CUDA detection** — skips nvidia-* wheels when the pod already has compatible CUDA libraries, saving ~2-6GB of downloads.
222
237
-**Atomic extraction** — each wheel extracts to a staging directory, then renames into site-packages. Partial extractions never corrupt the target.
223
238
-**No venv overhead** — uses a flat site-packages directory with a content-addressed cache key.
224
-
-**Demand-driven scheduling** — when Python hits `import torch`, the daemon reprioritizes torch to the front of the download queue.
0 commit comments