You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a faster pod (1Gbps+ network), cold starts drop further:
22
37
23
-
On container GPU providers (RunPod, Vast.ai, Lambda), you pay for GPU time during this entire wait.
38
+
| Workload | zerostart | uvx | Speedup |
39
+
|----------|-----------|-----|---------|
40
+
| torch |**23.1s**| 90.9s |**3.9x**|
41
+
| vllm |**33.3s**| 138.1s |**4.1x**|
24
42
25
-
## How It Works
43
+
## Why Is It Faster?
26
44
27
-
zerostart does two things:
45
+
### Cold starts: parallel Range-request streaming
28
46
29
-
1.**Fast parallel install** — downloads and extracts wheels simultaneously across 8 connections, streaming large wheels directly to site-packages with no temp files
30
-
2.**Progressive loading** — your app starts immediately; `import torch` blocks only until torch is extracted, not until everything is done
47
+
uvx downloads each wheel as a single HTTP connection. A 873MB torch wheel = one TCP stream.
48
+
49
+
zerostart uses **HTTP Range requests** to download multiple chunks of each wheel in parallel, and starts extracting files while chunks are still arriving:
31
50
32
51
```
33
-
┌─ Your Python app ──────────────────────────────────┐
chunk2 [====>]──extract──► ← 4 concurrent Range requests
59
+
chunk3 [====>]──extract──► per large wheel
60
+
chunk4 [====>]──extract──►
61
+
numpy.whl [=>]──extract──► ← all wheels in parallel
45
62
```
46
63
47
-
When your code hits `import torch`, the daemon reprioritizes torch to the front of the queue. Your app runs in parallel with the install — not after it.
64
+
On a slow network, this is the difference between 1 connection at 15 MB/s (60s for torch) and 16+ connections saturating the link.
65
+
66
+
### Warm starts: Rust cache check vs full re-resolve
67
+
68
+
uvx re-resolves dependencies and rebuilds the tool environment on every invocation — even when packages are cached. For vllm (177 packages), that means 177 cache lookups + metadata checks + links.
69
+
70
+
zerostart's warm path is three operations in Rust:
71
+
1.`stat(".complete")` — does the cached environment exist?
72
+
2.`find("lib/python*/site-packages")` — locate it
73
+
3.`exec(python)` — run directly
74
+
75
+
No resolution, no environment setup, no uv involved.
76
+
77
+
### Shared CUDA layer cache
78
+
79
+
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels at `$ZEROSTART_CACHE/shared_wheels/` and hardlinks them into new environments — so the second torch-based environment skips downloading those 6GB entirely.
48
80
49
81
## Quick Start
50
82
51
83
```bash
52
-
# Run any Python script with progressive loading
53
-
uvx zerostart serve.py
84
+
# Run a package (like uvx)
85
+
zerostart run torch -- -c "import torch; print(torch.cuda.is_available())"
54
86
55
-
#PEP 723 inline deps — just works (reads from script header)
56
-
uvx zerostart serve.py
87
+
#Run a script with dependencies
88
+
zerostart run -p torch -p transformers serve.py
57
89
58
-
# With explicit requirements
59
-
uvx zerostart -r requirements.txt serve.py
60
-
61
-
# With inline packages
62
-
uvx zerostart -p torch transformers serve.py
90
+
# With a requirements file
91
+
zerostart run -r requirements.txt serve.py
63
92
64
93
# Pass args to your script
65
-
uvx zerostart serve.py --port 8000
94
+
zerostart run serve.py -- --port 8000
66
95
```
67
96
68
97
### PEP 723 Inline Script Metadata
@@ -82,101 +111,62 @@ print(f"Loaded on {model.device}")
82
111
```
83
112
84
113
```bash
85
-
uvx zerostart serve.py # deps auto-detected from script
114
+
zerostart run serve.py # deps auto-detected from script
86
115
```
87
116
88
-
Or install it:
89
-
90
-
```bash
91
-
pip install zerostart
92
-
zerostart serve.py
93
-
```
94
-
95
-
## Benchmarks
117
+
## Architecture
96
118
97
-
Measured on an RTX 4090 pod (RunPod), comparing `uv pip install` against `zs-fast-wheel`:
For small packages there's no difference. For real ML stacks (hundreds of MB to GB), zs-fast-wheel is **6-9x faster** because it streams and extracts in parallel instead of download-then-extract.
108
-
109
-
### Time to First Import
110
-
111
-
With progressive loading, your code doesn't wait for the full install:
-**In-process via PyO3** — no subprocess, no IPC, no sockets. The Rust engine runs as a native Python extension. `signal_demand()` and `wait_done()` are direct function calls.
158
-
-**`std::sync` for cross-runtime safety** — uses `Mutex` + `Condvar` (not tokio sync) so `wait_done()` works correctly across threads.
159
-
-**Atomic extraction** — each wheel extracts to a staging directory, then atomically renames into site-packages. Partial extractions never corrupt the target.
134
+
-**All wheels through the streaming daemon** — every package with a wheel URL goes through parallel download+extract. Only sdist-only packages (rare) fall back to `uv pip install`.
135
+
-**Atomic extraction** — each wheel extracts to a staging directory, then renames into site-packages. Partial extractions never corrupt the target.
136
+
-**No venv overhead** — uses a flat site-packages directory with a content-addressed cache key. No `uv venv` on the critical path.
137
+
-**Demand-driven scheduling** — when Python hits `import torch`, the daemon reprioritizes torch to the front of the download queue.
zerostart caches completed environments in `.zerostart/`. If the same requirements are resolved again, it reuses the cached environment (~0s install). The cache key is a hash of the resolved artifact set.
150
+
```bash
151
+
# Crank up parallelism on a fast network
152
+
ZS_PARALLEL_DOWNLOADS=32 ZS_CHUNK_MB=32 zerostart run -v -p torch test.py
153
+
```
164
154
165
155
## Requirements
166
156
167
-
- Python 3.10+
168
157
- Linux (container GPU providers: RunPod, Vast.ai, Lambda, etc.)
169
158
-`uv` for requirement resolution (pre-installed on most GPU containers)
159
+
- Python 3.10+
170
160
171
-
On macOS: zerostart runs your script directly without progressive loading (useful for development).
161
+
macOS works for development (same CLI, no streaming optimization).
172
162
173
163
## gpu-cli Integration
174
164
175
165
If you use [gpu-cli](https://gpu-cli.sh):
176
166
177
167
```bash
178
-
# Your script runs with progressive loading on a GPU pod
179
-
gpu run "uvx zerostart serve.py"
168
+
# Your script runs on a GPU pod with fast package loading
0 commit comments