Skip to content

Commit cd40eea

Browse files
Update README with benchmarks and architecture explanation
10-12x faster cold starts, 10-31x faster warm starts vs uvx. Explains why: parallel Range-request streaming, Rust cache check, shared CUDA layer cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 85a7602 commit cd40eea

File tree

1 file changed

+101
-111
lines changed

1 file changed

+101
-111
lines changed

README.md

Lines changed: 101 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,97 @@
11
# zerostart
22

3-
**Fast cold starts for GPU Python.** Install packages 6-9x faster than pip/uv and start importing before the install finishes.
3+
**Fast cold starts for GPU Python.** Drop-in replacement for `uvx` that's 10-12x faster on cold starts and 10-30x faster on warm starts.
44

55
```bash
6-
# Install and run — packages load progressively as your app starts
7-
uvx zerostart serve.py
6+
# Instead of: uvx --from torch python serve.py
7+
zerostart run -p torch serve.py
88
```
99

1010
No code changes. No platform lock-in. Works on any container GPU provider.
1111

12-
## The Problem
12+
## Benchmarks
13+
14+
Measured on RTX 4090 pods (RunPod). Both tools download the same wheels from PyPI — zerostart is faster because of *how* it downloads.
15+
16+
### Cold Start (first run, empty cache)
17+
18+
| Workload | Packages | Size | zerostart | uvx | Speedup |
19+
|----------|----------|------|-----------|-----|---------|
20+
| torch + CUDA | 27 wheels | 6.8 GB | **47.9s** | 512.2s | **10.7x** |
21+
| vllm (full LLM stack) | 177 wheels | 9.4 GB | **69.6s** | 862.1s | **12.4x** |
22+
| transformers + torch | 51 wheels | 7.0 GB | **45.4s** |||
23+
| diffusers + torch | 60 wheels | 7.0 GB | **50.4s** |||
24+
| triton | 1 wheel | 638 MB | **7.0s** |||
25+
26+
### Warm Start (cached environment)
1327

14-
Every GPU Python cold start wastes minutes on package installation:
28+
| Workload | zerostart | uvx | Speedup |
29+
|----------|-----------|-----|---------|
30+
| torch | **2.5s** | 25.5s | **10.2x** |
31+
| vllm | **3.6s** | 114.0s | **31.7x** |
32+
| transformers + torch | **3.9s** |||
33+
| diffusers + torch | **5.3s** |||
34+
| triton | **0.9s** |||
1535

16-
| Step | Time |
17-
|------|------|
18-
| `pip install torch` (download + extract 800MB wheel) | 60-180s |
19-
| `pip install transformers tokenizers safetensors ...` | 30-60s |
20-
| Module imports (`import torch` loads 80+ .so files) | 5-15s |
21-
| **Total before your code runs** | **~2-5 min** |
36+
On a faster pod (1Gbps+ network), cold starts drop further:
2237

23-
On container GPU providers (RunPod, Vast.ai, Lambda), you pay for GPU time during this entire wait.
38+
| Workload | zerostart | uvx | Speedup |
39+
|----------|-----------|-----|---------|
40+
| torch | **23.1s** | 90.9s | **3.9x** |
41+
| vllm | **33.3s** | 138.1s | **4.1x** |
2442

25-
## How It Works
43+
## Why Is It Faster?
2644

27-
zerostart does two things:
45+
### Cold starts: parallel Range-request streaming
2846

29-
1. **Fast parallel install** — downloads and extracts wheels simultaneously across 8 connections, streaming large wheels directly to site-packages with no temp files
30-
2. **Progressive loading** — your app starts immediately; `import torch` blocks only until torch is extracted, not until everything is done
47+
uvx downloads each wheel as a single HTTP connection. A 873MB torch wheel = one TCP stream.
48+
49+
zerostart uses **HTTP Range requests** to download multiple chunks of each wheel in parallel, and starts extracting files while chunks are still arriving:
3150

3251
```
33-
┌─ Your Python app ──────────────────────────────────┐
34-
│ │
35-
│ import torch # blocks 1.3s (873MB wheel) │
36-
│ import transformers # blocks 0s (already done) │
37-
│ model = load(...) # runs while deps still land │
38-
│ │
39-
│ ┌─ zs-fast-wheel (Rust, in-process) ────────────┐ │
40-
│ │ downloading: ████████░░ tokenizers │ │
41-
│ │ extracting: ██████████ torch ✓ │ │
42-
│ │ queued: safetensors, triton, ... │ │
43-
│ └────────────────────────────────────────────────┘ │
44-
└─────────────────────────────────────────────────────┘
52+
uvx (sequential per wheel):
53+
torch.whl [=========downloading=========>] then [==extracting==]
54+
numpy.whl [=====>] then [=]
55+
56+
zerostart (parallel chunks, overlapped extraction):
57+
torch.whl chunk1 [====>]──extract──►
58+
chunk2 [====>]──extract──► ← 4 concurrent Range requests
59+
chunk3 [====>]──extract──► per large wheel
60+
chunk4 [====>]──extract──►
61+
numpy.whl [=>]──extract──► ← all wheels in parallel
4562
```
4663

47-
When your code hits `import torch`, the daemon reprioritizes torch to the front of the queue. Your app runs in parallel with the install — not after it.
64+
On a slow network, this is the difference between 1 connection at 15 MB/s (60s for torch) and 16+ connections saturating the link.
65+
66+
### Warm starts: Rust cache check vs full re-resolve
67+
68+
uvx re-resolves dependencies and rebuilds the tool environment on every invocation — even when packages are cached. For vllm (177 packages), that means 177 cache lookups + metadata checks + links.
69+
70+
zerostart's warm path is three operations in Rust:
71+
1. `stat(".complete")` — does the cached environment exist?
72+
2. `find("lib/python*/site-packages")` — locate it
73+
3. `exec(python)` — run directly
74+
75+
No resolution, no environment setup, no uv involved.
76+
77+
### Shared CUDA layer cache
78+
79+
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels at `$ZEROSTART_CACHE/shared_wheels/` and hardlinks them into new environments — so the second torch-based environment skips downloading those 6GB entirely.
4880

4981
## Quick Start
5082

5183
```bash
52-
# Run any Python script with progressive loading
53-
uvx zerostart serve.py
84+
# Run a package (like uvx)
85+
zerostart run torch -- -c "import torch; print(torch.cuda.is_available())"
5486

55-
# PEP 723 inline deps — just works (reads from script header)
56-
uvx zerostart serve.py
87+
# Run a script with dependencies
88+
zerostart run -p torch -p transformers serve.py
5789

58-
# With explicit requirements
59-
uvx zerostart -r requirements.txt serve.py
60-
61-
# With inline packages
62-
uvx zerostart -p torch transformers serve.py
90+
# With a requirements file
91+
zerostart run -r requirements.txt serve.py
6392

6493
# Pass args to your script
65-
uvx zerostart serve.py --port 8000
94+
zerostart run serve.py -- --port 8000
6695
```
6796

6897
### PEP 723 Inline Script Metadata
@@ -82,101 +111,62 @@ print(f"Loaded on {model.device}")
82111
```
83112

84113
```bash
85-
uvx zerostart serve.py # deps auto-detected from script
114+
zerostart run serve.py # deps auto-detected from script
86115
```
87116

88-
Or install it:
89-
90-
```bash
91-
pip install zerostart
92-
zerostart serve.py
93-
```
94-
95-
## Benchmarks
117+
## Architecture
96118

97-
Measured on an RTX 4090 pod (RunPod), comparing `uv pip install` against `zs-fast-wheel`:
98-
99-
### Install Speed
100-
101-
| Workload | Packages | Size | uv pip install | zs-fast-wheel | Speedup |
102-
|----------|----------|------|---------------|---------------|---------|
103-
| Small (requests, six, etc.) | 6 wheels | 3 MB | 767ms | 775ms | 1.0x |
104-
| Medium (numpy, pandas, scikit-learn) | 19 wheels | 251 MB | 12.0s | 1.3s | **9.3x** |
105-
| ML (torch, transformers, triton + CUDA) | 56 wheels | 7 GB | 97.8s | 16.2s | **6.0x** |
106-
107-
For small packages there's no difference. For real ML stacks (hundreds of MB to GB), zs-fast-wheel is **6-9x faster** because it streams and extracts in parallel instead of download-then-extract.
108-
109-
### Time to First Import
110-
111-
With progressive loading, your code doesn't wait for the full install:
112-
113-
| Package | Size | Time to first import |
114-
|---------|------|---------------------|
115-
| numpy | 16 MB | 0.2s |
116-
| torch | 873 MB | 1.3s (demand-prioritized) |
117-
| safetensors | 0.5 MB | 5.2s (queued behind large wheels) |
118-
119-
`import torch` completes in **1.3 seconds** even though the full 7GB ML stack takes 16s to install.
120-
121-
## How It Works (Details)
122-
123-
1. **Parallel streaming** — downloads and extracts wheels simultaneously across 8 connections
124-
2. **Demand signaling** — when Python hits `import torch`, the daemon reprioritizes torch to the front of the queue
125-
3. **Streaming extraction** — large wheels (>50MB) start extracting before the full download completes
126-
4. **No temp files** — wheels extract directly to site-packages, no intermediate copies
127-
5. **Lazy import hook** — a `sys.meta_path` finder that gates imports until the package is on disk, then lets normal Python machinery do the loading
128-
129-
### Architecture
130-
131-
The core is `zs-fast-wheel`, a Rust binary + PyO3 module:
119+
The entire cold path runs in Rust — no Python orchestrator:
132120

133121
```
134-
┌─────────────────────────────────────────────────┐
135-
│ Python process │
136-
│ │
137-
│ import torch ──► lazy import hook │
138-
│ │ │
139-
│ ├─ signal_demand("torch") │
140-
│ └─ wait_done("torch") │
141-
│ │ │
142-
│ ┌───────────────────────┼────────────────────┐ │
143-
│ │ DaemonEngine (Rust, in-process via PyO3) │ │
144-
│ │ │ │ │
145-
│ │ ┌─── download ───┐ │ ┌── extract ──┐ │ │
146-
│ │ │ wheel 1 ████░░ │──┼──►│ site-pkgs/ │ │ │
147-
│ │ │ wheel 2 ██████ │ │ │ ✓ done │ │ │
148-
│ │ │ torch ░░░░░░ │◄─┘ │ ... │ │ │
149-
│ │ │ (reprioritized) │ └─────────────┘ │ │
150-
│ │ └─────────────────┘ │ │
151-
│ └───────────────────────────────────────────┘ │
152-
└─────────────────────────────────────────────────┘
122+
zerostart run -p torch serve.py
123+
124+
1. Find Python (uv python find || which python3)
125+
2. Check warm cache (stat .complete marker — instant)
126+
3. Resolve deps (uv pip compile --format pylock.toml)
127+
4. Check shared cache (hardlink cached CUDA libs — parallel via rayon)
128+
5. Stream wheels (parallel Range-request download + extract)
129+
6. exec(python) (replaces process, no overhead)
153130
```
154131

155132
Key design decisions:
156133

157-
- **In-process via PyO3** — no subprocess, no IPC, no sockets. The Rust engine runs as a native Python extension. `signal_demand()` and `wait_done()` are direct function calls.
158-
- **`std::sync` for cross-runtime safety** — uses `Mutex` + `Condvar` (not tokio sync) so `wait_done()` works correctly across threads.
159-
- **Atomic extraction** — each wheel extracts to a staging directory, then atomically renames into site-packages. Partial extractions never corrupt the target.
134+
- **All wheels through the streaming daemon** — every package with a wheel URL goes through parallel download+extract. Only sdist-only packages (rare) fall back to `uv pip install`.
135+
- **Atomic extraction** — each wheel extracts to a staging directory, then renames into site-packages. Partial extractions never corrupt the target.
136+
- **No venv overhead** — uses a flat site-packages directory with a content-addressed cache key. No `uv venv` on the critical path.
137+
- **Demand-driven scheduling** — when Python hits `import torch`, the daemon reprioritizes torch to the front of the download queue.
138+
139+
## Tuning
140+
141+
Performance knobs via environment variables:
160142

161-
## Environment Caching
143+
| Variable | Default | Description |
144+
|----------|---------|-------------|
145+
| `ZS_PARALLEL_DOWNLOADS` | 16 | Concurrent HTTP connections |
146+
| `ZS_EXTRACT_THREADS` | num_cpus * 2 | Parallel extraction threads |
147+
| `ZS_CHUNK_MB` | 16 | Streaming chunk size (MB) for Range requests |
148+
| `ZEROSTART_CACHE` | `~/.cache/zerostart` | Cache directory |
162149

163-
zerostart caches completed environments in `.zerostart/`. If the same requirements are resolved again, it reuses the cached environment (~0s install). The cache key is a hash of the resolved artifact set.
150+
```bash
151+
# Crank up parallelism on a fast network
152+
ZS_PARALLEL_DOWNLOADS=32 ZS_CHUNK_MB=32 zerostart run -v -p torch test.py
153+
```
164154

165155
## Requirements
166156

167-
- Python 3.10+
168157
- Linux (container GPU providers: RunPod, Vast.ai, Lambda, etc.)
169158
- `uv` for requirement resolution (pre-installed on most GPU containers)
159+
- Python 3.10+
170160

171-
On macOS: zerostart runs your script directly without progressive loading (useful for development).
161+
macOS works for development (same CLI, no streaming optimization).
172162

173163
## gpu-cli Integration
174164

175165
If you use [gpu-cli](https://gpu-cli.sh):
176166

177167
```bash
178-
# Your script runs with progressive loading on a GPU pod
179-
gpu run "uvx zerostart serve.py"
168+
# Your script runs on a GPU pod with fast package loading
169+
gpu run "zerostart run -p torch serve.py"
180170
```
181171

182172
## License

0 commit comments

Comments
 (0)