Skip to content

Commit 9305251

Browse files
Update README with progressive loading, new benchmarks, current architecture
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent cb039b8 commit 9305251

1 file changed

Lines changed: 68 additions & 53 deletions

File tree

README.md

Lines changed: 68 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# zerostart
22

3-
Parallel streaming wheel extraction for installing large Python packages on remote GPUs.
3+
Fast cold starts for Python GPU applications. Drop-in wrapper that installs packages in the background while your app starts running.
44

55
```bash
66
zerostart run serve.py
@@ -10,21 +10,32 @@ Auto-detects dependencies from PEP 723 inline metadata, `pyproject.toml`, or `re
1010

1111
## Benchmarks
1212

13-
### Cold Start (first run, empty cache)
13+
### Inference Server Cold Start
1414

15-
Cold start speedup depends on pod network bandwidth. zerostart opens multiple parallel HTTP connections per wheel — this helps when a single connection can't saturate the link, but doesn't help when one connection already maxes out the pipe.
15+
Time-to-health-check for an inference server (torch + transformers + fastapi + uvicorn) on RTX A6000:
1616

17-
| Pod network | Workload | zerostart | uv | Speedup |
18-
|-------------|----------|-----------|-----|---------|
19-
| Moderate (~200 Mbps) | torch (6.8 GB) | 33s | 98s | 3x |
20-
| Moderate (~200 Mbps) | triton (638 MB) | 3.4s | 1.0s | uv faster |
21-
| Fast (~1 Gbps) | diffusers+torch (7 GB) | 57s | 57s | ~1x |
17+
| Metric | uv | zerostart | Speedup |
18+
|--------|-----|-----------|---------|
19+
| Health check ready | 46s | **1.8s** | **25x** |
20+
| First CUDA inference | 55s | **33s** | 1.7x |
2221

23-
On bandwidth-constrained pods (common with cheaper providers), parallel Range requests download large wheels 3x faster. On fast-network pods, a single connection already saturates the link and both tools finish in about the same time. For small packages, zerostart's startup overhead makes uv faster — just use uv directly.
22+
With progressive loading, your server accepts health probes in under 2 seconds while torch installs in the background. Kubernetes/load balancers see a healthy pod almost immediately.
23+
24+
### AI Pipeline Cold Starts
25+
26+
Cold start benchmarks across real AI workloads on RTX A6000:
27+
28+
| Workload | uv | zerostart | Speedup |
29+
|----------|-----|-----------|---------|
30+
| Diffusion (torch+diffusers+transformers) | 51s | **40s** | 22% faster |
31+
| LLM Fine-tuning (torch+transformers+peft+trl) | 59s | **38s** | 36% faster |
32+
| Computer Vision (torch+torchvision+opencv) | 54s | **40s** | 26% faster |
33+
| Audio/Speech (torch+transformers+numpy) | 48s | **31s** | 35% faster |
34+
| Data Science (pandas+sklearn+xgboost) | 8s | 8s | ~1x |
2435

2536
### Warm Start (cached environment)
2637

27-
Warm starts are where zerostart consistently wins regardless of network speed. uv re-resolves dependencies and rebuilds the environment on every invocation. zerostart checks a cache marker and exec's Python directly.
38+
Warm starts are where zerostart consistently wins. uv re-resolves and rebuilds the environment on every invocation. zerostart checks a cache marker and exec's Python directly.
2839

2940
| Workload | zerostart | uv | Speedup |
3041
|----------|-----------|-----|---------|
@@ -34,52 +45,52 @@ Warm starts are where zerostart consistently wins regardless of network speed. u
3445

3546
All measured on RunPod (RTX 4090 / A6000).
3647

37-
### End-to-End: Install + Download + Load Model + Inference
38-
39-
Full cold-to-inference benchmark with Qwen3.5-35B-A3B (34.7B params, MoE) on RTX A6000. Includes 20-token generation to verify correct model loading. Each variant runs on its own isolated pod with HF cache cleared for fair cold-start comparison.
48+
## How It Works
4049

41-
| Test | Baseline (uv) | zerostart | Notes |
42-
|------|---------------|-----------|-------|
43-
| **Full cold** | 569s | 642s | HF download (~7 min) dominates both paths |
44-
| **Warm** | 144s | **128s** | zerostart 11% faster — accelerate() eliminates import overhead |
45-
| **Cold install** (HF cached) || 182s | Realistic "second deploy" scenario |
50+
### Progressive loading
4651

47-
**Where zerostart wins (warm path):** torch import 0.0s vs 3.4s, transformers import 0.1s vs 5.8s, model load 35s vs 40s. `accelerate()` patches `from_pretrained` to skip random weight init and use direct safetensors loading.
52+
Python starts immediately while packages install in the background. A lazy import hook blocks each `import` only until that specific package is extracted:
4853

49-
**Where baseline wins (cold path):** uv installs faster (42s vs 60s) when installing alone — zerostart's daemon and uv compete for bandwidth. HF model download time varies by pod (~6.5 min vs ~7.4 min). For full cold starts, the 7-minute HF download dwarfs any install optimization.
54+
```
55+
uv (sequential):
56+
[====== install all packages ======] then [start Python]
57+
46s to health check
58+
59+
zerostart (progressive):
60+
[start Python immediately]
61+
import fastapi → ready in 0.3s → /health responding at 1.8s
62+
import torch → blocks 23s → first inference at 33s
63+
[======= packages installing in background =======]
64+
```
5065

51-
## How It Works
66+
Small packages (fastapi, uvicorn) are available in under a second. Large packages (torch, transformers) block on import until extracted. Your app runs as soon as its first imports resolve.
5267

53-
### Cold starts: parallel Range-request streaming
68+
### GET+pipeline architecture
5469

55-
uv downloads each wheel as a single HTTP connection. zerostart uses HTTP Range requests to download multiple chunks of each wheel in parallel, and starts extracting files while chunks are still arriving:
70+
Downloads full wheels via single GET requests (32 parallel connections), then pipelines extraction through 4 parallel workers. Biggest wheels download first to maximize overlap:
5671

5772
```
58-
uv (sequential per wheel):
59-
torch.whl [=========downloading=========>] then [==extracting==]
60-
numpy.whl [=====>] then [=]
61-
62-
zerostart (parallel chunks, overlapped extraction):
63-
torch.whl chunk1 [====>]──extract──►
64-
chunk2 [====>]──extract──► ← 4 concurrent Range requests
65-
chunk3 [====>]──extract──► per large wheel
66-
chunk4 [====>]──extract──►
67-
numpy.whl [=>]──extract──► ← all wheels in parallel
73+
Download: torch [==============>] numpy [=>] six [>]
74+
Extract: [worker 0: torch ======>]
75+
[worker 1: numpy =>]
76+
[worker 2: six >]
6877
```
6978

70-
### Warm starts: Rust cache check vs re-resolve
79+
### System CUDA detection
7180

72-
uv re-resolves dependencies and rebuilds the tool environment on every invocation — even when packages are cached. For vllm (177 packages), that means metadata checks and linking for each one.
81+
On pods with CUDA pre-installed, zerostart detects the system CUDA version and skips downloading nvidia-* wheels when the system already provides compatible libraries. This saves ~2-6GB of downloads depending on the workload.
82+
83+
### Shared CUDA layer cache
84+
85+
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels and hardlinks them into new environments — so the second torch-based environment skips re-extracting those 6GB.
86+
87+
### Warm starts: Rust cache check
7388

7489
zerostart's warm path is three operations in Rust:
7590
1. `stat(".complete")` — does the cached environment exist?
7691
2. `find("lib/python*/site-packages")` — locate it
7792
3. `exec(python)` — run directly
7893

79-
### Shared CUDA layer cache
80-
81-
CUDA libraries (nvidia-cublas, nvidia-cudnn, nvidia-nccl, etc.) are ~6GB and identical across torch, vllm, and diffusers environments. zerostart caches extracted wheels and hardlinks them into new environments — so the second torch-based environment skips re-extracting those 6GB.
82-
8394
## Install
8495

8596
```bash
@@ -203,54 +214,58 @@ Auto-registers via vLLM's plugin system when zerostart is installed.
203214

204215
## Architecture
205216

206-
The entire cold path runs in Rust — no Python orchestrator:
217+
The entire cold path runs in Rust with progressive loading:
207218

208219
```
209220
zerostart run -p torch serve.py
210221
211222
1. Find Python (uv python find || which python3)
212223
2. Check warm cache (stat .complete marker — instant)
213224
3. Resolve deps (uv pip compile --format pylock.toml)
214-
4. Check shared cache (hardlink cached CUDA libs)
215-
5. Stream wheels (parallel Range-request download + extract)
216-
6. exec(python) (replaces process, no overhead)
225+
4. Detect system CUDA (skip nvidia wheels if system libs match)
226+
5. Check shared cache (hardlink cached CUDA libs)
227+
6. Start daemon (GET+pipeline download + parallel extract)
228+
7. Start Python (immediately, with lazy import hook)
229+
└─ imports block only until their package is extracted
217230
```
218231

219232
Key design decisions:
220233

221-
- **All wheels through the streaming daemon** — every package with a wheel URL goes through parallel download+extract. Only sdist-only packages (rare) fall back to `uv pip install`.
234+
- **Progressive loading** — Python starts immediately. Imports block per-package, not globally. Health checks respond in seconds, not minutes.
235+
- **GET+pipeline** — single GET per wheel (maximizes bandwidth), 4 parallel extract workers (prevents small wheels queuing behind large ones).
236+
- **System CUDA detection** — skips nvidia-* wheels when the pod already has compatible CUDA libraries, saving ~2-6GB of downloads.
222237
- **Atomic extraction** — each wheel extracts to a staging directory, then renames into site-packages. Partial extractions never corrupt the target.
223238
- **No venv overhead** — uses a flat site-packages directory with a content-addressed cache key.
224-
- **Demand-driven scheduling** — when Python hits `import torch`, the daemon reprioritizes torch to the front of the download queue.
225239

226240
## Tuning
227241

228242
| Variable | Default | Description |
229243
|----------|---------|-------------|
230-
| `ZS_PARALLEL_DOWNLOADS` | 16 | Concurrent HTTP connections |
244+
| `ZS_PARALLEL_DOWNLOADS` | 32 | Concurrent HTTP connections |
231245
| `ZS_EXTRACT_THREADS` | num_cpus * 2 | Parallel extraction threads |
232-
| `ZS_CHUNK_MB` | 16 | Streaming chunk size (MB) for Range requests |
233246
| `ZEROSTART_CACHE` | `~/.cache/zerostart` | Cache directory |
247+
| `ZS_NO_SYSTEM_CUDA` | unset | Disable system CUDA detection |
248+
| `ZS_NO_SHARED_CACHE` | unset | Disable shared wheel cache |
234249

235250
## When to Use It
236251

237252
**Good fit:**
253+
- Inference servers — health check in <2s on cold start vs 45s+ with uv
238254
- Repeated runs on the same pod — warm starts are 4-7x faster than uv
239-
- Large GPU packages on bandwidth-constrained pods — parallel downloads help when a single connection is slow
240-
- Spot instances, CI/CD, autoscaling where you restart often and warm cache pays off
255+
- Spot instances, CI/CD, autoscaling where you restart often
256+
- Large GPU packages (torch, vllm, diffusers) — parallel downloads + progressive loading
241257

242258
**Not worth it:**
243-
- One-off cold starts on fast-network pods — uv is just as fast
244-
- Small packages — uv is faster, zerostart adds startup overhead
245-
- Local NVMe with models in page cache
259+
- Small packages only — uv is faster, zerostart adds startup overhead
260+
- One-off scripts that don't need to respond to health checks during install
246261

247262
## Requirements
248263

249264
- Linux (container GPU providers: RunPod, Vast.ai, Lambda, etc.)
250265
- `uv` for dependency resolution (pre-installed on most GPU containers)
251266
- Python 3.10+
252267

253-
macOS works for development (same CLI, no streaming optimization).
268+
macOS works for development (same CLI, no GPU optimization).
254269

255270
## gpu-cli Integration
256271

0 commit comments

Comments
 (0)