Plans

Great stuff. Im doing something similar on the storage front, and have forked your work.
Working on expanding it to solve the shortfalls. Love to work togeather.
Have already got a plan, and executing on it.

Here's my first pass. Will define it a bit better after get thru it. As I need to merge it with my storage concept, and kube integration as well. 

Phase 1 — Security & Correctness [CRITICAL]

 Add seccomp-bpf filter to VMM host process
Small · 2–3 days | tags: security, rust
 Inject CSPRNG reseed before every fork (kernel + userspace numpy/OpenSSL)
Small · 2–3 days | tags: rust
 Audit vmstate parser for unsafe memory reads; add bounds checks
Small · 2–3 days | tags: rust
 Replace hardcoded demo API key with proper key issuance + scoping system
Small · 3–5 days | tags: ops, security
 Add per-key rate limiting and usage tracking in the API server
Small · 3–5 days | tags: rust, ops


Phase 2 — Observability & Operability [HIGH]

 Integrate OpenTelemetry tracing across fork lifecycle (spawn → run → teardown)
Small · 2–3 days | tags: rust, ops
 Add structured per-fork metrics (RSS, CoW page faults, wall-clock, exit code)
Small · 2–3 days | tags: rust, ops
 Wire up Prometheus /metrics endpoint with dashboard (Grafana template)
Small · 2–3 days | tags: ops
 Add streaming stdout via SSE or WebSocket (Axum native)
Medium · 3–5 days | tags: rust, ux
 Implement hard CPU wall-clock timeout with SIGKILL fallback per fork
Small · 2–3 days | tags: rust


Phase 3 — Resource Isolation [HIGH]

 Wrap each fork process in its own cgroup v2 slice (memory + CPU quota)
Medium · 1 week | tags: rust, ops
 Implement CoW dirty-page cap — evict/kill sandbox when it exceeds memory limit
Medium · 1 week | tags: rust
 Add multi-vCPU support: restore LAPIC per vCPU, handle IPI/INIT-SIPI sequence
Medium · 1–2 weeks | tags: rust
 Add filesystem artifact extraction: capture stdout + size-capped /tmp tarball on exit
Medium · 1 week | tags: rust, ux


Phase 4 — Networking [CRITICAL]

 Design TAP/veth pool manager: pre-allocated interfaces with unique MAC/IP per fork
Large · 1 week design | tags: network, rust
 Re-snapshot template with idle virtio-net device; validate state restore with NIC attached
Large · 1 week | tags: network, rust
 Implement TAP fd injection at fork time and per-fork IP assignment (172.16.x.x/30 pairs)
Large · 1–2 weeks | tags: network, rust
 Add egress firewall rules per fork (iptables/nftables): block inter-fork, allow outbound only
Medium · 1 week | tags: network, security
 Expose optional network enable flag in API; default off for compute-only sandboxes
Small · 2–3 days | tags: rust, ux
 Update Python + Node SDKs to surface network access option
Small · 2 days | tags: ux


Phase 5 — Persistent Sessions (REPL Mode) [HIGH]

 Design session model: fork stays alive, second serial port or vsock for follow-up commands
Large · design spike 1 week | tags: rust, network
 Implement session manager: TTL, keep-alive pings, idle eviction
Large · 1–2 weeks | tags: rust, ops
 Add session API endpoints: create_session, exec_in_session, close_session
Medium · 1 week | tags: rust, ux
 Update SDKs with session abstraction (sb.session() context manager)
Small · 3 days | tags: ux


Phase 6 — Template Management [MEDIUM]

 Build template registry: versioned snapshots stored in object store (S3/MinIO)
Medium · 1 week | tags: infra, ops
 Add snapshot versioning API: create, list, pin, deprecate
Small · 3–5 days | tags: rust, ops
 Implement parallel re-snapshot pipeline to reduce 15s blocking window
Medium · 1 week | tags: rust, infra
 Add runtime template library: Python 3.11/3.12, Node 20/22, Go 1.22, Ruby 3
Medium · 1 week per runtime | tags: infra


Phase 7 — Multi-Host Scale-Out [INFRA]

 Design control plane: stateless API front-end + per-host fork agents
Large · architecture sprint | tags: infra, rust
 Implement snapshot distribution: replicate to all hosts on template publish
Large · 1–2 weeks | tags: infra
 Add host health + capacity scheduler: route fork requests to lowest-load host
Large · 2 weeks | tags: infra, rust
 Implement fork request queue with backpressure and overflow rejection
Medium · 1 week | tags: rust, ops
 Add cluster-wide metrics aggregation (Prometheus federation or Thanos)
Medium · 1 week | tags: ops
 Write deploy automation: Ansible/Terraform for multi-node provisioning
Medium · 1 week | tags: ops, infra


Phase 8 — Production Hardening [MEDIUM]

 Add fork engine fuzzing harness (cargo-fuzz on vmstate parser + CPU restore path)
Medium · 1 week | tags: rust, security
 Write integration test suite: 1000-fork stress, concurrent network, session stability
Medium · 1 week | tags: rust, ops
 Run formal KVM escape threat model; document residual risk + mitigations
Medium · 1 week | tags: security
 Add graceful degradation: host OOM → queue forks, emit 503 with retry-after
Small · 3 days | tags: rust, ops
 Add ARM64 KVM backend (separate CPU state restore path for Graviton/Ampere)
Large · 3–4 weeks | tags: rust, infra
 Publish API stability guarantee, changelog, and deprecation policy
Small · 2 days | tags: ux, ops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plans #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions