Skip to content

Plans #7

@glennswest

Description

@glennswest

Great stuff. Im doing something similar on the storage front, and have forked your work.
Working on expanding it to solve the shortfalls. Love to work togeather.
Have already got a plan, and executing on it.

Here's my first pass. Will define it a bit better after get thru it. As I need to merge it with my storage concept, and kube integration as well.

Phase 1 — Security & Correctness [CRITICAL]

Add seccomp-bpf filter to VMM host process
Small · 2–3 days | tags: security, rust
Inject CSPRNG reseed before every fork (kernel + userspace numpy/OpenSSL)
Small · 2–3 days | tags: rust
Audit vmstate parser for unsafe memory reads; add bounds checks
Small · 2–3 days | tags: rust
Replace hardcoded demo API key with proper key issuance + scoping system
Small · 3–5 days | tags: ops, security
Add per-key rate limiting and usage tracking in the API server
Small · 3–5 days | tags: rust, ops

Phase 2 — Observability & Operability [HIGH]

Integrate OpenTelemetry tracing across fork lifecycle (spawn → run → teardown)
Small · 2–3 days | tags: rust, ops
Add structured per-fork metrics (RSS, CoW page faults, wall-clock, exit code)
Small · 2–3 days | tags: rust, ops
Wire up Prometheus /metrics endpoint with dashboard (Grafana template)
Small · 2–3 days | tags: ops
Add streaming stdout via SSE or WebSocket (Axum native)
Medium · 3–5 days | tags: rust, ux
Implement hard CPU wall-clock timeout with SIGKILL fallback per fork
Small · 2–3 days | tags: rust

Phase 3 — Resource Isolation [HIGH]

Wrap each fork process in its own cgroup v2 slice (memory + CPU quota)
Medium · 1 week | tags: rust, ops
Implement CoW dirty-page cap — evict/kill sandbox when it exceeds memory limit
Medium · 1 week | tags: rust
Add multi-vCPU support: restore LAPIC per vCPU, handle IPI/INIT-SIPI sequence
Medium · 1–2 weeks | tags: rust
Add filesystem artifact extraction: capture stdout + size-capped /tmp tarball on exit
Medium · 1 week | tags: rust, ux

Phase 4 — Networking [CRITICAL]

Design TAP/veth pool manager: pre-allocated interfaces with unique MAC/IP per fork
Large · 1 week design | tags: network, rust
Re-snapshot template with idle virtio-net device; validate state restore with NIC attached
Large · 1 week | tags: network, rust
Implement TAP fd injection at fork time and per-fork IP assignment (172.16.x.x/30 pairs)
Large · 1–2 weeks | tags: network, rust
Add egress firewall rules per fork (iptables/nftables): block inter-fork, allow outbound only
Medium · 1 week | tags: network, security
Expose optional network enable flag in API; default off for compute-only sandboxes
Small · 2–3 days | tags: rust, ux
Update Python + Node SDKs to surface network access option
Small · 2 days | tags: ux

Phase 5 — Persistent Sessions (REPL Mode) [HIGH]

Design session model: fork stays alive, second serial port or vsock for follow-up commands
Large · design spike 1 week | tags: rust, network
Implement session manager: TTL, keep-alive pings, idle eviction
Large · 1–2 weeks | tags: rust, ops
Add session API endpoints: create_session, exec_in_session, close_session
Medium · 1 week | tags: rust, ux
Update SDKs with session abstraction (sb.session() context manager)
Small · 3 days | tags: ux

Phase 6 — Template Management [MEDIUM]

Build template registry: versioned snapshots stored in object store (S3/MinIO)
Medium · 1 week | tags: infra, ops
Add snapshot versioning API: create, list, pin, deprecate
Small · 3–5 days | tags: rust, ops
Implement parallel re-snapshot pipeline to reduce 15s blocking window
Medium · 1 week | tags: rust, infra
Add runtime template library: Python 3.11/3.12, Node 20/22, Go 1.22, Ruby 3
Medium · 1 week per runtime | tags: infra

Phase 7 — Multi-Host Scale-Out [INFRA]

Design control plane: stateless API front-end + per-host fork agents
Large · architecture sprint | tags: infra, rust
Implement snapshot distribution: replicate to all hosts on template publish
Large · 1–2 weeks | tags: infra
Add host health + capacity scheduler: route fork requests to lowest-load host
Large · 2 weeks | tags: infra, rust
Implement fork request queue with backpressure and overflow rejection
Medium · 1 week | tags: rust, ops
Add cluster-wide metrics aggregation (Prometheus federation or Thanos)
Medium · 1 week | tags: ops
Write deploy automation: Ansible/Terraform for multi-node provisioning
Medium · 1 week | tags: ops, infra

Phase 8 — Production Hardening [MEDIUM]

Add fork engine fuzzing harness (cargo-fuzz on vmstate parser + CPU restore path)
Medium · 1 week | tags: rust, security
Write integration test suite: 1000-fork stress, concurrent network, session stability
Medium · 1 week | tags: rust, ops
Run formal KVM escape threat model; document residual risk + mitigations
Medium · 1 week | tags: security
Add graceful degradation: host OOM → queue forks, emit 503 with retry-after
Small · 3 days | tags: rust, ops
Add ARM64 KVM backend (separate CPU state restore path for Graviton/Ampere)
Large · 3–4 weeks | tags: rust, infra
Publish API stability guarantee, changelog, and deprecation policy
Small · 2 days | tags: ux, ops

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions