src/gpu is designed for standard NVIDIA Linux hosts that expose the normal
driver stack:
libnvidia-ml.so.1/dev/nvidia*/proc/driver/nvidia/*/sys/class/drm/*and optionally/sys/class/infiniband/*
That maps well to GPU VMs on GCP, AWS, and Azure. It is not a good fit for opaque GPU products that hide the host driver stack.
For deterministic coverage without physical hardware:
env YOQ_SKIP_SLOW_TESTS=1 \
ZIG_GLOBAL_CACHE_DIR="$PWD/.zig-global-cache" \
ZIG_LOCAL_CACHE_DIR="$PWD/.zig-local-cache" \
zig build test-gpuThis target exercises the GPU subtree plus the manifest GPU env glue. It is intended to stay green on non-GPU hosts.
For one real cloud GPU VM:
- Provision a Linux VM with an NVIDIA driver installed.
- Confirm
nvidia-smiworks on the host. - Build
yoq. - Run the smoke script:
scripts/gpu-cloud-smoke.shThe smoke script checks:
- driver and NVML visibility
/dev/nvidia*device presenceyoq gpu topo --jsonyoq gpu bench --jsonwhen at least 2 GPUs are visiblezig build test-gpu
For clustered training validation, the app-first control plane now manages training definitions and remote lifecycle commands. The relevant operator path is:
yoq up --server 10.0.0.1:7700 -f manifest.toml
yoq train start finetune --server 10.0.0.1:7700
yoq train status finetune --server 10.0.0.1:7700
yoq train logs finetune --server 10.0.0.1:7700yoq train logs --server ... now proxies to the agent hosting the selected rank. If that agent is unreachable, the API returns an explicit hosting-agent error.
Start with a single Linux GPU VM, not Kubernetes and not multi-node NCCL. That gives the best signal with the least setup work.
Good first targets:
- GCP L4 or T4 VM
- AWS g4dn / g5
- Azure N-series GPU VM
Only add MIG or multi-node InfiniBand validation after the single-node lane is stable.