0009 - add GPU runtime policy and executor coverage#21
0009 - add GPU runtime policy and executor coverage#21ethanbailie wants to merge 2 commits intomainfrom
Conversation
las7
left a comment
There was a problem hiding this comment.
Review Notes
The runtime logic is clean and well-separated. A few issues to flag:
Issues
1. Duplicate + diverged schema code from PR #20
This PR carries all of #20's schema changes (config, job_types, API, version, tests) plus the runtime layer. These two PRs will conflict. More importantly, this PR's validate_device_ids is missing the duplicate device ID detection that #20 added (the seen set with case-insensitive dedup). Recommend merging #20 first, then rebasing this PR to only include the runtime additions.
2. --cap-drop=ALL may break GPU workloads
The container drops all capabilities and only adds back SETUID/SETGID. The NVIDIA container runtime may need additional capabilities to initialize GPU access. GPU jobs could silently fail at runtime. The unit tests mock Docker so this wouldn't be caught.
3. --read-only + GPU compatibility
The container runs with --read-only. NVIDIA CUDA often needs to write to /dev/shm for shared memory (especially multi-GPU). The existing --tmpfs=/tmp may not be enough. Worth testing on real hardware.
4. Inconsistent unknown vendor handling
build_gpu_flags() raises RuntimeUnavailableError for unknown vendors, but build_gpu_env_vars() silently returns {}. In practice build_gpu_flags is called first so the error is caught, but if someone calls build_gpu_env_vars independently it'd silently do nothing.
Looks good
- Runtime policy logic is clear: strict + GPU = rejected, permissive + GPU = forced runc
self._runtimereplacement with per-jobruntimein_run_containeris correct — confirmed no stale references elsewhere- NVIDIA flag generation covers all three modes (all/count/device)
- AMD device mounts (
/dev/kfd,/dev/dri) are correct - GPU env vars are vendor-specific (CUDA_VISIBLE_DEVICES vs ROCR/HIP_VISIBLE_DEVICES)
- Test coverage in
test_runtime.pycovers all branches - Error handling in
_run_containergracefully catchesRuntimeUnavailableError
Recommendation
Merge #20 first, rebase this PR to drop duplicate schema code, and open an issue to validate GPU + capability/read-only compatibility on real hardware before production use.
Now that GPU capability exists in job-type schema, this PR adds runtime enforcement in the executor so GPU jobs are handled safely and predictably across strict/permissive security modes.
What Changed
tako_vm/execution/worker.py:runcin permissive mode--gpus=all|N|device=.../dev/kfd+/dev/dridevice mountsTests Added/Updated
tests/test_runtime.pyHow To Review
tako_vm/execution/worker.py(resolve_runtime_for_job_type,build_gpu_flags,build_gpu_env_vars).tests/test_runtime.py.Suggested Verification
ruff check tako_vm testspytest tests/test_runtime.py -vOut of Scope