0009 - add GPU runtime policy and executor coverage by ethanbailie · Pull Request #21 · las7/TakoVM

ethanbailie · 2026-03-13T00:55:28Z

Now that GPU capability exists in job-type schema, this PR adds runtime enforcement in the executor so GPU jobs are handled safely and predictably across strict/permissive security modes.

What Changed

Updated tako_vm/execution/worker.py:
- GPU workloads are rejected in strict mode when gVisor is required
- GPU workloads force runc in permissive mode
- added vendor-specific GPU Docker flags
  - NVIDIA: --gpus=all|N|device=...
  - AMD: /dev/kfd + /dev/dri device mounts
- added vendor-specific env vars for device selection

Tests Added/Updated

tests/test_runtime.py
- strict-mode GPU rejection
- permissive-mode runtime fallback
- NVIDIA/AMD flag generation
- GPU env var generation
- unsupported vendor rejection

How To Review

Read runtime decision logic in tako_vm/execution/worker.py (resolve_runtime_for_job_type, build_gpu_flags, build_gpu_env_vars).
Confirm behavior is policy-driven and not tied to test-specific internals.
Verify each policy branch has direct assertions in tests/test_runtime.py.

Suggested Verification

ruff check tako_vm tests
pytest tests/test_runtime.py -v

Out of Scope

No session persistence/API changes.
No DB model changes.

las7

Review Notes

The runtime logic is clean and well-separated. A few issues to flag:

Issues

1. Duplicate + diverged schema code from PR #20
This PR carries all of #20's schema changes (config, job_types, API, version, tests) plus the runtime layer. These two PRs will conflict. More importantly, this PR's validate_device_ids is missing the duplicate device ID detection that #20 added (the seen set with case-insensitive dedup). Recommend merging #20 first, then rebasing this PR to only include the runtime additions.

2. --cap-drop=ALL may break GPU workloads
The container drops all capabilities and only adds back SETUID/SETGID. The NVIDIA container runtime may need additional capabilities to initialize GPU access. GPU jobs could silently fail at runtime. The unit tests mock Docker so this wouldn't be caught.

3. --read-only + GPU compatibility
The container runs with --read-only. NVIDIA CUDA often needs to write to /dev/shm for shared memory (especially multi-GPU). The existing --tmpfs=/tmp may not be enough. Worth testing on real hardware.

4. Inconsistent unknown vendor handling
build_gpu_flags() raises RuntimeUnavailableError for unknown vendors, but build_gpu_env_vars() silently returns {}. In practice build_gpu_flags is called first so the error is caught, but if someone calls build_gpu_env_vars independently it'd silently do nothing.

Looks good

Runtime policy logic is clear: strict + GPU = rejected, permissive + GPU = forced runc
self._runtime replacement with per-job runtime in _run_container is correct — confirmed no stale references elsewhere
NVIDIA flag generation covers all three modes (all/count/device)
AMD device mounts (/dev/kfd, /dev/dri) are correct
GPU env vars are vendor-specific (CUDA_VISIBLE_DEVICES vs ROCR/HIP_VISIBLE_DEVICES)
Test coverage in test_runtime.py covers all branches
Error handling in _run_container gracefully catches RuntimeUnavailableError

Recommendation

Merge #20 first, rebase this PR to drop duplicate schema code, and open an issue to validate GPU + capability/read-only compatibility on real hardware before production use.

add gpu runtime policy and executor coverage

f403c53

ethanbailie added the enhancement New feature or request label Mar 13, 2026

las7 mentioned this pull request Mar 16, 2026

0008 - add job-type GPU/session schema and coverage #20

Merged

las7 reviewed Mar 16, 2026

View reviewed changes

Merge branch 'main' into 0009-gpu-runtime-policy

60d9e4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0009 - add GPU runtime policy and executor coverage#21

0009 - add GPU runtime policy and executor coverage#21
ethanbailie wants to merge 2 commits intomainfrom
0009-gpu-runtime-policy

ethanbailie commented Mar 13, 2026

Uh oh!

las7 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ethanbailie commented Mar 13, 2026

What Changed

Tests Added/Updated

How To Review

Suggested Verification

Out of Scope

Uh oh!

las7 left a comment

Choose a reason for hiding this comment

Review Notes

Issues

Looks good

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants