Skip to content

0009 - add GPU runtime policy and executor coverage#21

Draft
ethanbailie wants to merge 2 commits intomainfrom
0009-gpu-runtime-policy
Draft

0009 - add GPU runtime policy and executor coverage#21
ethanbailie wants to merge 2 commits intomainfrom
0009-gpu-runtime-policy

Conversation

@ethanbailie
Copy link
Copy Markdown
Collaborator

Now that GPU capability exists in job-type schema, this PR adds runtime enforcement in the executor so GPU jobs are handled safely and predictably across strict/permissive security modes.

What Changed

  • Updated tako_vm/execution/worker.py:
    • GPU workloads are rejected in strict mode when gVisor is required
    • GPU workloads force runc in permissive mode
    • added vendor-specific GPU Docker flags
      • NVIDIA: --gpus=all|N|device=...
      • AMD: /dev/kfd + /dev/dri device mounts
    • added vendor-specific env vars for device selection

Tests Added/Updated

  • tests/test_runtime.py
    • strict-mode GPU rejection
    • permissive-mode runtime fallback
    • NVIDIA/AMD flag generation
    • GPU env var generation
    • unsupported vendor rejection

How To Review

  1. Read runtime decision logic in tako_vm/execution/worker.py (resolve_runtime_for_job_type, build_gpu_flags, build_gpu_env_vars).
  2. Confirm behavior is policy-driven and not tied to test-specific internals.
  3. Verify each policy branch has direct assertions in tests/test_runtime.py.

Suggested Verification

  • ruff check tako_vm tests
  • pytest tests/test_runtime.py -v

Out of Scope

  • No session persistence/API changes.
  • No DB model changes.

@ethanbailie ethanbailie added the enhancement New feature or request label Mar 13, 2026
Copy link
Copy Markdown
Owner

@las7 las7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Notes

The runtime logic is clean and well-separated. A few issues to flag:

Issues

1. Duplicate + diverged schema code from PR #20
This PR carries all of #20's schema changes (config, job_types, API, version, tests) plus the runtime layer. These two PRs will conflict. More importantly, this PR's validate_device_ids is missing the duplicate device ID detection that #20 added (the seen set with case-insensitive dedup). Recommend merging #20 first, then rebasing this PR to only include the runtime additions.

2. --cap-drop=ALL may break GPU workloads
The container drops all capabilities and only adds back SETUID/SETGID. The NVIDIA container runtime may need additional capabilities to initialize GPU access. GPU jobs could silently fail at runtime. The unit tests mock Docker so this wouldn't be caught.

3. --read-only + GPU compatibility
The container runs with --read-only. NVIDIA CUDA often needs to write to /dev/shm for shared memory (especially multi-GPU). The existing --tmpfs=/tmp may not be enough. Worth testing on real hardware.

4. Inconsistent unknown vendor handling
build_gpu_flags() raises RuntimeUnavailableError for unknown vendors, but build_gpu_env_vars() silently returns {}. In practice build_gpu_flags is called first so the error is caught, but if someone calls build_gpu_env_vars independently it'd silently do nothing.

Looks good

  • Runtime policy logic is clear: strict + GPU = rejected, permissive + GPU = forced runc
  • self._runtime replacement with per-job runtime in _run_container is correct — confirmed no stale references elsewhere
  • NVIDIA flag generation covers all three modes (all/count/device)
  • AMD device mounts (/dev/kfd, /dev/dri) are correct
  • GPU env vars are vendor-specific (CUDA_VISIBLE_DEVICES vs ROCR/HIP_VISIBLE_DEVICES)
  • Test coverage in test_runtime.py covers all branches
  • Error handling in _run_container gracefully catches RuntimeUnavailableError

Recommendation

Merge #20 first, rebase this PR to drop duplicate schema code, and open an issue to validate GPU + capability/read-only compatibility on real hardware before production use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants