Skip to content

fix(sandbox): restore GPU filesystem baseline#1522

Open
elezar wants to merge 2 commits into
mainfrom
fix/1486-gpu-sandbox-filesystem-policy/elezar
Open

fix(sandbox): restore GPU filesystem baseline#1522
elezar wants to merge 2 commits into
mainfrom
fix/1486-gpu-sandbox-filesystem-policy/elezar

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 22, 2026

Summary

Restore CUDA GPU filesystem access for Docker-backed GPU sandboxes without promoting all of /proc to full read-write policy access.

This keeps the GPU device-node baseline from #1524, but handles CUDA procfs thread-name writes with a GPU-only Landlock WriteFile exception on procfs. Non-GPU sandboxes do not receive this exception.

Related Issue

Fixes #1486

Builds on the GPU no-network enrichment fix merged in #1524.
The no-network enrichment regression is handled in #1524 and was introduced by #158. This PR addresses the follow-up GPU procfs baseline regression introduced by #910, where explicit default read-only paths prevented GPU-required baseline handling.
The GPU workload test images used for validation come from #1484.

Changes

  • Keep injected NVIDIA/WSL GPU device nodes in the GPU read-write baseline.
  • Stop promoting /proc into filesystem_policy.read_write; /proc can remain read-only in the policy.
  • Add a Linux Landlock runtime exception that grants only AccessFs::WriteFile under /proc, and only when GPU devices are present in the sandbox.
  • Cover descendant CUDA processes, such as a shell workload script that later starts deviceQuery.
  • Preserve custom-policy conflicts for injected GPU device nodes that are incorrectly kept read-only.
  • Update GPU sandbox policy documentation to describe the narrower procfs behavior.

Testing

  • /home/elezar/.local/bin/mise run pre-commit
  • /home/elezar/.cargo/bin/cargo test -p openshell-sandbox --lib baseline_tests -- --nocapture
  • /home/elezar/.cargo/bin/cargo test -p openshell-sandbox --lib landlock::tests -- --nocapture
  • /home/elezar/.cargo/bin/cargo clippy -p openshell-sandbox --lib --tests -- -D warnings
  • Plain Docker control: docker run --rm --device nvidia.com/gpu=all localhost/openshell/gpu-workload-cuda-basic:bdaa08fb-dirty passed with OPENSHELL_GPU_WORKLOAD_SUCCESS cuda-basic
  • Docker-backed OpenShell sandbox: openshell sandbox create --no-keep --from localhost/openshell/gpu-workload-cuda-basic:bdaa08fb-dirty --gpu -- /usr/local/bin/openshell-gpu-workload passed with OPENSHELL_GPU_WORKLOAD_SUCCESS cuda-basic
  • A narrower /proc/self/task prototype failed the same sandbox workload with cudaGetDeviceCount returned 304, confirming the need to cover descendant CUDA processes.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture/docs updated (if applicable)

@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 22, 2026 13:47
@github-actions
Copy link
Copy Markdown

@elezar elezar changed the base branch from main to fix/1486-gpu-enrichment-no-network/elezar May 22, 2026 14:06
Base automatically changed from fix/1486-gpu-enrichment-no-network/elezar to main May 27, 2026 08:20
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 96a1caa to 59e399a Compare May 27, 2026 09:02
Keep /proc out of the GPU filesystem baseline and allow only Landlock WriteFile access on procfs for GPU sandboxes. This lets CUDA update /proc/<pid>/task/<tid>/comm without promoting procfs to read-write in the policy.

A more restrictive rule rooted at /proc/self/task is insufficient because CUDA workloads can spawn descendant processes after Landlock is enforced, and those descendants resolve /proc/self to their own process-specific subtree.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 12bde4d to d73e6de Compare May 28, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GPU sandboxes miss filesystem access for CUDA workloads

1 participant