Skip to content

fix(gpu): add Tegra/Jetson GPU support#625

Open
elezar wants to merge 10 commits intomainfrom
fix/tegra-gpu-support
Open

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 10 commits intomainfrom
fix/tegra-gpu-support

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Mar 26, 2026

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

Changes

  • Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
  • Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
  • Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
  • Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits March 26, 2026 14:44
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d
(read-only) into the gateway container when it exists, so the nvidia
runtime running inside k3s can apply the same host-file injection
config as on the host — required for Jetson/Tegra platforms.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for
mounting /etc/nvidia-container-runtime/host-files-for-container.d into the
device plugin pod, required for correct CDI spec generation on Tegra-based
systems.

Also included is an nvcdi API bump that ensures that additional GIDs are
included in the generated CDI spec.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries
from /etc/group, discarding GIDs injected by the container runtime via
CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the
container-level GIDs before initgroups runs and merge them back
afterwards, excluding GID 0 (root) to avoid privilege retention.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi
rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox.
Fall back to the full path when the bare command is not found.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 26, 2026
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 26, 2026

cc @johnnynunez

@johnnynunez
Copy link
Copy Markdown

johnnynunez commented Mar 26, 2026

LGTM @elezar
ready to merge @johntmyers

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 27, 2026

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

@elezar elezar marked this pull request as ready for review March 27, 2026 07:16
@elezar elezar requested a review from a team as a code owner March 27, 2026 07:16
@johnnynunez
Copy link
Copy Markdown

This

Yes, i know. I was tracking it. And tested

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

  • Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
  • The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
  • That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

@pimlock pimlock added the test:e2e Requires end-to-end coverage label Apr 2, 2026
Comment on lines +605 to +610
const HOST_FILES_DIR: &str = "/etc/nvidia-container-runtime/host-files-for-container.d";
if std::path::Path::new(HOST_FILES_DIR).is_dir() {
let mut binds = host_config.binds.take().unwrap_or_default();
binds.push(format!("{HOST_FILES_DIR}:{HOST_FILES_DIR}:ro"));
host_config.binds = Some(binds);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the context, without this mount the failure is:

The error says: CDI --device-list-strategy options are only supported on NVML-based systems — the device plugin can't detect the GPU via NVML (since Tegra uses a different driver model), so it refuses to start with CDI mode.

The bind mount is needed. Without it, the NVIDIA toolkit inside the gateway can't recognize this as a Tegra platform with GPU capabilities, CDI spec generation fails, and the device plugin crashes.

- Bump nvidia-container-toolkit from 1.18.2 to 1.19.0 to support the
  -host-cuda-version flag used by newer CDI spec generation.
- Replace local filesystem check for host-files-for-container.d with
  Docker API kernel version detection (contains "tegra"). This fixes
  remote SSH deploys where the CLI machine may not have the directory.
- Only perform the Tegra check when GPU devices are requested.
@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 7, 2026

Testing on Jetson Thor (NVIDIA Thor GPU, driver 580.00, CUDA 13.0)

Validated the following on a physical Jetson Thor device:

Container Toolkit bump (1.18.2 → 1.19.0)

  • Required. The custom device plugin (k8s-device-plugin PR #1675) generates CDI specs with the --host-cuda-version flag. Toolkit 1.18.2's nvidia-cdi-hook doesn't recognize this flag, causing RunContainerError on GPU sandbox pods. 1.19.0 supports it.

host-files-for-container.d bind mount

  • Required. Without it, the device plugin cannot discover Tegra GPU devices and fails with "CDI options are only supported on NVML-based systems". The CSVs (devices.csv, drivers.csv) are inputs to CDI spec generation — they tell the toolkit which Tegra-specific device nodes and host libraries to inject.
  • Fixed the detection: replaced local Path::is_dir() check with docker.info() kernel version detection (contains("tegra")). The previous approach broke remote SSH deploys (CLI machine doesn't have the directory). Now gated on !device_ids.is_empty() so it's only checked for GPU gateways.

Custom device plugin (k8s-device-plugin PR #1675)

  • Required. Stock device plugin 0.18.2 can detect the Tegra platform and register with kubelet, but generates a nearly empty CDI spec (no device nodes, no library mounts). The custom build with driver-root-aware CSV resolution is needed for functional GPU access.
  • Tested: stock plugin → torch.cuda.is_available() returns False (no /dev/nvidia* in sandbox). Custom plugin → full PyTorch CUDA 13.0 matrix multiply succeeds on Thor.

GID merge in process.rs

  • Not needed today but harmless. CDI spec is v0.5.0 (no additionalGids support) and all device nodes are 0666. The code would become relevant with CDI v0.7.0+ and toolkit 1.19.1+ (nvidia-container-toolkit PR #1745).

Other findings

  • br_netfilter kernel module must be loaded on Tegra for k3s DNS/service networking. Without it, pods can't reach CoreDNS via ClusterIP. Good candidate for a pre-flight check.
  • Device plugin warnings about missing V4L2/GStreamer/legacy-tegra files are harmless — they're display/video codec libraries, not needed for CUDA compute.

…ID preservation

- Log when Tegra platform is detected and host-files bind mount is added,
  including the kernel version from the Docker daemon.
- Extract CDI GID snapshot logic into `snapshot_cdi_gids()` function that
  only activates when GPU devices are present (/dev/nvidiactl exists).
- Log preserved CDI-injected GIDs when they are restored after initgroups.
- Fix cargo fmt formatting issue in docker.rs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants