-
Notifications
You must be signed in to change notification settings - Fork 461
Description
Is there a way to configure the device-name-strategy for the management.nvidia.com CDI specification generated by the nvidia-container-toolkit daemonset?
Context
I'm running a k3s cluster with the NVIDIA GPU Operator and Volcano scheduler with the volcano-device-plugin. When scheduling pods using Volcano resource types, I get the following error:
Error: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
The Volcano device plugin requests GPUs by UUID (e.g., management.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx), but the management.nvidia.com-gpu.yaml CDI spec generated by the toolkit only contains:
devices:
- name: all
# ...It's missing the UUID-based device entries like:
devices:
- name: all
- name: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
That's what I get when I run nvidia-ctk list
root@cast-edge-019cb982-469d-774d-8a66-c210be5b65af:/# /usr/local/nvidia/toolkit/nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices
k8s.device-plugin.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
management.nvidia.com/gpu=all
What I've tried
I tried to push some environment variable inside the toolkit daemonset but it doesn't work.
These are all the environment variables I passed through the helm values at installation time.
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: RUNTIME_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: NVIDIA_CTK_CDI_DEVICE_NAME_STRATEGY
value: "uuid"
- name: NVIDIA_CTK_CDI_ANNOTATION_PREFIXES
value: "cdi.k8s.io/"
- name: NVIDIA_CTK_CDI_SPEC_DIRS
value: /var/run/cdi
- name: NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEVICE_NAME_STRATEGY
value: "uuid"But it seems that they do not affect the generated configuration.
The only way I found to configure it correctly was running this command on the node:
/usr/local/nvidia/toolkit/nvidia-ctk cdi generate --vendor management.nvidia.com --device-name-strategy uuid --output /var/run/cdi/management.nvidia.com-gpu.yaml
Environment
- GPU Operator: v25.10.1
- Volcano: v1.14.1
- Volcano-device-plugin: v1.11.0
- K3s: v1.34.4+k3s1