Skip to content

How to configure device-name-strategy for management.nvidia.com CDI spec generation? #2186

@cheina97

Description

@cheina97

Is there a way to configure the device-name-strategy for the management.nvidia.com CDI specification generated by the nvidia-container-toolkit daemonset?

Context

I'm running a k3s cluster with the NVIDIA GPU Operator and Volcano scheduler with the volcano-device-plugin. When scheduling pods using Volcano resource types, I get the following error:

Error: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

The Volcano device plugin requests GPUs by UUID (e.g., management.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx), but the management.nvidia.com-gpu.yaml CDI spec generated by the toolkit only contains:

devices:
  - name: all
    # ...

It's missing the UUID-based device entries like:

devices:
  - name: all
  - name: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

That's what I get when I run nvidia-ctk list

root@cast-edge-019cb982-469d-774d-8a66-c210be5b65af:/# /usr/local/nvidia/toolkit/nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices                          
k8s.device-plugin.nvidia.com/gpu=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
management.nvidia.com/gpu=all

What I've tried

I tried to push some environment variable inside the toolkit daemonset but it doesn't work.
These are all the environment variables I passed through the helm values at installation time.

toolkit:
    enabled: true
    env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: RUNTIME_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: NVIDIA_CTK_CDI_DEVICE_NAME_STRATEGY
      value: "uuid"
    - name: NVIDIA_CTK_CDI_ANNOTATION_PREFIXES
      value: "cdi.k8s.io/"     
    - name: NVIDIA_CTK_CDI_SPEC_DIRS
      value: /var/run/cdi 
    - name: NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEVICE_NAME_STRATEGY
      value: "uuid"

But it seems that they do not affect the generated configuration.

The only way I found to configure it correctly was running this command on the node:

/usr/local/nvidia/toolkit/nvidia-ctk cdi generate --vendor management.nvidia.com  --device-name-strategy uuid  --output /var/run/cdi/management.nvidia.com-gpu.yaml

Environment

  • GPU Operator: v25.10.1
  • Volcano: v1.14.1
  • Volcano-device-plugin: v1.11.0
  • K3s: v1.34.4+k3s1

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionCategorizes issue or PR as a support question.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions