Skip to content

Conversation

@jokestax
Copy link

@jokestax jokestax commented Dec 27, 2025

Description

This PR fixes an issue where containers using the NVIDIA runtime in legacy mode fail after a systemctl daemon-reload is performed on the host.

The root cause is that the devices.allow cgroup file is not being updated with NVIDIA character device rules when containers are created in legacy mode. This causes nvidia-smi and other tools to fail with:

Failed to initialize NVML: Unknown Error

Changes

  • Added cgroup device rules for NVIDIA devices in the legacy mode modifier
  • Rules are now added based on NVIDIA_VISIBLE_DEVICES environment variable:
    • Common devices: nvidiactl, nvidia-uvm, nvidia-uvm-tools, nvidia-modeset
    • GPU-specific devices: /dev/nvidia0, /dev/nvidia1, etc. (based on requested GPUs)
  • Supports NVIDIA_VISIBLE_DEVICES=all, specific indices (0,1,2), and UUIDs

Related Issue(s)

Fixes opencontainers/runc#4859

How To Test

Reproducing the Issue (Before Fix)

  1. Install the latest nvidia-container-toolkit version

  2. Update /etc/nvidia-container-runtime/config.toml to use legacy mode:

   [nvidia-container-runtime]
   #debug = "/var/log/nvidia-container-runtime.log"
   log-level = "info"
   mode = "legacy"
   runtimes = ["runc", "crun"]
  1. Restart Docker:
   sudo systemctl restart docker
  1. Start a GPU container:
   docker run -d --runtime=nvidia \
     -e NVIDIA_VISIBLE_DEVICES=0 or --gpus=all \
     --name gpu-test \
     nvidia/cuda:12.2.0-base-ubuntu22.04 sleep 30000
  1. Verify nvidia-smi works:
   docker exec -it gpu-test nvidia-smi
   # Should work ✓
  1. In a second terminal, run:
   sudo systemctl daemon-reload
  1. Test nvidia-smi again in the container:
   docker exec -it gpu-test nvidia-smi
   # ERROR: Failed to initialize NVML: Unknown Error ✗

Testing the Fix

  1. Clone this repository and build the new binary:
   git clone <repository-url>
   cd nvidia-container-toolkit
   go build -o nvidia-container-runtime ./cmd/nvidia-container-runtime
  1. Backup the original binary:
   sudo cp /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-runtime.bak
  1. Install the new binary:
   sudo cp nvidia-container-runtime /usr/bin/nvidia-container-runtime
  1. Restart Docker:
   sudo systemctl restart docker
  1. Start a new GPU container:
   docker run -d --runtime=nvidia \
     ---e NVIDIA_VISIBLE_DEVICES=0 or --gpus=all \
     --name gpu-test-fixed \
     nvidia/cuda:12.2.0-base-ubuntu22.04 sleep 30000
  1. Verify nvidia-smi works:
   docker exec -it gpu-test-fixed nvidia-smi
   # Should work ✓
  1. Run daemon-reload in another terminal:
   sudo systemctl daemon-reload
  1. Test nvidia-smi again:
   docker exec -it gpu-test-fixed nvidia-smi
   # Should still work ✓

Rollback (If Needed)

sudo cp /usr/bin/nvidia-container-runtime.bak /usr/bin/nvidia-container-runtime
sudo systemctl restart docker

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jokestax jokestax force-pushed the fix/gpu-permission branch 2 times, most recently from d01cc93 to 9a8ee88 Compare December 27, 2025 19:14
Signed-off-by: mrrishi <mrrishi373@gmail.com>
Signed-off-by: mrrishi <mrrishi373@gmail.com>
Signed-off-by: mrrishi <mrrishi373@gmail.com>
Signed-off-by: mrrishi <mrrishi373@gmail.com>
Signed-off-by: mrrishi <mrrishi373@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU Permission Lost Inside Container

1 participant