fix: add device rules for AllowDevice.conf when running in legacy mode #1553

jokestax · 2025-12-27T18:13:17Z

Description

This PR fixes an issue where containers using the NVIDIA runtime in legacy mode fail after a systemctl daemon-reload is performed on the host.

The root cause is that the devices.allow cgroup file is not being updated with NVIDIA character device rules when containers are created in legacy mode. This causes nvidia-smi and other tools to fail with:

Failed to initialize NVML: Unknown Error

Changes

Added cgroup device rules for NVIDIA devices in the legacy mode modifier
Rules are now added based on NVIDIA_VISIBLE_DEVICES environment variable:
- Common devices: nvidiactl, nvidia-uvm, nvidia-uvm-tools, nvidia-modeset
- GPU-specific devices: /dev/nvidia0, /dev/nvidia1, etc. (based on requested GPUs)
Supports NVIDIA_VISIBLE_DEVICES=all, specific indices (0,1,2), and UUIDs

Related Issue(s)

Fixes opencontainers/runc#4859

How To Test

Reproducing the Issue (Before Fix)

Install the latest nvidia-container-toolkit version
Update /etc/nvidia-container-runtime/config.toml to use legacy mode:

   [nvidia-container-runtime]
   #debug = "/var/log/nvidia-container-runtime.log"
   log-level = "info"
   mode = "legacy"
   runtimes = ["runc", "crun"]

Restart Docker:

   sudo systemctl restart docker

Start a GPU container:

   docker run -d --runtime=nvidia \
     -e NVIDIA_VISIBLE_DEVICES=0 or --gpus=all \
     --name gpu-test \
     nvidia/cuda:12.2.0-base-ubuntu22.04 sleep 30000

Verify nvidia-smi works:

   docker exec -it gpu-test nvidia-smi
   # Should work ✓

In a second terminal, run:

   sudo systemctl daemon-reload

Test nvidia-smi again in the container:

   docker exec -it gpu-test nvidia-smi
   # ERROR: Failed to initialize NVML: Unknown Error ✗

Testing the Fix

Clone this repository and build the new binary:

   git clone <repository-url>
   cd nvidia-container-toolkit
   go build -o nvidia-container-runtime ./cmd/nvidia-container-runtime

Backup the original binary:

   sudo cp /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-runtime.bak

Install the new binary:

   sudo cp nvidia-container-runtime /usr/bin/nvidia-container-runtime

Restart Docker:

   sudo systemctl restart docker

Start a new GPU container:

   docker run -d --runtime=nvidia \
     ---e NVIDIA_VISIBLE_DEVICES=0 or --gpus=all \
     --name gpu-test-fixed \
     nvidia/cuda:12.2.0-base-ubuntu22.04 sleep 30000

Verify nvidia-smi works:

   docker exec -it gpu-test-fixed nvidia-smi
   # Should work ✓

Run daemon-reload in another terminal:

   sudo systemctl daemon-reload

Test nvidia-smi again:

   docker exec -it gpu-test-fixed nvidia-smi
   # Should still work ✓

Rollback (If Needed)

sudo cp /usr/bin/nvidia-container-runtime.bak /usr/bin/nvidia-container-runtime
sudo systemctl restart docker

copy-pr-bot · 2025-12-27T18:13:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: mrrishi <mrrishi373@gmail.com>

jokestax force-pushed the fix/gpu-permission branch 2 times, most recently from d01cc93 to 9a8ee88 Compare December 27, 2025 19:14

jokestax added 5 commits December 28, 2025 01:08

fix: add device rules for AllowDevice.conf when running in legacy mode

b91388a

Signed-off-by: mrrishi <mrrishi373@gmail.com>

feat: add tests

8239332

Signed-off-by: mrrishi <mrrishi373@gmail.com>

feat: add tests

c7a4c4b

Signed-off-by: mrrishi <mrrishi373@gmail.com>

fix: caps path

924ca17

Signed-off-by: mrrishi <mrrishi373@gmail.com>

fix: add cgroup rules even when hook exists

1ac032e

Signed-off-by: mrrishi <mrrishi373@gmail.com>

jokestax force-pushed the fix/gpu-permission branch from b6f7285 to 1ac032e Compare December 27, 2025 19:38

jokestax mentioned this pull request Dec 27, 2025

GPU Permission Lost Inside Container opencontainers/runc#4859

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add device rules for AllowDevice.conf when running in legacy mode #1553

fix: add device rules for AllowDevice.conf when running in legacy mode #1553

Uh oh!

jokestax commented Dec 27, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: add device rules for AllowDevice.conf when running in legacy mode #1553

Are you sure you want to change the base?

fix: add device rules for AllowDevice.conf when running in legacy mode #1553

Uh oh!

Conversation

jokestax commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Related Issue(s)

How To Test

Reproducing the Issue (Before Fix)

Testing the Fix

Rollback (If Needed)

Uh oh!

copy-pr-bot bot commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jokestax commented Dec 27, 2025 •

edited

Loading