You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The NVIDIA device plugin DaemonSet must be running and healthy before GPU sandboxes can be created. It uses CDI injection (`deviceListStrategy: cdi-cri`) to inject GPU devices into sandbox pods — no `runtimeClassName` is set on sandbox pods.
264
+
265
+
```bash
266
+
# DaemonSet status — numberReady must be >= 1
267
+
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
268
+
269
+
# Device plugin pod logs — look for "CDI" lines confirming CDI mode is active
# Confirm a GPU sandbox pod has no runtimeClassName (CDI injection, not runtime class)
285
+
openshell doctor exec -- kubectl get pod -n openshell -o jsonpath='{range .items[*]}{.metadata.name}{" runtimeClassName="}{.spec.runtimeClassName}{"\n"}{end}'
286
+
```
287
+
288
+
Common issues:
289
+
290
+
-**DaemonSet 0/N ready**: The device plugin chart may still be deploying (k3s Helm controller can take 1–2 min) or the pod is crashing. Check pod logs.
291
+
-**`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries**: CDI spec generation has not completed. The device plugin may still be starting or the `cdi-cri` strategy isn't active. Verify `deviceListStrategy: cdi-cri` is in the rendered Helm values.
292
+
-**No CDI spec files at `/var/run/cdi/`**: Same as above — device plugin hasn't written CDI specs yet.
293
+
-**`HEALTHCHECK_GPU_DEVICE_PLUGIN_NOT_READY` in health check logs**: Device plugin has no ready pods. Check DaemonSet events and pod logs.
294
+
295
+
### Step 9: Check DNS Resolution
260
296
261
297
DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
262
298
@@ -314,6 +350,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
314
350
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest`|
315
351
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory`| Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
316
352
|`HEALTHCHECK_MISSING_SUPERVISOR` in health check logs |`/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start`|
353
+
|`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |
317
354
318
355
## Full Diagnostic Dump
319
356
@@ -367,4 +404,9 @@ openshell doctor exec -- ls -la /opt/openshell/bin/openshell-sandbox
367
404
368
405
echo"=== DNS Configuration ==="
369
406
openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
407
+
408
+
# GPU gateways only
409
+
echo"=== GPU Device Plugin ==="
410
+
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
0 commit comments