docs(debug-skill): add CDI device plugin diagnostics for GPU gateways

elezar · elezar · commit 6840a213af53 · 2026-03-20T17:54:52.000+01:00
diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md
@@ -256,7 +256,43 @@ Look for:
 - `OOMKilled` — memory limits too low
 - `FailedMount` — volume issues
 
-### Step 8: Check DNS Resolution
+### Step 8: Check GPU Device Plugin and CDI (GPU gateways only)
+
+Skip this step for non-GPU gateways.
+
+The NVIDIA device plugin DaemonSet must be running and healthy before GPU sandboxes can be created. It uses CDI injection (`deviceListStrategy: cdi-cri`) to inject GPU devices into sandbox pods — no `runtimeClassName` is set on sandbox pods.
+
+```bash
+# DaemonSet status — numberReady must be >= 1
+openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
+
+# Device plugin pod logs — look for "CDI" lines confirming CDI mode is active
+openshell doctor exec -- kubectl logs -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin --tail=50
+
+# List CDI devices registered by the device plugin (requires nvidia-ctk in the cluster image).
+# Device plugin CDI entries use the vendor string "k8s.device-plugin.nvidia.com" so entries
+# will be prefixed "k8s.device-plugin.nvidia.com/gpu=". If the list is empty, CDI spec
+# generation has not completed yet.
+openshell doctor exec -- nvidia-ctk cdi list
+
+# Verify CDI spec files were generated on the node
+openshell doctor exec -- ls /var/run/cdi/
+
+# Helm install job logs for the device plugin chart
+openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-nvidia-device-plugin --tail=100
+
+# Confirm a GPU sandbox pod has no runtimeClassName (CDI injection, not runtime class)
+openshell doctor exec -- kubectl get pod -n openshell -o jsonpath='{range .items[*]}{.metadata.name}{" runtimeClassName="}{.spec.runtimeClassName}{"\n"}{end}'
+```
+
+Common issues:
+
+- **DaemonSet 0/N ready**: The device plugin chart may still be deploying (k3s Helm controller can take 1–2 min) or the pod is crashing. Check pod logs.
+- **`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries**: CDI spec generation has not completed. The device plugin may still be starting or the `cdi-cri` strategy isn't active. Verify `deviceListStrategy: cdi-cri` is in the rendered Helm values.
+- **No CDI spec files at `/var/run/cdi/`**: Same as above — device plugin hasn't written CDI specs yet.
+- **`HEALTHCHECK_GPU_DEVICE_PLUGIN_NOT_READY` in health check logs**: Device plugin has no ready pods. Check DaemonSet events and pod logs.
+
+### Step 9: Check DNS Resolution
 
 DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
 
@@ -314,6 +350,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
 | gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
 | Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
 | `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
+| `nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |
 
 ## Full Diagnostic Dump
 
@@ -367,4 +404,9 @@ openshell doctor exec -- ls -la /opt/openshell/bin/openshell-sandbox
 
 echo "=== DNS Configuration ==="
 openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
+
+# GPU gateways only
+echo "=== GPU Device Plugin ==="
+openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
+openshell doctor exec -- nvidia-ctk cdi list
 ```