Skip to content

Commit 6840a21

Browse files
committed
docs(debug-skill): add CDI device plugin diagnostics for GPU gateways
1 parent 65f5642 commit 6840a21

File tree

1 file changed

+43
-1
lines changed
  • .agents/skills/debug-openshell-cluster

1 file changed

+43
-1
lines changed

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -256,7 +256,43 @@ Look for:
256256
- `OOMKilled` — memory limits too low
257257
- `FailedMount` — volume issues
258258

259-
### Step 8: Check DNS Resolution
259+
### Step 8: Check GPU Device Plugin and CDI (GPU gateways only)
260+
261+
Skip this step for non-GPU gateways.
262+
263+
The NVIDIA device plugin DaemonSet must be running and healthy before GPU sandboxes can be created. It uses CDI injection (`deviceListStrategy: cdi-cri`) to inject GPU devices into sandbox pods — no `runtimeClassName` is set on sandbox pods.
264+
265+
```bash
266+
# DaemonSet status — numberReady must be >= 1
267+
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
268+
269+
# Device plugin pod logs — look for "CDI" lines confirming CDI mode is active
270+
openshell doctor exec -- kubectl logs -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin --tail=50
271+
272+
# List CDI devices registered by the device plugin (requires nvidia-ctk in the cluster image).
273+
# Device plugin CDI entries use the vendor string "k8s.device-plugin.nvidia.com" so entries
274+
# will be prefixed "k8s.device-plugin.nvidia.com/gpu=". If the list is empty, CDI spec
275+
# generation has not completed yet.
276+
openshell doctor exec -- nvidia-ctk cdi list
277+
278+
# Verify CDI spec files were generated on the node
279+
openshell doctor exec -- ls /var/run/cdi/
280+
281+
# Helm install job logs for the device plugin chart
282+
openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-nvidia-device-plugin --tail=100
283+
284+
# Confirm a GPU sandbox pod has no runtimeClassName (CDI injection, not runtime class)
285+
openshell doctor exec -- kubectl get pod -n openshell -o jsonpath='{range .items[*]}{.metadata.name}{" runtimeClassName="}{.spec.runtimeClassName}{"\n"}{end}'
286+
```
287+
288+
Common issues:
289+
290+
- **DaemonSet 0/N ready**: The device plugin chart may still be deploying (k3s Helm controller can take 1–2 min) or the pod is crashing. Check pod logs.
291+
- **`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries**: CDI spec generation has not completed. The device plugin may still be starting or the `cdi-cri` strategy isn't active. Verify `deviceListStrategy: cdi-cri` is in the rendered Helm values.
292+
- **No CDI spec files at `/var/run/cdi/`**: Same as above — device plugin hasn't written CDI specs yet.
293+
- **`HEALTHCHECK_GPU_DEVICE_PLUGIN_NOT_READY` in health check logs**: Device plugin has no ready pods. Check DaemonSet events and pod logs.
294+
295+
### Step 9: Check DNS Resolution
260296

261297
DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
262298

@@ -314,6 +350,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
314350
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
315351
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
316352
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
353+
| `nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |
317354

318355
## Full Diagnostic Dump
319356

@@ -367,4 +404,9 @@ openshell doctor exec -- ls -la /opt/openshell/bin/openshell-sandbox
367404

368405
echo "=== DNS Configuration ==="
369406
openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
407+
408+
# GPU gateways only
409+
echo "=== GPU Device Plugin ==="
410+
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
411+
openshell doctor exec -- nvidia-ctk cdi list
370412
```

0 commit comments

Comments
 (0)