fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733
fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733zyang-dev wants to merge 7 commits into
Conversation
…eflight Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
…oid false-positive blocks Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDetects and flags missing or stale NVIDIA CDI GPU specs, compares declared vs live device nodes, probes nvidia-cdi-refresh unit health, emits generate/refresh/warn remediation actions, performs installer-level refresh via systemd (with sudo when needed), and adds dedicated tests and harness updates. ChangesCDI Stale Spec Detection and Refresh Remediation
Sequence DiagramsequenceDiagram
participant CLI as assessHost
participant DockerCdi as docker-cdi.ts
participant Systemctl as systemctl
participant Planner as planHostRemediation
participant Installer as installer (scripts/install.sh)
CLI->>DockerCdi: discover effective CDI spec / collect device nodes
DockerCdi->>CLI: report mismatch / stale
CLI->>Systemctl: probe nvidia-cdi-refresh (is-enabled/is-failed)
CLI->>Planner: compute remediation (missing | stale | warn)
Planner->>Installer: emit repair plan (generate / refresh / warn)
Installer->>Systemctl: enable/start nvidia-cdi-refresh (sudo if needed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
PR Review AdvisorFindings: 3 needs attention, 3 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/install.sh`:
- Around line 1903-1912: The stale-spec auto-repair branch currently runs when
host.cdiNvidiaGpuSpecStale && host.cdiNvidiaGpuSpecNeedsRepair &&
!host.cdiNvidiaGpuSpecMissing && !isWslDockerDesktopRuntime(host), which can
prompt sudo even when nvidia-container-toolkit is absent; add a requirement that
host.nvidiaContainerToolkitInstalled is true to that condition (i.e., gate the
stale branch on host.nvidiaContainerToolkitInstalled) so the installer defers to
the later "install toolkit first" remediation instead of attempting auto-repair.
In `@src/lib/onboard.ts`:
- Around line 1844-1849: The guard that checks host.cdiNvidiaGpuSpecNeedsRepair,
host.cdiNvidiaGpuSpecMissing and optedOutGpuPassthrough is verbose and inflates
the file; collapse it into a single compact conditional and keep behavior
identical (early return when not needing repair or when opted out). Then
compress the subsequent error logging/throwing block (the code that runs when
the check fails) by formatting the error message as a single-line template
string or using a short helper to build the message so you remove extra blank
lines and multi-line concatenation; update references to the same symbols
(host.cdiNvidiaGpuSpecNeedsRepair, host.cdiNvidiaGpuSpecMissing,
optedOutGpuPassthrough) and to the function that emits the error so behavior
remains unchanged.
In `@src/lib/onboard/preflight.ts`:
- Around line 1189-1191: The stale-spec branch must mirror the missing-spec
branch by checking assessment.systemctlAvailable before producing systemd-only
commands: change the repairCommands assignment so when missingSpec is false it
uses buildStaleCdiWarnCommands(flaggedFilePath) only if
assessment.systemctlAvailable is true, otherwise call a new or existing
non-systemd fallback (e.g., buildStaleCdiManualWarnCommands or return a
non-blocking warning action) that emits manual remediation steps instead of
systemctl commands; update references to repairCommands,
buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
assessment.systemctlAvailable, specPath and flaggedFilePath accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f820c13a-4139-4ddd-be6a-1414d00efb93
📒 Files selected for processing (6)
scripts/install.shsrc/lib/onboard.tssrc/lib/onboard/preflight-cdi.test.tssrc/lib/onboard/preflight.test.tssrc/lib/onboard/preflight.tstest/install-preflight.test.ts
💤 Files with no reviewable changes (1)
- src/lib/onboard/preflight.test.ts
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard/preflight-cdi.test.ts (1)
51-64:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMake these
assessHostcases fully hermetic.These tests don't inject
runCaptureImpl, soassessHost()still shells out for package-manager detection. That makes the suite depend on whatever tools are installed on the runner instead of just the fixture data. Add a shared no-op/stubrunCaptureImplfor these cases so the unit tests stop touching the host environment. As per coding guidelines,**/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs in unit tests.Also applies to: 71-86, 222-240
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/preflight-cdi.test.ts` around lines 51 - 64, The tests call assessHost but don't inject runCaptureImpl, so package-manager detection still shells out; add a hermetic stub by passing runCaptureImpl that returns a deterministic capture object (e.g., { stdout: "", stderr: "", code: 0 } or similar) to each assessHost invocation in this test file; update the assessHost calls (the ones in this test and the other similar blocks) to include runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 }) so the unit tests no longer touch the real host or NVIDIA tools.
♻️ Duplicate comments (1)
src/lib/onboard/preflight.ts (1)
894-896:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep stale-spec remediation runnable without systemd.
When
systemctlAvailable === false, the stale-spec path still emits onlysystemctlcommands and then marks the action blocking. That leaves non-systemd Linux hosts with no executable repair path, even though the missing-spec path already has a fallback. Gate this branch the same way or downgrade it to a non-blocking/manual warning.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/preflight.ts` around lines 894 - 896, The stale-spec branch currently always generates systemctl-only remediation via buildStaleCdiWarnCommands and can be blocking on non-systemd hosts; modify the logic that sets repairCommands so it checks systemctlAvailable the same way the missingSpec path does (or make the stale-spec path non-blocking/manual when systemctlAvailable === false). Specifically, when choosing between buildNvidiaCdiRepairCommands and buildStaleCdiWarnCommands, gate the buildStaleCdiWarnCommands call on systemctlAvailable (or replace it with a non-systemd fallback/no-op warning and clear the blocking flag) using the existing symbols repairCommands, systemctlAvailable, buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands, missingSpec, and flaggedFilePath to locate and implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 257-272: The shell commands generated by
buildNvidiaCdiRepairCommands interpolate specDir and specPath raw into strings;
run specDir and specPath through the repo's shell-quoting helper before building
each command string so spaces or metacharacters are safely quoted, and do the
same for flaggedFilePath in the other remediation block referenced around lines
300-305—i.e., locate where flaggedFilePath is interpolated and replace direct
interpolation with the quoted value from the shell-quoting helper.
---
Outside diff comments:
In `@src/lib/onboard/preflight-cdi.test.ts`:
- Around line 51-64: The tests call assessHost but don't inject runCaptureImpl,
so package-manager detection still shells out; add a hermetic stub by passing
runCaptureImpl that returns a deterministic capture object (e.g., { stdout: "",
stderr: "", code: 0 } or similar) to each assessHost invocation in this test
file; update the assessHost calls (the ones in this test and the other similar
blocks) to include runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 })
so the unit tests no longer touch the real host or NVIDIA tools.
---
Duplicate comments:
In `@src/lib/onboard/preflight.ts`:
- Around line 894-896: The stale-spec branch currently always generates
systemctl-only remediation via buildStaleCdiWarnCommands and can be blocking on
non-systemd hosts; modify the logic that sets repairCommands so it checks
systemctlAvailable the same way the missingSpec path does (or make the
stale-spec path non-blocking/manual when systemctlAvailable === false).
Specifically, when choosing between buildNvidiaCdiRepairCommands and
buildStaleCdiWarnCommands, gate the buildStaleCdiWarnCommands call on
systemctlAvailable (or replace it with a non-systemd fallback/no-op warning and
clear the blocking flag) using the existing symbols repairCommands,
systemctlAvailable, buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
missingSpec, and flaggedFilePath to locate and implement the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a42d9078-1396-486d-a757-485dcbccdbbf
📒 Files selected for processing (4)
src/lib/onboard/docker-cdi.test.tssrc/lib/onboard/docker-cdi.tssrc/lib/onboard/preflight-cdi.test.tssrc/lib/onboard/preflight.ts
✅ Files skipped from review due to trivial changes (1)
- src/lib/onboard/docker-cdi.test.ts
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard/docker-cdi.ts (1)
337-379:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAlign user-facing explanation with remediation command (
nemoclaw onboard) to avoid confusing OpenShell command references.
src/lib/onboard/docker-cdi.tsremediation tells users to rerunnemoclaw onboard(optionally--no-gpu), but the explanation strings attribute the failure to OpenShell’sgateway start --gpu, which is only an internal detail of the GPU passthrough path. Rewording to referencenemoclaw onboardGPU passthrough keeps the cause while matching the command users are actually told to run.📝 Proposed wording alignment
- "OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a stale " + + "`nemoclaw onboard` (GPU passthrough) injects devices from the CDI spec, so a stale " +reasons.push( - "OpenShell's `gateway start --gpu` can fail until the CDI spec is refreshed and verified.", + "`nemoclaw onboard` GPU passthrough can fail until the CDI spec is refreshed and verified.", );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-cdi.ts` around lines 337 - 379, Replace user-facing references to "OpenShell's `gateway start --gpu`" with the user-facing remediation command "nemoclaw onboard" (mentioning the optional --gpu flag where appropriate) in the strings in this file: update the long explanatory return string near the top of docker-cdi.ts and the final reasons.push in explainNvidiaCdiRepairReason so they reference running "nemoclaw onboard" (or "nemoclaw onboard --no-gpu"/"--gpu" as applicable) instead of the internal OpenShell command; keep the rest of each message unchanged so the technical detail and mitigation advice remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 337-379: Replace user-facing references to "OpenShell's `gateway
start --gpu`" with the user-facing remediation command "nemoclaw onboard"
(mentioning the optional --gpu flag where appropriate) in the strings in this
file: update the long explanatory return string near the top of docker-cdi.ts
and the final reasons.push in explainNvidiaCdiRepairReason so they reference
running "nemoclaw onboard" (or "nemoclaw onboard --no-gpu"/"--gpu" as
applicable) instead of the internal OpenShell command; keep the rest of each
message unchanged so the technical detail and mitigation advice remain intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 05a9f650-d750-47e3-93a9-56686c4d79b3
📒 Files selected for processing (2)
src/lib/onboard/docker-cdi.test.tssrc/lib/onboard/docker-cdi.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/docker-cdi.test.ts
Summary
--gpus allresolves GPUs via the CDI spec, so a stale/etc/cdi/nvidia.yaml(e.g. annvidia-uvmdevice major that drifted after a driver change) silently injects the wrong device and breaks CUDA in GPU containers (nvidia-smiworks,cuInitfails). This adds host-preflight detection for a missing, stale, or unrefreshed NVIDIA CDI spec and guides the user to regenerate it / enable thenvidia-cdi-refreshunits.Related Issue
Fixes #4658
Changes
nvidia.com/gpuCDI spec when comparing declared/devGPU device nodes with live device numbers./etc/cdi/nvidia.yamlleftovers when a fresh/var/run/cdi/nvidia.yamloverrides them.minorvalues to0and preferhostPathfor host-side stat checks.nvidia-ctk cdi generatefallback.nvidia-cdi-refreshservice refresh commands only, with optional manual leftover removal in displayed guidance.src/lib/onboard/preflight-cdi.test.tsand add regression coverage for precedence, stale/matching/absent devices, omitted-minor nodes, hostPath, refresh warnings, and stale auto-fix behavior.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Reproduced #4658 and manually verified that this PR fixes #4658
Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests