Skip to content

fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733

Open
zyang-dev wants to merge 7 commits into
mainfrom
fix/cdi-refresh-preflight
Open

fix(onboard): detect stale or unrefreshed NVIDIA CDI specs in host preflight#4733
zyang-dev wants to merge 7 commits into
mainfrom
fix/cdi-refresh-preflight

Conversation

@zyang-dev
Copy link
Copy Markdown
Contributor

@zyang-dev zyang-dev commented Jun 3, 2026

Summary

--gpus all resolves GPUs via the CDI spec, so a stale /etc/cdi/nvidia.yaml (e.g. an nvidia-uvm device major that drifted after a driver change) silently injects the wrong device and breaks CUDA in GPU containers (nvidia-smi works, cuInit fails). This adds host-preflight detection for a missing, stale, or unrefreshed NVIDIA CDI spec and guides the user to regenerate it / enable the nvidia-cdi-refresh units.

Related Issue

Fixes #4658

Changes

  • Check only the highest-precedence effective nvidia.com/gpu CDI spec when comparing declared /dev GPU device nodes with live device numbers.
  • Ignore stale /etc/cdi/nvidia.yaml leftovers when a fresh /var/run/cdi/nvidia.yaml overrides them.
  • Default omitted CDI device-node minor values to 0 and prefer hostPath for host-side stat checks.
  • Keep missing-spec remediation unchanged, including direct nvidia-ctk cdi generate fallback.
  • Split stale-spec remediation to use nvidia-cdi-refresh service refresh commands only, with optional manual leftover removal in displayed guidance.
  • Keep NVIDIA CDI refresh-unit health as a non-blocking warning unless the spec is missing or proven stale.
  • Move CDI preflight coverage into src/lib/onboard/preflight-cdi.test.ts and add regression coverage for precedence, stale/matching/absent devices, omitted-minor nodes, hostPath, refresh warnings, and stale auto-fix behavior.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Reproduced #4658 and manually verified that this PR fixes #4658


Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Detects stale NVIDIA CDI GPU device specs and attempts automatic refresh via the NVIDIA CDI refresh service during host preflight; provides manual guidance for non-systemd hosts.
  • Bug Fixes

    • Treats stale CDI GPU specs as blocking when CDI injection is configured.
    • Installer repair now prefers enabling/starting the NVIDIA refresh service (with elevation) and updates messaging to say specs may be "missing or stale" and are being refreshed.
  • Tests

    • Added/updated tests covering CDI detection, stale vs missing remediation plans, and installer repair behaviors.

zyang-dev added 3 commits June 3, 2026 14:48
…eflight

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
…oid false-positive blocks

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Detects and flags missing or stale NVIDIA CDI GPU specs, compares declared vs live device nodes, probes nvidia-cdi-refresh unit health, emits generate/refresh/warn remediation actions, performs installer-level refresh via systemd (with sudo when needed), and adds dedicated tests and harness updates.

Changes

CDI Stale Spec Detection and Refresh Remediation

Layer / File(s) Summary
docker-cdi module: types, parsing, detection, and command builders
src/lib/onboard/docker-cdi.ts, src/lib/onboard/docker-cdi.test.ts
Adds CDI-focused public types, spec-dir parsing, effective-spec selection, device-node collection, mismatch detection, and command/explanation builders for repair/refresh/warn; includes parsing, staleness detection, and remediation command tests.
Preflight assessment and remediation planner
src/lib/onboard/preflight.ts
Extends HostAssessment with CDI fields, probes nvidia-cdi-refresh systemd state, computes stale/mismatch/needsRepair flags, and updates planHostRemediation to emit generate/refresh/warn actions with toolkit/WSL/systemctl branching.
onboard: runtime assertion update
src/lib/onboard.ts
assertCdiNvidiaGpuSpecPresent treats missing OR stale CDI specs as fatal when CDI injection is configured and updates the fatal message text.
Installer-level stale-spec repair and messaging
scripts/install.sh
Adds repair_installer_stale_nvidia_cdi_spec(flagged_file) to enable/start nvidia-cdi-refresh (uses sudo when needed); repair_installer_nvidia_cdi_spec now distinguishes missing vs stale and delegates; CLI messaging updated to reference “missing or stale” / “Refreshing NVIDIA CDI device spec”.
Tests, harness updates, and test reorg
src/lib/onboard/preflight-cdi.test.ts, src/lib/onboard/docker-cdi.test.ts, src/lib/onboard/preflight.test.ts, test/install-preflight.test.ts
Adds dedicated CDI preflight and docker-cdi test suites; removes moved CDI blocks from preflight.test.ts; extends install-preflight harness to simulate stale state, pass CDI_STALE_FILE, and verify refresh-only repair behavior; updates assertions.

Sequence Diagram

sequenceDiagram
  participant CLI as assessHost
  participant DockerCdi as docker-cdi.ts
  participant Systemctl as systemctl
  participant Planner as planHostRemediation
  participant Installer as installer (scripts/install.sh)

  CLI->>DockerCdi: discover effective CDI spec / collect device nodes
  DockerCdi->>CLI: report mismatch / stale
  CLI->>Systemctl: probe nvidia-cdi-refresh (is-enabled/is-failed)
  CLI->>Planner: compute remediation (missing | stale | warn)
  Planner->>Installer: emit repair plan (generate / refresh / warn)
  Installer->>Systemctl: enable/start nvidia-cdi-refresh (sudo if needed)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4298: Modifies CDI remediation/validation decision points and WSL+Docker Desktop exemptions; related to CDI handling and installer repair flow.

Suggested labels

fix, Docker, bug-fix, Platform: Ubuntu

Suggested reviewers

  • cv

"I sniffed the stale CDI trace,
I nudged systemd to start the race,
I checked majors, minors, and specs,
I hopped to enable refresh checks,
Now GPUs wake with sunny grace!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: detecting stale or unrefreshed NVIDIA CDI specs in host preflight, which aligns with the core objective of adding staleness detection to prevent incorrect device node injection.
Linked Issues check ✅ Passed The PR implements the primary coding objective from #4658: detecting and handling stale NVIDIA CDI specs that cause incorrect device node injection and CUDA failures in containers. All major requirements are met.
Out of Scope Changes check ✅ Passed All changes are focused on NVIDIA CDI spec detection, staleness validation, and remediation. No out-of-scope modifications were introduced; changes remain targeted to the CDI preflight assessment and repair workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cdi-refresh-preflight

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

PR Review Advisor

Findings: 3 needs attention, 3 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 3 still apply, 2 new items found

Review findings

🛠️ Needs attention

  • Add runtime validation for the linked DGX Spark vLLM startup path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit coverage for CDI detection and installer repair, but the changed installer/onboarding/sandbox infrastructure does not include or identify targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
    • Recommendation: Add or identify targeted runtime/integration validation for the changed behavior on the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
    • Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.
  • Offset preflight monolith growth before merge (src/lib/onboard/preflight.ts:1): The PR extracts CDI helpers into `docker-cdi.ts`, but the current large `preflight.ts` hotspot still grows by more than the repository's monolith threshold. This increases maintenance risk in an already high-churn onboarding/preflight surface.
    • Recommendation: Move enough of the new CDI assessment/remediation planning out of `preflight.ts` or otherwise offset the growth so this large-file hotspot does not grow by 20+ lines.
    • Evidence: Deterministic monolith evidence: `src/lib/onboard/preflight.ts` baseLines 2074, headLines 2116, delta +42, severity blocker: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."
  • Update stale CDI remediation assertion for quoted command output (src/lib/onboard/preflight-cdi.test.ts:280): The stale CDI remediation test still searches for an unquoted `sudo rm -f /etc/cdi/nvidia.yaml` substring, while the implementation now intentionally shell-quotes the path as `sudo rm -f '/etc/cdi/nvidia.yaml'`. That makes the changed test expectation inconsistent with the security fix and likely fails when run against the current implementation.
    • Recommendation: Update the assertion to expect the quoted command, or assert via the command-builder tests already checking quoted stale paths. Keep the test aligned with the new shell-safe display format.
    • Evidence: `buildStaleCdiWarnCommands()` emits `sudo rm -f ${shellQuote(flaggedFilePath)}` in `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/docker-cdi.test.ts` expects `sudo rm -f '/etc/cdi/nvidia.yaml'`; `src/lib/onboard/preflight-cdi.test.ts` still checks `command.includes("sudo rm -f /etc/cdi/nvidia.yaml")`.

🔎 Worth checking

  • Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct-generation fallback guidance.
  • Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:288): The displayed command builders now quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. Since CDI spec directories can be host filesystem paths, a path containing spaces is truncated before it reaches the quoted remediation guidance or installer stale repair messaging.
    • Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches remediation output.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe uses `mismatch.trim().split(/\s+/, 1)[0]`. Builder-only tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
  • Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:274): The PR adds recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now clear, but the code still does not state when this localized workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
    • Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, and `buildStaleCdiManualWarnCommands`; tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: NVIDIA CDI missing/stale spec repair in installer and host preflight: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, `buildStaleCdiManualWarnCommands`, and direct-generation fallback guidance.
  • Add runtime validation for the linked DGX Spark vLLM startup path: The linked issue's expected result is end-to-end runtime behavior: the vLLM container starts, serves the target model on port 8000, onboarding reaches inference verification, and status shows a running provider. This PR adds strong unit coverage for CDI detection and installer repair, but the changed installer/onboarding/sandbox infrastructure does not include or identify targeted runtime validation proving that the documented DGX Spark vLLM path is unblocked.
    • Recommendation: Add or identify targeted runtime/integration validation for the changed behavior on the affected GPU/CDI path, such as confirming stale CDI preflight repair clears and the documented vLLM onboarding path reaches inference verification and a running provider status. Keep this separate from reporting external CI/E2E status.
    • Evidence: Issue [DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error" #4658 Expected Result: "Container `nemoclaw-vllm` starts cleanly, vLLM serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on port 8000, onboard proceeds to `[4/8] Verifying inference` and beyond. `nemoclaw <name> status` shows running provider." Deterministic testDepth verdict: runtime_validation_recommended for `scripts/install.sh`, `src/lib/onboard.ts`, `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/preflight.ts`.
  • Offset preflight monolith growth before merge (src/lib/onboard/preflight.ts:1): The PR extracts CDI helpers into `docker-cdi.ts`, but the current large `preflight.ts` hotspot still grows by more than the repository's monolith threshold. This increases maintenance risk in an already high-churn onboarding/preflight surface.
    • Recommendation: Move enough of the new CDI assessment/remediation planning out of `preflight.ts` or otherwise offset the growth so this large-file hotspot does not grow by 20+ lines.
    • Evidence: Deterministic monolith evidence: `src/lib/onboard/preflight.ts` baseLines 2074, headLines 2116, delta +42, severity blocker: "Current monolith grew by 20 or more lines; extract or offset the growth before merge."
  • Update stale CDI remediation assertion for quoted command output (src/lib/onboard/preflight-cdi.test.ts:280): The stale CDI remediation test still searches for an unquoted `sudo rm -f /etc/cdi/nvidia.yaml` substring, while the implementation now intentionally shell-quotes the path as `sudo rm -f '/etc/cdi/nvidia.yaml'`. That makes the changed test expectation inconsistent with the security fix and likely fails when run against the current implementation.
    • Recommendation: Update the assertion to expect the quoted command, or assert via the command-builder tests already checking quoted stale paths. Keep the test aligned with the new shell-safe display format.
    • Evidence: `buildStaleCdiWarnCommands()` emits `sudo rm -f ${shellQuote(flaggedFilePath)}` in `src/lib/onboard/docker-cdi.ts`, and `src/lib/onboard/docker-cdi.test.ts` expects `sudo rm -f '/etc/cdi/nvidia.yaml'`; `src/lib/onboard/preflight-cdi.test.ts` still checks `command.includes("sudo rm -f /etc/cdi/nvidia.yaml")`.
  • Preserve stale CDI mismatch file paths that contain whitespace (src/lib/onboard/docker-cdi.ts:288): The displayed command builders now quote paths safely, but stale mismatch extraction still treats the first whitespace-delimited token as the file path. Since CDI spec directories can be host filesystem paths, a path containing spaces is truncated before it reaches the quoted remediation guidance or installer stale repair messaging.
    • Recommendation: Return structured mismatch data instead of encoding it as a whitespace-delimited string, or encode the file path in a reversible way. Add an end-to-end test that creates a stale mismatch from a CDI path containing spaces and verifies the full path reaches remediation output.
    • Evidence: `findCdiDeviceNodeMismatch()` returns `${node.filePath} ${node.path}=...`; `extractCdiMismatchFilePath()` slices at the first whitespace, and the installer probe uses `mismatch.trim().split(/\s+/, 1)[0]`. Builder-only tests cover quoting `/tmp/cdi dir/...`, but not extraction from the mismatch string.
  • Document the removal condition for the CDI refresh workaround (src/lib/onboard/docker-cdi.ts:274): The PR adds recovery behavior around stale or missing NVIDIA CDI specs and refresh-service health. The invalid state, source boundary, and regression tests are now clear, but the code still does not state when this localized workaround can be removed. Without a removal condition, host-workaround logic can become permanent source-of-truth drift.
    • Recommendation: Add a short comment or tracking note near the refresh/direct-generation remediation path that states the removal condition, for example when OpenShell/Docker no longer rely on CDI for GPU injection or NVIDIA Container Toolkit guarantees service-managed CDI specs are refreshed before Docker/OpenShell use them.
    • Evidence: Recovery paths include `repair_installer_stale_nvidia_cdi_spec`, `buildNvidiaCdiRepairCommands`, `buildStaleCdiWarnCommands`, and `buildStaleCdiManualWarnCommands`; tests cover stale effective specs, refresh warnings, missing-spec fallback, toolkit absence, and WSL Docker Desktop exemption, but no removal condition is documented near the workaround.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

E2E Advisor Recommendation

Required E2E: gpu-e2e, cloud-onboard-e2e
Optional E2E: onboard-negative-paths-e2e, gpu-double-onboard-e2e

Dispatch hint: cloud-onboard-e2e,gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Required because the core runtime risk is real NVIDIA GPU passthrough via Docker CDI into OpenShell. This is the existing GPU runner E2E that runs install/onboard with the Ollama local GPU provider and verifies sandbox inference on an NVIDIA GPU host.
  • cloud-onboard-e2e (high): Required because the installer and onboard preflight path changed. This existing E2E exercises the public install flow, non-interactive onboarding, gateway/sandbox creation, sandbox health, and inference.local/security checks on the standard Ubuntu path, catching non-GPU regressions introduced by the new host preflight plumbing.

Optional E2E

  • onboard-negative-paths-e2e (medium): Useful adjacent coverage for onboarding preflight/error handling and non-interactive validation paths. It is not a direct CDI stale-spec test, so it is optional rather than merge-blocking for this PR.
  • gpu-double-onboard-e2e (high): Optional confidence for repeat GPU onboarding on a real GPU host. The PR changes GPU preflight and installer repair behavior that could affect re-onboard flows, but gpu-e2e is the primary required GPU coverage.

New E2E recommendations

  • gpu-cdi-stale-spec-repair (high): No existing E2E appears to force Docker CDISpecDirs with a deliberately stale effective nvidia.com/gpu CDI spec, then verify that install.sh/onboard refreshes nvidia-cdi-refresh.service or blocks with the intended remediation before OpenShell gateway start. The new unit tests cover parsing and planning, but not the real host/service/sudo/gateway interaction.
    • Suggested test: Add a GPU-runner E2E that seeds a mismatched nvidia.yaml CDI device-node major/minor, runs install.sh or nemoclaw onboard with GPU enabled, and asserts the stale CDI repair path refreshes the service-managed /var/run/cdi/nvidia.yaml or exits with actionable refresh guidance without starting a broken OpenShell GPU gateway.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,gpu-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Core Linux repo-current onboarding and installer/preflight code changed. This scenario exercises the standard Ubuntu Docker onboarding path and catches broad regressions from the new preflight/docker-cdi imports and onboarding guard behavior.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gpu-repo-local-ollama-openclaw: The PR specifically changes NVIDIA Docker CDI detection, stale-spec remediation, and GPU preflight blocking behavior. This is the only routed scenario with a gpu-docker-cdi runtime, so it is required despite using a special GPU runner.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

  • wsl-repo-cloud-openclaw: Optional adjacent coverage for preflight/onboarding behavior under WSL, including the WSL Docker Desktop CDI exemption paths touched by the change. Kept optional because it uses a special Windows/WSL runner and the primary CDI surface is covered by the GPU scenario.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • scripts/install.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-cdi.ts
  • src/lib/onboard/preflight.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/install.sh`:
- Around line 1903-1912: The stale-spec auto-repair branch currently runs when
host.cdiNvidiaGpuSpecStale && host.cdiNvidiaGpuSpecNeedsRepair &&
!host.cdiNvidiaGpuSpecMissing && !isWslDockerDesktopRuntime(host), which can
prompt sudo even when nvidia-container-toolkit is absent; add a requirement that
host.nvidiaContainerToolkitInstalled is true to that condition (i.e., gate the
stale branch on host.nvidiaContainerToolkitInstalled) so the installer defers to
the later "install toolkit first" remediation instead of attempting auto-repair.

In `@src/lib/onboard.ts`:
- Around line 1844-1849: The guard that checks host.cdiNvidiaGpuSpecNeedsRepair,
host.cdiNvidiaGpuSpecMissing and optedOutGpuPassthrough is verbose and inflates
the file; collapse it into a single compact conditional and keep behavior
identical (early return when not needing repair or when opted out). Then
compress the subsequent error logging/throwing block (the code that runs when
the check fails) by formatting the error message as a single-line template
string or using a short helper to build the message so you remove extra blank
lines and multi-line concatenation; update references to the same symbols
(host.cdiNvidiaGpuSpecNeedsRepair, host.cdiNvidiaGpuSpecMissing,
optedOutGpuPassthrough) and to the function that emits the error so behavior
remains unchanged.

In `@src/lib/onboard/preflight.ts`:
- Around line 1189-1191: The stale-spec branch must mirror the missing-spec
branch by checking assessment.systemctlAvailable before producing systemd-only
commands: change the repairCommands assignment so when missingSpec is false it
uses buildStaleCdiWarnCommands(flaggedFilePath) only if
assessment.systemctlAvailable is true, otherwise call a new or existing
non-systemd fallback (e.g., buildStaleCdiManualWarnCommands or return a
non-blocking warning action) that emits manual remediation steps instead of
systemctl commands; update references to repairCommands,
buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
assessment.systemctlAvailable, specPath and flaggedFilePath accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f820c13a-4139-4ddd-be6a-1414d00efb93

📥 Commits

Reviewing files that changed from the base of the PR and between 11fcbf7 and 366f996.

📒 Files selected for processing (6)
  • scripts/install.sh
  • src/lib/onboard.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.test.ts
  • src/lib/onboard/preflight.ts
  • test/install-preflight.test.ts
💤 Files with no reviewable changes (1)
  • src/lib/onboard/preflight.test.ts

Comment thread scripts/install.sh
Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/preflight.ts Outdated
zyang-dev added 2 commits June 3, 2026 16:19
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev added the v0.0.58 Release target label Jun 3, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/preflight-cdi.test.ts (1)

51-64: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make these assessHost cases fully hermetic.

These tests don't inject runCaptureImpl, so assessHost() still shells out for package-manager detection. That makes the suite depend on whatever tools are installed on the runner instead of just the fixture data. Add a shared no-op/stub runCaptureImpl for these cases so the unit tests stop touching the host environment. As per coding guidelines, **/*.test.{js,ts}: Mock external dependencies in unit tests; do not call real NVIDIA APIs in unit tests.

Also applies to: 71-86, 222-240

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight-cdi.test.ts` around lines 51 - 64, The tests call
assessHost but don't inject runCaptureImpl, so package-manager detection still
shells out; add a hermetic stub by passing runCaptureImpl that returns a
deterministic capture object (e.g., { stdout: "", stderr: "", code: 0 } or
similar) to each assessHost invocation in this test file; update the assessHost
calls (the ones in this test and the other similar blocks) to include
runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 }) so the unit tests no
longer touch the real host or NVIDIA tools.
♻️ Duplicate comments (1)
src/lib/onboard/preflight.ts (1)

894-896: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep stale-spec remediation runnable without systemd.

When systemctlAvailable === false, the stale-spec path still emits only systemctl commands and then marks the action blocking. That leaves non-systemd Linux hosts with no executable repair path, even though the missing-spec path already has a fallback. Gate this branch the same way or downgrade it to a non-blocking/manual warning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/preflight.ts` around lines 894 - 896, The stale-spec branch
currently always generates systemctl-only remediation via
buildStaleCdiWarnCommands and can be blocking on non-systemd hosts; modify the
logic that sets repairCommands so it checks systemctlAvailable the same way the
missingSpec path does (or make the stale-spec path non-blocking/manual when
systemctlAvailable === false). Specifically, when choosing between
buildNvidiaCdiRepairCommands and buildStaleCdiWarnCommands, gate the
buildStaleCdiWarnCommands call on systemctlAvailable (or replace it with a
non-systemd fallback/no-op warning and clear the blocking flag) using the
existing symbols repairCommands, systemctlAvailable, buildStaleCdiWarnCommands,
buildNvidiaCdiRepairCommands, missingSpec, and flaggedFilePath to locate and
implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 257-272: The shell commands generated by
buildNvidiaCdiRepairCommands interpolate specDir and specPath raw into strings;
run specDir and specPath through the repo's shell-quoting helper before building
each command string so spaces or metacharacters are safely quoted, and do the
same for flaggedFilePath in the other remediation block referenced around lines
300-305—i.e., locate where flaggedFilePath is interpolated and replace direct
interpolation with the quoted value from the shell-quoting helper.

---

Outside diff comments:
In `@src/lib/onboard/preflight-cdi.test.ts`:
- Around line 51-64: The tests call assessHost but don't inject runCaptureImpl,
so package-manager detection still shells out; add a hermetic stub by passing
runCaptureImpl that returns a deterministic capture object (e.g., { stdout: "",
stderr: "", code: 0 } or similar) to each assessHost invocation in this test
file; update the assessHost calls (the ones in this test and the other similar
blocks) to include runCaptureImpl: () => ({ stdout: "", stderr: "", code: 0 })
so the unit tests no longer touch the real host or NVIDIA tools.

---

Duplicate comments:
In `@src/lib/onboard/preflight.ts`:
- Around line 894-896: The stale-spec branch currently always generates
systemctl-only remediation via buildStaleCdiWarnCommands and can be blocking on
non-systemd hosts; modify the logic that sets repairCommands so it checks
systemctlAvailable the same way the missingSpec path does (or make the
stale-spec path non-blocking/manual when systemctlAvailable === false).
Specifically, when choosing between buildNvidiaCdiRepairCommands and
buildStaleCdiWarnCommands, gate the buildStaleCdiWarnCommands call on
systemctlAvailable (or replace it with a non-systemd fallback/no-op warning and
clear the blocking flag) using the existing symbols repairCommands,
systemctlAvailable, buildStaleCdiWarnCommands, buildNvidiaCdiRepairCommands,
missingSpec, and flaggedFilePath to locate and implement the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a42d9078-1396-486d-a757-485dcbccdbbf

📥 Commits

Reviewing files that changed from the base of the PR and between 9f74eef and 34c7b5f.

📒 Files selected for processing (4)
  • src/lib/onboard/docker-cdi.test.ts
  • src/lib/onboard/docker-cdi.ts
  • src/lib/onboard/preflight-cdi.test.ts
  • src/lib/onboard/preflight.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard/docker-cdi.test.ts

Comment thread src/lib/onboard/docker-cdi.ts
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/docker-cdi.ts (1)

337-379: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align user-facing explanation with remediation command (nemoclaw onboard) to avoid confusing OpenShell command references.
src/lib/onboard/docker-cdi.ts remediation tells users to rerun nemoclaw onboard (optionally --no-gpu), but the explanation strings attribute the failure to OpenShell’s gateway start --gpu, which is only an internal detail of the GPU passthrough path. Rewording to reference nemoclaw onboard GPU passthrough keeps the cause while matching the command users are actually told to run.

📝 Proposed wording alignment
-    "OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a stale " +
+    "`nemoclaw onboard` (GPU passthrough) injects devices from the CDI spec, so a stale " +
   reasons.push(
-    "OpenShell's `gateway start --gpu` can fail until the CDI spec is refreshed and verified.",
+    "`nemoclaw onboard` GPU passthrough can fail until the CDI spec is refreshed and verified.",
   );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-cdi.ts` around lines 337 - 379, Replace user-facing
references to "OpenShell's `gateway start --gpu`" with the user-facing
remediation command "nemoclaw onboard" (mentioning the optional --gpu flag where
appropriate) in the strings in this file: update the long explanatory return
string near the top of docker-cdi.ts and the final reasons.push in
explainNvidiaCdiRepairReason so they reference running "nemoclaw onboard" (or
"nemoclaw onboard --no-gpu"/"--gpu" as applicable) instead of the internal
OpenShell command; keep the rest of each message unchanged so the technical
detail and mitigation advice remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard/docker-cdi.ts`:
- Around line 337-379: Replace user-facing references to "OpenShell's `gateway
start --gpu`" with the user-facing remediation command "nemoclaw onboard"
(mentioning the optional --gpu flag where appropriate) in the strings in this
file: update the long explanatory return string near the top of docker-cdi.ts
and the final reasons.push in explainNvidiaCdiRepairReason so they reference
running "nemoclaw onboard" (or "nemoclaw onboard --no-gpu"/"--gpu" as
applicable) instead of the internal OpenShell command; keep the rest of each
message unchanged so the technical detail and mitigation advice remain intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05a9f650-d750-47e3-93a9-56686c4d79b3

📥 Commits

Reviewing files that changed from the base of the PR and between ead83f7 and 529f1de.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-cdi.test.ts
  • src/lib/onboard/docker-cdi.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-cdi.test.ts

@prekshivyas prekshivyas self-assigned this Jun 3, 2026
@cv cv added v0.0.59 Release target and removed v0.0.58 Release target labels Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.59 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Inference] PR #4619 vLLM/NVFP4 path crashes on container start with "CUDA unknown error"

3 participants