fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state#4599
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (19)
🚧 Files skipped from review as they are similar to previous changes (18)
📝 WalkthroughWalkthroughAdds structured CUDA-usability proof results, runs bounded Docker/WSL proofs (ARM64), persists proof outcomes in the registry, integrates optional proof injection into GPU detection, captures proofs during onboarding, and renders proof status (verified/failed/unverified) in sandbox status output with tests and e2e update. ChangesSandbox GPU proof state and verification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/inference/nim.ts (1)
96-100: ⚡ Quick winDon't swallow internal prover initialization failures.
This
catchturns bugs inside../onboard/wsl-docker-desktop-gpuinto a silent CPU fallback, which makes supported N1X hosts look like “no GPU” instead of surfacing the regression. Only the optional module-resolution case should returnnull; other failures should bubble.Based on learnings: "avoid adding defensive error handling (e.g., try/catch wrappers, fallbacks, or extra validation) around internal helper logic when there is no realistic throwing path or failure mode."Suggested change
function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null { try { return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver(); - } catch { - return null; + } catch (error) { + if ( + error && + typeof error === "object" && + "code" in error && + (error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND" + ) { + return null; + } + throw error; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/nim.ts` around lines 96 - 100, The function defaultArm64WslDockerDesktopGpuProver currently swallows all errors from requiring or initializing ../onboard/wsl-docker-desktop-gpu; change it so only the optional-module-resolution case returns null and any other initialization errors bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt to createArm64WslDockerDesktopGpuProver inside a try/catch that only returns null when the caught error is a module-not-found resolution (e.g., error.code === 'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures inside createArm64WslDockerDesktopGpuProver are not silenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/inference/nim.ts`:
- Around line 96-100: The function defaultArm64WslDockerDesktopGpuProver
currently swallows all errors from requiring or initializing
../onboard/wsl-docker-desktop-gpu; change it so only the
optional-module-resolution case returns null and any other initialization errors
bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt to
createArm64WslDockerDesktopGpuProver inside a try/catch that only returns null
when the caught error is a module-not-found resolution (e.g., error.code ===
'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures
inside createArm64WslDockerDesktopGpuProver are not silenced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a97cb9ac-bff2-4a0c-b99c-99c28f76ae2b
📒 Files selected for processing (15)
src/lib/actions/sandbox/status.test.tssrc/lib/actions/sandbox/status.tssrc/lib/inference/gpu-trust.tssrc/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/onboard.tssrc/lib/onboard/sandbox-gpu-mode.tssrc/lib/onboard/sandbox-gpu-preflight.test.tssrc/lib/onboard/sandbox-gpu-preflight.tssrc/lib/onboard/sandbox-registry-metadata.tssrc/lib/onboard/wsl-docker-desktop-gpu.test.tssrc/lib/onboard/wsl-docker-desktop-gpu.tssrc/lib/runner.tssrc/lib/state/registry.tstest/e2e/test-gpu-e2e.sh
|
@coderabbitai review The incremental commit addresses your nitpick on |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check the current state of wsl-docker-desktop-gpu.ts for MODULE_NOT_FOUND handling
rg -n "MODULE_NOT_FOUND\|defaultArm64Wsl\|require.*runCaptureEx\|catch" src/lib/onboard/wsl-docker-desktop-gpu.ts -A3 -B3
echo "---nim.ts defaultArm64WslDockerDesktopGpuProver---"
rg -n "defaultArm64Wsl\|MODULE_NOT_FOUND" src/lib/inference/nim.ts -A5 -B3Length of output: 646 🏁 Script executed: #!/bin/bash
# Find the relevant files
fd "wsl-docker-desktop-gpu.ts" --type f
fd "nim.ts" --type f | grep -v test | grep -v node_modules
echo "---"
# Try broader search
rg -rn "MODULE_NOT_FOUND" --type ts -C3Length of output: 1760 🏁 Script executed: #!/bin/bash
echo "=== nim.ts defaultArm64WslDockerDesktopGpuProver / lazy require section ==="
cat src/lib/inference/nim.ts | grep -n "MODULE_NOT_FOUND\|require\|catch\|proveArm64\|defaultArm64\|lazy" | head -60
echo ""
echo "=== Full nim.ts catch blocks ==="
rg -n "catch" src/lib/inference/nim.ts -A 8 -B 2
echo ""
echo "=== wsl-docker-desktop-gpu.ts full file ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.tsLength of output: 16786 🏁 Script executed: #!/bin/bash
# Read the exact lines around the defaultArm64WslDockerDesktopGpuProver function to confirm rethrow
awk 'NR>=92 && NR<=125' src/lib/inference/nim.tsLength of output: 1920
function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null {
try {
return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver();
} catch (error) {
// Only the optional module-resolution case should degrade to "no prover";
// a real bug inside the prover module must bubble up rather than masquerade
// as a missing GPU on an otherwise-supported N1X host.
if (
error &&
typeof error === "object" &&
(error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND"
) {
return null;
}
throw error; // ← all other errors rethrown
}
}The narrowed catch is exactly right:
The previous nitpick is fully addressed. The rest of the PR also looks solid — the [approve] |
|
@yimoj this one's showing |
|
@yimoj Did you see this error when you tested it on Windows-ARM with WSL? |
cbdf60f to
eef7250
Compare
|
@zyang-dev Thank you — that To be upfront: I do not have Windows-ARM + WSL hardware, so I did not reproduce your run live. I diagnosed it from the image itself, which is conclusive here: So on an aarch64 host Docker pulls the "arm64" variant whose binary is x86-64, and execve fails with Fix (pushed):
If you're able to re-run |
|
@coderabbitai review Rebased onto current |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== wsl-docker-desktop-gpu.ts: proof command + timeout + isExecFormatError ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.ts
echo ""
echo "=== sandbox-gpu-preflight.ts: cuInit(0)=0 requirement + jetson remediation ==="
cat src/lib/onboard/sandbox-gpu-preflight.ts
echo ""
echo "=== snapshot.ts: sandboxGpuProof cleared on clone ==="
rg -n "sandboxGpuProof" src/lib/actions/sandbox/snapshot.ts -C5
echo ""
echo "=== status.ts: proof suffix rendering ==="
cat src/lib/actions/sandbox/status.ts
echo ""
echo "=== registry.ts: SandboxGpuProofResult + SandboxEntry ==="
rg -n "SandboxGpuProof\|sandboxGpuProof" src/lib/state/registry.ts -C3
echo ""
echo "=== nim.ts: MODULE_NOT_FOUND narrowed catch (confirm still in place) ==="
awk 'NR>=88 && NR<=125' src/lib/inference/nim.tsLength of output: 50273
Snapshot clone
MODULE_NOT_FOUND narrowed catch (from previous review)
One cosmetic nit — not blocking: Line ~188 in Everything is correct end-to-end. Approving. [approve] |
… proof state Two grouped GPU trust/proof/status fixes, rebased onto current main. NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody, whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with "exec format error" on the very N1X target this feature accepts. An explicit exec-format-error diagnostic now distinguishes an image-architecture problem from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on cuInit(0)=0, persisted to the registry and rendered by status as "(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero exit that printed a non-zero cuInit code (swallowed exit) is treated as failed, not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper (net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson /dev/nvmap + video/render group remediation. Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors. Fixes NVIDIA#4565 Fixes NVIDIA#4231 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
eef7250 to
a56e438
Compare
|
@coderabbitai review Latest head only updates one stale code comment (the |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== Comment around line 188 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=183 && NR<=200' src/lib/onboard/wsl-docker-desktop-gpu.tsLength of output: 1101 🏁 Script executed: #!/bin/bash
echo "=== Search for vectorAdd / nbody / diagnostic comment in wsl-docker-desktop-gpu.ts ==="
rg -n "vectorAdd\|nbody\|prefer.*diagnostic\|fall back.*stdout" src/lib/onboard/wsl-docker-desktop-gpu.ts -C2Length of output: 545 🏁 Script executed: #!/bin/bash
echo "=== Lines 160-192 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=160 && NR<=192' src/lib/onboard/wsl-docker-desktop-gpu.tsLength of output: 1747
// prefer it for the diagnostic and fall back to stdout (vectorAdd output).The [approve] |
Summary
Two grouped GPU trust/proof/status fixes.
nemoclawnow accepts real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker--gpusCUDA proof succeeds (#4565), andnemoclaw statusreports proven CUDA usability instead of treating any configured GPU as healthy (#4231).Related Issue
Fixes #4565
Fixes #4231
Changes
detectGpu()still rejects a denylistedJMJWOA-Generic-*name by default; the only escape iscreateArm64WslDockerDesktopGpuProver, which runs one boundeddocker run --gpus all …CUDA workload on ARM64 Docker Desktop WSL hosts. The proof image isnvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0(a genuine aarch64 binary running a real CUDA kernel — device alloc + add + result verification). The previouscuda-sample:nbodyimage was wrong for this ARM64-only path: its arm64 manifest entry actually ships an x86-64 ELF, so it fails withexec format erroron the exact N1X hardware this feature targets (reported in-thread). Only a real GPU passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. The proof timeout is bounded (default 180s,NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MSoverride) and failures keep the CPU fallback with--no-gpuguidance. An explicitexec format errordiagnostic now distinguishes an image-architecture problem from a missing GPU.SandboxGpuProofResult(verified/unverified/failed) keyed on thecuInit(0)=0usability proof instead of silently swallowing optional-proof failures. A zero exit that still printed a non-zerocuInit(0)code (a wrapper that swallowed the real exit) is treated as failed, not verified. The result is persisted to the sandbox registry and rendered bynemoclaw statusas(CUDA verified)/(CUDA unverified)/(last CUDA proof failed: …). CUDA failures print Jetson/dev/nvmap+video/rendergroup remediation. The proof is captured by the existingverifyGpuSandboxAfterReadywrapper (sosrc/lib/onboard.tsis unchanged / net-zero), and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox'sCUDA verifiedstate.--no-gpuguidance preserved on every proof-failure path.runCaptureExso Docker/CUDA failure diagnostics are no longer dropped.MODULE_NOT_FOUNDand rethrows internal initialization errors (earlier CodeRabbit nit).Type of Change
Verification
upstream/main; resolved conflicts instatus.ts/status-snapshot.ts/status.test.ts(upstream extracted the snapshot/report code intostatus-snapshot.ts) and threaded the proof result through the#4509verifyGpuSandboxAfterReadywrapper.wsl-docker-desktop-gpu,nim,sandbox-gpu-preflight,docker-gpu-local-inference,status,registry,snapshot*).npm test(cli project): only pre-existing, environment-only failures remain (test/cli.test.ts,test/ssrf-parity.test.ts,config-sync/nemoclaw-startroot-ownership tests — file-mode/ownership/network checks unrelated to this change; none touch the modified files).codex review --base upstream/mainclean after addressing two P2 findings (stale proof on snapshot clone; requirecuInit(0)=0before verifying).npx prekpre-commit/pre-push hooks pass (format, lint, typecheck).Notes
fileon the extracted binary (thenbodyarm64 tag contains an x86-64 ELF; thevectoradd-cuda12.5.0arm64 tag contains a real aarch64 binary). No live Windows-ARM/WSL GPU hardware was available on the triage host, so the N1X run was not reproduced live — see the in-thread reply for the exact commands and evidence.detectGpuproof gating via injected prover, and the verifier/status proof classification via fixtures, confirming the pre-fix reject ([Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU #4565) and misleading "enabled" ([Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status #4231) before fixing.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit