Skip to content

fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state#4599

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4565-gpu-proof-trust-status
Jun 3, 2026
Merged

fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state#4599
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4565-gpu-proof-trust-status

Conversation

@yimoj
Copy link
Copy Markdown
Contributor

@yimoj yimoj commented Jun 1, 2026

Summary

Two grouped GPU trust/proof/status fixes. nemoclaw now accepts real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker --gpus CUDA proof succeeds (#4565), and nemoclaw status reports proven CUDA usability instead of treating any configured GPU as healthy (#4231).

Related Issue

Fixes #4565
Fixes #4231

Changes

  • [Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU #4565 — accept N1X without reopening the Snapdragon false positive ([WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988/fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check #4424): detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is createArm64WslDockerDesktopGpuProver, which runs one bounded docker run --gpus all … CUDA workload on ARM64 Docker Desktop WSL hosts. The proof image is nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel — device alloc + add + result verification). The previous cuda-sample:nbody image was wrong for this ARM64-only path: its arm64 manifest entry actually ships an x86-64 ELF, so it fails with exec format error on the exact N1X hardware this feature targets (reported in-thread). Only a real GPU passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. The proof timeout is bounded (default 180s, NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MS override) and failures keep the CPU fallback with --no-gpu guidance. An explicit exec format error diagnostic now distinguishes an image-architecture problem from a missing GPU.
  • [Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status #4231 — status reflects CUDA proof, not just config: the direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on the cuInit(0)=0 usability proof instead of silently swallowing optional-proof failures. A zero exit that still printed a non-zero cuInit(0) code (a wrapper that swallowed the real exit) is treated as failed, not verified. The result is persisted to the sandbox registry and rendered by nemoclaw status as (CUDA verified) / (CUDA unverified) / (last CUDA proof failed: …). CUDA failures print Jetson /dev/nvmap + video/render group remediation. The proof is captured by the existing verifyGpuSandboxAfterReady wrapper (so src/lib/onboard.ts is unchanged / net-zero), and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's CUDA verified state.
  • Fail-closed CPU fallback and explicit --no-gpu guidance preserved on every proof-failure path.
  • Captured stderr in runCaptureEx so Docker/CUDA failure diagnostics are no longer dropped.
  • The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors (earlier CodeRabbit nit).

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • Rebased onto current upstream/main; resolved conflicts in status.ts/status-snapshot.ts/status.test.ts (upstream extracted the snapshot/report code into status-snapshot.ts) and threaded the proof result through the #4509 verifyGpuSandboxAfterReady wrapper.
  • Targeted GPU/status/registry/snapshot suites green (wsl-docker-desktop-gpu, nim, sandbox-gpu-preflight, docker-gpu-local-inference, status, registry, snapshot*).
  • npm test (cli project): only pre-existing, environment-only failures remain (test/cli.test.ts, test/ssrf-parity.test.ts, config-sync/nemoclaw-start root-ownership tests — file-mode/ownership/network checks unrelated to this change; none touch the modified files).
  • codex review --base upstream/main clean after addressing two P2 findings (stale proof on snapshot clone; require cuInit(0)=0 before verifying).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • npx prek pre-commit/pre-push hooks pass (format, lint, typecheck).

Notes


Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features
    • Persistent per-sandbox CUDA proof tracking and reporting (verified / unverified / failed) with human-readable status lines and platform-specific remediation guidance.
    • ARM64 WSL Docker Desktop GPU verification path with configurable timeout and clearer diagnostics.
  • Bug Fixes
    • Snapshot restore no longer inherits a source sandbox’s GPU proof status.
  • Tests
    • Updated unit and E2E GPU tests to validate CUDA usability states instead of a generic GPU-enabled marker.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f9f87825-a65f-4a85-983f-75aed6836e6f

📥 Commits

Reviewing files that changed from the base of the PR and between eef7250 and a56e438.

📒 Files selected for processing (19)
  • src/lib/actions/sandbox/snapshot.ts
  • src/lib/actions/sandbox/status-snapshot.ts
  • src/lib/actions/sandbox/status.test.ts
  • src/lib/actions/sandbox/status.ts
  • src/lib/inference/gpu-trust.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard/docker-gpu-local-inference.test.ts
  • src/lib/onboard/docker-gpu-local-inference.ts
  • src/lib/onboard/docker-gpu-sandbox-create.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-gpu-preflight.test.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts
  • src/lib/onboard/sandbox-registry-metadata.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.test.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.ts
  • src/lib/runner.ts
  • src/lib/state/registry.ts
  • test/e2e/test-gpu-e2e.sh
🚧 Files skipped from review as they are similar to previous changes (18)
  • src/lib/actions/sandbox/snapshot.ts
  • src/lib/actions/sandbox/status.test.ts
  • src/lib/inference/gpu-trust.ts
  • test/e2e/test-gpu-e2e.sh
  • src/lib/runner.ts
  • src/lib/actions/sandbox/status-snapshot.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-registry-metadata.ts
  • src/lib/onboard/docker-gpu-sandbox-create.ts
  • src/lib/onboard/docker-gpu-local-inference.test.ts
  • src/lib/state/registry.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.ts
  • src/lib/onboard/docker-gpu-local-inference.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.test.ts
  • src/lib/inference/nim.test.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts
  • src/lib/onboard/sandbox-gpu-preflight.test.ts

📝 Walkthrough

Walkthrough

Adds structured CUDA-usability proof results, runs bounded Docker/WSL proofs (ARM64), persists proof outcomes in the registry, integrates optional proof injection into GPU detection, captures proofs during onboarding, and renders proof status (verified/failed/unverified) in sandbox status output with tests and e2e update.

Changes

Sandbox GPU proof state and verification

Layer / File(s) Summary
GPU proof types and registry persistence
src/lib/state/registry.ts
SandboxGpuProofStatus and SandboxGpuProofResult added; SandboxEntry extended with sandboxGpuProof; registerSandbox persists proof as entry.sandboxGpuProof ?? null.
Direct sandbox GPU verifier refactoring
src/lib/onboard/sandbox-gpu-preflight.ts, src/lib/onboard/sandbox-gpu-preflight.test.ts
createDirectSandboxGpuVerifier now returns SandboxGpuProofResult, classifies CUDA usability via cuInit(0)= marker, accumulates optional failures, throws on required failures, and logs platform-specific remediation (Jetson vs WSL). Tests cover verified/failed/unverified and optional handling.
Runner stderr capture extension
src/lib/runner.ts
CaptureResult now includes optional stderr; runCaptureEx returns trimmed stderr alongside stdout for diagnostic use in proof runners.
Docker GPU proof result contract
src/lib/inference/gpu-trust.ts
Adds DockerGpuProofResult and Arm64WslDockerDesktopGpuProver types describing bounded Docker --gpus proof outcomes and prover contract.
ARM64 WSL Docker Desktop prover and tests
src/lib/onboard/wsl-docker-desktop-gpu.ts, src/lib/onboard/wsl-docker-desktop-gpu.test.ts
Introduces WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND update, timeout helper wslDockerDesktopGpuProofTimeoutMs, isExecFormatErrorDiagnostic, an ARM64 prover factory mapping runner output to DockerGpuProofResult, and tests validating platform gating, argv/image selection, timeout, and exec-format diagnostics.
GPU detection integration with proof injection
src/lib/inference/nim.ts, src/lib/inference/nim.test.ts
detectGpu(deps) accepts DetectGpuDeps with optional ARM64 WSL prover; on denylisted placeholder rows runs bounded proof, rejects on missing/failed proof, or accepts and sets wslDockerDesktopGpuProofPassed when proof passes. Tests for accept-path, fail-closed, and null-prover cases added.
Onboarding sandbox GPU configuration
src/lib/onboard/sandbox-gpu-mode.ts
Adds optional `sandboxGpuProof?: SandboxGpuProofResult
Local inference verification and proof persistence
src/lib/onboard/docker-gpu-local-inference.ts, src/lib/onboard/docker-gpu-local-inference.test.ts
verifyGpuSandboxAfterReady captures SandboxGpuProofResult from verification hooks and assigns to config.sandboxGpuProof; option types updated and test added to verify persistence.
Docker GPU sandbox creation proof wiring
src/lib/onboard/docker-gpu-sandbox-create.ts
verifyGpuOrExit now returns SandboxGpuProofResult and forwards verifyDirectSandboxGpu result through the create-sandbox flow.
Registry metadata proof persistence
src/lib/onboard/sandbox-registry-metadata.ts
getSandboxRuntimeRegistryFields includes sandboxGpuProof in returned fields and conditionally adds it to metadata only when config.sandboxGpuProof is present.
Sandbox status reporting and tests
src/lib/actions/sandbox/status-snapshot.ts, src/lib/actions/sandbox/status.ts, src/lib/actions/sandbox/status.test.ts
Adds sandboxGpuProof to SandboxStatusReport, populates it from registry, adds sandboxGpuProofUnverified() and sandboxGpuProofStatusSuffix() helpers, appends proof suffix to "Sandbox GPU" line, and prints guidance on failed/unverified. Tests verify suffix rendering and legacy null behavior.
Snapshot restore sandbox isolation
src/lib/actions/sandbox/snapshot.ts
Auto-created destination sandbox for snapshot restore explicitly clears sandboxGpuProof to null to avoid inheriting GPU proof when no GPU flags are passed.
End-to-end GPU status validation
test/e2e/test-gpu-e2e.sh
E2E script updated to assert CUDA proof state: pass on "CUDA verified", fail on "CUDA unverified" or "last CUDA proof failed", skip if absent.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4407: Related changes to Docker GPU onboarding verification flow and proof-result handling.
  • NVIDIA/NemoClaw#4600: Related showSandboxStatus() rendering changes affecting sandbox status output.
  • NVIDIA/NemoClaw#4609: Related modifications to verifyGpuSandboxAfterReady onboarding gating.

Suggested labels

Docker, Sandbox, NemoClaw CLI, Integration: OpenClaw, v0.0.57

Suggested reviewers

  • prekshivyas

Poem

🐰 I hopped through proofs in midnight code,
cuInit traces on a winding road.
ARM64 provers bounded in time,
status shows truth in succinct rhyme—
CUDA verified, cocoa and code.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly summarizes the main changes: implementing Docker Desktop GPU proof for Windows-ARM and reporting sandbox CUDA proof state in the status output.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/inference/nim.ts (1)

96-100: ⚡ Quick win

Don't swallow internal prover initialization failures.

This catch turns bugs inside ../onboard/wsl-docker-desktop-gpu into a silent CPU fallback, which makes supported N1X hosts look like “no GPU” instead of surfacing the regression. Only the optional module-resolution case should return null; other failures should bubble.

Suggested change
 function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null {
   try {
     return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver();
-  } catch {
-    return null;
+  } catch (error) {
+    if (
+      error &&
+      typeof error === "object" &&
+      "code" in error &&
+      (error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND"
+    ) {
+      return null;
+    }
+    throw error;
   }
 }
Based on learnings: "avoid adding defensive error handling (e.g., try/catch wrappers, fallbacks, or extra validation) around internal helper logic when there is no realistic throwing path or failure mode."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.ts` around lines 96 - 100, The function
defaultArm64WslDockerDesktopGpuProver currently swallows all errors from
requiring or initializing ../onboard/wsl-docker-desktop-gpu; change it so only
the optional-module-resolution case returns null and any other initialization
errors bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt
to createArm64WslDockerDesktopGpuProver inside a try/catch that only returns
null when the caught error is a module-not-found resolution (e.g., error.code
=== 'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures
inside createArm64WslDockerDesktopGpuProver are not silenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/inference/nim.ts`:
- Around line 96-100: The function defaultArm64WslDockerDesktopGpuProver
currently swallows all errors from requiring or initializing
../onboard/wsl-docker-desktop-gpu; change it so only the
optional-module-resolution case returns null and any other initialization errors
bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt to
createArm64WslDockerDesktopGpuProver inside a try/catch that only returns null
when the caught error is a module-not-found resolution (e.g., error.code ===
'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures
inside createArm64WslDockerDesktopGpuProver are not silenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a97cb9ac-bff2-4a0c-b99c-99c28f76ae2b

📥 Commits

Reviewing files that changed from the base of the PR and between df7d054 and 559211e.

📒 Files selected for processing (15)
  • src/lib/actions/sandbox/status.test.ts
  • src/lib/actions/sandbox/status.ts
  • src/lib/inference/gpu-trust.ts
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-gpu-preflight.test.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts
  • src/lib/onboard/sandbox-registry-metadata.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.test.ts
  • src/lib/onboard/wsl-docker-desktop-gpu.ts
  • src/lib/runner.ts
  • src/lib/state/registry.ts
  • test/e2e/test-gpu-e2e.sh

@yimoj
Copy link
Copy Markdown
Contributor Author

yimoj commented Jun 1, 2026

@coderabbitai review

The incremental commit addresses your nitpick on defaultArm64WslDockerDesktopGpuProver (now only swallows MODULE_NOT_FOUND and rethrows other errors). Please re-review to confirm.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of wsl-docker-desktop-gpu.ts for MODULE_NOT_FOUND handling
rg -n "MODULE_NOT_FOUND\|defaultArm64Wsl\|require.*runCaptureEx\|catch" src/lib/onboard/wsl-docker-desktop-gpu.ts -A3 -B3

echo "---nim.ts defaultArm64WslDockerDesktopGpuProver---"
rg -n "defaultArm64Wsl\|MODULE_NOT_FOUND" src/lib/inference/nim.ts -A5 -B3

Length of output: 646


🏁 Script executed:

#!/bin/bash
# Find the relevant files
fd "wsl-docker-desktop-gpu.ts" --type f
fd "nim.ts" --type f | grep -v test | grep -v node_modules

echo "---"
# Try broader search
rg -rn "MODULE_NOT_FOUND" --type ts -C3

Length of output: 1760


🏁 Script executed:

#!/bin/bash
echo "=== nim.ts defaultArm64WslDockerDesktopGpuProver / lazy require section ==="
cat src/lib/inference/nim.ts | grep -n "MODULE_NOT_FOUND\|require\|catch\|proveArm64\|defaultArm64\|lazy" | head -60

echo ""
echo "=== Full nim.ts catch blocks ==="
rg -n "catch" src/lib/inference/nim.ts -A 8 -B 2

echo ""
echo "=== wsl-docker-desktop-gpu.ts full file ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 16786


🏁 Script executed:

#!/bin/bash
# Read the exact lines around the defaultArm64WslDockerDesktopGpuProver function to confirm rethrow
awk 'NR>=92 && NR<=125' src/lib/inference/nim.ts

Length of output: 1920



@yimoj The fix is confirmed correct. Here's what I verified in src/lib/inference/nim.ts (Lines 96–113):

function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null {
  try {
    return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver();
  } catch (error) {
    // Only the optional module-resolution case should degrade to "no prover";
    // a real bug inside the prover module must bubble up rather than masquerade
    // as a missing GPU on an otherwise-supported N1X host.
    if (
      error &&
      typeof error === "object" &&
      (error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND"
    ) {
      return null;
    }
    throw error;  // ← all other errors rethrown
  }
}

The narrowed catch is exactly right:

  • MODULE_NOT_FOUNDreturn null (graceful fallback when the onboard module isn't present in a slim bundle)
  • Any other errorthrow error (genuine prover failures bubble up and are no longer silently converted to "no GPU detected" on a real N1X host)

The previous nitpick is fully addressed. The rest of the PR also looks solid — the wsl-docker-desktop-gpu.ts prover correctly splits the proof command into argv (no shell), captures stderr for diagnostics, and applies the configurable timeout. The sandbox CUDA proof result flow (verified/unverified/failed → registry → nemoclaw status) is coherent end-to-end.

[approve]

@yimoj yimoj added v0.0.60 Release target v0.0.56 Release target and removed v0.0.60 Release target labels Jun 1, 2026
@cv cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026
@prekshivyas
Copy link
Copy Markdown
Contributor

@yimoj this one's showing CONFLICTING against main — could you re-sync (merge/rebase main) so it can go through review? CI is otherwise green and there are no open threads. Thanks!

@zyang-dev
Copy link
Copy Markdown
Contributor

@yimoj
I ran this command and got the error "exec /cuda-samples/sample: exec format error".

export const WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND =
  "docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark";

Did you see this error when you tested it on Windows-ARM with WSL?

@zyang-dev zyang-dev self-requested a review June 2, 2026 20:57
@cv cv added v0.0.58 Release target and removed v0.0.57 Release target labels Jun 3, 2026
@wscurran wscurran added area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality labels Jun 3, 2026
@yimoj yimoj force-pushed the fix/4565-gpu-proof-trust-status branch from cbdf60f to eef7250 Compare June 3, 2026 03:25
@yimoj
Copy link
Copy Markdown
Contributor Author

yimoj commented Jun 3, 2026

@zyang-dev Thank you — that exec /cuda-samples/sample: exec format error is a real bug, not just a docs question, and it would have defeated the whole point of this PR on the exact N1X hardware #4565 is about. I've fixed it.

To be upfront: I do not have Windows-ARM + WSL hardware, so I did not reproduce your run live. I diagnosed it from the image itself, which is conclusive here:

# The nbody image's manifest list does advertise an arm64 entry...
docker manifest inspect nvcr.io/nvidia/k8s/cuda-sample:nbody
#   -> manifests: [ {amd64}, {arm64} ]

# ...but the binary inside that arm64 entry is actually x86-64:
docker pull --platform linux/arm64 nvcr.io/nvidia/k8s/cuda-sample@sha256:3652af0e06f16b41eaae6f1b3134b3dd07144156549923ca8a0bccb432bf0a1b
file /cuda-samples/nbody
#   -> ELF 64-bit LSB pie executable, x86-64, ... (/cuda-samples/sample -> /cuda-samples/nbody)

So on an aarch64 host Docker pulls the "arm64" variant whose binary is x86-64, and execve fails with exec format error. Since this proof path only ever runs on ARM64, that image could never pass — exactly what you hit.

Fix (pushed):

  • Switched the proof to nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0, whose arm64 variant ships a genuine ARM aarch64 binary (verified the same way: file reports ELF 64-bit LSB pie executable, ARM aarch64). vectorAdd runs a real CUDA kernel (device alloc + add + result verification), so it's still a strong usability proof that stays fail-closed on the Snapdragon nvidia-smi shim.
  • Added explicit exec format error detection so any future image/arch mismatch is reported as a proof-image problem rather than silently reading as "no GPU".

If you're able to re-run docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 on your N1X WSL setup, I'd be grateful for confirmation that it now reports a passing kernel — that's the one piece I can't verify without the hardware.

@yimoj
Copy link
Copy Markdown
Contributor Author

yimoj commented Jun 3, 2026

@coderabbitai review

Rebased onto current main (resolves the prior CONFLICTING state) and addressed the Windows-ARM exec format error reported in-thread: the proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (the old nbody arm64 tag shipped an x86-64 binary). Also hardened the #4231 path: cuInit(0)=0 is now required before reporting CUDA verified, and the proof is cleared on snapshot clone. Please re-review the latest head.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== wsl-docker-desktop-gpu.ts: proof command + timeout + isExecFormatError ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.ts

echo ""
echo "=== sandbox-gpu-preflight.ts: cuInit(0)=0 requirement + jetson remediation ==="
cat src/lib/onboard/sandbox-gpu-preflight.ts

echo ""
echo "=== snapshot.ts: sandboxGpuProof cleared on clone ==="
rg -n "sandboxGpuProof" src/lib/actions/sandbox/snapshot.ts -C5

echo ""
echo "=== status.ts: proof suffix rendering ==="
cat src/lib/actions/sandbox/status.ts

echo ""
echo "=== registry.ts: SandboxGpuProofResult + SandboxEntry ==="
rg -n "SandboxGpuProof\|sandboxGpuProof" src/lib/state/registry.ts -C3

echo ""
echo "=== nim.ts: MODULE_NOT_FOUND narrowed catch (confirm still in place) ==="
awk 'NR>=88 && NR<=125' src/lib/inference/nim.ts

Length of output: 50273



@yimoj All three change groups verified against the latest head.

#4565vectoradd-cuda12.5.0 proof image + exec-format-error hardening

  • WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND now uses the arch-correct vectoradd-cuda12.5.0 image, with inline documentation explaining exactly why nbody failed (x86-64 ELF inside the arm64 manifest entry). ✅
  • wslDockerDesktopGpuProofTimeoutMs validates the env override with Number.isFinite(raw) && raw > 0. ✅
  • isExecFormatErrorDiagnostic is correctly gated by /exec format error/i and the createArm64WslDockerDesktopGpuProver closure logs a distinct "proof-image issue, not a missing GPU" message for that branch — it does not fall through to "treating GPU as unproven", so the failure mode is actionable rather than silent. ✅

#4231cuInit(0)=0 required for CUDA verified

  • CUDA_INIT_RESULT_PATTERN captures the return code; only cudaInitMatch?.[1] === "0" promotes to cudaVerified = true. ✅
  • Symmetry on the success path is present: exit 0 + cuInit ran + non-zero return code → cudaFailure (not silently verified), which correctly handles the "wrapper swallows exit code but prints cuInit(0)=<err>" scenario. ✅
  • Jetson remediation path wired via resolvedPlatform === "jetson" branch in the failed case. ✅

Snapshot clone

  • sandboxGpuProof: null explicitly set for snapshot restore --to <dst> auto-create with a comment explaining why inheritance is wrong. ✅

MODULE_NOT_FOUND narrowed catch (from previous review)

  • Still in place in nim.ts Lines 96–113. ✅

One cosmetic nit — not blocking: Line ~188 in wsl-docker-desktop-gpu.ts has a stale comment // prefer it for the diagnostic and fall back to stdout (nbody banner). The "nbody banner" phrase is a leftover from the old proof command and could be updated to reference vectorAdd output, but it has no functional impact.

Everything is correct end-to-end. Approving.

[approve]

… proof state

Two grouped GPU trust/proof/status fixes, rebased onto current main.

NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without
reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still
rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the
ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA
workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0
(a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody,
whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with
"exec format error" on the very N1X target this feature accepts. An explicit
exec-format-error diagnostic now distinguishes an image-architecture problem
from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable
CUDA device) stays fail-closed.

NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct
sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed)
keyed on cuInit(0)=0, persisted to the registry and rendered by status as
"(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero
exit that printed a non-zero cuInit code (swallowed exit) is treated as failed,
not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper
(net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot
inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson
/dev/nvmap + video/render group remediation.

Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every
proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics
are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND
and rethrows internal initialization errors.

Fixes NVIDIA#4565
Fixes NVIDIA#4231

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4565-gpu-proof-trust-status branch from eef7250 to a56e438 Compare June 3, 2026 04:03
@yimoj
Copy link
Copy Markdown
Contributor Author

yimoj commented Jun 3, 2026

@coderabbitai review

Latest head only updates one stale code comment (the nbody bannervectorAdd output nit you flagged); no functional change from the approved revision.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== Comment around line 188 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=183 && NR<=200' src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 1101


🏁 Script executed:

#!/bin/bash
echo "=== Search for vectorAdd / nbody / diagnostic comment in wsl-docker-desktop-gpu.ts ==="
rg -n "vectorAdd\|nbody\|prefer.*diagnostic\|fall back.*stdout" src/lib/onboard/wsl-docker-desktop-gpu.ts -C2

Length of output: 545


🏁 Script executed:

#!/bin/bash
echo "=== Lines 160-192 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=160 && NR<=192' src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 1747



@yimoj Confirmed — Line 176 in wsl-docker-desktop-gpu.ts now reads:

// prefer it for the diagnostic and fall back to stdout (vectorAdd output).

The nbody banner reference is gone, replaced with the accurate vectorAdd output. That's the only change from the approved revision, and it's exactly right.

[approve]

@wscurran wscurran removed the fix label Jun 3, 2026
@cv cv merged commit b8e9531 into NVIDIA:main Jun 3, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality v0.0.58 Release target

Projects

None yet

5 participants