fix(onboard): classify Docker GPU patch Error-phase failure (#4316) by yimoj · Pull Request #4407 · NVIDIA/NemoClaw

yimoj · 2026-05-28T05:20:24Z

Summary

The Docker GPU patch path can leave the patched container dead/exited with the sandbox in Error phase, but NemoClaw previously reported a generic "did not become ready" + "GPU proof failed" combo without enough container-level signal to identify which patched create option broke the sandbox. This PR distinguishes Error-phase / patched-container failure from real proof failures, short-circuits the readiness and supervisor-reconnect waits when the sandbox enters a terminal phase, and captures actionable diagnostics on disk so users can see the patched container's exit code and error directly.

Related Issue

Fixes #4316

Changes

Add isSandboxInErrorPhase / getSandboxFailurePhase in src/lib/state/gateway.ts to recognize Error / Failed / CrashLoopBackOff rows from openshell sandbox list.
Short-circuit waitForCreatedSandboxReadyWithTrace and waitForOpenShellSupervisorReconnect when the sandbox enters a terminal phase instead of burning the full timeout window.
Add captureDockerGpuPatchSandboxSnapshot and classifyDockerGpuPatchFailure in src/lib/onboard/docker-gpu-patch.ts. The classifier distinguishes patched_container_failed, sandbox_error_phase, supervisor_unreachable, and proof_failure based on the sandbox phase + patched container State.
Wire the new snapshot/classification into printDockerGpuPatchFailureAndExit, printDockerGpuReadinessFailure, and printDockerGpuProofFailure; write patched-container-state.json alongside existing diagnostics and surface a failure_kind= line in summary.txt.
Skip the GPU proof step in onboard.ts when the sandbox is already in a terminal phase so users see the real lifecycle failure instead of an openshell sandbox exec-refused error.
Plumb dockerCapture through docker-gpu-sandbox-create and onboard.ts so diagnostics work in every patch entry point.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npm test (CLI + plugin) — relevant suites pass; pre-existing flakes (config-sync chmod, some cli.test.ts 5s timeouts on overloaded hosts) reproduce on upstream/main and are unrelated.
./node_modules/.bin/tsc -p tsconfig.cli.json passes.
Tests added or updated for new or changed behavior (src/lib/onboard/docker-gpu-patch.test.ts).
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes — no doc surface changed; diagnostics are written to ~/.nemoclaw/onboard-failures/... as before, just with more fields.

Notes

Host environment for this fix had Docker but no NVIDIA GPU, so the regression was reproduced and validated hermetically (mocked openshell + docker adapters with the reporter's failure signatures). The hermetic test exercises the new short-circuit and classification, and the build output matches the live-system flow.
Codex review (5 rounds) is now clean on the code; only the local triage scratch note (issue-4316.md) was flagged and is intentionally not committed.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

New Features
- Enhanced GPU sandbox diagnostics with standardized failure summaries and state files for clearer post-mortems.
- Allow caller-controlled GPU proof verification so callers can run custom verification and failure handling.
Bug Fixes
- Faster failure detection — readiness and reconnect waits abort early when sandboxes enter terminal error phases.
- More precise, classified readiness/failure messages distinguishing terminal phases from timeouts.
Tests
- Added regression tests for phase detection, diagnostics capture/classification, and GPU-proof flows.

The Docker GPU patch can leave the patched container in a dead/exited state with the sandbox in Error phase. Onboarding previously reported a generic "did not become ready" + "GPU proof failed" combo without enough container-level signal to identify which patched create option broke the sandbox; the readiness and supervisor-reconnect waits also burned their full timeout windows even when the sandbox had already entered a terminal phase (NVIDIA#4316). Add an Error/Failed/CrashLoopBackOff phase classifier in state/gateway, short-circuit `waitForCreatedSandboxReadyWithTrace` and `waitForOpenShellSupervisorReconnect` when the sandbox enters a terminal failure phase, and introduce a snapshot + classifier in docker-gpu-patch that distinguishes patched_container_failed, sandbox_error_phase, supervisor_unreachable, and proof_failure. The print helpers surface the new classification plus a patched-container-state.json artifact alongside the existing diagnostics. Skip the GPU proof entirely when the sandbox is already in a terminal phase so users see the actual lifecycle failure instead of an exec-refused error. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai · 2026-05-28T05:20:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 197198e2-0a0a-4cb7-af58-497c1e6ad9eb

📥 Commits

Reviewing files that changed from the base of the PR and between 254f578 and 460704d.

📒 Files selected for processing (5)

src/lib/onboard.ts
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-sandbox-create.ts
src/lib/onboard/sandbox-readiness-tracing.ts
src/lib/state/gateway.ts

💤 Files with no reviewable changes (1)

src/lib/state/gateway.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Detects terminal sandbox error phases during Docker-GPU onboarding, short‑circuits readiness and supervisor‑reconnect waits, captures sandbox/container snapshots, classifies failures, and writes enriched diagnostics; includes regression tests for parsing, short‑circuits, classification, and output files.

Changes

GPU Patch Error-Phase Detection and Diagnostics

Layer / File(s)	Summary
Sandbox error-phase detection helpers `src/lib/state/gateway.ts`	Adds `TERMINAL_SANDBOX_FAILURE_PHASES` and exports `getSandboxFailurePhase(output, sandboxName)` to extract terminal failure tokens from `openshell sandbox` output.
Readiness tracer integration `src/lib/onboard/sandbox-readiness-tracing.ts`	`waitForCreatedSandboxReadyWithTrace` now returns `CreatedSandboxReadinessResult`, accepts a sandbox-failure extractor, emits `terminal_failure_phase`, and early-returns when a terminal phase is observed; adds `formatCreatedSandboxReadinessFailureMessage` and `printReadinessFailure`.
Diagnostics types and parsing `src/lib/onboard/docker-gpu-patch.ts`	Adds exported types (`DockerContainerState`, `DockerGpuPatchSandboxSnapshot`, `DockerGpuPatchFailureClassification`) and parsers for `sandbox list`/`get` and `docker inspect` to support structured snapshots.
Patch error handling, classification, and diagnostics `src/lib/onboard/docker-gpu-patch.ts`	Adds `captureDockerGpuPatchSandboxSnapshot` and `classifyDockerGpuPatchFailure`; failure printers capture snapshot/classification, accept `dockerCapture`, and `collectDockerGpuPatchDiagnostics` writes enriched `summary.txt` and `patched-container-state.json`.
Supervisor reconnect short-circuit `src/lib/onboard/docker-gpu-patch.ts`	`waitForOpenShellSandboxExec` and reconnect waiters short-circuit when `sandbox list` shows a terminal error phase.
Sandbox create: deps wiring and accessor `src/lib/onboard/docker-gpu-sandbox-create.ts`	Requires and threads `dockerCapture` through diagnostics paths, exposes `printReadinessFailureIfEnabled()` and `verifyGpuOrExit()` on the `DockerGpuSandboxCreatePatch` surface, and builds `DockerGpuPatchFailureContext` for diagnostics.
Onboarding flow integration `src/lib/onboard.ts`	Passes `gatewayState.getSandboxFailurePhase` into readiness tracing, uses `sandboxReadinessTracing.printReadinessFailure` on readiness failure, routes readiness proof failures to `dockerGpuCreatePatch.printReadinessFailureIfEnabled()`, and passes `dockerGpuCreatePatch.verifyGpuOrExit` into `verifyGpuSandboxAfterReady`.
Local GPU verification wrapper and tests `src/lib/onboard/docker-gpu-local-inference.ts`, `src/lib/onboard/docker-gpu-local-inference.test.ts`	Adds optional `verifyGpuOrExit` parameter to `verifyGpuSandboxAfterReady` and tests verifying delegation and suppression of duplicate diagnostics when the wrapper handles failures.
Regression tests for `#4316` `src/lib/onboard/docker-gpu-patch.test.ts`	New tests validate terminal-phase detection from `sandbox list`, readiness/reconnect short-circuits, sandbox-list vs get precedence, snapshot contents, multiple classification scenarios, diagnostics behavior with/without `dockerCapture`, and filesystem outputs.

Sequence Diagrams

sequenceDiagram
  participant OnboardFlow as Onboard Flow
  participant ReadinessTracer as Readiness Tracer
  participant GatewayHelpers as Gateway Helpers
  participant DockerGpuPatch as Docker-GPU Patch
  OnboardFlow->>ReadinessTracer: waitForCreatedSandboxReadyWithTrace(getSandboxFailurePhase)
  ReadinessTracer->>GatewayHelpers: parse sandbox list/get for sandboxName
  alt Terminal Error Phase
    GatewayHelpers-->>ReadinessTracer: failurePhase
    ReadinessTracer-->>OnboardFlow: terminal_failure_phase (failurePhase)
    OnboardFlow->>DockerGpuPatch: printReadinessFailureIfEnabled() / collect diagnostics
  else Ready
    GatewayHelpers-->>ReadinessTracer: ready
    ReadinessTracer-->>OnboardFlow: ready
    OnboardFlow->>DockerGpuPatch: verifyGpuOrExit(verifyDirectSandboxGpu)
  end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

NVIDIA/NemoClaw#4609: Modifies GPU proof/verification flow and verifyGpuSandboxAfterReady behavior in related areas.

Suggested Reviewers

ericksoa

🐰 I hopped through logs where phases flick and fade,
I sniffed for "Error" tokens the list display made,
I captured a snapshot, put reasons in a file,
I stopped waiting when Error came and logged the while,
Hop, patch, report — then onboarding may smile!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): classify Docker GPU patch Error-phase failure (`#4316`)' clearly identifies the main change: classifying Docker GPU patch Error-phase failures to address issue `#4316`, which is the core objective of the pull request.
Linked Issues check	✅ Passed	The PR addresses all primary coding objectives from `#4316`: it detects sandbox Error phases via getSandboxFailurePhase, short-circuits readiness waits on terminal phases, classifies failures into distinct kinds (patched_container_failed, sandbox_error_phase, supervisor_unreachable, proof_failure), captures actionable diagnostics (patched-container-state.json, failure_kind in summary.txt), and integrates the classification pipeline into failure reporters.
Out of Scope Changes check	✅ Passed	All code changes align with the stated objectives: failure classification and diagnostics capture for Docker GPU patch flows, readiness-tracing improvements, gateway phase detection, and sandbox-create integration. No unrelated refactoring or feature additions are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)

3643-3648: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Differentiate terminal failure from timeout.

Once isSandboxInErrorPhase is wired into this wait, ready === false no longer means "timed out". The fallback message at Line 3662 still says the sandbox "did not become ready within ...", so an early Error/Failed/CrashLoopBackOff exit is still reported as a timeout.

Suggested fix

   const ready = sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace({
     sandboxName,
     timeoutSecs: sandboxReadyTimeoutSecs,
     runCaptureOpenshell,
     isSandboxReady,
     isSandboxInErrorPhase,
     sleep: sleepSeconds,
   });
+  const failurePhase = !ready
+    ? getSandboxFailurePhase(
+        runCaptureOpenshell(["sandbox", "list"], { ignoreError: true }),
+        sandboxName,
+      )
+    : null;
 
   const restoreBackupPath =
     pendingStateRestore?.manifest?.backupPath ?? pendingStateRestoreBackupPath;
 
   if (!ready) {
@@
-    console.error(
-      `  Sandbox '${sandboxName}' was created but did not become ready within ${sandboxReadyTimeoutSecs}s.`,
-    );
+    console.error(
+      failurePhase
+        ? `  Sandbox '${sandboxName}' entered ${failurePhase} before it became ready.`
+        : `  Sandbox '${sandboxName}' was created but did not become ready within ${sandboxReadyTimeoutSecs}s.`,
+    );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3643 - 3648, The current call to
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace (assigned to ready)
treats any non-true result as a timeout; update the wait function to return a
richer result (e.g., { ready: boolean, reason: 'timeout'|'error'|'other',
details?: any }) or otherwise surface why it failed (use isSandboxInErrorPhase
internally), then change the caller logic that inspects ready to branch: when
ready === true proceed, when reason === 'error' log/throw a clear "sandbox
entered error phase" message including details, and when reason === 'timeout'
keep the existing timeout fallback message. Ensure references to ready,
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace, and
isSandboxInErrorPhase are used so the caller can distinguish terminal failures
from timeouts.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3680-3688: The code repeatedly constructs the same Docker GPU
diagnostics payload object ({ runCaptureOpenshell, dockerCapture:
docker.dockerCapture, context: { sandboxName, newContainerId:
dockerGpuCreatePatch.patchedContainerId(), selectedMode:
dockerGpuCreatePatch.selectedMode() } }) in three places; extract that into a
small local helper/closure (e.g., buildDockerGpuDiagPayload or dockerGpuDiag())
and replace each inline object with a call to that helper so the assembly uses
the single function and reduces file growth and duplication around
runCaptureOpenshell, docker.dockerCapture, sandboxName,
dockerGpuCreatePatch.patchedContainerId(), and
dockerGpuCreatePatch.selectedMode().

In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 1361-1376: The current logic sets sandboxPhase from
parseSandboxPhaseFromGetOutput(getOutput) and only uses
parseSandboxPhaseFromListOutput(listOutput, sandboxName) as a fallback; change
it so the list result takes precedence: after obtaining listOutput and
sandboxListLine, if findSandboxListLine(listOutput, sandboxName) found the
sandbox then overwrite sandboxPhase with
parseSandboxPhaseFromListOutput(listOutput, sandboxName) regardless of whether
sandboxPhase was previously set; update the block around
deps.runCaptureOpenshell([... "sandbox", "list" ...]) to prefer the list-derived
phase (affecting sandboxPhase, listOutput, sandboxListLine) so
classifyDockerGpuPatchFailure(...) sees the up-to-date phase.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3643-3648: The current call to
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace (assigned to ready)
treats any non-true result as a timeout; update the wait function to return a
richer result (e.g., { ready: boolean, reason: 'timeout'|'error'|'other',
details?: any }) or otherwise surface why it failed (use isSandboxInErrorPhase
internally), then change the caller logic that inspects ready to branch: when
ready === true proceed, when reason === 'error' log/throw a clear "sandbox
entered error phase" message including details, and when reason === 'timeout'
keep the existing timeout fallback message. Ensure references to ready,
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace, and
isSandboxInErrorPhase are used so the caller can distinguish terminal failures
from timeouts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 349a50ef-a1dc-4a4e-bdb7-a1afae5f6794

📥 Commits

Reviewing files that changed from the base of the PR and between 78909ec and 8174f66.

📒 Files selected for processing (6)

src/lib/onboard.ts
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/docker-gpu-sandbox-create.ts
src/lib/onboard/sandbox-readiness-tracing.ts
src/lib/state/gateway.ts

The previous commit added the Error-phase short-circuit and proof-vs- readiness distinction inline in onboard.ts, but `src/lib/onboard.ts` is under a codebase-growth guardrail that blocks net growth in the top- level entrypoint. Move the GPU readiness-failure print block and the "skip proof on terminal phase" check into `DockerGpuSandboxCreatePatch` helpers so onboard.ts shrinks while the diagnostics surface stays the same. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/lib/onboard/docker-gpu-sandbox-create.ts (1)
230-240: ⚡ Quick win

Consider including oldContainerId in the failure context for diagnostic completeness.

The buildFailureContext helper constructs a context object for readiness and proof failure diagnostics but does not include oldContainerId, whereas the inline context built for supervisor reconnect failures (line 156) does include it. This inconsistency means readiness/proof failure diagnostics will be missing the old container ID, which could be valuable for comparing before/after container state in diagnostic outputs like patched-container-state.json.
📋 Proposed fix to include oldContainerId
 function buildFailureContext(
   sandboxName: string,
   result: DockerGpuPatchResult | null,
 ): DockerGpuPatchFailureContext {
   return {
     sandboxName,
+    oldContainerId: result?.oldContainerId ?? null,
     newContainerId: result?.newContainerId ?? null,
     backupContainerName: result?.backupContainerName ?? null,
     selectedMode: result?.mode ?? null,
   };
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-gpu-sandbox-create.ts` around lines 230 - 240, The
failure context returned by buildFailureContext is missing oldContainerId;
update the buildFailureContext function to include oldContainerId (e.g., set
oldContainerId: result?.oldContainerId ?? null) so it matches the inline
supervisor reconnect context and ensures DockerGpuPatchFailureContext
(type/interface) includes oldContainerId for diagnostics such as
patched-container-state.json.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/docker-gpu-sandbox-create.ts`:
- Around line 230-240: The failure context returned by buildFailureContext is
missing oldContainerId; update the buildFailureContext function to include
oldContainerId (e.g., set oldContainerId: result?.oldContainerId ?? null) so it
matches the inline supervisor reconnect context and ensures
DockerGpuPatchFailureContext (type/interface) includes oldContainerId for
diagnostics such as patched-container-state.json.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d58e524f-f855-4156-83bf-ebecfa0ae37f

📥 Commits

Reviewing files that changed from the base of the PR and between 8174f66 and f307b0a.

📒 Files selected for processing (2)

src/lib/onboard.ts
src/lib/onboard/docker-gpu-sandbox-create.ts

…taleness CodeRabbit and Codex feedback on NVIDIA#4316: - The readiness-failure message in onboard.ts still reported "did not become ready within Xs" even after `waitForCreatedSandboxReadyWithTrace` learned to short-circuit on Error/Failed/CrashLoopBackOff. When that short-circuit fires, the message should call out the terminal phase instead of blaming the timeout. - `captureDockerGpuPatchSandboxSnapshot` initially preferred the `sandbox get` phase and only fell back to `sandbox list`. The list view can be the fresher signal when a transition just happened, but blindly overriding the get phase would mask a terminal `get` behind a stale live `list` row. Rank phases as terminal > live > intermediate/none and let the higher-ranked signal win, with ties going to the list output as the broader gateway view. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

`buildFailureContext` was missing `oldContainerId`, but the supervisor- reconnect failure path already passes the full before/after pair. Add `oldContainerId` so the readiness and proof failure diagnostics (patched-container-state.json, docker-network-summary.txt) get the same original-container reference for comparison. CodeRabbit nitpick on NVIDIA#4316. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

wscurran · 2026-05-28T17:56:25Z

✨
Related open issues:

#4316 nemoclaw onboarding with gpu fails because sandbox transitions to Error phase after Docker GPU patch

yimoj · 2026-05-30T01:16:03Z

@coderabbitai resume

coderabbitai · 2026-05-30T01:16:10Z

✅ Actions performed

Reviews resumed.

… guardrail The post-merge conflict resolution expanded the GPU patch deps literal to multi-line form and inlined the readiness-failure message branching, putting onboard.ts at +7 net lines and tripping codebase-growth- guardrails. Compress the deps literal back to one line and move the readiness- failure message formatter into sandbox-readiness-tracing.ts where it sits next to the CreatedSandboxReadinessResult contract it formats. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

Comment thread src/lib/onboard/docker-gpu-patch.ts Outdated

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

yimoj added 2 commits May 28, 2026 05:55

yimoj added the v0.0.55 label May 28, 2026

MasakiYamagishiSCSK mentioned this pull request May 28, 2026

nemoclaw onboarding with gpu fails because sandbox transitions to Error phase after Docker GPU patch #4316

Closed

2 tasks

wscurran added Docker labels May 28, 2026

jyaunches added R1 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026

Merge branch 'main' into fix/4316-docker-gpu-error-diagnostics

5f5444e

cv enabled auto-merge (squash) May 30, 2026 01:04

Merge branch 'main' into fix/4316-docker-gpu-error-diagnostics

f1bf673

yimoj added v0.0.60 Release target and removed v0.0.60 Release target labels Jun 1, 2026

cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026

cv and others added 2 commits June 1, 2026 15:35

merge(main): resolve docker GPU diagnostics conflicts

0cbcfe5

auto-merge was automatically disabled June 1, 2026 23:21
Head branch was pushed to by a user without write access

refactor(onboard): rely on Docker GPU patch defaults

405be6d

cv added 3 commits June 1, 2026 16:29

refactor(onboard): use a single sandbox failure phase classifier

257c605

refactor(onboard): move readiness failure formatting

d2ea957

refactor(onboard): remove unused Docker GPU patch accessor

460704d

cv approved these changes Jun 2, 2026

View reviewed changes

cv merged commit 451f26f into NVIDIA:main Jun 2, 2026
29 checks passed

laitingsheng mentioned this pull request Jun 2, 2026

fix(onboard): debounce Docker GPU patch supervisor reconnect Error-phase short-circuit #4668

Merged

12 tasks

coderabbitai Bot mentioned this pull request Jun 3, 2026

fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599

Merged

8 tasks

wscurran removed Docker labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): classify Docker GPU patch Error-phase failure (#4316)#4407

fix(onboard): classify Docker GPU patch Error-phase failure (#4316)#4407
cv merged 12 commits into
NVIDIA:mainfrom
yimoj:fix/4316-docker-gpu-error-diagnostics

yimoj commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

wscurran commented May 28, 2026

Uh oh!

yimoj commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yimoj commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented May 28, 2026

Uh oh!

yimoj commented May 30, 2026

Uh oh!

coderabbitai Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yimoj commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading