fix(onboard): classify Docker GPU patch Error-phase failure (#4316)#4407
Conversation
The Docker GPU patch can leave the patched container in a dead/exited state with the sandbox in Error phase. Onboarding previously reported a generic "did not become ready" + "GPU proof failed" combo without enough container-level signal to identify which patched create option broke the sandbox; the readiness and supervisor-reconnect waits also burned their full timeout windows even when the sandbox had already entered a terminal phase (NVIDIA#4316). Add an Error/Failed/CrashLoopBackOff phase classifier in state/gateway, short-circuit `waitForCreatedSandboxReadyWithTrace` and `waitForOpenShellSupervisorReconnect` when the sandbox enters a terminal failure phase, and introduce a snapshot + classifier in docker-gpu-patch that distinguishes patched_container_failed, sandbox_error_phase, supervisor_unreachable, and proof_failure. The print helpers surface the new classification plus a patched-container-state.json artifact alongside the existing diagnostics. Skip the GPU proof entirely when the sandbox is already in a terminal phase so users see the actual lifecycle failure instead of an exec-refused error. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughDetects terminal sandbox error phases during Docker-GPU onboarding, short‑circuits readiness and supervisor‑reconnect waits, captures sandbox/container snapshots, classifies failures, and writes enriched diagnostics; includes regression tests for parsing, short‑circuits, classification, and output files. ChangesGPU Patch Error-Phase Detection and Diagnostics
Sequence DiagramssequenceDiagram
participant OnboardFlow as Onboard Flow
participant ReadinessTracer as Readiness Tracer
participant GatewayHelpers as Gateway Helpers
participant DockerGpuPatch as Docker-GPU Patch
OnboardFlow->>ReadinessTracer: waitForCreatedSandboxReadyWithTrace(getSandboxFailurePhase)
ReadinessTracer->>GatewayHelpers: parse sandbox list/get for sandboxName
alt Terminal Error Phase
GatewayHelpers-->>ReadinessTracer: failurePhase
ReadinessTracer-->>OnboardFlow: terminal_failure_phase (failurePhase)
OnboardFlow->>DockerGpuPatch: printReadinessFailureIfEnabled() / collect diagnostics
else Ready
GatewayHelpers-->>ReadinessTracer: ready
ReadinessTracer-->>OnboardFlow: ready
OnboardFlow->>DockerGpuPatch: verifyGpuOrExit(verifyDirectSandboxGpu)
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested Reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
3643-3648:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDifferentiate terminal failure from timeout.
Once
isSandboxInErrorPhaseis wired into this wait,ready === falseno longer means "timed out". The fallback message at Line 3662 still says the sandbox "did not become ready within ...", so an earlyError/Failed/CrashLoopBackOffexit is still reported as a timeout.Suggested fix
const ready = sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace({ sandboxName, timeoutSecs: sandboxReadyTimeoutSecs, runCaptureOpenshell, isSandboxReady, isSandboxInErrorPhase, sleep: sleepSeconds, }); + const failurePhase = !ready + ? getSandboxFailurePhase( + runCaptureOpenshell(["sandbox", "list"], { ignoreError: true }), + sandboxName, + ) + : null; const restoreBackupPath = pendingStateRestore?.manifest?.backupPath ?? pendingStateRestoreBackupPath; if (!ready) { @@ - console.error( - ` Sandbox '${sandboxName}' was created but did not become ready within ${sandboxReadyTimeoutSecs}s.`, - ); + console.error( + failurePhase + ? ` Sandbox '${sandboxName}' entered ${failurePhase} before it became ready.` + : ` Sandbox '${sandboxName}' was created but did not become ready within ${sandboxReadyTimeoutSecs}s.`, + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3643 - 3648, The current call to sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace (assigned to ready) treats any non-true result as a timeout; update the wait function to return a richer result (e.g., { ready: boolean, reason: 'timeout'|'error'|'other', details?: any }) or otherwise surface why it failed (use isSandboxInErrorPhase internally), then change the caller logic that inspects ready to branch: when ready === true proceed, when reason === 'error' log/throw a clear "sandbox entered error phase" message including details, and when reason === 'timeout' keep the existing timeout fallback message. Ensure references to ready, sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace, and isSandboxInErrorPhase are used so the caller can distinguish terminal failures from timeouts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3680-3688: The code repeatedly constructs the same Docker GPU
diagnostics payload object ({ runCaptureOpenshell, dockerCapture:
docker.dockerCapture, context: { sandboxName, newContainerId:
dockerGpuCreatePatch.patchedContainerId(), selectedMode:
dockerGpuCreatePatch.selectedMode() } }) in three places; extract that into a
small local helper/closure (e.g., buildDockerGpuDiagPayload or dockerGpuDiag())
and replace each inline object with a call to that helper so the assembly uses
the single function and reduces file growth and duplication around
runCaptureOpenshell, docker.dockerCapture, sandboxName,
dockerGpuCreatePatch.patchedContainerId(), and
dockerGpuCreatePatch.selectedMode().
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 1361-1376: The current logic sets sandboxPhase from
parseSandboxPhaseFromGetOutput(getOutput) and only uses
parseSandboxPhaseFromListOutput(listOutput, sandboxName) as a fallback; change
it so the list result takes precedence: after obtaining listOutput and
sandboxListLine, if findSandboxListLine(listOutput, sandboxName) found the
sandbox then overwrite sandboxPhase with
parseSandboxPhaseFromListOutput(listOutput, sandboxName) regardless of whether
sandboxPhase was previously set; update the block around
deps.runCaptureOpenshell([... "sandbox", "list" ...]) to prefer the list-derived
phase (affecting sandboxPhase, listOutput, sandboxListLine) so
classifyDockerGpuPatchFailure(...) sees the up-to-date phase.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3643-3648: The current call to
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace (assigned to ready)
treats any non-true result as a timeout; update the wait function to return a
richer result (e.g., { ready: boolean, reason: 'timeout'|'error'|'other',
details?: any }) or otherwise surface why it failed (use isSandboxInErrorPhase
internally), then change the caller logic that inspects ready to branch: when
ready === true proceed, when reason === 'error' log/throw a clear "sandbox
entered error phase" message including details, and when reason === 'timeout'
keep the existing timeout fallback message. Ensure references to ready,
sandboxReadinessTracing.waitForCreatedSandboxReadyWithTrace, and
isSandboxInErrorPhase are used so the caller can distinguish terminal failures
from timeouts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 349a50ef-a1dc-4a4e-bdb7-a1afae5f6794
📒 Files selected for processing (6)
src/lib/onboard.tssrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tssrc/lib/onboard/docker-gpu-sandbox-create.tssrc/lib/onboard/sandbox-readiness-tracing.tssrc/lib/state/gateway.ts
The previous commit added the Error-phase short-circuit and proof-vs- readiness distinction inline in onboard.ts, but `src/lib/onboard.ts` is under a codebase-growth guardrail that blocks net growth in the top- level entrypoint. Move the GPU readiness-failure print block and the "skip proof on terminal phase" check into `DockerGpuSandboxCreatePatch` helpers so onboard.ts shrinks while the diagnostics surface stays the same. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-sandbox-create.ts (1)
230-240: ⚡ Quick winConsider including
oldContainerIdin the failure context for diagnostic completeness.The
buildFailureContexthelper constructs a context object for readiness and proof failure diagnostics but does not includeoldContainerId, whereas the inline context built for supervisor reconnect failures (line 156) does include it. This inconsistency means readiness/proof failure diagnostics will be missing the old container ID, which could be valuable for comparing before/after container state in diagnostic outputs likepatched-container-state.json.📋 Proposed fix to include oldContainerId
function buildFailureContext( sandboxName: string, result: DockerGpuPatchResult | null, ): DockerGpuPatchFailureContext { return { sandboxName, + oldContainerId: result?.oldContainerId ?? null, newContainerId: result?.newContainerId ?? null, backupContainerName: result?.backupContainerName ?? null, selectedMode: result?.mode ?? null, }; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-gpu-sandbox-create.ts` around lines 230 - 240, The failure context returned by buildFailureContext is missing oldContainerId; update the buildFailureContext function to include oldContainerId (e.g., set oldContainerId: result?.oldContainerId ?? null) so it matches the inline supervisor reconnect context and ensures DockerGpuPatchFailureContext (type/interface) includes oldContainerId for diagnostics such as patched-container-state.json.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/docker-gpu-sandbox-create.ts`:
- Around line 230-240: The failure context returned by buildFailureContext is
missing oldContainerId; update the buildFailureContext function to include
oldContainerId (e.g., set oldContainerId: result?.oldContainerId ?? null) so it
matches the inline supervisor reconnect context and ensures
DockerGpuPatchFailureContext (type/interface) includes oldContainerId for
diagnostics such as patched-container-state.json.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d58e524f-f855-4156-83bf-ebecfa0ae37f
📒 Files selected for processing (2)
src/lib/onboard.tssrc/lib/onboard/docker-gpu-sandbox-create.ts
…taleness CodeRabbit and Codex feedback on NVIDIA#4316: - The readiness-failure message in onboard.ts still reported "did not become ready within Xs" even after `waitForCreatedSandboxReadyWithTrace` learned to short-circuit on Error/Failed/CrashLoopBackOff. When that short-circuit fires, the message should call out the terminal phase instead of blaming the timeout. - `captureDockerGpuPatchSandboxSnapshot` initially preferred the `sandbox get` phase and only fell back to `sandbox list`. The list view can be the fresher signal when a transition just happened, but blindly overriding the get phase would mask a terminal `get` behind a stale live `list` row. Rank phases as terminal > live > intermediate/none and let the higher-ranked signal win, with ties going to the list output as the broader gateway view. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
`buildFailureContext` was missing `oldContainerId`, but the supervisor- reconnect failure path already passes the full before/after pair. Add `oldContainerId` so the readiness and proof failure diagnostics (patched-container-state.json, docker-network-summary.txt) get the same original-container reference for comparison. CodeRabbit nitpick on NVIDIA#4316. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
… guardrail The post-merge conflict resolution expanded the GPU patch deps literal to multi-line form and inlined the readiness-failure message branching, putting onboard.ts at +7 net lines and tripping codebase-growth- guardrails. Compress the deps literal back to one line and move the readiness- failure message formatter into sandbox-readiness-tracing.ts where it sits next to the CreatedSandboxReadinessResult contract it formats. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Head branch was pushed to by a user without write access
Summary
The Docker GPU patch path can leave the patched container dead/exited with the sandbox in Error phase, but NemoClaw previously reported a generic "did not become ready" + "GPU proof failed" combo without enough container-level signal to identify which patched create option broke the sandbox. This PR distinguishes Error-phase / patched-container failure from real proof failures, short-circuits the readiness and supervisor-reconnect waits when the sandbox enters a terminal phase, and captures actionable diagnostics on disk so users can see the patched container's exit code and error directly.
Related Issue
Fixes #4316
Changes
isSandboxInErrorPhase/getSandboxFailurePhaseinsrc/lib/state/gateway.tsto recognizeError/Failed/CrashLoopBackOffrows fromopenshell sandbox list.waitForCreatedSandboxReadyWithTraceandwaitForOpenShellSupervisorReconnectwhen the sandbox enters a terminal phase instead of burning the full timeout window.captureDockerGpuPatchSandboxSnapshotandclassifyDockerGpuPatchFailureinsrc/lib/onboard/docker-gpu-patch.ts. The classifier distinguishespatched_container_failed,sandbox_error_phase,supervisor_unreachable, andproof_failurebased on the sandbox phase + patched container State.printDockerGpuPatchFailureAndExit,printDockerGpuReadinessFailure, andprintDockerGpuProofFailure; writepatched-container-state.jsonalongside existing diagnostics and surface afailure_kind=line insummary.txt.onboard.tswhen the sandbox is already in a terminal phase so users see the real lifecycle failure instead of anopenshell sandbox exec-refused error.dockerCapturethroughdocker-gpu-sandbox-createandonboard.tsso diagnostics work in every patch entry point.Type of Change
Verification
npm test(CLI + plugin) — relevant suites pass; pre-existing flakes (config-syncchmod, somecli.test.ts5s timeouts on overloaded hosts) reproduce onupstream/mainand are unrelated../node_modules/.bin/tsc -p tsconfig.cli.jsonpasses.src/lib/onboard/docker-gpu-patch.test.ts).~/.nemoclaw/onboard-failures/...as before, just with more fields.Notes
openshell+dockeradapters with the reporter's failure signatures). The hermetic test exercises the new short-circuit and classification, and the build output matches the live-system flow.issue-4316.md) was flagged and is intentionally not committed.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests