fix(onboard): debounce Docker GPU patch supervisor reconnect Error-phase short-circuit#4668
Conversation
…ase short-circuit Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a bounded supervisor-reconnect wait that inspects OpenShell sandbox phases and requires a configurable number of consecutive terminal Error-phase polls before failing fast. Exposes env/config getters and a deps override, wires the wait into Docker GPU sandbox recreation, adds tests for debounce behavior, and documents the new env toggle. ChangesSupervisor-reconnect error-phase debounce
Sequence DiagramsequenceDiagram
participant Recreate as recreateOpenShellDockerSandboxWithGpu
participant Wait as waitForOpenShellSupervisorReconnect
participant Exec as runOpenshell
participant Capture as runCaptureOpenshell
Recreate->>Wait: invoke(timeoutSecs, { errorPhaseDebouncePolls })
loop poll until deadline or success
Wait->>Exec: sandbox exec ... -- true
alt Exec fails
Wait->>Capture: openshell sandbox list
Capture-->>Wait: sandbox phase (e.g., Error, Provisioning)
Wait->>Wait: increment/reset consecutive-Error counter
end
end
Wait-->>Recreate: return success|failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…odule + document env Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4668.docs.buildwithfern.com/nemoclaw |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-gpu-supervisor-reconnect.ts`:
- Around line 111-112: The injected override deps.errorPhaseDebouncePolls must
be clamped to the same minimum as the env-backed path; change the assignment for
errorPhaseDebouncePolls to normalize the injected value (e.g. use Math.max with
minimum 1) so that errorPhaseDebouncePolls = Math.max(1,
deps.errorPhaseDebouncePolls ??
getDockerGpuSupervisorReconnectErrorDebouncePolls()); this ensures both the deps
override and getDockerGpuSupervisorReconnectErrorDebouncePolls() honor the same
minimum contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b87059ca-56cd-4a48-830c-3297761637c2
📒 Files selected for processing (6)
docs/reference/troubleshooting.mdxskills/nemoclaw-user-reference/references/troubleshooting.mdsrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tssrc/lib/onboard/docker-gpu-supervisor-reconnect.test.tssrc/lib/onboard/docker-gpu-supervisor-reconnect.ts
…o minimum 1 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…es + trim EOF Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
prekshivyas
left a comment
There was a problem hiding this comment.
APPROVE.
The trailing-edge debounce with reset-on-recovery in waitForOpenShellSupervisorReconnect is the right model for #4664 — a transient sandbox list Error during container re-registration is absorbed, while a genuinely crashed container still fast-fails (~8s) instead of burning the full timeout. Tests inject a mocked sleep and the poll source (no real wall-clock dependence) and cover the window boundary, flapping-reset, clamp, and non-finite-override cases with exact poll/sleep counts. The CodeRabbit override-clamp gap is fixed and regression-tested. CI green on 1b70103, thread resolved in head.
Non-blocking cleanup: TERMINAL_SANDBOX_FAILURE_PHASES and parseSandboxListFailurePhase in the new module duplicate SANDBOX_FAILURE_PHASE_TOKENS / parseSandboxPhaseFromListOutput still in docker-gpu-patch.ts — values match today but can silently drift; consider exporting one canonical set. Doc nit: default K=5 is ~8s of sleeps (4×2s), not 10s.
Signed-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary
When OpenShell's Docker GPU patch recreates the sandbox container with
--gpus all, the brief container churn (stop old → rename → run new) leaves the host'ssandbox listcache reporting phase Error for a few seconds before the host re-registers the new container. The previous fast-fail short-circuit treats that transient Error as fatal on the very first poll, so a perfectly healthy GPU sandbox dies withOpenShell supervisor did not reconnect to the GPU-enabled container.within ~12 s — even though the new container is running, healthy, and the OCSF supervisor has already loggedLIFECYCLE:INSTALL OpenShell Sandbox Supervisor success.This PR debounces the Error-phase short-circuit: require K consecutive Error polls (default 5 ≈ 10 s sustained Error) before fast-failing. A patched container that actually crashes still fast-fails (~10 s instead of the original ~4 s); a transient teardown-Error during recreation no longer aborts the wait.
The supervisor-reconnect code path (constants, helpers, the reconnect wait, and its tests) is extracted into a focused
docker-gpu-supervisor-reconnect.tsmodule with a source-of-truth boundary and removal condition documented in the module header.Related Issue
Fixes #4664
Changes
src/lib/onboard/docker-gpu-supervisor-reconnect.tsowns the supervisor-reconnect wait, the timeout helper, and the new debounce helper. Header records the source-of-truth boundary (transient Error is an OpenShell sandbox-list cache artifact during recreation) and the removal condition (drop the debounce once OpenShell guaranteessandbox listskips Error during a known recreate, validated by a real-Docker GPU E2E that observes transient Error recovering to Ready).src/lib/onboard/docker-gpu-patch.ts: replace inline reconnect helpers with imports + re-exports from the new module.recreateOpenShellDockerSandboxWithGpunow callswaitForOpenShellSupervisorReconnectdirectly. Net file delta vsmain: −49 lines.NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE(clamped ≥ 1, default 5) tunes the debounce window.DockerGpuSupervisorReconnectDeps.errorPhaseDebouncePollslets tests inject a small K without touching env.waitForOpenShellSupervisorReconnecttracks consecutive Error-phase polls and short-circuits only aftererrorPhaseDebouncePollsconsecutive Error reads. Counter resets on any non-Error poll so flapping does not accumulate.src/lib/onboard/docker-gpu-patch.test.ts: existing fast-fail test now asserts the explicit K=1 (no-debounce) behavior so the original intent is preserved when an operator opts out of the debounce. Net file delta vsmain: +5 lines.src/lib/onboard/docker-gpu-supervisor-reconnect.test.tscovers the debounce state machine: transient Error window shorter than K → reconnect succeeds; sustained Error for K polls → still fast-fails; flapping phase resets the counter; env override + lower-bound clamp ongetDockerGpuSupervisorReconnectErrorDebouncePolls.docs/reference/troubleshooting.mdxandskills/nemoclaw-user-reference/references/troubleshooting.md: short troubleshooting note under "Docker GPU patch failed during sandbox create" describing the default debounce and when to tune the env var.Type of Change
Verification
npx prek run --all-filespassesnpm testpasses — rannpx vitest run --project cli src/lib/onboard/docker-gpu-patch.test.ts src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts(54/54 pass) andnpm run typecheck:cli(clean). Fullnpx vitest run --project clishows 5 pre-existing failures onmainHEAD in unrelated files (src/lib/cli/command-registry.test.ts,test/cli.test.ts,test/whatsapp-qr-compact.test.ts). Required runtime validation dispatched:e2e-branch-validation:gpu(run 26819864315) andgpu-repo-local-ollama-openclaw(run 26819868701).npm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Documentation
Tests