fix(onboard): debounce Docker GPU patch supervisor reconnect Error-phase short-circuit by laitingsheng · Pull Request #4668 · NVIDIA/NemoClaw

laitingsheng · 2026-06-02T12:05:01Z

Summary

When OpenShell's Docker GPU patch recreates the sandbox container with --gpus all, the brief container churn (stop old → rename → run new) leaves the host's sandbox list cache reporting phase Error for a few seconds before the host re-registers the new container. The previous fast-fail short-circuit treats that transient Error as fatal on the very first poll, so a perfectly healthy GPU sandbox dies with OpenShell supervisor did not reconnect to the GPU-enabled container. within ~12 s — even though the new container is running, healthy, and the OCSF supervisor has already logged LIFECYCLE:INSTALL OpenShell Sandbox Supervisor success.

This PR debounces the Error-phase short-circuit: require K consecutive Error polls (default 5 ≈ 10 s sustained Error) before fast-failing. A patched container that actually crashes still fast-fails (~10 s instead of the original ~4 s); a transient teardown-Error during recreation no longer aborts the wait.

The supervisor-reconnect code path (constants, helpers, the reconnect wait, and its tests) is extracted into a focused docker-gpu-supervisor-reconnect.ts module with a source-of-truth boundary and removal condition documented in the module header.

Related Issue

Fixes #4664

Changes

New module src/lib/onboard/docker-gpu-supervisor-reconnect.ts owns the supervisor-reconnect wait, the timeout helper, and the new debounce helper. Header records the source-of-truth boundary (transient Error is an OpenShell sandbox-list cache artifact during recreation) and the removal condition (drop the debounce once OpenShell guarantees sandbox list skips Error during a known recreate, validated by a real-Docker GPU E2E that observes transient Error recovering to Ready).
src/lib/onboard/docker-gpu-patch.ts: replace inline reconnect helpers with imports + re-exports from the new module. recreateOpenShellDockerSandboxWithGpu now calls waitForOpenShellSupervisorReconnect directly. Net file delta vs main: −49 lines.
New env var NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE (clamped ≥ 1, default 5) tunes the debounce window. DockerGpuSupervisorReconnectDeps.errorPhaseDebouncePolls lets tests inject a small K without touching env.
waitForOpenShellSupervisorReconnect tracks consecutive Error-phase polls and short-circuits only after errorPhaseDebouncePolls consecutive Error reads. Counter resets on any non-Error poll so flapping does not accumulate.
src/lib/onboard/docker-gpu-patch.test.ts: existing fast-fail test now asserts the explicit K=1 (no-debounce) behavior so the original intent is preserved when an operator opts out of the debounce. Net file delta vs main: +5 lines.
New src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts covers the debounce state machine: transient Error window shorter than K → reconnect succeeds; sustained Error for K polls → still fast-fails; flapping phase resets the counter; env override + lower-bound clamp on getDockerGpuSupervisorReconnectErrorDebouncePolls.
docs/reference/troubleshooting.mdx and skills/nemoclaw-user-reference/references/troubleshooting.md: short troubleshooting note under "Docker GPU patch failed during sandbox create" describing the default debounce and when to tune the env var.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes — ran npx vitest run --project cli src/lib/onboard/docker-gpu-patch.test.ts src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts (54/54 pass) and npm run typecheck:cli (clean). Full npx vitest run --project cli shows 5 pre-existing failures on main HEAD in unrelated files (src/lib/cli/command-registry.test.ts, test/cli.test.ts, test/whatsapp-qr-compact.test.ts). Required runtime validation dispatched: e2e-branch-validation:gpu (run 26819864315) and gpu-repo-local-ollama-openclaw (run 26819868701).
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

New Features
- Supervisor reconnect now debounces transient Error-phase detections to reduce false onboarding failures; debounce count and timeout are configurable (with sensible defaults and minimum clamping), and a no-debounce fast-fail behavior can be asserted via configuration.
Documentation
- Added troubleshooting guidance describing reconnect behavior, default debounce, and how to adjust the debounce window.
Tests
- Expanded tests covering debounce behavior, fast-fail vs. absorb scenarios, counter reset, and env-var handling.

…ase short-circuit Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-06-02T12:05:31Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f4d272c-b3c1-40ac-b58f-9cef1731e511

📥 Commits

Reviewing files that changed from the base of the PR and between 21013fc and 1b70103.

📒 Files selected for processing (3)

src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.ts

💤 Files with no reviewable changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

src/lib/onboard/docker-gpu-supervisor-reconnect.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts

📝 Walkthrough

Walkthrough

Adds a bounded supervisor-reconnect wait that inspects OpenShell sandbox phases and requires a configurable number of consecutive terminal Error-phase polls before failing fast. Exposes env/config getters and a deps override, wires the wait into Docker GPU sandbox recreation, adds tests for debounce behavior, and documents the new env toggle.

Changes

Supervisor-reconnect error-phase debounce

Layer / File(s)	Summary
Configuration and public API `src/lib/onboard/docker-gpu-supervisor-reconnect.ts`, `src/lib/onboard/docker-gpu-patch.ts`	Adds exported env constants and getters for reconnect timeout and error-phase debounce, re-exports the supervisor-reconnect entrypoint, removes local reconnect timeout constants from docker-gpu-patch, and extends `DockerGpuPatchDeps` with optional `errorPhaseDebouncePolls` to forward into the reconnect wait.
Supervisor-reconnect implementation & wiring `src/lib/onboard/docker-gpu-supervisor-reconnect.ts`, `src/lib/onboard/docker-gpu-patch.ts`	Implements ANSI-aware parsing of `openshell sandbox list`, blocking sleep helper, and `waitForOpenShellSupervisorReconnect` which polls `sandbox exec ... -- true` until success, deadline, or a configurable number of consecutive detected terminal Error-phase polls; replaces the former local reconnect short-circuit by calling the new wait from sandbox recreation.
Tests and troubleshooting docs `src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts`, `src/lib/onboard/docker-gpu-patch.test.ts`, `docs/reference/troubleshooting.mdx`, `skills/nemoclaw-user-reference/references/troubleshooting.md`	Adds comprehensive Vitest coverage for transient Error absorption, sustained-Error fast-fail, counter reset on recovery, and env/default/clamping behavior; updates an existing fast-fail test to pass `errorPhaseDebouncePolls: 1`; documents the new `NEMOCLAW_DOCKER_GPU_SUPERVISOR_RECONNECT_ERROR_DEBOUNCE` override and its clamping rule.

Sequence Diagram

sequenceDiagram
  participant Recreate as recreateOpenShellDockerSandboxWithGpu
  participant Wait as waitForOpenShellSupervisorReconnect
  participant Exec as runOpenshell
  participant Capture as runCaptureOpenshell

  Recreate->>Wait: invoke(timeoutSecs, { errorPhaseDebouncePolls })
  loop poll until deadline or success
    Wait->>Exec: sandbox exec ... -- true
    alt Exec fails
      Wait->>Capture: openshell sandbox list
      Capture-->>Wait: sandbox phase (e.g., Error, Provisioning)
      Wait->>Wait: increment/reset consecutive-Error counter
    end
  end
  Wait-->>Recreate: return success|failure

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

nemoclaw onboarding with gpu fails because sandbox transitions to Error phase after Docker GPU patch #4316: Vector-match describing the same supervisor reconnect Error-phase fast-fail behavior this PR mitigates.

Possibly related PRs

NVIDIA/NemoClaw#4407: Earlier PR that modified Error-phase short-circuit behavior in the supervisor reconnect flow; this PR introduces debouncing and rewiring.

Suggested labels

onboarding, Docker, Sandbox

Suggested reviewers

ericksoa

Poem

🐰 I hopped in to watch the reconnect race,
Tiny Errors now take a gentler pace.
We count a few polls before sounding alarm,
So transient glitches won't do much harm.
A patient rabbit cheers the steady calm.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: debouncing the Docker GPU patch supervisor reconnect Error-phase short-circuit to prevent false failures.
Linked Issues check	✅ Passed	The PR fully addresses issue `#4664`: implements debouncing of transient Error-phase polls (K=5 by default), preserves fast-fail for sustained errors, provides env-var tuning, and includes comprehensive tests and documentation.
Out of Scope Changes check	✅ Passed	All changes are in scope: new supervisor-reconnect module, refactored patch logic to use it, environment variable configuration, comprehensive test coverage, and troubleshooting documentation directly addressing the issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-gpu-patch-reconnect-debounce

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-02T12:06:32Z

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e, gpu-repo-local-ollama-openclaw

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at 1b70103e57963b21ea495b6488bba94f7866e654 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high): This PR changes the real Docker GPU patch supervisor reconnect path used by GPU onboarding. The nightly gpu-e2e job runs install/onboard with NEMOCLAW_PROVIDER=ollama on an NVIDIA GPU runner, validates Docker/GPU availability, verifies onboard GPU proofs, and exercises inference through the patched sandbox.

Optional E2E

gpu-double-onboard-e2e (high): Useful adjacent confidence because it performs a second GPU/Ollama onboard on a fresh GPU runner and would exercise the same reconnect behavior during re-onboard, but the core changed path is already covered by gpu-e2e.
gpu-repo-local-ollama-openclaw (high): Typed scenario coverage for repo checkout + local Ollama OpenClaw on a Docker CDI GPU runner. This is complementary to nightly gpu-e2e and can catch scenario-registry/user-flow drift, but is not the primary merge-blocking check for this patch.

New E2E recommendations

docker-gpu-supervisor-reconnect (high): Existing GPU E2Es validate the happy path but do not deterministically force the transient OpenShell sandbox list Error phase that this debounce is designed to absorb. Add a targeted real-Docker/GPU E2E or scenario assertion that recreates the sandbox container, observes transient Error polls, then verifies reconnect and inference succeed before the debounce window expires.
- Suggested test: docker-gpu-supervisor-reconnect-transient-error-e2e

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e

github-actions · 2026-06-02T12:06:34Z

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

gpu-repo-local-ollama-openclaw: Changes affect the Docker GPU patch supervisor-reconnect path used during GPU sandbox onboarding. This is the only dispatchable scenario routed to a GPU runner with a GPU Docker runtime, so it is required despite using a special runner.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Optional scenario E2E

None.

Relevant changed files

src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.ts

github-actions · 2026-06-02T12:08:30Z

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Add targeted runtime validation for the Docker/OpenShell reconnect race (src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts:18): The new unit tests cover the debounce state machine with mocked `sandbox exec` and `sandbox list` sequences, but the linked bug is an infrastructure timing race between Docker container recreation, OpenShell supervisor reconnect, and the sandbox-list cache. The changed files do not add or identify runtime/integration coverage proving that a real patched Docker GPU sandbox recovers from a transient Error phase, or that a genuinely crashed patched container still fails fast with diagnostics in the actual runtime path.
- Recommendation: Add or identify targeted runtime/integration validation that recreates a Docker GPU sandbox and observes both paths: a transient Error phase recovers to Ready, and a truly failed patched container still surfaces Error-phase diagnostics without burning the full reconnect timeout. Do not rely only on mocked CLI output for this sandbox lifecycle path.
- Evidence: Deterministic test-depth verdict is `runtime_validation_recommended`. The changed tests simulate `Error, Error, Provisioning, Ready`, sustained Error, counter reset, env override, clamp, and non-finite override cases, but no changed file adds runtime/integration validation for Docker/OpenShell supervisor reconnect.

🌱 Nice ideas

None.

Since last review details

Current findings:

Add targeted runtime validation for the Docker/OpenShell reconnect race (src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts:18): The new unit tests cover the debounce state machine with mocked `sandbox exec` and `sandbox list` sequences, but the linked bug is an infrastructure timing race between Docker container recreation, OpenShell supervisor reconnect, and the sandbox-list cache. The changed files do not add or identify runtime/integration coverage proving that a real patched Docker GPU sandbox recovers from a transient Error phase, or that a genuinely crashed patched container still fails fast with diagnostics in the actual runtime path.
- Recommendation: Add or identify targeted runtime/integration validation that recreates a Docker GPU sandbox and observes both paths: a transient Error phase recovers to Ready, and a truly failed patched container still surfaces Error-phase diagnostics without burning the full reconnect timeout. Do not rely only on mocked CLI output for this sandbox lifecycle path.
- Evidence: Deterministic test-depth verdict is `runtime_validation_recommended`. The changed tests simulate `Error, Error, Provisioning, Ready`, sustained Error, counter reset, env override, clamp, and non-finite override cases, but no changed file adds runtime/integration validation for Docker/OpenShell supervisor reconnect.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…odule + document env Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-02T12:31:52Z

🌿 Preview your docs: https://nvidia-preview-pr-4668.docs.buildwithfern.com/nemoclaw

github-actions · 2026-06-02T12:33:42Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26819949625
Target ref: c8bc1c44cbbb4c9ce8dca86310e4d375c9b21d7e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-supervisor-reconnect.ts`:
- Around line 111-112: The injected override deps.errorPhaseDebouncePolls must
be clamped to the same minimum as the env-backed path; change the assignment for
errorPhaseDebouncePolls to normalize the injected value (e.g. use Math.max with
minimum 1) so that errorPhaseDebouncePolls = Math.max(1,
deps.errorPhaseDebouncePolls ??
getDockerGpuSupervisorReconnectErrorDebouncePolls()); this ensures both the deps
override and getDockerGpuSupervisorReconnectErrorDebouncePolls() honor the same
minimum contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b87059ca-56cd-4a48-830c-3297761637c2

📥 Commits

Reviewing files that changed from the base of the PR and between 13abdba and c8bc1c4.

📒 Files selected for processing (6)

docs/reference/troubleshooting.mdx
skills/nemoclaw-user-reference/references/troubleshooting.md
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.test.ts
src/lib/onboard/docker-gpu-supervisor-reconnect.ts

…o minimum 1 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

…es + trim EOF Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-02T13:09:53Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26821866927
Target ref: 1b70103e57963b21ea495b6488bba94f7866e654
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

prekshivyas

APPROVE.

The trailing-edge debounce with reset-on-recovery in waitForOpenShellSupervisorReconnect is the right model for #4664 — a transient sandbox list Error during container re-registration is absorbed, while a genuinely crashed container still fast-fails (~8s) instead of burning the full timeout. Tests inject a mocked sleep and the poll source (no real wall-clock dependence) and cover the window boundary, flapping-reset, clamp, and non-finite-override cases with exact poll/sleep counts. The CodeRabbit override-clamp gap is fixed and regression-tested. CI green on 1b70103, thread resolved in head.

Non-blocking cleanup: TERMINAL_SANDBOX_FAILURE_PHASES and parseSandboxListFailurePhase in the new module duplicate SANDBOX_FAILURE_PHASE_TOKENS / parseSandboxPhaseFromListOutput still in docker-gpu-patch.ts — values match today but can silently drift; consider exporting one canonical set. Doc nit: default K=5 is ~8s of sleeps (4×2s), not 10s.

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

fix(onboard): debounce Docker GPU patch supervisor reconnect Error-ph…

13abdba

…ase short-circuit Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the fix label Jun 2, 2026

refactor(onboard): extract Docker GPU supervisor-reconnect debounce m…

c8bc1c4

…odule + document env Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/lib/onboard/docker-gpu-supervisor-reconnect.ts Outdated

laitingsheng added 2 commits June 2, 2026 12:46

fix(onboard): clamp injected supervisor-reconnect debounce override t…

21013fc

…o minimum 1 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(onboard): reject non-finite supervisor-reconnect debounce overrid…

1b70103

…es + trim EOF Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the v0.0.57 Release target label Jun 2, 2026

prekshivyas approved these changes Jun 2, 2026

View reviewed changes

cv merged commit a2c020d into main Jun 2, 2026
34 of 35 checks passed

cv deleted the fix-gpu-patch-reconnect-debounce branch June 2, 2026 16:07

prekshivyas self-assigned this Jun 2, 2026

wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026

prekshivyas mentioned this pull request Jun 3, 2026

[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664

Open

Conversation

laitingsheng commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

prekshivyas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

laitingsheng commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading