feat: sandbox survival across gateway restarts by ericksoa · Pull Request #1466 · NVIDIA/NemoClaw

ericksoa · 2026-04-04T01:05:42Z

Summary

Bump minimum OpenShell to v0.0.22 — enables sandbox persistence across gateway restarts (deterministic k3s node name + workspace PVC + gateway resume from volume)
Auto-recover OpenClaw processes — when the sandbox pod survives a restart but the OpenClaw gateway didn't re-run, nemoclaw connect and nemoclaw status detect it and transparently restart via SSH
E2E test — proves the full survival scenario with real NVIDIA inference: onboard → baseline inference → plant marker file → stop gateway → restart gateway → verify sandbox survived → verify marker persisted → verify inference works post-restart (24/24 tests passed)

How it works (joint OpenShell + NemoClaw solution)

OpenShell v0.0.22 persists the infrastructure layer:

Gateway resumes from Docker volume state (PR feat(bootstrap): resume gateway from existing state and persist SSH handshake secret OpenShell#488)
SSH handshake secrets survive as K8s Secrets (PR feat(bootstrap): resume gateway from existing state and persist SSH handshake secret OpenShell#488)
Deterministic k3s node name prevents PVC orphaning (PR fix(bootstrap,server): persist sandbox state across gateway stop/start cycles OpenShell#739)
Default 1Gi workspace PVC at /sandbox (PR fix(bootstrap,server): persist sandbox state across gateway stop/start cycles OpenShell#739)

NemoClaw restores the application layer:

Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl localhost:18789)
Cleans stale lock/temp files, restarts gateway via SSH
Re-establishes dashboard port forward (18789)
nemoclaw status shows OpenClaw: running | recovered | not running with guidance

User experience after this PR

laptop closes → Docker stops → laptop opens
→ Docker auto-restarts container
→ OpenShell gateway resumes, sandbox pod reschedules with workspace intact
→ user runs: nemoclaw my-assistant connect
→ NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward
→ user is back where they left off

Context

Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22 on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT (nemoclaw-start) does not re-run after pod restart on some platforms.

Test plan

Unit tests pass (826/826 in main working directory)
E2E sandbox survival test passes with real NVIDIA inference (24/24)
nemoclaw status shows OpenClaw: running when gateway is alive
nemoclaw status shows OpenClaw: recovered after auto-restart
ShellCheck passes on new E2E test
Validate on Brev VM (where ENTRYPOINT doesn't re-run)

Summary by CodeRabbit

New Features
- Adds automatic gateway health monitoring with in-sandbox recovery attempts, re-establishes dashboard port-forwarding, and provides clearer status output with actionable recovery instructions.
Tests
- Adds an end-to-end test validating sandbox persistence and continuity across gateway stop/start cycles, including live inference and marker-file persistence checks.
Chores
- Bumped minimum OpenShell version requirement to 0.0.22.

OpenShell v0.0.22 adds sandbox persistence across gateway restarts: - Deterministic k3s node name (NVIDIA/OpenShell#739) - Default workspace PVC at /sandbox (NVIDIA/OpenShell#739) - Gateway resume from Docker volume state (NVIDIA/OpenShell#488) - SSH handshake secret persistence (NVIDIA/OpenShell#488) This unblocks sandbox survival when Docker restarts (e.g., laptop close/open) — workspace data, SSH keys, and sandbox pods all survive. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

When a sandbox survives a gateway restart (OpenShell >= 0.0.22), the pod comes back but the OpenClaw gateway process inside it may not re-run. This adds transparent detection and recovery: - `nemoclaw <name> connect` auto-detects and restarts the OpenClaw gateway via SSH before opening the shell - `nemoclaw <name> status` shows OpenClaw process health with three states: running, recovered, or not running (with guidance) - Stale lock/temp files are cleaned before restarting - Dashboard port forward (18789) is re-established after recovery Detection uses the gateway's HTTP endpoint (curl localhost:18789) rather than pgrep, since the gateway runs as a separate user with restricted process visibility. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Proves the complete sandbox survival scenario end-to-end with real NVIDIA inference (no mocks): 1. Onboard sandbox with real NVIDIA Endpoints 2. Verify live inference works (baseline PONG test) 3. Plant a marker file in /sandbox workspace 4. Stop gateway (simulates laptop close) 5. Restart gateway (simulates laptop open) 6. Verify sandbox pod survived and is Ready 7. Verify marker file persisted (workspace PVC works) 8. Verify live inference works after restart Requires OpenShell >= 0.0.22 and NVIDIA_API_KEY. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-04-04T01:05:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 42a54f7b-cf95-4096-b455-e7f1692893bc

📥 Commits

Reviewing files that changed from the base of the PR and between 1b877d4 and 4f22004.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/e2e/test-sandbox-survival.sh

✅ Files skipped from review due to trivial changes (1)

bin/nemoclaw.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/test-sandbox-survival.sh

📝 Walkthrough

Walkthrough

Adds in-sandbox OpenClaw gateway liveness checks and automated recovery invoked from connect/status flows, raises minimum OpenShell version to 0.0.22, and adds an end-to-end test that verifies sandbox persistence across a gateway restart while exercising real NVIDIA endpoints.

Changes

Cohort / File(s)	Summary
Gateway Health & Recovery `bin/nemoclaw.js`	Adds helpers to fetch sandbox SSH config, run SSH commands, curl `http://127.0.0.1:18789/` for gateway liveness, perform in-sandbox recovery (source `~/.bashrc`, clean stale lock/temp files, launch `nohup openclaw gateway run`, verify PID/endpoint), and re-establish host→sandbox port forward. Invokes recovery from `sandboxConnect()` and reports enhanced status in `sandboxStatus()`.
Version Requirements `scripts/install-openshell.sh`	Bumps minimum OpenShell version check from `0.0.7` to `0.0.22` and updates related comments about sandbox persistence across gateway restarts.
End-to-End Persistence Test `test/e2e/test-sandbox-survival.sh`	New comprehensive E2E Bash script that validates sandbox continuity across gateway stop/start: prerequisites, onboarding, baseline inference (expects “PONG”), writes/reads a marker file in `/sandbox/.survival-marker`, simulates gateway restart, waits for reconnection, rechecks SSH and inference, and then destroys resources. Includes colored reporting and timeout guard support.

Sequence Diagram

sequenceDiagram
    actor User
    participant NemoClaw as NemoClaw\n(bin/nemoclaw.js)
    participant OpenShell as OpenShell\n(CLI / forward)
    participant Sandbox as Sandbox\n(container / SSH)
    participant Gateway as OpenClaw\n(HTTP :18789)

    User->>NemoClaw: sandboxConnect()
    activate NemoClaw
    NemoClaw->>NemoClaw: Check gateway liveness (curl 127.0.0.1:18789)
    NemoClaw->>OpenShell: Retrieve sandbox SSH config
    activate OpenShell
    OpenShell-->>NemoClaw: SSH config
    deactivate OpenShell

    NemoClaw->>Sandbox: SSH -> curl 127.0.0.1:18789
    activate Sandbox
    alt Gateway unhealthy
        Sandbox-->>NemoClaw: no response / error
        NemoClaw->>Sandbox: SSH -> source ~/.bashrc
        NemoClaw->>Sandbox: SSH -> cleanup stale locks/temp
        NemoClaw->>Sandbox: SSH -> nohup openclaw gateway run ...
        NemoClaw->>Sandbox: SSH -> verify PID and HTTP readiness
        Sandbox-->>NemoClaw: PID & ready
    else Gateway healthy
        Sandbox-->>NemoClaw: HTTP 200 OK
    end
    deactivate Sandbox

    NemoClaw->>OpenShell: Re-establish port forward for 18789
    activate OpenShell
    OpenShell-->>NemoClaw: forward established
    deactivate OpenShell

    NemoClaw-->>User: Connected / status printed
    deactivate NemoClaw

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I brushed the sandbox, stirred the gate awake,
I chased the stale locks, and mended every break,
I hopped through SSH tunnels, nudged the port to sing,
Now data naps in peace — hooray for everything! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.37% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: sandbox survival across gateway restarts' accurately summarizes the main objective and primary change of the PR, which is to enable sandbox persistence and transparent recovery when the OpenShell/OpenClaw gateway restarts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sandbox-survival-recovery-v2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 249-273: The recoverSandboxProcesses function can race when called
concurrently (status and connect funnel here); add serialization by acquiring a
per-sandbox recovery lock (e.g., create/obtain a lock file like
/tmp/openclaw-recover-<sandboxName>.lock or use flock) before touching gateway
lock files, or alternatively re-check the gateway process liveness inside the
remote script just after sourcing ~/.bashrc and before running "rm -rf
/tmp/openclaw-*/gateway.*.lock" to ensure another agent didn't start the
gateway; implement the lock acquisition/release around the critical section in
recoverSandboxProcesses (the code that builds and executes the remote script via
executeSandboxCommand) and/or add a liveness check in the script portion (verify
an existing gateway PID/process and exit without deleting locks if found) to
prevent TOCTOU clobbering.
- Around line 271-273: The current check using executeSandboxCommand returning
status 0 and stdout.includes("GATEWAY_PID=") only verifies the process started
briefly; update both occurrences (the block around executeSandboxCommand at the
shown snippet and the similar block at 307-315) to perform an actual HTTP/port
probe against 127.0.0.1:18789 (or the configured gateway address) after
confirming GATEWAY_PID, retrying briefly (e.g., a few attempts with short
delays) until the probe succeeds or timeout, and only then return true; keep
executeSandboxCommand to start the gateway and use its PID string (GATEWAY_PID)
to correlate, but require the successful TCP/HTTP probe before marking recovery
successful.

In `@test/e2e/test-sandbox-survival.sh`:
- Around line 311-317: Replace the brittle grep-based health check that parses
`openshell status` with the production gateway-health predicate: invoke the same
CLI predicate used in production (the `bin/nemoclaw.js` gateway-health check)
instead of grepping for "running|connected|✓"; ensure the test loop calls that
predicate (and respects its success/failure exit code) so the loop only breaks
when the predicate confirms Status: Connected, active gateway is `nemoclaw`, and
the named gateway exists, rather than on any substring match from `openshell
status`.
- Around line 94-97: The registry_has function currently uses grep which matches
substrings/regex; update registry_has(name) to parse the JSON at $REGISTRY and
check for an exact match against .sandboxes[].name (e.g. using jq -e
'.sandboxes[] | select(.name==env.NAME)' or similar) so it returns success only
when a sandbox name exactly equals the provided $name; keep the function
signature registry_has and retain use of $REGISTRY and the local variable
name="$1".
- Around line 30-34: The outer wrapper unconditionally calls `exec timeout -s
TERM "$TIMEOUT_SECONDS" "$0" "$@"` which fails on systems without GNU `timeout`;
change the logic around `NEMOCLAW_E2E_NO_TIMEOUT`, `TIMEOUT_SECONDS`, and the
`exec timeout` call so you first check for the presence of the `timeout` command
(e.g., `command -v timeout`), and only perform the `exec timeout ...` when it
exists; if `timeout` is missing, set `NEMOCLAW_E2E_NO_TIMEOUT` to avoid
recursion and continue without wrapping, ensuring the `TIMEOUT_SECONDS` logic
remains intact and referenced symbols (`NEMOCLAW_E2E_NO_TIMEOUT`,
`TIMEOUT_SECONDS`, and the `exec timeout -s TERM ...` invocation) are updated
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 96b5f334-0d8a-4403-ad03-4cbdc6e1bfd5

📥 Commits

Reviewing files that changed from the base of the PR and between c3ee651 and 1b877d4.

📒 Files selected for processing (3)

bin/nemoclaw.js
scripts/install-openshell.sh
test/e2e/test-sandbox-survival.sh

bin/nemoclaw.js

test/e2e/test-sandbox-survival.sh

- Add TOCTOU guard: re-check gateway liveness inside the remote recovery script before deleting lock files (prevents concurrent callers from clobbering each other) - Verify HTTP probe after recovery: don't declare success until curl localhost:18789 confirms the gateway is actually listening - Guard timeout wrapper: check for timeout/gtimeout before exec'ing (macOS doesn't have GNU timeout by default) - Use exact JSON match in registry_has: parse sandboxes.json with python3 instead of grep substring match - Tighten gateway health check: verify both "Connected" status and "nemoclaw" gateway name instead of loose substring match Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa added 4 commits April 3, 2026 18:01

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

bin/nemoclaw.js Outdated Show resolved Hide resolved

bin/nemoclaw.js Outdated Show resolved Hide resolved

test/e2e/test-sandbox-survival.sh Show resolved Hide resolved

test/e2e/test-sandbox-survival.sh Show resolved Hide resolved

test/e2e/test-sandbox-survival.sh Show resolved Hide resolved

cv approved these changes Apr 4, 2026

View reviewed changes

ericksoa merged commit 2d29a02 into main Apr 4, 2026
8 checks passed

ericksoa mentioned this pull request Apr 4, 2026

[NemoClaw][All Platform] sandbox disappears after host reboots #1154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sandbox survival across gateway restarts#1466

feat: sandbox survival across gateway restarts#1466
ericksoa merged 5 commits intomainfrom
feat/sandbox-survival-recovery-v2

ericksoa commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericksoa commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works (joint OpenShell + NemoClaw solution)

User experience after this PR

Context

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericksoa commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading