Skip to content

feat: sandbox survival across gateway restarts#1466

Merged
ericksoa merged 5 commits intomainfrom
feat/sandbox-survival-recovery-v2
Apr 4, 2026
Merged

feat: sandbox survival across gateway restarts#1466
ericksoa merged 5 commits intomainfrom
feat/sandbox-survival-recovery-v2

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 4, 2026

Summary

  • Bump minimum OpenShell to v0.0.22 — enables sandbox persistence across gateway restarts (deterministic k3s node name + workspace PVC + gateway resume from volume)
  • Auto-recover OpenClaw processes — when the sandbox pod survives a restart but the OpenClaw gateway didn't re-run, nemoclaw connect and nemoclaw status detect it and transparently restart via SSH
  • E2E test — proves the full survival scenario with real NVIDIA inference: onboard → baseline inference → plant marker file → stop gateway → restart gateway → verify sandbox survived → verify marker persisted → verify inference works post-restart (24/24 tests passed)

How it works (joint OpenShell + NemoClaw solution)

OpenShell v0.0.22 persists the infrastructure layer:

NemoClaw restores the application layer:

  • Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl localhost:18789)
  • Cleans stale lock/temp files, restarts gateway via SSH
  • Re-establishes dashboard port forward (18789)
  • nemoclaw status shows OpenClaw: running | recovered | not running with guidance

User experience after this PR

laptop closes → Docker stops → laptop opens
→ Docker auto-restarts container
→ OpenShell gateway resumes, sandbox pod reschedules with workspace intact
→ user runs: nemoclaw my-assistant connect
→ NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward
→ user is back where they left off

Context

Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22 on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT (nemoclaw-start) does not re-run after pod restart on some platforms.

Test plan

  • Unit tests pass (826/826 in main working directory)
  • E2E sandbox survival test passes with real NVIDIA inference (24/24)
  • nemoclaw status shows OpenClaw: running when gateway is alive
  • nemoclaw status shows OpenClaw: recovered after auto-restart
  • ShellCheck passes on new E2E test
  • Validate on Brev VM (where ENTRYPOINT doesn't re-run)

Summary by CodeRabbit

  • New Features

    • Adds automatic gateway health monitoring with in-sandbox recovery attempts, re-establishes dashboard port-forwarding, and provides clearer status output with actionable recovery instructions.
  • Tests

    • Adds an end-to-end test validating sandbox persistence and continuity across gateway stop/start cycles, including live inference and marker-file persistence checks.
  • Chores

    • Bumped minimum OpenShell version requirement to 0.0.22.

ericksoa added 4 commits April 3, 2026 18:01
OpenShell v0.0.22 adds sandbox persistence across gateway restarts:
- Deterministic k3s node name (NVIDIA/OpenShell#739)
- Default workspace PVC at /sandbox (NVIDIA/OpenShell#739)
- Gateway resume from Docker volume state (NVIDIA/OpenShell#488)
- SSH handshake secret persistence (NVIDIA/OpenShell#488)

This unblocks sandbox survival when Docker restarts (e.g., laptop
close/open) — workspace data, SSH keys, and sandbox pods all survive.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When a sandbox survives a gateway restart (OpenShell >= 0.0.22), the
pod comes back but the OpenClaw gateway process inside it may not
re-run. This adds transparent detection and recovery:

- `nemoclaw <name> connect` auto-detects and restarts the OpenClaw
  gateway via SSH before opening the shell
- `nemoclaw <name> status` shows OpenClaw process health with three
  states: running, recovered, or not running (with guidance)
- Stale lock/temp files are cleaned before restarting
- Dashboard port forward (18789) is re-established after recovery

Detection uses the gateway's HTTP endpoint (curl localhost:18789)
rather than pgrep, since the gateway runs as a separate user with
restricted process visibility.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Proves the complete sandbox survival scenario end-to-end with real
NVIDIA inference (no mocks):

1. Onboard sandbox with real NVIDIA Endpoints
2. Verify live inference works (baseline PONG test)
3. Plant a marker file in /sandbox workspace
4. Stop gateway (simulates laptop close)
5. Restart gateway (simulates laptop open)
6. Verify sandbox pod survived and is Ready
7. Verify marker file persisted (workspace PVC works)
8. Verify live inference works after restart

Requires OpenShell >= 0.0.22 and NVIDIA_API_KEY.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Proves the complete sandbox survival scenario end-to-end with real
NVIDIA inference (no mocks):

1. Onboard sandbox with real NVIDIA Endpoints
2. Verify live inference works (baseline PONG test)
3. Plant a marker file in /sandbox workspace
4. Stop gateway (simulates laptop close)
5. Restart gateway (simulates laptop open)
6. Verify sandbox pod survived and is Ready
7. Verify marker file persisted (workspace PVC works)
8. Verify live inference works after restart

Requires OpenShell >= 0.0.22 and NVIDIA_API_KEY.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 42a54f7b-cf95-4096-b455-e7f1692893bc

📥 Commits

Reviewing files that changed from the base of the PR and between 1b877d4 and 4f22004.

📒 Files selected for processing (2)
  • bin/nemoclaw.js
  • test/e2e/test-sandbox-survival.sh
✅ Files skipped from review due to trivial changes (1)
  • bin/nemoclaw.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-sandbox-survival.sh

📝 Walkthrough

Walkthrough

Adds in-sandbox OpenClaw gateway liveness checks and automated recovery invoked from connect/status flows, raises minimum OpenShell version to 0.0.22, and adds an end-to-end test that verifies sandbox persistence across a gateway restart while exercising real NVIDIA endpoints.

Changes

Cohort / File(s) Summary
Gateway Health & Recovery
bin/nemoclaw.js
Adds helpers to fetch sandbox SSH config, run SSH commands, curl http://127.0.0.1:18789/ for gateway liveness, perform in-sandbox recovery (source ~/.bashrc, clean stale lock/temp files, launch nohup openclaw gateway run, verify PID/endpoint), and re-establish host→sandbox port forward. Invokes recovery from sandboxConnect() and reports enhanced status in sandboxStatus().
Version Requirements
scripts/install-openshell.sh
Bumps minimum OpenShell version check from 0.0.7 to 0.0.22 and updates related comments about sandbox persistence across gateway restarts.
End-to-End Persistence Test
test/e2e/test-sandbox-survival.sh
New comprehensive E2E Bash script that validates sandbox continuity across gateway stop/start: prerequisites, onboarding, baseline inference (expects “PONG”), writes/reads a marker file in /sandbox/.survival-marker, simulates gateway restart, waits for reconnection, rechecks SSH and inference, and then destroys resources. Includes colored reporting and timeout guard support.

Sequence Diagram

sequenceDiagram
    actor User
    participant NemoClaw as NemoClaw\n(bin/nemoclaw.js)
    participant OpenShell as OpenShell\n(CLI / forward)
    participant Sandbox as Sandbox\n(container / SSH)
    participant Gateway as OpenClaw\n(HTTP :18789)

    User->>NemoClaw: sandboxConnect()
    activate NemoClaw
    NemoClaw->>NemoClaw: Check gateway liveness (curl 127.0.0.1:18789)
    NemoClaw->>OpenShell: Retrieve sandbox SSH config
    activate OpenShell
    OpenShell-->>NemoClaw: SSH config
    deactivate OpenShell

    NemoClaw->>Sandbox: SSH -> curl 127.0.0.1:18789
    activate Sandbox
    alt Gateway unhealthy
        Sandbox-->>NemoClaw: no response / error
        NemoClaw->>Sandbox: SSH -> source ~/.bashrc
        NemoClaw->>Sandbox: SSH -> cleanup stale locks/temp
        NemoClaw->>Sandbox: SSH -> nohup openclaw gateway run ...
        NemoClaw->>Sandbox: SSH -> verify PID and HTTP readiness
        Sandbox-->>NemoClaw: PID & ready
    else Gateway healthy
        Sandbox-->>NemoClaw: HTTP 200 OK
    end
    deactivate Sandbox

    NemoClaw->>OpenShell: Re-establish port forward for 18789
    activate OpenShell
    OpenShell-->>NemoClaw: forward established
    deactivate OpenShell

    NemoClaw-->>User: Connected / status printed
    deactivate NemoClaw
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I brushed the sandbox, stirred the gate awake,
I chased the stale locks, and mended every break,
I hopped through SSH tunnels, nudged the port to sing,
Now data naps in peace — hooray for everything! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: sandbox survival across gateway restarts' accurately summarizes the main objective and primary change of the PR, which is to enable sandbox persistence and transparent recovery when the OpenShell/OpenClaw gateway restarts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sandbox-survival-recovery-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 249-273: The recoverSandboxProcesses function can race when called
concurrently (status and connect funnel here); add serialization by acquiring a
per-sandbox recovery lock (e.g., create/obtain a lock file like
/tmp/openclaw-recover-<sandboxName>.lock or use flock) before touching gateway
lock files, or alternatively re-check the gateway process liveness inside the
remote script just after sourcing ~/.bashrc and before running "rm -rf
/tmp/openclaw-*/gateway.*.lock" to ensure another agent didn't start the
gateway; implement the lock acquisition/release around the critical section in
recoverSandboxProcesses (the code that builds and executes the remote script via
executeSandboxCommand) and/or add a liveness check in the script portion (verify
an existing gateway PID/process and exit without deleting locks if found) to
prevent TOCTOU clobbering.
- Around line 271-273: The current check using executeSandboxCommand returning
status 0 and stdout.includes("GATEWAY_PID=") only verifies the process started
briefly; update both occurrences (the block around executeSandboxCommand at the
shown snippet and the similar block at 307-315) to perform an actual HTTP/port
probe against 127.0.0.1:18789 (or the configured gateway address) after
confirming GATEWAY_PID, retrying briefly (e.g., a few attempts with short
delays) until the probe succeeds or timeout, and only then return true; keep
executeSandboxCommand to start the gateway and use its PID string (GATEWAY_PID)
to correlate, but require the successful TCP/HTTP probe before marking recovery
successful.

In `@test/e2e/test-sandbox-survival.sh`:
- Around line 311-317: Replace the brittle grep-based health check that parses
`openshell status` with the production gateway-health predicate: invoke the same
CLI predicate used in production (the `bin/nemoclaw.js` gateway-health check)
instead of grepping for "running|connected|✓"; ensure the test loop calls that
predicate (and respects its success/failure exit code) so the loop only breaks
when the predicate confirms Status: Connected, active gateway is `nemoclaw`, and
the named gateway exists, rather than on any substring match from `openshell
status`.
- Around line 94-97: The registry_has function currently uses grep which matches
substrings/regex; update registry_has(name) to parse the JSON at $REGISTRY and
check for an exact match against .sandboxes[].name (e.g. using jq -e
'.sandboxes[] | select(.name==env.NAME)' or similar) so it returns success only
when a sandbox name exactly equals the provided $name; keep the function
signature registry_has and retain use of $REGISTRY and the local variable
name="$1".
- Around line 30-34: The outer wrapper unconditionally calls `exec timeout -s
TERM "$TIMEOUT_SECONDS" "$0" "$@"` which fails on systems without GNU `timeout`;
change the logic around `NEMOCLAW_E2E_NO_TIMEOUT`, `TIMEOUT_SECONDS`, and the
`exec timeout` call so you first check for the presence of the `timeout` command
(e.g., `command -v timeout`), and only perform the `exec timeout ...` when it
exists; if `timeout` is missing, set `NEMOCLAW_E2E_NO_TIMEOUT` to avoid
recursion and continue without wrapping, ensuring the `TIMEOUT_SECONDS` logic
remains intact and referenced symbols (`NEMOCLAW_E2E_NO_TIMEOUT`,
`TIMEOUT_SECONDS`, and the `exec timeout -s TERM ...` invocation) are updated
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 96b5f334-0d8a-4403-ad03-4cbdc6e1bfd5

📥 Commits

Reviewing files that changed from the base of the PR and between c3ee651 and 1b877d4.

📒 Files selected for processing (3)
  • bin/nemoclaw.js
  • scripts/install-openshell.sh
  • test/e2e/test-sandbox-survival.sh

- Add TOCTOU guard: re-check gateway liveness inside the remote
  recovery script before deleting lock files (prevents concurrent
  callers from clobbering each other)
- Verify HTTP probe after recovery: don't declare success until
  curl localhost:18789 confirms the gateway is actually listening
- Guard timeout wrapper: check for timeout/gtimeout before exec'ing
  (macOS doesn't have GNU timeout by default)
- Use exact JSON match in registry_has: parse sandboxes.json with
  python3 instead of grep substring match
- Tighten gateway health check: verify both "Connected" status and
  "nemoclaw" gateway name instead of loose substring match

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa merged commit 2d29a02 into main Apr 4, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants