feat: sandbox survival across gateway restarts#1466
Conversation
OpenShell v0.0.22 adds sandbox persistence across gateway restarts: - Deterministic k3s node name (NVIDIA/OpenShell#739) - Default workspace PVC at /sandbox (NVIDIA/OpenShell#739) - Gateway resume from Docker volume state (NVIDIA/OpenShell#488) - SSH handshake secret persistence (NVIDIA/OpenShell#488) This unblocks sandbox survival when Docker restarts (e.g., laptop close/open) — workspace data, SSH keys, and sandbox pods all survive. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
When a sandbox survives a gateway restart (OpenShell >= 0.0.22), the pod comes back but the OpenClaw gateway process inside it may not re-run. This adds transparent detection and recovery: - `nemoclaw <name> connect` auto-detects and restarts the OpenClaw gateway via SSH before opening the shell - `nemoclaw <name> status` shows OpenClaw process health with three states: running, recovered, or not running (with guidance) - Stale lock/temp files are cleaned before restarting - Dashboard port forward (18789) is re-established after recovery Detection uses the gateway's HTTP endpoint (curl localhost:18789) rather than pgrep, since the gateway runs as a separate user with restricted process visibility. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Proves the complete sandbox survival scenario end-to-end with real NVIDIA inference (no mocks): 1. Onboard sandbox with real NVIDIA Endpoints 2. Verify live inference works (baseline PONG test) 3. Plant a marker file in /sandbox workspace 4. Stop gateway (simulates laptop close) 5. Restart gateway (simulates laptop open) 6. Verify sandbox pod survived and is Ready 7. Verify marker file persisted (workspace PVC works) 8. Verify live inference works after restart Requires OpenShell >= 0.0.22 and NVIDIA_API_KEY. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Proves the complete sandbox survival scenario end-to-end with real NVIDIA inference (no mocks): 1. Onboard sandbox with real NVIDIA Endpoints 2. Verify live inference works (baseline PONG test) 3. Plant a marker file in /sandbox workspace 4. Stop gateway (simulates laptop close) 5. Restart gateway (simulates laptop open) 6. Verify sandbox pod survived and is Ready 7. Verify marker file persisted (workspace PVC works) 8. Verify live inference works after restart Requires OpenShell >= 0.0.22 and NVIDIA_API_KEY. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds in-sandbox OpenClaw gateway liveness checks and automated recovery invoked from connect/status flows, raises minimum OpenShell version to 0.0.22, and adds an end-to-end test that verifies sandbox persistence across a gateway restart while exercising real NVIDIA endpoints. Changes
Sequence DiagramsequenceDiagram
actor User
participant NemoClaw as NemoClaw\n(bin/nemoclaw.js)
participant OpenShell as OpenShell\n(CLI / forward)
participant Sandbox as Sandbox\n(container / SSH)
participant Gateway as OpenClaw\n(HTTP :18789)
User->>NemoClaw: sandboxConnect()
activate NemoClaw
NemoClaw->>NemoClaw: Check gateway liveness (curl 127.0.0.1:18789)
NemoClaw->>OpenShell: Retrieve sandbox SSH config
activate OpenShell
OpenShell-->>NemoClaw: SSH config
deactivate OpenShell
NemoClaw->>Sandbox: SSH -> curl 127.0.0.1:18789
activate Sandbox
alt Gateway unhealthy
Sandbox-->>NemoClaw: no response / error
NemoClaw->>Sandbox: SSH -> source ~/.bashrc
NemoClaw->>Sandbox: SSH -> cleanup stale locks/temp
NemoClaw->>Sandbox: SSH -> nohup openclaw gateway run ...
NemoClaw->>Sandbox: SSH -> verify PID and HTTP readiness
Sandbox-->>NemoClaw: PID & ready
else Gateway healthy
Sandbox-->>NemoClaw: HTTP 200 OK
end
deactivate Sandbox
NemoClaw->>OpenShell: Re-establish port forward for 18789
activate OpenShell
OpenShell-->>NemoClaw: forward established
deactivate OpenShell
NemoClaw-->>User: Connected / status printed
deactivate NemoClaw
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 249-273: The recoverSandboxProcesses function can race when called
concurrently (status and connect funnel here); add serialization by acquiring a
per-sandbox recovery lock (e.g., create/obtain a lock file like
/tmp/openclaw-recover-<sandboxName>.lock or use flock) before touching gateway
lock files, or alternatively re-check the gateway process liveness inside the
remote script just after sourcing ~/.bashrc and before running "rm -rf
/tmp/openclaw-*/gateway.*.lock" to ensure another agent didn't start the
gateway; implement the lock acquisition/release around the critical section in
recoverSandboxProcesses (the code that builds and executes the remote script via
executeSandboxCommand) and/or add a liveness check in the script portion (verify
an existing gateway PID/process and exit without deleting locks if found) to
prevent TOCTOU clobbering.
- Around line 271-273: The current check using executeSandboxCommand returning
status 0 and stdout.includes("GATEWAY_PID=") only verifies the process started
briefly; update both occurrences (the block around executeSandboxCommand at the
shown snippet and the similar block at 307-315) to perform an actual HTTP/port
probe against 127.0.0.1:18789 (or the configured gateway address) after
confirming GATEWAY_PID, retrying briefly (e.g., a few attempts with short
delays) until the probe succeeds or timeout, and only then return true; keep
executeSandboxCommand to start the gateway and use its PID string (GATEWAY_PID)
to correlate, but require the successful TCP/HTTP probe before marking recovery
successful.
In `@test/e2e/test-sandbox-survival.sh`:
- Around line 311-317: Replace the brittle grep-based health check that parses
`openshell status` with the production gateway-health predicate: invoke the same
CLI predicate used in production (the `bin/nemoclaw.js` gateway-health check)
instead of grepping for "running|connected|✓"; ensure the test loop calls that
predicate (and respects its success/failure exit code) so the loop only breaks
when the predicate confirms Status: Connected, active gateway is `nemoclaw`, and
the named gateway exists, rather than on any substring match from `openshell
status`.
- Around line 94-97: The registry_has function currently uses grep which matches
substrings/regex; update registry_has(name) to parse the JSON at $REGISTRY and
check for an exact match against .sandboxes[].name (e.g. using jq -e
'.sandboxes[] | select(.name==env.NAME)' or similar) so it returns success only
when a sandbox name exactly equals the provided $name; keep the function
signature registry_has and retain use of $REGISTRY and the local variable
name="$1".
- Around line 30-34: The outer wrapper unconditionally calls `exec timeout -s
TERM "$TIMEOUT_SECONDS" "$0" "$@"` which fails on systems without GNU `timeout`;
change the logic around `NEMOCLAW_E2E_NO_TIMEOUT`, `TIMEOUT_SECONDS`, and the
`exec timeout` call so you first check for the presence of the `timeout` command
(e.g., `command -v timeout`), and only perform the `exec timeout ...` when it
exists; if `timeout` is missing, set `NEMOCLAW_E2E_NO_TIMEOUT` to avoid
recursion and continue without wrapping, ensuring the `TIMEOUT_SECONDS` logic
remains intact and referenced symbols (`NEMOCLAW_E2E_NO_TIMEOUT`,
`TIMEOUT_SECONDS`, and the `exec timeout -s TERM ...` invocation) are updated
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 96b5f334-0d8a-4403-ad03-4cbdc6e1bfd5
📒 Files selected for processing (3)
bin/nemoclaw.jsscripts/install-openshell.shtest/e2e/test-sandbox-survival.sh
- Add TOCTOU guard: re-check gateway liveness inside the remote recovery script before deleting lock files (prevents concurrent callers from clobbering each other) - Verify HTTP probe after recovery: don't declare success until curl localhost:18789 confirms the gateway is actually listening - Guard timeout wrapper: check for timeout/gtimeout before exec'ing (macOS doesn't have GNU timeout by default) - Use exact JSON match in registry_has: parse sandboxes.json with python3 instead of grep substring match - Tighten gateway health check: verify both "Connected" status and "nemoclaw" gateway name instead of loose substring match Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
nemoclaw connectandnemoclaw statusdetect it and transparently restart via SSHHow it works (joint OpenShell + NemoClaw solution)
OpenShell v0.0.22 persists the infrastructure layer:
/sandbox(PR fix(bootstrap,server): persist sandbox state across gateway stop/start cycles OpenShell#739)NemoClaw restores the application layer:
nemoclaw statusshowsOpenClaw: running | recovered | not runningwith guidanceUser experience after this PR
Context
Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22 on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT (
nemoclaw-start) does not re-run after pod restart on some platforms.Test plan
nemoclaw statusshowsOpenClaw: runningwhen gateway is alivenemoclaw statusshowsOpenClaw: recoveredafter auto-restartSummary by CodeRabbit
New Features
Tests
Chores