feat(bootstrap): resume gateway from existing state and persist SSH handshake secret by drew · Pull Request #488 · NVIDIA/OpenShell

drew · 2026-03-19T21:59:30Z

Summary

Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so openshell gateway start recovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.

Related Issue

Closes #487

Changes

Gateway Resume

Add DeployOptions.resume flag with a resume branch in deploy_gateway_with_logs that falls through to idempotent ensure_* calls instead of erroring or destroying
gateway_admin_deploy auto-resumes for stopped/volume-only states; already-running returns immediately; --recreate still destroys
Auto-bootstrap (sandbox create) tries resume first, falls back to recreate on failure (logged at warn)
Add cleanup_gateway_container for volume-preserving cleanup on resume failure
Add unless-stopped Docker restart policy so the container auto-restarts on Docker daemon restart

SSH Handshake Secret Persistence

Add reconcile_ssh_handshake_secret in bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)
Update Helm chart StatefulSet to read OPENSHELL_SSH_HANDSHAKE_SECRET via secretKeyRef instead of plain value
Remove secret generation and sed injection from cluster-entrypoint.sh
Remove sshHandshakeSecret from HelmChart CR values; add sshHandshakeSecretName to values.yaml
Update cluster-deploy-fast.sh to create K8s secret directly via kubectl
Add SSH handshake secret existence to cluster health check

Testing

mise run pre-commit passes (format, lint, license headers)
cargo test --package openshell-bootstrap --package openshell-cli — all 163 tests pass
E2E tests (mise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gateway

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

ohgeebtw · 2026-03-28T01:50:02Z

Thankfully someone already has a PR for this. Came across the exact same problem today after restarting docker and was lost for a second.

Hopefully it gets merged soon, until then I will use my local patched version for further testing nemoclaw 👍

rossmorey · 2026-04-01T16:16:32Z

I'm hitting this exact issue running OpenShell v0.0.19. Every VM stop/start cycle (including nightly backups that stop/restart VMs) breaks sandbox SSH with "handshake verification failed." The only recovery is deleting and recreating the sandbox.

I've worked around it by switching all automation to kubectl exec (gateway startup, health checks, etc.), but IDE access via openshell sandbox connect --editor is completely blocked until the sandbox is recreated. The persistent SSH secret + gateway resume in this PR would fix both issues. Looking forward to this shipping.

…andshake secret Add a resume code path to gateway start so existing Docker volume state (k3s, etcd, sandboxes, secrets) is reused instead of requiring a full destroy/recreate cycle. When the container is gone but the volume remains (e.g. Docker restart), the CLI automatically creates a new container with the existing volume and reconciles PKI and secrets. Move the SSH handshake HMAC secret from ephemeral generation in the cluster entrypoint (regenerated on every container start) to a Kubernetes Secret that persists in etcd on the Docker volume. This ensures sandbox SSH sessions survive container restarts. Key changes: - Add DeployOptions.resume flag with resume branch in deploy flow - Add cleanup_gateway_container for volume-preserving failure cleanup - Auto-resume in gateway_admin_deploy (stopped/volume-only states) - Auto-bootstrap tries resume first, falls back to recreate - Add unless-stopped Docker restart policy to gateway container - Reconcile SSH handshake secret as K8s Secret alongside TLS PKI - Update Helm chart to read secret via secretKeyRef - Add SSH handshake secret to cluster health check Closes #487

On resume after container kill, ensure_network destroys and recreates the Docker network with a new ID. The stopped container still referenced the old network ID, causing 'network not found' on start. Fix by reconciling the container's network attachment in ensure_container. Also, reconcile_pki was attempting to load K8s secrets before k3s had booted, failing transiently, and regenerating PKI unnecessarily. This triggered a server rollout restart causing TLS errors. Fix by waiting for the openshell namespace before attempting to read existing secrets. Add gRPC readiness check to gateway_admin_deploy so the CLI waits for the server to accept connections before declaring the gateway ready. Add e2e test covering container kill, stale network, sandbox persistence, and sandbox create after resume.

The wait_for_healthy helper checked for 'healthy', 'running', or '✓' but openshell status outputs 'Connected'. All five gateway_resume tests were failing because the health check never matched.

…ternally The deploy flow now auto-detects whether to resume by checking for existing gateway state inside deploy_gateway_with_logs. Callers no longer need to compute and pass a resume flag. The explicit gateway start path still short-circuits for already-running gateways to avoid redundant work.

The gateway returns HTTP 412 (Precondition Failed) when the sandbox pod exists but hasn't reached Ready phase yet. This is a transient state after allocation. Instead of failing immediately, retry with exponential backoff (1s to 8s) for up to 60 seconds.

- Remove duplicate Duration import and use unqualified Duration in ssh.rs - Prefix unused default_image parameter with underscore in sandbox/mod.rs - Make SecretResolver pub to match its use in pub function signature

…ation When a gateway is stopped and restarted with a different container image, ensure_container() removes the old container and creates a new one. The new container gets a different hostname (Docker default: container ID prefix), which k3s registers as a new node. Pods on the old node remain stuck in Terminating until the eviction timeout expires, causing the 30s health check to fail with 'connection reset by peer'. Preserve the old container's hostname before removal and set it on the replacement container so k3s sees the same node identity. For fresh containers, set the hostname to the container name for a stable default that survives future recreations.

…deletion Reverts the hostname preservation approach which caused k3s node password validation failures. Instead, makes clean_stale_nodes() reliable by: 1. Retrying with 3s backoff (up to ~45s) until kubectl becomes available after a container restart, instead of firing once and silently giving up. 2. Force-deleting pods stuck in Terminating on removed stale nodes so StatefulSets can immediately reschedule replacements. This fixes gateway resume failures after stop/start when the container image has changed (common in development), where the new container gets a different k3s node identity and pods on the old node never reschedule.

OpenShell v0.0.22 adds sandbox persistence across gateway restarts: - Deterministic k3s node name (NVIDIA/OpenShell#739) - Default workspace PVC at /sandbox (NVIDIA/OpenShell#739) - Gateway resume from Docker volume state (NVIDIA/OpenShell#488) - SSH handshake secret persistence (NVIDIA/OpenShell#488) This unblocks sandbox survival when Docker restarts (e.g., laptop close/open) — workspace data, SSH keys, and sandbox pods all survive. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

## Summary - **Bump minimum OpenShell to v0.0.22** — enables sandbox persistence across gateway restarts (deterministic k3s node name + workspace PVC + gateway resume from volume) - **Auto-recover OpenClaw processes** — when the sandbox pod survives a restart but the OpenClaw gateway didn't re-run, `nemoclaw connect` and `nemoclaw status` detect it and transparently restart via SSH - **E2E test** — proves the full survival scenario with real NVIDIA inference: onboard → baseline inference → plant marker file → stop gateway → restart gateway → verify sandbox survived → verify marker persisted → verify inference works post-restart (24/24 tests passed) ### How it works (joint OpenShell + NemoClaw solution) OpenShell v0.0.22 persists the infrastructure layer: - Gateway resumes from Docker volume state (PR NVIDIA/OpenShell#488) - SSH handshake secrets survive as K8s Secrets (PR NVIDIA/OpenShell#488) - Deterministic k3s node name prevents PVC orphaning (PR NVIDIA/OpenShell#739) - Default 1Gi workspace PVC at `/sandbox` (PR NVIDIA/OpenShell#739) NemoClaw restores the application layer: - Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl localhost:18789) - Cleans stale lock/temp files, restarts gateway via SSH - Re-establishes dashboard port forward (18789) - `nemoclaw status` shows `OpenClaw: running | recovered | not running` with guidance ### User experience after this PR ``` laptop closes → Docker stops → laptop opens → Docker auto-restarts container → OpenShell gateway resumes, sandbox pod reschedules with workspace intact → user runs: nemoclaw my-assistant connect → NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward → user is back where they left off ``` ### Context Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22 on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT (`nemoclaw-start`) does not re-run after pod restart on some platforms. ## Test plan - [x] Unit tests pass (826/826 in main working directory) - [x] E2E sandbox survival test passes with real NVIDIA inference (24/24) - [x] `nemoclaw status` shows `OpenClaw: running` when gateway is alive - [x] `nemoclaw status` shows `OpenClaw: recovered` after auto-restart - [x] ShellCheck passes on new E2E test - [ ] Validate on Brev VM (where ENTRYPOINT doesn't re-run)  ## Summary by CodeRabbit * **New Features** * Adds automatic gateway health monitoring with in-sandbox recovery attempts, re-establishes dashboard port-forwarding, and provides clearer status output with actionable recovery instructions. * **Tests** * Adds an end-to-end test validating sandbox persistence and continuity across gateway stop/start cycles, including live inference and marker-file persistence checks. * **Chores** * Bumped minimum OpenShell version requirement to 0.0.22.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

drew requested a review from a team as a code owner March 19, 2026 21:59

drew added area:gateway Gateway server and control-plane work area:cluster Related to running OpenShell on k3s/docker labels Mar 19, 2026

drew self-assigned this Mar 19, 2026

drew added the test:e2e Requires end-to-end coverage label Mar 19, 2026

johntmyers previously approved these changes Mar 19, 2026

View reviewed changes

drew dismissed johntmyers’s stale review via b917835 March 20, 2026 04:43

drew mentioned this pull request Mar 20, 2026

Sandbox Not Available Anymore After Machine Restart NVIDIA/NemoClaw#486

Open

2 tasks

drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 46a1bcf to 146ef3c Compare March 24, 2026 00:28

mjamiv mentioned this pull request Mar 27, 2026

feat(sandbox): openshell sandbox sync — continuous host-sandbox file sync with state backup #636

Closed

2 tasks

drew added 7 commits April 1, 2026 09:45

add e2e tests

55f30bc

fix(e2e): match 'connected' status in gateway resume health check

3018f5d

The wait_for_healthy helper checked for 'healthy', 'running', or '✓' but openshell status outputs 'Connected'. All five gateway_resume tests were failing because the health check never matched.

fix: resolve compile warnings across workspace

e1bea6d

- Remove duplicate Duration import and use unqualified Duration in ssh.rs - Prefix unused default_image parameter with underscore in sandbox/mod.rs - Make SecretResolver pub to match its use in pub function signature

drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 146ef3c to e1bea6d Compare April 1, 2026 19:49

drew added 2 commits April 1, 2026 13:04

fix: resolve rebase artifacts (type mismatch and formatting)

8760a0f

johntmyers previously approved these changes Apr 1, 2026

View reviewed changes

drew dismissed johntmyers’s stale review via da5a68f April 1, 2026 22:01

drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 6 times, most recently from e21e78f to 8c38234 Compare April 2, 2026 03:01

drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 11 times, most recently from f967eb8 to 1b881d8 Compare April 2, 2026 06:30

drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 1b881d8 to 4ba1b61 Compare April 2, 2026 06:46

fix tests

3704e90

drew merged commit e837849 into main Apr 2, 2026
19 of 21 checks passed

drew deleted the 487-gateway-resume-ssh-secret/drew branch April 2, 2026 16:50

ericksoa mentioned this pull request Apr 4, 2026

feat: sandbox survival across gateway restarts NVIDIA/NemoClaw#1466

Merged

6 tasks

rluo8 mentioned this pull request Apr 5, 2026

openshell forward fails with Exit 255 after WSL2/System Restart NVIDIA/NemoClaw#629

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
drew merged 11 commits intomainfrom
487-gateway-resume-ssh-secret/drew

drew commented Mar 19, 2026

Uh oh!

ohgeebtw commented Mar 28, 2026

Uh oh!

rossmorey commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

drew commented Mar 19, 2026

Summary

Related Issue

Changes

Gateway Resume

SSH Handshake Secret Persistence

Testing

Checklist

Uh oh!

ohgeebtw commented Mar 28, 2026

Uh oh!

rossmorey commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants