Skip to content

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488

Merged
drew merged 11 commits intomainfrom
487-gateway-resume-ssh-secret/drew
Apr 2, 2026
Merged

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
drew merged 11 commits intomainfrom
487-gateway-resume-ssh-secret/drew

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Mar 19, 2026

Summary

Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so openshell gateway start recovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.

Related Issue

Closes #487

Changes

Gateway Resume

  • Add DeployOptions.resume flag with a resume branch in deploy_gateway_with_logs that falls through to idempotent ensure_* calls instead of erroring or destroying
  • gateway_admin_deploy auto-resumes for stopped/volume-only states; already-running returns immediately; --recreate still destroys
  • Auto-bootstrap (sandbox create) tries resume first, falls back to recreate on failure (logged at warn)
  • Add cleanup_gateway_container for volume-preserving cleanup on resume failure
  • Add unless-stopped Docker restart policy so the container auto-restarts on Docker daemon restart

SSH Handshake Secret Persistence

  • Add reconcile_ssh_handshake_secret in bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)
  • Update Helm chart StatefulSet to read OPENSHELL_SSH_HANDSHAKE_SECRET via secretKeyRef instead of plain value
  • Remove secret generation and sed injection from cluster-entrypoint.sh
  • Remove sshHandshakeSecret from HelmChart CR values; add sshHandshakeSecretName to values.yaml
  • Update cluster-deploy-fast.sh to create K8s secret directly via kubectl
  • Add SSH handshake secret existence to cluster health check

Testing

  • mise run pre-commit passes (format, lint, license headers)
  • cargo test --package openshell-bootstrap --package openshell-cli — all 163 tests pass
  • E2E tests (mise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gateway

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner March 19, 2026 21:59
@drew drew added area:gateway Gateway server and control-plane work area:cluster Related to running OpenShell on k3s/docker labels Mar 19, 2026
@drew drew self-assigned this Mar 19, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Mar 19, 2026
johntmyers
johntmyers previously approved these changes Mar 19, 2026
@ohgeebtw
Copy link
Copy Markdown

Thankfully someone already has a PR for this. Came across the exact same problem today after restarting docker and was lost for a second.

Hopefully it gets merged soon, until then I will use my local patched version for further testing nemoclaw 👍

@rossmorey
Copy link
Copy Markdown

I'm hitting this exact issue running OpenShell v0.0.19. Every VM stop/start cycle (including nightly backups that stop/restart VMs) breaks sandbox SSH with "handshake verification failed." The only recovery is deleting and recreating the sandbox.

I've worked around it by switching all automation to kubectl exec (gateway startup, health checks, etc.), but IDE access via openshell sandbox connect --editor is completely blocked until the sandbox is recreated. The persistent SSH secret + gateway resume in this PR would fix both issues. Looking forward to this shipping.

drew added 7 commits April 1, 2026 09:45
…andshake secret

Add a resume code path to gateway start so existing Docker volume state
(k3s, etcd, sandboxes, secrets) is reused instead of requiring a full
destroy/recreate cycle. When the container is gone but the volume remains
(e.g. Docker restart), the CLI automatically creates a new container with
the existing volume and reconciles PKI and secrets.

Move the SSH handshake HMAC secret from ephemeral generation in the
cluster entrypoint (regenerated on every container start) to a Kubernetes
Secret that persists in etcd on the Docker volume. This ensures sandbox
SSH sessions survive container restarts.

Key changes:
- Add DeployOptions.resume flag with resume branch in deploy flow
- Add cleanup_gateway_container for volume-preserving failure cleanup
- Auto-resume in gateway_admin_deploy (stopped/volume-only states)
- Auto-bootstrap tries resume first, falls back to recreate
- Add unless-stopped Docker restart policy to gateway container
- Reconcile SSH handshake secret as K8s Secret alongside TLS PKI
- Update Helm chart to read secret via secretKeyRef
- Add SSH handshake secret to cluster health check

Closes #487
On resume after container kill, ensure_network destroys and recreates
the Docker network with a new ID. The stopped container still referenced
the old network ID, causing 'network not found' on start. Fix by
reconciling the container's network attachment in ensure_container.

Also, reconcile_pki was attempting to load K8s secrets before k3s had
booted, failing transiently, and regenerating PKI unnecessarily. This
triggered a server rollout restart causing TLS errors. Fix by waiting
for the openshell namespace before attempting to read existing secrets.

Add gRPC readiness check to gateway_admin_deploy so the CLI waits for
the server to accept connections before declaring the gateway ready.

Add e2e test covering container kill, stale network, sandbox persistence,
and sandbox create after resume.
The wait_for_healthy helper checked for 'healthy', 'running', or '✓'
but openshell status outputs 'Connected'. All five gateway_resume tests
were failing because the health check never matched.
…ternally

The deploy flow now auto-detects whether to resume by checking for
existing gateway state inside deploy_gateway_with_logs. Callers no
longer need to compute and pass a resume flag. The explicit gateway
start path still short-circuits for already-running gateways to avoid
redundant work.
The gateway returns HTTP 412 (Precondition Failed) when the sandbox pod
exists but hasn't reached Ready phase yet. This is a transient state
after allocation. Instead of failing immediately, retry with exponential
backoff (1s to 8s) for up to 60 seconds.
- Remove duplicate Duration import and use unqualified Duration in ssh.rs
- Prefix unused default_image parameter with underscore in sandbox/mod.rs
- Make SecretResolver pub to match its use in pub function signature
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 146ef3c to e1bea6d Compare April 1, 2026 19:49
drew added 2 commits April 1, 2026 13:04
…ation

When a gateway is stopped and restarted with a different container image,
ensure_container() removes the old container and creates a new one. The
new container gets a different hostname (Docker default: container ID
prefix), which k3s registers as a new node. Pods on the old node remain
stuck in Terminating until the eviction timeout expires, causing the 30s
health check to fail with 'connection reset by peer'.

Preserve the old container's hostname before removal and set it on the
replacement container so k3s sees the same node identity. For fresh
containers, set the hostname to the container name for a stable default
that survives future recreations.
johntmyers
johntmyers previously approved these changes Apr 1, 2026
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 6 times, most recently from e21e78f to 8c38234 Compare April 2, 2026 03:01
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 11 times, most recently from f967eb8 to 1b881d8 Compare April 2, 2026 06:30
…deletion

Reverts the hostname preservation approach which caused k3s node password
validation failures. Instead, makes clean_stale_nodes() reliable by:

1. Retrying with 3s backoff (up to ~45s) until kubectl becomes available
   after a container restart, instead of firing once and silently giving up.
2. Force-deleting pods stuck in Terminating on removed stale nodes so
   StatefulSets can immediately reschedule replacements.

This fixes gateway resume failures after stop/start when the container
image has changed (common in development), where the new container gets a
different k3s node identity and pods on the old node never reschedule.
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 1b881d8 to 4ba1b61 Compare April 2, 2026 06:46
@drew drew merged commit e837849 into main Apr 2, 2026
19 of 21 checks passed
@drew drew deleted the 487-gateway-resume-ssh-secret/drew branch April 2, 2026 16:50
ericksoa added a commit to NVIDIA/NemoClaw that referenced this pull request Apr 4, 2026
OpenShell v0.0.22 adds sandbox persistence across gateway restarts:
- Deterministic k3s node name (NVIDIA/OpenShell#739)
- Default workspace PVC at /sandbox (NVIDIA/OpenShell#739)
- Gateway resume from Docker volume state (NVIDIA/OpenShell#488)
- SSH handshake secret persistence (NVIDIA/OpenShell#488)

This unblocks sandbox survival when Docker restarts (e.g., laptop
close/open) — workspace data, SSH keys, and sandbox pods all survive.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa added a commit to NVIDIA/NemoClaw that referenced this pull request Apr 4, 2026
## Summary

- **Bump minimum OpenShell to v0.0.22** — enables sandbox persistence
across gateway restarts (deterministic k3s node name + workspace PVC +
gateway resume from volume)
- **Auto-recover OpenClaw processes** — when the sandbox pod survives a
restart but the OpenClaw gateway didn't re-run, `nemoclaw connect` and
`nemoclaw status` detect it and transparently restart via SSH
- **E2E test** — proves the full survival scenario with real NVIDIA
inference: onboard → baseline inference → plant marker file → stop
gateway → restart gateway → verify sandbox survived → verify marker
persisted → verify inference works post-restart (24/24 tests passed)

### How it works (joint OpenShell + NemoClaw solution)

OpenShell v0.0.22 persists the infrastructure layer:
- Gateway resumes from Docker volume state (PR NVIDIA/OpenShell#488)
- SSH handshake secrets survive as K8s Secrets (PR NVIDIA/OpenShell#488)
- Deterministic k3s node name prevents PVC orphaning (PR
NVIDIA/OpenShell#739)
- Default 1Gi workspace PVC at `/sandbox` (PR NVIDIA/OpenShell#739)

NemoClaw restores the application layer:
- Detects "sandbox alive, OpenClaw dead" via HTTP probe (curl
localhost:18789)
- Cleans stale lock/temp files, restarts gateway via SSH
- Re-establishes dashboard port forward (18789)
- `nemoclaw status` shows `OpenClaw: running | recovered | not running`
with guidance

### User experience after this PR

```
laptop closes → Docker stops → laptop opens
→ Docker auto-restarts container
→ OpenShell gateway resumes, sandbox pod reschedules with workspace intact
→ user runs: nemoclaw my-assistant connect
→ NemoClaw detects OpenClaw not running, auto-restarts, reconnects port forward
→ user is back where they left off
```

### Context

Reported by @SenthilKumar-Ravichandran after testing OpenShell v0.0.22
on Brev VMs — PVC persistence works, but the user-defined ENTRYPOINT
(`nemoclaw-start`) does not re-run after pod restart on some platforms.

## Test plan

- [x] Unit tests pass (826/826 in main working directory)
- [x] E2E sandbox survival test passes with real NVIDIA inference
(24/24)
- [x] `nemoclaw status` shows `OpenClaw: running` when gateway is alive
- [x] `nemoclaw status` shows `OpenClaw: recovered` after auto-restart
- [x] ShellCheck passes on new E2E test
- [ ] Validate on Brev VM (where ENTRYPOINT doesn't re-run)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Adds automatic gateway health monitoring with in-sandbox recovery
attempts, re-establishes dashboard port-forwarding, and provides clearer
status output with actionable recovery instructions.

* **Tests**
* Adds an end-to-end test validating sandbox persistence and continuity
across gateway stop/start cycles, including live inference and
marker-file persistence checks.

* **Chores**
  * Bumped minimum OpenShell version requirement to 0.0.22.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:cluster Related to running OpenShell on k3s/docker area:gateway Gateway server and control-plane work test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: gateway resume from existing state and persistent SSH handshake secret

4 participants