Skip to content

feat(sandbox): expose Stop/Start RPCs in gateway API + CLI#3

Merged
vessux merged 1 commit into
mainfrom
feat/sandbox-stop-start
May 26, 2026
Merged

feat(sandbox): expose Stop/Start RPCs in gateway API + CLI#3
vessux merged 1 commit into
mainfrom
feat/sandbox-stop-start

Conversation

@vessux
Copy link
Copy Markdown
Owner

@vessux vessux commented May 26, 2026

Summary

  • New StopSandbox / StartSandbox gateway RPCs that halt a sandbox container without destroying its workspace volume, then bring it back live.
  • Backed by the existing driver-layer stop_sandbox (podman/docker/kubernetes/vm) and StartupResume::resume_sandbox (Docker today; this PR adds Podman).
  • openshell sandbox stop|start CLI verbs with --all parity to delete.

Why

Driver-layer Stop and the internal compute_driver.proto RPC are already in place, but the gateway-facing openshell.proto only exposes DeleteSandbox. That path tears down the workspace volume along with the container, so anyone wanting to halt a sandbox for resource reasons has to lose state.

Changes

  • proto/openshell.proto: add StopSandbox + StartSandbox RPCs and Request/Response messages.
  • openshell-server:
    • compute/mod.rs: pub async fn stop_sandbox(name) (forwards to driver) and pub async fn start_sandbox(name) (forwards to StartupResume).
    • grpc/sandbox.rs: handle_stop_sandbox / handle_start_sandbox with empty-name guard and not-found mapping.
    • grpc/mod.rs: trait wires.
    • auth/authz.rs: both RPCs gated on sandbox:write.
  • openshell-driver-podman:
    • driver.rs: new resume_sandbox(sandbox_id, sandbox_name) mirroring the Docker implementation (inspect → start if not running; return Ok(false) when missing).
    • Wired as StartupResume for new_podman so the Start verb works on podman without a gateway restart.
  • openshell-cli:
    • SandboxCommands::Stop { names, all } / SandboxCommands::Start { names } enum variants and match arms.
    • run::sandbox_stop / run::sandbox_start runners with sandbox-name completion and the same forward-cleanup behavior as delete.

Phase is left to the existing watch loop to update as containers transition; no new SandboxPhase variant is required. Restart-from-stopped works mid-session via the new Start verb, and at gateway boot via the existing resume_persisted_sandboxes sweep.

Testing

  • cargo test -p openshell-driver-podman -p openshell-server -p openshell-cli (all green).
  • cargo clippy --tests --no-deps clean.
  • cargo fmt --check clean.

New unit tests:

  • Podman driver: resume_sandbox not-found / already-running / stopped-starts paths.
  • Compute wrappers: stop_sandbox not-found / ok, start_sandbox not-found / unimplemented-without-hook / forwards-to-hook / backend-missing-reports-false.
  • gRPC handlers: empty-name InvalidArgument and missing-sandbox NotFound for both verbs.

The seven mock OpenShell impls in integration tests are updated to satisfy the expanded trait surface.

Checklist

  • Conventional Commits.
  • cargo fmt, cargo clippy --tests, cargo test all clean for affected crates.
  • No proto wire-format breakage: new RPCs only, no field renumbering.
  • No SPDX header changes.

The driver-layer stop_sandbox already exists for every compute backend
(podman, docker, kubernetes, vm) and the internal compute_driver.proto
exposes a StopSandbox RPC. But the gateway-facing openshell.proto only
surfaces DeleteSandbox, which destroys the workspace volume alongside
the container. Operators who want to halt a sandbox for resource reasons
have to delete it and lose state.

This patch adds:

- StopSandbox and StartSandbox RPCs to proto/openshell.proto with
  matching Request/Response messages.
- Compute-runtime wrappers (stop_sandbox, start_sandbox) that look up
  the sandbox by name, forward stop to the driver, and forward start
  to the StartupResume hook so the existing Docker resume path is
  reused.
- A resume_sandbox method on PodmanComputeDriver mirroring the Docker
  implementation, plus a StartupResume impl wired in new_podman so the
  Start verb works on podman too.
- gRPC handlers (handle_stop_sandbox, handle_start_sandbox), trait
  wires, and matching authz entries (both gated on sandbox:write).
- CLI verbs: openshell sandbox stop [--all] and openshell sandbox start
  with completion for sandbox names and the same forward-cleanup
  behavior as delete.

Phase is left to the existing watch loop to update as containers
transition; no new SandboxPhase variant is required. Restart-from-
stopped works mid-session via Start and at gateway boot via the
existing resume_persisted_sandboxes sweep.

Tests:

- Podman driver: resume_sandbox returns Ok(false) when container is
  missing, is a no-op when already running, and issues POST /start
  when stopped.
- Compute wrappers: NotFound for missing sandboxes, Unimplemented when
  no StartupResume hook is configured, forwards to the hook with the
  correct id/name, surfaces the hook's Ok(false) verdict.
- gRPC handlers: empty-name InvalidArgument and missing-sandbox
  NotFound for both Stop and Start.

The seven mock OpenShell impls in integration tests are updated to
satisfy the expanded trait surface.
@vessux vessux merged commit 45f537a into main May 26, 2026
9 of 11 checks passed
@vessux vessux deleted the feat/sandbox-stop-start branch May 26, 2026 09:17
vessux added a commit to vessux/openlock that referenced this pull request May 26, 2026
Fork v0.5.0 ships the `openshell sandbox stop` / `openshell sandbox start`
verbs that this branch's new `stopSandbox`/`startSandbox` wrappers call.
Production installs need the binary tarballs that v0.5.0 publishes;
dev-mode picks up the local `openshell-fork/target/debug` checkout
regardless and ran fine without this bump.

See vessux/OpenShell#3 + release notes for v0.5.0.
vessux added a commit to vessux/openlock that referenced this pull request May 26, 2026
… openlock-27e) (#40)

* fix(sandbox): non-destructive stop + reap; auto-start on reattach (openlock-27e)

Bug: `openlock stop <name>` and the 30-minute idle-reaper both routed
through `deleteSandbox`, which tears down the container, the workspace
volume, and the session-scoped handshake secret in one call. Help text
claimed "preserves state" and "no removal" — both lies. A workspace
that survived a `Ctrl-C` would silently disappear after half an hour
of being idle.

Fix:

- container.ts: add buildSandboxStopArgv / buildSandboxStartArgv and
  stopSandbox / startSandbox wrappers around the new openshell sandbox
  stop / start verbs (vessux/OpenShell#3).
- session-ops.ts: stopSession and reapIdleStaleSessions now call
  stopSandbox instead of deleteSandbox. cleanSession keeps the
  destructive deleteSandbox call (that is the explicit-teardown path).
- session.ts: when reattaching to a session whose container is in
  "exited" state, call startSandbox before waitForSandboxReady. The
  supervisor lives inside the container, so without an explicit start
  the existing wait path would time out.

The existing _descriptions.ts already advertises "preserves state" and
"no removal" — those statements are now finally true.

Other:

- package.json: scope `bun test` to `./src/ ./tests/` so the test
  runner does not pick up the openshell-fork checkout's z3-sys build
  artifacts (a TypeScript test file shipped inside the vendored z3
  source).

The wired-up TS layer talks to the fork's new Stop/Start RPCs via
the openshell CLI. Dev-mode builds (this repo plus a sibling
openshell-fork checkout) pick up the binaries from
openshell-fork/target/debug; production installs will need a fork
release tag bump once the upstream fork PR merges.

* chore(deps): bump OPENSHELL_FORK_TAG to v0.5.0

Fork v0.5.0 ships the `openshell sandbox stop` / `openshell sandbox start`
verbs that this branch's new `stopSandbox`/`startSandbox` wrappers call.
Production installs need the binary tarballs that v0.5.0 publishes;
dev-mode picks up the local `openshell-fork/target/debug` checkout
regardless and ran fine without this bump.

See vessux/OpenShell#3 + release notes for v0.5.0.

* fix(sandbox): name openshell sandboxes without podman prefix

`openshell sandbox <verb>` operates on the gateway-registered sandbox
name (the one passed to `--name` at create), not the underlying podman
container name `openshell-sandbox-<name>`. openlock was prefixing every
name with `SANDBOX_PREFIX` before invoking the binary, so the gateway
always answered NotFound:

  $ openshell sandbox get openshell-sandbox-foo  → NotFound
  $ openshell sandbox get foo                    → Ready

`deleteSandbox` masked this by silently swallowing stderr/exit code
(`{stdout:"ignore", stderr:"ignore"}`), so `openlock clean` reported
success while leaving the podman container orphaned — discovered when
trying to smoke-test PR-#40's auto-start path which depends on
`getSandboxState` returning a non-`missing` value.

Drop the prefix from every callsite that talks to the openshell binary
(session.ts / session-ops.ts / cli/exec / cli/shell), drop the unused
`SANDBOX_PREFIX` constant + file (integration tests hardcode the
container name where they hit podman directly), and surface
non-NotFound delete failures instead of swallowing them.

Also fixes `getSandboxState`:
- `openshell sandbox get` has never accepted `-o json` (only `--policy-only`
  on top of `[name]`), so the JSON.parse path always exit-coded to
  `"missing"`. Drop the flag, parse the human "Phase: X" line instead.
- Add ANSI-stripping + a `parseSandboxGetPhase` unit test.

Without this, PR-#40's `if (state === "exited") await startSandbox(...)`
in `reattachSession` is dead code: `getSandboxState` returns `"missing"`
first and we exit before reaching the start.

* fix(sandbox): map openshell Phase=Error to "exited" (no Stopped variant)

Fork v0.5.0 has no SandboxPhase::Stopped — an explicit `openshell sandbox
stop` transitions phase Ready -> Error because the watch loop treats any
ContainerExited as a terminal failure. Before this, parseSandboxGetPhase
mapped Error to "other"; reattachSession then printed "Attaching to
running" but did not invoke startSandbox, and waitForSandboxReady hung
probing a stopped container.

Mapping Error to "exited" lets reattach trigger startSandbox; genuine
provisioning failures still surface when startSandbox/waitForSandboxReady
fail. Tracked as bd openlock-z9i (fork-side fix: add Stopped variant or
intentional-stop flag).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant