Skip to content

fix(onboard): use OpenShell gateway user service#4580

Open
ericksoa wants to merge 9 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60
Open

fix(onboard): use OpenShell gateway user service#4580
ericksoa wants to merge 9 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented May 31, 2026

Summary

Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior.

  • Use OpenShell's package-managed openshell-gateway user service when its vendor/package unit is present, writing Docker-driver gateway env before restart.
  • Ignore per-user/stale gateway unit files so standalone recovery remains available when there is no package-managed OpenShell service.
  • Keep the existing standalone gateway launch path as an explicit compatibility fallback when the upstream service is unavailable.
  • Update docs/tests to describe package-service ownership vs. standalone fallback.

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npm run checks
  • npm run source-shape:check
  • npm run check:installer-hash
  • bash -n scripts/install-openshell.sh
  • bash test/e2e/test-openshell-version-pin.sh
  • npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.ts

Refs #4423
Follow-up to #4578

Summary by CodeRabbit

  • Documentation
    • Clarified Deployment Topology, uninstall/state-dir contents, Apple Silicon sandbox behavior, and environment-variable guidance for the standalone fallback.
  • New Features
    • On Linux, the installer prefers starting a package-managed OpenShell gateway and falls back to the standalone gateway when appropriate.
  • Tests
    • Expanded unit and e2e tests to cover package-managed vs standalone gateway scenarios and realistic checksum-driven installer flows.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa added bug Something fails against expected or documented behavior Platform: DGX Spark Support for DGX Spark priority: high Important issue that should be resolved in the next release NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. v0.0.60 Release target labels May 31, 2026
@ericksoa ericksoa self-assigned this May 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: efbf3696-8a99-4b6b-86e1-e583b7859459

📥 Commits

Reviewing files that changed from the base of the PR and between 59b38da and 7523a1e.

📒 Files selected for processing (4)
  • docs/reference/architecture.mdx
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • src/lib/onboard/gateway-tcp-readiness.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard/gateway-tcp-readiness.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/reference/architecture.mdx

📝 Walkthrough

Walkthrough

Adds Linux systemd user-service startup for the OpenShell Docker-driver gateway, gates Debian env-override writes on service presence, integrates package-managed startup into onboarding, expands unit and e2e tests, and updates architecture/CLI docs and TCP-readiness comments.

Changes

Linux Gateway User Service Feature

Layer / File(s) Summary
Service detection types and unit path helpers
src/lib/onboard/docker-driver-gateway-service.ts
Exports constants/types and helpers to compute systemd user unit paths and check unit-file existence on Linux with injectable existsSync.
Service startup systemctl orchestration
src/lib/onboard/docker-driver-gateway-service.ts
Command-availability probing, spawn-sync abstraction, and startOpenShellGatewayUserService() implementing gated systemctl --user daemon-reload → enable → restart with structured result and fallback allowance.
Package-managed gateway startup wrapper
src/lib/onboard/docker-driver-gateway-service.ts
startPackageManagedDockerDriverGateway() attempts package-managed startup (or returns false if unit missing), polls for endpoint registration, validates health/TCP readiness, clears runtime files, and verifies sandbox-bridge reachability or exits/throws on failure.
Env override conditional write logic
src/lib/onboard/docker-driver-gateway-env.ts
Imports hasOpenShellGatewayUserService, re-exports startPackageManagedDockerDriverGateway, and changes writeDockerGatewayDebEnvOverride() to accept opts and return boolean (false when service absent; true after write + perms).
Env override conditional write tests
src/lib/onboard/docker-driver-gateway-env.test.ts
Updates existing test to assert wrote === true when systemd unit present; adds test asserting wrote === false and no gateway.env when only standalone gateway binary exists.
Onboard package-managed startup integration
src/lib/onboard.ts
Imports reportDockerDriverGatewayStartFailure; startDockerDriverGateway() computes gatewayEnv, writes DEB env override, delegates to startPackageManagedDockerDriverGateway(), and returns early on success.
Service module unit tests
src/lib/onboard/docker-driver-gateway-service.test.ts
Adds helpers and Vitest cases verifying platform detection, systemctl call sequence, fallback allowance on bus errors, non-fallback on restart failures, package-managed orchestration timing, and health-failure behavior.
Architecture and CLI documentation updates
docs/reference/architecture.mdx, docs/reference/commands.mdx, src/lib/onboard/gateway-tcp-readiness.ts
Deployment topology documents Linux package-managed service restart with standalone fallback and Apple Silicon Docker sandbox behavior; uninstall state dir contents and env-var descriptions updated for standalone-fallback gateway; TCP-readiness comment clarified.
E2E fake asset & checksum generation
test/e2e/test-openshell-version-pin.sh
Fake gh/curl handlers now create fake OpenShell assets, compute SHA-256 digests, and emit correct *-checksums-sha256.txt entries for openshell, gateway, and sandbox archives.

Sequence Diagrams

sequenceDiagram
  participant Onboard as startDockerDriverGateway
  participant EnvOverride as writeDockerGatewayDebEnvOverride
  participant PkgMgd as startPackageManagedDockerDriverGateway
  participant Service as startOpenShellGatewayUserService
  participant Systemctl as systemctl --user
  participant Gateway as openshell-gateway

  Onboard->>Onboard: Compute gatewayEnv from OpenShell --version
  Onboard->>EnvOverride: Write DEB gateway.env override
  EnvOverride-->>Onboard: wrote boolean

  Onboard->>PkgMgd: startPackageManagedDockerDriverGateway(...)
  PkgMgd->>PkgMgd: Check unit file exists

  alt Unit exists
    PkgMgd->>Service: startOpenShellGatewayUserService()
    Service->>Systemctl: daemon-reload
    Systemctl-->>Service: result
    Service->>Systemctl: enable openshell-gateway
    Systemctl-->>Service: result
    Service->>Systemctl: restart openshell-gateway
    Systemctl-->>Gateway: restart signal
    Gateway-->>Service: started / failure

    alt Service started
      Service-->>PkgMgd: started true
      PkgMgd-->>Onboard: success
    else Service failed
      Service-->>PkgMgd: fallbackAllowed / reason
      PkgMgd-->>Onboard: false or exit/throw
    end
  else Unit not found
    PkgMgd-->>Onboard: false
  end

  Onboard->>Onboard: Standalone gateway fallback path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4227: Related changes to sandbox-bridge reachability probe classification used by the package-managed startup flow.

Suggested labels

OpenShell, Docker, NemoClaw CLI

Suggested reviewers

  • jyaunches
  • cv

Poem

🐇 I hopped through systemd gates at dawn,
Wrote envs, checked units, then moved on.
Docker beds warmed, the gateway spun,
Fallback kept ready till checks were done.
Carrots for checksums — neat and bouncy!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title precisely describes the main change: implementing package-managed OpenShell gateway user service integration, which is the primary objective across multiple files in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4423-openshell-service-lifecycle-v60

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

E2E Advisor Recommendation

Required E2E: cloud-e2e, openshell-gateway-upgrade-e2e, gateway-health-honest-e2e, openshell-version-pin-e2e
Optional E2E: sandbox-survival-e2e, cloud-onboard-e2e

Auto-dispatched E2E: cloud-e2e, openshell-gateway-upgrade-e2e via nightly-e2e.yaml at 7523a1ebf586a3749bd3b091f2d22c64add9b021nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-e2e (high; live NVIDIA inference and Docker sandbox build): Runs the full install → onboard → sandbox verification → live inference user journey from the PR ref. This is the broadest required guard for changes in onboarding and Docker-driver gateway startup/fallback.
  • openshell-gateway-upgrade-e2e (high; real install/upgrade and sandbox/gateway lifecycle): Validates OpenShell gateway upgrade/compatibility behavior with older gateway installs and current supported OpenShell. The PR changes gateway ownership handoff, env override/state handling, and fallback semantics in exactly this area.
  • gateway-health-honest-e2e (medium; hermetic gateway sabotage without live inference): Covers the standalone Docker-driver gateway startup health boundary and ensures NemoClaw does not declare a crashed gateway healthy. The PR changes startDockerDriverGateway ordering, service fallback, and readiness semantics, so this regression should block merge.
  • openshell-version-pin-e2e (low; hermetic installer/version-pin regression): The PR edits this E2E script and touches OpenShell deployment compatibility. Run it to ensure the modified checksum/fake-asset path still validates pinned OpenShell reinstall behavior.

Optional E2E

  • sandbox-survival-e2e (high; live sandbox and inference): Useful adjacent confidence for gateway restart/recovery and sandbox survival across gateway lifecycle disruptions. Recommended if runner budget allows, but less directly targeted than the required upgrade and health-honesty guards.
  • cloud-onboard-e2e (high; live NVIDIA inference and Docker sandbox build): Additional live onboarding/security-boundary coverage with custom policy presets and inference.local checks. Helpful because gateway env/state changes can affect provider routing and sandbox health, but cloud-e2e already covers the main user journey.

New E2E recommendations

  • package-managed-openshell-gateway-service (high): Existing E2Es appear to cover standalone Docker-driver gateway startup, upgrade, and version pinning, but not a real Linux host with an installed package/vendor openshell-gateway systemd user unit. This PR explicitly adds that handoff and docs note direct nightly coverage is still missing.
    • Suggested test: Add a Linux E2E that installs or hermetically stages a package-managed openshell-gateway user service, runs nemoclaw onboard, verifies ~/.config/openshell/gateway.env is written with Docker-driver env, verifies systemctl --user enable/restart is used only for trusted package unit paths and openshell-gateway ExecStart, waits for endpoint/gRPC/bridge readiness, and verifies standalone PID/runtime breadcrumbs are cleared only after service health succeeds.
  • package-service-fallback-negative-paths (medium): Unit tests cover user-manager/bus outage and untrusted per-user units, but a workflow-level regression would protect real CLI behavior and logs when systemctl --user is unavailable or a stale per-user unit exists.
    • Suggested test: Add an E2E regression script that shims systemctl and service unit paths to prove NemoClaw falls back to the standalone gateway for bus outages/untrusted units, but fails closed for trusted service restart failures without silently starting a second gateway.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw, gpu-repo-local-ollama-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Linux Docker-driver onboarding/gateway startup code changed, including package-managed OpenShell gateway service handoff and standalone fallback. The Ubuntu repo cloud OpenClaw scenario is the smallest routed scenario that exercises standard Linux repo onboarding, gateway health, sandbox creation, and baseline OpenShell route readiness.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • wsl-repo-cloud-openclaw: Optional special-runner coverage for the same Linux Docker-driver onboarding surface under WSL, where systemd/user-service availability and fallback behavior can differ from native Ubuntu.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw
  • gpu-repo-local-ollama-openclaw: Optional special-runner coverage for Docker-driver gateway startup on a GPU/local-Ollama onboarding path; useful if the package-managed gateway handoff may interact with local inference or GPU sandbox setup, but not the primary changed surface.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • src/lib/onboard/gateway-tcp-readiness.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

PR Review Advisor

Findings: 0 needs attention, 4 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Linux Docker-driver package-managed gateway service with standalone fallback: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The broad source-of-truth questions are mostly answered in docs and tests, but follow-up remains because env-file mutation is based on fixed unit-path existence before active identity validation, and fallback classification may allow `No such file or directory` package-service failures to proceed to standalone fallback.
  • Tighten package-managed gateway identity before handoff (src/lib/onboard/docker-driver-gateway-service.ts:137): The new handoff validates the active user unit's FragmentPath and requires ExecStart to contain the token `openshell-gateway`, which is a substantial improvement over fixed path existence plus TCP readiness. However, this gateway is a credential-boundary service and the local Docker-driver endpoint runs with gateway auth disabled. A token match does not prove that the executed binary is the expected package-owned OpenShell gateway, and `startDockerDriverGateway()` writes `~/.config/openshell/gateway.env` based on fixed service-path existence before the stronger `systemctl --user show` identity check runs.
    • Recommendation: Strengthen the trust proof before the package-managed path becomes authoritative: validate `ExecStart` against expected absolute package-owned paths or an installer-owned marker/package ownership check, add a negative test for wrapper or `/tmp/*openshell-gateway*` ExecStart values, and consider deferring the env-file write until after the active unit identity is validated.
    • Evidence: `isTrustedOpenShellGatewayUserServiceIdentity()` accepts a trusted FragmentPath plus `/\bopenshell-gateway\b/` in ExecStart; `src/lib/onboard.ts` calls `writeDockerGatewayDebEnvOverride(() => gatewayEnv)` before `startPackageManagedDockerDriverGateway(...)` performs `systemctl --user show` validation.
  • Narrow fallback classification for package-service failures (src/lib/onboard/docker-driver-gateway-service.ts:93): The compatibility fallback is intentional, but the unavailable-user-manager classifier includes a broad `No such file or directory` match. That can make installed-but-broken package service failures look like acceptable fallback cases, for example a unit whose ExecStart binary is missing or a malformed service setup. This can mask source-of-truth regressions in the package-managed service path.
    • Recommendation: Classify only clear user-manager or bus outages as fallback-allowed, and fail closed for an installed package unit that cannot restart because its service definition or executable is broken. Add a regression test for a missing ExecStart binary or restart failure that returns `No such file or directory` but should not silently fall back.
    • Evidence: `userManagerLooksUnavailable()` matches `/No such file or directory/i`; fallback-allowed failures cause `startPackageManagedDockerDriverGateway()` to warn and return `false`, after which `startDockerDriverGateway()` continues to the standalone launch path.
  • Add runtime validation for the real package-managed gateway handoff (src/lib/onboard/docker-driver-gateway-service.ts:252): The unit tests now cover the package-managed handoff directly and close the prior direct-test gap, but the changed behavior depends on real `systemctl --user`, distro-specific user unit contents, OpenShell's HTTP/2 gRPC health endpoint, and Docker bridge reachability. Mocked unit tests cannot prove those integration boundaries behave together on target hosts.
    • Recommendation: Add or identify a targeted runtime/integration validation that installs or simulates the real OpenShell package-managed user service, writes the Docker-driver env, restarts through `systemctl --user`, verifies the gRPC health probe, and confirms fallback behavior for absent user-manager and stale per-user units.
    • Evidence: `startPackageManagedDockerDriverGateway()` now injects mocks for service start, endpoint registration, CLI metadata, readiness, sleep, and bridge verification; deterministic test-depth analysis marked runtime validation recommended for the changed onboard/host glue surfaces.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: Linux Docker-driver package-managed gateway service with standalone fallback: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The broad source-of-truth questions are mostly answered in docs and tests, but follow-up remains because env-file mutation is based on fixed unit-path existence before active identity validation, and fallback classification may allow `No such file or directory` package-service failures to proceed to standalone fallback.
  • Tighten package-managed gateway identity before handoff (src/lib/onboard/docker-driver-gateway-service.ts:137): The new handoff validates the active user unit's FragmentPath and requires ExecStart to contain the token `openshell-gateway`, which is a substantial improvement over fixed path existence plus TCP readiness. However, this gateway is a credential-boundary service and the local Docker-driver endpoint runs with gateway auth disabled. A token match does not prove that the executed binary is the expected package-owned OpenShell gateway, and `startDockerDriverGateway()` writes `~/.config/openshell/gateway.env` based on fixed service-path existence before the stronger `systemctl --user show` identity check runs.
    • Recommendation: Strengthen the trust proof before the package-managed path becomes authoritative: validate `ExecStart` against expected absolute package-owned paths or an installer-owned marker/package ownership check, add a negative test for wrapper or `/tmp/*openshell-gateway*` ExecStart values, and consider deferring the env-file write until after the active unit identity is validated.
    • Evidence: `isTrustedOpenShellGatewayUserServiceIdentity()` accepts a trusted FragmentPath plus `/\bopenshell-gateway\b/` in ExecStart; `src/lib/onboard.ts` calls `writeDockerGatewayDebEnvOverride(() => gatewayEnv)` before `startPackageManagedDockerDriverGateway(...)` performs `systemctl --user show` validation.
  • Narrow fallback classification for package-service failures (src/lib/onboard/docker-driver-gateway-service.ts:93): The compatibility fallback is intentional, but the unavailable-user-manager classifier includes a broad `No such file or directory` match. That can make installed-but-broken package service failures look like acceptable fallback cases, for example a unit whose ExecStart binary is missing or a malformed service setup. This can mask source-of-truth regressions in the package-managed service path.
    • Recommendation: Classify only clear user-manager or bus outages as fallback-allowed, and fail closed for an installed package unit that cannot restart because its service definition or executable is broken. Add a regression test for a missing ExecStart binary or restart failure that returns `No such file or directory` but should not silently fall back.
    • Evidence: `userManagerLooksUnavailable()` matches `/No such file or directory/i`; fallback-allowed failures cause `startPackageManagedDockerDriverGateway()` to warn and return `false`, after which `startDockerDriverGateway()` continues to the standalone launch path.
  • Add runtime validation for the real package-managed gateway handoff (src/lib/onboard/docker-driver-gateway-service.ts:252): The unit tests now cover the package-managed handoff directly and close the prior direct-test gap, but the changed behavior depends on real `systemctl --user`, distro-specific user unit contents, OpenShell's HTTP/2 gRPC health endpoint, and Docker bridge reachability. Mocked unit tests cannot prove those integration boundaries behave together on target hosts.
    • Recommendation: Add or identify a targeted runtime/integration validation that installs or simulates the real OpenShell package-managed user service, writes the Docker-driver env, restarts through `systemctl --user`, verifies the gRPC health probe, and confirms fallback behavior for absent user-manager and stale per-user units.
    • Evidence: `startPackageManagedDockerDriverGateway()` now injects mocks for service start, endpoint registration, CLI metadata, readiness, sleep, and bridge verification; deterministic test-depth analysis marked runtime validation recommended for the changed onboard/host glue surfaces.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

ericksoa added 2 commits May 31, 2026 08:44
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717014158
Target ref: 04b7523a4b3e0692374bc8229b4eab11be5ee0df
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e,runtime-overrides-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ⚠️ cancelled
runtime-overrides-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, runtime-overrides-e2e, sandbox-survival-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2414-2471: The helper function
tryStartPackageManagedDockerDriverGateway should be moved out of
src/lib/onboard.ts into a new sibling module (e.g.,
src/lib/onboard-package-managed.ts) so the big block no longer inflates
onboard.ts; keep startDockerDriverGateway() in onboard.ts as the coordinator and
have it import and call the relocated tryStartPackageManagedDockerDriverGateway.
Ensure you preserve the function signature and all referenced symbols
(dockerDriverGatewayService, clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, runCaptureOpenshell, isGatewayHealthy,
isGatewayTcpReady, verifySandboxBridgeGatewayReachableOrExit, envInt,
sleepSeconds, GATEWAY_NAME) and their imports/exports so behavior is unchanged,
update module imports in onboard.ts, and export the helper from the new module
for testing/consumption.
- Around line 2443-2461: The call to clearDockerDriverGatewayRuntimeFiles
currently runs before the health-poll loop and removes runtime PID/marker files
prematurely; move that cleanup so it only runs after the gateway is confirmed
healthy (i.e., after isGatewayHealthy(status, namedInfo, currentInfo) && await
isGatewayTcpReady() succeeds) — update the code so
clearDockerDriverGatewayRuntimeFiles is invoked after the successful health
checks (for functions/variables involved: clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, isGatewayHealthy, isGatewayTcpReady,
verifySandboxBridgeGatewayReachableOrExit) so recovery/fallback logic retains
runtime breadcrumbs until the service is truly up.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad5de779-9431-49cf-972d-2394b910a053

📥 Commits

Reviewing files that changed from the base of the PR and between 9641ce0 and 04b7523.

📒 Files selected for processing (9)
  • docs/reference/architecture.mdx
  • docs/reference/commands.mdx
  • scripts/install-openshell.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.test.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • test/install-openshell-version-check.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717115183
Target ref: 90a5959cb9b39e7058ef96abd1468cbd46bd8b83
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717180037
Target ref: 795b3c376c96cd021ac9091c6160b37ab9ae2f51
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2424-2426: Please run one service-path E2E before merge.

This branch now short-circuits the legacy startup flow when the package-managed gateway takes ownership, so openshell-gateway-upgrade-e2e plus one happy-path onboard flow such as cloud-e2e or sandbox-operations-e2e would give good coverage of the new handoff.

As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2424 - 2426, Run a service-path E2E before
merging to validate the new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2424-2426: Run a service-path E2E before merging to validate the
new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9cba808-c0cf-473b-950a-5efbb2171a45

📥 Commits

Reviewing files that changed from the base of the PR and between 04b7523 and 795b3c3.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-env.ts

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717266019
Target ref: 098007d653201720acea8b9f796c267c9feff776
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 2 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26718114347
Target ref: c24aa4e2f9f2f1d765c4d68ad3d19b78bf33f12ba
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, openshell-gateway-upgrade-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718192181
Target ref: c24aa4e2f798b44cedc7665ffc5750b610d2e28a
Workflow ref: main
Requested jobs: cloud-e2e,double-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.mdx`:
- Line 1422: Update the description for the environment variable
NEMOCLAW_OPENSHELL_SANDBOX_BIN to use active voice to match the parallel entry
for NEMOCLAW_OPENSHELL_GATEWAY_BIN: replace the passive phrase "passed to the
Linux Docker-driver standalone fallback" with the active construction "used by
the Linux Docker-driver standalone fallback" while keeping the rest of the
sentence unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b6c6b2a5-0d6f-4db2-a49c-f202dd4ad0d1

📥 Commits

Reviewing files that changed from the base of the PR and between 098007d and c24aa4e.

📒 Files selected for processing (2)
  • docs/reference/commands.mdx
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-service.ts

Comment thread docs/reference/commands.mdx Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718200252
Target ref: 3f68c539819164ab6324c26c69a524181c2c3451
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718263958
Target ref: 3f68c539819164ab6324c26c69a524181c2c3451
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Brev E2E (full): PASSED on branch fix/4423-openshell-service-lifecycle-v60See logs

@ericksoa ericksoa requested a review from cv May 31, 2026 17:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/docker-driver-gateway-service.ts (1)

145-150: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Allow standalone fallback for bus/user-manager errors on restart too.

Limiting fallbackAllowed to daemon-reload failures turns a systemctl --user restart bus/user-manager outage into a hard stop. Since the caller in src/lib/onboard.ts (Lines 2411-2427) routes through this helper before the standalone path, that blocks the documented fallback even though the package-managed service is simply unavailable.

🔧 Proposed fix
       return {
         attempted: true,
-        fallbackAllowed: args[0] === "daemon-reload" && userManagerLooksUnavailable(result.reason ?? ""),
+        fallbackAllowed: userManagerLooksUnavailable(result.reason ?? ""),
         reason,
         started: false,
       };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-driver-gateway-service.ts` around lines 145 - 150, The
fallbackAllowed check in the runSystemctlUser error handling currently only
permits fallback for daemon-reload failures; update that condition to also allow
fallback when the invoked command is "restart" and the failure matches the user
manager/bus outage detector. In the error branch where runSystemctlUser
result.ok is false (inside docker-driver-gateway-service.ts), change the
fallbackAllowed expression that references args[0] and
userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0]
=== "daemon-reload" OR args[0] === "restart" when
userManagerLooksUnavailable(...) is true, ensuring restart failures due to
bus/user-manager outages permit the standalone fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard/docker-driver-gateway-service.ts`:
- Around line 145-150: The fallbackAllowed check in the runSystemctlUser error
handling currently only permits fallback for daemon-reload failures; update that
condition to also allow fallback when the invoked command is "restart" and the
failure matches the user manager/bus outage detector. In the error branch where
runSystemctlUser result.ok is false (inside docker-driver-gateway-service.ts),
change the fallbackAllowed expression that references args[0] and
userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0]
=== "daemon-reload" OR args[0] === "restart" when
userManagerLooksUnavailable(...) is true, ensuring restart failures due to
bus/user-manager outages permit the standalone fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc5291fc-ff9e-4407-9444-a734dc2c9a01

📥 Commits

Reviewing files that changed from the base of the PR and between c24aa4e and ce9cb66.

📒 Files selected for processing (3)
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • test/e2e/test-openshell-version-pin.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • test/e2e/test-openshell-version-pin.sh

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719041225
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: all (no filter)
Summary: 5 passed, 0 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ⚠️ cancelled
brave-search-e2e ✅ success
channels-add-remove-e2e ⚠️ cancelled
channels-stop-start-e2e ⚠️ cancelled
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
credential-migration-e2e ⚠️ cancelled
credential-sanitization-e2e ⚠️ cancelled
device-auth-health-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-dashboard-e2e ⚠️ cancelled
hermes-discord-e2e ⚠️ cancelled
hermes-e2e ⚠️ cancelled
hermes-inference-switch-e2e ⚠️ cancelled
hermes-onboard-security-posture-e2e ⚠️ cancelled
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-slack-e2e ⚠️ cancelled
inference-routing-e2e ⚠️ cancelled
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ⚠️ cancelled
issue-4462-scope-upgrade-approval-e2e ⚠️ cancelled
kimi-inference-compat-e2e ⚠️ cancelled
launchable-smoke-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
onboard-negative-paths-e2e ⚠️ cancelled
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
openclaw-discord-pairing-e2e ⚠️ cancelled
openclaw-inference-switch-e2e ⚠️ cancelled
openclaw-onboard-security-posture-e2e ⚠️ cancelled
openclaw-slack-pairing-e2e ⚠️ cancelled
openclaw-tui-chat-correlation-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ⚠️ cancelled
rebuild-hermes-stale-base-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
runtime-overrides-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
state-backup-restore-e2e ⚠️ cancelled
telegram-injection-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
tunnel-lifecycle-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled
vm-driver-privileged-exec-routing-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719097375
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719158501
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: all (no filter)
Summary: 55 passed, 0 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success
vm-driver-privileged-exec-routing-e2e ✅ success

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26720532591
Target ref: 7523a1ebf586a3749bd3b091f2d22c64add9b021
Workflow ref: main
Requested jobs: cloud-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
openshell-gateway-upgrade-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26720778107
Target ref: 7523a1ebf586a3749bd3b091f2d22c64add9b021
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
openshell-gateway-upgrade-e2e ✅ success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something fails against expected or documented behavior NV QA Bugs found by the NVIDIA QA Team Platform: DGX Spark Support for DGX Spark priority: high Important issue that should be resolved in the next release Sandbox Use this label to identify issues related to the NemoClaw isolated environment based on OpenShell. UAT Issues flagged for User Acceptance Testing. v0.0.60 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant