feat: migrate sandbox operations to OpenShell gRPC#4329
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pr-4329.docs.buildwithfern.com/nemoclaw |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 6 needs attention, 12 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Selective E2E Results — ❌ Some jobs failedRun: 26523535919
|
Selective E2E Results — ❌ Some jobs failedRun: 26527630225
|
Selective E2E Results — ❌ Some jobs failedRun: 26527923535
|
Selective E2E Results — ❌ Some jobs failedRun: 26528706357
|
Selective E2E Results — ❌ Some jobs failedRun: 26531854119
|
Selective E2E Results — ❌ Some jobs failedRun: 26537255180
|
Selective E2E Results — ❌ Some jobs failedRun: 26541403310
|
Selective E2E Results — ❌ Some jobs failedRun: 26545970853
|
Selective E2E Results — ✅ All requested jobs passedRun: 26547205887
|
Selective E2E Results — ✅ All requested jobs passedRun: 26547328935
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/state/sandbox.ts (1)
865-875:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDon’t let
catmask a failed SQLite backup.Line 873 runs the SQLite backup and then unconditionally cats the temp file. Because this command list is joined with
;, a Python failure can still end withcat -- "$tmp"returning0, which makes the snapshot look successful while persisting an empty/corrupt state-file backup.🐛 Proposed fix
return [ `src=${quotedRemotePath}`, "[ ! -e \"$src\" ] && exit 2", '[ -f "$src" ] && [ ! -L "$src" ] || { echo "unsafe sqlite state file: $src" >&2; exit 10; }', 'hardlink_count="$(find "$src" -maxdepth 0 -type f -links +1 -print 2>/dev/null | wc -l | tr -d " ")"', '[ "${hardlink_count:-0}" = "0" ] || { echo "hard-linked sqlite state file rejected: $src" >&2; exit 11; }', 'tmp="$(mktemp /tmp/nemoclaw-sqlite-backup.XXXXXX)"', 'trap \'rm -f "$tmp"\' EXIT', - `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp"`, - 'cat -- "$tmp"', + `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp" && cat -- "$tmp"`, ].join("; ");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/state/sandbox.ts` around lines 865 - 875, The current command array in the function that builds the shell snippet runs `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp"` followed by an unconditional 'cat -- "$tmp"', so a failing Python backup can be masked; change the sequence so the cat only runs on successful backup (for example, chain the backup and the cat with a conditional AND or explicitly check the backup exit status and exit on failure) while keeping the existing safety checks (the quotedRemotePath check, hardlink check, mktemp/trap). Ensure you modify the element containing `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp"` and the 'cat -- "$tmp"' element so that 'cat' is executed only when the backup command succeeded.
🧹 Nitpick comments (1)
src/lib/state/sandbox.ts (1)
1005-1641: Please run the state lifecycle E2Es on this migration path.This file now routes backup/restore through gRPC exec, tar streaming, and marker parsing, so it is worth exercising
state-backup-restore-e2e,snapshot-commands-e2e, andrebuild-openclaw-e2ebefore merge.As per coding guidelines,
src/lib/state/sandbox.ts: "This file manages sandbox state (backup, restore, rebuild, snapshot). Changes affect data persistence across sandbox lifecycle operations."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/state/sandbox.ts` around lines 1005 - 1641, The review asks you to execute the full state lifecycle end-to-end tests against the new gRPC/tar/marker backup and restore flow: run state-backup-restore-e2e, snapshot-commands-e2e, and rebuild-openclaw-e2e and verify backupSandboxState and restoreSandboxState behavior (including marker parsing, tar streaming, pre-backup audit, sanitizeBackupDirectory, and workspace-* discovery) on the intended migration path; if failures appear, iterate on backupSandboxState, restoreSandboxState, safeTarExtract, statusFromMarker, execBinaryStreamSync/execTextSync wrappers, and related helper functions until the E2E suites pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/adapters/openshell/forward-bridge-state.ts`:
- Around line 39-41: The statePath function is vulnerable because String(port)
can contain path separators or traversal sequences; update statePath to sanitize
the port similarly to safeSandbox (or validate as a numeric port) before
composing the filename: convert port to string, strip or replace any characters
that are not digits/letters/._- (e.g. replace /[^A-Za-z0-9._-]/g with "_") or
enforce a numeric check (parseInt/Number and throw on NaN) so the returned path
(stateDir() + `${safeSandbox}-${safePort}.json`) cannot escape the intended
directory; modify the function statePath to produce safePort and use it in the
path join.
In `@src/lib/onboard.ts`:
- Around line 2795-2796: forwardStatesAsListOutput() only returns
NemoClaw-managed forwards so findAvailableDashboardPort(...) can pick a port
already taken by legacy/manual forwards; update the pre-resolution to include
legacy/live forwards or probe the candidate port before finalizing
effectivePort: call or implement a legacy forward listing (e.g. merge results of
forwardStatesAsListOutput() with a new listLegacyForwards() or
openshellForwardProbe()) and pass the combined set into
findAvailableDashboardPort(sandboxName, preferredPort, combinedForwards), or
after findAvailableDashboardPort returns, perform an explicit port probe on
effectivePort and, if occupied, retry selection; ensure this change happens
before ensureDashboardForward() (so CHAT_UI_URL / NEMOCLAW_DASHBOARD_PORT baked
into the staged Dockerfile/sandbox env reflect a truly available port).
In `@test/helpers/grpc-fake-ssh.cjs`:
- Around line 37-39: The current hermetic binary check uses
path.resolve(openshell) which doesn't resolve symlinks; replace that check by
calling fs.realpathSync or fs.promises.realpath on both openshell and home
(e.g., realShell = fs.realpathSync(openshell); realHome = fs.realpathSync(home))
and then verify realShell.startsWith(`${realHome}${path.sep}`); also wrap
realpath in a try/catch so failures log the same stderr message and call
process.exit(127) as before, keeping the existing variables (openshell, home)
and the failure branch that writes the error and exits.
---
Outside diff comments:
In `@src/lib/state/sandbox.ts`:
- Around line 865-875: The current command array in the function that builds the
shell snippet runs `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp"` followed
by an unconditional 'cat -- "$tmp"', so a failing Python backup can be masked;
change the sequence so the cat only runs on successful backup (for example,
chain the backup and the cat with a conditional AND or explicitly check the
backup exit status and exit on failure) while keeping the existing safety checks
(the quotedRemotePath check, hardlink check, mktemp/trap). Ensure you modify the
element containing `${pythonCommand(SQLITE_BACKUP_PY)} "$src" "$tmp"` and the
'cat -- "$tmp"' element so that 'cat' is executed only when the backup command
succeeded.
---
Nitpick comments:
In `@src/lib/state/sandbox.ts`:
- Around line 1005-1641: The review asks you to execute the full state lifecycle
end-to-end tests against the new gRPC/tar/marker backup and restore flow: run
state-backup-restore-e2e, snapshot-commands-e2e, and rebuild-openclaw-e2e and
verify backupSandboxState and restoreSandboxState behavior (including marker
parsing, tar streaming, pre-backup audit, sanitizeBackupDirectory, and
workspace-* discovery) on the intended migration path; if failures appear,
iterate on backupSandboxState, restoreSandboxState, safeTarExtract,
statusFromMarker, execBinaryStreamSync/execTextSync wrappers, and related helper
functions until the E2E suites pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 15ce826e-3c02-479c-b326-9b51e2257183
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (65)
.github/workflows/e2e-script.yamlci/env-var-doc-allowlist.jsonpackage.jsonscripts/copy-openshell-protos.mjsscripts/install.shsrc/lib/actions/sandbox/connect.tssrc/lib/actions/sandbox/destroy.tssrc/lib/actions/sandbox/policy-channel.tssrc/lib/actions/sandbox/process-recovery.tssrc/lib/actions/sandbox/skill-install.tssrc/lib/adapters/openshell/forward-bridge-runner.tssrc/lib/adapters/openshell/forward-bridge-state.test.tssrc/lib/adapters/openshell/forward-bridge-state.tssrc/lib/adapters/openshell/gateway-metadata.test.tssrc/lib/adapters/openshell/gateway-metadata.tssrc/lib/adapters/openshell/grpc-migration-guard.test.tssrc/lib/adapters/openshell/grpc.test.tssrc/lib/adapters/openshell/grpc.tssrc/lib/adapters/openshell/proto/datamodel.protosrc/lib/adapters/openshell/proto/openshell.protosrc/lib/adapters/openshell/sync-runner.tssrc/lib/adapters/openshell/timeouts.tssrc/lib/onboard.tssrc/lib/onboard/dashboard-access.test.tssrc/lib/onboard/dashboard-access.tssrc/lib/onboard/dashboard.tssrc/lib/onboard/forward-cleanup.test.tssrc/lib/onboard/forward-cleanup.tssrc/lib/onboard/forward-start.test.tssrc/lib/onboard/forward-start.tssrc/lib/onboard/preflight-port-recovery.test.tssrc/lib/onboard/preflight-port-recovery.tssrc/lib/onboard/sandbox-verification-exec.tssrc/lib/onboard/stale-gateway-cleanup.test.tssrc/lib/onboard/stale-gateway-cleanup.tssrc/lib/sandbox/version.test.tssrc/lib/sandbox/version.tssrc/lib/share-command-deps.tssrc/lib/share-command.test.tssrc/lib/share-command.tssrc/lib/skill-install.tssrc/lib/state/sandbox.tssrc/lib/status-command-deps.test.tssrc/lib/status-command-deps.tssrc/lib/tunnel/services-sandbox.test.tssrc/lib/tunnel/services.tssrc/lib/verify-deployment.tstest/cli.test.tstest/destroy-cleanup-sandbox-services.test.tstest/e2e/test-double-onboard.shtest/helpers/grpc-fake-ssh.cjstest/install-preflight.test.tstest/nemoclaw-cli-recovery.test.tstest/onboard-dashboard.test.tstest/onboard.test.tstest/rebuild-credential-preflight.test.tstest/rebuild-shields-auto-unlock.test.tstest/recover-port-forward.test.tstest/repro-2201.test.tstest/sandbox-connect-inference.test.tstest/share-command-deps-probe-argv.test.tstest/share-command-remote-path.test.tstest/shellquote-sandbox.test.tstest/snapshot-gateway-guard.test.tstest/snapshot.test.ts
💤 Files with no reviewable changes (6)
- test/share-command-remote-path.test.ts
- src/lib/onboard/forward-start.test.ts
- src/lib/onboard/dashboard-access.test.ts
- test/e2e/test-double-onboard.sh
- src/lib/onboard/forward-start.ts
- src/lib/onboard/dashboard-access.ts
| function statePath(sandboxName: string, port: number | string): string { | ||
| const safeSandbox = sandboxName.replace(/[^A-Za-z0-9._-]/g, "_"); | ||
| return path.join(stateDir(), `${safeSandbox}-${String(port)}.json`); |
There was a problem hiding this comment.
Sanitize port before composing the state-file path.
At Line 41, String(port) is injected directly into a path segment. A string with / or .. can escape the intended ~/.nemoclaw/forwards scope for read/write/unlink operations.
Suggested fix
+function normalizePortSegment(port: number | string): string {
+ const parsed =
+ typeof port === "number" ? port : Number.parseInt(String(port).trim(), 10);
+ if (!Number.isInteger(parsed) || parsed < 1 || parsed > 65_535) {
+ throw new Error(`Invalid forward port '${String(port)}'`);
+ }
+ return String(parsed);
+}
+
function statePath(sandboxName: string, port: number | string): string {
const safeSandbox = sandboxName.replace(/[^A-Za-z0-9._-]/g, "_");
- return path.join(stateDir(), `${safeSandbox}-${String(port)}.json`);
+ return path.join(stateDir(), `${safeSandbox}-${normalizePortSegment(port)}.json`);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function statePath(sandboxName: string, port: number | string): string { | |
| const safeSandbox = sandboxName.replace(/[^A-Za-z0-9._-]/g, "_"); | |
| return path.join(stateDir(), `${safeSandbox}-${String(port)}.json`); | |
| function normalizePortSegment(port: number | string): string { | |
| const parsed = | |
| typeof port === "number" ? port : Number.parseInt(String(port).trim(), 10); | |
| if (!Number.isInteger(parsed) || parsed < 1 || parsed > 65_535) { | |
| throw new Error(`Invalid forward port '${String(port)}'`); | |
| } | |
| return String(parsed); | |
| } | |
| function statePath(sandboxName: string, port: number | string): string { | |
| const safeSandbox = sandboxName.replace(/[^A-Za-z0-9._-]/g, "_"); | |
| return path.join(stateDir(), `${safeSandbox}-${normalizePortSegment(port)}.json`); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/adapters/openshell/forward-bridge-state.ts` around lines 39 - 41, The
statePath function is vulnerable because String(port) can contain path
separators or traversal sequences; update statePath to sanitize the port
similarly to safeSandbox (or validate as a numeric port) before composing the
filename: convert port to string, strip or replace any characters that are not
digits/letters/._- (e.g. replace /[^A-Za-z0-9._-]/g with "_") or enforce a
numeric check (parseInt/Number and throw on NaN) so the returned path
(stateDir() + `${safeSandbox}-${safePort}.json`) cannot escape the intended
directory; modify the function statePath to produce safePort and use it in the
path join.
| export function stopForwardBridge(sandboxName: string, port: number | string): boolean { | ||
| const state = readStateFile(statePath(sandboxName, port)); | ||
| if (state && isPidAlive(state.pid) && !isTestForwardPid(state.pid)) { | ||
| try { | ||
| process.kill(state.pid, "SIGTERM"); | ||
| } catch { | ||
| /* ignore */ | ||
| } | ||
| if (!waitForPidExit(state.pid)) { | ||
| try { | ||
| process.kill(state.pid, "SIGKILL"); | ||
| } catch { | ||
| /* ignore */ | ||
| } | ||
| waitForPidExit(state.pid, 1_000); | ||
| } | ||
| waitForPortRelease(state.bind, state.port); |
There was a problem hiding this comment.
Avoid PID-only termination for forward bridge cleanup.
At Lines 147-156, termination is driven by PID liveness only. With stale state + PID reuse, this can kill an unrelated process.
Suggested hardening
export function stopForwardBridge(sandboxName: string, port: number | string): boolean {
const state = readStateFile(statePath(sandboxName, port));
- if (state && isPidAlive(state.pid) && !isTestForwardPid(state.pid)) {
+ if (
+ state &&
+ isPidAlive(state.pid) &&
+ !isTestForwardPid(state.pid) &&
+ !canBindForwardPort(state.bind, state.port)
+ ) {
try {
process.kill(state.pid, "SIGTERM");
} catch {
/* ignore */
}| const earlyForwards = forwardStatesAsListOutput(); | ||
| const effectivePort = findAvailableDashboardPort(sandboxName, preferredPort, earlyForwards); |
There was a problem hiding this comment.
Keep legacy/manual forwards visible to the early port allocator.
forwardStatesAsListOutput() only reflects NemoClaw-managed forward-bridge state. That drops any still-live legacy openshell forward entry or manually-started forward, so effectivePort can be chosen against a port that's already occupied. ensureDashboardForward() can recover later, but by then CHAT_UI_URL / NEMOCLAW_DASHBOARD_PORT have already been baked into the staged Dockerfile and sandbox env, which defeats the purpose of this pre-resolution block. Please merge legacy/live forward data into this input, or probe the candidate port directly before finalizing effectivePort.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 2795 - 2796, forwardStatesAsListOutput()
only returns NemoClaw-managed forwards so findAvailableDashboardPort(...) can
pick a port already taken by legacy/manual forwards; update the pre-resolution
to include legacy/live forwards or probe the candidate port before finalizing
effectivePort: call or implement a legacy forward listing (e.g. merge results of
forwardStatesAsListOutput() with a new listLegacyForwards() or
openshellForwardProbe()) and pass the combined set into
findAvailableDashboardPort(sandboxName, preferredPort, combinedForwards), or
after findAvailableDashboardPort returns, perform an explicit port probe on
effectivePort and, if occupied, retry selection; ensure this change happens
before ensureDashboardForward() (so CHAT_UI_URL / NEMOCLAW_DASHBOARD_PORT baked
into the staged Dockerfile/sandbox env reflect a truly available port).
| if (!openshell || !home || !path.resolve(openshell).startsWith(`${home}${path.sep}`)) { | ||
| process.stderr.write("grpc fake transport could not find the hermetic fake openshell under HOME\n"); | ||
| process.exit(127); |
There was a problem hiding this comment.
Harden the hermetic binary check against symlink escapes.
path.resolve() only normalizes the path string; it does not resolve symlinks. A symlink like $HOME/bin/openshell -> /usr/bin/openshell passes this check and can run a non-hermetic binary.
🔧 Proposed fix
const openshell = findFakeOpenshell();
const home = process.env.HOME ? path.resolve(process.env.HOME) : "";
-if (!openshell || !home || !path.resolve(openshell).startsWith(`${home}${path.sep}`)) {
+let realHome = "";
+let realOpenshell = "";
+try {
+ realHome = home ? fs.realpathSync(home) : "";
+ realOpenshell = openshell ? fs.realpathSync(openshell) : "";
+} catch {}
+if (!realOpenshell || !realHome || !realOpenshell.startsWith(`${realHome}${path.sep}`)) {
process.stderr.write("grpc fake transport could not find the hermetic fake openshell under HOME\n");
process.exit(127);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/helpers/grpc-fake-ssh.cjs` around lines 37 - 39, The current hermetic
binary check uses path.resolve(openshell) which doesn't resolve symlinks;
replace that check by calling fs.realpathSync or fs.promises.realpath on both
openshell and home (e.g., realShell = fs.realpathSync(openshell); realHome =
fs.realpathSync(home)) and then verify
realShell.startsWith(`${realHome}${path.sep}`); also wrap realpath in a
try/catch so failures log the same stderr message and call process.exit(127) as
before, keeping the existing variables (openshell, home) and the failure branch
that writes the error and exits.
Selective E2E Results — ❌ Some jobs failedRun: 26721427994
|
Selective E2E Results — ❌ Some jobs failedRun: 26724160646
|
Use raw OpenShell gRPC for ForwardTcp and log/watch endpoints that are not exposed by the current TypeScript SDK binding. Add a detached dashboard forward runner and raw ExecSandbox fallback so SDK-preview builds can validate runtime flows without the placeholder implementation. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26724916520
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26725221329
|
Selective E2E Results — ❌ Some jobs failedRun: 26725411482
|
Selective E2E Results — ✅ All requested jobs passedRun: 26725411482
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26726398782
|
Selective E2E Results — ✅ All requested jobs passedRun: 26726398782
|
Summary
Migrates NemoClaw-managed sandbox lifecycle operations from SSH/SFTP/SSHFS-style paths to direct OpenShell gRPC transport. This adds a Node gRPC adapter for OpenShell gateway metadata, sandbox exec streams, sandbox logs/watch, and dashboard ForwardTcp bridges while preserving the privileged Docker exec path for root/shields mutations.
Related Issue
None.
Changes
share mountunder the gRPC-only model while retaining status/unmount cleanup for legacy mounts.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Ran locally:
npm run build:clinpm run typecheck:clinpx vitest run src/lib/adapters/openshell/gateway-metadata.test.ts src/lib/adapters/openshell/grpc-migration-guard.test.ts src/lib/share-command.test.ts src/lib/sandbox/version.test.ts test/share-command-deps-probe-argv.test.ts src/lib/actions/sandbox/process-recovery.test.ts src/lib/skill-install.test.ts test/process-recovery.test.ts test/snapshot.test.ts test/snapshot-restore-existing-dest.test.ts test/rebuild-credential-preflight.test.ts test/rebuild-shields-auto-unlock.test.ts test/rebuild-credential-hydration.test.ts test/rebuild-policy-presets.test.ts test/share-command-remote-path.test.ts test/share-command-writable.test.tsgit diff --checkNot run locally:
openshell sandbox listreturns connection refused).Signed-off-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit