Skip to content

fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980

Merged
cv merged 7 commits into
mainfrom
fix/dashboard-port-preflight-3953
May 25, 2026
Merged

fix(onboard): fail fast in preflight when all dashboard ports are occupied (#3953)#3980
cv merged 7 commits into
mainfrom
fix/dashboard-port-preflight-3953

Conversation

@nvshaxie
Copy link
Copy Markdown
Contributor

@nvshaxie nvshaxie commented May 21, 2026

Summary

  • findAvailableDashboardPort already raises the right "All dashboard ports in range 18789-18799 are occupied …" error when every port is held, but it only fires in late onboarding (after preflight + gateway start + inference selection). Reporters expect the failure to be upfront.
  • Add a narrow preflightDashboardPortRangeAvailability() helper in src/lib/onboard/dashboard-port.ts that only checks host bindings (safe to run before OpenShell is reachable) and call it at the end of preflight(). Exits 1 with the same canonical message when every port in DASHBOARD_PORT_RANGE_START..END is bound.
  • Drop the unused findDashboardForwardOwner import/re-export from src/lib/onboard.ts to keep the entrypoint net-neutral per the onboard-entrypoint-budget workflow (the only remaining caller is dashboard-port.test.ts, which imports it directly from the module).

Bug reproduction (pre-fix)

```
$ for p in $(seq 18789 18799); do python3 -m http.server "$p" &>/dev/null & done
$ nemoclaw onboard --non-interactive --yes-i-accept-third-party-software --name overflow-test ...
[1/8] Preflight checks ← passes
[2/8] Starting OpenShell gateway ← runs anyway
[3/8] Configuring inference (NIM) ← runs anyway
Inference options menu … ← reaches the wizard
```

Behavior post-fix (same setup)

```
[1/8] Preflight checks
✓ Docker is running … ✓ Memory OK …
All dashboard ports in range 18789-18799 are occupied:
18789 → non-OpenShell host listener
18790 → non-OpenShell host listener
...
18799 → non-OpenShell host listener
Free a sandbox or use --control-ui-port with a port outside this range.
```

Exit 1. `[2/8] Starting OpenShell gateway` is never printed.

Why the helper is sound at this stage

  • findAvailableDashboardPort needs openshell forward list (so it can distinguish "this port is bound by an OpenShell forward that belongs to this sandbox" from "host listener"). The preflight helper deliberately skips that distinction and treats every bound port as a non-OpenShell listener.
  • That is correct here: if every port in the range is bound (either by external processes OR by other sandboxes), the new sandbox cannot use any of them anyway, and operator intervention is required either way (--control-ui-port <N> or freeing a sandbox).

Test plan

  • New `describe("preflightDashboardPortRangeAvailability ([NemoClaw][All Platforms][Onboard] Onboard does not fail fast when all dashboard ports 18789-18799 are occupied by external processes #3953)")` block in `src/lib/onboard/dashboard-port.test.ts` with 3 cases (every port bound → exit 1 with canonical message; 10/11 bound + 1 free → no exit; no ports bound → no exit). Uses injected stubs for `isPortBoundOnHost` and `process.exit`.
  • `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass.
  • Manual on Ubuntu 24.04 x86_64: occupied 18789-18799 with `python3 -m http.server`, ran `node bin/nemoclaw.js onboard --non-interactive ...`. Exits at the end of `[1/8] Preflight checks` with the canonical message and the per-port list; `[2/8] Starting OpenShell gateway` never prints.
  • `src/lib/onboard.ts` diff stat: `+2/-2` (net 0), within the entrypoint budget.

Fixes #3953.

Signed-off-by: Shawn Xie shaxie@nvidia.com

Summary by CodeRabbit

  • New Features

    • Preflight now detects when all dashboard ports are exhausted and blocks startup with a clear error showing per-port details and guidance to specify an alternate control UI port. This check runs on normal start and on resume, but is skipped when a control UI port is explicitly provided (via flag or environment).
  • Tests

    • Added tests for fully occupied, partially available, and completely free port-range scenarios.

Review Change Stack

…upied (#3953)

`findAvailableDashboardPort` already raises the right error
("All dashboard ports in range 18789-18799 are occupied …") when every
port is held, but it only runs late in onboarding — after preflight,
gateway start, and inference selection have already had side effects.
A user with all 11 dashboard ports held by external processes
(python3, etc.) saw onboard proceed all the way to the inference menu
before any port-related failure surfaced, contradicting the contract
in the docs / test plan.

Add a narrow `preflightDashboardPortRangeAvailability()` helper to
`src/lib/onboard/dashboard-port.ts` and call it from `preflight()`
right before it returns the GPU detection. The helper only checks
host bindings (it does not need OpenShell forward state, so it is
safe before gateway start). When every port in
`DASHBOARD_PORT_RANGE_START..END` is bound, it prints the same
canonical message and exits 1.

To keep `src/lib/onboard.ts` net-neutral per the
`onboard-entrypoint-budget` workflow, the unused
`findDashboardForwardOwner` import / re-export is dropped (no
callers outside the colocated unit test, which imports it directly
from `./onboard/dashboard-port`). Net delta on `onboard.ts`: +2/-2.

Test plan
- Added 3 unit tests in `src/lib/onboard/dashboard-port.test.ts`
  with injected `isPortBoundOnHost`/`process.exit` stubs:
  * exits 1 with the canonical message when every port in the range
    is bound; stderr lists each port → non-OpenShell host listener.
  * returns without exiting when 10 of 11 are bound and 1 is free.
  * returns without exiting when no port is bound.
- `npx vitest run src/lib/onboard/dashboard-port.test.ts` — 11/11 pass.
- Manual end-to-end on Ubuntu 24.04 x86_64:
  * Occupy 18789-18799 with `python3 -m http.server` placeholders.
  * `nemoclaw onboard --non-interactive ...` now exits inside
    `[1/8] Preflight checks` with
    `All dashboard ports in range 18789-18799 are occupied:`
    followed by one line per occupied port. `[2/8] Starting OpenShell
    gateway` and `[3/8] Configuring inference` are never printed.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@nvshaxie nvshaxie requested review from cv and ericksoa May 21, 2026 07:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4bf4634-4f29-4611-af2a-4d41989a2267

📥 Commits

Reviewing files that changed from the base of the PR and between a0cc430 and 1298197.

📒 Files selected for processing (1)
  • src/lib/onboard.ts
💤 Files with no reviewable changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds preflightDashboardPortRangeAvailability to scan dashboard ports 18789–18799; it returns early if any port is free, otherwise prints per-port diagnostics and exits(1). Onboard imports and invokes this check only when no explicit control UI port is set (including NEMOCLAW_DASHBOARD_PORT), and the test suite adds cases for full, partial, and empty occupancy.

Changes

Dashboard Port Exhaustion Preflight

Layer / File(s) Summary
Dashboard port exhaustion check implementation and tests
src/lib/onboard/dashboard-port.ts, src/lib/onboard/dashboard-port.test.ts
preflightDashboardPortRangeAvailability iterates the dashboard port range with an injectable bound check, returns early if any port is free, and emits a per-port error + non-zero exit when all ports are occupied. Tests cover full exhaustion (exit+error), partial availability, and no-bound scenarios.
Onboard preflight integration
src/lib/onboard.ts
Imports and conditionally invokes preflightDashboardPortRangeAvailability() during preflight when no explicit --control-ui-port or NEMOCLAW_DASHBOARD_PORT is set, updates _preflightDashboardPort initialization, and removes findDashboardForwardOwner from module.exports.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

OpenShell, Networking

Suggested reviewers

  • cv
  • jyaunches
  • ericksoa

Poem

🐰
I hopped the ports from nine to nine,
I sniffed and tapped each hollow line.
If every gate returns a lock,
I call the halt — I stop the clock.
Now onboarding hears my sign.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a fail-fast check for when all dashboard ports are occupied during preflight.
Linked Issues check ✅ Passed The PR implements all requirements from #3953: detects when all dashboard ports 18789-18799 are occupied, prints a clear error message listing each port, exits with non-zero status, and prevents further onboarding steps.
Out of Scope Changes check ✅ Passed All changes are directly related to the dashboard port preflight check objective. Removal of unused findDashboardForwardOwner is justified as keeping the entrypoint net-neutral.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dashboard-port-preflight-3953

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Unit tests do not prove the full onboard fail-fast boundary (src/lib/onboard/dashboard-port.test.ts:99): The new tests validate preflightDashboardPortRangeAvailability in isolation using injected port-bound and exit stubs. They do not run the full onboard path, so they do not prove that gateway startup, inference selection, or sandbox creation are skipped when all ports are occupied, nor do they cover the resume path or explicit dashboard-port override path in integrated execution.
    • Recommendation: Add or identify targeted integration/runtime validation that binds 18789-18799, runs onboard without an explicit dashboard port, asserts non-zero exit before gateway/sandbox side effects, and covers resume plus NEMOCLAW_DASHBOARD_PORT/--control-ui-port bypass behavior.
    • Evidence: dashboard-port.test.ts directly invokes preflightDashboardPortRangeAvailability(allBound, exitFn), partial-bound, and free-port stubs. The production call sites are in src/lib/onboard.ts preflight() and the resume branch after handlePreflightState().
  • Active onboard.ts overlap increases drift risk (src/lib/onboard.ts:2442): The patched code still exists and the diff is small, but src/lib/onboard.ts remains a large active onboard entrypoint with recent and open overlapping work in preflight, gateway, sandbox, policy, and provider areas. This raises sequencing risk for where this pre-gateway fail-fast check belongs after nearby refactors land.
    • Recommendation: Before merge, re-check the latest main and overlapping onboard refactors to ensure both normal and resume guarded calls remain before gateway/inference/sandbox side effects and still respect explicit dashboard-port overrides.
    • Evidence: Trusted drift evidence shows many recent onboard.ts changes and open PR overlaps on src/lib/onboard.ts. The current PR modifies src/lib/onboard.ts with net-zero line delta but integrates behavior into a high-churn monolith.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Unit tests do not prove the full onboard fail-fast boundary (src/lib/onboard/dashboard-port.test.ts:99): The new tests validate preflightDashboardPortRangeAvailability in isolation using injected port-bound and exit stubs. They do not run the full onboard path, so they do not prove that gateway startup, inference selection, or sandbox creation are skipped when all ports are occupied, nor do they cover the resume path or explicit dashboard-port override path in integrated execution.
    • Recommendation: Add or identify targeted integration/runtime validation that binds 18789-18799, runs onboard without an explicit dashboard port, asserts non-zero exit before gateway/sandbox side effects, and covers resume plus NEMOCLAW_DASHBOARD_PORT/--control-ui-port bypass behavior.
    • Evidence: dashboard-port.test.ts directly invokes preflightDashboardPortRangeAvailability(allBound, exitFn), partial-bound, and free-port stubs. The production call sites are in src/lib/onboard.ts preflight() and the resume branch after handlePreflightState().
  • Active onboard.ts overlap increases drift risk (src/lib/onboard.ts:2442): The patched code still exists and the diff is small, but src/lib/onboard.ts remains a large active onboard entrypoint with recent and open overlapping work in preflight, gateway, sandbox, policy, and provider areas. This raises sequencing risk for where this pre-gateway fail-fast check belongs after nearby refactors land.
    • Recommendation: Before merge, re-check the latest main and overlapping onboard refactors to ensure both normal and resume guarded calls remain before gateway/inference/sandbox side effects and still respect explicit dashboard-port overrides.
    • Evidence: Trusted drift evidence shows many recent onboard.ts changes and open PR overlaps on src/lib/onboard.ts. The current PR modifies src/lib/onboard.ts with net-zero line delta but integrates behavior into a high-churn monolith.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 3844: preflightDashboardPortRangeAvailability() is being called
unconditionally which causes failures even when the user provided an explicit
--control-ui-port; update the logic so the range-exhaustion check runs only when
no explicit control UI port override is present — detect the override (the CLI
option or config value that represents the explicit port, e.g. controlUiPort /
controlUiPortOverride / options.controlUiPort or the corresponding env var used
earlier in the file) and skip calling preflightDashboardPortRangeAvailability()
when that override is set, leaving the existing explicit-port path intact (the
explicit-port handling you preserved earlier) so the onboarding respects
--control-ui-port inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a75a1aed-7ee0-41b2-be45-486dcc29bbfe

📥 Commits

Reviewing files that changed from the base of the PR and between 18c7265 and a916031.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/dashboard-port.test.ts
  • src/lib/onboard/dashboard-port.ts

Comment thread src/lib/onboard.ts Outdated
@wscurran wscurran added bug Something fails against expected or documented behavior fix labels May 21, 2026
@wscurran
Copy link
Copy Markdown
Contributor

…ort is set

Addresses CodeRabbit on #3980. The previous commit called
preflightDashboardPortRangeAvailability() unconditionally at the end
of preflight(), which broke the explicit --control-ui-port path: a
user who passes `--control-ui-port 3000` to escape the default range
would still hit the "All dashboard ports in range 18789-18799 are
occupied" exit, even though that range is no longer the one we'll
use.

Guard the call on `_preflightDashboardPort === null` so the new
fail-fast only applies to the auto-allocation path. The explicit-port
path is already validated separately above (lines ~3670-3700, behind
the same `dashboardPortToCheck` flag).

Verified end-to-end: with 18789-18799 held by python3 placeholders,
`nemoclaw onboard --control-ui-port 23000` no longer prints the
range-exhaustion error in preflight; without the flag, the existing
fail-fast behavior is unchanged.

onboard.ts diff stays at +2/-2 net 0, within the entrypoint budget.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@nvshaxie
Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai — applied your suggested guard in the followup commit. preflightDashboardPortRangeAvailability() is now only called when _preflightDashboardPort === null (i.e. no --control-ui-port override). The explicit-port path on lines ~3670-3700 already validates the chosen port separately, so no double-checking is needed.

Verified end-to-end on Ubuntu 24.04 with 18789-18799 held by python3 -m http.server placeholders:

```
$ nemoclaw onboard --control-ui-port 23000 --non-interactive ...
[1/8] Preflight checks
✓ Memory OK …
[2/8] Starting OpenShell gateway ← proceeds past preflight (would have exited before)

```

Without the flag, the existing fail-fast behavior is unchanged (still prints All dashboard ports in range 18789-18799 are occupied: and exits 1 before step 2).

onboard.ts diff stays at +2/-2, net 0, within the entrypoint budget.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Tip

For best results, initiate chat on the files or code changes.

@nvshaxie: The guard makes perfect sense — skipping the range check when an explicit --control-ui-port is provided avoids redundant validation and keeps the explicit-port path self-contained. The end-to-end verification on Ubuntu 24.04 with all 18789–18799 ports occupied confirms both the fail-fast path and the bypass path behave correctly. Looks good! 🐇 ✨


✏️ Learnings added
Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T01:44:56.363Z
Learning: In `src/lib/onboard.ts`, `preflightDashboardPortRangeAvailability()` (imported from `./onboard/dashboard-port`) is guarded to only be called when `_preflightDashboardPort === null` (i.e., no `--control-ui-port` override was provided). The explicit-port validation path (around lines 3670–3700 of `onboard.ts`) handles its own port availability check separately, so calling `preflightDashboardPortRangeAvailability()` in that branch would be a double-check.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/lib/onboard.ts (3)

3545-3551: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t only release the default dashboard port during gateway cleanup.

This cleanup path tears down the gateway but only stops DASHBOARD_PORT. NemoClaw can be forwarding on an auto-allocated or explicit port, so those SSH forwards survive the cleanup and can later show up as bogus host listeners that block onboarding. The same default-port assumption is repeated in the other new cleanup call sites below.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3545 - 3551, The cleanup currently only
stops DASHBOARD_PORT (calling bestEffortForwardStop(runOpenshell,
DASHBOARD_PORT)), which leaves other NemoClaw forwards (auto-allocated or
explicit ports) active and can create stale host listeners; update the gateway
teardown path in the block using destroyGatewayForReuse and
bestEffortForwardStop to stop all active forwards instead of just DASHBOARD_PORT
— either by retrieving the gateway's list of forwarded ports and calling
bestEffortForwardStop for each, or by adding/using a helper (e.g.,
stopAllForwards or similar) that accepts runOpenshell/gatewayForwardState and
stops every forwarded port before calling destroyGatewayForReuse; ensure you
update the other new cleanup call sites that mirror this pattern so all
forwarded ports are cleaned up, referencing bestEffortForwardStop, runOpenshell,
DASHBOARD_PORT, and destroyGatewayForReuse to locate the changes.

8981-8981: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat NEMOCLAW_DASHBOARD_PORT as an explicit preflight override.

Line 8981 only seeds _preflightDashboardPort from opts.controlUiPort. If the operator sets NEMOCLAW_DASHBOARD_PORT=23000 and the default 18789-18799 range is full, this still takes the auto-allocation branch, runs the new range-exhaustion check, and exits even though 23000 is the intended dashboard port. The explicit-port path and the fail-fast skip need to use the resolved env override too.

💡 Suggested fix
-  _preflightDashboardPort = opts.controlUiPort || null;
+  const envDashboardPort =
+    process.env.NEMOCLAW_DASHBOARD_PORT != null ? DASHBOARD_PORT : null;
+  _preflightDashboardPort = opts.controlUiPort ?? envDashboardPort ?? null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` at line 8981, The code only initializes
_preflightDashboardPort from opts.controlUiPort; treat the environment override
NEMOCLAW_DASHBOARD_PORT as an explicit preflight override by reading
process.env.NEMOCLAW_DASHBOARD_PORT (if present) when setting
_preflightDashboardPort so that the explicit-port path is taken; update the
dashboard-preflight logic that runs the range-exhaustion check (the
auto-allocation branch) to skip that check when _preflightDashboardPort is
resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT
so the operator-provided port (23000) is honored instead of triggering
range-exhaustion exit.

9328-9350: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resume still bypasses the new dashboard-range fail-fast.

When cached preflight is reused here, the new guard at Line 3827 never runs. If the default range becomes fully occupied after the initial attempt, --resume can still progress into later steps and only fail in dashboard allocation after sandbox work has started, which is exactly the late failure this PR is trying to move forward.

💡 Suggested fix
       assertCdiNvidiaGpuSpecPresent(assessHost(), resumeOptedOutGpuPassthrough);
       validateSandboxGpuPreflight(resumeSandboxGpuConfig);
+      if (_preflightDashboardPort === null) {
+        preflightDashboardPortRangeAvailability();
+      }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9328 - 9350, Resume path skips the new
dashboard-range fail-fast guard; ensure the same dashboard allocation check used
in the normal preflight path is invoked during resume so we fail fast on range
exhaustion. Modify the resumePreflight branch (the if (resumePreflight) block
that calls skippedStepMessage, resolveSandboxGpuConfig,
exitOnSandboxGpuConfigErrors, assertCdiNvidiaGpuSpecPresent, and
validateSandboxGpuPreflight) to call the dashboard-range fail-fast function used
by the non-resume flow (the same guard invoked outside this block) before
proceeding to sandbox GPU validation so resume cannot bypass dashboard
allocation checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 3545-3551: The cleanup currently only stops DASHBOARD_PORT
(calling bestEffortForwardStop(runOpenshell, DASHBOARD_PORT)), which leaves
other NemoClaw forwards (auto-allocated or explicit ports) active and can create
stale host listeners; update the gateway teardown path in the block using
destroyGatewayForReuse and bestEffortForwardStop to stop all active forwards
instead of just DASHBOARD_PORT — either by retrieving the gateway's list of
forwarded ports and calling bestEffortForwardStop for each, or by adding/using a
helper (e.g., stopAllForwards or similar) that accepts
runOpenshell/gatewayForwardState and stops every forwarded port before calling
destroyGatewayForReuse; ensure you update the other new cleanup call sites that
mirror this pattern so all forwarded ports are cleaned up, referencing
bestEffortForwardStop, runOpenshell, DASHBOARD_PORT, and destroyGatewayForReuse
to locate the changes.
- Line 8981: The code only initializes _preflightDashboardPort from
opts.controlUiPort; treat the environment override NEMOCLAW_DASHBOARD_PORT as an
explicit preflight override by reading process.env.NEMOCLAW_DASHBOARD_PORT (if
present) when setting _preflightDashboardPort so that the explicit-port path is
taken; update the dashboard-preflight logic that runs the range-exhaustion check
(the auto-allocation branch) to skip that check when _preflightDashboardPort is
resolved from either opts.controlUiPort or process.env.NEMOCLAW_DASHBOARD_PORT
so the operator-provided port (23000) is honored instead of triggering
range-exhaustion exit.
- Around line 9328-9350: Resume path skips the new dashboard-range fail-fast
guard; ensure the same dashboard allocation check used in the normal preflight
path is invoked during resume so we fail fast on range exhaustion. Modify the
resumePreflight branch (the if (resumePreflight) block that calls
skippedStepMessage, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors,
assertCdiNvidiaGpuSpecPresent, and validateSandboxGpuPreflight) to call the
dashboard-range fail-fast function used by the non-resume flow (the same guard
invoked outside this block) before proceeding to sandbox GPU validation so
resume cannot bypass dashboard allocation checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fefc6d8c-a930-484a-b4e7-033502152492

📥 Commits

Reviewing files that changed from the base of the PR and between a916031 and 37d5531.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

… gate resume path

CodeRabbit pointed out two follow-ups on PR #3980's fail-fast preflight:

1. The env override NEMOCLAW_DASHBOARD_PORT pins an explicit dashboard
   port the same way --control-ui-port does, but only --control-ui-port
   was disabling the new range-exhaustion fail-fast. Operators pinning
   18900 with the default 18789-18799 range full would still exit on
   preflight even though their port was outside the range. Seed
   _preflightDashboardPort from the resolved env-override too.

2. The new fail-fast guard ran in the normal preflight path only; the
   --resume path reuses cached preflight and would walk past an
   exhausted range, only failing later in dashboard allocation after
   sandbox work has started — exactly the late failure this PR is
   trying to prevent. Mirror the guard in the resume branch.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvshaxie
Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai — addressed the two findings that scope to this PR's preflight change in commit ec7e315:

  • 8981NEMOCLAW_DASHBOARD_PORT env override now disables the range-exhaustion fail-fast the same way --control-ui-port does. Implemented as suggested via the resolved DASHBOARD_PORT constant from ./core/ports.
  • 9328-9350 — added the same if (_preflightDashboardPort === null) preflightDashboardPortRangeAvailability() guard in the --resume branch so cached preflight no longer skips it.

Skipping 3545-3551 (and the mirrored cleanup call sites) — only releasing DASHBOARD_PORT in cleanup is pre-existing behavior unrelated to the fail-fast added here. Happy to file it as a separate issue if that's the right move.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Tip

For best results, initiate chat on the files or code changes.

@nvshaxie: All three points make sense — great follow-through on the two in-scope fixes:

  • NEMOCLAW_DASHBOARD_PORT guard (line 8981): Using the resolved DASHBOARD_PORT constant from ./core/ports as the env-override signal is the right approach — it keeps the guard symmetric with the --control-ui-port path and avoids a raw string comparison.
  • --resume branch guard (lines 9328–9350): Adding the same if (_preflightDashboardPort === null) check there ensures cached-preflight runs are protected too. Good catch closing that gap.

On lines 3545–3551 — filing it as a separate issue is exactly the right call. Would you like me to open a follow-up issue in this repo to track the cleanup of those DASHBOARD_PORT release call sites?

🐇 ✨


✏️ Learnings added
Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: In `src/lib/onboard.ts`, the `preflightDashboardPortRangeAvailability()` range-exhaustion fail-fast is disabled not only when `--control-ui-port` is provided (`_preflightDashboardPort !== null`) but also when `NEMOCLAW_DASHBOARD_PORT` env override is set (detected via the resolved `DASHBOARD_PORT` constant from `./core/ports`). Both explicit-port paths skip the default-range check.

Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: In `src/lib/onboard.ts`, the `if (_preflightDashboardPort === null) preflightDashboardPortRangeAvailability()` guard is present in both the standard preflight branch and the `--resume` (cached-preflight) branch (around lines 9328–9350) so that range-exhaustion detection is not bypassed when onboarding resumes from a cached state.

Learnt from: nvshaxie
URL: https://github.com/NVIDIA/NemoClaw/pull/3980

Timestamp: 2026-05-22T02:05:42.745Z
Learning: The DASHBOARD_PORT release/cleanup call sites around lines 3545–3551 of `src/lib/onboard.ts` represent pre-existing behavior unrelated to the fail-fast preflight change introduced in PR `#3980`. Any cleanup or refactor of those call sites should be tracked as a separate issue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8981-8985: The new const envDashboardPort is both wrong (uses
DASHBOARD_PORT instead of the env value) and adds lines; replace the two-line
block with a single expression that uses the env var directly and preserves the
precedence: set _preflightDashboardPort = opts.controlUiPort ??
(process.env.NEMOCLAW_DASHBOARD_PORT ?? null); do the same collapse for the
other occurrence that mirrors this logic (the block around the symbols at the
second location).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a30a6e9f-573d-4e98-a9a2-bf83c2497dc5

📥 Commits

Reviewing files that changed from the base of the PR and between 37d5531 and ec7e315.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated
cv and others added 2 commits May 21, 2026 19:09
Replace the verbose 4-line env-override comment block with a single
inline ternary on the assignment line, and pack the resume guard onto
the existing validateSandboxGpuPreflight call. Semantically identical
to ec7e315; just brings src/lib/onboard.ts back under the
onboard-entrypoint-budget gate (additions ≤ deletions).

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
…eflight-3953

# Conflicts:
#	src/lib/onboard.ts
@github-actions
Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@cv cv added the v0.0.51 Release target label May 25, 2026
@cv cv merged commit 3e28848 into main May 25, 2026
22 of 23 checks passed
cv pushed a commit that referenced this pull request May 26, 2026
## Summary

- keep `~/.nemoclaw` directory hardening for user-owned config dirs
- skip directory `chmod` when the config dir is owned by another UID,
which preserves OpenClaw's root-owned `/sandbox/.nemoclaw` layout
- keep `config.json` restricted to `0600`
- add shell-level regression coverage for owned and non-owned config
directories

## Root Cause

PR #4054 changed the step-7 sandbox config sync script from writing
`~/.nemoclaw/config.json` to unconditionally running `chmod 700
~/.nemoclaw` first. In the OpenClaw sandbox image, `/sandbox/.nemoclaw`
is intentionally root-owned with mode `1755`, while
`/sandbox/.nemoclaw/config.json` is sandbox-owned and writable. The
directory chmod fails under `set -euo pipefail`, so `openshell sandbox
connect <sandbox>` exits 1 during:

`[7/8] Setting up OpenClaw inside sandbox`

This restores compatibility with that image contract without reverting
the file-permission hardening from #4054.

## Evidence

- Last known good public install: run `26413471038`, installed ref
`8ed9a2c04c80627deed519a7c5ffe8f95ebb5ddd`, `cloud-onboard-e2e` reached
`✓ OpenClaw gateway launched inside sandbox`.
- First observed failing public install: run `26414889786`, installed
ref `94fefa6fd08e33d48eede651e33770a0712da854`, `cloud-onboard-e2e`
failed at `openshell sandbox connect e2e-cloud-onboard`.
- The commit window contains #4054, #3980, and #3675; #4054 is the only
PR that changed the config-sync script executed at the failing boundary.

## Validation

- `npm ci --include=dev --ignore-scripts`
- `npx vitest run src/lib/onboard/config-sync.test.ts
--reporter=verbose`
- `npx @biomejs/biome check src/lib/onboard/config-sync.ts
src/lib/onboard/config-sync.test.ts`
- `git diff --check`
- `npm run build:cli`
- `npm run typecheck:cli`


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Improved NemoClaw configuration directory permission handling to be
more robust in different user ownership scenarios.
* Enhanced sandbox configuration script to conditionally apply
permissions based on directory ownership.

* **Tests**
* Added comprehensive integration tests for configuration directory
permission management and ownership verification.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4199?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug Something fails against expected or documented behavior bug-fix PR fixes a bug or regression v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][All Platforms][Onboard] Onboard does not fail fast when all dashboard ports 18789-18799 are occupied by external processes

3 participants