Skip to content

feat(setup-ui): wizard-first first-run + operator admin console#145

Open
ndee wants to merge 34 commits intomainfrom
feat/setup-ui
Open

feat(setup-ui): wizard-first first-run + operator admin console#145
ndee wants to merge 34 commits intomainfrom
feat/setup-ui

Conversation

@ndee
Copy link
Copy Markdown
Owner

@ndee ndee commented Apr 25, 2026

Summary

Replaces the JSON-textarea install screen and the admin-shell-first navigation with a guided, wizard-first first-run experience and a task-oriented operator admin console. The technical foundation from the original PR is preserved: Fastify-served static SPA, no-build-step Preact + htm, vendored deps, CSRF + cookie sessions, two-stage auth (bootstrap-token → Matrix-password), and all install/preflight/test-connection/install-job/status/reconfigure/onboarding routes.

The UI is now split into two distinct modes inside the same SPA, with mode detected client-side from /api/status (no new backend endpoint needed):

  • Mode A — first-run wizard (default when not initialized): 9 sequential steps under #/setup/* — Welcome, Preflight, Matrix, Mailbox, Provider, Modules, Review, Progress, Done.
  • Mode B — admin / reconfiguration console (default when initialized): task-oriented pages under #/admin/* — Status (operator overview), Mailbox / Matrix / Provider (Test → Apply), Onboarding, Recovery.

After a successful install, the wizard auto-issues a Matrix onboarding code, surfaces the Element homeserver URL and alert room as the primary CTA, and points the operator at Matrix as the day-2 control plane. The flow ends in Matrix, not in the web UI.

Backend

  • No new HTTP routes. The wizard reuses existing POST /api/install/preflight, POST /api/install/test-imap, POST /api/install/test-matrix, POST /api/install/run, GET /api/install/jobs/:jobId, POST /api/onboarding/issue, GET /api/onboarding/state. The reconfigure pages reuse POST /api/reconfigure/{imap,matrix,openrouter} and the same test-imap / test-matrix endpoints.
  • One contract test added in src/contracts/install.test.ts: the exact JSON shape the wizard emits is asserted to round-trip through installRequestSchema. This is the single guard against frontend/backend drift.

Frontend

New mode-aware host (public/setup-ui/app.js) detects mode from /api/status (installationId && version.provenance ⇒ admin; otherwise first-run), enforces redirects between modes, redirects legacy hash routes (#/install, #/status, #/reconfigure/*, #/onboarding) to the new IA for one release cycle, and surfaces a "Could not load status / Retry" splash on 5xx instead of falling back to first-run.

Wizard (public/setup-ui/screens/wizard/):

  • 9 step components plus a shared WizardShell and Stepper.
  • Form state (NOT secrets) persists to localStorage under sov:setup-ui:wizard:v1; secrets re-prompted on rehydrate.
  • Resume mid-install: if a jobId is in localStorage, the wizard reads the job state on load and jumps to Progress / Success / Review-with-error as appropriate.
  • Matrix and Mailbox steps each have a "Test connection" button using the existing test endpoints.
  • Progress step groups raw job step IDs into operator-readable phases (Preparing runtime, Connecting Matrix, Connecting mailbox, Activating modules, Activating Mail Sentinel, Finalizing node) with the raw step list available behind a <details>.
  • Success step auto-issues an onboarding code (unless one is already outstanding), renders nextSteps.elementHomeserverUrl as the primary CTA, and surfaces operator username + alert room name + onboarding code with copy buttons.

Admin (public/setup-ui/screens/admin/):

  • Status.js — operator-overview rollup. Worst-of all health values for the headline; critical-action banner with explicit CTAs ("Reconfigure mailbox", etc.); healthy / warning / error groups; existing flat services table preserved inside a collapsible <details>.
  • Recovery.js — sign-post page only. Shows current auth stage, links to onboarding-code re-issue, prints the exact sudo sovereign-node setup-ui issue-bootstrap-token command for terminal recovery, and points at sovereign-node doctor / logs / reconfigure for break-glass.

Reconfigure (public/setup-ui/screens/reconfigure.js rewritten): each form (mailbox, Matrix, provider) drives a editing → testing → tested-ok | tested-fail → applying → applied-ok | applied-fail state machine. Apply is disabled until a passing test (where applicable). Editing any field after a passing test resets the form to editing. After Apply, restartRequiredServices is rendered in plain language ("OpenClaw gateway will restart to pick up the change…") instead of raw service names.

CSS — no design-token changes. The existing palette (#050816 background, #f97317 orange CTA, Space Grotesk + Inter Tight) and component classes (.card, .field, .btn, .alert, .badge, .steps, .kv, .table, .code-block) are reused. New classes added for wizard layout (.wizard-shell, .stepper, .module-list, .btn--xl, .bullet-list, etc.) using the same tokens.

Removed: public/setup-ui/screens/install.js (JSON textarea) and public/setup-ui/screens/status.js (flat table).

Tests & checks

  • pnpm typecheck, pnpm lint, pnpm test all green (388 tests passing, was 327 — same set, no regressions).
  • pnpm test:coverage:integration and :unit — green; coverage thresholds untouched.
  • pnpm build + postbuild copies all new wizard / admin / components files into dist/public/setup-ui/.

Out of scope (deferred follow-ups)

  • Persistent rollback on reconfigure failure. Today reconfigure surfaces restartRequiredServices and validation checks but the backend does not auto-revert on partial failure. Out of scope here.
  • POST /api/reconfigure/{imap,matrix}/test against currently-persisted secrets. The current reconfigure forms require the operator to retype the password to test — honest behaviour for "secret-bearing reconfiguration", but worth a dedicated test endpoint later.
  • Browser tests / Playwright. The repo has no JS test runner today; introducing one is its own PR.
  • Advanced/JSON mode toggle on the wizard's Review step (the original install JSON-paste UX, demoted to opt-in). Removed cleanly here; reintroduce via #/setup?json=1 later if there is demand.
  • Real module picker. Today the Modules step is read-only ("mail-sentinel + node-operator on, customise coming soon") because there are no other first-party bots to pick. Becomes a real picker once that changes.

Test plan

  • pnpm typecheck, pnpm lint, pnpm test all green
  • pnpm build produces dist/public/setup-ui/screens/wizard/ and …/admin/ and …/components/
  • pnpm test:coverage:integration and :unit green
  • Local smoke: pnpm dev:apihttp://127.0.0.1:8787/ → redirects to /setup-ui/ → wizard appears (because no install) → walk all 9 steps → arrive at Done with onboarding code rendered
  • After a successful install, refresh → default lands on #/admin/status with operator-overview rollup. #/setup redirects to #/admin/status. #/setup?force=1 re-enters the wizard.
  • Old hash routes (#/install, #/status, #/reconfigure/imap, #/reconfigure/matrix, #/reconfigure/openrouter, #/onboarding) redirect to their new equivalents.
  • Reconfigure pages: Apply is disabled until a passing Test (mailbox, Matrix); editing a field after a passing test resets to editing. Apply on provider works without a Test.
  • Ephemeral Proxmox dev VM: proxmox-pool-vm dev create, install via scripts/install.sh, tunnel via ssh -L 8787:127.0.0.1:8787, walk the wizard end-to-end on a fresh node, then validate the upgrade scenario, then proxmox-pool-vm dev delete.

ndee added 5 commits April 25, 2026 05:32
Adds a Preact + htm setup UI served by the existing Fastify API at
127.0.0.1:8787, wrapping install, status, reconfigure, and Matrix
onboarding-code re-issue. No build step — vendored Preact 10.24.3 and htm
3.1.1 ESM bundles are committed under public/setup-ui/vendor/.

Backend
- @fastify/static mounts the UI at /setup-ui/, with `/` redirecting to it.
- Per-path Cache-Control: no-cache for app code, immutable for vendor.
- New POST /api/onboarding/issue and GET /api/onboarding/state, with the
  state endpoint stripping codeHash/codeSalt/passwordSecretRef before
  returning. Re-issue is the same primitive the CLI uses; the UI confirms
  before overwriting an outstanding code.
- New `getMatrixOnboardingState` on the InstallerService interface,
  implemented by RealInstallerService and stubbed for dev.
- `tsup` postbuild copies public/ into dist/public/; runtime path
  resolution handles dev (tsx) and built (dist) layouts.

UI
- Visual identity matches sovereign-ai-node.com (dark navy #050816
  background, orange #f97317 CTA, Space Grotesk + Inter Tight).
- Hash router with four screens (Install, Status, Reconfigure × 3,
  Onboarding). Install posts the request JSON and polls
  /api/install/jobs/:jobId; Status polls /api/status every 5 s.

Out of scope (deferred)
- Auth / token gating: API stays open on 127.0.0.1.
- Browser-level UI tests: project has no JS test runner yet; backend
  routes are at 100 % coverage. Playwright is queued for a follow-up.
…ord auth

Layers password auth on top of the localhost setup UI from the previous
commit and adds an opt-in LAN bind via the existing --api-host flag.
Auth is **always on**, regardless of bind address.

## Two-phase auth

- **Pre-install** (readRuntimeConfig throws CONFIG_NOT_FOUND): one-time
  **bootstrap token** redeemable on /api/auth/login. Generated and
  printed by the installer at end of post-install diagnostics; reissue
  via `sudo sovereign-node setup-ui issue-bootstrap-token`.
- **Post-install** (runtime config + operator password file present):
  operator's **Matrix password**, validated via Synapse
  /_matrix/client/v3/login. Same credential as Element.

Synapse-down vs wrong password: 503 HOMESERVER_UNREACHABLE vs 401
INVALID_CREDENTIALS, distinguishable in the UI.

## Backend

- `src/api/auth/sessions.ts` — in-memory store, 8h TTL, no refresh.
- `src/api/auth/rate-limit.ts` — per-IP sliding window, 5 fails / 15min.
- `src/api/auth/middleware.ts` — Fastify preHandler. Allow-lists
  `/`, `/healthz`, `/setup-ui/*`, `GET /api/auth/state`,
  `POST /api/auth/login`. Everything else requires `sov_session`
  cookie. POST/PUT/DELETE/PATCH require X-CSRF-Token == sov_csrf
  cookie (double-submit).
- `src/api/routes/auth.ts` — GET /api/auth/state (also reports csrf
  when authenticated), POST /api/auth/login, POST /api/auth/logout.
  Login route handles rate-limit, schema validation, stage routing.
- @fastify/cookie added; cookies HttpOnly (sov_session) +
  SameSite=Strict + Path=/ + Max-Age=28800. No Secure (LAN HTTP).
- New service interface methods on InstallerService:
  getAuthStage, issueSetupUiBootstrapToken, getSetupUiBootstrapState,
  consumeSetupUiBootstrapToken, verifyOperatorPassword.
- Bootstrap-state file at `{paths.stateDir}/setup-ui/bootstrap-state.json`
  (mode 0o600), written via writeInstallerJsonFile.
- Token primitives in `src/onboarding/setup-ui-bootstrap.ts` mirror
  the Matrix onboarding-code shape (24h TTL, max 5 attempts,
  timing-safe compare).

## Frontend

- `public/setup-ui/screens/login.js` — bootstrap-token form +
  password form, mode driven by /api/auth/state.
- `public/setup-ui/api.js` — tracks csrf in module scope, sends
  X-CSRF-Token on POSTs, dispatches `sov:unauth` on 401.
- `public/setup-ui/app.js` — useAuthState hook gates the app shell;
  Sign out button in nav.

## Installer

- `scripts/install/lib-matrix-urls.sh` — print_setup_ui_access_guidance
  prints URL + bootstrap token + LAN-bind hint at end of install.
- `scripts/install.sh` — calls the printer after Matrix onboarding.
- `scripts/install/lib-log.sh` — --api-host help clarifies LAN expose.
- New CLI subcommand `sovereign-node setup-ui issue-bootstrap-token`.

## Tests

- 100% coverage on every new/changed file (integration + unit suites).
- 387 vitest tests pass, lint + typecheck green.

## Out of scope (still)

- HTTPS / TLS termination; LAN HTTP only.
- Internet exposure; CSRF posture and rate limit assume LAN trust.
- Multi-user; one operator credential.
- Browser-level UI tests; Playwright is queued for a follow-up PR.
The installer creates `{stateDir}/setup-ui/bootstrap-state.json` while
running as root, so the parent directory ends up owned by root:root.
At runtime the API service runs as `sovereign-node`; its attempt to
write the redeem-result temp file fails with EACCES because the
directory itself isn't writable by that user.

Add `ensureSetupUiStateDirOwnership` (mkdir + applyRuntimeOwnership)
and call it from both `issueSetupUiBootstrapToken` and
`consumeSetupUiBootstrapToken` so the directory is chown'd to the
runtime ownership when running as root and is a no-op otherwise.

Discovered while bringing the dev VM online for end-to-end UI
verification: the login POST returned 500 with EACCES on the
bootstrap-state temp file path.
The install screen seeded the JSON editor with `openai/gpt-4o-mini`
and the reconfigure-OpenRouter form's placeholder mentioned the same
model. Both diverged from the canonical default in
scripts/install.sh:31 (`RECOMMENDED_OPENROUTER_MODEL=qwen/qwen3.5-9b`).
Align both UI references with the installer's recommendation so the
defaults match across CLI and UI.
Replace the JSON-textarea install screen with a guided 9-step wizard
(Welcome → Preflight → Matrix → Mailbox → Provider → Modules → Review →
Progress → Done) and split the SPA into two distinct modes: first-run
(default when the node is not initialized) and admin (default once it
is). Mode is detected client-side from /api/status (no new endpoint).

Status becomes operator-overview-first: an overall health rollup,
critical-action banner, healthy/warning/error groups, with the existing
flat services table demoted into a collapsible <details>. Reconfigure
forms gain a Test → Apply state machine using the existing
/api/install/test-imap and /api/install/test-matrix endpoints; Apply
stays disabled until a passing test, and restartRequiredServices is
rendered in plain language.

Wizard form state (no secrets) persists to localStorage so an operator
can close and reopen the laptop mid-setup. Secrets are only kept in
component state and submitted once at install time. On success, the
wizard auto-issues a Matrix onboarding code, surfaces the Element
homeserver URL and alert room as the primary CTA, and points the
operator at Matrix as the day-2 control plane.

Old hash routes (#/install, #/status, #/reconfigure/*, #/onboarding)
redirect to the new IA for one release cycle. Adds a contract test
asserting the wizard-generated payload round-trips through
installRequestSchema, guarding against frontend/backend drift.
@ndee ndee changed the title feat(api): localhost setup & admin web UI feat(setup-ui): wizard-first first-run + operator admin console Apr 25, 2026
ndee added 24 commits April 26, 2026 17:21
The wizard-first refactor accidentally rewrote screens/reconfigure.js
with `./` import specifiers (as if the file lived under screens/admin/),
breaking the entire module graph: app.js imports Reconfigure at module
load, ES modules fail the whole graph when one specifier 404s, and the
result was a black page on every route — including the wizard.

Restore `../` paths so the file resolves to setup-ui/{vendor,api,forms,
components}/* the way every other screen in the tree does.
Add a three-card picker at the top of the Matrix wizard step — Public
site, Local LAN, Local dev — so the operator chooses how their node
will be reached before filling in URLs.

- Public: empty defaults (operator owns the real domain). Surfaces the
  router/firewall requirements (DNS, :80 for ACME, :443 for Element,
  :8448 if federation) so an operator can decide whether a public
  install is realistic before they get there.
- Local LAN: prefills https://matrix.lan.local with tlsMode "internal"
  (Caddy local CA) and explains the "trust this CA on each device"
  expectation.
- Local dev: prefills http://127.0.0.1:8008 with tlsMode "local-dev"
  for a try-it-out smoke that needs no DNS, no certs, no router.

Wires tlsMode through the wizard state hook and into the generated
installRequest. Adds two contract tests covering the local-dev and
local-LAN payload shapes so a frontend/backend drift on tlsMode breaks
the build first. Federation toggle hidden on Local dev (it's pointless
without TLS).
Pre-install, the bundled Synapse doesn't exist yet, so testing a
homeserver URL on Local LAN or Local dev can only ever fail with a red
"Matrix homeserver could not be reached" banner — confusing for an
operator who is doing exactly what the wizard told them to do.

On Public site keep the button (you may already have a homeserver
running, and verifying it before install is genuinely useful), but
clarify it's optional. On Local LAN and Local dev replace the button
with a dim one-liner explaining the bundled installer will create the
homeserver during install.
POST /api/install/run is synchronous on the server: it runs every step
in order and only returns the response after the job reaches a terminal
state. The wizard previously rendered "Starting install…" for the entire
duration of that POST (often 50+ s) with no other feedback, then on
failure silently navigated back to Review with no indication of *why*
the install failed. Operators retried, creating multiple ghost jobs.

Three fixes:

1. While the POST is in flight, surface a "Submitting install request…"
   banner with a live elapsed-seconds counter, plus a one-line note that
   explains why the request is taking a while. Replaces the static
   "Starting install…" string.

2. On a terminal failure, stay on the Progress step instead of
   navigating away. Render a clear "Install failed at <step>" alert with
   the error code and message, the per-phase progress list (so the
   operator sees how far it got), and a collapsible per-step detail
   pane that includes each step's error.

3. Add explicit "Back to review" and "Try again" buttons on the failure
   screen. "Try again" reuses the same request and bumps a retry
   counter so the effect re-fires; "Back to review" lets the operator
   change inputs first.

Wizard host wiring updated to pass onBackToReview instead of onFailed.
The bundled-Matrix install fails at the "Configure OpenClaw runtime"
step on a fresh node with:

  LOBSTER_INSTALL_FAILED: npm install for Lobster CLI exited with non-zero status
  EACCES: permission denied, mkdir '/usr/lib/node_modules/@clawdbot'

Root cause: real-service.ts calls setManagedOpenClawEnv during install,
which mutates process.env.HOME globally to /var/lib/sovereign-node/openclaw-home
so OpenClaw subprocesses see the right home directory. That mutation
then leaks into every later child process, including the unrelated
`npm install -g @clawdbot/lobster` that ensureLobsterCliInstalled
spawns. With HOME pointed at openclaw-home (which has no .npmrc), npm
falls back to the system default global prefix /usr/lib/node_modules,
which is root-owned, and EACCES out.

Fix: capture the API service's original HOME at module load time,
before any other installer code can touch process.env, and pass an
explicit env to every npm/lobster invocation that anchors HOME and
npm_config_prefix to the service user's npm-global. This bypasses the
process.env mutation without requiring the wider refactor of
real-service.ts's env handling.

The structural problem in real-service.ts (mutating process.env across
the entire process lifetime) is left as a separate follow-up.

Adds src/installer/real-service-lobster.test.ts (8 tests) covering the
env shape and the install-success / install-failure paths.
Two small fixes the laptop-side smoke run flagged:

1. Copy bug: the sub-headline rendered the literal `&amp;` because
   htm doesn't decode HTML entities in text nodes. Use the raw `&`
   character instead, matching `Local setup` / `Admin console` in
   WizardShell and AdminNav.

2. Recovery affordance: when sign-in fails with any
   BOOTSTRAP_TOKEN_* error code (CONSUMED, EXPIRED, INVALID, LOCKED,
   NOT_ISSUED), the failure was a dead-end banner. The instruction
   for re-issuing a token via CLI was only in the form's intro copy,
   far from the failure context. Surface a contextual info card with
   the exact `sudo sovereign-node setup-ui issue-bootstrap-token`
   command directly under the error banner so an operator sees
   recovery in the failure context.
Three small UX fixes the laptop browser walk surfaced:

1. Mailbox port spin-button advertised valuemax=0 / valuemin=0 to
   assistive tech because NumberInput took no min/max props. Add
   optional min, max, step props to NumberInput; have MailboxStep
   pass min=1, max=65535 so the constraint is announced correctly.

2. Install-progress phase rollup ignored the "warned" and "canceled"
   step states. A phase whose only sub-step ended in `warned` (e.g.
   imap_validate ending in IMAP_TEST_FAILED→WARNED) read as PENDING
   in the summary even after execution had passed it. Add `warned`
   and `canceled` to both the rollup logic and the tone map. The
   summary now resolves to `warned` (yellow) when any sub-step warned
   and nothing failed, and to `canceled` (red) when canceled,
   matching the tone of the per-step badges below.

3. Drive-by: the "mixed terminal + pending" branch is now explicit
   instead of falling out of an "every-or-nothing" check, which fixes
   a real but rare drift where some sub-steps had terminated but
   others hadn't yet, leaving the bucket stuck on `pending`.
…onfig

The bundled-Matrix install wrote a top-level
channels.matrix.homeserver in openclaw.json5 alongside
channels.matrix.accounts.<id>.homeserver. Modern OpenClaw treats the
top-level field as the single-account legacy shape, and the doctor
flags it on every CLI invocation:

  Doctor changes:
    Moved channels.matrix single-account top-level values into
    channels.matrix.accounts.default.

The doctor surfaces this as an interactive clack-style prompt on stdout
of every `openclaw …` invocation, including `openclaw cron list --json`.
The JSON parse fails, the gateway WebSocket closes, and the install
fails at bots_configure with MANAGED_AGENT_REGISTER_FAILED.

Drop the top-level homeserver from the channels.matrix block. The
modern shape keeps homeserver under accounts.<id>.homeserver, which
the same writer already produces. No other change needed: the matrix
plugin still loads because plugins.entries.matrix.enabled stays true,
and per-account access tokens, group routing, etc. were already on
the accounts.<id> objects.

Test asserts the field is absent.
…w config

Iter-1 of the wizard E2E narrowed the install failure at bots_configure
to OpenClaw's doctor migration prompt bleeding onto stdout of `openclaw
cron list --json`:

  Doctor changes:
    Moved channels.matrix single-account top-level values into
    channels.matrix.accounts.default.

The first fix dropped channels.matrix.homeserver but the migration
still triggered. Iter-2 inspected the live config and found the
remaining trigger fields: top-level dm / groupPolicy / groupAllowFrom
/ groups. Doctor treats any of those as "single-account top-level
values" and migrates them into accounts.default. Drop them all.

Each entry in matrixAccounts already carries its own per-account dm
/ groupPolicy / groupAllowFrom / groups, so behavior is preserved.
Tests updated to assert top-level absence and read federation policy
from accounts.<id>.{dm,groupPolicy,groups}.
OpenClaw doctor flags `gateway.mode is unset; gateway start will be
blocked` whenever the runtime config omits gateway.mode. The doctor
note is just a warning, but it correlates with the daemon refusing to
start, which makes every `openclaw cron list` call fail with
`gateway closed (1006 abnormal closure)`.

Set gateway.mode = local. We always run the gateway in loopback mode;
relay deployments still bind loopback and tunnel via the relay-tunnel
systemd service.
The runtime sovereign-node-api service runs as User=sovereign-node and
can't write /etc/systemd/system/sovereign-openclaw-gateway.service or
run systemctl, so the bundled-Matrix install fails after the
matrix_bootstrap_room step: the gateway never starts, and every
`openclaw cron list` call from bots_configure dies with
`gateway closed (1006 abnormal closure)`.

Two changes:

1. Drop a narrow sudoers fragment at bootstrap time
   (/etc/sudoers.d/sovereign-node-gateway) that lets the service user
   tee the gateway unit file and run systemctl daemon-reload, restart,
   enable --now, is-active, and status — all *only* against the
   sovereign-openclaw-gateway unit. Validated with `visudo -cf`; if
   invalid for any reason, the file is removed so we don't break sudo.

2. ensureSystemGatewayServiceFallback now tries plain writeFile first,
   falls back to `sudo -n tee` on EACCES/EPERM, and probes whether
   systemctl needs sudo before issuing the daemon-reload / restart /
   enable / is-active commands. Plain systemctl still works in the
   root install context (scripts/install.sh); sudo -n is used only
   when the runtime API is the caller.
NoNewPrivileges=true on the sovereign-node-api unit blocks sudo from
elevating to root, which prevents the runtime API service from using
the scoped sudoers fragment introduced in 7d22f1a to install the
OpenClaw gateway systemd unit. Result: install fails at bots_configure
because the gateway never starts.

Drop NoNewPrivileges. Hardening is preserved via the narrow sudoers
fragment, which limits root capabilities to a fixed allowlist of
systemctl commands against a single unit name.
Earlier hardcoded "if non-root use sudo" broke the test suite because
test processes run as a non-zero uid but own the unit dir directly.

Switch to a try-direct, fallback-on-error pattern:

- Unit-file write: try writeFile first; on EACCES/EPERM fall back to
  `sudo -n tee` against the scoped sudoers fragment. Tests own /etc in
  their tempdir, so writeFile succeeds; production runs as
  sovereign-node and falls back to sudo.

- systemctl commands: try plain `systemctl` first; fall back to
  `sudo -n systemctl` when stderr reports polkit's "Interactive
  authentication required" or systemd's "must be root". The bootstrap
  install path runs as root and never needs the fallback; the runtime
  API service does, via the scoped sudoers fragment.
After a partially-completed install, docker-compose's volume bind
leaves matrix-lan-local/postgres-data owned by root, which means the
runtime API service (running as sovereign-node) can't mkdir siblings
under it on the next install attempt — fails at matrix_provision with
`EACCES: permission denied, mkdir '…/synapse'`.

Add an EACCES fallback to mkdir(synapseDir): if the direct mkdir
fails because of bad parent ownership, escalate via the scoped
sudoers fragment to chown -R the project dir back to the calling uid:gid,
then retry the mkdir.

Extend the sudoers fragment dropped at install time to allow that
narrowly-scoped chown:

  sovereign-node ALL=(root) NOPASSWD: /bin/chown -R [0-9]*\\:[0-9]*
    /var/lib/sovereign-node/bundled-matrix/*

The pattern restricts both the target path and the form of the
ownership argument, so the rule can only re-claim numeric ownership
within the bundled-matrix tree.
resetBundledPostgresState calls backupPostgresDataBeforeReset and then
rm -r on postgres-data. Both walk the dir tree, and both fail EACCES
when running as a non-root service user against a postgres-data
written by the postgres container (uid 70 inside container).

Add a reclaimOwnership helper that uses the scoped sudoers fragment
to chown -R the path to the calling uid:gid before any walking
operation. Apply it at the start of resetBundledPostgresState and
again after `compose down` (Docker may have re-touched the dir).

Trips iter-7 of the wizard E2E loop when the recoverable-credentials
recovery path triggers — first install attempt fails, recovery tries
to reset postgres state, can't scan the dir.
iter-8 of the wizard E2E loop fails at matrix_bootstrap_accounts with
`EPERM: chmod '/etc/sovereign-node/secrets'` after the dir somehow
ends up root-owned mid-install (likely a leftover from a prior run, a
docker-compose bind-mount creating the dir as root, or a partial
chown sequence in the bootstrap install path).

Add a sudoReclaimOwnership helper that uses the scoped sudoers
fragment to chown -R the dir to the calling uid:gid, and call it from
ensureSecretsDir on EPERM/EACCES from the chmod attempt before
falling back to the cwd path.

Extend the sudoers fragment dropped at install time to allow
`/bin/chown -R [0-9]*:[0-9]* /etc/sovereign-node/secrets` and
`/etc/sovereign-node/secrets/*`, scoped exactly to the path where
runtime might need to reclaim ownership.
DEFAULT_SERVICE_USER/GROUP were "root", which meant the runtime API
service (running as sovereign-node) wrote a system-gateway unit with
User=root Group=root, then tried to read jiti/tmp cache files that
the now-root gateway had written. EACCES on the matrix extension
load → bots_configure fails.

Default to sovereign-node:sovereign-node, matching the User= on the
sovereign-node-api unit. Bootstrap install context is unaffected
because it sets SOVEREIGN_NODE_SERVICE_USER from the script.
The sovereign-node-api unit had no Environment=PATH=, so systemd
inherited a minimal default PATH (/usr/local/sbin:/usr/local/bin:
/usr/sbin:/usr/bin:/snap/bin). When the install ran openclaw via the
API service, the verification step's `which openclaw` (or equivalent
PATH lookup) failed because openclaw lives at
/var/lib/sovereign-node/.npm-global/bin/openclaw — never on the
default PATH.

Result: a fresh install on a fresh VM fails immediately at
openclaw_bootstrap_cli with OPENCLAW_INSTALL_FAILED ("OpenClaw
installer completed but the openclaw CLI was not detected"), even
though the binary is installed correctly.

Iter-1 of the wizard E2E loop hot-patched this on a single VM with
`sed -i` to add the Environment=PATH line, but the unit *template*
in the repo was never updated. Every fresh bootstrap regenerates a
broken unit. This commit fixes the template so the next bootstrap
produces a working unit on first try.
The first manual end-to-end walk surfaced three real bugs on the
"Your node is ready" step that the automated playwright loop never
hit (network state, not screen state):

1. Copy code / Copy link silently no-op on insecure HTTP origins.
   navigator.clipboard.writeText is undefined when the wizard runs on
   plain HTTP from a LAN IP (not localhost), the call throws, the
   try/catch swallows it, the user gets no feedback. Wraps both buttons
   in a new CopyButton component that tries the modern API first then
   falls back to a hidden textarea + document.execCommand("copy") —
   the legacy path is exactly designed for insecure contexts. Visible
   "Copied" / "Copy failed — select manually" feedback for 2.5 s.

2. result.nextSteps is undefined in practice. The schema declares it
   but no installer code populates it; the SuccessStep's Open Element
   button + kv list silently render empty. Read fallback values from
   wizardState (publicBaseUrl, alertRoomName, operator + homeserver
   domain) so the page renders the operator/homeserver/alert-room
   triple even with no backend support. (A real backend fix would
   thread the data through the install pipeline; tracked as task #29
   for a separate PR.)

3. Local LAN mode lacks operator-facing preconditions. New
   LanPreconditionsCard renders only when deployMode === "lan" and
   walks through the three things needed on each LAN device: DNS
   resolution (router rewrite or /etc/hosts), Caddy local CA trust
   (with the exact path on the node and OS-specific import steps),
   port 443 reachability. Same applied for Local dev: a hint card
   showing the SSH-tunnel command since the homeserver only binds
   loopback.

CopyButton lives in forms.js so screens/onboarding.js (admin
Onboarding page) can use it too — same bug class on the same shape.
Local LAN was complex: each device on the operator's network needed
DNS resolution for matrix.lan.local AND the Caddy CA imported AND
port 443 reachable. The DNS step is the one that breaks people
(macOS .local mDNS quirks, /etc/hosts on every device, router DNS
overrides).

Caddy supports IPs as site addresses for `tls internal`; one cert
covers all listed names. Detect this host's RFC1918 IPv4 addresses at
install time and add them to the Caddyfile's site directive so the
generated cert covers both the hostname and each LAN IP. Operators
can now reach the homeserver at https://<lan-ip>/ from any device on
the LAN with only CA trust + port 443 — no DNS required.

- New src/system/lan-ips.ts: detectLanIPv4() walks os.networkInterfaces
  filtering loopback/internal/link-local/IPv6, sorts by RFC1918
  preference (10/8 → 192.168/16 → 172.16/12 → public), de-dups. Pure;
  takes networkInterfaces() output as a parameter for testability.
- DockerComposeBundledMatrixProvisioner gains an optional `lanIpProvider`
  constructor arg (defaults to detectLanIPv4) so tests can stub it.
- renderCaddyfile gains an `extraSiteNames` parameter; for `tls internal`
  it joins them into the comma-separated site directive
  (`matrix.lan.local, 192.168.0.181, 10.0.0.5 {`).
- The publicBaseUrl host is filtered out of the LAN list before merging
  to avoid duplicating it in the site directive.
- LanGuidance copy updated: "DNS is optional" — the cert covers the IP.
- LanPreconditionsCard rewritten: CA trust + port 443 are the two real
  preconditions; DNS demoted to "Optional" with a note about
  .local/mDSN edge cases on Apple devices.
- Two new matrix.test.ts cases assert (a) the site directive includes
  hostname + LAN IPs, (b) duplicates are filtered when the publicBaseUrl
  host is also in the detected LAN list.
- New lan-ips.test.ts (6 cases) covers sorting, exclusion of
  loopback/IPv6/link-local, dedup, and the empty case.

Closes the user-reported "Local LAN is too complex" bug from the first
manual end-to-end install on VM 181.
ndee added 5 commits May 4, 2026 19:49
When the wizard's localStorage points to a jobId that no longer exists
on the node (most common cause: a state wipe between manual installs),
the wizard surfaces "API_ERROR: Install job not found: job_…" forever.
Real bug: the catch arms in the host's resume effect and the Progress
step's poll loop didn't distinguish "this job is gone, start fresh"
from a transient API error, and never cleared the stale jobId.

Three layers of fix:

1. Server: getInstallJob throws a typed object
   { code: "INSTALL_JOB_NOT_FOUND", retryable: false, details: { jobId } }
   instead of a plain Error("Install job not found"). The install/jobs
   route maps this code to HTTP 404 (was 400/generic).

2. Wizard host (index.js): on rehydrate, if the persisted jobId 404s
   with INSTALL_JOB_NOT_FOUND, clear it from localStorage so the
   wizard starts at Welcome on next mount instead of looping. Other
   errors (network blip, auth expired) leave the jobId in place so a
   refresh can retry.

3. ProgressStep poll: same shape — on INSTALL_JOB_NOT_FOUND mid-poll,
   clear jobId and surface a friendly explanation ("This install job
   no longer exists on the node, likely because the node state was
   reset. Go back to Review and start a new install."). The "Try
   again" / "Back to review" buttons already work because terminal
   state is cleared.

Test: real-service.test.ts asserts the typed error shape on unknown
jobIds, alongside the existing successful-getInstallJob assertion.
The Caddy IP-cert change (56f2b7b) made any LAN IP reachable directly
from a fresh laptop without DNS or /etc/hosts edits. The wizard's
Local LAN preset still seeded publicBaseUrl with https://matrix.lan.local,
which surfaced as the "Open Element →" link on the Done page — and
that hostname is unreachable until the operator wires up DNS.

Add GET /api/setup-ui/host-info exposing the host's RFC1918 IPv4
addresses (reuses detectLanIPv4). MatrixStep fetches it on mount and
seeds the Local LAN preset's publicBaseUrl with the first IP. The
Matrix server name (homeserverDomain → MXIDs) stays matrix.lan.local
so federation/MXIDs don't depend on the LAN IP.

If host-info fails or returns no IPs, falls back to the old hostname
URL so the wizard remains usable. Stale persisted hostname URLs are
upgraded once host-info arrives.
…no &lt;

Three issues surfaced on the Done page during a manual install in
Local LAN mode:

1. No "Open Element →" button for LAN. The button was gated on
   deployMode === "public", but LAN with the IP-cert is just as
   reachable from the operator's browser. Show it for both modes,
   pointing at https://app.element.io/#/login?hs_url=...&login_hint=...
   so Element opens with the homeserver and operator prefilled
   (matches buildElementWebLoginLink in matrix-onboarding-page.ts).

2. Operators couldn't tell where the Caddy CA actually lives. The
   preconditions card printed the literal "<node-LAN-IP>" placeholder
   instead of the homeserver IP. Substitute the real host parsed from
   wizardState.matrix.publicBaseUrl, expose the CA URL as a clickable
   link, and keep the curl command for headless devices.

3. HTML entities like &lt; rendered verbatim because htm passes
   template-literal text as-is. Move every "<placeholder>" through a
   ${"..."} interpolation so they print as < and > in the rendered DOM.

Stays self-contained on the Done step — no backend changes, no schema
changes. The hostnameFromUrl + buildElementWebLoginLink helpers are
small enough to live inline; if they grow more callers later we can
extract them.
This is the open-core installer. It should feel like technical
self-hosted setup for operators, not a managed-SaaS onboarding. This
pass adjusts copy, success/error consistency, and a few targeted state
behaviors without redesigning the visual language.

Highlights:

- Header chip now reads "Open-core setup · DIY self-hosted" so the
  open-core path is visible up front.
- Welcome: honest 5–15 minute estimate that depends on network mode;
  explicit external-connections framing instead of the previous
  "nothing leaves until Matrix" line.
- Login: open-core access framing, plain-language bootstrap-token copy,
  "Continue" button. Token-recovery hint always visible.
- Preflight: friendly labels for raw check IDs (host-os → Host OS,
  openclaw-dns → Runtime backend DNS, etc.), and reordered by
  practical importance (privileges, ports, disk, Docker, DNS, time).
- Matrix: title becomes "Matrix control plane"; field renamed
  "Public base URL" → "Matrix URL" with mode-aware helper. LAN
  guidance is rewritten as 3 short bullets and now substitutes the
  real LAN IP into the example URL (was rendering &lt;node-LAN-IP&gt;
  literally because the htm template embedded HTML entities). Public
  is marked as the advanced path. Local dev guidance is honest
  ("only this machine can reach the homeserver directly").
- Mailbox: tighter subtitle, clearer password helper, success message
  reads "Mailbox connection looks good", folder hint added.
- Provider: explicit "OpenRouter as the supported LLM provider in
  open core" framing; model field reframed as "Initial default";
  validation copy clarifies error path.
- Modules: reframed as informational "Installed components" — both
  components are always installed in open core today, so the step
  no longer pretends to be a real choice.
- Review: grouped into Matrix / Mailbox / Provider / Components /
  Secrets sections; secrets shown as set/missing only; primary
  button reads "Install locally"; secret-handling note rewritten.
- Progress: calm subtitle that adapts to deploy mode; phase labels
  use product-level words ("Installing runtime backend",
  "Setting up Matrix", "Activating components") and the relay
  phase is suppressed entirely unless that step actually ran or
  failed (open core defaults to no relay).
- Install failed: title is "Install stopped before completion"; a
  humanized failure summary explains *why* the step stopped and
  suggests the next action (sudo, IMAP creds, OpenRouter key,
  Matrix prerequisites). Raw error code/message stays visible
  beneath, and a Copy raw error button bundles the failed step +
  step list for paste-back. Step detail opens by default on failure.
- Validation error path: contract errors that arrive before a job
  exists (e.g. missing OpenRouter key) now render as a calm
  "Provider configuration incomplete" page with Back to provider
  and Back to review actions, instead of a raw API_ERROR banner.
- Done: title is mode-aware ("Local dev install completed" vs
  "Your node is installed") and the page no longer says ready while
  also showing a red CONFIG_NOT_FOUND banner — that specific case
  now renders as a neutral "Node finalizing" warn block instead.
  LAN preconditions card is restructured into three numbered
  sub-headings (trust the CA / verify port 443 / optional hostname).

No backend changes. No schema changes. Visual language unchanged.

Items intentionally left out:
- Earlier provider-key validation: would require a new API endpoint
  and an OpenRouter round-trip from the wizard; deferred to a
  follow-up PR.
- Real module picker: still no real choice; reframed informational.
- Done-page nextSteps still derived from wizardState; backend
  population of result.nextSteps remains tracked separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant