feat(setup-ui): wizard-first first-run + operator admin console#145
Open
feat(setup-ui): wizard-first first-run + operator admin console#145
Conversation
Adds a Preact + htm setup UI served by the existing Fastify API at 127.0.0.1:8787, wrapping install, status, reconfigure, and Matrix onboarding-code re-issue. No build step — vendored Preact 10.24.3 and htm 3.1.1 ESM bundles are committed under public/setup-ui/vendor/. Backend - @fastify/static mounts the UI at /setup-ui/, with `/` redirecting to it. - Per-path Cache-Control: no-cache for app code, immutable for vendor. - New POST /api/onboarding/issue and GET /api/onboarding/state, with the state endpoint stripping codeHash/codeSalt/passwordSecretRef before returning. Re-issue is the same primitive the CLI uses; the UI confirms before overwriting an outstanding code. - New `getMatrixOnboardingState` on the InstallerService interface, implemented by RealInstallerService and stubbed for dev. - `tsup` postbuild copies public/ into dist/public/; runtime path resolution handles dev (tsx) and built (dist) layouts. UI - Visual identity matches sovereign-ai-node.com (dark navy #050816 background, orange #f97317 CTA, Space Grotesk + Inter Tight). - Hash router with four screens (Install, Status, Reconfigure × 3, Onboarding). Install posts the request JSON and polls /api/install/jobs/:jobId; Status polls /api/status every 5 s. Out of scope (deferred) - Auth / token gating: API stays open on 127.0.0.1. - Browser-level UI tests: project has no JS test runner yet; backend routes are at 100 % coverage. Playwright is queued for a follow-up.
…ord auth
Layers password auth on top of the localhost setup UI from the previous
commit and adds an opt-in LAN bind via the existing --api-host flag.
Auth is **always on**, regardless of bind address.
## Two-phase auth
- **Pre-install** (readRuntimeConfig throws CONFIG_NOT_FOUND): one-time
**bootstrap token** redeemable on /api/auth/login. Generated and
printed by the installer at end of post-install diagnostics; reissue
via `sudo sovereign-node setup-ui issue-bootstrap-token`.
- **Post-install** (runtime config + operator password file present):
operator's **Matrix password**, validated via Synapse
/_matrix/client/v3/login. Same credential as Element.
Synapse-down vs wrong password: 503 HOMESERVER_UNREACHABLE vs 401
INVALID_CREDENTIALS, distinguishable in the UI.
## Backend
- `src/api/auth/sessions.ts` — in-memory store, 8h TTL, no refresh.
- `src/api/auth/rate-limit.ts` — per-IP sliding window, 5 fails / 15min.
- `src/api/auth/middleware.ts` — Fastify preHandler. Allow-lists
`/`, `/healthz`, `/setup-ui/*`, `GET /api/auth/state`,
`POST /api/auth/login`. Everything else requires `sov_session`
cookie. POST/PUT/DELETE/PATCH require X-CSRF-Token == sov_csrf
cookie (double-submit).
- `src/api/routes/auth.ts` — GET /api/auth/state (also reports csrf
when authenticated), POST /api/auth/login, POST /api/auth/logout.
Login route handles rate-limit, schema validation, stage routing.
- @fastify/cookie added; cookies HttpOnly (sov_session) +
SameSite=Strict + Path=/ + Max-Age=28800. No Secure (LAN HTTP).
- New service interface methods on InstallerService:
getAuthStage, issueSetupUiBootstrapToken, getSetupUiBootstrapState,
consumeSetupUiBootstrapToken, verifyOperatorPassword.
- Bootstrap-state file at `{paths.stateDir}/setup-ui/bootstrap-state.json`
(mode 0o600), written via writeInstallerJsonFile.
- Token primitives in `src/onboarding/setup-ui-bootstrap.ts` mirror
the Matrix onboarding-code shape (24h TTL, max 5 attempts,
timing-safe compare).
## Frontend
- `public/setup-ui/screens/login.js` — bootstrap-token form +
password form, mode driven by /api/auth/state.
- `public/setup-ui/api.js` — tracks csrf in module scope, sends
X-CSRF-Token on POSTs, dispatches `sov:unauth` on 401.
- `public/setup-ui/app.js` — useAuthState hook gates the app shell;
Sign out button in nav.
## Installer
- `scripts/install/lib-matrix-urls.sh` — print_setup_ui_access_guidance
prints URL + bootstrap token + LAN-bind hint at end of install.
- `scripts/install.sh` — calls the printer after Matrix onboarding.
- `scripts/install/lib-log.sh` — --api-host help clarifies LAN expose.
- New CLI subcommand `sovereign-node setup-ui issue-bootstrap-token`.
## Tests
- 100% coverage on every new/changed file (integration + unit suites).
- 387 vitest tests pass, lint + typecheck green.
## Out of scope (still)
- HTTPS / TLS termination; LAN HTTP only.
- Internet exposure; CSRF posture and rate limit assume LAN trust.
- Multi-user; one operator credential.
- Browser-level UI tests; Playwright is queued for a follow-up PR.
The installer creates `{stateDir}/setup-ui/bootstrap-state.json` while
running as root, so the parent directory ends up owned by root:root.
At runtime the API service runs as `sovereign-node`; its attempt to
write the redeem-result temp file fails with EACCES because the
directory itself isn't writable by that user.
Add `ensureSetupUiStateDirOwnership` (mkdir + applyRuntimeOwnership)
and call it from both `issueSetupUiBootstrapToken` and
`consumeSetupUiBootstrapToken` so the directory is chown'd to the
runtime ownership when running as root and is a no-op otherwise.
Discovered while bringing the dev VM online for end-to-end UI
verification: the login POST returned 500 with EACCES on the
bootstrap-state temp file path.
The install screen seeded the JSON editor with `openai/gpt-4o-mini` and the reconfigure-OpenRouter form's placeholder mentioned the same model. Both diverged from the canonical default in scripts/install.sh:31 (`RECOMMENDED_OPENROUTER_MODEL=qwen/qwen3.5-9b`). Align both UI references with the installer's recommendation so the defaults match across CLI and UI.
Replace the JSON-textarea install screen with a guided 9-step wizard (Welcome → Preflight → Matrix → Mailbox → Provider → Modules → Review → Progress → Done) and split the SPA into two distinct modes: first-run (default when the node is not initialized) and admin (default once it is). Mode is detected client-side from /api/status (no new endpoint). Status becomes operator-overview-first: an overall health rollup, critical-action banner, healthy/warning/error groups, with the existing flat services table demoted into a collapsible <details>. Reconfigure forms gain a Test → Apply state machine using the existing /api/install/test-imap and /api/install/test-matrix endpoints; Apply stays disabled until a passing test, and restartRequiredServices is rendered in plain language. Wizard form state (no secrets) persists to localStorage so an operator can close and reopen the laptop mid-setup. Secrets are only kept in component state and submitted once at install time. On success, the wizard auto-issues a Matrix onboarding code, surfaces the Element homeserver URL and alert room as the primary CTA, and points the operator at Matrix as the day-2 control plane. Old hash routes (#/install, #/status, #/reconfigure/*, #/onboarding) redirect to the new IA for one release cycle. Adds a contract test asserting the wizard-generated payload round-trips through installRequestSchema, guarding against frontend/backend drift.
The wizard-first refactor accidentally rewrote screens/reconfigure.js
with `./` import specifiers (as if the file lived under screens/admin/),
breaking the entire module graph: app.js imports Reconfigure at module
load, ES modules fail the whole graph when one specifier 404s, and the
result was a black page on every route — including the wizard.
Restore `../` paths so the file resolves to setup-ui/{vendor,api,forms,
components}/* the way every other screen in the tree does.
Add a three-card picker at the top of the Matrix wizard step — Public site, Local LAN, Local dev — so the operator chooses how their node will be reached before filling in URLs. - Public: empty defaults (operator owns the real domain). Surfaces the router/firewall requirements (DNS, :80 for ACME, :443 for Element, :8448 if federation) so an operator can decide whether a public install is realistic before they get there. - Local LAN: prefills https://matrix.lan.local with tlsMode "internal" (Caddy local CA) and explains the "trust this CA on each device" expectation. - Local dev: prefills http://127.0.0.1:8008 with tlsMode "local-dev" for a try-it-out smoke that needs no DNS, no certs, no router. Wires tlsMode through the wizard state hook and into the generated installRequest. Adds two contract tests covering the local-dev and local-LAN payload shapes so a frontend/backend drift on tlsMode breaks the build first. Federation toggle hidden on Local dev (it's pointless without TLS).
Pre-install, the bundled Synapse doesn't exist yet, so testing a homeserver URL on Local LAN or Local dev can only ever fail with a red "Matrix homeserver could not be reached" banner — confusing for an operator who is doing exactly what the wizard told them to do. On Public site keep the button (you may already have a homeserver running, and verifying it before install is genuinely useful), but clarify it's optional. On Local LAN and Local dev replace the button with a dim one-liner explaining the bundled installer will create the homeserver during install.
POST /api/install/run is synchronous on the server: it runs every step in order and only returns the response after the job reaches a terminal state. The wizard previously rendered "Starting install…" for the entire duration of that POST (often 50+ s) with no other feedback, then on failure silently navigated back to Review with no indication of *why* the install failed. Operators retried, creating multiple ghost jobs. Three fixes: 1. While the POST is in flight, surface a "Submitting install request…" banner with a live elapsed-seconds counter, plus a one-line note that explains why the request is taking a while. Replaces the static "Starting install…" string. 2. On a terminal failure, stay on the Progress step instead of navigating away. Render a clear "Install failed at <step>" alert with the error code and message, the per-phase progress list (so the operator sees how far it got), and a collapsible per-step detail pane that includes each step's error. 3. Add explicit "Back to review" and "Try again" buttons on the failure screen. "Try again" reuses the same request and bumps a retry counter so the effect re-fires; "Back to review" lets the operator change inputs first. Wizard host wiring updated to pass onBackToReview instead of onFailed.
The bundled-Matrix install fails at the "Configure OpenClaw runtime" step on a fresh node with: LOBSTER_INSTALL_FAILED: npm install for Lobster CLI exited with non-zero status EACCES: permission denied, mkdir '/usr/lib/node_modules/@clawdbot' Root cause: real-service.ts calls setManagedOpenClawEnv during install, which mutates process.env.HOME globally to /var/lib/sovereign-node/openclaw-home so OpenClaw subprocesses see the right home directory. That mutation then leaks into every later child process, including the unrelated `npm install -g @clawdbot/lobster` that ensureLobsterCliInstalled spawns. With HOME pointed at openclaw-home (which has no .npmrc), npm falls back to the system default global prefix /usr/lib/node_modules, which is root-owned, and EACCES out. Fix: capture the API service's original HOME at module load time, before any other installer code can touch process.env, and pass an explicit env to every npm/lobster invocation that anchors HOME and npm_config_prefix to the service user's npm-global. This bypasses the process.env mutation without requiring the wider refactor of real-service.ts's env handling. The structural problem in real-service.ts (mutating process.env across the entire process lifetime) is left as a separate follow-up. Adds src/installer/real-service-lobster.test.ts (8 tests) covering the env shape and the install-success / install-failure paths.
Two small fixes the laptop-side smoke run flagged: 1. Copy bug: the sub-headline rendered the literal `&` because htm doesn't decode HTML entities in text nodes. Use the raw `&` character instead, matching `Local setup` / `Admin console` in WizardShell and AdminNav. 2. Recovery affordance: when sign-in fails with any BOOTSTRAP_TOKEN_* error code (CONSUMED, EXPIRED, INVALID, LOCKED, NOT_ISSUED), the failure was a dead-end banner. The instruction for re-issuing a token via CLI was only in the form's intro copy, far from the failure context. Surface a contextual info card with the exact `sudo sovereign-node setup-ui issue-bootstrap-token` command directly under the error banner so an operator sees recovery in the failure context.
Three small UX fixes the laptop browser walk surfaced: 1. Mailbox port spin-button advertised valuemax=0 / valuemin=0 to assistive tech because NumberInput took no min/max props. Add optional min, max, step props to NumberInput; have MailboxStep pass min=1, max=65535 so the constraint is announced correctly. 2. Install-progress phase rollup ignored the "warned" and "canceled" step states. A phase whose only sub-step ended in `warned` (e.g. imap_validate ending in IMAP_TEST_FAILED→WARNED) read as PENDING in the summary even after execution had passed it. Add `warned` and `canceled` to both the rollup logic and the tone map. The summary now resolves to `warned` (yellow) when any sub-step warned and nothing failed, and to `canceled` (red) when canceled, matching the tone of the per-step badges below. 3. Drive-by: the "mixed terminal + pending" branch is now explicit instead of falling out of an "every-or-nothing" check, which fixes a real but rare drift where some sub-steps had terminated but others hadn't yet, leaving the bucket stuck on `pending`.
…onfig
The bundled-Matrix install wrote a top-level
channels.matrix.homeserver in openclaw.json5 alongside
channels.matrix.accounts.<id>.homeserver. Modern OpenClaw treats the
top-level field as the single-account legacy shape, and the doctor
flags it on every CLI invocation:
Doctor changes:
Moved channels.matrix single-account top-level values into
channels.matrix.accounts.default.
The doctor surfaces this as an interactive clack-style prompt on stdout
of every `openclaw …` invocation, including `openclaw cron list --json`.
The JSON parse fails, the gateway WebSocket closes, and the install
fails at bots_configure with MANAGED_AGENT_REGISTER_FAILED.
Drop the top-level homeserver from the channels.matrix block. The
modern shape keeps homeserver under accounts.<id>.homeserver, which
the same writer already produces. No other change needed: the matrix
plugin still loads because plugins.entries.matrix.enabled stays true,
and per-account access tokens, group routing, etc. were already on
the accounts.<id> objects.
Test asserts the field is absent.
…w config
Iter-1 of the wizard E2E narrowed the install failure at bots_configure
to OpenClaw's doctor migration prompt bleeding onto stdout of `openclaw
cron list --json`:
Doctor changes:
Moved channels.matrix single-account top-level values into
channels.matrix.accounts.default.
The first fix dropped channels.matrix.homeserver but the migration
still triggered. Iter-2 inspected the live config and found the
remaining trigger fields: top-level dm / groupPolicy / groupAllowFrom
/ groups. Doctor treats any of those as "single-account top-level
values" and migrates them into accounts.default. Drop them all.
Each entry in matrixAccounts already carries its own per-account dm
/ groupPolicy / groupAllowFrom / groups, so behavior is preserved.
Tests updated to assert top-level absence and read federation policy
from accounts.<id>.{dm,groupPolicy,groups}.
OpenClaw doctor flags `gateway.mode is unset; gateway start will be blocked` whenever the runtime config omits gateway.mode. The doctor note is just a warning, but it correlates with the daemon refusing to start, which makes every `openclaw cron list` call fail with `gateway closed (1006 abnormal closure)`. Set gateway.mode = local. We always run the gateway in loopback mode; relay deployments still bind loopback and tunnel via the relay-tunnel systemd service.
The runtime sovereign-node-api service runs as User=sovereign-node and can't write /etc/systemd/system/sovereign-openclaw-gateway.service or run systemctl, so the bundled-Matrix install fails after the matrix_bootstrap_room step: the gateway never starts, and every `openclaw cron list` call from bots_configure dies with `gateway closed (1006 abnormal closure)`. Two changes: 1. Drop a narrow sudoers fragment at bootstrap time (/etc/sudoers.d/sovereign-node-gateway) that lets the service user tee the gateway unit file and run systemctl daemon-reload, restart, enable --now, is-active, and status — all *only* against the sovereign-openclaw-gateway unit. Validated with `visudo -cf`; if invalid for any reason, the file is removed so we don't break sudo. 2. ensureSystemGatewayServiceFallback now tries plain writeFile first, falls back to `sudo -n tee` on EACCES/EPERM, and probes whether systemctl needs sudo before issuing the daemon-reload / restart / enable / is-active commands. Plain systemctl still works in the root install context (scripts/install.sh); sudo -n is used only when the runtime API is the caller.
NoNewPrivileges=true on the sovereign-node-api unit blocks sudo from elevating to root, which prevents the runtime API service from using the scoped sudoers fragment introduced in 7d22f1a to install the OpenClaw gateway systemd unit. Result: install fails at bots_configure because the gateway never starts. Drop NoNewPrivileges. Hardening is preserved via the narrow sudoers fragment, which limits root capabilities to a fixed allowlist of systemctl commands against a single unit name.
Earlier hardcoded "if non-root use sudo" broke the test suite because test processes run as a non-zero uid but own the unit dir directly. Switch to a try-direct, fallback-on-error pattern: - Unit-file write: try writeFile first; on EACCES/EPERM fall back to `sudo -n tee` against the scoped sudoers fragment. Tests own /etc in their tempdir, so writeFile succeeds; production runs as sovereign-node and falls back to sudo. - systemctl commands: try plain `systemctl` first; fall back to `sudo -n systemctl` when stderr reports polkit's "Interactive authentication required" or systemd's "must be root". The bootstrap install path runs as root and never needs the fallback; the runtime API service does, via the scoped sudoers fragment.
After a partially-completed install, docker-compose's volume bind
leaves matrix-lan-local/postgres-data owned by root, which means the
runtime API service (running as sovereign-node) can't mkdir siblings
under it on the next install attempt — fails at matrix_provision with
`EACCES: permission denied, mkdir '…/synapse'`.
Add an EACCES fallback to mkdir(synapseDir): if the direct mkdir
fails because of bad parent ownership, escalate via the scoped
sudoers fragment to chown -R the project dir back to the calling uid:gid,
then retry the mkdir.
Extend the sudoers fragment dropped at install time to allow that
narrowly-scoped chown:
sovereign-node ALL=(root) NOPASSWD: /bin/chown -R [0-9]*\\:[0-9]*
/var/lib/sovereign-node/bundled-matrix/*
The pattern restricts both the target path and the form of the
ownership argument, so the rule can only re-claim numeric ownership
within the bundled-matrix tree.
resetBundledPostgresState calls backupPostgresDataBeforeReset and then rm -r on postgres-data. Both walk the dir tree, and both fail EACCES when running as a non-root service user against a postgres-data written by the postgres container (uid 70 inside container). Add a reclaimOwnership helper that uses the scoped sudoers fragment to chown -R the path to the calling uid:gid before any walking operation. Apply it at the start of resetBundledPostgresState and again after `compose down` (Docker may have re-touched the dir). Trips iter-7 of the wizard E2E loop when the recoverable-credentials recovery path triggers — first install attempt fails, recovery tries to reset postgres state, can't scan the dir.
iter-8 of the wizard E2E loop fails at matrix_bootstrap_accounts with `EPERM: chmod '/etc/sovereign-node/secrets'` after the dir somehow ends up root-owned mid-install (likely a leftover from a prior run, a docker-compose bind-mount creating the dir as root, or a partial chown sequence in the bootstrap install path). Add a sudoReclaimOwnership helper that uses the scoped sudoers fragment to chown -R the dir to the calling uid:gid, and call it from ensureSecretsDir on EPERM/EACCES from the chmod attempt before falling back to the cwd path. Extend the sudoers fragment dropped at install time to allow `/bin/chown -R [0-9]*:[0-9]* /etc/sovereign-node/secrets` and `/etc/sovereign-node/secrets/*`, scoped exactly to the path where runtime might need to reclaim ownership.
DEFAULT_SERVICE_USER/GROUP were "root", which meant the runtime API service (running as sovereign-node) wrote a system-gateway unit with User=root Group=root, then tried to read jiti/tmp cache files that the now-root gateway had written. EACCES on the matrix extension load → bots_configure fails. Default to sovereign-node:sovereign-node, matching the User= on the sovereign-node-api unit. Bootstrap install context is unaffected because it sets SOVEREIGN_NODE_SERVICE_USER from the script.
The sovereign-node-api unit had no Environment=PATH=, so systemd
inherited a minimal default PATH (/usr/local/sbin:/usr/local/bin:
/usr/sbin:/usr/bin:/snap/bin). When the install ran openclaw via the
API service, the verification step's `which openclaw` (or equivalent
PATH lookup) failed because openclaw lives at
/var/lib/sovereign-node/.npm-global/bin/openclaw — never on the
default PATH.
Result: a fresh install on a fresh VM fails immediately at
openclaw_bootstrap_cli with OPENCLAW_INSTALL_FAILED ("OpenClaw
installer completed but the openclaw CLI was not detected"), even
though the binary is installed correctly.
Iter-1 of the wizard E2E loop hot-patched this on a single VM with
`sed -i` to add the Environment=PATH line, but the unit *template*
in the repo was never updated. Every fresh bootstrap regenerates a
broken unit. This commit fixes the template so the next bootstrap
produces a working unit on first try.
The first manual end-to-end walk surfaced three real bugs on the
"Your node is ready" step that the automated playwright loop never
hit (network state, not screen state):
1. Copy code / Copy link silently no-op on insecure HTTP origins.
navigator.clipboard.writeText is undefined when the wizard runs on
plain HTTP from a LAN IP (not localhost), the call throws, the
try/catch swallows it, the user gets no feedback. Wraps both buttons
in a new CopyButton component that tries the modern API first then
falls back to a hidden textarea + document.execCommand("copy") —
the legacy path is exactly designed for insecure contexts. Visible
"Copied" / "Copy failed — select manually" feedback for 2.5 s.
2. result.nextSteps is undefined in practice. The schema declares it
but no installer code populates it; the SuccessStep's Open Element
button + kv list silently render empty. Read fallback values from
wizardState (publicBaseUrl, alertRoomName, operator + homeserver
domain) so the page renders the operator/homeserver/alert-room
triple even with no backend support. (A real backend fix would
thread the data through the install pipeline; tracked as task #29
for a separate PR.)
3. Local LAN mode lacks operator-facing preconditions. New
LanPreconditionsCard renders only when deployMode === "lan" and
walks through the three things needed on each LAN device: DNS
resolution (router rewrite or /etc/hosts), Caddy local CA trust
(with the exact path on the node and OS-specific import steps),
port 443 reachability. Same applied for Local dev: a hint card
showing the SSH-tunnel command since the homeserver only binds
loopback.
CopyButton lives in forms.js so screens/onboarding.js (admin
Onboarding page) can use it too — same bug class on the same shape.
Local LAN was complex: each device on the operator's network needed
DNS resolution for matrix.lan.local AND the Caddy CA imported AND
port 443 reachable. The DNS step is the one that breaks people
(macOS .local mDNS quirks, /etc/hosts on every device, router DNS
overrides).
Caddy supports IPs as site addresses for `tls internal`; one cert
covers all listed names. Detect this host's RFC1918 IPv4 addresses at
install time and add them to the Caddyfile's site directive so the
generated cert covers both the hostname and each LAN IP. Operators
can now reach the homeserver at https://<lan-ip>/ from any device on
the LAN with only CA trust + port 443 — no DNS required.
- New src/system/lan-ips.ts: detectLanIPv4() walks os.networkInterfaces
filtering loopback/internal/link-local/IPv6, sorts by RFC1918
preference (10/8 → 192.168/16 → 172.16/12 → public), de-dups. Pure;
takes networkInterfaces() output as a parameter for testability.
- DockerComposeBundledMatrixProvisioner gains an optional `lanIpProvider`
constructor arg (defaults to detectLanIPv4) so tests can stub it.
- renderCaddyfile gains an `extraSiteNames` parameter; for `tls internal`
it joins them into the comma-separated site directive
(`matrix.lan.local, 192.168.0.181, 10.0.0.5 {`).
- The publicBaseUrl host is filtered out of the LAN list before merging
to avoid duplicating it in the site directive.
- LanGuidance copy updated: "DNS is optional" — the cert covers the IP.
- LanPreconditionsCard rewritten: CA trust + port 443 are the two real
preconditions; DNS demoted to "Optional" with a note about
.local/mDSN edge cases on Apple devices.
- Two new matrix.test.ts cases assert (a) the site directive includes
hostname + LAN IPs, (b) duplicates are filtered when the publicBaseUrl
host is also in the detected LAN list.
- New lan-ips.test.ts (6 cases) covers sorting, exclusion of
loopback/IPv6/link-local, dedup, and the empty case.
Closes the user-reported "Local LAN is too complex" bug from the first
manual end-to-end install on VM 181.
When the wizard's localStorage points to a jobId that no longer exists
on the node (most common cause: a state wipe between manual installs),
the wizard surfaces "API_ERROR: Install job not found: job_…" forever.
Real bug: the catch arms in the host's resume effect and the Progress
step's poll loop didn't distinguish "this job is gone, start fresh"
from a transient API error, and never cleared the stale jobId.
Three layers of fix:
1. Server: getInstallJob throws a typed object
{ code: "INSTALL_JOB_NOT_FOUND", retryable: false, details: { jobId } }
instead of a plain Error("Install job not found"). The install/jobs
route maps this code to HTTP 404 (was 400/generic).
2. Wizard host (index.js): on rehydrate, if the persisted jobId 404s
with INSTALL_JOB_NOT_FOUND, clear it from localStorage so the
wizard starts at Welcome on next mount instead of looping. Other
errors (network blip, auth expired) leave the jobId in place so a
refresh can retry.
3. ProgressStep poll: same shape — on INSTALL_JOB_NOT_FOUND mid-poll,
clear jobId and surface a friendly explanation ("This install job
no longer exists on the node, likely because the node state was
reset. Go back to Review and start a new install."). The "Try
again" / "Back to review" buttons already work because terminal
state is cleared.
Test: real-service.test.ts asserts the typed error shape on unknown
jobIds, alongside the existing successful-getInstallJob assertion.
The Caddy IP-cert change (56f2b7b) made any LAN IP reachable directly from a fresh laptop without DNS or /etc/hosts edits. The wizard's Local LAN preset still seeded publicBaseUrl with https://matrix.lan.local, which surfaced as the "Open Element →" link on the Done page — and that hostname is unreachable until the operator wires up DNS. Add GET /api/setup-ui/host-info exposing the host's RFC1918 IPv4 addresses (reuses detectLanIPv4). MatrixStep fetches it on mount and seeds the Local LAN preset's publicBaseUrl with the first IP. The Matrix server name (homeserverDomain → MXIDs) stays matrix.lan.local so federation/MXIDs don't depend on the LAN IP. If host-info fails or returns no IPs, falls back to the old hostname URL so the wizard remains usable. Stale persisted hostname URLs are upgraded once host-info arrives.
…no < Three issues surfaced on the Done page during a manual install in Local LAN mode: 1. No "Open Element →" button for LAN. The button was gated on deployMode === "public", but LAN with the IP-cert is just as reachable from the operator's browser. Show it for both modes, pointing at https://app.element.io/#/login?hs_url=...&login_hint=... so Element opens with the homeserver and operator prefilled (matches buildElementWebLoginLink in matrix-onboarding-page.ts). 2. Operators couldn't tell where the Caddy CA actually lives. The preconditions card printed the literal "<node-LAN-IP>" placeholder instead of the homeserver IP. Substitute the real host parsed from wizardState.matrix.publicBaseUrl, expose the CA URL as a clickable link, and keep the curl command for headless devices. 3. HTML entities like < rendered verbatim because htm passes template-literal text as-is. Move every "<placeholder>" through a ${"..."} interpolation so they print as < and > in the rendered DOM. Stays self-contained on the Done step — no backend changes, no schema changes. The hostnameFromUrl + buildElementWebLoginLink helpers are small enough to live inline; if they grow more callers later we can extract them.
This is the open-core installer. It should feel like technical
self-hosted setup for operators, not a managed-SaaS onboarding. This
pass adjusts copy, success/error consistency, and a few targeted state
behaviors without redesigning the visual language.
Highlights:
- Header chip now reads "Open-core setup · DIY self-hosted" so the
open-core path is visible up front.
- Welcome: honest 5–15 minute estimate that depends on network mode;
explicit external-connections framing instead of the previous
"nothing leaves until Matrix" line.
- Login: open-core access framing, plain-language bootstrap-token copy,
"Continue" button. Token-recovery hint always visible.
- Preflight: friendly labels for raw check IDs (host-os → Host OS,
openclaw-dns → Runtime backend DNS, etc.), and reordered by
practical importance (privileges, ports, disk, Docker, DNS, time).
- Matrix: title becomes "Matrix control plane"; field renamed
"Public base URL" → "Matrix URL" with mode-aware helper. LAN
guidance is rewritten as 3 short bullets and now substitutes the
real LAN IP into the example URL (was rendering <node-LAN-IP>
literally because the htm template embedded HTML entities). Public
is marked as the advanced path. Local dev guidance is honest
("only this machine can reach the homeserver directly").
- Mailbox: tighter subtitle, clearer password helper, success message
reads "Mailbox connection looks good", folder hint added.
- Provider: explicit "OpenRouter as the supported LLM provider in
open core" framing; model field reframed as "Initial default";
validation copy clarifies error path.
- Modules: reframed as informational "Installed components" — both
components are always installed in open core today, so the step
no longer pretends to be a real choice.
- Review: grouped into Matrix / Mailbox / Provider / Components /
Secrets sections; secrets shown as set/missing only; primary
button reads "Install locally"; secret-handling note rewritten.
- Progress: calm subtitle that adapts to deploy mode; phase labels
use product-level words ("Installing runtime backend",
"Setting up Matrix", "Activating components") and the relay
phase is suppressed entirely unless that step actually ran or
failed (open core defaults to no relay).
- Install failed: title is "Install stopped before completion"; a
humanized failure summary explains *why* the step stopped and
suggests the next action (sudo, IMAP creds, OpenRouter key,
Matrix prerequisites). Raw error code/message stays visible
beneath, and a Copy raw error button bundles the failed step +
step list for paste-back. Step detail opens by default on failure.
- Validation error path: contract errors that arrive before a job
exists (e.g. missing OpenRouter key) now render as a calm
"Provider configuration incomplete" page with Back to provider
and Back to review actions, instead of a raw API_ERROR banner.
- Done: title is mode-aware ("Local dev install completed" vs
"Your node is installed") and the page no longer says ready while
also showing a red CONFIG_NOT_FOUND banner — that specific case
now renders as a neutral "Node finalizing" warn block instead.
LAN preconditions card is restructured into three numbered
sub-headings (trust the CA / verify port 443 / optional hostname).
No backend changes. No schema changes. Visual language unchanged.
Items intentionally left out:
- Earlier provider-key validation: would require a new API endpoint
and an OpenRouter round-trip from the wizard; deferred to a
follow-up PR.
- Real module picker: still no real choice; reframed informational.
- Done-page nextSteps still derived from wizardState; backend
population of result.nextSteps remains tracked separately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the JSON-textarea install screen and the admin-shell-first navigation with a guided, wizard-first first-run experience and a task-oriented operator admin console. The technical foundation from the original PR is preserved: Fastify-served static SPA, no-build-step Preact + htm, vendored deps, CSRF + cookie sessions, two-stage auth (bootstrap-token → Matrix-password), and all install/preflight/test-connection/install-job/status/reconfigure/onboarding routes.
The UI is now split into two distinct modes inside the same SPA, with mode detected client-side from
/api/status(no new backend endpoint needed):#/setup/*— Welcome, Preflight, Matrix, Mailbox, Provider, Modules, Review, Progress, Done.#/admin/*— Status (operator overview), Mailbox / Matrix / Provider (Test → Apply), Onboarding, Recovery.After a successful install, the wizard auto-issues a Matrix onboarding code, surfaces the Element homeserver URL and alert room as the primary CTA, and points the operator at Matrix as the day-2 control plane. The flow ends in Matrix, not in the web UI.
Backend
POST /api/install/preflight,POST /api/install/test-imap,POST /api/install/test-matrix,POST /api/install/run,GET /api/install/jobs/:jobId,POST /api/onboarding/issue,GET /api/onboarding/state. The reconfigure pages reusePOST /api/reconfigure/{imap,matrix,openrouter}and the sametest-imap/test-matrixendpoints.src/contracts/install.test.ts: the exact JSON shape the wizard emits is asserted to round-trip throughinstallRequestSchema. This is the single guard against frontend/backend drift.Frontend
New mode-aware host (
public/setup-ui/app.js) detects mode from/api/status(installationId && version.provenance⇒ admin; otherwise first-run), enforces redirects between modes, redirects legacy hash routes (#/install,#/status,#/reconfigure/*,#/onboarding) to the new IA for one release cycle, and surfaces a "Could not load status / Retry" splash on 5xx instead of falling back to first-run.Wizard (
public/setup-ui/screens/wizard/):WizardShellandStepper.localStorageundersov:setup-ui:wizard:v1; secrets re-prompted on rehydrate.jobIdis in localStorage, the wizard reads the job state on load and jumps to Progress / Success / Review-with-error as appropriate.<details>.nextSteps.elementHomeserverUrlas the primary CTA, and surfaces operator username + alert room name + onboarding code with copy buttons.Admin (
public/setup-ui/screens/admin/):Status.js— operator-overview rollup. Worst-of all health values for the headline; critical-action banner with explicit CTAs ("Reconfigure mailbox", etc.); healthy / warning / error groups; existing flat services table preserved inside a collapsible<details>.Recovery.js— sign-post page only. Shows current auth stage, links to onboarding-code re-issue, prints the exactsudo sovereign-node setup-ui issue-bootstrap-tokencommand for terminal recovery, and points atsovereign-node doctor / logs / reconfigurefor break-glass.Reconfigure (
public/setup-ui/screens/reconfigure.jsrewritten): each form (mailbox, Matrix, provider) drives aediting → testing → tested-ok | tested-fail → applying → applied-ok | applied-failstate machine. Apply is disabled until a passing test (where applicable). Editing any field after a passing test resets the form toediting. After Apply,restartRequiredServicesis rendered in plain language ("OpenClaw gateway will restart to pick up the change…") instead of raw service names.CSS — no design-token changes. The existing palette (
#050816background,#f97317orange CTA, Space Grotesk + Inter Tight) and component classes (.card,.field,.btn,.alert,.badge,.steps,.kv,.table,.code-block) are reused. New classes added for wizard layout (.wizard-shell,.stepper,.module-list,.btn--xl,.bullet-list, etc.) using the same tokens.Removed:
public/setup-ui/screens/install.js(JSON textarea) andpublic/setup-ui/screens/status.js(flat table).Tests & checks
pnpm typecheck,pnpm lint,pnpm testall green (388 tests passing, was 327 — same set, no regressions).pnpm test:coverage:integrationand:unit— green; coverage thresholds untouched.pnpm build+ postbuild copies all new wizard / admin / components files intodist/public/setup-ui/.Out of scope (deferred follow-ups)
restartRequiredServicesandvalidationchecks but the backend does not auto-revert on partial failure. Out of scope here.POST /api/reconfigure/{imap,matrix}/testagainst currently-persisted secrets. The current reconfigure forms require the operator to retype the password to test — honest behaviour for "secret-bearing reconfiguration", but worth a dedicated test endpoint later.#/setup?json=1later if there is demand.Test plan
pnpm typecheck,pnpm lint,pnpm testall greenpnpm buildproducesdist/public/setup-ui/screens/wizard/and…/admin/and…/components/pnpm test:coverage:integrationand:unitgreenpnpm dev:api→ http://127.0.0.1:8787/ → redirects to/setup-ui/→ wizard appears (because no install) → walk all 9 steps → arrive at Done with onboarding code rendered#/admin/statuswith operator-overview rollup.#/setupredirects to#/admin/status.#/setup?force=1re-enters the wizard.#/install,#/status,#/reconfigure/imap,#/reconfigure/matrix,#/reconfigure/openrouter,#/onboarding) redirect to their new equivalents.editing. Apply on provider works without a Test.proxmox-pool-vm dev create, install viascripts/install.sh, tunnel viassh -L 8787:127.0.0.1:8787, walk the wizard end-to-end on a fresh node, then validate the upgrade scenario, thenproxmox-pool-vm dev delete.