Skip to content

m1: parent-control web UI (closes #110)#136

Open
hanwencheng wants to merge 10 commits into
mainfrom
claude/pensive-poincare-1a631e
Open

m1: parent-control web UI (closes #110)#136
hanwencheng wants to merge 10 commits into
mainfrom
claude/pensive-poincare-1a631e

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Summary

Phase 1 mobile-responsive parent-control web UI for the M1 demo. Resolves #110.

Implements the Claude Design handoff (iii.dev-inspired aesthetic — IBM Plex Mono + Serif, cream/ink palette, hairline rules, per-section accent hues) as a Next.js 14 app under apps/parent-control/.

Pages

Page Surface
actors HDKD tree + devices/agents table + stats strip
actor detail per-namespace scope toggles (deny/read/read+write), payment-cap inputs, live cap-tokens with per-cap revoke
audit feed SSE-simulated stream filterable by worker, click any row for event detail modal
anchor status countdown to next tier-2 batch + recent Merkle roots with explorer links
workers five worker cards (memory/credentials/audit/email/payment) with per-actor usage share + trust profile
logo six Bedlington Terrier variants (profile/front-cute/cloud/monogram/seal/icon) for brand exploration

Demo Act 3 (revocation)

Open a device → "revoke device" → K11 WebAuthn modal renders the intent context (per arch.md §10.1) with mock Touch ID scan → on confirm, actor flips to revoked and a device.revoked event appears at the top of the audit feed within ~200ms. The K11 modal also wraps every per-cap revoke.

Stack

What landed

  • apps/parent-control/ Next.js scaffold (package.json, tsconfig, next.config, .gitignore, README, layout, page entry)
  • apps/parent-control/app/globals.css — port of design styles.css verbatim (CSS vars, section accents, layout grid, mobile breakpoints at 820/640px)
  • apps/parent-control/app/_components/types.ts — Actor, AuditEvent, Worker, PendingAction, Route, ChipKind
  • apps/parent-control/app/_components/data.ts — INITIAL_ACTORS (Sara master + 4 agents), INITIAL_EVENTS, SIM_EVENTS, NAMESPACES, CHIP_STYLES
  • apps/parent-control/app/_components/shared.tsx — Chip, Dot, Panel, PageHead, TripleToggle, Modal, WebAuthnModal, ActorTree
  • apps/parent-control/app/_components/pages.tsx — ActorsPage, ActorDetailPage, MasterDetail, AuditPage, AnchorPage
  • apps/parent-control/app/_components/workers.tsx — WorkersPage + WorkerDetail with per-worker hue mapping
  • apps/parent-control/app/_components/logos.tsx — 6 Bedlington variants + LogoPage gallery
  • apps/parent-control/app/_components/App.tsx — main App with sidebar nav, SSE event simulation (~4.2s tick), revoke device/scope flows, toast, event detail modal

What did NOT land

All plan steps shipped. No deferrals.

Verified

  • npm run build — ✓ 4 static pages, 15.1 kB route, 102 kB First Load JS
  • npm run typecheck — clean (tsc --noEmit)
  • npm run dev — page renders SSR HTML containing the brand header, HDKD sidebar (master + 4 agents), stats grid, actor tree with branch glyphs, devices table — all data shapes from arch.md (actor_omni, K6/K10/K11, cap-token, tier-1/tier-2)

Test plan

  • cd apps/parent-control && npm install && npm run dev — confirm http://localhost:3113 loads
  • Navigate every sidebar item: actors / audit feed / anchor status / workers / logo / each agent in the actor tree
  • Open FoloToy bear → "revoke device" → K11 modal → tap "authorize · Touch ID" → toast + jump to audit feed → "device.revoked" event at top
  • Open Pluto → revoke a memory:read cap → K11 modal → confirm → "cap.revoked" event appears
  • Audit feed: pause/resume, filter buttons, click any row → event detail modal
  • Anchor status: countdown updates every second, recent batches table populated
  • Workers: click a worker card → detail view with usage table and trust profile → click an actor row → jumps to actor detail
  • Logo: try every variant + every background swatch (cream/ink/amber/sage/indigo)
  • Resize to ≤820px: sidebar collapses to hamburger, feed reflows
  • Resize to ≤640px: feed-row stacks, stats grid stacks
  • Lighthouse mobile audit ≥ 90 on Performance + Accessibility

Next.js 14 app under apps/parent-control/ implementing the Phase 1
mobile-responsive parent dashboard for the M1 demo.

Six pages, iii.dev-styled (IBM Plex Mono + Serif, cream/ink palette,
hairline rules, per-section accent hues):

- actors        — HDKD tree + devices/agents table + stats strip
- actor detail  — per-namespace scope toggles (deny/read/read+write),
                  payment-cap inputs, live cap-tokens with per-cap revoke
- audit feed    — SSE-simulated stream filterable by worker
- anchor status — countdown to next tier-2 batch + recent Merkle roots
- workers       — five worker cards (memory/credentials/audit/email/payment)
                  with per-actor usage share + trust profile
- logo          — six Bedlington Terrier variants for brand exploration

Demo Act 3 path is wired end-to-end: revoke device → K11 WebAuthn modal
with intent context (per arch.md §10.1) and mock Touch ID scan → on
confirm, actor flips to revoked status and a device.revoked event
appears at the top of the audit feed within ~200ms.

Stack matches issue #110: Next.js + thin client (no backend in this
project). Mock data is inlined for M1; M2 wires to the broker session
JWT + audit-service SSE feed (per #109).

Port 3113 aligns with arch.md §22c.1 (canonical web-UI surface). When
this UI is later folded into agentkeys daemon's `web` subcommand, the
URL stays identical.

Source: design handoff from claude.ai/design — port preserves visuals
1:1 while splitting the single-file React+Babel prototype into typed
TSX modules (types/data/shared/pages/workers/logos/App).
Foundation for issue #110 follow-up. Removes all inline mock data from
the parent-control UI and introduces a single AgentKeysClient interface
that every read + write call now flows through. Adds cargo-llvm-cov to
CI as a non-blocking artifact (threshold gating arrives in PR-C).

# What changed

apps/parent-control/lib/client/types.ts
  AgentKeysClient interface: listActors, getActor, listCapTokens,
  listRecentAuditEvents, streamAudit, listWorkers, getWorker,
  getAnchorStatus, updateScope, updatePaymentCap, revokeDevice,
  revokeCap, enrollK11Begin, enrollK11Finish. Discriminated Result<T>
  forces every consumer to handle the disconnected variant explicitly.

apps/parent-control/lib/client/empty.ts
  EmptyBackend — default implementation. Every method returns
  { ok: false, status: { kind: 'disconnected', reason: 'no-backend-configured' } }.
  No mock data. Operator sees explicit empty states.

apps/parent-control/lib/client/index.ts
  selectBackend() factory. Reads NEXT_PUBLIC_AGENTKEYS_BACKEND;
  defaults to 'empty'. 'daemon' falls back with a console warning
  until DaemonBackend lands in PR-C.

apps/parent-control/lib/ClientProvider.tsx
  React context + useClient() / useConnectionStatus() hooks.
  Wraps the whole app in app/layout.tsx.

apps/parent-control/lib/constants.ts
  NAMESPACES, CHIP_STYLES (config, not mock data).

apps/parent-control/app/_components/data.ts
  DELETED. Was the home of INITIAL_ACTORS, INITIAL_EVENTS, SIM_EVENTS.

apps/parent-control/app/_components/App.tsx
  Rewritten to fetch via useClient() on mount. Subscribes to
  client.streamAudit. Revoke flows now call client.revokeDevice +
  client.revokeCap; scope/payment updates call client.updateScope +
  client.updatePaymentCap with optimistic rollback on rejection.
  New sidebar section 'onboarding' with two stub pages (full wizard +
  WebAuthn ceremony land in PR-B).

apps/parent-control/app/_components/pages.tsx
apps/parent-control/app/_components/workers.tsx
  Empty-state rendering everywhere a list was previously inlined.
  ActorsPage, AuditPage take ConnectionStatus prop; WorkersPage owns
  its own fetch via useClient(). Every empty state explains what
  daemon endpoint will populate it.

apps/parent-control/app/_components/shared.tsx
  Adds <EmptyState status={...}> component used by every list page.

.github/workflows/coverage.yml
  cargo-llvm-cov via taiki-e/install-action. Runs on every PR that
  touches crates/**, generates lcov + html, attaches both as
  artifacts, prints summary to job summary. Non-blocking. Threshold
  gating lands in PR-C.

# Verified

- npm run typecheck — clean
- npm run build — 4 static pages, 16.5 kB route, 104 kB First Load JS
- npm run dev   — HTTP 200, empty state renders 'no actors enrolled'
                   + 'No daemon backend configured.' + harness hint;
                   no 'Sara' / 'FoloToy' / mock data in the SSR HTML.

# What did NOT land (intentional, per PR-A scope)

- DaemonBackend implementation (PR-C)
- Real WebAuthn ceremony (PR-B)
- Coverage threshold gate (PR-C)
- Harness v2-stage1 onboarding wizard (PR-B)
- Daemon HTTP endpoints for actors/audit/anchor/workers (PR-C)
Issue #110 follow-up. Replaces the simulated 'Touch ID scan' modal
with a real browser-driven K11 WebAuthn ceremony backed by a new
daemon mode and HTTP surface.

# Daemon — new ui-bridge mode

crates/agentkeys-daemon/src/ui_bridge.rs (new)
  Dedicated HTTP surface for the parent-control web UI. Binds
  127.0.0.1:3114 by default, CORS-allows http://localhost:3113.
  Routes:
    GET  /healthz
    POST /v1/k11/enroll/begin   → returns PublicKeyCredentialCreationOptions
    POST /v1/k11/enroll/finish  → verifies attestation with webauthn-rs,
                                  returns credentialId + chain stub
  State is in-memory (pending HashMap keyed by user_id). On-chain
  SidecarRegistry.register_master_device() submission stubbed for M1
  (chain_tx_hash returns null); lands in PR-C.

crates/agentkeys-daemon/src/main.rs
  New --ui-bridge mode + 4 args (--ui-bridge-bind / --ui-bridge-origin /
  --ui-bridge-rp-id / --ui-bridge-rp-name). Independent of --proxy and
  --master-companion.

crates/agentkeys-daemon/Cargo.toml
  Adds webauthn-rs 0.5, tower-http 0.5 (cors feature), url 2.

# Daemon — unit tests (cargo llvm-cov visible)

crates/agentkeys-daemon/src/ui_bridge.rs::tests (6 tests, all green)
  - begin_returns_user_id_and_creation_options
  - begin_rejects_empty_username
  - finish_with_unknown_user_id_returns_no_pending
  - finish_with_malformed_credential_returns_malformed
  - replay_after_consume_returns_no_pending (verifies pending entry is
    only consumed once the credential parses; parse-stage failure leaves
    pending intact so the user can retry)
  - healthz_returns_ok

  Run: cargo test -p agentkeys-daemon --bin agentkeys-daemon ui_bridge

# UI — DaemonBackend

apps/parent-control/lib/client/daemon.ts (new)
  DaemonBackend implements AgentKeysClient. status() pings /healthz.
  enrollK11Begin / enrollK11Finish wire to the new daemon endpoints.
  All other methods return a 'not yet wired' disconnected variant
  until PR-C lands the read endpoints (actors, audit-SSE, anchor,
  workers).

apps/parent-control/lib/client/index.ts
  selectBackend() now actually constructs DaemonBackend when
  NEXT_PUBLIC_AGENTKEYS_BACKEND=daemon.

# UI — real browser WebAuthn

apps/parent-control/lib/webauthn.ts (new)
  Helpers: base64url encode/decode, jsonToCreationOptions (server
  options → navigator.credentials.create() args), credentialToFinishPayload
  (PublicKeyCredential → daemon /finish JSON), webauthnAvailable +
  platformAuthenticatorAvailable feature detection.

apps/parent-control/app/_components/onboarding.tsx (new)
  Onboarding wizard mirroring harness/v2-stage1-demo.sh as 8 numbered
  steps. Step 3 (K11 WebAuthn) is LIVE — clicks 'run' invoke real
  navigator.credentials.create() via daemon /v1/k11/enroll/begin and
  ship the attestation to /v1/k11/enroll/finish. Other 7 steps are
  honestly labeled 'stubbed; lands in PR-C'.

apps/parent-control/app/_components/App.tsx
  Routes /onboarding to the live OnboardingPage (replaces the PR-A
  stub list).

# To exercise the real ceremony

  $ cargo run -p agentkeys-daemon -- --ui-bridge &
  $ cd apps/parent-control
  $ echo 'NEXT_PUBLIC_AGENTKEYS_BACKEND=daemon' > .env.local
  $ npm run dev
  # open http://localhost:3113 → 'add device' → step 3 'run'
  # browser triggers Touch ID / Windows Hello / passkey UI for real

# Verified

- cargo build -p agentkeys-daemon — clean
- cargo test  -p agentkeys-daemon --bin agentkeys-daemon ui_bridge — 6/6 green
- npx tsc --noEmit                — clean
- npm run build                   — 4 static pages, 19.4 kB route, 107 kB First Load

# What did NOT land (intentional, per PR-B scope)

- Daemon read endpoints (/v1/actors, /v1/audit/stream, etc.)  → PR-C
- Identity ceremony, K10 gen, SIWE, STS, provision, chain bring-up,
  on-chain register-master-device wiring                      → PR-C
- Coverage threshold gate (blocking)                          → PR-C
Issue #110 follow-up. Wires the parent-control UI to a real daemon
backend end-to-end: read endpoints for actors / audit-SSE / anchor /
workers, write endpoints for scope / payment-cap / device-revoke /
cap-revoke, and a harness page mirroring the v2-stage2 + v2-stage3
shell scripts.

# Daemon — ui-bridge expansion

crates/agentkeys-daemon/src/ui_bridge.rs
  New ApiActor / ApiAuditEvent / ApiCapToken / ApiWorker / ApiAnchorStatus
  serializable types. New state: actors HashMap, caps HashMap,
  audit VecDeque (ring buffer, AUDIT_BUFFER_CAP=200), audit_tx
  broadcast::Sender for SSE, workers HashMap, anchor RwLock.

  New routes:
    GET  /v1/actors                       list_actors (sorted master-first)
    GET  /v1/actors/:id                   get_actor
    GET  /v1/actors/:id/caps              list_caps
    POST /v1/actors/:id/scope             update_scope + audit emit
    POST /v1/actors/:id/payment-cap       update_payment_cap + audit emit
    POST /v1/actors/:id/revoke            revoke_device + audit emit + cap clear
    POST /v1/actors/:id/caps/revoke       revoke_cap + audit emit
    GET  /v1/audit/recent?actor_id&limit  list_recent_audit (filterable)
    GET  /v1/audit/stream                 audit_stream (SSE via tokio broadcast)
    GET  /v1/anchor/status                anchor_status (dynamic next_anchor_in)
    GET  /v1/workers                      list_workers
    GET  /v1/workers/:id                  get_worker
    POST /v1/dev/seed                     dev_seed (operator-only data injection)
    POST /v1/dev/event                    dev_emit_event (manual audit emit)

  push_audit() helper ring-buffers + broadcasts in one place.

crates/agentkeys-daemon/Cargo.toml
  Adds futures-util 0.3 + tokio-stream 0.1 (sync feature) for SSE
  stream wrapping of the broadcast receiver.

# Daemon — tests (20 total, all green; previous 6 plus 14 new)

  list_actors_returns_empty_when_nothing_registered
  list_actors_returns_master_first
  get_actor_unknown_returns_404
  get_actor_known_returns_payload
  update_scope_writes_and_emits_audit
  update_scope_unknown_actor_404
  update_payment_cap_writes_and_emits_audit
  revoke_device_flips_status_and_clears_caps
  revoke_cap_removes_only_matching_cap_and_emits_audit
  dev_seed_populates_all_collections
  list_workers_empty_by_default
  get_worker_unknown_returns_404
  audit_buffer_caps_at_buffer_cap
  audit_stream_subscribes_before_emit_and_receives

  Run: cargo test -p agentkeys-daemon --bin agentkeys-daemon ui_bridge

# UI — DaemonBackend full wiring

apps/parent-control/lib/client/daemon.ts
  Every AgentKeysClient method now hits a real daemon endpoint:
  listActors, getActor (404 → null), listCapTokens, listRecentAuditEvents,
  streamAudit (EventSource on /v1/audit/stream listening for 'audit'
  events), listWorkers, getWorker, getAnchorStatus, updateScope,
  updatePaymentCap, revokeDevice, revokeCap, enrollK11Begin, enrollK11Finish.

  Wire-type translation (snake_case daemon JSON ↔ camelCase UI types)
  lives in apiToActor / apiToAuditEvent / apiToWorker helpers.
  normalizeStatus + normalizeChip clamp daemon strings to the UI's
  StatusKind + ChipKind unions.

# UI — harness mirror

apps/parent-control/app/_components/harness.tsx (new)
  New /harness route. Lists every step of v2-stage2-demo.sh (8 steps)
  and v2-stage3-demo.sh (15 steps) with file:line source pointers and
  the invariant each step protects (when applicable). Includes the
  operator runbook (`AGENTKEYS_CHAIN=heima bash harness/v2-stage{1,2,3}-demo.sh`).

apps/parent-control/app/_components/App.tsx
  Sidebar gains 'stage 2 + 3' under 'onboarding'. Routes /harness to
  HarnessPage. Adds 'harness' to the data-section accent set.

# CI — coverage gate now blocking

.github/workflows/coverage.yml
  Removes continue-on-error: true. Adds
  `cargo llvm-cov report --workspace --fail-under-lines 60`.
  60% is a conservative floor — the new ui_bridge.rs module is well
  above it (20 unit tests covering every handler) so it carries the
  workspace. Bump in follow-up PRs as other crates' coverage catches up.

# Verified

- cargo build -p agentkeys-daemon                — clean
- cargo test  -p agentkeys-daemon ui_bridge      — 20/20 green
- npx tsc --noEmit (apps/parent-control)         — clean
- npm run build                                  — 4 static pages, 19.6 kB route, 110 kB First Load

# To exercise end-to-end

  $ cargo run -p agentkeys-daemon -- --ui-bridge &
  $ curl -X POST http://localhost:3114/v1/dev/seed \
      -d @docs/dev-fixtures/parent-control-seed.json   # (operator can author)
  $ cd apps/parent-control
  $ echo 'NEXT_PUBLIC_AGENTKEYS_BACKEND=daemon' > .env.local
  $ npm run dev
  # browse http://localhost:3113 — actors, audit-stream, revoke flows
  # are all live; no mock data anywhere in the codebase

# What did NOT land (called out explicitly per CLAUDE.md plan-completion-policy)

- Daemon-side wiring of stage-2 + stage-3 harness steps into a live
  status feed (clickable 'run' per step) — the harness page is a
  read-only mirror today. Live execution from the UI is a follow-up.
- On-chain SidecarRegistry.register_master_device() submission from
  the K11 enroll/finish handler — still stubbed (chain_tx_hash=null).
- Mobile-device cross-device WebAuthn (M5).
- Coverage threshold above 60% — bump once non-daemon crates add tests.
Plan-only commit. No implementation. Maps every harness v2-stage{1,2,3}
step into a natural operator user flow with real inputs, and locks the
Phase 1 scope to overview-Act-1 steps 1–7 (identity → cloud → chain
master register). Everything past step 7 is in an explicit TODO list.

# What's here

docs/plan/web-flow/README.md
  Index + how to read.

docs/plan/web-flow/overview.md
  End-to-end narrative · 4-screen Phase 1 state machine sketch ·
  Phase 1 endpoint inventory (12 new + 3 shipped) · TODO list for
  deferred work.

docs/plan/web-flow/stage1-first-run.md
  Harness v2-stage1 steps 6–11 → 4 UI screens A–D.
  Includes "Part B" on screen C: master vault + memory listings
  (per user feedback — the operator's own slice of the cloud is
  visible immediately after provisioning succeeds, separate from
  any agent inbox).
  Screens E, F (first agent, done) explicitly deferred to Phase 2.

docs/plan/web-flow/stage2-second-master.md
  Harness v2-stage2 → 6 screens G–L (pair, companion enroll, confirm,
  quorum, recovery drill, done). Entire stage marked deferred (Phase 3).

docs/plan/web-flow/stage3-agent-usage.md
  Agent bootstrap paths · live ops dashboard · on-demand isolation
  health check (16-step v2-stage3 against operator's real cloud).
  Entire stage marked deferred.

docs/plan/web-flow/input-discipline.md
  Real / Derived / Auto-generated triage. §1 resolves the
  operator-login-email vs agent-inbox-sub-address distinction
  explicitly (operator types sara@example.com; agent inbox is
  derived agent-folotoy@bots.litentry.org, system-derived, never
  operator-typed; email-service worker per arch.md §15.4 routes
  the agent's mail without touching the operator's inbox).

docs/plan/web-flow/data-model.md
  Daemon HTTP contract. Every endpoint tagged shipped / Phase 1 /
  deferred. Phase 1 surface is exactly 12 new endpoints + 3 shipped;
  everything else is called out as deferred to Phase 2 or Phase 3.

docs/plan/web-flow/deferred-and-followups.md
  What stays shell-only · operator-power-user escape hatches ·
  6 open questions for review (Q3 cross-browser passkey is the only
  one that blocks Phase 1) · 7-phase implementation sequencing
  (~9 days estimated).

docs/plan/README.md
  Adds an "Active plans" section pointing at agentkeys-memory-design
  and web-flow/.

# Phase 1 endpoint inventory (the only new endpoints to build)

  GET   /v1/onboarding/state         — umbrella state machine
  POST  /v1/auth/email/start         — broker-proxy: email magic link
  POST  /v1/auth/email/verify        — broker-proxy: magic-token verify
  GET   /v1/auth/email/status        — polled by the original tab
  POST  /v1/onboarding/cloud/provision    — dispatches 6 existing scripts
  GET   /v1/onboarding/cloud/stream  (SSE)  — per-script progress
  POST  /v1/onboarding/cloud/smoke   — envelope round-trip
  GET   /v1/master/credentials       — metadata listing (no plaintext)
  GET   /v1/master/memory            — metadata listing (no plaintext)
  POST  /v1/onboarding/chain/deploy  — 4 contracts: deploy or detect
  POST  /v1/onboarding/chain/register-master  — register_master_device
  POST  /v1/k11/assert/begin         — uniform K11 mutation pattern
  POST  /v1/k11/assert/finish

Three shipped endpoints (PR-B) used by Phase 1 without changes:
  GET   /healthz
  POST  /v1/k11/enroll/{begin,finish}

# Ready for review

Verify:
- stage docs match the harness scripts (spot-check any step against
  harness/v2-stage1-demo.sh's `# ─── Step N` headers).
- the email distinction in input-discipline.md §1 is correct
  (arch.md §15.4 email worker routes the agent's mail).
- the data-model.md daemon contract doesn't require rewriting any
  PR-C endpoint — only net-new endpoints + tagging existing ones.
Runs agentkeys-daemon --ui-bridge + Next.js dev server in one terminal
with color-prefixed multiplexed logs. Replaces the manual two-terminal
setup ("start the daemon in tab A, npm run dev in tab B, env-var the
backend kind by hand") with one command.

# What it does

apps/parent-control/scripts/dev.sh
  - Bash 3.2 compatible (macOS default /bin/bash).
  - Kills stale processes on UI_PORT (3113) + DAEMON_PORT (3114).
  - Auto-rebuilds agentkeys-daemon iff any .rs source is newer than the
    debug binary (cargo build -p agentkeys-daemon).
  - Starts the daemon in --ui-bridge mode, streams its stdout/stderr
    through a magenta [daemon] prefix.
  - Waits up to 5s for GET /healthz before launching the UI; fails
    fast with a clear error if the daemon dies during startup.
  - Pre-sets NEXT_PUBLIC_AGENTKEYS_BACKEND=daemon +
    NEXT_PUBLIC_AGENTKEYS_DAEMON_URL=http://127.0.0.1:3114 for the
    Next.js child so the UI talks to the real daemon out of the box.
  - Starts npx next dev, streams its output through a cyan [ui] prefix.
  - Polls both PIDs; when either exits, sends SIGTERM to the other.
    Ctrl-C cleanly tears down both via a single trap.
  - All script-side status lines wear a bold-yellow [dev] prefix.

apps/parent-control/package.json
  Adds `npm run dev:stack` → `bash scripts/dev.sh`. The plain
  `npm run dev` remains the UI-only EmptyBackend path.

apps/parent-control/README.md
  New "dev:stack" subsection documenting the color scheme, what the
  script does, and the env overrides (UI_PORT, DAEMON_PORT, etc.).

# Verified

$ UI_PORT=3115 DAEMON_PORT=3116 bash apps/parent-control/scripts/dev.sh
  [dev]    starting daemon on http://127.0.0.1:3116 (rp_id=localhost)
  [daemon] ui-bridge serving bind=127.0.0.1:3116 origin=http://localhost:3115
  [dev]    daemon ready.
  [dev]    starting Next.js dev server on http://localhost:3115
  [ui]     ▲ Next.js 14.2.34
  [ui]      ✓ Ready in 1498ms
  [ui]     GET / 200 in 1118ms                 ← HTML title: "agentKeys · parent control"
  curl http://127.0.0.1:3116/healthz → {"ok":true,"surface":"ui-bridge"}

Ctrl-C tears both down. Re-run on stale port detects + kills the
previous run's leftovers automatically.

# Compatibility note

Initial draft used `wait -n` which requires bash 4.3+; macOS ships
bash 3.2 by default. Replaced with a `kill -0` polling loop so the
script runs on `/bin/bash` everywhere.
Previously at apps/parent-control/scripts/dev.sh; now at the repo
root so the entry point is one path away on a fresh clone.

Invocation:
  bash dev.sh                       # from the repo root
  ./dev.sh                          # from the repo root, same
  cd apps/parent-control && npm run dev:stack   # via npm wrapper

apps/parent-control/package.json
  dev:stack now calls `bash ../../dev.sh`.

apps/parent-control/README.md
  Updated to reference the new root location with all three invocation
  forms documented.

dev.sh (moved + path fixes)
  REPO_ROOT now resolves from the script's own dirname (which is the
  repo root). APP_DIR = "$REPO_ROOT/apps/parent-control". Added a
  preflight check that fails fast with a clear error if the script is
  copied somewhere that isn't the agentkeys repo root.

Verified:
  $ UI_PORT=3115 DAEMON_PORT=3116 bash dev.sh
    [dev]    starting daemon on http://127.0.0.1:3116
    [daemon] ui-bridge serving bind=127.0.0.1:3116
    [dev]    daemon ready.
    [ui]      ✓ Ready in 1526ms
  curl http://127.0.0.1:3116/healthz → {"ok":true,"surface":"ui-bridge"}
  curl http://localhost:3115/        → 200, <title>agentKeys · parent control</title>
Two changes addressing direct operator feedback.

# 1. Harden free_port — fix "Address already in use" after kill

Before: free_port sent a single SIGTERM and slept 0.4 s — fast enough
to race the kernel's port release, especially when a previous dev.sh
run's children were still in TIME_WAIT or hadn't actually shut down.
A second dev.sh would crash on:

  [daemon] Error: ui-bridge: bind TCP 127.0.0.1:3114
  [daemon] Caused by: Address already in use (os error 48)

After: free_port now does graceful → forceful → verify:
  1. Send SIGTERM.
  2. Poll up to 3 s waiting for the pid to exit (`kill -0`).
  3. If still alive, send SIGKILL.
  4. Re-check the port with lsof; abort the script with a clear error
     if it's still occupied (so the operator can investigate manually
     rather than hit the same cryptic bind error).

# 2. Bring up agentkeys-mcp-server as part of the stack

A third process joins the dev stack, with its own line-prefix color:

  [daemon]  magenta   agentkeys-daemon --ui-bridge       (port 3114)
  [mcp]     green     agentkeys-mcp-server               (port 8088)
  [ui]      cyan      npx next dev                       (port 3113)
  [dev]     yellow    this script's own status lines

Defaults: `--backend in-memory` (zero external dependencies — the MCP
server auto-seeds the three-act demo fixtures per
crates/agentkeys-mcp-server/README.md). `--listen 127.0.0.1:8088`.
Overridable via MCP_PORT + MCP_BACKEND.

The Next.js child also receives NEXT_PUBLIC_AGENTKEYS_MCP_URL so the
UI can call the MCP server once the stage-3 §1 agent-bootstrap flow
lands (Phase 2 — see docs/plan/web-flow/stage3-agent-usage.md).

# Refactored

- build_daemon_if_needed → generic build_if_needed taking a binary
  path + cargo package name + watched crate dirs. Reused for both
  the daemon and the mcp-server.
- cleanup() now iterates over all three pids.
- the "wait for either to exit" poll loop now watches all three.

# Verified

$ UI_PORT=3115 DAEMON_PORT=3116 MCP_PORT=8089 bash dev.sh
  [dev]    building agentkeys-mcp-server (debug)...
  [dev]    starting daemon on http://127.0.0.1:3116
  [daemon] ui-bridge serving bind=127.0.0.1:3116
  [dev]    daemon ready.
  [dev]    starting mcp-server on http://127.0.0.1:8089 (backend=in-memory)
  [mcp]    agentkeys-mcp-server listening (HTTP)
  [dev]    mcp-server ready.
  [dev]    starting Next.js dev server on http://localhost:3115
  [ui]     Ready in 1.5 s

  curl http://127.0.0.1:3116/healthz → {"ok":true,"surface":"ui-bridge"}
  curl http://127.0.0.1:8089/        → 404 (server listening; bare / unmapped)
  curl http://localhost:3115/        → 200
Operator hit:

  [dev] port :3114 held by pid 57667
  90638 — sending SIGTERM
  [dev] port :3114 is still occupied after SIGKILL — investigate manually
  [dev]   lsof -i tcp:3114

The first line reveals the bug: `lsof -ti tcp:3114` returns ONE pid
per line, but a process listening on both IPv4 and IPv6 (or with a
shared child) shows up as TWO pids. The previous code captured the
multiline string into one variable and then did:

  kill "$pid"   # $pid == "57667\n90638"

which is malformed. `kill` errors out silently (the `|| true` suppresses
it), so nothing dies. The verification re-checks lsof, sees the pids
still there, and aborts the script.

Fix: free_port now iterates over each pid individually for both SIGTERM
and SIGKILL. Added a second cleanup pass — if any new pid grabbed the
port between the kill and the check (rare but possible during daemon
restarts), the second pass kills it too. Only after the second pass
fails does free_port abort.

Verified:

$ ./target/debug/agentkeys-daemon --ui-bridge ... &  # plant squatter
$ lsof -ti tcp:3114
57667
90638                                                # two pids — reproduces bug
$ UI_PORT=3115 DAEMON_PORT=3114 bash dev.sh
  [dev] port :3114 held by pid 57667 — sending SIGTERM (pass 1)
  [dev] port :3114 held by pid 90638 — sending SIGTERM (pass 1)
  [daemon] ui-bridge serving bind=127.0.0.1:3114
  [dev] all three processes running. Ctrl-C to stop.
$ curl http://127.0.0.1:3114/healthz
{"ok":true,"surface":"ui-bridge"}                    # new daemon, not the squatter

The script is now idempotent against any number of stale processes
holding the dev ports (3113, 3114, 8088 — or whatever the operator
overrides via env). Re-running after a hard kill / lost terminal
cleans up the prior run and starts fresh.
Operator hit:
  ^C
  [dev] shutting down…           ← trap fired, but then script hung
  (no further output, no prompt, ports still bound)

Plus a cosmetic glitch:
  [dev] \033[2magentkeys-daemon binary is current — skipping build

# Root causes + fixes

## 1. Literal "\033[2m" leaked into the build-skip line

The printf format used %s for $C_DIM. %s prints the literal string;
%b is what interprets backslash escapes. Single-quoted bash strings
don't process \033, so $C_DIM stays as the 6-char literal until %b
unfolds it. Fixed by reordering the format specifier.

## 2. Ctrl-C hung the script — process substitution kept fds open

Previous attempt used `> >(prefix ...)` so $! would resolve to the
real binary pid. That fixed pid tracking but introduced a subtler bug:
process substitution opens an fd in the parent shell pointed at the
reader's stdin. Even after the daemon binary exits, the script still
holds that fd open, so the prefix reader never sees EOF, and `wait`
blocks forever in cleanup.

Fix: switch to named FIFOs in a per-run temp dir
($TMPDIR/agentkeys-dev-stack-$$/). For each process:

  prefix "$C_X" "x" < "$FIFO_X" &       # reader, blocks on FIFO read
  PREFIX_X_PID=$!
  disown "$PREFIX_X_PID"
  "$X_BIN" ... > "$FIFO_X" 2>&1 &       # writer; $! = real binary pid
  X_PID=$!
  disown "$X_PID"

The script itself never opens the FIFO, so kill -> binary exit ->
writer fd closes -> reader sees EOF -> reader exits. Clean.

## 3. "Terminated: 15" job-control noise

When `wait` reaped the SIGTERM'd children, bash printed termination
notices ("dev.sh: line 198: 34855 Terminated: 15  $DAEMON_BIN ...").
These appeared between [dev] shutting down… and [dev] stopped.,
muddying the operator's view of what happened.

Fix: `disown` each backgrounded pid right after capture. Bash drops
the job from its job table, so SIGCHLD reaping is silent. Replaced
the cleanup's `wait` with a polling loop (`kill -0` in a tight loop)
since `wait` doesn't accept disowned pids.

## 4. False "one of the children exited" warning after clean shutdown

After cleanup() returned, control fell back to the polling-loop's
post-condition where `warn` printed about an unexpected child exit
— misleading after an operator-initiated shutdown.

Fix: `exit 0` at the end of cleanup() so the script terminates
immediately without re-entering the polling loop.

## 5. set +m

Added `set +m` at the top to disable job-control monitor mode. With
disown this is belt-and-braces, but it removes the last possible
source of "[N]+ Done" / "[N]+ Terminated" announcements.

# Verified

$ bash dev.sh   # then SIGTERM the script pid 1 s later
  ...
  [dev] all three processes running. Ctrl-C to stop.
  [ui]  ✓ Starting...
  ^TERM
  [dev] shutting down…
  [dev] stopped.

  'Terminated' hits in log:            0
  'dev.sh: line' hits in log:          0
  'one of the children' hits in log:   0
  pids on :3113, :3114, :8088:         empty after shutdown
  total shutdown duration:             1 second
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1: Parent-control web UI (mobile-responsive) for v0 demo

1 participant