fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts by ljagiello · Pull Request #403 · friday-platform/friday-studio

ljagiello · 2026-05-19T23:10:07Z

Summary

The launcher's readiness probe was hitting /health on the same Deno.serve that handles agent, MCP, and NATS traffic. When MCP fan-out saturates the V8 microtask queue (one bad stdio server hanging on initialize, concurrent stale uvx subprocesses, JetStream consumer pulls backing up), the trivial /health handler couldn't get scheduled inside the launcher's 2 s probe deadline. 30 consecutive failures × 2 s = SIGTERM → forced restart, even though the daemon was alive enough to acknowledge the signal and run shutdown handlers.

This change binds a second Deno.serve on port + 1 with a static () => new Response("ok") handler — no Hono, no middleware, no shared-state access. The two listeners share the V8 isolate (not full isolation), but Deno's Tokio-side accept loop runs them as independent tasks and the static-Response fast path wins the microtask race that a Hono-routed handler would lose. The main /health route stays for atlas-cli + UI; only the launcher's probe moves.

Changes

apps/atlasd/src/atlas-daemon.ts — AtlasDaemonOptions.healthPort. When set, startHealthListener() binds a second Deno.serve with () => new Response("ok"). Drained in phase 1 alongside the main server (1 s ceiling, own abort controller). Bind failures are non-fatal — main /health still works.
apps/atlas-cli/src/commands/daemon/start.tsx — --health-port flag, default <port>+1, 0 opts out. Plumbed through buildDaemonArgs() so re-exec carries it.
tools/friday-launcher/project.go — friday spec probes 8081/ (default) or <override>+1/ when FRIDAY_PORT_FRIDAY is set. Launcher passes both --port and --health-port to the daemon CLI so they stay in lockstep.

Background

Original incident: friday@192.168.64.17 2026-05-19 14:15:45 PDT. User connected Notion in chat, two MCP servers were auto-wired (com-notion-mcp HTTP — worked, and com-mcparmory-notion stdio — hung on FastMCP initialize because the package required OAUTH2_CLIENT_ID/OAUTH2_CLIENT_SECRET that nothing provided). The 20 s MCP timeout cascaded into a 5 s NATS JetStream consumer timeout in workspace-chat, then a 50 s log silence (event-loop starvation), then SIGTERM from launcher. Full forensic trace is in conversation history.

Test plan

Verified locally with the daemon running on --port 38080 --health-port 38081:

Not verified — VM deploy blocked on Gatekeeper:

Replay original Notion-connect cascade with fix in place

Production binaries are Developer-ID signed; locally compiled binaries are adhoc-signed and get SIGKILL'd by macOS Gatekeeper on the target VM. Attempting codesign --force --sign - on the 1 GB binary OOM'd the VM (~65 MB free RAM). To validate the in-prod cascade is actually prevented, we need a properly-signed build on the VM. The design is locally sound — the gap is reproduction infrastructure, not a code defect.

Follow-ups (not in this PR)

If this fix alone doesn't fully prevent the cascade, the remaining lever is killing leaked uvx subprocesses on MCP stdio timeout (separate change in packages/mcp-client/transports/stdio.ts). The dedicated liveness listener buys readiness even when the subprocess leak is happening; killing the subprocesses removes the leak itself.

…t can't trip readiness restarts The launcher's readiness probe was hitting `/health` on the same listener that serves agent / MCP / NATS traffic. When MCP fan-out saturates the V8 microtask queue (one bad stdio server stalling on initialize, three concurrent stale subprocesses, JetStream consumer pulls backing up), the trivial /health handler couldn't get scheduled inside the launcher's 2 s probe deadline. 30 consecutive failures × 2 s → SIGTERM → forced restart, even though the daemon was alive enough to acknowledge the signal and run shutdown handlers. Atlas daemon now binds a second `Deno.serve` on `port+1` with a static `() => new Response("ok")` handler — no Hono, no middleware, no shared-state access. Two listeners share the V8 isolate, but Deno's Tokio-side accept loop runs them as independent tasks and the static- Response fast path wins the microtask race that a Hono-routed handler would lose. The main `/health` route is preserved for atlas-cli and the UI; only the launcher's probe moves. Launcher's friday spec now probes `:8081/` (default) or `<override>+1/` when `FRIDAY_PORT_FRIDAY` is set, and passes `--health-port` to the daemon CLI so the two stay in lockstep. `--health-port 0` opts out (single-listener mode). ## Locally verified - Both listeners bind HTTPS with the same s2s cert - Liveness returns 200 "ok" to any path - Main `/health` route still serves the existing JSON shape - Sub-3 ms liveness latency baseline AND under main-port load - Clean SIGTERM shutdown — both ports drain, no warnings - `--health-port` matrix: omitted (default port+1), 0 (disabled), equal-to-port (no-op), occupied (logs bind-failed, main continues) - Launcher unit tests + new TestFridayLivenessListener_DefaultPort pass ## Not verified VM deploy attempted but blocked on Gatekeeper: production binaries are Developer-ID signed, locally compiled binaries are adhoc-signed and get SIGKILL'd. Re-signing on the VM exhausted its memory. The exact failure cascade from 2026-05-19 14:15 PDT (Notion-connect → workspace-chat NatsError TIMEOUT → 60 s probe failure → SIGTERM) hasn't been replayed with the fix in place; the design is locally sound but the in-prod reproduction is pending a properly-signed build. Context: docs/reviews are in the conversation; the original incident is captured by global.log lines 11500–12200 on friday@192.168.64.17 (now rotated). See ScheduleWakeup-style follow-up: if the fix doesn't fully resolve the cascade, the remaining lever is killing leaked uvx subprocesses on MCP stdio timeout (separate change in packages/mcp-client/transports/stdio.ts).

Nine findings from docs/reviews/2026-05-19-fix-daemon-liveness-listener.md: **Doc-comment (#3)** — rewrite startHealthListener's JSDoc. Drop the "static-Response wins the microtask race" claim (both listeners share one V8 isolate, one event loop, one microtask queue — under genuine starvation neither dispatches). Replace with the actually-true wins: bypasses Hono routing + request-logger middleware, independent TCP accept queue, isolated from per-route regressions. **Bind-failure path (#4)** — sync try/catch around Deno.serve was dead code; bind errors (EADDRINUSE) surface asynchronously via the `finished` promise. Attach `.finished.catch` so failures log and null out `healthServer` so the drain step skips a dead handle. **Dead onError (#6)** — `() => new Response("ok")` cannot throw. Remove the onError handler entirely. **Speculative `--health-port 0` (#5)** — drop the `> 0` opt-out branch in the CLI. Disabling is now via `--health-port == --port`, which the daemon's existing equal-port guard short-circuits. One source of truth. **Symmetric launcher wiring (#7)** — the no-override branch was passing `--health-port 8081` "in case the daemon's default drifts," but the launcher already hardcodes healthPort:"8081" in the spec. Three coupled defaults reduced to two: the launcher's spec and the daemon's own deriveHealthPort agree by construction. The override branch still passes both flags explicitly because there's no default to fall back on. **Atoi error path (#8)** — the override branch silently swallowed `strconv.Atoi` errors. Non-numeric FRIDAY_PORT_FRIDAY="abc" used to set healthPort to "abc" and the readiness probe would target https://127.0.0.1:abc/ forever. Now: validate at spec-build, log loudly, keep the default. Also bounded to 1..65535; 65536 wrap on `port+1` is handled by skipping --health-port and letting the daemon's equal-port guard run. **Tests (#1, #2, #8)** — three new files of automated coverage: - `apps/atlas-cli/src/commands/daemon/start.test.ts` (vitest, 15 tests): deriveHealthPort + buildDaemonArgs branch matrix. Tests the CLI → AtlasDaemonOptions transform end-to-end, including the round-trip pin (CLI must pass the same number it derives). - `apps/atlasd/src/health-listener.test.ts` (Deno test, 3 tests): saturate a mock main listener with 64 in-flight 20-second handlers, time the liveness probe. Asserts <500ms response under 1500ms budget. This is the synthetic replay of the production cascade — proves the accept-queue isolation works, not just that the wiring compiles. Actual run: liveness completes in 110ms while main has 64 stuck. - New Go test `TestSupervisedProcesses_BadPortOverride_IsIgnored`: four sub-cases (non-numeric, whitespace-only, 70000, 0) confirm the override is rejected at spec-build and the friday default holds. Updated `TestFridayLivenessListener_DefaultPort` to assert the no-override path now does NOT pass --health-port (it used to; daemon's deriveHealthPort default covers it). **Doc gap (#9)** — added a "Liveness listener" section to `apps/atlasd/CLAUDE.md` documenting the dual-listener architecture, the actually-true wins (not the microtask race myth), and the disable path. Future maintainers reach for CLAUDE.md before grep.

…dary Two critical findings from the round-2 review: 1. The round-2 swap of sync try/catch for `.finished.catch` was wrong. `Deno.serve` throws SYNCHRONOUSLY on EADDRINUSE (verified against Deno 2.7.4): try { Deno.serve({port: <busy>}, ...); } catch (e) { /* fires */ } // → AddrInUse: Address already in use (os error 48) // server.finished — never rejects with the bind error The v1 review's claim that "bind errors surface asynchronously" was incorrect, and round 2 rewrote working code based on it. On any real port collision the new daemon would crash with an uncaught exception instead of degrading gracefully. Restored: synchronous try/catch is the primary handler. The `.finished.catch` stays as defense-in-depth for runtime listener errors that surface after a successful bind (rare, but observed in past Deno releases when a TLS handshake fails the listener itself). Both paths null out healthServer so the drain step skips a dead handle. 2. port=65535 chained into the same crash: - Launcher skipped passing --health-port "to let the daemon's default skip too" - But atlas-cli's deriveHealthPort({port: 65535}) returns 65536 - 65536 ≠ 65535, so the equal-port guard didn't fire - Deno.serve({port: 65536}) throws sync → uncaught (fix #1 above would catch it, but the daemon would still degrade unnecessarily) Fix: at port=65535 the launcher now passes --health-port equal to --port (65535). The daemon's equal-port guard short-circuits without binding, healthPort stays at main so the probe lands on the bound socket, no second listener attempts. Tests added/changed: - TestSupervisedProcesses_PortOverride_65535 pins the 65535 chain - start.test.ts comment on deriveHealthPort(65535)→65536 rewritten to reference the launcher short-circuit and the sync-catch fallback Verified empirically: - Sync try/catch fires on EADDRINUSE (reproducer in commit notes) - healthServer is null after bind failure - Main daemon continues serving

…l probe loop Three findings from the round-3 review: 1. Cap FRIDAY_PORT_FRIDAY at 65500 (not 65535). The 65535 chain was broken anyway: launcher passed --health-port equal to --port to short-circuit the daemon's equal-port guard, but then the launcher's probe landed on https://127.0.0.1:65535/ — and the main daemon's Hono app has no route at "/", only at "/health". Probe 404 → 30 failures → SIGTERM. Exactly the loop this PR exists to fix. Rather than threading healthPath updates through the boundary branch, cap the input range so the case can never arise. 65500 leaves 35 ports of headroom — more than enough for the launcher's +1 math without any 16-bit-boundary special case in either the daemon or the launcher. Tests updated: - Renamed TestSupervisedProcesses_PortOverride_65535 → TestSupervisedProcesses_PortOverride_UpperBound with two subcases (65500 accepted, 65501 rejected). - Added "just-above-cap": "65501" to the bad-override sweep. 2. Equal-port guard extracted as a testable helper. `shouldBindHealthListener(port, healthPort)` moved to its own file (`health-listener-policy.ts`) so it can be unit-tested without triggering atlas-daemon.ts's transitive Deno imports (which break the vitest harness — pre-existing issue documented in CLAUDE.md). 10 vitest cases cover the meaningful branches: undefined, zero/negative, equal-port (deliberate disable), typical default, override case, port=undefined edge. 3. Deleted the sustained-probe loop in the starvation test. Five back-to-back probes hit a warm liveness handler with no state change between them — the first probe proves everything probes 2-6 would. v2 review flagged as borderline, v3 said delete. Done. CLAUDE.md updated: documents the 65500 cap and the policy-helper extraction. Pre-existing "broken daemon vitest harness" note already covered why the policy lives in its own file.

Drop the `as number` cast in startHealthListener by promoting shouldBindHealthListener's return type from `boolean` to `healthPort is number`. The guard narrows `this.options.healthPort` through to `number` at the call site; no cast needed. Runtime behavior unchanged — all 10 policy tests pass without modification. Removes the only `as` cast added by this PR.

apps/atlasd/src/health-listener.test.ts is a Deno test (uses `Deno.serve` and `jsr:@std/assert@1` imports). It's meant to run via `deno test --allow-net`, not vitest. CI's `deno task test` invokes vitest, whose default `.test.ts` glob picks the file up and chokes: Error: Cannot find package 'jsr:@std/assert@1' imported from apps/atlasd/src/health-listener.test.ts Same pattern as the pre-existing exclude for `tools/agent-playground/.../mcp-credentials-panel.test.ts` — vitest can't load it for environment-specific reasons. Comment in the exclude block names the failure mode for the next maintainer. The (unrelated) pre-existing flake in `packages/jetstream/src/connect.test.ts` ("throws when URL file is stale and spawnFallback=false" — expects /spawnFallback=false/ regex but receives 'CONNECTION_REFUSED') is present on `main` and not caused by this PR.

The vitest for buildDaemonArgs / deriveHealthPort imported them from start.tsx, which transitively pulls in `@atlas/client/v2`, credential helpers, and other heavy modules. CI runs tests in parallel and the load order can interact poorly with NATS-using tests in the suite (`packages/jetstream/src/connect.test.ts` started failing after we added vitest files that import start.tsx, even though the jetstream file itself is byte-identical to main). Move `buildDaemonArgs`, `deriveHealthPort`, and `StartArgs` to a new `start-args.ts` — pure functions, no daemon-runtime imports. start.tsx re-exports them so the rest of the CLI keeps its previously-stable surface. The vitest now imports from start-args.ts directly. Same pattern as the existing `health-listener-policy.ts` split that sidestepped the broken daemon vitest harness.

knip flagged `jsr:@std/assert@1` as an "unlisted dependency" because the JSR URL-specifier import isn't recognized as a package. Same story as the vitest exclude — `health-listener.test.ts` is a Deno test, not part of the JS/TS dependency graph. Added to `apps/atlasd.ignore` (already contained `src/tool-worker-launcher.ts` for a similar non-JS reason).

ljagiello requested a review from Vpr99 as a code owner May 19, 2026 23:10

ljagiello added 7 commits May 19, 2026 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403

fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403
ljagiello wants to merge 8 commits into
mainfrom
fix-daemon-liveness-listener

ljagiello commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ljagiello commented May 19, 2026

Summary

Changes

Background

Test plan

Follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant