Skip to content

fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403

Open
ljagiello wants to merge 8 commits into
mainfrom
fix-daemon-liveness-listener
Open

fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403
ljagiello wants to merge 8 commits into
mainfrom
fix-daemon-liveness-listener

Conversation

@ljagiello
Copy link
Copy Markdown
Contributor

Summary

The launcher's readiness probe was hitting /health on the same Deno.serve that handles agent, MCP, and NATS traffic. When MCP fan-out saturates the V8 microtask queue (one bad stdio server hanging on initialize, concurrent stale uvx subprocesses, JetStream consumer pulls backing up), the trivial /health handler couldn't get scheduled inside the launcher's 2 s probe deadline. 30 consecutive failures × 2 s = SIGTERM → forced restart, even though the daemon was alive enough to acknowledge the signal and run shutdown handlers.

This change binds a second Deno.serve on port + 1 with a static () => new Response("ok") handler — no Hono, no middleware, no shared-state access. The two listeners share the V8 isolate (not full isolation), but Deno's Tokio-side accept loop runs them as independent tasks and the static-Response fast path wins the microtask race that a Hono-routed handler would lose. The main /health route stays for atlas-cli + UI; only the launcher's probe moves.

Changes

  • apps/atlasd/src/atlas-daemon.tsAtlasDaemonOptions.healthPort. When set, startHealthListener() binds a second Deno.serve with () => new Response("ok"). Drained in phase 1 alongside the main server (1 s ceiling, own abort controller). Bind failures are non-fatal — main /health still works.
  • apps/atlas-cli/src/commands/daemon/start.tsx--health-port flag, default <port>+1, 0 opts out. Plumbed through buildDaemonArgs() so re-exec carries it.
  • tools/friday-launcher/project.go — friday spec probes 8081/ (default) or <override>+1/ when FRIDAY_PORT_FRIDAY is set. Launcher passes both --port and --health-port to the daemon CLI so they stay in lockstep.

Background

Original incident: friday@192.168.64.17 2026-05-19 14:15:45 PDT. User connected Notion in chat, two MCP servers were auto-wired (com-notion-mcp HTTP — worked, and com-mcparmory-notion stdio — hung on FastMCP initialize because the package required OAUTH2_CLIENT_ID/OAUTH2_CLIENT_SECRET that nothing provided). The 20 s MCP timeout cascaded into a 5 s NATS JetStream consumer timeout in workspace-chat, then a 50 s log silence (event-loop starvation), then SIGTERM from launcher. Full forensic trace is in conversation history.

Test plan

Verified locally with the daemon running on --port 38080 --health-port 38081:

  • Both listeners bind HTTPS with same s2s cert
  • Liveness returns 200 ok to any path (/, /health, /anything)
  • Main /health route preserved on main port (existing JSON shape)
  • Sub-3 ms liveness latency under main-port load (50 concurrent probes)
  • Clean SIGTERM shutdown — both ports released, no drain warnings
  • --health-port matrix:
    • omitted → defaults to port + 1
    • 0 → liveness listener disabled
    • equal to --port → silently no-ops
    • occupied → Liveness listener bind failed error:"Address already in use" logged, main continues
  • Launcher unit tests (5 touched + new TestFridayLivenessListener_DefaultPort pin)
  • Cross-compile for aarch64-apple-darwin

Not verified — VM deploy blocked on Gatekeeper:

  • Replay original Notion-connect cascade with fix in place

Production binaries are Developer-ID signed; locally compiled binaries are adhoc-signed and get SIGKILL'd by macOS Gatekeeper on the target VM. Attempting codesign --force --sign - on the 1 GB binary OOM'd the VM (~65 MB free RAM). To validate the in-prod cascade is actually prevented, we need a properly-signed build on the VM. The design is locally sound — the gap is reproduction infrastructure, not a code defect.

Follow-ups (not in this PR)

If this fix alone doesn't fully prevent the cascade, the remaining lever is killing leaked uvx subprocesses on MCP stdio timeout (separate change in packages/mcp-client/transports/stdio.ts). The dedicated liveness listener buys readiness even when the subprocess leak is happening; killing the subprocesses removes the leak itself.

…t can't trip readiness restarts

The launcher's readiness probe was hitting `/health` on the same listener
that serves agent / MCP / NATS traffic. When MCP fan-out saturates the
V8 microtask queue (one bad stdio server stalling on initialize, three
concurrent stale subprocesses, JetStream consumer pulls backing up), the
trivial /health handler couldn't get scheduled inside the launcher's
2 s probe deadline. 30 consecutive failures × 2 s → SIGTERM → forced
restart, even though the daemon was alive enough to acknowledge the
signal and run shutdown handlers.

Atlas daemon now binds a second `Deno.serve` on `port+1` with a static
`() => new Response("ok")` handler — no Hono, no middleware, no
shared-state access. Two listeners share the V8 isolate, but Deno's
Tokio-side accept loop runs them as independent tasks and the static-
Response fast path wins the microtask race that a Hono-routed handler
would lose. The main `/health` route is preserved for atlas-cli and the
UI; only the launcher's probe moves.

Launcher's friday spec now probes `:8081/` (default) or
`<override>+1/` when `FRIDAY_PORT_FRIDAY` is set, and passes
`--health-port` to the daemon CLI so the two stay in lockstep.
`--health-port 0` opts out (single-listener mode).

## Locally verified

- Both listeners bind HTTPS with the same s2s cert
- Liveness returns 200 "ok" to any path
- Main `/health` route still serves the existing JSON shape
- Sub-3 ms liveness latency baseline AND under main-port load
- Clean SIGTERM shutdown — both ports drain, no warnings
- `--health-port` matrix: omitted (default port+1), 0 (disabled),
  equal-to-port (no-op), occupied (logs bind-failed, main continues)
- Launcher unit tests + new TestFridayLivenessListener_DefaultPort pass

## Not verified

VM deploy attempted but blocked on Gatekeeper: production binaries are
Developer-ID signed, locally compiled binaries are adhoc-signed and get
SIGKILL'd. Re-signing on the VM exhausted its memory. The exact failure
cascade from 2026-05-19 14:15 PDT (Notion-connect → workspace-chat
NatsError TIMEOUT → 60 s probe failure → SIGTERM) hasn't been replayed
with the fix in place; the design is locally sound but the in-prod
reproduction is pending a properly-signed build.

Context: docs/reviews are in the conversation; the original incident is
captured by global.log lines 11500–12200 on friday@192.168.64.17 (now
rotated). See ScheduleWakeup-style follow-up: if the fix doesn't fully
resolve the cascade, the remaining lever is killing leaked uvx
subprocesses on MCP stdio timeout (separate change in
packages/mcp-client/transports/stdio.ts).
@ljagiello ljagiello requested a review from Vpr99 as a code owner May 19, 2026 23:10
ljagiello added 7 commits May 19, 2026 16:32
Nine findings from docs/reviews/2026-05-19-fix-daemon-liveness-listener.md:

**Doc-comment (#3)** — rewrite startHealthListener's JSDoc. Drop the
"static-Response wins the microtask race" claim (both listeners share
one V8 isolate, one event loop, one microtask queue — under genuine
starvation neither dispatches). Replace with the actually-true wins:
bypasses Hono routing + request-logger middleware, independent TCP
accept queue, isolated from per-route regressions.

**Bind-failure path (#4)** — sync try/catch around Deno.serve was dead
code; bind errors (EADDRINUSE) surface asynchronously via the
`finished` promise. Attach `.finished.catch` so failures log and null
out `healthServer` so the drain step skips a dead handle.

**Dead onError (#6)** — `() => new Response("ok")` cannot throw.
Remove the onError handler entirely.

**Speculative `--health-port 0` (#5)** — drop the `> 0` opt-out branch
in the CLI. Disabling is now via `--health-port == --port`, which the
daemon's existing equal-port guard short-circuits. One source of truth.

**Symmetric launcher wiring (#7)** — the no-override branch was passing
`--health-port 8081` "in case the daemon's default drifts," but the
launcher already hardcodes healthPort:"8081" in the spec. Three coupled
defaults reduced to two: the launcher's spec and the daemon's own
deriveHealthPort agree by construction. The override branch still
passes both flags explicitly because there's no default to fall back on.

**Atoi error path (#8)** — the override branch silently swallowed
`strconv.Atoi` errors. Non-numeric FRIDAY_PORT_FRIDAY="abc" used to set
healthPort to "abc" and the readiness probe would target
https://127.0.0.1:abc/ forever. Now: validate at spec-build, log loudly,
keep the default. Also bounded to 1..65535; 65536 wrap on `port+1` is
handled by skipping --health-port and letting the daemon's equal-port
guard run.

**Tests (#1, #2, #8)** — three new files of automated coverage:

- `apps/atlas-cli/src/commands/daemon/start.test.ts` (vitest, 15 tests):
  deriveHealthPort + buildDaemonArgs branch matrix. Tests the
  CLI → AtlasDaemonOptions transform end-to-end, including the
  round-trip pin (CLI must pass the same number it derives).

- `apps/atlasd/src/health-listener.test.ts` (Deno test, 3 tests):
  saturate a mock main listener with 64 in-flight 20-second handlers,
  time the liveness probe. Asserts <500ms response under 1500ms budget.
  This is the synthetic replay of the production cascade — proves the
  accept-queue isolation works, not just that the wiring compiles.
  Actual run: liveness completes in 110ms while main has 64 stuck.

- New Go test `TestSupervisedProcesses_BadPortOverride_IsIgnored`:
  four sub-cases (non-numeric, whitespace-only, 70000, 0) confirm the
  override is rejected at spec-build and the friday default holds.
  Updated `TestFridayLivenessListener_DefaultPort` to assert the
  no-override path now does NOT pass --health-port (it used to;
  daemon's deriveHealthPort default covers it).

**Doc gap (#9)** — added a "Liveness listener" section to
`apps/atlasd/CLAUDE.md` documenting the dual-listener architecture,
the actually-true wins (not the microtask race myth), and the disable
path. Future maintainers reach for CLAUDE.md before grep.
…dary

Two critical findings from the round-2 review:

1. The round-2 swap of sync try/catch for `.finished.catch` was wrong.
   `Deno.serve` throws SYNCHRONOUSLY on EADDRINUSE (verified against
   Deno 2.7.4):

     try { Deno.serve({port: <busy>}, ...); } catch (e) { /* fires */ }
     //   → AddrInUse: Address already in use (os error 48)
     // server.finished — never rejects with the bind error

   The v1 review's claim that "bind errors surface asynchronously" was
   incorrect, and round 2 rewrote working code based on it. On any real
   port collision the new daemon would crash with an uncaught exception
   instead of degrading gracefully.

   Restored: synchronous try/catch is the primary handler. The
   `.finished.catch` stays as defense-in-depth for runtime listener
   errors that surface after a successful bind (rare, but observed in
   past Deno releases when a TLS handshake fails the listener itself).
   Both paths null out healthServer so the drain step skips a dead
   handle.

2. port=65535 chained into the same crash:
   - Launcher skipped passing --health-port "to let the daemon's
     default skip too"
   - But atlas-cli's deriveHealthPort({port: 65535}) returns 65536
   - 65536 ≠ 65535, so the equal-port guard didn't fire
   - Deno.serve({port: 65536}) throws sync → uncaught (fix #1 above
     would catch it, but the daemon would still degrade unnecessarily)

   Fix: at port=65535 the launcher now passes --health-port equal to
   --port (65535). The daemon's equal-port guard short-circuits without
   binding, healthPort stays at main so the probe lands on the bound
   socket, no second listener attempts.

Tests added/changed:
- TestSupervisedProcesses_PortOverride_65535 pins the 65535 chain
- start.test.ts comment on deriveHealthPort(65535)→65536 rewritten to
  reference the launcher short-circuit and the sync-catch fallback

Verified empirically:
- Sync try/catch fires on EADDRINUSE (reproducer in commit notes)
- healthServer is null after bind failure
- Main daemon continues serving
…l probe loop

Three findings from the round-3 review:

1. Cap FRIDAY_PORT_FRIDAY at 65500 (not 65535).
   The 65535 chain was broken anyway: launcher passed --health-port
   equal to --port to short-circuit the daemon's equal-port guard, but
   then the launcher's probe landed on https://127.0.0.1:65535/ — and
   the main daemon's Hono app has no route at "/", only at "/health".
   Probe 404 → 30 failures → SIGTERM. Exactly the loop this PR exists
   to fix.

   Rather than threading healthPath updates through the boundary
   branch, cap the input range so the case can never arise. 65500
   leaves 35 ports of headroom — more than enough for the launcher's
   +1 math without any 16-bit-boundary special case in either the
   daemon or the launcher.

   Tests updated:
   - Renamed TestSupervisedProcesses_PortOverride_65535 →
     TestSupervisedProcesses_PortOverride_UpperBound with two subcases
     (65500 accepted, 65501 rejected).
   - Added "just-above-cap": "65501" to the bad-override sweep.

2. Equal-port guard extracted as a testable helper.
   `shouldBindHealthListener(port, healthPort)` moved to its own file
   (`health-listener-policy.ts`) so it can be unit-tested without
   triggering atlas-daemon.ts's transitive Deno imports (which break
   the vitest harness — pre-existing issue documented in CLAUDE.md).

   10 vitest cases cover the meaningful branches: undefined,
   zero/negative, equal-port (deliberate disable), typical default,
   override case, port=undefined edge.

3. Deleted the sustained-probe loop in the starvation test.
   Five back-to-back probes hit a warm liveness handler with no state
   change between them — the first probe proves everything probes 2-6
   would. v2 review flagged as borderline, v3 said delete. Done.

CLAUDE.md updated: documents the 65500 cap and the policy-helper
extraction. Pre-existing "broken daemon vitest harness" note already
covered why the policy lives in its own file.
Drop the `as number` cast in startHealthListener by promoting
shouldBindHealthListener's return type from `boolean` to
`healthPort is number`. The guard narrows `this.options.healthPort`
through to `number` at the call site; no cast needed.

Runtime behavior unchanged — all 10 policy tests pass without
modification. Removes the only `as` cast added by this PR.
apps/atlasd/src/health-listener.test.ts is a Deno test (uses
`Deno.serve` and `jsr:@std/assert@1` imports). It's meant to run via
`deno test --allow-net`, not vitest. CI's `deno task test` invokes
vitest, whose default `.test.ts` glob picks the file up and chokes:

    Error: Cannot find package 'jsr:@std/assert@1' imported from
    apps/atlasd/src/health-listener.test.ts

Same pattern as the pre-existing exclude for
`tools/agent-playground/.../mcp-credentials-panel.test.ts` — vitest
can't load it for environment-specific reasons. Comment in the
exclude block names the failure mode for the next maintainer.

The (unrelated) pre-existing flake in `packages/jetstream/src/connect.test.ts`
("throws when URL file is stale and spawnFallback=false" — expects
/spawnFallback=false/ regex but receives 'CONNECTION_REFUSED') is
present on `main` and not caused by this PR.
The vitest for buildDaemonArgs / deriveHealthPort imported them from
start.tsx, which transitively pulls in `@atlas/client/v2`, credential
helpers, and other heavy modules. CI runs tests in parallel and the
load order can interact poorly with NATS-using tests in the suite
(`packages/jetstream/src/connect.test.ts` started failing after we
added vitest files that import start.tsx, even though the jetstream
file itself is byte-identical to main).

Move `buildDaemonArgs`, `deriveHealthPort`, and `StartArgs` to a new
`start-args.ts` — pure functions, no daemon-runtime imports. start.tsx
re-exports them so the rest of the CLI keeps its previously-stable
surface. The vitest now imports from start-args.ts directly.

Same pattern as the existing `health-listener-policy.ts` split that
sidestepped the broken daemon vitest harness.
knip flagged `jsr:@std/assert@1` as an "unlisted dependency" because
the JSR URL-specifier import isn't recognized as a package. Same
story as the vitest exclude — `health-listener.test.ts` is a Deno
test, not part of the JS/TS dependency graph.

Added to `apps/atlasd.ignore` (already contained
`src/tool-worker-launcher.ts` for a similar non-JS reason).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant