fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403
Open
ljagiello wants to merge 8 commits into
Open
fix(atlasd,launcher): dedicated liveness listener so MCP fan-out can't trip readiness restarts#403ljagiello wants to merge 8 commits into
ljagiello wants to merge 8 commits into
Conversation
…t can't trip readiness restarts
The launcher's readiness probe was hitting `/health` on the same listener
that serves agent / MCP / NATS traffic. When MCP fan-out saturates the
V8 microtask queue (one bad stdio server stalling on initialize, three
concurrent stale subprocesses, JetStream consumer pulls backing up), the
trivial /health handler couldn't get scheduled inside the launcher's
2 s probe deadline. 30 consecutive failures × 2 s → SIGTERM → forced
restart, even though the daemon was alive enough to acknowledge the
signal and run shutdown handlers.
Atlas daemon now binds a second `Deno.serve` on `port+1` with a static
`() => new Response("ok")` handler — no Hono, no middleware, no
shared-state access. Two listeners share the V8 isolate, but Deno's
Tokio-side accept loop runs them as independent tasks and the static-
Response fast path wins the microtask race that a Hono-routed handler
would lose. The main `/health` route is preserved for atlas-cli and the
UI; only the launcher's probe moves.
Launcher's friday spec now probes `:8081/` (default) or
`<override>+1/` when `FRIDAY_PORT_FRIDAY` is set, and passes
`--health-port` to the daemon CLI so the two stay in lockstep.
`--health-port 0` opts out (single-listener mode).
## Locally verified
- Both listeners bind HTTPS with the same s2s cert
- Liveness returns 200 "ok" to any path
- Main `/health` route still serves the existing JSON shape
- Sub-3 ms liveness latency baseline AND under main-port load
- Clean SIGTERM shutdown — both ports drain, no warnings
- `--health-port` matrix: omitted (default port+1), 0 (disabled),
equal-to-port (no-op), occupied (logs bind-failed, main continues)
- Launcher unit tests + new TestFridayLivenessListener_DefaultPort pass
## Not verified
VM deploy attempted but blocked on Gatekeeper: production binaries are
Developer-ID signed, locally compiled binaries are adhoc-signed and get
SIGKILL'd. Re-signing on the VM exhausted its memory. The exact failure
cascade from 2026-05-19 14:15 PDT (Notion-connect → workspace-chat
NatsError TIMEOUT → 60 s probe failure → SIGTERM) hasn't been replayed
with the fix in place; the design is locally sound but the in-prod
reproduction is pending a properly-signed build.
Context: docs/reviews are in the conversation; the original incident is
captured by global.log lines 11500–12200 on friday@192.168.64.17 (now
rotated). See ScheduleWakeup-style follow-up: if the fix doesn't fully
resolve the cascade, the remaining lever is killing leaked uvx
subprocesses on MCP stdio timeout (separate change in
packages/mcp-client/transports/stdio.ts).
Nine findings from docs/reviews/2026-05-19-fix-daemon-liveness-listener.md: **Doc-comment (#3)** — rewrite startHealthListener's JSDoc. Drop the "static-Response wins the microtask race" claim (both listeners share one V8 isolate, one event loop, one microtask queue — under genuine starvation neither dispatches). Replace with the actually-true wins: bypasses Hono routing + request-logger middleware, independent TCP accept queue, isolated from per-route regressions. **Bind-failure path (#4)** — sync try/catch around Deno.serve was dead code; bind errors (EADDRINUSE) surface asynchronously via the `finished` promise. Attach `.finished.catch` so failures log and null out `healthServer` so the drain step skips a dead handle. **Dead onError (#6)** — `() => new Response("ok")` cannot throw. Remove the onError handler entirely. **Speculative `--health-port 0` (#5)** — drop the `> 0` opt-out branch in the CLI. Disabling is now via `--health-port == --port`, which the daemon's existing equal-port guard short-circuits. One source of truth. **Symmetric launcher wiring (#7)** — the no-override branch was passing `--health-port 8081` "in case the daemon's default drifts," but the launcher already hardcodes healthPort:"8081" in the spec. Three coupled defaults reduced to two: the launcher's spec and the daemon's own deriveHealthPort agree by construction. The override branch still passes both flags explicitly because there's no default to fall back on. **Atoi error path (#8)** — the override branch silently swallowed `strconv.Atoi` errors. Non-numeric FRIDAY_PORT_FRIDAY="abc" used to set healthPort to "abc" and the readiness probe would target https://127.0.0.1:abc/ forever. Now: validate at spec-build, log loudly, keep the default. Also bounded to 1..65535; 65536 wrap on `port+1` is handled by skipping --health-port and letting the daemon's equal-port guard run. **Tests (#1, #2, #8)** — three new files of automated coverage: - `apps/atlas-cli/src/commands/daemon/start.test.ts` (vitest, 15 tests): deriveHealthPort + buildDaemonArgs branch matrix. Tests the CLI → AtlasDaemonOptions transform end-to-end, including the round-trip pin (CLI must pass the same number it derives). - `apps/atlasd/src/health-listener.test.ts` (Deno test, 3 tests): saturate a mock main listener with 64 in-flight 20-second handlers, time the liveness probe. Asserts <500ms response under 1500ms budget. This is the synthetic replay of the production cascade — proves the accept-queue isolation works, not just that the wiring compiles. Actual run: liveness completes in 110ms while main has 64 stuck. - New Go test `TestSupervisedProcesses_BadPortOverride_IsIgnored`: four sub-cases (non-numeric, whitespace-only, 70000, 0) confirm the override is rejected at spec-build and the friday default holds. Updated `TestFridayLivenessListener_DefaultPort` to assert the no-override path now does NOT pass --health-port (it used to; daemon's deriveHealthPort default covers it). **Doc gap (#9)** — added a "Liveness listener" section to `apps/atlasd/CLAUDE.md` documenting the dual-listener architecture, the actually-true wins (not the microtask race myth), and the disable path. Future maintainers reach for CLAUDE.md before grep.
…dary
Two critical findings from the round-2 review:
1. The round-2 swap of sync try/catch for `.finished.catch` was wrong.
`Deno.serve` throws SYNCHRONOUSLY on EADDRINUSE (verified against
Deno 2.7.4):
try { Deno.serve({port: <busy>}, ...); } catch (e) { /* fires */ }
// → AddrInUse: Address already in use (os error 48)
// server.finished — never rejects with the bind error
The v1 review's claim that "bind errors surface asynchronously" was
incorrect, and round 2 rewrote working code based on it. On any real
port collision the new daemon would crash with an uncaught exception
instead of degrading gracefully.
Restored: synchronous try/catch is the primary handler. The
`.finished.catch` stays as defense-in-depth for runtime listener
errors that surface after a successful bind (rare, but observed in
past Deno releases when a TLS handshake fails the listener itself).
Both paths null out healthServer so the drain step skips a dead
handle.
2. port=65535 chained into the same crash:
- Launcher skipped passing --health-port "to let the daemon's
default skip too"
- But atlas-cli's deriveHealthPort({port: 65535}) returns 65536
- 65536 ≠ 65535, so the equal-port guard didn't fire
- Deno.serve({port: 65536}) throws sync → uncaught (fix #1 above
would catch it, but the daemon would still degrade unnecessarily)
Fix: at port=65535 the launcher now passes --health-port equal to
--port (65535). The daemon's equal-port guard short-circuits without
binding, healthPort stays at main so the probe lands on the bound
socket, no second listener attempts.
Tests added/changed:
- TestSupervisedProcesses_PortOverride_65535 pins the 65535 chain
- start.test.ts comment on deriveHealthPort(65535)→65536 rewritten to
reference the launcher short-circuit and the sync-catch fallback
Verified empirically:
- Sync try/catch fires on EADDRINUSE (reproducer in commit notes)
- healthServer is null after bind failure
- Main daemon continues serving
…l probe loop Three findings from the round-3 review: 1. Cap FRIDAY_PORT_FRIDAY at 65500 (not 65535). The 65535 chain was broken anyway: launcher passed --health-port equal to --port to short-circuit the daemon's equal-port guard, but then the launcher's probe landed on https://127.0.0.1:65535/ — and the main daemon's Hono app has no route at "/", only at "/health". Probe 404 → 30 failures → SIGTERM. Exactly the loop this PR exists to fix. Rather than threading healthPath updates through the boundary branch, cap the input range so the case can never arise. 65500 leaves 35 ports of headroom — more than enough for the launcher's +1 math without any 16-bit-boundary special case in either the daemon or the launcher. Tests updated: - Renamed TestSupervisedProcesses_PortOverride_65535 → TestSupervisedProcesses_PortOverride_UpperBound with two subcases (65500 accepted, 65501 rejected). - Added "just-above-cap": "65501" to the bad-override sweep. 2. Equal-port guard extracted as a testable helper. `shouldBindHealthListener(port, healthPort)` moved to its own file (`health-listener-policy.ts`) so it can be unit-tested without triggering atlas-daemon.ts's transitive Deno imports (which break the vitest harness — pre-existing issue documented in CLAUDE.md). 10 vitest cases cover the meaningful branches: undefined, zero/negative, equal-port (deliberate disable), typical default, override case, port=undefined edge. 3. Deleted the sustained-probe loop in the starvation test. Five back-to-back probes hit a warm liveness handler with no state change between them — the first probe proves everything probes 2-6 would. v2 review flagged as borderline, v3 said delete. Done. CLAUDE.md updated: documents the 65500 cap and the policy-helper extraction. Pre-existing "broken daemon vitest harness" note already covered why the policy lives in its own file.
Drop the `as number` cast in startHealthListener by promoting shouldBindHealthListener's return type from `boolean` to `healthPort is number`. The guard narrows `this.options.healthPort` through to `number` at the call site; no cast needed. Runtime behavior unchanged — all 10 policy tests pass without modification. Removes the only `as` cast added by this PR.
apps/atlasd/src/health-listener.test.ts is a Deno test (uses
`Deno.serve` and `jsr:@std/assert@1` imports). It's meant to run via
`deno test --allow-net`, not vitest. CI's `deno task test` invokes
vitest, whose default `.test.ts` glob picks the file up and chokes:
Error: Cannot find package 'jsr:@std/assert@1' imported from
apps/atlasd/src/health-listener.test.ts
Same pattern as the pre-existing exclude for
`tools/agent-playground/.../mcp-credentials-panel.test.ts` — vitest
can't load it for environment-specific reasons. Comment in the
exclude block names the failure mode for the next maintainer.
The (unrelated) pre-existing flake in `packages/jetstream/src/connect.test.ts`
("throws when URL file is stale and spawnFallback=false" — expects
/spawnFallback=false/ regex but receives 'CONNECTION_REFUSED') is
present on `main` and not caused by this PR.
The vitest for buildDaemonArgs / deriveHealthPort imported them from start.tsx, which transitively pulls in `@atlas/client/v2`, credential helpers, and other heavy modules. CI runs tests in parallel and the load order can interact poorly with NATS-using tests in the suite (`packages/jetstream/src/connect.test.ts` started failing after we added vitest files that import start.tsx, even though the jetstream file itself is byte-identical to main). Move `buildDaemonArgs`, `deriveHealthPort`, and `StartArgs` to a new `start-args.ts` — pure functions, no daemon-runtime imports. start.tsx re-exports them so the rest of the CLI keeps its previously-stable surface. The vitest now imports from start-args.ts directly. Same pattern as the existing `health-listener-policy.ts` split that sidestepped the broken daemon vitest harness.
knip flagged `jsr:@std/assert@1` as an "unlisted dependency" because the JSR URL-specifier import isn't recognized as a package. Same story as the vitest exclude — `health-listener.test.ts` is a Deno test, not part of the JS/TS dependency graph. Added to `apps/atlasd.ignore` (already contained `src/tool-worker-launcher.ts` for a similar non-JS reason).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The launcher's readiness probe was hitting
/healthon the sameDeno.servethat handles agent, MCP, and NATS traffic. When MCP fan-out saturates the V8 microtask queue (one bad stdio server hanging oninitialize, concurrent stale uvx subprocesses, JetStream consumer pulls backing up), the trivial/healthhandler couldn't get scheduled inside the launcher's 2 s probe deadline. 30 consecutive failures × 2 s = SIGTERM → forced restart, even though the daemon was alive enough to acknowledge the signal and run shutdown handlers.This change binds a second
Deno.serveonport + 1with a static() => new Response("ok")handler — no Hono, no middleware, no shared-state access. The two listeners share the V8 isolate (not full isolation), but Deno's Tokio-side accept loop runs them as independent tasks and the static-Response fast path wins the microtask race that a Hono-routed handler would lose. The main/healthroute stays for atlas-cli + UI; only the launcher's probe moves.Changes
apps/atlasd/src/atlas-daemon.ts—AtlasDaemonOptions.healthPort. When set,startHealthListener()binds a secondDeno.servewith() => new Response("ok"). Drained in phase 1 alongside the main server (1 s ceiling, own abort controller). Bind failures are non-fatal — main/healthstill works.apps/atlas-cli/src/commands/daemon/start.tsx—--health-portflag, default<port>+1,0opts out. Plumbed throughbuildDaemonArgs()so re-exec carries it.tools/friday-launcher/project.go— friday spec probes8081/(default) or<override>+1/whenFRIDAY_PORT_FRIDAYis set. Launcher passes both--portand--health-portto the daemon CLI so they stay in lockstep.Background
Original incident:
friday@192.168.64.172026-05-19 14:15:45 PDT. User connected Notion in chat, two MCP servers were auto-wired (com-notion-mcpHTTP — worked, andcom-mcparmory-notionstdio — hung on FastMCPinitializebecause the package requiredOAUTH2_CLIENT_ID/OAUTH2_CLIENT_SECRETthat nothing provided). The 20 s MCP timeout cascaded into a 5 s NATS JetStream consumer timeout in workspace-chat, then a 50 s log silence (event-loop starvation), then SIGTERM from launcher. Full forensic trace is in conversation history.Test plan
Verified locally with the daemon running on
--port 38080 --health-port 38081:200 okto any path (/,/health,/anything)/healthroute preserved on main port (existing JSON shape)--health-portmatrix:port + 10→ liveness listener disabled--port→ silently no-opsLiveness listener bind failed error:"Address already in use"logged, main continuesTestFridayLivenessListener_DefaultPortpin)aarch64-apple-darwinNot verified — VM deploy blocked on Gatekeeper:
Production binaries are Developer-ID signed; locally compiled binaries are adhoc-signed and get SIGKILL'd by macOS Gatekeeper on the target VM. Attempting
codesign --force --sign -on the 1 GB binary OOM'd the VM (~65 MB free RAM). To validate the in-prod cascade is actually prevented, we need a properly-signed build on the VM. The design is locally sound — the gap is reproduction infrastructure, not a code defect.Follow-ups (not in this PR)
If this fix alone doesn't fully prevent the cascade, the remaining lever is killing leaked
uvxsubprocesses on MCP stdio timeout (separate change inpackages/mcp-client/transports/stdio.ts). The dedicated liveness listener buys readiness even when the subprocess leak is happening; killing the subprocesses removes the leak itself.