Skip to content

fix(backend): route fastify listen/ready/close errors through logger instead of unhandled rejection#17401

Open
calebcgates wants to merge 1 commit into
misskey-dev:developfrom
calebcgates:fix/server-listen-and-shutdown-error-handling
Open

fix(backend): route fastify listen/ready/close errors through logger instead of unhandled rejection#17401
calebcgates wants to merge 1 commit into
misskey-dev:developfrom
calebcgates:fix/server-listen-and-shutdown-error-handling

Conversation

@calebcgates
Copy link
Copy Markdown

What

Closes #17399.

packages/backend/src/server/ServerService.ts has three error-handling regressions around fastify.listen(), fastify.ready(), and fastify.close() that cause startup and shutdown failures to be reported as uncaught Node exceptions instead of through the misskey logger the code clearly intended to use.

The four changes:

  1. await fastify.listen() inside a try/catch instead of fire-and-forget on the TCP path and callback-with-ignored-err on the socket path.
  2. Run fs.chmodSync on the socket only after listen() resolves (today the chmod runs even when listen failed, throwing ENOENT and masking the original error).
  3. Wrap await fastify.ready() in the same try/catch so plugin-registration timeouts (ERR_AVVIO_PLUGIN_TIMEOUT) are reported through this.logger.error instead of NestJS's bootstrap stack trace.
  4. Cap fastify.close() at 5s in dispose() so leaked WebSocket upgrades from streamingApiServerService.attach(fastify.server) don't hang OnApplicationShutdown until PM2/systemd SIGKILLs the process.

The intent of the existing fastify.server.on('error', ...) block is preserved — moved into a handleListenError() helper invoked from the new try/catch. The reason this matters: fastify v5's listen() rejects its returned Promise when the underlying http.Server hits EADDRINUSE/EACCES, before (or in a race with) the http.Server emitting 'error'. On Node >= 15 with default --unhandled-rejections=throw, the listen-promise rejection terminates the process with a stack trace before server.on('error') can run. The handler is reachable today only by accident of timing.

Why

Concrete reproductions:

Repro 1 — TCP port in use:

# Terminal 1
nc -l 3000

# Terminal 2 (Misskey configured with `port: 3000`)
pnpm start
# Today: UnhandledPromiseRejectionWarning + stack trace, no friendly message
# After this PR: "Port 3000 is already in use by another process." + clean exit(1)

Repro 2 — Unix socket without permission:

# default.yml
socket: /var/run/misskey.sock
chmodSocket: '777'

Run as a non-root user. fastify.listen({ path }) rejects with EACCES, the callback's err is ignored, then fs.chmodSync('/var/run/misskey.sock', ...) throws ENOENT: no such file or directory because no socket was ever created. User sees the chmod error, not the actual permission error. After this PR: "You do not have permission to listen on /var/run/misskey.sock." + clean exit(1).

Repro 3 — OnApplicationShutdown hangs:

A WebSocket client opens a connection via streamingApiServerService.attach(fastify.server), then disconnects abruptly without a clean close frame. PM2 / systemd / k8s sends SIGTERM to misskey. dispose() calls await this.#fastify.close() which waits indefinitely for the upgraded socket to be reaped. PM2's default kill_timeout then SIGKILLs the process; logs show "process killed" with no diagnostic. After this PR: 5-second cap, then OnApplicationShutdown resolves cleanly so PM2 records a graceful exit.

These are the only four real bugs in this area. The other fastify call sites in the backend (route handlers throwing in ClientServerService / ApiServerService / WellKnownServerService, register() calls deferred to await fastify.ready(), addHook() callbacks) are correctly handled by the existing setErrorHandler (ClientServerService.ts:924) plus the now-properly-try-wrapped await fastify.ready() — they're not part of this PR's scope.

Additional info (optional)

Behavior preservation

Scenario Old New
Healthy TCP startup works works
Healthy socket startup works works (chmod still runs after listen)
EADDRINUSE unhandled rejection trace logger.error + exit(1)
EACCES unhandled rejection trace logger.error + exit(1)
Cluster-worker listen failure process.send('listenFailed') only when server.on('error') wins the race process.send('listenFailed') always
Plugin timeout in ready() NestJS bootstrap trace logger.error + exit(1)
Graceful shutdown w/ idle conns works works
Graceful shutdown w/ leaked WS hangs until SIGKILL resolves at 5s

On tests

packages/backend/test/unit/ has no ServerService spec today. Adding one requires mocking the NestJS DI graph for ~18 injected services and stubbing 11 sub-fastify-services. I scoped this PR to the bug-fix and the truth table above; happy to add a vitest unit test for handleListenError in a follow-up if useful — please flag if you want it bundled here instead.

Local verification

  • pnpm install --frozen-lockfile — no lockfile churn
  • pnpm --filter backend typecheck (tsgo) — 0 errors
  • pnpm --filter backend exec eslint src/server/ServerService.ts — 0 errors; warning count net-reduced from 9 to 5 (removed (err as any) cast, two unused-param warnings, and the this.config.socket! non-null assertion)

Checklist

  • Read the contribution guide
  • Test working in a local environment
  • Update CHANGELOG.md
  • (If possible) Add tests — see "On tests" above; offered as follow-up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

packages/backend Server side specific issue/PR size:M This PR changes 30-99 lines, ignoring generated files.

Projects

Development

Successfully merging this pull request may close these issues.

ServerService: startup and shutdown failures surface as UnhandledPromiseRejection / SIGKILL instead of via logger

1 participant