-
Notifications
You must be signed in to change notification settings - Fork 317
Description
Problem
All 8 production Electric Cloud instances produce recurring Req.TransportError: socket closed errors from CallHomeReporter.report_home/2. ~900 errors/day, ~10,200 in the last 7 days. Sentry issue ELECTRIC-5CH has been firing continuously since 2025-12-22 (18,641 total occurrences).
Root cause
report_home/2 uses Req.post! which defaults to HTTP/2 via ALPN. Finch's HTTP/2 pool holds a single long-lived connection per {scheme, host, port} and keeps it open after the request completes. Since CallHomeReporter only fires every 30 minutes, the pooled connection sits idle until the proxy/CDN in front of checkpoint.electric-sql.com closes it.
On the next report, Finch discovers the dead connection only when it tries to write to the socket → Req.TransportError: socket closed. Retries (1s, 2s, 4s) go through the same pool which may still be reconnecting, so all attempts fail. Req.post! raises, crashing the Task, producing the Sentry alert.
Each stack has its own CallHomeReporter and its own connection pool — there's no connection sharing between stacks. Connection pooling provides zero benefit for a 30-minute interval reporter.
Secondary issue: stats always cleared regardless of HTTP success
report_home/2 starts a Task.async and immediately returns :ok. The GenServer clears stats before the async HTTP request completes. If the request fails, the telemetry stats for that 30-minute period are lost silently.
Proposed fix
In packages/electric-telemetry/lib/electric/telemetry/call_home_reporter.ex:
-
Force HTTP/1.1 + close connections after each request: Replace
Req.post!withReq.postpassingconnect_options: [protocols: [:http1]]andheaders: [{"connection", "close"}]. TheConnection: closeheader causes Mint to mark the connection as non-reusable, so Finch discards it after the response. No stale connections ever sit in the pool. -
Make the call synchronous: Remove
Task.async, callReq.postdirectly inreport_home/2. The GenServer does nothing between 30-minute cycles so blocking is fine. HTTP/1.1 NimblePool operates synchronously — no async pool messages to worry about. -
Return success/failure:
report_home/2returns:okor{:error, reason}. The GenServer only clears stats on success. Failures are logged as warnings. -
Remove catch-all handle_info clauses: The three clauses that swallowed Task.async messages are no longer needed.
Impact
- Eliminates ~900 error logs and Sentry alerts per day
- Telemetry data is no longer silently lost on HTTP failures
- No functional impact — only affects the telemetry call-home path