Skip to content

Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified) #4343

@kevin-dp

Description

@kevin-dp

Summary

When the agents-server becomes unreachable (verified for both Ctrl-C graceful shutdown and kill -9 hard kill), the pull-wake runner's stream reconnect path never fires. The runner reports stream_connected: true, reconnect_count: 0, client_status: streaming and stream_connected_since unchanged for the entire duration of the outage. The state machine never transitions to running.reconnecting, exponential backoff never runs, and reconnect_count is never incremented.

Heartbeats (which are separate, short-lived HTTP POSTs) correctly observe the outage and fail with ECONNREFUSED every 30s — so the runner is aware on one path that the server is unreachable. But the stream reader sees no error and concludes the stream is fine.

Surfaced while testing #4339 (pull-wake runner lifecycle hardening). The PR adds the state machine, exponential backoff (1s → 30s cap), reconnect_count tracking, and running.reconnecting state — but none of these engage under any tested failure mode.

Updated: original framing was specific to graceful shutdown. A follow-up test with kill -9 (which generates an RST instead of FIN) produced the same broken behavior — the bug is not about TCP shutdown signal type, it is about the stream reader not surfacing any server failure as a stream error, regardless of how the server dies.

Reproductions

Case A — graceful shutdown (Ctrl-C)

Setup:

  • Local agents-server running via pnpm --filter @electric-ax/agents-server dev (tsx watch on src/entrypoint.ts)
  • Desktop app connected, pull-wake runner registered and healthy

Steps:

  1. Confirm baseline via the new health endpoint:
    curl -sS -H "electric-principal: system:dev-local" \
      "http://127.0.0.1:4437/_electric/runners/<RUNNER_ID>/health" \
      | jq '{rc:.client.reconnect_count, status:.client.status, stream:.client.stream_connected, since:.client.stream_connected_since}'
    Expected: rc:0, status:"streaming", stream:true, since:"<some prior time>"
  2. Stop the server gracefullyCtrl-C the pnpm dev process (or kill -SIGINT <pid>).
  3. Watch the desktop log at ~/Library/Application Support/Electric Agents/logs/builtin-agents-*.jsonl. Heartbeat failures appear every 30s:
    [builtin-agents] pull-wake runner failed  TypeError: fetch failed: connect ECONNREFUSED 127.0.0.1:4437
    
  4. Leave the server down for several minutes (verified for 17 minutes in original repro).
  5. Restart the server.
  6. Re-query the health endpoint after the next heartbeat succeeds.

Case B — hard kill (kill -9)

Same setup. Same procedure but kill -9 <pid> instead of Ctrl-C. RST should be sent immediately. Identical observed behavior:

kill at 10:32:45.3Z
first log entry at 10:33:01.7Z  (the regularly-scheduled 30s heartbeat — NOT a reconnect attempt)
+30s   next heartbeat failure
+30s   next heartbeat failure
+30s   next heartbeat failure

All entries arrive on the 30s heartbeat schedule. No additional log entries at exponential-backoff intervals (1s, 2s, 4s, 8s, 16s) — which is what we'd expect to see if the stream-reconnect catch branch was firing.

After recovery, reconnect_count is still 0 and stream_connected_since is unchanged from before the kill. The runner thinks nothing happened.

Observed payload (post-recovery, after both Case A and Case B)

{
  "client": {
    "status": "streaming",
    "reconnect_count": 0,
    "stream_connected": true,
    "stream_connected_since": "2026-05-18T08:11:00.913Z",   // unchanged — was the value pre-outage
    "last_heartbeat_at": "2026-05-18T10:35:01.684Z",
    "last_heartbeat_ok": true
  },
  "health": { "status": "healthy", "issues": [] }
}

stream_connected_since survived BOTH outages (Ctrl-C followed by hard kill) without changing. The runner has held the same conceptual stream object since before either outage.

Root cause analysis

The stream consumption loop in packages/agents-runtime/src/pull-wake-runner.ts opens a long-lived HTTP fetch and reads chunks until it ends:

await consumeWakeStream(signal, runGeneration)
if (!signal.aborted) {
  state = `running.reconnecting`
  // ... backoff/sleep ...
}

When the agents-server dies (regardless of shutdown mechanism):

  • The OS should send a FIN (Ctrl-C) or RST (kill -9) on the established TCP connection
  • Node's fetch streaming reader does not surface this as a read error when the stream is idle (no in-flight wakes) — the reader just sits on the dead socket
  • The connection effectively enters a half-open state from the application's perspective
  • Wakes published by the server during downtime (if any) cannot be received

Plausible cause: the stream uses long-polling / Server-Sent-Events-style chunked transfer over fetch. When the underlying socket is closed by the peer but no data is in flight, the AsyncIterator returned by response.body may not yield or error promptly. It depends on:

  • Whether Node observes the FIN/RST before the next read attempt is made
  • Whether the AsyncIterator emits an end-of-stream that the code then handles in the success branch (no reportError, no reconnectCount++)

So consumeWakeStream either doesn't return at all (most likely) or returns "cleanly" without an error. Either way:

  • The state machine never transitions to running.reconnecting
  • reconnect_count is never incremented
  • reportError is never called from the stream path
  • stream_connected and stream_connected_since are never updated

Heartbeats run on a separate timer and use individual fetch calls — those fail with ECONNREFUSED immediately because each one initiates a new TCP connection. That's why we see clean 30s-cadence heartbeat failures in the log but no stream-level activity. Heartbeat failures call reportError (via onError in agents/src/server.ts:324) and set last_heartbeat_ok: false, but they do not invalidate the stream connection state on the client side.

Expected behavior

The runner should recognize that its stream is dead — at minimum, this should be visible in the diagnostics. Ideal end state:

  1. State machine transitions running.streaming → running.reconnecting when the stream stops delivering OR when N consecutive heartbeats fail (whichever comes first).
  2. reconnect_count increments each time a reconnect is attempted.
  3. stream_connected_since is reset on successful reconnect (the PR's "reset on success" claim depends on this firing).
  4. Exponential backoff path is exercised (and observable via reconnect_count growing faster than the heartbeat cadence).

Suggested investigation / fixes

Pick one or combine; (a) is the cheapest:

(a) Tie stream liveness to heartbeat failures (cheap)

After K consecutive heartbeat failures (e.g. K=2), abort the current stream connection and re-enter consumeWakeStream. This bridges the two signal paths and uses the already-working heartbeat detection to drive the (already-implemented) reconnect machinery. This is now the strongest recommendation given that the bug is universal, not specific to graceful shutdown — a stream-side detector won't help if the stream never sees the failure.

(b) Application-level keepalive on the stream

Have the server periodically push a no-op ping event over the wake stream (e.g. every 30s). On the client, treat absence-of-pings (>2× ping interval) as a stream failure and reconnect.

(c) TCP-level keepalive

Set SO_KEEPALIVE on the underlying socket with aggressive timing (idle=10s, interval=5s, count=3). Standard fetch in Node doesn't expose this directly; would require a custom http2 agent or undici Dispatcher.

(d) Write-side heartbeat on the stream connection itself

If the stream is bidirectional or supports a client→server keepalive, send periodic empty writes. Failure to write surfaces a socket error → triggers reconnect.

Given the Case B result, (a) is now clearly the highest-leverage fix: heartbeat failures are the only observable signal the runner has when the server is gone, so heartbeat-driven stream reset is the only path that doesn't depend on the stream layer somehow noticing what it currently doesn't notice.

Severity

Reliability — the PR advertises a feature (graceful reconnect with backoff) that does not engage under any tested failure mode (server Ctrl-C or kill -9 — verified). The runner eventually recovers because heartbeats start working again — but during the outage there is no visibility, no proactive recovery attempt, and wakes published during the outage cannot be received until the stream actually gets reset by some other mechanism.

The PR's running.reconnecting state, reconnect_count counter, and exponential backoff implementation appear to be unreachable code in practice. This is not a corner case — it's the default behavior on every observed server-down scenario.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions