Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified)

## Summary

When the agents-server becomes unreachable (verified for both `Ctrl-C` graceful shutdown **and** `kill -9` hard kill), the pull-wake runner's **stream reconnect path never fires**. The runner reports `stream_connected: true`, `reconnect_count: 0`, `client_status: streaming` and `stream_connected_since` unchanged for the entire duration of the outage. The state machine never transitions to `running.reconnecting`, exponential backoff never runs, and `reconnect_count` is never incremented.

Heartbeats (which are separate, short-lived HTTP POSTs) correctly observe the outage and fail with `ECONNREFUSED` every 30s — so the runner *is* aware on one path that the server is unreachable. But the stream reader sees no error and concludes the stream is fine.

Surfaced while testing #4339 (pull-wake runner lifecycle hardening). The PR adds the state machine, exponential backoff (1s → 30s cap), `reconnect_count` tracking, and `running.reconnecting` state — but **none of these engage under any tested failure mode**.

> **Updated**: original framing was specific to graceful shutdown. A follow-up test with `kill -9` (which generates an RST instead of FIN) produced **the same broken behavior** — the bug is not about TCP shutdown signal type, it is about the stream reader not surfacing any server failure as a stream error, regardless of how the server dies.

## Reproductions

### Case A — graceful shutdown (`Ctrl-C`)

Setup:
- Local agents-server running via `pnpm --filter @electric-ax/agents-server dev` (tsx watch on `src/entrypoint.ts`)
- Desktop app connected, pull-wake runner registered and healthy

Steps:
1. Confirm baseline via the new health endpoint:
   ```bash
   curl -sS -H "electric-principal: system:dev-local" \
     "http://127.0.0.1:4437/_electric/runners/<RUNNER_ID>/health" \
     | jq '{rc:.client.reconnect_count, status:.client.status, stream:.client.stream_connected, since:.client.stream_connected_since}'
   ```
   Expected: `rc:0, status:"streaming", stream:true, since:"<some prior time>"`
2. **Stop the server gracefully** — `Ctrl-C` the `pnpm dev` process (or `kill -SIGINT <pid>`).
3. Watch the desktop log at `~/Library/Application Support/Electric Agents/logs/builtin-agents-*.jsonl`. Heartbeat failures appear every 30s:
   ```
   [builtin-agents] pull-wake runner failed  TypeError: fetch failed: connect ECONNREFUSED 127.0.0.1:4437
   ```
4. Leave the server down for several minutes (verified for 17 minutes in original repro).
5. Restart the server.
6. Re-query the health endpoint after the next heartbeat succeeds.

### Case B — hard kill (`kill -9`)

Same setup. Same procedure but `kill -9 <pid>` instead of `Ctrl-C`. RST should be sent immediately. Identical observed behavior:

```
kill at 10:32:45.3Z
first log entry at 10:33:01.7Z  (the regularly-scheduled 30s heartbeat — NOT a reconnect attempt)
+30s   next heartbeat failure
+30s   next heartbeat failure
+30s   next heartbeat failure
```

All entries arrive on the 30s heartbeat schedule. No additional log entries at exponential-backoff intervals (1s, 2s, 4s, 8s, 16s) — which is what we'd expect to see if the stream-reconnect catch branch was firing.

After recovery, `reconnect_count` is still `0` and `stream_connected_since` is unchanged from before the kill. The runner thinks nothing happened.

## Observed payload (post-recovery, after both Case A and Case B)

```json
{
  "client": {
    "status": "streaming",
    "reconnect_count": 0,
    "stream_connected": true,
    "stream_connected_since": "2026-05-18T08:11:00.913Z",   // unchanged — was the value pre-outage
    "last_heartbeat_at": "2026-05-18T10:35:01.684Z",
    "last_heartbeat_ok": true
  },
  "health": { "status": "healthy", "issues": [] }
}
```

`stream_connected_since` survived BOTH outages (Ctrl-C followed by hard kill) without changing. The runner has held the same conceptual stream object since before either outage.

## Root cause analysis

The stream consumption loop in `packages/agents-runtime/src/pull-wake-runner.ts` opens a long-lived HTTP fetch and reads chunks until it ends:

```ts
await consumeWakeStream(signal, runGeneration)
if (!signal.aborted) {
  state = `running.reconnecting`
  // ... backoff/sleep ...
}
```

When the agents-server dies (regardless of shutdown mechanism):
- The OS *should* send a FIN (Ctrl-C) or RST (kill -9) on the established TCP connection
- Node's fetch streaming reader **does not surface this as a read error** when the stream is idle (no in-flight wakes) — the reader just sits on the dead socket
- The connection effectively enters a half-open state from the application's perspective
- Wakes published by the server during downtime (if any) cannot be received

Plausible cause: the stream uses long-polling / Server-Sent-Events-style chunked transfer over fetch. When the underlying socket is closed by the peer but no data is in flight, the AsyncIterator returned by `response.body` may not yield or error promptly. It depends on:
- Whether Node observes the FIN/RST before the next read attempt is made
- Whether the AsyncIterator emits an end-of-stream that the code then handles in the success branch (no `reportError`, no `reconnectCount++`)

So `consumeWakeStream` either doesn't return at all (most likely) or returns "cleanly" without an error. Either way:
- The state machine never transitions to `running.reconnecting`
- `reconnect_count` is never incremented
- `reportError` is never called from the stream path
- `stream_connected` and `stream_connected_since` are never updated

Heartbeats run on a *separate* timer and use individual fetch calls — those fail with `ECONNREFUSED` immediately because each one initiates a new TCP connection. That's why we see clean 30s-cadence heartbeat failures in the log but no stream-level activity. Heartbeat failures call `reportError` (via `onError` in `agents/src/server.ts:324`) and set `last_heartbeat_ok: false`, but they do **not** invalidate the stream connection state on the client side.

## Expected behavior

The runner should recognize that its stream is dead — at minimum, this should be visible in the diagnostics. Ideal end state:

1. State machine transitions `running.streaming → running.reconnecting` when the stream stops delivering OR when N consecutive heartbeats fail (whichever comes first).
2. `reconnect_count` increments each time a reconnect is attempted.
3. `stream_connected_since` is reset on successful reconnect (the PR's "reset on success" claim depends on this firing).
4. Exponential backoff path is exercised (and observable via `reconnect_count` growing faster than the heartbeat cadence).

## Suggested investigation / fixes

Pick one or combine; (a) is the cheapest:

### (a) Tie stream liveness to heartbeat failures (cheap)

After K consecutive heartbeat failures (e.g. K=2), abort the current stream connection and re-enter `consumeWakeStream`. This bridges the two signal paths and uses the already-working heartbeat detection to drive the (already-implemented) reconnect machinery. This is now the strongest recommendation given that **the bug is universal, not specific to graceful shutdown** — a stream-side detector won't help if the stream never sees the failure.

### (b) Application-level keepalive on the stream

Have the server periodically push a no-op `ping` event over the wake stream (e.g. every 30s). On the client, treat absence-of-pings (>2× ping interval) as a stream failure and reconnect.

### (c) TCP-level keepalive

Set `SO_KEEPALIVE` on the underlying socket with aggressive timing (idle=10s, interval=5s, count=3). Standard fetch in Node doesn't expose this directly; would require a custom http2 agent or `undici` Dispatcher.

### (d) Write-side heartbeat on the stream connection itself

If the stream is bidirectional or supports a client→server keepalive, send periodic empty writes. Failure to write surfaces a socket error → triggers reconnect.

Given the Case B result, (a) is now clearly the highest-leverage fix: heartbeat failures are the **only** observable signal the runner has when the server is gone, so heartbeat-driven stream reset is the only path that doesn't depend on the stream layer somehow noticing what it currently doesn't notice.

## Severity

Reliability — the PR advertises a feature (graceful reconnect with backoff) that does not engage under **any tested failure mode** (server `Ctrl-C` *or* `kill -9` — verified). The runner *eventually* recovers because heartbeats start working again — but during the outage there is no visibility, no proactive recovery attempt, and wakes published during the outage cannot be received until the stream actually gets reset by some other mechanism.

The PR's `running.reconnecting` state, `reconnect_count` counter, and exponential backoff implementation appear to be unreachable code in practice. This is not a corner case — it's the default behavior on every observed server-down scenario.

## Related

- #4339 — PR that introduced the state machine and reconnect logic
- #4340 — claim not released after dispatch
- #4341 — `lease_expires_at: null` on materialized claims
- #4342 — Local Runtime UI shows stale state during outage (compounded by this bug, since the UI's stream indicator reads `client.stream_connected: true` which is exactly the wrong value)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified) #4343

Summary

Reproductions

Case A — graceful shutdown (`Ctrl-C`)

Case B — hard kill (`kill -9`)

Observed payload (post-recovery, after both Case A and Case B)

Root cause analysis

Expected behavior

Suggested investigation / fixes

(a) Tie stream liveness to heartbeat failures (cheap)

(b) Application-level keepalive on the stream

(c) TCP-level keepalive

(d) Write-side heartbeat on the stream connection itself

Severity

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pull-wake: stream reconnect path does not fire when agents-server becomes unreachable (Ctrl-C and kill -9 both verified) #4343

Description

Summary

Reproductions

Case A — graceful shutdown (Ctrl-C)

Case B — hard kill (kill -9)

Observed payload (post-recovery, after both Case A and Case B)

Root cause analysis

Expected behavior

Suggested investigation / fixes

(a) Tie stream liveness to heartbeat failures (cheap)

(b) Application-level keepalive on the stream

(c) TCP-level keepalive

(d) Write-side heartbeat on the stream connection itself

Severity

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Case A — graceful shutdown (`Ctrl-C`)

Case B — hard kill (`kill -9`)