Skip to content

Local Runtime page shows stale/contradictory state when agents-server is unreachable #4342

@kevin-dp

Description

@kevin-dp

Summary

When the agents-server becomes unreachable, the desktop app's Local Runtime settings page shows stale and internally contradictory state. The UI does not communicate to the user that the source of truth (the agents-server) is unreachable, and most badges continue to show last-known values as if they were live. As a result:

  • A user can stare at "Runtime: Running" + "Stream: Connected" while the runtime has actually been disconnected for minutes
  • The first visual signal that something is wrong (Health: Unhealthy / Lease expired) takes ~3 minutes to appear (lease lifetime)
  • Even after the Health badge flips, the page shows internally contradictory state — "Unhealthy" and "Connected" at the same time

Surfaced while testing #4339 (pull-wake health diagnostics).

Reproduction

  1. Start the desktop app with a working connection to a local agents-server (localhost:4437).
  2. Open Settings → Local Runtime. All rows should show healthy state.
  3. In another terminal, stop the agents-server process (e.g. Ctrl-C your pnpm --filter @electric-ax/agents-server dev).
  4. Watch the Local Runtime page.

Observed timeline

  • 0s — agents-server killed
  • 0s – ~3min — page continues to show:
    • Runtime: Running (green)
    • Connection: Pull-wake (no status qualifier)
    • Health: Healthy (green)
    • Stream: Connected (green)
    • Last heartbeat: 0–30s ago (the only growing field — but easy to miss)
  • ~3minHealth flips to Unhealthy / Lease expired (red), but every other field keeps its pre-kill value:
    • Runtime: Running ← still green
    • Stream: Connected ← still green
    • Last heartbeat: 3m ago, Lease expires 1m ago ← phrasing is broken (should be past tense once expired)

Screenshots

Two screenshots illustrate the two states:

  1. ~30s after server kill — every row green, no indication anything is wrong (only Last heartbeat is starting to age).
  2. ~3min after server killHealth: Unhealthy / Lease expired while Stream: Connected and Runtime: Running simultaneously.
Image Image

Expected behavior

When the agents-server is unreachable, the page should clearly and quickly communicate that the runtime cannot reach its server. Concretely:

Row Currently Should be
Top of page (no banner) Banner: "Cannot reach agents-server at localhost:4437 — data may be stale (last contact Xs ago)"
Runtime Running Running (cannot reach agents-server) — or split into two rows (process status + server reachability)
Connection Pull-wake Pull-wake — reconnecting / disconnected
Stream Connected Disconnected or Reconnecting
Last heartbeat "3m ago, Lease expires 1m ago" "3m ago, Lease expired 1m ago" (past tense)
Health Unhealthy / Lease expired ✓ (correct, but fires ~3min late — see below)

The Health badge transition is correct in logic but the 3-minute lag is too long for an interactive UI. A user actively using the app would notice the contradictory state long before Health flips. The lease lifetime is the right safety net but should not be the primary signal.

Root cause

packages/agents-server-ui/src/components/settings/pages/LocalRuntimePage.tsx reads runner state exclusively from an Electric Shape synced from the agents-server's runners table:

const runnerData = useLiveQuery((q) =>
  q.from({ runner: runnersCollection })
    .where(({ runner }) => eq(runner.id, runnerId))
)

When the agents-server goes down:

  • The Shape stream cannot deliver fresh data, but it does not signal "I have not received an update in N seconds" to the consumer — the last cached row just sits there.
  • runner.liveness_lease_expires_at is a timestamp, so runnerHealth() can compute "expired" client-side as time advances — this is the ONE field that works correctly without a fresh write.
  • All other fields (diagnostics.stream_connected, last_seen_at, etc.) are server-stamped values that the UI cannot invalidate independently.

The Health-via-lease-expiry path is the PR's deliberate design — the lease is the safety net. But it's effectively the page's only live signal when the server is unreachable, which makes the page misleading for the first ~3 minutes.

Architectural note

The pull-wake runner runs in the same Electron process as the UI. The runner has live in-memory state via PullWakeRunner.getHealth() (packages/agents-runtime/src/pull-wake-runner.ts) that includes accurate status, stream_connected, last_heartbeat_ok, reconnect_count, etc. The Local Runtime page just doesn't read it — it goes through the Shape, which routes through the dead server.

For the local runner specifically, the source of truth should be the in-process runner state, not the server-synced Shape. The Shape design makes sense for managing other runners (e.g. inspecting a cloud worker), but it's the wrong primary source for the runner that lives in this very process.

Suggested fixes

Two complementary directions, in order of increasing scope:

1. Stale-data detection on the Shape (cheap, works generically)

When the Shape stream has not delivered an update in N × the expected interval (e.g. 2× the heartbeat interval = ~60s with default 30s heartbeat), gray out the synced fields and annotate them with "last updated Xs ago". A top-level banner ("Cannot reach agents-server — data may be stale") would tie this together.

This addresses the contradiction but doesn't give correct live state — fields just become visibly unreliable.

2. Local IPC override for the in-process runner (architecturally correct)

For the local runner, surface getHealth() over IPC from the main process to the renderer, and prefer it over the Shape-sourced fields. This gives the UI real-time, accurate state regardless of server reachability — the Shape becomes redundant for the local runner.

The Shape pathway remains the right answer for remote runners that this UI may also show in the future.

Severity

UX bug — does not affect correctness of the runtime itself, but actively misleads operators about whether their local runtime is working. Particularly painful during onboarding or first-time setup, where the user has no baseline for what "looks right".

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions