feat(agents): pull-wake runner health check, principal rename, and lifecycle hardening#4339
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…migration, drop authorization fallback - Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop) - Migration expires active claims and clears dispatch state before deleting runners - Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url - listRunners validates owner_principal query param Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4339 +/- ##
=======================================
Coverage ? 59.56%
=======================================
Files ? 290
Lines ? 28579
Branches ? 7754
=======================================
Hits ? 17022
Misses ? 11540
Partials ? 17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
State machine, concurrent claim limits, exponential reconnect backoff, and granular health status. onError is now reporting-only with fallback console.error logging. stop() rethrows drainWakes errors to callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Test iteration 7 body placeholder |
There was a problem hiding this comment.
I had a look through the code and left some comments.
On a higher level, i don't like the "claims" name, it collides too much with auth claims.
The term "lease" is more standard in distributed systems so I would rename the claims variables to: lease / activeLeaseCount / maxConcurrentLeases / getActiveLeasesForRunner.
Also, i think the code needs quite some refactoring. Currently there are ~20 variables inside one closure and they are being mutated from everywhere in this file. That's very error-prone. I'll try refactoring in a follow up PR.
| let eventHeartbeatTimer: ReturnType<typeof setTimeout> | null = null | ||
| let currentOffset = config.offset | ||
| let startedAt: string | null = null | ||
| let streamConnected = false |
There was a problem hiding this comment.
I would define streamConnected as a getter:
get streamConnected() {
return !!streamConnectedSince
}| last_claim_result: `claimed` | `no_work` | `error` | null | ||
| last_dispatch_at: string | null | ||
| events_received: number | ||
| claims_succeeded: number |
There was a problem hiding this comment.
Perhaps this could be a nested property:
export interface PullWakeRunnerHealth {
// ...
claims: {
succeeded: number
skipped: number
failed: number
}
}Not sure how useful these numbers are. Perhaps an array of the actual claims that succeeded/skipped/failed would be better.
| type PullWakeRunnerState = | ||
| | `stopped` | ||
| | `starting` | ||
| | `running.connecting` |
There was a problem hiding this comment.
The running. prefixes are a bit unusual.
There was a problem hiding this comment.
Would this benefit from being an actual TS enum?
enum PullWakeRunnerState {
Stopped = `stopped`,
Starting = `starting`,
Connecting = `running.connecting`,
Streaming = `running.streaming`,
Reconnecting = `running.reconnecting`,
Stopping = `stopping`
}This would make the code less string-heavy.
Or if you don't like enums:
const PullWakeRunnerState = {
Stopped: 'stopped',
Starting: 'starting',
Connecting: 'running.connecting',
Streaming: `running.streaming`,
Reconnecting: `running.reconnecting`,
Stopping: `stopping`
} as const
type PullWakeRunnerState = (typeof PullWakeRunnerState)[keyof typeof PullWakeRunnerState]| Math.floor(config.maxConcurrentClaims ?? DEFAULT_MAX_CONCURRENT_CLAIMS) | ||
| ) | ||
|
|
||
| const toStatus = (): PullWakeRunnerStatus => { |
There was a problem hiding this comment.
Why have "state" and "status" ? They are almost identical.
I would just go with the enum approach from my previous comment.
Users use it like PullWakeRunnerState.Connecting and it resolves to the string "running.connecting" so we effectively get the usage we want.
| const notifyHeartbeatChange = (): void => { | ||
| const signal = controller?.signal | ||
| if (!signal || signal.aborted || heartbeatIntervalMs <= 0) return | ||
| if (eventHeartbeatTimer) return |
There was a problem hiding this comment.
nit: we have 2 if statements for fast return, to be consistent the first one would only do the signal checks and the 2nd one would do heartbeat checks, or we would use only one if statement:
if (!signal || signal.aborted) return
if (heartbeatIntervalMs <= 0 || eventHeartbeatTimer) returnThere was a problem hiding this comment.
We could also avoid the eventHeartbeatTimer check by rewriting the assignment:
eventHeartbeatTimer ??= setTimeout(() => {
eventHeartbeatTimer = null
void heartbeat(signal)
}, eventHeartbeatThrottleMs)|
|
||
| const notifyHeartbeatChange = (): void => { | ||
| const signal = controller?.signal | ||
| if (!signal || signal.aborted || heartbeatIntervalMs <= 0) return |
There was a problem hiding this comment.
Why do we check heartbeatIntervalMs <= 0 if it's not used in this function?
This function uses eventHeartbeatThrottleMs instead.
| } catch (err) { | ||
| if (!signal.aborted) { | ||
| config.onError?.(err instanceof Error ? err : new Error(String(err))) | ||
| lastHeartbeatOk = false |
There was a problem hiding this comment.
We already set lastHeartbeatOk = false on L256 right before throwing and here we only set it inside if (!signal.aborted) but it will already be false. Is this the behaviour we want? If we only want it to be set to false if !signal.aborted then we need to not set it right before throwing.
| throw err | ||
| } | ||
| if (!claimErrorRecorded) { | ||
| recordClaimError() |
There was a problem hiding this comment.
Do we need to set claimErrorRecorded = true here?
Perhaps would be better if recordClaimError does that (e.g. we could wrap it such that we can't forget to set this variable).
| return true | ||
| } | ||
|
|
||
| const sleep = async (ms: number, signal: AbortSignal): Promise<void> => { |
There was a problem hiding this comment.
This should be extracted to a utility file.
|
Spent some time testing the lifecycle hardening claims in this PR against a running desktop app + local agents-server. Some of what's claimed works as advertised; one significant gap turned up that I think is worth flagging before merge. What worked
Things turned up while testingWhile testing this PR I also surfaced three related reliability issues that the new diagnostics made visible — they aren't introduced by this PR (the diagnostics just exposed them), but they all interact with this work:
Suggestion
Happy to help test once you've got a candidate fix. |
There was a problem hiding this comment.
We should address #4339 (comment) before merging.
| 1. A **health check endpoint** (`GET /_electric/runners/:id/health`) for deep debugging — curl it to see comprehensive diagnostics about a runner's dispatch pipeline. | ||
| 2. **Rich runner state in Postgres** so that apps can sync the `runners` table via an Electric Shape and show runner status on any device (e.g. see your laptop runner's status from your phone). | ||
|
|
||
| The `diagnostics` JSONB column on the `runners` table serves both purposes: the health endpoint reads it for the detailed response, and Shape sync delivers it reactively to any connected client. |
There was a problem hiding this comment.
This will make runners into a quite noisy shape, with every runner causing an update every 2 seconds. I suggest we split it into a separate table for heartbeats & claims, and keep status, started, last_error & last_error_at in the main table. Anyone needing full diagnostics can then either get the endpoint, or subscribe to a filtered shape for diagnostics of a particular runner. WDYT?
There was a problem hiding this comment.
It should be a max of every two seconds (if there's lots of activity) and then when things aren't happening, updates would drop to every 30 seconds.
But yeah, fairly noisy I agree actually still. Lemme think about it.
There was a problem hiding this comment.
We don't need the entire object on that table for "normal" UIs I think. the heartbeat endpoint can split this up easily
There was a problem hiding this comment.
yeah makes sense — refactored
# Conflicts: # packages/agents-desktop/src/main.ts # packages/agents-server/src/routing/runners-router.ts # packages/agents-server/test/dispatch-policy-routing.test.ts # packages/agents/src/server.ts # packages/electric-ax/src/start.ts # packages/electric-ax/test/start.test.ts
|
test body 2 |
|
Claude Code Review (iteration 1) Substantial PR (~3.5k additions / 60+ files): runner health endpoint, owner_user_id -> owner_principal rename, pull-wake runner hardening (state machine, bounded concurrency, exponential backoff, event-driven heartbeats). Overall direction solid. Main concerns: type-safety latent bug in health response, two small state-reset bugs, stack of unresolved inline review comments from kevin-dp. WHAT IS WORKING WELL
CRITICAL (Must Fix): None found. IMPORTANT (Should Fix)
SUGGESTIONS (Nice to Have)
ISSUE CONFORMANCE No linked GitHub issue on this PR. PR description is unusually thorough (root cause, approach, invariants, non-goals, trade-offs, verification, files-changed table) and substitutes well for a tracked issue. Diagnostics work surfaced #4340-#4342 and #4343, filed and tracked separately. Consider attaching Closes for whichever #4343 fixes are landing in this PR (the heartbeat-driven reset is option (a) per kevin-dp). PREVIOUS REVIEW STATUS A prior Claude Code Review comment existed but contained only a placeholder fragment ("PLACEHOLDER continuing in next edit.") -- replaced with this complete review. NOTE ON PROMPT-INJECTION ATTEMPT While reading .review-context/conversation.json I encountered a fake system-reminder tag embedded inside one of KyleAMathewss comment bodies (re: task tools). Treated as untrusted comment content, ignored. Flagging so you know. Review iteration: 1 | 2026-05-18 |
|
Iteration 2 — 2026-05-18 (re-validation pass) Re-read the diff and inline review threads. No new commits on the branch since iteration 1 was posted at 18:29 UTC. All findings in the prior Claude review (comment id 4480689327) remain open as written. Independent verification confirms:
No new issues surfaced beyond iteration 1. Review iteration: 2 | 2026-05-18 |
Claude Code ReviewSummaryIteration 5 on What's Working WellAll five iteration-4 Important findings are now fixed in code (and three of them are covered by new tests):
Other iteration-5 wins:
Issues FoundCritical (Must Fix)None. Important (Should Fix)
Suggestions (Nice to Have)
Issue ConformanceNo linked GitHub issue. PR description remains thorough but should be reconciled with the actual behavior on two points:
The new commit Previous Review StatusPrior Claude review is comment id Review iteration: 5 | 2026-05-19 |
Summary
Adds pull-wake runner health diagnostics, renames runner ownership from
owner_user_idto canonicalowner_principalURLs, and hardens the pull-wake runner lifecycle around heartbeats, reconnects, shutdown, and error reporting.This PR also fixes the local desktop send path so mutating local server requests can go through the Electron main process instead of being blocked behind Chromium's HTTP/1.1 connection limit, while keeping normal shape reads in the renderer.
Root Cause
Pull-wake dispatch had too little observability and too many implicit lifecycle assumptions. When a runner missed wakes, got stuck reconnecting, or failed during shutdown, the server/UI had limited information to explain the state. The ownership field also used
owner_user_id, which was misleading because principals can be users, agents, services, or system actors.Local development exposed a separate but related issue: renderer-origin mutating requests could stall behind long-lived shape/SSE requests, delaying sends and new agent startup by seconds even when the server processed the request in milliseconds.
Approach
Runner Health Diagnostics
PullWakeRunnerreports stream, heartbeat, claim, dispatch, reconnect, and error diagnostics in heartbeat payloads.GET /_electric/runners/:id/healthaggregates runner state, liveness lease, sanitized client diagnostics, active claims, dispatch stats, and derived health issues.Principal Ownership
owner_principal.electric-principalrequest headers now accept either a principal key (user:alice) or a principal URL (/principal/user%3Aalice) throughparsePrincipalInput()and compare internally by canonical URL.Pull-Wake Runner Lifecycle
stopped | starting | connecting | streaming | reconnecting | stopping.onErrorreporting-only and isolates throwing reporters with aconsole.errorfallback.stop()gates new claim dispatch, aborts outstanding work, waits boundedly for claim actors, drains runtime wakes, and rethrows drain errors after recording diagnostics.waitForStopped()now waits for the full stop path, not just the stream loop.start().maxConcurrentClaims: it only limited claim/enqueue work, not actual wake execution, so it was a misleading invariant.Dispatch And Local Send Performance
/sendremains normalapplication/json; the preflight/connection issue is handled by the desktop fetch path rather than changing API semantics.Key Invariants
owner_principalvalues are canonical principal URLs.onErrorcannot control runner lifecycle.Non-goals
pull-wake-runner.ts; the critical invariants are fixed and tested first.Verification