Summary
The pull-wake claim release path in callback-forward (packages/agents-server/src/routing/internal-router.ts) gates materializeReleasedClaim on the in-memory ClaimWriteTokenStore. When the in-memory token is missing or has been evicted, the consumer_claims row is never marked released, the entity status is never transitioned back from running to idle, and the row leaks indefinitely.
Surfaced while testing #4339 (pull-wake health diagnostics). The new GET /_electric/runners/:id/health endpoint exposes per-claim active rows, which made the leak observable.
The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's sendDone after the idleTimeout window (default 5 min via packages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented below.
Failure modes
The release path that genuinely doesn't run on the unfixed code is the one where the in-memory ClaimWriteTokenStore no longer holds the consumer's token at the time done arrives. Three scenarios reproduce this deterministically (covered by unit tests in packages/agents-server/test/webhook-forward-routing.test.ts):
- Server restart between mint and done. The runtime calls
callback-forward to mint a token, then runs the agent. If the agents-server restarts during the run, the in-memory store loses all tokens. When sendDone arrives, the lookup returns stillOwnsClaim: false, and the entire release block is skipped: no materializeReleasedClaim, no updateStatus(idle). The DB row stays active forever.
- Newer wake evicts the in-memory token. Two wakes for the same entity stream arrive close together. The second wake's
mint() call evicts the first wake's token from the in-memory map (this is ClaimWriteTokenStore's designed eviction behavior for stream→consumer mapping). When the first wake later calls sendDone, stillOwnsClaim is false for it — same skip path.
- Retry after
updateStatus failure. First done attempt successfully releases the row but updateStatus(entity, 'idle') throws (transient DB error, etc.). The runtime retries the done call. On retry, the row is already released, so the entityCleared flag is false; on the unfixed code the path then skips updateStatus(idle) again, leaving the entity stuck at running.
Root cause
internal-router.ts:611-665 in callbackForward has all three release actions behind a single gate:
if (entity && stillOwnsClaim) {
await materializeReleasedClaim(...)
await updateStatus(entity.url, 'idle')
clearStream(...)
await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
clearStream(...)
} else if (entity) {
log.info('done ignored for stale claim ...')
}
The stillOwnsClaim check is appropriate for write authorization (you should only be able to write to the entity's main stream if you hold the active claim), but it's the wrong gate for claim release, which is a DB-keyed operation on (consumerId, epoch). Releasing the DB row should not require in-memory state — the DB primary key is authoritative identity.
Suggested fix
Decouple the three concerns. Each gets its own gate:
materializeReleasedClaim — runs whenever epoch is defined. DB identity.
updateStatus(entity, 'idle') + onEntityChanged — runs when either our release just cleared the entity's active dispatch row or we still hold the in-memory token. The first handles server-restart + retry; the second handles the legacy "no active dispatch row exists" case in server-claim-write-token.test.ts.
clearStream — remains gated by stillOwnsClaim so we never clear a newer consumer's in-memory token.
The "either" in the entity-status gate requires materializeReleasedClaim to report a new entityCleared flag indicating whether the active dispatch row was just cleared (versus a no-op because a newer wake had already taken over). Added in the registry method.
Severity
Low-to-medium for the happy path (idle timeout naturally clears state). Higher for:
- Long-running servers that get restarted under load (claim rows accumulate, eventually exhaust
maxConcurrentClaims for the runner)
- Entities receiving frequent wakes where the token-eviction race is likely
- Any deployment where transient DB errors during
updateStatus aren't rare
Visibility is poor regardless — without #4339's diagnostics endpoint, operators have no way to see the orphaned rows.
PR
Implemented in #4346 (against fix-pull-wake). Unit tests demonstrate all three failure modes pre-fix and pass after the fix.
Related
Summary
The pull-wake claim release path in
callback-forward(packages/agents-server/src/routing/internal-router.ts) gatesmaterializeReleasedClaimon the in-memoryClaimWriteTokenStore. When the in-memory token is missing or has been evicted, theconsumer_claimsrow is never markedreleased, the entity status is never transitioned back fromrunningtoidle, and the row leaks indefinitely.Surfaced while testing #4339 (pull-wake health diagnostics). The new
GET /_electric/runners/:id/healthendpoint exposes per-claimactiverows, which made the leak observable.The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's
sendDoneafter theidleTimeoutwindow (default 5 min viapackages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented below.Failure modes
The release path that genuinely doesn't run on the unfixed code is the one where the in-memory
ClaimWriteTokenStoreno longer holds the consumer's token at the timedonearrives. Three scenarios reproduce this deterministically (covered by unit tests inpackages/agents-server/test/webhook-forward-routing.test.ts):callback-forwardto mint a token, then runs the agent. If the agents-server restarts during the run, the in-memory store loses all tokens. WhensendDonearrives, the lookup returnsstillOwnsClaim: false, and the entire release block is skipped: nomaterializeReleasedClaim, noupdateStatus(idle). The DB row staysactiveforever.mint()call evicts the first wake's token from the in-memory map (this isClaimWriteTokenStore's designed eviction behavior for stream→consumer mapping). When the first wake later callssendDone,stillOwnsClaimisfalsefor it — same skip path.updateStatusfailure. First done attempt successfully releases the row butupdateStatus(entity, 'idle')throws (transient DB error, etc.). The runtime retries the done call. On retry, the row is already released, so theentityClearedflag isfalse; on the unfixed code the path then skipsupdateStatus(idle)again, leaving the entity stuck atrunning.Root cause
internal-router.ts:611-665incallbackForwardhas all three release actions behind a single gate:The
stillOwnsClaimcheck is appropriate for write authorization (you should only be able to write to the entity's main stream if you hold the active claim), but it's the wrong gate for claim release, which is a DB-keyed operation on(consumerId, epoch). Releasing the DB row should not require in-memory state — the DB primary key is authoritative identity.Suggested fix
Decouple the three concerns. Each gets its own gate:
materializeReleasedClaim— runs wheneverepochis defined. DB identity.updateStatus(entity, 'idle')+onEntityChanged— runs when either our release just cleared the entity's active dispatch row or we still hold the in-memory token. The first handles server-restart + retry; the second handles the legacy "no active dispatch row exists" case inserver-claim-write-token.test.ts.clearStream— remains gated bystillOwnsClaimso we never clear a newer consumer's in-memory token.The "either" in the entity-status gate requires
materializeReleasedClaimto report a newentityClearedflag indicating whether the active dispatch row was just cleared (versus a no-op because a newer wake had already taken over). Added in the registry method.Severity
Low-to-medium for the happy path (idle timeout naturally clears state). Higher for:
maxConcurrentClaimsfor the runner)updateStatusaren't rareVisibility is poor regardless — without #4339's diagnostics endpoint, operators have no way to see the orphaned rows.
PR
Implemented in #4346 (against
fix-pull-wake). Unit tests demonstrate all three failure modes pre-fix and pass after the fix.Related
lease_expires_at: nullon materialized claims removes the safety net that would otherwise auto-expire abandoned rows.