Skip to content

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340

@kevin-dp

Description

@kevin-dp

Summary

The pull-wake claim release path in callback-forward (packages/agents-server/src/routing/internal-router.ts) gates materializeReleasedClaim on the in-memory ClaimWriteTokenStore. When the in-memory token is missing or has been evicted, the consumer_claims row is never marked released, the entity status is never transitioned back from running to idle, and the row leaks indefinitely.

Surfaced while testing #4339 (pull-wake health diagnostics). The new GET /_electric/runners/:id/health endpoint exposes per-claim active rows, which made the leak observable.

The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's sendDone after the idleTimeout window (default 5 min via packages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented below.

Failure modes

The release path that genuinely doesn't run on the unfixed code is the one where the in-memory ClaimWriteTokenStore no longer holds the consumer's token at the time done arrives. Three scenarios reproduce this deterministically (covered by unit tests in packages/agents-server/test/webhook-forward-routing.test.ts):

  1. Server restart between mint and done. The runtime calls callback-forward to mint a token, then runs the agent. If the agents-server restarts during the run, the in-memory store loses all tokens. When sendDone arrives, the lookup returns stillOwnsClaim: false, and the entire release block is skipped: no materializeReleasedClaim, no updateStatus(idle). The DB row stays active forever.
  2. Newer wake evicts the in-memory token. Two wakes for the same entity stream arrive close together. The second wake's mint() call evicts the first wake's token from the in-memory map (this is ClaimWriteTokenStore's designed eviction behavior for stream→consumer mapping). When the first wake later calls sendDone, stillOwnsClaim is false for it — same skip path.
  3. Retry after updateStatus failure. First done attempt successfully releases the row but updateStatus(entity, 'idle') throws (transient DB error, etc.). The runtime retries the done call. On retry, the row is already released, so the entityCleared flag is false; on the unfixed code the path then skips updateStatus(idle) again, leaving the entity stuck at running.

Root cause

internal-router.ts:611-665 in callbackForward has all three release actions behind a single gate:

if (entity && stillOwnsClaim) {
  await materializeReleasedClaim(...)
  await updateStatus(entity.url, 'idle')
  clearStream(...)
  await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
  clearStream(...)
} else if (entity) {
  log.info('done ignored for stale claim ...')
}

The stillOwnsClaim check is appropriate for write authorization (you should only be able to write to the entity's main stream if you hold the active claim), but it's the wrong gate for claim release, which is a DB-keyed operation on (consumerId, epoch). Releasing the DB row should not require in-memory state — the DB primary key is authoritative identity.

Suggested fix

Decouple the three concerns. Each gets its own gate:

  • materializeReleasedClaim — runs whenever epoch is defined. DB identity.
  • updateStatus(entity, 'idle') + onEntityChanged — runs when either our release just cleared the entity's active dispatch row or we still hold the in-memory token. The first handles server-restart + retry; the second handles the legacy "no active dispatch row exists" case in server-claim-write-token.test.ts.
  • clearStream — remains gated by stillOwnsClaim so we never clear a newer consumer's in-memory token.

The "either" in the entity-status gate requires materializeReleasedClaim to report a new entityCleared flag indicating whether the active dispatch row was just cleared (versus a no-op because a newer wake had already taken over). Added in the registry method.

Severity

Low-to-medium for the happy path (idle timeout naturally clears state). Higher for:

  • Long-running servers that get restarted under load (claim rows accumulate, eventually exhaust maxConcurrentClaims for the runner)
  • Entities receiving frequent wakes where the token-eviction race is likely
  • Any deployment where transient DB errors during updateStatus aren't rare

Visibility is poor regardless — without #4339's diagnostics endpoint, operators have no way to see the orphaned rows.

PR

Implemented in #4346 (against fix-pull-wake). Unit tests demonstrate all three failure modes pre-fix and pass after the fix.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions