Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry)

## Summary

The pull-wake claim release path in `callback-forward` (`packages/agents-server/src/routing/internal-router.ts`) gates `materializeReleasedClaim` on the **in-memory** `ClaimWriteTokenStore`. When the in-memory token is missing or has been evicted, the `consumer_claims` row is never marked `released`, the entity status is never transitioned back from `running` to `idle`, and the row leaks indefinitely.

Surfaced while testing #4339 (pull-wake health diagnostics). The new `GET /_electric/runners/:id/health` endpoint exposes per-claim `active` rows, which made the leak observable.

The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's `sendDone` after the `idleTimeout` window (default 5 min via `packages/agents/src/bootstrap.ts:146`). The bug only fires in the failure modes documented below.

## Failure modes

The release path that genuinely *doesn't run* on the unfixed code is the one where the in-memory `ClaimWriteTokenStore` no longer holds the consumer's token at the time `done` arrives. Three scenarios reproduce this deterministically (covered by unit tests in `packages/agents-server/test/webhook-forward-routing.test.ts`):

1. **Server restart between mint and done.** The runtime calls `callback-forward` to mint a token, then runs the agent. If the agents-server restarts during the run, the in-memory store loses all tokens. When `sendDone` arrives, the lookup returns `stillOwnsClaim: false`, and the entire release block is skipped: no `materializeReleasedClaim`, no `updateStatus(idle)`. The DB row stays `active` forever.
2. **Newer wake evicts the in-memory token.** Two wakes for the same entity stream arrive close together. The second wake's `mint()` call evicts the first wake's token from the in-memory map (this is `ClaimWriteTokenStore`'s designed eviction behavior for stream→consumer mapping). When the *first* wake later calls `sendDone`, `stillOwnsClaim` is `false` for it — same skip path.
3. **Retry after `updateStatus` failure.** First done attempt successfully releases the row but `updateStatus(entity, 'idle')` throws (transient DB error, etc.). The runtime retries the done call. On retry, the row is already released, so the `entityCleared` flag is `false`; on the unfixed code the path then skips `updateStatus(idle)` again, leaving the entity stuck at `running`.

## Root cause

`internal-router.ts:611-665` in `callbackForward` has all three release actions behind a single gate:

```ts
if (entity && stillOwnsClaim) {
  await materializeReleasedClaim(...)
  await updateStatus(entity.url, 'idle')
  clearStream(...)
  await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
  clearStream(...)
} else if (entity) {
  log.info('done ignored for stale claim ...')
}
```

The `stillOwnsClaim` check is appropriate for **write authorization** (you should only be able to write to the entity's main stream if you hold the active claim), but it's the wrong gate for **claim release**, which is a DB-keyed operation on `(consumerId, epoch)`. Releasing the DB row should not require in-memory state — the DB primary key is authoritative identity.

## Suggested fix

Decouple the three concerns. Each gets its own gate:

- `materializeReleasedClaim` — runs whenever `epoch` is defined. DB identity.
- `updateStatus(entity, 'idle')` + `onEntityChanged` — runs when *either* our release just cleared the entity's active dispatch row *or* we still hold the in-memory token. The first handles server-restart + retry; the second handles the legacy "no active dispatch row exists" case in `server-claim-write-token.test.ts`.
- `clearStream` — remains gated by `stillOwnsClaim` so we never clear a newer consumer's in-memory token.

The "either" in the entity-status gate requires `materializeReleasedClaim` to report a new `entityCleared` flag indicating whether the active dispatch row was just cleared (versus a no-op because a newer wake had already taken over). Added in the registry method.

## Severity

Low-to-medium for the happy path (idle timeout naturally clears state). Higher for:
- Long-running servers that get restarted under load (claim rows accumulate, eventually exhaust `maxConcurrentClaims` for the runner)
- Entities receiving frequent wakes where the token-eviction race is likely
- Any deployment where transient DB errors during `updateStatus` aren't rare

Visibility is poor regardless — without #4339's diagnostics endpoint, operators have no way to see the orphaned rows.

## PR

Implemented in #4346 (against `fix-pull-wake`). Unit tests demonstrate all three failure modes pre-fix and pass after the fix.

## Related

- #4341 — `lease_expires_at: null` on materialized claims removes the safety net that would otherwise auto-expire abandoned rows.
- #4339 — PR whose diagnostics surfaced this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340

Summary

Failure modes

Root cause

Suggested fix

Severity

PR

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340

Description

Summary

Failure modes

Root cause

Suggested fix

Severity

PR

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions