fix(agents-server): release pull-wake claim row even when in-memory token is missing by kevin-dp · Pull Request #4346 · electric-sql/electric

kevin-dp · 2026-05-18T13:09:08Z

Summary

Fixes #4340 — pull-wake claims leaking in consumer_claims when the in-memory ClaimWriteTokenStore no longer holds the consumer's token at the time sendDone arrives.

The release path in callback-forward was gated by stillOwnsClaim, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transient updateStatus failure — the entire release block is skipped: materializeReleasedClaim never runs, the entity stays stuck at status='running', and the consumer_claims row stays active indefinitely.

The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's sendDone after the idleTimeout window (default 5 min via packages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented in Test scenarios below.

Root cause

packages/agents-server/src/routing/internal-router.ts had all three release actions behind the same in-memory gate:

if (entity && stillOwnsClaim) {
  await materializeReleasedClaim(...)   // DB row release
  await updateStatus(entity.url, 'idle') // entity status
  clearStream(...)                       // in-memory token cleanup
  await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
  clearStream(...)
} else if (entity) {
  log.info('done ignored for stale claim ...')
}

stillOwnsClaim is the right gate for write authorization during the agent run, but it's the wrong gate for releasing the DB row, which is keyed by (consumerId, epoch). The DB primary key is authoritative identity — the in-memory token state is orthogonal.

The fix

Three concerns, three gates:

DB row release (materializeReleasedClaim) — runs whenever epoch is defined. (consumerId, epoch) is the DB primary key; that's enough.
Entity status → idle + onEntityChanged — runs when entityCleared || stillOwnsClaim. entityCleared is a new return field from materializeReleasedClaim, set to true only when our (consumerId, epoch) was the active dispatch row and we just nulled it out. The || handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failed updateStatus (state cleared on first attempt, token still held).
In-memory token cleanup (clearStream) — remains gated by stillOwnsClaim so a newer consumer's token is never cleared out from under it.

`materializeReleasedClaim` API change

- ): Promise<ConsumerClaim | null> {
+ ): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> {

Only one production caller (internal-router.ts); both that caller and the test mock are updated. The .returning() on the entityDispatchState UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active).

Test scenarios

The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't).

Scenario	`entityCleared`	`stillOwnsClaim`	DB row released	Entity → idle
A. Happy path (mint + done)	true	true	✓ released	✓ goes idle
B. Server restart (no in-memory token, DB row still active)	true	false	✓ released	✓ goes idle
C. Newer wake (wake-1 done after wake-2 takes over the stream)	false	false	✓ wake-1's row released	× stays running — wake-2 is in flight
D. Retry (first done's `updateStatus` threw; same done retried)	false	true	✓ no-op (already released)	✓ goes idle
E. Legacy stale-done test (test setup never materialized active claim; token evicted)	false	false	no row to release	× stays running — newer claim conceptually in flight

New tests in packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340) cover scenarios A–C. Existing tests in server-claim-write-token.test.ts cover D and E and continue to pass after the fix.

Verified

Unit tests: deterministic. Pre-fix, B and C fail (zero invocations of materializeReleasedClaim); D fails (updateStatus skipped on retry). Post-fix, all five scenarios produce the documented behavior.
Manual run-through against a local desktop + agents-server: send one message, dispatch claims (active_count: 0 → 1), agent completes, runtime calls sendDone after idleTimeout, server fires the new release path, active_count: 1 → 0, entity status transitions back to idle.

Not addressed in this PR

Pre-existing orphan rows: rows that already leaked from prior runs of the unfixed code can't be released because no fresh done callback is coming. Would need a reaper job or admin command.
lease_expires_at: null issue (Pull-wake: heartbeat path nulls out lease_expires_at on consumer_claims #4341): independent. Without a lease, even a reaper job can't time-out claims safely.

Base branch note

This PR targets fix-pull-wake (#4339), not main, because materializeReleasedClaim was introduced in #4308 which is part of the fix-pull-wake lineage but not yet in main. Merge order: this → fix-pull-wake → main.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…migration, drop authorization fallback - Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop) - Migration expires active claims and clears dispatch state before deleting runners - Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url - listRunners validates owner_principal query param Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

State machine, concurrent claim limits, exponential reconnect backoff, and granular health status. onError is now reporting-only with fallback console.error logging. stop() rethrows drainWakes errors to callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ver dev fallback The desktop's default owner_principal was `system:local-desktop`, but the agents-server falls back to `system:dev-local` when no auth header is present. The principal mismatch caused runner registration to fail with 403 UNAUTHORIZED for unauthenticated local development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oken is missing The release path in callback-forward was gated by `stillOwnsClaim`, an in-memory check that fails after server restart or when a newer wake on the same stream evicts the token. When that happened, the consumer_claims row stayed at status=active indefinitely and the entity remained stuck at status=running long after `done` arrived. Decouple the concerns: - materializeReleasedClaim runs whenever epoch is defined (DB identity is sufficient to release the row). Now returns `{ claim, entityCleared }` where entityCleared is true iff our (consumerId, epoch) was the active dispatch and we just cleared it. - updateStatus(idle) and onEntityChanged fire when `entityCleared || stillOwnsClaim` — covers happy path, server restart, retry-after-failed- updateStatus, while still leaving status=running when a newer wake holds the entity's active dispatch. - clearStream remains gated by stillOwnsClaim so we never clear another consumer's token from under it. Regression tests in test/webhook-forward-routing.test.ts cover the three failure modes (lost token, evicted token, retry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-18T13:11:55Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
223	1	222	2

View the full list of 1 ❄️ flaky test(s)

test/horton-pull-wake-e2e.test.ts > pull-wake Horton e2e with mocked LLM > dispatches explicit runner-policy wakes and Horton writes mocked responses

Flake rate in main: 100.00% (Passed 0 times, Failed 8 times)

Stack Traces | 0.0775s run time

AssertionError: expected 500 to be 204 // Object.is equality

- Expected
+ Received

- 204
+ 500

 ❯ test/horton-pull-wake-e2e.test.ts:183:28

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

KyleAMathews and others added 18 commits May 16, 2026 11:31

docs: add pull-wake health check design spec

d14e848

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update health check spec with principal rename and shape sync

afe87f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add pull-wake health check implementation plan

6c9979e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): address code review findings — add canonicalizePrincipal, …

e01b4d7

…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): strict no-compat — remove canonicalizePrincipal, validate …

83e2c41

…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): scope migration to runner-owned claims, fix default princi…

454ea9b

…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(plan): store principal URLs directly in constants, not keys

6530be3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agents): add pull-wake runner health diagnostics

6c49982

fix(agents): address pull-wake health review findings

15aef19

chore: add changeset for pull-wake health diagnostics

83fb039

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agents): surface pull-wake runtime diagnostics

7c3a0fb

chore: add changeset for pull-wake runner hardening

d6c5d4d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(agents): avoid delayed pull-wake session startup

e625468

chore: add changeset for pull-wake startup UI

897b7d4

This was referenced May 19, 2026

Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340

Open

fix(agents-server): preserve consumer_claims.lease_expires_at across heartbeats #4353

Open

Base automatically changed from fix-pull-wake to main May 19, 2026 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
kevin-dp wants to merge 18 commits into
mainfrom
fix-claim-release-after-dispatch

kevin-dp commented May 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevin-dp commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

The fix

materializeReleasedClaim API change

Test scenarios

Verified

Not addressed in this PR

Base branch note

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevin-dp commented May 18, 2026 •

edited

Loading

`materializeReleasedClaim` API change

codecov Bot commented May 18, 2026 •

edited

Loading