fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
Open
kevin-dp wants to merge 18 commits into
Open
fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346kevin-dp wants to merge 18 commits into
kevin-dp wants to merge 18 commits into
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dispatch-policy, server-utils, and electric-ax Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…URL form, callers convert keys to URLs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…migration, drop authorization fallback - Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop) - Migration expires active claims and clears dispatch state before deleting runners - Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url - listRunners validates owner_principal query param Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pal keys, complete desktop constant replacement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
State machine, concurrent claim limits, exponential reconnect backoff, and granular health status. onError is now reporting-only with fallback console.error logging. stop() rethrows drainWakes errors to callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ver dev fallback The desktop's default owner_principal was `system:local-desktop`, but the agents-server falls back to `system:dev-local` when no auth header is present. The principal mismatch caused runner registration to fail with 403 UNAUTHORIZED for unauthenticated local development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken is missing
The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.
Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
sufficient to release the row). Now returns `{ claim, entityCleared }`
where entityCleared is true iff our (consumerId, epoch) was the active
dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
updateStatus, while still leaving status=running when a newer wake holds
the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
consumer's token from under it.
Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ 1 Tests Failed:
View the full list of 1 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4340 — pull-wake claims leaking in
consumer_claimswhen the in-memoryClaimWriteTokenStoreno longer holds the consumer's token at the timesendDonearrives.The release path in
callback-forwardwas gated bystillOwnsClaim, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transientupdateStatusfailure — the entire release block is skipped:materializeReleasedClaimnever runs, the entity stays stuck atstatus='running', and theconsumer_claimsrow staysactiveindefinitely.The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's
sendDoneafter theidleTimeoutwindow (default 5 min viapackages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented in Test scenarios below.Root cause
packages/agents-server/src/routing/internal-router.tshad all three release actions behind the same in-memory gate:stillOwnsClaimis the right gate for write authorization during the agent run, but it's the wrong gate for releasing the DB row, which is keyed by(consumerId, epoch). The DB primary key is authoritative identity — the in-memory token state is orthogonal.The fix
Three concerns, three gates:
materializeReleasedClaim) — runs wheneverepochis defined.(consumerId, epoch)is the DB primary key; that's enough.onEntityChanged— runs whenentityCleared || stillOwnsClaim.entityClearedis a new return field frommaterializeReleasedClaim, set totrueonly when our(consumerId, epoch)was the active dispatch row and we just nulled it out. The||handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failedupdateStatus(state cleared on first attempt, token still held).clearStream) — remains gated bystillOwnsClaimso a newer consumer's token is never cleared out from under it.materializeReleasedClaimAPI changeOnly one production caller (
internal-router.ts); both that caller and the test mock are updated. The.returning()on theentityDispatchStateUPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active).Test scenarios
The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't).
entityClearedstillOwnsClaimupdateStatusthrew; same done retried)New tests in
packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340)cover scenarios A–C. Existing tests inserver-claim-write-token.test.tscover D and E and continue to pass after the fix.Verified
materializeReleasedClaim); D fails (updateStatusskipped on retry). Post-fix, all five scenarios produce the documented behavior.active_count: 0 → 1), agent completes, runtime callssendDoneafteridleTimeout, server fires the new release path,active_count: 1 → 0, entity status transitions back toidle.Not addressed in this PR
donecallback is coming. Would need a reaper job or admin command.lease_expires_at: nullissue (Pull-wake: heartbeat path nulls out lease_expires_at on consumer_claims #4341): independent. Without a lease, even a reaper job can't time-out claims safely.Base branch note
This PR targets
fix-pull-wake(#4339), notmain, becausematerializeReleasedClaimwas introduced in #4308 which is part of thefix-pull-wakelineage but not yet inmain. Merge order: this → fix-pull-wake → main.🤖 Generated with Claude Code