Skip to content

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346

Open
kevin-dp wants to merge 18 commits into
mainfrom
fix-claim-release-after-dispatch
Open

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
kevin-dp wants to merge 18 commits into
mainfrom
fix-claim-release-after-dispatch

Conversation

@kevin-dp
Copy link
Copy Markdown
Contributor

@kevin-dp kevin-dp commented May 18, 2026

Summary

Fixes #4340 — pull-wake claims leaking in consumer_claims when the in-memory ClaimWriteTokenStore no longer holds the consumer's token at the time sendDone arrives.

The release path in callback-forward was gated by stillOwnsClaim, an in-memory check. When that check fails — server restart between mint and done, parallel wakes evicting each other's tokens, or a retry after a transient updateStatus failure — the entire release block is skipped: materializeReleasedClaim never runs, the entity stays stuck at status='running', and the consumer_claims row stays active indefinitely.

The steady-state "send one message, wait" path is not affected by this bug: it releases correctly via the runtime's sendDone after the idleTimeout window (default 5 min via packages/agents/src/bootstrap.ts:146). The bug only fires in the failure modes documented in Test scenarios below.

Root cause

packages/agents-server/src/routing/internal-router.ts had all three release actions behind the same in-memory gate:

if (entity && stillOwnsClaim) {
  await materializeReleasedClaim(...)   // DB row release
  await updateStatus(entity.url, 'idle') // entity status
  clearStream(...)                       // in-memory token cleanup
  await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
  clearStream(...)
} else if (entity) {
  log.info('done ignored for stale claim ...')
}

stillOwnsClaim is the right gate for write authorization during the agent run, but it's the wrong gate for releasing the DB row, which is keyed by (consumerId, epoch). The DB primary key is authoritative identity — the in-memory token state is orthogonal.

The fix

Three concerns, three gates:

  1. DB row release (materializeReleasedClaim) — runs whenever epoch is defined. (consumerId, epoch) is the DB primary key; that's enough.
  2. Entity status → idle + onEntityChanged — runs when entityCleared || stillOwnsClaim. entityCleared is a new return field from materializeReleasedClaim, set to true only when our (consumerId, epoch) was the active dispatch row and we just nulled it out. The || handles two non-trivial cases: server restart (DB has us active, token is gone) and retry after a failed updateStatus (state cleared on first attempt, token still held).
  3. In-memory token cleanup (clearStream) — remains gated by stillOwnsClaim so a newer consumer's token is never cleared out from under it.

materializeReleasedClaim API change

- ): Promise<ConsumerClaim | null> {
+ ): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> {

Only one production caller (internal-router.ts); both that caller and the test mock are updated. The .returning() on the entityDispatchState UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim is active).

Test scenarios

The fix decouples the three concerns. Below: ✓ = action happens, × = action does NOT happen (and is correct that it doesn't).

Scenario entityCleared stillOwnsClaim DB row released Entity → idle
A. Happy path (mint + done) true true ✓ released ✓ goes idle
B. Server restart (no in-memory token, DB row still active) true false ✓ released ✓ goes idle
C. Newer wake (wake-1 done after wake-2 takes over the stream) false false ✓ wake-1's row released × stays running — wake-2 is in flight
D. Retry (first done's updateStatus threw; same done retried) false true ✓ no-op (already released) ✓ goes idle
E. Legacy stale-done test (test setup never materialized active claim; token evicted) false false no row to release × stays running — newer claim conceptually in flight

New tests in packages/agents-server/test/webhook-forward-routing.test.ts > claim release on done callback (regression for #4340) cover scenarios A–C. Existing tests in server-claim-write-token.test.ts cover D and E and continue to pass after the fix.

Verified

  • Unit tests: deterministic. Pre-fix, B and C fail (zero invocations of materializeReleasedClaim); D fails (updateStatus skipped on retry). Post-fix, all five scenarios produce the documented behavior.
  • Manual run-through against a local desktop + agents-server: send one message, dispatch claims (active_count: 0 → 1), agent completes, runtime calls sendDone after idleTimeout, server fires the new release path, active_count: 1 → 0, entity status transitions back to idle.

Not addressed in this PR

Base branch note

This PR targets fix-pull-wake (#4339), not main, because materializeReleasedClaim was introduced in #4308 which is part of the fix-pull-wake lineage but not yet in main. Merge order: this → fix-pull-wake → main.

🤖 Generated with Claude Code

KyleAMathews and others added 18 commits May 16, 2026 11:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dispatch-policy, server-utils, and electric-ax

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…URL form, callers convert keys to URLs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…migration, drop authorization fallback

- Use principalKeyFromUrl for proper principal URL validation (rejects /principal/local-desktop)
- Migration expires active claims and clears dispatch state before deleting runners
- Desktop: don't use authorization header as principal source — return undefined and let server derive from ctx.principal.url
- listRunners validates owner_principal query param

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pal keys, complete desktop constant replacement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
State machine, concurrent claim limits, exponential reconnect backoff,
and granular health status. onError is now reporting-only with fallback
console.error logging. stop() rethrows drainWakes errors to callers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ver dev fallback

The desktop's default owner_principal was `system:local-desktop`, but the
agents-server falls back to `system:dev-local` when no auth header is
present. The principal mismatch caused runner registration to fail with
403 UNAUTHORIZED for unauthenticated local development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken is missing

The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.

Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
  sufficient to release the row). Now returns `{ claim, entityCleared }`
  where entityCleared is true iff our (consumerId, epoch) was the active
  dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
  stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
  updateStatus, while still leaving status=running when a newer wake holds
  the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
  consumer's token from under it.

Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
223 1 222 2
View the full list of 1 ❄️ flaky test(s)
test/horton-pull-wake-e2e.test.ts > pull-wake Horton e2e with mocked LLM > dispatches explicit runner-policy wakes and Horton writes mocked responses

Flake rate in main: 100.00% (Passed 0 times, Failed 8 times)

Stack Traces | 0.0775s run time
AssertionError: expected 500 to be 204 // Object.is equality

- Expected
+ Received

- 204
+ 500

 ❯ test/horton-pull-wake-e2e.test.ts:183:28

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants