Summary
The lease_expires_at column on consumer_claims is nulled out by the per-wake heartbeat path, leaving claims without any lease for the remainder of their lifetime. After the first heartbeat (~10s after dispatch), lease_expires_at becomes null and stays that way — meaning any lease-based reaper or expiry check cannot recover the row if done never arrives.
Reproduction
Send a wake to a pull-wake runner and inspect the active claim via the health endpoint.
At claim materialization (materializeActiveClaim in entity-registry.ts):
{
"consumer_id": "w_ccc537e97ee8110f165930d8",
"epoch": 1,
"claimed_at": "2026-05-19T11:01:11.631Z",
"last_heartbeat_at": null,
"lease_expires_at": "2026-05-19T11:01:41.631Z"
}
lease_expires_at = claimed_at + 30s. Lease is set correctly from the upstream durable-streams lease_ttl_ms.
~20 seconds later, after the runtime's first per-wake heartbeat:
{
"consumer_id": "w_ccc537e97ee8110f165930d8",
"epoch": 1,
"claimed_at": "2026-05-19T11:01:11.631Z",
"last_heartbeat_at": "2026-05-19T11:01:31.684Z",
"lease_expires_at": null
}
lease_expires_at is now null. From this point on, no further updates ever restore the lease.
Root cause
packages/agents-server/src/entity-registry.ts — materializeHeartbeatClaim:
async materializeHeartbeatClaim(input: MaterializeHeartbeatClaimInput): Promise<void> {
const heartbeatAt = input.heartbeatAt ?? new Date()
await this.db
.update(consumerClaims)
.set({
lastHeartbeatAt: heartbeatAt,
leaseExpiresAt: input.leaseExpiresAt ?? null, // ← unconditionally writes null if not provided
updatedAt: heartbeatAt,
})
.where(...)
}
packages/agents-server/src/routing/internal-router.ts:606-609 — the only production caller:
await ctx.entityManager.registry.materializeHeartbeatClaim?.({
consumerId,
epoch,
// no leaseExpiresAt is passed
})
So every heartbeat writes lease_expires_at = null. The first heartbeat (within ~10s of dispatch) is enough to clear the lease for good.
Why it matters
A claim with lease_expires_at: null has no expiry the system can act on. Concretely:
- A reaper job that prunes claims by
lease_expires_at < now cannot touch these rows — they're never "expired."
- The lease as a safety net is effectively absent during the active phase of every wake. The only thing that releases the claim is the
done callback. If anything prevents done (server restart, runtime crash, network partition, etc.), the row leaks until something else cleans it up.
- Visibility is also affected: the
lease_remaining_ms field in the health endpoint relies on this column.
The bug doesn't cause claim leaks on its own — done still releases claims on the happy path — but it removes the safety net that would otherwise time-out abandoned claims when done doesn't arrive.
Suggested fix
Two options:
- Don't touch
lease_expires_at in materializeHeartbeatClaim. Heartbeats become pure alive-pings: they update last_heartbeat_at and nothing else. The lease was set at materializeActiveClaim time from the upstream lease_ttl_ms and stays valid for that window. This is the simpler model and matches what the upstream lease ostensibly means.
- Extend the lease on heartbeat. The caller computes a fresh
leaseExpiresAt = now + TTL (using either the upstream lease_ttl_ms, the per-wake heartbeat interval, or a fixed default) and passes it. Keeps the "lease is renewed by heartbeats" model but requires picking a TTL.
(1) is the cheapest and most defensible. The runtime's heartbeats don't currently carry any TTL signal, so making them refresh the lease would require either inferring one from the heartbeat interval (brittle) or wiring through a new parameter (more invasive).
Severity
Low on the happy path (most claims are released by done long before any lease-based reaping would matter). Becomes relevant if:
Related
Summary
The
lease_expires_atcolumn onconsumer_claimsis nulled out by the per-wake heartbeat path, leaving claims without any lease for the remainder of their lifetime. After the first heartbeat (~10s after dispatch),lease_expires_atbecomesnulland stays that way — meaning any lease-based reaper or expiry check cannot recover the row ifdonenever arrives.Reproduction
Send a wake to a pull-wake runner and inspect the active claim via the health endpoint.
At claim materialization (
materializeActiveClaiminentity-registry.ts):{ "consumer_id": "w_ccc537e97ee8110f165930d8", "epoch": 1, "claimed_at": "2026-05-19T11:01:11.631Z", "last_heartbeat_at": null, "lease_expires_at": "2026-05-19T11:01:41.631Z" }lease_expires_at = claimed_at + 30s. Lease is set correctly from the upstream durable-streamslease_ttl_ms.~20 seconds later, after the runtime's first per-wake heartbeat:
{ "consumer_id": "w_ccc537e97ee8110f165930d8", "epoch": 1, "claimed_at": "2026-05-19T11:01:11.631Z", "last_heartbeat_at": "2026-05-19T11:01:31.684Z", "lease_expires_at": null }lease_expires_atis nownull. From this point on, no further updates ever restore the lease.Root cause
packages/agents-server/src/entity-registry.ts—materializeHeartbeatClaim:packages/agents-server/src/routing/internal-router.ts:606-609— the only production caller:So every heartbeat writes
lease_expires_at = null. The first heartbeat (within ~10s of dispatch) is enough to clear the lease for good.Why it matters
A claim with
lease_expires_at: nullhas no expiry the system can act on. Concretely:lease_expires_at < nowcannot touch these rows — they're never "expired."donecallback. If anything preventsdone(server restart, runtime crash, network partition, etc.), the row leaks until something else cleans it up.lease_remaining_msfield in the health endpoint relies on this column.The bug doesn't cause claim leaks on its own —
donestill releases claims on the happy path — but it removes the safety net that would otherwise time-out abandoned claims whendonedoesn't arrive.Suggested fix
Two options:
lease_expires_atinmaterializeHeartbeatClaim. Heartbeats become pure alive-pings: they updatelast_heartbeat_atand nothing else. The lease was set atmaterializeActiveClaimtime from the upstreamlease_ttl_msand stays valid for that window. This is the simpler model and matches what the upstream lease ostensibly means.leaseExpiresAt = now + TTL(using either the upstream lease_ttl_ms, the per-wake heartbeat interval, or a fixed default) and passes it. Keeps the "lease is renewed by heartbeats" model but requires picking a TTL.(1) is the cheapest and most defensible. The runtime's heartbeats don't currently carry any TTL signal, so making them refresh the lease would require either inferring one from the heartbeat interval (brittle) or wiring through a new parameter (more invasive).
Severity
Low on the happy path (most claims are released by
donelong before any lease-based reaping would matter). Becomes relevant if:lease_remaining_ms: nulleverywhere makes the diagnostic less informative.donefrom arriving (which Pull-wake: active claim leaks when in-memory write-token is missing (server restart, newer-wake eviction, retry) #4340 covers separately) — without a lease, no automatic recovery.Related