Skip to content

Pull-wake: heartbeat path nulls out lease_expires_at on consumer_claims #4341

@kevin-dp

Description

@kevin-dp

Summary

The lease_expires_at column on consumer_claims is nulled out by the per-wake heartbeat path, leaving claims without any lease for the remainder of their lifetime. After the first heartbeat (~10s after dispatch), lease_expires_at becomes null and stays that way — meaning any lease-based reaper or expiry check cannot recover the row if done never arrives.

Reproduction

Send a wake to a pull-wake runner and inspect the active claim via the health endpoint.

At claim materialization (materializeActiveClaim in entity-registry.ts):

{
  "consumer_id": "w_ccc537e97ee8110f165930d8",
  "epoch": 1,
  "claimed_at":         "2026-05-19T11:01:11.631Z",
  "last_heartbeat_at":  null,
  "lease_expires_at":   "2026-05-19T11:01:41.631Z"
}

lease_expires_at = claimed_at + 30s. Lease is set correctly from the upstream durable-streams lease_ttl_ms.

~20 seconds later, after the runtime's first per-wake heartbeat:

{
  "consumer_id": "w_ccc537e97ee8110f165930d8",
  "epoch": 1,
  "claimed_at":         "2026-05-19T11:01:11.631Z",
  "last_heartbeat_at":  "2026-05-19T11:01:31.684Z",
  "lease_expires_at":   null
}

lease_expires_at is now null. From this point on, no further updates ever restore the lease.

Root cause

packages/agents-server/src/entity-registry.tsmaterializeHeartbeatClaim:

async materializeHeartbeatClaim(input: MaterializeHeartbeatClaimInput): Promise<void> {
  const heartbeatAt = input.heartbeatAt ?? new Date()
  await this.db
    .update(consumerClaims)
    .set({
      lastHeartbeatAt: heartbeatAt,
      leaseExpiresAt: input.leaseExpiresAt ?? null,  // ← unconditionally writes null if not provided
      updatedAt: heartbeatAt,
    })
    .where(...)
}

packages/agents-server/src/routing/internal-router.ts:606-609 — the only production caller:

await ctx.entityManager.registry.materializeHeartbeatClaim?.({
  consumerId,
  epoch,
  // no leaseExpiresAt is passed
})

So every heartbeat writes lease_expires_at = null. The first heartbeat (within ~10s of dispatch) is enough to clear the lease for good.

Why it matters

A claim with lease_expires_at: null has no expiry the system can act on. Concretely:

  • A reaper job that prunes claims by lease_expires_at < now cannot touch these rows — they're never "expired."
  • The lease as a safety net is effectively absent during the active phase of every wake. The only thing that releases the claim is the done callback. If anything prevents done (server restart, runtime crash, network partition, etc.), the row leaks until something else cleans it up.
  • Visibility is also affected: the lease_remaining_ms field in the health endpoint relies on this column.

The bug doesn't cause claim leaks on its own — done still releases claims on the happy path — but it removes the safety net that would otherwise time-out abandoned claims when done doesn't arrive.

Suggested fix

Two options:

  1. Don't touch lease_expires_at in materializeHeartbeatClaim. Heartbeats become pure alive-pings: they update last_heartbeat_at and nothing else. The lease was set at materializeActiveClaim time from the upstream lease_ttl_ms and stays valid for that window. This is the simpler model and matches what the upstream lease ostensibly means.
  2. Extend the lease on heartbeat. The caller computes a fresh leaseExpiresAt = now + TTL (using either the upstream lease_ttl_ms, the per-wake heartbeat interval, or a fixed default) and passes it. Keeps the "lease is renewed by heartbeats" model but requires picking a TTL.

(1) is the cheapest and most defensible. The runtime's heartbeats don't currently carry any TTL signal, so making them refresh the lease would require either inferring one from the heartbeat interval (brittle) or wiring through a new parameter (more invasive).

Severity

Low on the happy path (most claims are released by done long before any lease-based reaping would matter). Becomes relevant if:

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions