Skip to content

Reject polls from recently-shutdown workers to prevent task theft#9545

Merged
rkannan82 merged 4 commits intomainfrom
kannan/shutdown-worker-poll-rejection
Mar 23, 2026
Merged

Reject polls from recently-shutdown workers to prevent task theft#9545
rkannan82 merged 4 commits intomainfrom
kannan/shutdown-worker-poll-rejection

Conversation

@rkannan82
Copy link
Contributor

@rkannan82 rkannan82 commented Mar 17, 2026

What changed?

Add a TTL cache of recently-shutdown WorkerInstanceKeys to the matching engine. When CancelOutstandingWorkerPolls is called during ShutdownWorker, the worker's key is recorded in this cache. Subsequent polls carrying that key are rejected immediately with an empty response.

Why?

When ShutdownWorker cancels a worker's polls, the SDK's graceful shutdown path may re-poll before fully stopping. This zombie re-poll can sync-match with retry tasks (e.g., activity retries dispatched by the timer queue), which the dying worker silently drops — causing the task to sit until timeout. The cache prevents these zombie polls from being matched with real tasks.

How did you test it?

  • built
  • added new unit test(s)
  • verified using sdk test

Potential risks

  • The cache is per matching-node, populated via the cancelOutstandingWorkerPolls fan-out which covers all partitions. If partition count changes between cancellation and re-poll, a node that wasn't fanned-out to won't have the cache entry. This is an unlikely edge case during a shutdown sequence.
  • Only affects polls that carry WorkerInstanceKey (new SDK versions). No impact on existing SDK versions.

Made with Cursor

@rkannan82 rkannan82 force-pushed the kannan/shutdown-worker-poll-rejection branch 3 times, most recently from 4c2c331 to e23223e Compare March 17, 2026 17:14
When ShutdownWorker cancels a worker's polls via CancelOutstandingWorkerPolls,
the SDK's graceful shutdown path may re-poll before fully stopping. This zombie
re-poll can sync-match with retry tasks (e.g., activity retries dispatched by
the timer queue), which the dying worker silently drops — causing the task to
sit until timeout.

Add a TTL cache of recently-shutdown WorkerInstanceKeys to the matching engine.
Polls arriving from workers in this cache are rejected immediately with an empty
response, preventing zombie re-polls from stealing tasks.

Made-with: Cursor
@rkannan82 rkannan82 force-pushed the kannan/shutdown-worker-poll-rejection branch from e23223e to 097785c Compare March 17, 2026 17:14
@rkannan82 rkannan82 requested a review from dnr March 17, 2026 18:52
@rkannan82 rkannan82 marked this pull request as ready for review March 17, 2026 18:52
@rkannan82 rkannan82 requested review from a team as code owners March 17, 2026 18:52
The shutdown rejection is independent of pollerID tracking and should
run unconditionally based on workerInstanceKey.

Made-with: Cursor
@rkannan82 rkannan82 requested a review from dnr March 19, 2026 04:38
@rkannan82 rkannan82 merged commit 7e6be5c into main Mar 23, 2026
46 checks passed
@rkannan82 rkannan82 deleted the kannan/shutdown-worker-poll-rejection branch March 23, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants