Skip to content

fix: replace only the dead activity worker instead of restarting the whole pool#770

Merged
rustatian merged 5 commits into
masterfrom
fix/2335-activity-pool-single-worker-replace
Jun 5, 2026
Merged

fix: replace only the dead activity worker instead of restarting the whole pool#770
rustatian merged 5 commits into
masterfrom
fix/2335-activity-pool-single-worker-replace

Conversation

@rustatian

Copy link
Copy Markdown
Collaborator

When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting EventWorkerStopped. The plugin then reset the entire activity pool (ResetAPactP.Reset), which was redundant and tore down the healthy survivors (dropping their in-flight activities). Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is enough — only the stateful workflow worker still needs a full reset.

  • activity-worker death now logs and lets the pool's watcher replacement stand (no whole-pool restart)
  • remove the now-unused ResetAP
  • add Test_SingleActivityWorker_Replaced_NoPoolRestart (kills one activity worker mid-workflow; asserts the survivors keep their PIDs, only the dead worker is replaced, and the workflow completes)

closes roadrunner-server/roadrunner#2335

…whole pool

When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting EventWorkerStopped, so resetting the entire activity pool (ResetAP -> actP.Reset) was redundant and disruptive to the surviving workers. Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is sufficient; only the stateful workflow worker still needs a full reset.

- activity-worker death: log only, let the pool's watcher replacement stand
- remove the now-unused ResetAP
- add Test_SingleActivityWorker_Replaced_NoPoolRestart
Copilot AI review requested due to automatic review settings June 4, 2026 16:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the Temporal RoadRunner plugin’s worker-stop handling so that a single activity-worker failure no longer triggers a full activity pool restart (which can interrupt healthy workers and in-flight activities). It relies on the pool’s worker-watcher to replace only the dead activity worker, while preserving the existing full-reset behavior for workflow-worker failures.

Changes:

  • Stop resetting the entire activity pool on activity-worker death; only reset on workflow-worker death.
  • Remove the now-unused activity-pool-only reset method.
  • Add an integration test to ensure one killed activity worker is replaced without restarting the whole pool; bump github.com/prometheus/common to v0.68.1.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
plugin.go Updates worker-stopped event handling to avoid whole activity pool resets on activity-worker death.
tests/general/disaster_test.go Adds regression test validating single activity-worker replacement without pool restart.
go.mod Bumps github.com/prometheus/common to v0.68.1 (indirect).
go.sum Updates module checksums (including Prometheus common + related entries).
tests/go.mod Bumps github.com/prometheus/common to v0.68.1 (indirect) in test module.
tests/go.sum Updates test module checksums for Prometheus common.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugin.go
Comment thread tests/general/disaster_test.go Outdated
rustatian added 3 commits June 4, 2026 18:22
The fixed 2s wait in Test_SingleActivityWorker_Replaced_NoPoolRestart could be too short on slow CI. Replace it with a require.Eventually poll, backed by a non-asserting listWorkers helper (getWorkers now delegates to it).
…ker count

The require.Eventually poll regressed the general CI job: repeatedly fetching the worker list (without closing the RPC connection) and observing the replacement before it was fully ready stalled pool teardown at Stop(), tripping the 60s deadline. A fixed wait is sufficient and safer here - the workflow completion already gates on real recovery.
ResetAP was removed; an activity-worker death no longer resets the pool.
@rustatian rustatian self-assigned this Jun 4, 2026
…wn fix

The test reliably exposed a pre-existing pool teardown hang (an idle pool that lost a worker hung Destroy for ~60s), fixed separately in roadrunner-server/pool#48. Re-add the (reliable, no-workflow) version once that fix is released and bumped here.
@rustatian rustatian merged commit cb0a375 into master Jun 5, 2026
12 checks passed
@rustatian rustatian deleted the fix/2335-activity-pool-single-worker-replace branch June 5, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[🐛 BUG]: SIGSEGV in PHP worker causes full activity pool restart instead of single worker replacement

3 participants