fix: replace only the dead activity worker instead of restarting the whole pool#770
Merged
Merged
Conversation
…whole pool When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting EventWorkerStopped, so resetting the entire activity pool (ResetAP -> actP.Reset) was redundant and disruptive to the surviving workers. Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is sufficient; only the stateful workflow worker still needs a full reset. - activity-worker death: log only, let the pool's watcher replacement stand - remove the now-unused ResetAP - add Test_SingleActivityWorker_Replaced_NoPoolRestart
There was a problem hiding this comment.
Pull request overview
This PR adjusts the Temporal RoadRunner plugin’s worker-stop handling so that a single activity-worker failure no longer triggers a full activity pool restart (which can interrupt healthy workers and in-flight activities). It relies on the pool’s worker-watcher to replace only the dead activity worker, while preserving the existing full-reset behavior for workflow-worker failures.
Changes:
- Stop resetting the entire activity pool on activity-worker death; only reset on workflow-worker death.
- Remove the now-unused activity-pool-only reset method.
- Add an integration test to ensure one killed activity worker is replaced without restarting the whole pool; bump
github.com/prometheus/commontov0.68.1.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
plugin.go |
Updates worker-stopped event handling to avoid whole activity pool resets on activity-worker death. |
tests/general/disaster_test.go |
Adds regression test validating single activity-worker replacement without pool restart. |
go.mod |
Bumps github.com/prometheus/common to v0.68.1 (indirect). |
go.sum |
Updates module checksums (including Prometheus common + related entries). |
tests/go.mod |
Bumps github.com/prometheus/common to v0.68.1 (indirect) in test module. |
tests/go.sum |
Updates test module checksums for Prometheus common. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The fixed 2s wait in Test_SingleActivityWorker_Replaced_NoPoolRestart could be too short on slow CI. Replace it with a require.Eventually poll, backed by a non-asserting listWorkers helper (getWorkers now delegates to it).
…ker count The require.Eventually poll regressed the general CI job: repeatedly fetching the worker list (without closing the RPC connection) and observing the replacement before it was fully ready stalled pool teardown at Stop(), tripping the 60s deadline. A fixed wait is sufficient and safer here - the workflow completion already gates on real recovery.
ResetAP was removed; an activity-worker death no longer resets the pool.
…wn fix The test reliably exposed a pre-existing pool teardown hang (an idle pool that lost a worker hung Destroy for ~60s), fixed separately in roadrunner-server/pool#48. Re-add the (reliable, no-workflow) version once that fix is released and bumped here.
wolfy-j
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting
EventWorkerStopped. The plugin then reset the entire activity pool (ResetAP→actP.Reset), which was redundant and tore down the healthy survivors (dropping their in-flight activities). Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is enough — only the stateful workflow worker still needs a full reset.ResetAPTest_SingleActivityWorker_Replaced_NoPoolRestart(kills one activity worker mid-workflow; asserts the survivors keep their PIDs, only the dead worker is replaced, and the workflow completes)closes roadrunner-server/roadrunner#2335