fix: replace only the dead activity worker instead of restarting the whole pool by rustatian · Pull Request #770 · temporalio/roadrunner-temporal

rustatian · 2026-06-04T16:05:01Z

When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting EventWorkerStopped. The plugin then reset the entire activity pool (ResetAP → actP.Reset), which was redundant and tore down the healthy survivors (dropping their in-flight activities). Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is enough — only the stateful workflow worker still needs a full reset.

activity-worker death now logs and lets the pool's watcher replacement stand (no whole-pool restart)
remove the now-unused ResetAP
add Test_SingleActivityWorker_Replaced_NoPoolRestart (kills one activity worker mid-workflow; asserts the survivors keep their PIDs, only the dead worker is replaced, and the workflow completes)

closes roadrunner-server/roadrunner#2335

…whole pool When a single activity worker dies, the pool's worker-watcher already re-allocates a replacement before emitting EventWorkerStopped, so resetting the entire activity pool (ResetAP -> actP.Reset) was redundant and disruptive to the surviving workers. Activity tasks are stateless and retried by the Temporal server, so the single-worker replacement is sufficient; only the stateful workflow worker still needs a full reset. - activity-worker death: log only, let the pool's watcher replacement stand - remove the now-unused ResetAP - add Test_SingleActivityWorker_Replaced_NoPoolRestart

Copilot

Pull request overview

This PR adjusts the Temporal RoadRunner plugin’s worker-stop handling so that a single activity-worker failure no longer triggers a full activity pool restart (which can interrupt healthy workers and in-flight activities). It relies on the pool’s worker-watcher to replace only the dead activity worker, while preserving the existing full-reset behavior for workflow-worker failures.

Changes:

Stop resetting the entire activity pool on activity-worker death; only reset on workflow-worker death.
Remove the now-unused activity-pool-only reset method.
Add an integration test to ensure one killed activity worker is replaced without restarting the whole pool; bump github.com/prometheus/common to v0.68.1.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`plugin.go`	Updates worker-stopped event handling to avoid whole activity pool resets on activity-worker death.
`tests/general/disaster_test.go`	Adds regression test validating single activity-worker replacement without pool restart.
`go.mod`	Bumps `github.com/prometheus/common` to `v0.68.1` (indirect).
`go.sum`	Updates module checksums (including Prometheus common + related entries).
`tests/go.mod`	Bumps `github.com/prometheus/common` to `v0.68.1` (indirect) in test module.
`tests/go.sum`	Updates test module checksums for Prometheus common.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The fixed 2s wait in Test_SingleActivityWorker_Replaced_NoPoolRestart could be too short on slow CI. Replace it with a require.Eventually poll, backed by a non-asserting listWorkers helper (getWorkers now delegates to it).

…ker count The require.Eventually poll regressed the general CI job: repeatedly fetching the worker list (without closing the RPC connection) and observing the replacement before it was fully ready stalled pool teardown at Stop(), tripping the 60s deadline. A fixed wait is sufficient and safer here - the workflow completion already gates on real recovery.

ResetAP was removed; an activity-worker death no longer resets the pool.

…wn fix The test reliably exposed a pre-existing pool teardown hang (an idle pool that lost a worker hung Destroy for ~60s), fixed separately in roadrunner-server/pool#48. Re-add the (reliable, no-workflow) version once that fix is released and bumped here.

Copilot AI review requested due to automatic review settings June 4, 2026 16:05

Copilot started reviewing on behalf of rustatian June 4, 2026 16:05 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread plugin.go

Comment thread tests/general/disaster_test.go Outdated

rustatian added 3 commits June 4, 2026 18:22

test: poll for the replacement worker instead of a fixed sleep

e210e1d

The fixed 2s wait in Test_SingleActivityWorker_Replaced_NoPoolRestart could be too short on slow CI. Replace it with a require.Eventually poll, backed by a non-asserting listWorkers helper (getWorkers now delegates to it).

docs: drop stale ResetAP references from the plugin guide

09f7448

ResetAP was removed; an activity-worker death no longer resets the pool.

rustatian self-assigned this Jun 4, 2026

wolfy-j approved these changes Jun 5, 2026

View reviewed changes

rustatian merged commit cb0a375 into master Jun 5, 2026
12 checks passed

rustatian deleted the fix/2335-activity-pool-single-worker-replace branch June 5, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace only the dead activity worker instead of restarting the whole pool#770

fix: replace only the dead activity worker instead of restarting the whole pool#770
rustatian merged 5 commits into
masterfrom
fix/2335-activity-pool-single-worker-replace

rustatian commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rustatian commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants