[train][tests] Reduce test_worker_group flake by TimothySeah · Pull Request #61536 · ray-project/ray

TimothySeah · 2026-03-06T00:55:02Z

Summary

test_zombie_actor_termination has been failing flakily with

[2026-03-04T06:42:34Z] python/ray/train/v2/tests/test_worker_group.py::test_zombie_actor_termination FAILED [ 32%]
...
[2026-03-04T06:42:34Z]         worker_group_context = _default_worker_group_context(
[2026-03-04T06:42:34Z]             train_fn_ref=train_fn_ref,
[2026-03-04T06:42:34Z]             num_workers=NUM_WORKERS,
[2026-03-04T06:42:34Z]         )
[2026-03-04T06:42:34Z]
[2026-03-04T06:42:34Z]         # Starts the worker group and runs the train function
[2026-03-04T06:42:34Z] >       worker_group = WorkerGroup.create(
[2026-03-04T06:42:34Z]             train_run_context=train_run_context,
[2026-03-04T06:42:34Z]             worker_group_context=worker_group_context,
[2026-03-04T06:42:34Z]             callbacks=[],
[2026-03-04T06:42:34Z]         )
...
[2026-03-04T06:42:34Z]             if not pg_handle.wait(self._worker_group_start_timeout_s):
[2026-03-04T06:42:34Z]                 pg_handle.shutdown()
[2026-03-04T06:42:34Z] >               raise WorkerGroupStartupTimeoutError(
[2026-03-04T06:42:34Z]                     num_workers=worker_group_context.num_workers
[2026-03-04T06:42:34Z]                 )
[2026-03-04T06:42:34Z] E               ray.train.v2._internal.exceptions.WorkerGroupStartupTimeoutError: The worker group startup timed out after 60.0 seconds waiting for 4 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.

This could be due to mock_runtime_context unintentionally taking up ray cluster resources.

Testing

I was unable to reproduce the issue locally but this change should be safe.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request addresses a flaky test, test_zombie_actor_termination, which was failing due to a WorkerGroupStartupTimeoutError. The root cause was identified as the mock_runtime_context fixture's DummyActor consuming a CPU resource by default, leaving insufficient resources for the test's worker group. The fix correctly specifies num_cpus=0 for this dummy actor, ensuring it doesn't consume cluster resources. This change is logical, safe, and should resolve the flakiness.

[train][tests] Reduce test_worker_group flake

79fbb9f

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner March 6, 2026 00:55

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

ray-gardener bot added the train Ray Train Related Issue label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][tests] Reduce test_worker_group flake#61536

[train][tests] Reduce test_worker_group flake#61536
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/fix-more-train-flaky

TimothySeah commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimothySeah commented Mar 6, 2026

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant