Skip to content

[train][tests] Reduce test_worker_group flake#61536

Open
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/fix-more-train-flaky
Open

[train][tests] Reduce test_worker_group flake#61536
TimothySeah wants to merge 1 commit intoray-project:masterfrom
TimothySeah:tseah/fix-more-train-flaky

Conversation

@TimothySeah
Copy link
Contributor

Summary

test_zombie_actor_termination has been failing flakily with

[2026-03-04T06:42:34Z] python/ray/train/v2/tests/test_worker_group.py::test_zombie_actor_termination FAILED [ 32%]
...
[2026-03-04T06:42:34Z]         worker_group_context = _default_worker_group_context(
[2026-03-04T06:42:34Z]             train_fn_ref=train_fn_ref,
[2026-03-04T06:42:34Z]             num_workers=NUM_WORKERS,
[2026-03-04T06:42:34Z]         )
[2026-03-04T06:42:34Z]
[2026-03-04T06:42:34Z]         # Starts the worker group and runs the train function
[2026-03-04T06:42:34Z] >       worker_group = WorkerGroup.create(
[2026-03-04T06:42:34Z]             train_run_context=train_run_context,
[2026-03-04T06:42:34Z]             worker_group_context=worker_group_context,
[2026-03-04T06:42:34Z]             callbacks=[],
[2026-03-04T06:42:34Z]         )
...
[2026-03-04T06:42:34Z]             if not pg_handle.wait(self._worker_group_start_timeout_s):
[2026-03-04T06:42:34Z]                 pg_handle.shutdown()
[2026-03-04T06:42:34Z] >               raise WorkerGroupStartupTimeoutError(
[2026-03-04T06:42:34Z]                     num_workers=worker_group_context.num_workers
[2026-03-04T06:42:34Z]                 )
[2026-03-04T06:42:34Z] E               ray.train.v2._internal.exceptions.WorkerGroupStartupTimeoutError: The worker group startup timed out after 60.0 seconds waiting for 4 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.

This could be due to mock_runtime_context unintentionally taking up ray cluster resources.

Testing

I was unable to reproduce the issue locally but this change should be safe.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner March 6, 2026 00:55
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a flaky test, test_zombie_actor_termination, which was failing due to a WorkerGroupStartupTimeoutError. The root cause was identified as the mock_runtime_context fixture's DummyActor consuming a CPU resource by default, leaving insufficient resources for the test's worker group. The fix correctly specifies num_cpus=0 for this dummy actor, ensuring it doesn't consume cluster resources. This change is logical, safe, and should resolve the flakiness.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant