Skip to content

[train][tests] Abort cancels validation tasks and has deterministic behavior#61510

Open
TimothySeah wants to merge 2 commits intoray-project:masterfrom
TimothySeah:tseah/fix-ray-train-flaky-tests
Open

[train][tests] Abort cancels validation tasks and has deterministic behavior#61510
TimothySeah wants to merge 2 commits intoray-project:masterfrom
TimothySeah:tseah/fix-ray-train-flaky-tests

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Mar 5, 2026

Summary

We observed the following flaky behavior in test_report_validation_fn_resumption

  [2026-03-04T02:25:25Z] (TrainController pid=6108) Launched async validation task for checkpoint      
  Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/valid 
  ation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)                                           
  [2026-03-04T02:25:25Z] (run_trainer pid=6028) Received SIGINT. Gracefully aborting the training run  
  — this may take a few seconds. To forcefully abort immediately, you can send a different signal,     
  such as SIGKILL.                                                                                     
  [2026-03-04T02:25:25Z] (TrainController pid=6108) Finished async validation task(s) for              
  checkpoint(s): [Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validatio 
  n_fn_resu0/validation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)].                         
 ...                                
  [2026-03-04T02:25:25Z] (TrainController pid=6108) ray.exceptions.TaskCancelledError: Task:           
  TaskID(4e6f2dd62797ace7ffffffffffffffffffffffff01000000) was cancelled.                              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) A run snapshot was found in storage folder at:     
  '/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/validation_fn_resumption'              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) This snapshot contains a list of checkpoints       
  reported via `ray.train.report` and will be loaded. This allows the latest checkpoint found in the   
  snapshot to be accessible within your training function via `ray.train.get_checkpoint`.              

Essentially what happened was

  1. The unit test called ray.cancel
  2. This sent a SIGINT to the trainer and recursively cancelled the validation task
  3. SIGINT made Ray Train go through the abortion code, which includes transitioning to the ABORTED state
  4. after_controller_state_update sometimes (I think there's a race between when after_controller_state_update is called and when the validation task is cancelled) saw the cancelled validation task and updated the metrics accordingly.
  5. The second train run had no validations to resume.

This PR avoids this race by:

  1. Changing after_controller_state_update to do nothing if the state is terminal. We will process validation tasks from FINISHED and ERRORED train runs in before_controller_shutdown and cancel validation tasks in ABORTED train runs in before_controller_abort.
  2. Changing test_report_validation_fn_resumption to SIGINT a process (the same pattern as test_sigint_abort) rather than ray.cancel a task (the old pattern) to deterministically test the graceful abortion path.

Testing

Unit tests.

…ehavior

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner March 5, 2026 03:31
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition during validation task cancellation upon run abortion. The changes introduce a before_controller_abort hook to explicitly cancel pending validation tasks and make the corresponding test more deterministic by using SIGINT instead of ray.cancel. The overall approach is sound, but I've found a critical issue where the new before_controller_abort hook is not actually called, which means the fix for cancelling validation tasks is incomplete. I've provided a detailed comment with a suggested fix.

Note: Security Review did not run due to the size of the PR.

@ray-gardener ray-gardener bot added train Ray Train Related Issue stability labels Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stability train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant