Skip to content

[Update] Always restart clustermgtd on update failure#3104

Closed
gmarciani wants to merge 1 commit intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-restart-on-failure-0128-1
Closed

[Update] Always restart clustermgtd on update failure#3104
gmarciani wants to merge 1 commit intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-restart-on-failure-0128-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

Description of changes

Always restart clustermgtd on update failure, regardless the point of failure.

Before this change, clustermgtd was restarted only if the update recipe failed the cluster readiness check.

This change is meant to reduce the chances of having clustermgtd stopped after a failed update.

UX

clustermgtd gets restarted on update recipe failure (both update and rollback):

Running handlers:
[2026-01-28T17:50:42+00:00] ERROR: Running exception handlers
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Started
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Update failed on HeadNode due to: ruby_block[synthetic failure] (aws-parallelcluster-slurm::update_head_node line 258) had an error: RuntimeError: SYNTHETIC ERROR INJECTED BY MGIACOMO
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - fetch_dna_files[Fetch ComputeFleet's Dna files]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - fetch_config[Fetch and load cluster configs]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - file[/opt/parallelcluster/shared/update_trigger]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[replace slurm queue nodes]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[generate_pcluster_slurm_configs]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[generate_pcluster_fleet_config]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[stop clustermgtd]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running recovery commands
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: cleanup DNA files
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/share_compute_fleet_dna.py --region us-east-1 --cleanup
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout:
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr: INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/LaunchTemplateB1b67670b4d707d5-dna.json
INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/extra.json
INFO:__main__:All dna.json files have been shared!

[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: cleanup DNA files
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: start clustermgtd
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout: clustermgtd: started

[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr:
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: start clustermgtd
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Completed successfully
  - ErrorHandlers::UpdateFailureHandler
Running handlers complete
[2026-01-28T17:50:44+00:00] ERROR: Exception handlers complete

Tests

  • The risks have been deeply assessed during design.
  • Manually verified that clustermgtd is always restarted on whatever update failure. In particular I simulated a failure by injecting synthetic exceptions into the update recipe while clustermgtd was stgopped. The alarm introduced in [Observability] Alarm on clustermgtd not running aws-parallelcluster#7209 never goes red because the overall downtime of clustermgtd is less than a minute from update start to rollback completion.
  • Unit tests (updated to cover the current changes)
  • [SUCCEEDED] Integ tests: test_update_rollback_failure

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ess the point of failure.

Before this change, clustermgtd was restarted only if the update recipe failed the cluster readiness check.

This change is meant to reduce the chances of having clustermgtd stopped after a failed update.
@gmarciani gmarciani added the 3.x label Jan 28, 2026
@gmarciani gmarciani marked this pull request as ready for review January 28, 2026 20:48
@gmarciani gmarciani requested review from a team as code owners January 28, 2026 20:48
@gmarciani gmarciani enabled auto-merge (rebase) January 28, 2026 20:50
@gmarciani gmarciani closed this Jan 28, 2026
auto-merge was automatically disabled January 28, 2026 22:39

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant