[Update] Always restart clustermgtd on update failure by gmarciani · Pull Request #3104 · aws/aws-parallelcluster-cookbook

gmarciani · 2026-01-28T20:48:30Z

Description of changes

Always restart clustermgtd on update failure, regardless the point of failure.

Before this change, clustermgtd was restarted only if the update recipe failed the cluster readiness check.

This change is meant to reduce the chances of having clustermgtd stopped after a failed update.

UX

clustermgtd gets restarted on update recipe failure (both update and rollback):

Running handlers:
[2026-01-28T17:50:42+00:00] ERROR: Running exception handlers
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Started
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Update failed on HeadNode due to: ruby_block[synthetic failure] (aws-parallelcluster-slurm::update_head_node line 258) had an error: RuntimeError: SYNTHETIC ERROR INJECTED BY MGIACOMO
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - fetch_dna_files[Fetch ComputeFleet's Dna files]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - fetch_config[Fetch and load cluster configs]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - file[/opt/parallelcluster/shared/update_trigger]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[replace slurm queue nodes]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[generate_pcluster_slurm_configs]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[generate_pcluster_fleet_config]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - execute[stop clustermgtd]
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running recovery commands
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: cleanup DNA files
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/share_compute_fleet_dna.py --region us-east-1 --cleanup
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout:
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr: INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/LaunchTemplateB1b67670b4d707d5-dna.json
INFO:__main__:Cleaning up /opt/parallelcluster/shared/dna/extra.json
INFO:__main__:All dna.json files have been shared!

[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: cleanup DNA files
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: start clustermgtd
[2026-01-28T17:50:42+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout: clustermgtd: started

[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr:
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: start clustermgtd
[2026-01-28T17:50:44+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Completed successfully
  - ErrorHandlers::UpdateFailureHandler
Running handlers complete
[2026-01-28T17:50:44+00:00] ERROR: Exception handlers complete

Tests

The risks have been deeply assessed during design.
Manually verified that clustermgtd is always restarted on whatever update failure. In particular I simulated a failure by injecting synthetic exceptions into the update recipe while clustermgtd was stgopped. The alarm introduced in [Observability] Alarm on clustermgtd not running aws-parallelcluster#7209 never goes red because the overall downtime of clustermgtd is less than a minute from update start to rollback completion.
Unit tests (updated to cover the current changes)
[SUCCEEDED] Integ tests: test_update_rollback_failure

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ess the point of failure. Before this change, clustermgtd was restarted only if the update recipe failed the cluster readiness check. This change is meant to reduce the chances of having clustermgtd stopped after a failed update.

gmarciani added the 3.x label Jan 28, 2026

gmarciani marked this pull request as ready for review January 28, 2026 20:48

gmarciani requested review from a team as code owners January 28, 2026 20:48

gmarciani enabled auto-merge (rebase) January 28, 2026 20:50

gmarciani closed this Jan 28, 2026

auto-merge was automatically disabled January 28, 2026 22:39
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update] Always restart clustermgtd on update failure#3104

[Update] Always restart clustermgtd on update failure#3104
gmarciani wants to merge 1 commit intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-restart-on-failure-0128-1

gmarciani commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gmarciani commented Jan 28, 2026

Description of changes

UX

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant