[Develop][cookbook] Cherry-pick changes from release-3.14 branch to develop branch#3081
Open
hehe7318 wants to merge 12 commits intoaws:developfrom
Open
[Develop][cookbook] Cherry-pick changes from release-3.14 branch to develop branch#3081hehe7318 wants to merge 12 commits intoaws:developfrom
hehe7318 wants to merge 12 commits intoaws:developfrom
Conversation
…abled` to disable in-place updates on compute and login nodes by disabling cfn-hup on those nodes. As a consequence, it also disables the cluster readiness checks executed by the head node on cluster update. Disabling cfn-hup mitigates a relevant performance degradation that may occur with tightly coupled workload st scale.
…ion of NVIDIA driver, if the module is available on the kernel. Starting kernel `5.14.0-611`, some DRM symbols required by the NVIDIA driver are exported by new client modules.
…mmon rather than sssd.
This reverts commit bef143f.
) * Fix DCV on Ubuntu 22.04+ on DLAMI by disabling Wayland Disable Wayland protocol in GDM3 for Ubuntu 22.04+ to force the use of Xorg on GPU instances running without a display. Ubuntu 22.04+ defaults to Wayland which causes GDM startup issues with NVIDIA drivers and NICE DCV. Force Xorg by setting WaylandEnable=false in /etc/gdm3/custom.conf.
…eck (aws#3062) * Do not consider missing records as a cluster readiness check failure * Update CHANGELOG * Add note that missing records don't cause failure
…oup (installed for DCV) now pulls in ImageMagick which requires this package
and fix race condition making compute node deploy wrong cluster config version on update failure.
Ensure clustermgtd is running after an update completes, regardless of
whether the update succeeded or failed.
On success, restart clustermgtd unconditionally at the end of the update recipe,
regardless of whether the update includes queue changes
On failure on the head node, execute recovery actions:
- Clean up DNA files shared with compute nodes to prevent them from
deploying a config version that is about to be rolled back
- Restart clustermgtd if scontrol reconfigure succeeded, ensuring
cluster management resumes after update/rollback failures
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.