updated nccl postinstall for pcluster v 3.13 and above#53
Open
nghtm wants to merge 1 commit intoaws-samples:mainfrom
Open
updated nccl postinstall for pcluster v 3.13 and above#53nghtm wants to merge 1 commit intoaws-samples:mainfrom
nghtm wants to merge 1 commit intoaws-samples:mainfrom
Conversation
Author
|
Unable to create fork on /aws-samples due to permissions error |
KeitaW
suggested changes
Sep 5, 2025
Contributor
KeitaW
left a comment
There was a problem hiding this comment.
Shouldn't we maintain a way to install custom aws-ofi-nccl? It would be ideal if we could skip installation if the specified aws-ofi-nccl is preinstalled, otherwise compile it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes: Remove OFI-NCCL installation as this is precompiled with EFA installer, update NCCL to latest compatible version of OFI-NCCL installed in pcluster AMI as of 09/03/2025 (https://github.com/aws/aws-ofi-nccl/releases/tag/v1.14.2).
This reference branch will be used for AI/ML Pcluster workshop that specifies use of pcluster 3.13, and does not require install of OFI NCCL.
https://catalog.workshops.aws/ml-on-aws-parallelcluster/en-US/03-cluster/02-setup-cluster
confirming successful NCCL tests on AMI and with Container using these post-install scripts.
test on 2x g5.8xlarge (4x a10 per node) below:
nccl-all-reduce-val.log
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.