Skip to content

[ci] Re-enabling rccl tests with 4gpu machine setup#2831

Merged
geomin12 merged 29 commits intomainfrom
users/geomin12/4-gpu
Jan 12, 2026
Merged

[ci] Re-enabling rccl tests with 4gpu machine setup#2831
geomin12 merged 29 commits intomainfrom
users/geomin12/4-gpu

Conversation

@geomin12
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 commented Jan 9, 2026

Re-enabling RCCL tests with our 4GPU machine setup

I tested locally here:

rocrand correctly retrieving the runner: https://github.com/ROCm/TheRock/actions/runs/20836486058/job/59862300962
rccl correctly retrieving the runner: https://github.com/ROCm/TheRock/actions/runs/20836486058/job/59862300974
other archs: https://github.com/ROCm/TheRock/actions/runs/20836486058/job/59862412259

Closes #2264
Progress on #2616

@araravik-psd
Copy link
Copy Markdown
Contributor

Verified that the changes we made have been run 10 times with re-run below and see it consistently passing on the 4GPU runner everytime.

https://github.com/ROCm/TheRock/actions/runs/20830811494

CI runs are also going through fine. Approving this PR.

Copy link
Copy Markdown
Contributor

@araravik-psd araravik-psd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread build_tools/github_actions/test_executable_scripts/test_rccl.py Outdated
@geomin12 geomin12 requested a review from idass1990 January 12, 2026 19:55
Copy link
Copy Markdown
Contributor

@idass1990 idass1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@geomin12
Copy link
Copy Markdown
Contributor Author

@geomin12 geomin12 merged commit ff46daa into main Jan 12, 2026
47 of 49 checks passed
@geomin12 geomin12 deleted the users/geomin12/4-gpu branch January 12, 2026 22:05
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Jan 12, 2026
@eble-amd eble-amd mentioned this pull request Jan 13, 2026
72 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

rccl-tests are consistently exceeding the 15m timeout set

4 participants