Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

✨ Description

Make Fast-LLM work with the gloo distributed backend. Context: I was hoping it would fix some random NCCL crashes/timeouts in distributed tests with many processes, allowing us to run more tests in parallel (and without the need for multiple physical). However, the problem still happens with gloo, so keeping the test in nccl for now.

@jlamypoirier jlamypoirier marked this pull request as ready for review December 15, 2025 22:44
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Base automatically changed from jlp_dataset_metadata to main December 22, 2025 17:51
@jlamypoirier jlamypoirier merged commit d278945 into main Dec 22, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the jlp_gloo branch December 22, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants