Skip to content

Fix allgather op_id reuse better#918

Open
wence- wants to merge 1 commit intorapidsai:mainfrom
wence-:wence/fix/allgather-again
Open

Fix allgather op_id reuse better#918
wence- wants to merge 1 commit intorapidsai:mainfrom
wence-:wence/fix/allgather-again

Conversation

@wence-
Copy link
Contributor

@wence- wence- commented Mar 17, 2026

We can't stratify metadata tags by finish or normal metadata because in that scenario we can still accidentally end up posting metadata receives that will be matched by the next collective.

Instead change the event loop to satisfy the correct invariant for metadata receive posting: we post at most one receive per event loop iteration and act on it appropriately. If it is a finish chunk we update the number of expected messages (and hence the number of additional metadata receives we expect to send), otherwise it's a data chunk and we update the number of received messages. This way the metadata receive is never a greedy loop that can eat sends from a subsequent collective.

We can't stratify metadata tags by finish or normal metadata because in
that scenario we can still accidentally end up posting metadata receives
that will be matched by the next collective.

Instead change the event loop to satisfy the correct invariant for metadata
receive posting: we post at most one receive per event loop iteration and
act on it appropriately. If it is a finish chunk we update the number of
expected messages (and hence the number of additional metadata receives we
expect to send), otherwise it's a data chunk and we update the number of
received messages. This way the metadata receive is never a greedy loop
that can eat sends from a subsequent collective.
@wence- wence- requested a review from a team as a code owner March 17, 2026 12:38
@wence- wence- added bug Something isn't working non-breaking Introduces a non-breaking change labels Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant