Skip to content

Conversation

@ruizhang1230
Copy link
Contributor

@ruizhang1230 ruizhang1230 commented Jan 7, 2026

This MR fixes an issue introduced by #28.

The MR #28 modified the trigger condition of the P2P broadcast operation, requiring the H2D buffer. If the free GPU memory could not meet the condition for H2D operations, it would set the H2D buffer to None, preventing any communication buffer from being registered with the P2P store. This caused subsequent synchronization of the weight buffer to follow the wrong path, resulting in GPU memory synchronization errors.

In this fix MR, if H2D operations are available, the H2D buffer is registered with the P2P store when P2P operation are used to update weights; otherwise, the receive buffer is directly registered with the P2P store. And then, the corresponding copy_to_buffer operation is then executed through the right path.

@blahgeek blahgeek merged commit 84e99ad into MoonshotAI:main Jan 9, 2026
2 checks passed
@ruizhang1230 ruizhang1230 deleted the fix_p2p_error branch January 10, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants