Fix potential hang issues when encountering illegal external connections#1880
Open
qingyunlei-tencent wants to merge 1 commit intoNVIDIA:masterfrom
Open
Fix potential hang issues when encountering illegal external connections#1880qingyunlei-tencent wants to merge 1 commit intoNVIDIA:masterfrom
qingyunlei-tencent wants to merge 1 commit intoNVIDIA:masterfrom
Conversation
Skylion007
reviewed
Oct 28, 2025
| // Added printing of local listening address and link type information | ||
| ncclSocketAddress listenAddr; | ||
| struct sockaddr_in addr; | ||
| socklen_t len = sizeof(addr); |
There was a problem hiding this comment.
Nit, shouldn't this be a constexpr? Or getsocketname not like that?
Author
There was a problem hiding this comment.
Nit, shouldn't this be a constexpr? Or getsocketname not like that?
I’ve referred to some online examples, and this way of writing seems fine. Would you be willing to elaborate on your questions in detail?
There was a problem hiding this comment.
Suggested change
| socklen_t len = sizeof(addr); | |
| constexpr socklen_t len = sizeof(addr); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have addressed the issue raised in #1808 by introducing a new socket state specifically to handle invalid magic values, preventing the program from entering an infinite loop. The PR is based on #1834, and we extend it to cover a key edge case (successful external connection but no magic value or invalid magic field length). Details of the state transitions and handling mechanism are visualized in the diagram below:


Test Method as Follows:
Write a shell script that executes nccl-tests alltoall_perf in a loop. During the program runtime, use
naabuto simulate external connections. When the port used by the proxy thread is connected, it will generate logs and prevent program hangs.Additionally, we have extended the solution to cover an important edge case:
When an external connection is successfully established but either:
A timeout mechanism has been implemented to prevent the program from hanging indefinitely in such cases. This ensures the system can gracefully handle incomplete or malformed connection handshakes, improving overall stability and fault tolerance.
The timeout duration is set to a reasonable default that balances responsiveness with allowance for legitimate network delays, while still preventing permanent hangs in failure scenarios.