-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Bugfix: SQ overflow #3145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: SQ overflow #3145
Conversation
|
We encountered the same problem as #3132. This PR's solution is similar to #3132 (comment) and can solve this problem. PTAL @sunce4t @yanglimingcn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the RDMA window management by splitting the single _window_size variable into two separate variables: _remote_rq_window_size (tracking remote receive queue capacity) and _sq_window_size (tracking local send queue capacity). This provides more granular control over flow control in RDMA operations.
- Split window size tracking into separate remote RQ and local SQ windows
- Added
SendTypeenum to distinguish between different types of send operations - Enhanced logging to differentiate between client and server handshake completion
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/brpc/rdma/rdma_endpoint.h | Declared two new atomic variables _remote_rq_window_size and _sq_window_size to replace the single _window_size variable |
| src/brpc/rdma/rdma_endpoint.cpp | Refactored window management logic to independently track local SQ and remote RQ capacity; added SendType enum; improved handshake logging; optimized zero-copy receive path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f17d597 to
f9e09bd
Compare
46350e7 to
45251a9
Compare
|
Why is CQ divided into send_cq and recv_cq? |
|
We encountered an error: requests were not being sent, causing a large number of client timeouts. [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }From the RDMA connection information, we found that because Using |
45251a9 to
db3673a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
db3673a to
2cdf06d
Compare
With this modification, send_cq will generate one CQE for every 1/4 of the window? |
Yes. |
It is unnecessary to split CQ into send_cq and recv_cq, the reason that sliding window goes wrong is the precondition it relies on is not guaranteed in RoCE environment. The sliding window mechanism presumes that an local app receives the ack from remote app means that the underlying SQ has already been released, in a lossless IB environment it maybe true. But in RoCE environment, it is not guaranteed. Because the local device releases SQ only if it receives the ack to a message from remote device, but the ack might be lost and the remote side has processed the message, which means the remote app has received the message and sent an application layer ack to local app. Unfortunately, the local app layer received the ack to a message from remote app and then increased the window, but the device layer is still waiting for the ack from remote device to release the SQ. So the story is the window being increased wrongly for one or two times, till the window is absolutely wrong. |
@legionxiong f the device-level ACK is lost, is it impossible to release the SQ?
Is there a better solution? |
Local side will resend the message whose ACK was not received within |
In this case, it seems we cannot poll any work completion of IBV_WC_SEND for a long time. With solicited_only=1, if device layer ACKs follows all application layer ACKs, we may be unable to poll IBV_WC_SEND after polling IBV_WC_RECV because there are no CQEs.
How can we check for the event of IBV_WC_SEND?
This solution for updating the SQ window is better. |
The brpc version which i used is 1.14.1, does it matter? |
|
I don't think it matters; just focus on the modifications to the RDMA section. |
|
I run three tasks, they lasts for 1742、1551、1852 minutes, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again. @yanglimingcn @chenBright |
|
LGTM |
@randomkang Is this using the latest commit? |
|
yes, i use the commit of "0794a476ee88e19f91cb9ce43166b4d2f90077b3" |
|
In fact, I still don't understand the specific function of the 5/4 window in the latest Premiere Pro. |
I think it would be more reasonable to add a sliding window to limit the number of IMM. @randomkang Please try the latest PR again |
|
So how should this window be set up reasonably? It seems the number of imm items generated is not quite certain? |
The window size is related to SendAck. When _remote_window_capacity is divisible by 2, a maximum of 2 imm are required. When _remote_window_capacity is not divisible by 2, a maximum of 3 imm are required. Therefore, RESERVED_WR_NUM is a reasonable value. brpc/src/brpc/rdma/rdma_endpoint.cpp Lines 518 to 519 in 7cd0c25
brpc/src/brpc/rdma/rdma_endpoint.cpp Lines 911 to 916 in 7cd0c25
|
|
Since RESERVED_WR_NUM is already present, is the 5/4 no longer needed? |
Yes. The capacity of RESERVED_WR_NUM is sufficient; by using the imm window to limit the number of imm elements, 5/4 is no longer needed. |
I run 2 tasks, and the error of "Fail to ibv_post_send: Cannot allocate memory" does not happpen again. One last for 1790 mins, and fail due to "[wk-15] Thread E1217 16:32:30.900705 106575 12820477405150 external/brpc/src/brpc/rdma/rdma_helper.cpp:220 BlockAllocate] Fail to get block from memory pool". This error is not related to communication. The other last for 1787mins, and fail due to "[tf-ps-7] W1217 04:18:39.676858 24875 78292958891373 tensorflow/contrib/pms/rpc/brpc_remote_worker.cc:383 operator()] Fail to recv tensorflow.RecvTensorResponse from 10.28.132.10:3642 : Aborted: [E10010][10.28.132.10:3642][E10010]Step 60851325530873895". I don't see the error message of brpc, maybe this error is caused by tensorflow. @chenBright |
By the way, these three tasks also failed in final. The first task failed due to "[wk-8] E1204 21:01:41.101131 72310 62642098032795 external/brpc/src/brpc/rdma/block_pool.cpp:362 AllocBlockFrom] Fail to extend new region. You can set the size of memory pool larger. Refer to the help message of these flags: rdma_memory_pool_initial_size_mb, rdma_memory_pool_increase_size_mb, rdma_memory_pool_max_regions." This error is not related to communication. The second task failed, but i don't see any error message. Maybe it is killed by the matchine failure. The third task failed due to "tf op is stuck". |
|
@randomkang In summary, the error of "Fail to ibv_post_send: Cannot allocate memory" has been resolved. Right? |
Yes, i think so. |
|
@yanglimingcn Please review it again. |
0b849d1 to
985c3b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (1)
src/brpc/rdma/rdma_endpoint.cpp:1520
- After setting
notified = trueand requesting notifications for both CQs at lines 1497-1500, the code continues with the next iteration of the loop. However, at line 1520,notifiedis unconditionally reset tofalsewhen any completion events are found (cnt > 0). This means that if events arrive after notification but are found in the first poll, the notification state is incorrectly reset. This could lead to missing the re-notification step in subsequent iterations when no events are found.
notified = true;
continue;
}
if (!m->MoreReadEvents(&progress)) {
break;
}
if (0 != ep->GetAndAckEvents(s)) {
return;
}
// Restart polling from `recv_cq'.
send = false;
cq = ep->_resource->recv_cq;
notified = false;
continue;
}
notified = false;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
985c3b3 to
b02d13a
Compare
|
LGTM |
What problem does this PR solve?
Issue Number: resolve #3132
Problem Summary:
What is changed and the side effects?
Changed:
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: