Feat[io]: add SQ-depth admission control and robust batch posting/rollback handling#188
Feat[io]: add SQ-depth admission control and robust batch posting/rollback handling#188
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves RDMA I/O robustness under send-queue (SQ) pressure by adding endpoint-level SQ depth admission control and making batch posting/accounting more resilient to partial posting and error paths.
Changes:
- Add per-endpoint SQ depth tracking (
sqDepth,maxSqDepth) and plumb it through endpoint connection + CQ completion handling. - Introduce bounded/backoff-based SQ admission control for notify and batch read/write posting, with explicit release on failure/completion paths.
- Extend CQ callback payload with
wrCountto release the exact SQ depth for signaled batches.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/pybind/pybind_io.cpp |
Exposes additional RDMA backend configuration knobs (max_send_wr, max_cqe_num, max_msg_sge) to Python. |
src/io/rdma/common.hpp |
Adds SQ tracking fields to EpPair and extends CqCallbackMessage with wrCount. |
src/io/rdma/common.cpp |
Implements SQ admission control with backoff + integrates SQ reserve/release into notify and batched RDMA posting. |
src/io/rdma/backend_impl.cpp |
Initializes SQ tracking on connect and releases SQ depth on CQ completion based on opcode / wrCount. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
1b43576 to
297d771
Compare
297d771 to
49f66e1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Add per-QP send queue depth tracking with bounded backoff retry to prevent ibv_post_send failures caused by SQ overflow. - Add sqDepth (shared atomic counter) and maxSqDepth to EpPair so all copies of an endpoint share the same depth tracker - Add wrCount to CqCallbackMessage for accurate depth release on CQE - Pre-reserve SQ capacity with CAS before ibv_post_send in both RdmaBatchReadWrite (all-or-nothing across EPs) and RdmaNotifyTransfer - On SQ full, yield and retry up to MORI_IO_SQ_MAX_BACKOFF (default 100000) iterations before returning error, allowing CQ polling to drain outstanding WRs - Release depth in NotifManager::ProcessOneCqe on send/data completions - Initialize depth tracker in RdmaManager::ConnectEndpoint from the endpoint config's maxMsgsNum (max_send_wr)
9cc0089 to
2eb152b
Compare
2eb152b to
60fd2df
Compare
f20d261 to
4b0423b
Compare
4b0423b to
11dcbaf
Compare
Motivation
This PR improves the robustness of RDMA transfer submission under strict queue pressure and complex error conditions.
The main goals are:
Technical Details
This PR introduces endpoint-level SQ depth tracking and integrates it deeply into the remote IO notify + batch read/write paths.
Core Changes
Endpoint SQ Capacity Tracking:
sqDepthas a shared atomic counter,maxSqDepthas the limit).SQ Admission Control & Backoff:
MORI_IO_SQ_BACKOFF_TIMEOUT_US(default: 10000us).Submission Ledger for Precise CQE Release:
wrCounttracking in callbacks with a SubmissionLedger.recordId. The CQ completion path uses thisrecordIdto precisely map and release the correspondingsqDepthfor both normal (Posted) and partially failed (Orphaned) requests.Robust Multi-Endpoint Error Handling & Degraded State:
degradedatomic flag to EpPair.ibv_post_sendfails in the middle of a large multi-endpoint batch, the current endpoint and all other endpoints with pending unsignaled WRs will correctly roll back.Orphanedstates in their respective ledgers, and the endpoints are marked asdegradedto reject new WRs until the NotifManager recovers them or receives CQEs. This completely fixes former SQ depth leak deadlocks.Thread-Safe Status Updates:
TransferStatus* statustostd::atomic<TransferStatus*> statusin CqCallbackMeta. This prevents data races and dangling pointer access when error paths defensively nullify the status concurrently across multiple CQ processing threads.Test Plan
test_enginecoversxgmi,rdma, and timeout edge cases.ibv_post_sendfailures in batched read/write after partial posting and verify that no SQ-depth leak happens anddegradedendpoints gracefully block and recover.MORI_IO_SQ_BACKOFF_TIMEOUT_US. Confirm the logic limits overcommit without dropping valid requests or suffering from permanentSQ fulllockouts.max_send_wr) and verify the split submission signaling behavior works properly on multiple EPs.Test Result