Skip to content

Feat[io]: add SQ-depth admission control and robust batch posting/rollback handling#188

Merged
maning00 merged 11 commits intomainfrom
feat/sq-depth-tracking
Mar 10, 2026
Merged

Feat[io]: add SQ-depth admission control and robust batch posting/rollback handling#188
maning00 merged 11 commits intomainfrom
feat/sq-depth-tracking

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented Mar 6, 2026

Motivation

This PR improves the robustness of RDMA transfer submission under strict queue pressure and complex error conditions.

The main goals are:

  • Prevent overfilling send queues (SQ) by introducing software-side SQ depth admission control.
  • Ensure SQ depth accounting is always released correctly on both completion and multi-endpoint failure paths to prevent resource leaks.
  • Make large, multi-endpoint transfer batches behave correctly when posting is split into multiple WR chains.
  • Reduce CPU burn under SQ contention with bounded, configurable wait-and-yield backoff.
  • Eliminate data races corresponding to asynchronous CQE callback invocations on failed batches.

Technical Details

This PR introduces endpoint-level SQ depth tracking and integrates it deeply into the remote IO notify + batch read/write paths.

Core Changes

  • Endpoint SQ Capacity Tracking:

    • Added per-endpoint tracking fields in EpPair (sqDepth as a shared atomic counter, maxSqDepth as the limit).
    • Initialized capacity bounds during endpoint connection based on reliable device attributes (src/io/rdma/backend_impl.cpp).
  • SQ Admission Control & Backoff:

    • TryReserveSqDepth(...) provides a bounded reservation with a two-phase backoff (yield -> micro-sleep) and timeouts.
    • Configurable via MORI_IO_SQ_BACKOFF_TIMEOUT_US (default: 10000us).
    • Centralized SQ clearance via ReleaseSqDepth(...).
  • Submission Ledger for Precise CQE Release:

    • Replaced the naive wrCount tracking in callbacks with a SubmissionLedger.
    • Submissions are assigned a recordId. The CQ completion path uses this recordId to precisely map and release the corresponding sqDepth for both normal (Posted) and partially failed (Orphaned) requests.
  • Robust Multi-Endpoint Error Handling & Degraded State:

    • Added a degraded atomic flag to EpPair.
    • If ibv_post_send fails in the middle of a large multi-endpoint batch, the current endpoint and all other endpoints with pending unsignaled WRs will correctly roll back.
    • Successfully posted but unsignaled WRs on all endpoints are moved to Orphaned states in their respective ledgers, and the endpoints are marked as degraded to reject new WRs until the NotifManager recovers them or receives CQEs. This completely fixes former SQ depth leak deadlocks.
  • Thread-Safe Status Updates:

    • Changed TransferStatus* status to std::atomic<TransferStatus*> status in CqCallbackMeta. This prevents data races and dangling pointer access when error paths defensively nullify the status concurrently across multiple CQ processing threads.

Test Plan

  • Unit/Integration suites: Build and run existing tests for RDMA I/O paths.test_engine covers xgmi, rdma, and timeout edge cases.
  • Fail-path recovery: Force ibv_post_send failures in batched read/write after partial posting and verify that no SQ-depth leak happens and degraded endpoints gracefully block and recover.
  • Contention validation: Run high-concurrency RDMA read/write + notify traffic with low MORI_IO_SQ_BACKOFF_TIMEOUT_US. Confirm the logic limits overcommit without dropping valid requests or suffering from permanent SQ full lockouts.
  • Large-batch validation: Run transfer batch sizes significantly larger than the underlying SQ depth (max_send_wr) and verify the split submission signaling behavior works properly on multiple EPs.

Test Result

  • No compile regressions observed.
  • SQ-depth leaks from multi-endpoint batch post failure paths are fixed; reserved depths and orphan records are correctly rolled back or drained.
  • Under heavy load, SQ admission control prevents persistent overcommit behavior and completely avoids permanent throttling caused by permanently leaked counters.
  • Dangling pointer issues causing segfaults during concurrent teardowns/errors have been mitigated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves RDMA I/O robustness under send-queue (SQ) pressure by adding endpoint-level SQ depth admission control and making batch posting/accounting more resilient to partial posting and error paths.

Changes:

  • Add per-endpoint SQ depth tracking (sqDepth, maxSqDepth) and plumb it through endpoint connection + CQ completion handling.
  • Introduce bounded/backoff-based SQ admission control for notify and batch read/write posting, with explicit release on failure/completion paths.
  • Extend CQ callback payload with wrCount to release the exact SQ depth for signaled batches.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/pybind/pybind_io.cpp Exposes additional RDMA backend configuration knobs (max_send_wr, max_cqe_num, max_msg_sge) to Python.
src/io/rdma/common.hpp Adds SQ tracking fields to EpPair and extends CqCallbackMessage with wrCount.
src/io/rdma/common.cpp Implements SQ admission control with backoff + integrates SQ reserve/release into notify and batched RDMA posting.
src/io/rdma/backend_impl.cpp Initializes SQ tracking on connect and releases SQ depth on CQ completion based on opcode / wrCount.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/io/rdma/common.cpp Outdated
Comment thread src/io/rdma/common.cpp Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp Outdated
Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/io/rdma/common.cpp
Comment thread src/io/rdma/common.cpp
Add per-QP send queue depth tracking with bounded backoff retry to
prevent ibv_post_send failures caused by SQ overflow.

- Add sqDepth (shared atomic counter) and maxSqDepth to EpPair so all
  copies of an endpoint share the same depth tracker
- Add wrCount to CqCallbackMessage for accurate depth release on CQE
- Pre-reserve SQ capacity with CAS before ibv_post_send in both
  RdmaBatchReadWrite (all-or-nothing across EPs) and RdmaNotifyTransfer
- On SQ full, yield and retry up to MORI_IO_SQ_MAX_BACKOFF (default
  100000) iterations before returning error, allowing CQ polling to
  drain outstanding WRs
- Release depth in NotifManager::ProcessOneCqe on send/data completions
- Initialize depth tracker in RdmaManager::ConnectEndpoint from the
  endpoint config's maxMsgsNum (max_send_wr)
@maning00 maning00 force-pushed the feat/sq-depth-tracking branch 2 times, most recently from 9cc0089 to 2eb152b Compare March 10, 2026 08:13
@maning00 maning00 force-pushed the feat/sq-depth-tracking branch from 2eb152b to 60fd2df Compare March 10, 2026 08:18
@maning00 maning00 force-pushed the feat/sq-depth-tracking branch 2 times, most recently from f20d261 to 4b0423b Compare March 10, 2026 09:34
@maning00 maning00 force-pushed the feat/sq-depth-tracking branch from 4b0423b to 11dcbaf Compare March 10, 2026 09:40
@maning00 maning00 merged commit f0167ab into main Mar 10, 2026
2 of 3 checks passed
@maning00 maning00 deleted the feat/sq-depth-tracking branch March 10, 2026 09:47
maning00 added a commit that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants