Skip to content

fabtests/efa/multi_ep_stress: Refactor completion handling with dedicated CQ threads#11938

Open
alekswn wants to merge 1 commit intoofiwg:mainfrom
alekswn:multi-ep-stress-concurent-cq-read
Open

fabtests/efa/multi_ep_stress: Refactor completion handling with dedicated CQ threads#11938
alekswn wants to merge 1 commit intoofiwg:mainfrom
alekswn:multi-ep-stress-concurent-cq-read

Conversation

@alekswn
Copy link
Copy Markdown
Contributor

@alekswn alekswn commented Mar 2, 2026

Refactor the multi-endpoint stress test to use dedicated completion queue (CQ) polling threads instead of inline polling. This change improves scalability and reduces contention when handling completions.

Key changes:

  • Introduce cq_context structure to encapsulate CQ state with atomic completion counters and dedicated polling thread
  • Replace blocking wait_for_comp() with lock-free atomic-based completion tracking that runs in separate threads
  • Implement ft_backoff() with progressive backoff strategy (CPU pause, yield, exponential sleep) to reduce busy-waiting overhead
  • Remove shared CQ mutex locking in favor of lock-free atomic operations
  • Move CQ thread lifecycle management into setup_endpoint() and cleanup_endpoint()
  • Refactor sender/receiver workers to poll atomic completion counters instead of directly calling fi_cq_read()
  • Fix timeout handling to distinguish between -FI_ETIMEDOUT and other errors
  • Move shared resource setup/cleanup to run_test() for better lifecycle management

This design eliminates lock contention on shared CQs and allows worker threads to continue posting operations while a dedicated thread handles completions asynchronously.

…ated CQ threads

Refactor the multi-endpoint stress test to use dedicated completion queue
(CQ) polling threads instead of inline polling. This change improves
scalability and reduces contention when handling completions.

Key changes:
- Introduce cq_context structure to encapsulate CQ state with atomic
  completion counters and dedicated polling thread
- Replace blocking wait_for_comp() with lock-free atomic-based completion
  tracking that runs in separate threads
- Implement ft_backoff() with progressive backoff strategy (CPU pause,
  yield, exponential sleep) to reduce busy-waiting overhead
- Remove shared CQ mutex locking in favor of lock-free atomic operations
- Move CQ thread lifecycle management into setup_endpoint() and
  cleanup_endpoint()
- Refactor sender/receiver workers to poll atomic completion counters
  instead of directly calling fi_cq_read()
- Fix timeout handling to distinguish between -FI_ETIMEDOUT and other
  errors
- Move shared resource setup/cleanup to run_test() for better lifecycle
  management

This design eliminates lock contention on shared CQs and allows worker
threads to continue posting operations while a dedicated thread handles
completions asynchronously.

Signed-off-by: Alexey Novikov <nalexey@amazon.com>
@alekswn alekswn requested a review from a team March 2, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant