Skip to content

hdnax/MPiSC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Non-Blocking Distributed MPSC Queues

MPiSC Status Thesis

Table of Contents

Abstract

Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel wait-free distributed MPSC queues: dLTQueue, Slotqueue, and dLTQueueV2. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance.

Motivation and Methodology

The Problem

MPSC queues are essential for irregular applications — programs with unpredictable, data-dependent memory access patterns:

  • Actor model: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors
  • Fan-out/fan-in: Worker nodes enqueue results to an aggregation node for processing

These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system.

Gap in the Literature

Shared-memory has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most:

Queue Issue
DQueue Incorrect ABA solution and unsafe memory reclamation
WRLQueue Actually blocking — dequeuer waits for all enqueuers
Jiffy Insufficient memory reclamation, not truly wait-free
LTQueue Correct — uses LL/SC for ABA, proper memory reclamation

Distributed has only one MPSC queue: AMQueue. Despite claiming lock-freedom, it is actually blocking — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. (Confirmed by the original author)

Our Approach

We adapt LTQueue — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication.

Key challenge: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI.

Our solution: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC.

Key techniques:

  • SPSC-per-enqueuer: Each producer maintains a local queue, eliminating producer contention
  • Unique timestamps: Solves ABA via monotonic version numbers
  • Double-refresh: Bounds retries to two per node, ensuring wait-freedom

Contributions

Findings

  • 3 of 4 shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues
  • AMQueue, the only distributed MPSC queue, is blocking despite claims of lock-freedom
  • LTQueue is the only correct candidate for distributed adaptation

Novel Algorithms

Algorithm Progress Enqueue Dequeue
dLTQueue Wait-free O(log n) remote O(log n) remote
Slotqueue Wait-free O(1) remote O(1) remote, O(n) local
dLTQueueV2 Wait-free O(1) remote O(1) remote, O(log n) local

All algorithms are linearizable with no dynamic memory allocation.

Results

Benchmarked on SuperMUC-NG (6000+ nodes) and CoolMUC-4 (100+ nodes):

Metric Our Queues vs AMQueue
Enqueue throughput ~2x better
Dequeue throughput 3-10x worse
Fault tolerance Wait-free (vs blocking)

Trade-off: Stronger fault tolerance at the cost of dequeue performance.

Related

  1. dLTQueue - FDSE 2025 (ResearchGate)
  2. Slotqueue - NPC 2025 (ResearchGate)

Full thesis

About

Investigation and porting of shared-memory MPSCs to distributed context using MPI-3

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published