Skip to content

Fix: reset signal slots after Phase 2 barrier for reentrant collectiv…#1

Open
georgebisbas wants to merge 2 commits into
feat/allgather-reduce-scatter-examplesfrom
feat/reentrant-barrier-signal-reset
Open

Fix: reset signal slots after Phase 2 barrier for reentrant collectiv…#1
georgebisbas wants to merge 2 commits into
feat/allgather-reduce-scatter-examplesfrom
feat/reentrant-barrier-signal-reset

Conversation

@georgebisbas
Copy link
Copy Markdown
Owner

…e kernels

The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each signal slot permanently at 1 after the first kernel invocation. A second call on the same HCCL window (e.g. inside a training loop) would find signal_base[peer] >= 1 already, causing every TWAIT to return immediately without actually synchronising — a silent correctness bug.

Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window and flush with a second pipe_barrier(PIPE_ALL).

Safety of the reset:
signal_base[i] in OUR window is where peer i writes its arrival
notification. TWAIT(signal_base[i], GE 1) returning guarantees that
peer i's AtomicAdd DMA has already committed to our local memory; no
further write to signal_base[i] can arrive from the current invocation.
We are the exclusive writer of our own signal_base, so zeroing it here
races with nothing.

Applied uniformly to all three collective kernels:

  • allreduce_distributed
  • allgather_distributed (new, added in parent commit)
  • reduce_scatter_distributed (new, added in parent commit)

Addresses the HIGH-priority barrier comment from the Gemini review on hw-native-sys#842

…e kernels

The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each
signal slot permanently at 1 after the first kernel invocation.  A second
call on the same HCCL window (e.g. inside a training loop) would find
signal_base[peer] >= 1 already, causing every TWAIT to return immediately
without actually synchronising — a silent correctness bug.

Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains
in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window
and flush with a second pipe_barrier(PIPE_ALL).

Safety of the reset:
  signal_base[i] in OUR window is where peer i writes its arrival
  notification.  TWAIT(signal_base[i], GE 1) returning guarantees that
  peer i's AtomicAdd DMA has already committed to our local memory; no
  further write to signal_base[i] can arrive from the current invocation.
  We are the exclusive writer of our own signal_base, so zeroing it here
  races with nothing.

Applied uniformly to all three collective kernels:
  - allreduce_distributed
  - allgather_distributed   (new, added in parent commit)
  - reduce_scatter_distributed  (new, added in parent commit)

Addresses the HIGH-priority barrier comment from the Gemini review on
hw-native-sys#842
Keep kernel source concise while preserving full safety rationale in prior
commit message on this branch (60db4df). Behavior is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant