Skip to content

fix(mooncake): serialize ibv_reg_mr to avoid nvidia-peermem segfault under concurrent registration#14

Open
DavidBellamy wants to merge 19 commits intomainfrom
fix/mooncake-ibv-reg-mr-serialize-llm360-fork
Open

fix(mooncake): serialize ibv_reg_mr to avoid nvidia-peermem segfault under concurrent registration#14
DavidBellamy wants to merge 19 commits intomainfrom
fix/mooncake-ibv-reg-mr-serialize-llm360-fork

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Add a process-wide threading lock around `register_memory` / `batch_register_memory` / engine init in `MooncakeTransferEngine` to defend against a race in nvidia-peermem's page-callback path that can segfault when multiple threads register GPU memory concurrently against multiple IB contexts.

Why

Under concurrent GPU memory registration from multiple IB contexts (common in SR-IOV VF environments where each HCA presents a separate `ibv_context`), nvidia-peermem's page-callback can race and segfault inside the kernel. Sequential registration across the same set of HCAs is fine; the race window only opens with concurrency.

Reproducible with a stress harness that issues `ibv_reg_mr` from N threads against N contexts simultaneously. With a process-wide mutex around the offending call, the segfault disappears with no observed throughput cost on realistic workloads (registration runs at session setup, not per request).

Changes (`python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`)

  • Module-level `_ibv_reg_lock` and `_init_lock` (`threading.Lock`).
  • `register` and `batch_register` acquire `_ibv_reg_lock` around the underlying `engine.register_memory` / `engine.batch_register_memory` call.
  • `init_mooncake_transfer_engine` wrapped in `_init_lock` to also defend against concurrent first-init from multiple worker threads.

Behavior

  • No API change.
  • No throughput regression observed (registration is per-session setup, not per-request).

Provenance

One of five focused PRs that supersede #3.

mickqian and others added 19 commits April 4, 2026 23:37
…alistic perf and auto-discover ut (sgl-project#22086)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…gl-project#21649)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
…under concurrent registration

Add a process-wide threading lock around register_memory / batch_register_memory /
engine init in MooncakeTransferEngine to defend against a race in nvidia-peermem's
page-callback path that can segfault when multiple threads register GPU memory
concurrently against multiple IB contexts.

Under concurrent GPU memory registration from multiple IB contexts (common in
SR-IOV VF environments where each HCA presents a separate ibv_context),
nvidia-peermem's page-callback can race and segfault inside the kernel.
Sequential registration across the same set of HCAs is fine; the race window
only opens with concurrency.

Reproducible with a stress harness that issues ibv_reg_mr from N threads against
N contexts simultaneously. With a process-wide mutex around the offending call,
the segfault disappears with no observed throughput cost on realistic workloads
(registration runs at session setup, not per request).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants