fix(mooncake): serialize ibv_reg_mr to avoid nvidia-peermem segfault under concurrent registration by DavidBellamy · Pull Request #14 · LLM360/sglang

DavidBellamy · 2026-05-04T00:56:27Z

Summary

Add a process-wide threading lock around `register_memory` / `batch_register_memory` / engine init in `MooncakeTransferEngine` to defend against a race in nvidia-peermem's page-callback path that can segfault when multiple threads register GPU memory concurrently against multiple IB contexts.

Why

Under concurrent GPU memory registration from multiple IB contexts (common in SR-IOV VF environments where each HCA presents a separate `ibv_context`), nvidia-peermem's page-callback can race and segfault inside the kernel. Sequential registration across the same set of HCAs is fine; the race window only opens with concurrency.

Reproducible with a stress harness that issues `ibv_reg_mr` from N threads against N contexts simultaneously. With a process-wide mutex around the offending call, the segfault disappears with no observed throughput cost on realistic workloads (registration runs at session setup, not per request).

Changes (`python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`)

Module-level `_ibv_reg_lock` and `_init_lock` (`threading.Lock`).
`register` and `batch_register` acquire `_ibv_reg_lock` around the underlying `engine.register_memory` / `engine.batch_register_memory` call.
`init_mooncake_transfer_engine` wrapped in `_init_lock` to also defend against concurrent first-init from multiple worker threads.

Behavior

No API change.
No throughput regression observed (registration is per-session setup, not per-request).

Provenance

One of five focused PRs that supersede #3.

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…roject#18639)

…rt (sgl-project#18642)

…a for qwen-vl (sgl-project#18781)

…ject#19731)

…-project#20066)

…under concurrent registration Add a process-wide threading lock around register_memory / batch_register_memory / engine init in MooncakeTransferEngine to defend against a race in nvidia-peermem's page-callback path that can segfault when multiple threads register GPU memory concurrently against multiple IB contexts. Under concurrent GPU memory registration from multiple IB contexts (common in SR-IOV VF environments where each HCA presents a separate ibv_context), nvidia-peermem's page-callback can race and segfault inside the kernel. Sequential registration across the same set of HCAs is fine; the race window only opens with concurrency. Reproducible with a stress harness that issues ibv_reg_mr from N threads against N contexts simultaneously. With a process-wide mutex around the offending call, the segfault disappears with no observed throughput cost on realistic workloads (registration runs at session setup, not per request).

mickqian and others added 19 commits April 4, 2026 23:37

[diffusion] CI: improve diffusion comparison benchmark setting for re…

43654ef

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Fix] Fix nightly tests (sgl-project#22140)

164bc0a

Enable IndexCache for DeepSeek V3.2 (sgl-project#21405)

07f57fc

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (s…

c1927e1

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

[Hotfix] Fix router gemm on sm103 (sgl-project#22134)

1519acf

[1/8] [sglang-miles] True on-policy training support for FSDP2 (sgl-p…

74c9eab

…roject#18639)

[2/8] [sglang-miles] R3 (Rollout Routing Replay) DeepEP and MTP suppo…

453eb15

…rt (sgl-project#18642)

[3/8] [sglang-miles] Support INT4 QAT for RL (sgl-project#18565)

2bc3f68

[4/8] [sglang-miles] PD disaggregation for RL (sgl-project#18646)

277147c

[5/8] [sglang-miles] MTP related fix (sgl-project#18647)

7af1f15

[6/8] [sglang-miles] tmp fix for vlm training: use legacy_load_mm_dat…

24def86

…a for qwen-vl (sgl-project#18781)

[7/8] [sglang-miles] support better token id return for TITO (sgl-pro…

69018ba

…ject#19731)

[8/8] [feat] Support cross turn token in after last user message (sgl…

3606aec

…-project#20066)

[sglang-miles] fix weight checker (sgl-project#21494)

98b5440

P2P Weight Update features for miles (sgl-project#21278)

629aa25

Fix is_multimodal_gen attr not in v0.5.10 ModelConfig

d335128

Fix flashinfer fused_moe with topk>8 (sgl-project#22201)

0e3e23e

Add maintain-deploy workflow for auto-merging PRs into deploy branch

58fc036

github-actions Bot added diffusion lora blackwell deepseek labels May 4, 2026

DavidBellamy mentioned this pull request May 4, 2026

W-TITO bullet 1 + W2 Miles shim + smg response-flatten Cargo patch #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mooncake): serialize ibv_reg_mr to avoid nvidia-peermem segfault under concurrent registration#14

fix(mooncake): serialize ibv_reg_mr to avoid nvidia-peermem segfault under concurrent registration#14
DavidBellamy wants to merge 19 commits intomainfrom
fix/mooncake-ibv-reg-mr-serialize-llm360-fork

DavidBellamy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

DavidBellamy commented May 4, 2026

Summary

Why

Changes (`python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`)

Behavior

Provenance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants