fix(mooncake): retry transient transfer failures and make session-kill threshold configurable#4
Closed
DavidBellamy wants to merge 52 commits intomainfrom
Closed
fix(mooncake): retry transient transfer failures and make session-kill threshold configurable#4DavidBellamy wants to merge 52 commits intomainfrom
DavidBellamy wants to merge 52 commits intomainfrom
Conversation
…seek on-demand (sgl-project#21864) Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
…sgl-project#21367) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
…#21206) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
…PT-OSS bf16 models. (sgl-project#22417)
…roject#19225) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Kangrui Du <kangruidu@gmail.com> Co-authored-by: Xiaole Guo <gxlvera@gmail.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
…mentation (sgl-project#22698) Co-authored-by: xdtbynd <supercluster@vip.qq.com>
Co-authored-by: zhsurpass <zhsurpass@users.noreply.github.com>
…22700) Co-authored-by: h30064329 <hanbing45@h-partners.com>
…r docs update (sgl-project#22704) Co-authored-by: Jianzhao Xu <xujianchao@huawei.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…gl-project#21259) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Wrap engine.batch_transfer_sync in a small retry loop so a single
transient transfer failure does not permakill a PD session.
Background: the disaggregation polling loop increments session_failures
on any non-zero return from batch_transfer_sync, and the default kill
threshold is low. Mooncake has internal per-slice retries for WC
errors but not for handshake RPC failures or endpoint resets under
load, so a single transient error from those paths takes the session
out for the rest of the run. Observed under heavy concurrent transfer
on an H200 cluster running PD disagg with multiple decode pools.
Behavior:
- Retry up to N attempts with linear backoff between them.
- Return 0 on first success and log a recovery warning if it took
more than one attempt.
- On persistent failure, log session id, slice count, total bytes,
first src/dst/len, then return the last non-zero ret unchanged so
the existing failure-accounting path handles it.
Knobs (env, defaults preserve existing behavior at N=1):
SGLANG_DISAGG_TRANSFER_RETRIES default 3
SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS default 50
The disaggregation polling loop currently marks a session as failed the first time session_failures hits 1. Combined with the batch_transfer_sync retry loop in the previous commit, this is too aggressive: even after the transfer-level retries recover, a future transient on the same session can still tip it into the failed set because the threshold is 1. Make the threshold configurable via SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD and raise the default to 10. This trades a small amount of latency on truly dead sessions for resilience against intermittent fabric hiccups, which in our experience are by far the more common case at scale on PD disagg with mooncake. Setting the env var to 1 restores the prior fail-fast behavior for anyone who wants it.
Collaborator
Author
|
Superseded by #16 — same two commits ( Closing as superseded. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small, additive changes to mooncake disaggregation that together make PD sessions resilient to transient transfer errors.
Retry
engine.batch_transfer_syncon non-zero return before reporting failure to the polling loop. Mooncake itself retries WC errors per slice, but handshake RPC failures and endpoint resets under load currently bubble straight up. Knobs:SGLANG_DISAGG_TRANSFER_RETRIES(default 3)SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS(default 50)Make the session-failure kill threshold configurable via
SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD(default 10, was effectively 1). Without this, a single transient hiccup outside the retry window still permakills the session.Why
Observed under sustained PD disaggregation load on an H200 cluster: transient handshake / endpoint-reset failures take a session out for the rest of the run, even though the underlying connection recovers within milliseconds. The two changes give the session a chance to recover both within a single transfer (loop) and across transfers (threshold).
Behavior
Defaults change behavior, intentionally:
SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD=1.No new dependencies, no API changes, no changes to call sites outside
mooncake/conn.py.Test plan
failed_sessions.