Skip to content

fix(mooncake): retry transient transfer failures and make session-kill threshold configurable#4

Closed
DavidBellamy wants to merge 52 commits intomainfrom
fix/mooncake-transfer-retry-and-threshold
Closed

fix(mooncake): retry transient transfer failures and make session-kill threshold configurable#4
DavidBellamy wants to merge 52 commits intomainfrom
fix/mooncake-transfer-retry-and-threshold

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Two small, additive changes to mooncake disaggregation that together make PD sessions resilient to transient transfer errors.

  1. Retry engine.batch_transfer_sync on non-zero return before reporting failure to the polling loop. Mooncake itself retries WC errors per slice, but handshake RPC failures and endpoint resets under load currently bubble straight up. Knobs:

    • SGLANG_DISAGG_TRANSFER_RETRIES (default 3)
    • SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS (default 50)
    • On recovery, logs a warning naming the session and attempt count.
    • On persistent failure, logs session id, slice count, total bytes, and first src/dst/len before returning the original non-zero ret to the caller.
  2. Make the session-failure kill threshold configurable via SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD (default 10, was effectively 1). Without this, a single transient hiccup outside the retry window still permakills the session.

Why

Observed under sustained PD disaggregation load on an H200 cluster: transient handshake / endpoint-reset failures take a session out for the rest of the run, even though the underlying connection recovers within milliseconds. The two changes give the session a chance to recover both within a single transfer (loop) and across transfers (threshold).

Behavior

Defaults change behavior, intentionally:

  • Retries: 1 → 3. Latency cost on a successful transfer is zero (early return on ret==0).
  • Threshold: 1 → 10. Restoring the prior fail-fast mode is one env var: SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD=1.

No new dependencies, no API changes, no changes to call sites outside mooncake/conn.py.

Test plan

  • Verified across multiple long-running PD disagg jobs on an H200 cluster (R3-off agentic RL stack); recovered-on-retry warnings observed in the wild without any session ending up in failed_sessions.
  • Reviewer to confirm the env-var defaults are acceptable for upstream-style behavior, or request that defaults be reverted to (1, 1) with opt-in via env var.

klshuster and others added 30 commits April 12, 2026 16:25
…seek on-demand (sgl-project#21864)

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
…#21206)

Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
…roject#19225)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Kangrui Du <kangruidu@gmail.com>
Co-authored-by: Xiaole Guo <gxlvera@gmail.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
…mentation (sgl-project#22698)

Co-authored-by: xdtbynd <supercluster@vip.qq.com>
Co-authored-by: zhsurpass <zhsurpass@users.noreply.github.com>
…22700)

Co-authored-by: h30064329 <hanbing45@h-partners.com>
…r docs update (sgl-project#22704)

Co-authored-by: Jianzhao Xu <xujianchao@huawei.com>
Fridge003 and others added 16 commits April 13, 2026 14:39
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…gl-project#21259)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Wrap engine.batch_transfer_sync in a small retry loop so a single
transient transfer failure does not permakill a PD session.

Background: the disaggregation polling loop increments session_failures
on any non-zero return from batch_transfer_sync, and the default kill
threshold is low. Mooncake has internal per-slice retries for WC
errors but not for handshake RPC failures or endpoint resets under
load, so a single transient error from those paths takes the session
out for the rest of the run. Observed under heavy concurrent transfer
on an H200 cluster running PD disagg with multiple decode pools.

Behavior:
  - Retry up to N attempts with linear backoff between them.
  - Return 0 on first success and log a recovery warning if it took
    more than one attempt.
  - On persistent failure, log session id, slice count, total bytes,
    first src/dst/len, then return the last non-zero ret unchanged so
    the existing failure-accounting path handles it.

Knobs (env, defaults preserve existing behavior at N=1):
  SGLANG_DISAGG_TRANSFER_RETRIES        default 3
  SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS default 50
The disaggregation polling loop currently marks a session as failed
the first time session_failures hits 1. Combined with the
batch_transfer_sync retry loop in the previous commit, this is too
aggressive: even after the transfer-level retries recover, a future
transient on the same session can still tip it into the failed set
because the threshold is 1.

Make the threshold configurable via SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD
and raise the default to 10. This trades a small amount of latency on
truly dead sessions for resilience against intermittent fabric
hiccups, which in our experience are by far the more common case at
scale on PD disagg with mooncake. Setting the env var to 1 restores
the prior fail-fast behavior for anyone who wants it.
@DavidBellamy
Copy link
Copy Markdown
Collaborator Author

Superseded by #16 — same two commits (fix(mooncake): retry transient batch_transfer_sync failures + feat(mooncake): make session-failure kill threshold configurable) on a clean branch off the production pin (0ca02195), without the rebase noise that this PR has accumulated.

Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.