fix(disaggregation): retry transient mooncake transfer failures and make session-kill threshold configurable by DavidBellamy · Pull Request #16 · LLM360/sglang

DavidBellamy · 2026-05-04T00:58:20Z

Summary

Two small additive changes to PD disaggregation that together make sessions resilient to transient transfer failures:

Retry `engine.batch_transfer_sync` on non-zero return before reporting failure to the polling loop. Knobs:
- `SGLANG_DISAGG_TRANSFER_RETRIES` (default 3)
- `SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS` (default 50)
- On recovery, logs a warning naming the session and attempt count.
- On persistent failure, logs session id, slice count, total bytes, and first src/dst/len before returning the original non-zero ret to the caller.
Make the session-failure kill threshold configurable via `SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD` (default 10, was effectively 1). Without this, a single transient hiccup outside the retry window still permakills the session.

Why

Observed under sustained PD disaggregation load with multiple decode pools: transient handshake / endpoint-reset failures take a session out for the rest of the run, even though the underlying connection recovers within milliseconds. Mooncake's internal per-slice retries cover WC errors but not handshake-RPC failures or endpoint resets under load.

The two changes give the session a chance to recover both within a single transfer (loop) and across transfers (threshold).

Behavior change

Defaults change behavior, intentionally:

Retries: 1 → 3. Latency cost on a successful first attempt is zero (early return on `ret==0`).
Threshold: 1 → 10. Restoring the prior fail-fast mode: `SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD=1`.

No new dependencies, no API changes, no changes to call sites outside `mooncake/conn.py`.

Provenance

Branch is based on the production pin (`0ca02195`), so this is directly applicable on top of what's running today. Supersedes #4 (which had rebase noise).

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

…roject#18639)

…rt (sgl-project#18642)

…a for qwen-vl (sgl-project#18781)

…ject#19731)

…-project#20066)

…uest Adds a [patch.crates-io] override pointing openai-protocol at a fork that adds serde(flatten) to ChatCompletionRequest, matching the existing pattern on CompletionRequest. This allows engine-specific parameters like return_routed_experts to pass through the gateway to workers. Upstream PR: lightseekorg/smg#1119 Remove this patch once the upstream crate publishes with the fix.

…ions Adds completion_token_ids: Optional[List[int]] to ChatCompletionResponseChoice and return_completion_token_ids: bool flag to ChatCompletionRequest. Guarded by the flag; no behavior change when false. Required for token-in-token-out (TITO) on the agentic /v1/chat/completions path used by Harbor external agents during RL rollouts. Without this, the agent loop would have to re-tokenize string content each turn, which drifts from the training forward pass and corrupts the training signal.

Updates sgl-model-gateway's [patch.crates-io] override from the request-side flatten branch (already merged upstream via lightseekorg/smg#1130) to the response-side flatten branch (DavidBellamy/smg:add-flatten-chat-completion-response). The response-side branch is based off upstream/main, which now carries the merged sgl-project#1130 request-side flatten; so a single branch covers both request and response flatten without needing two separate patch entries. Needed so routed_experts and completion_token_ids survive the round-trip through sgl-model-gateway without being dropped by serde's default unknown-field handling on response structs.

Adds two axum routes on the gateway that mirror the Miles rollout orchestrator's existing worker-management API: - POST /add_worker?url=... (query param, not JSON body) - GET /list_workers (returns {"urls": [...]} shape) The native /workers routes remain the long-term API. The shim lets Miles switch its base_url from the Miles router to sgl-model-gateway without client-side changes in W5. POST /add_worker constructs a WorkerConfigRequest from just the URL and delegates to worker_service.create_worker(); GET /list_workers reads the worker registry directly and flattens to URL list form.

Adds five unit tests in TestCompletionTokenIds asserting: - ChatCompletionRequest accepts return_completion_token_ids=True. - The flag defaults to False (no behavior change for existing clients). - ChatCompletionResponseChoice accepts completion_token_ids as a list. - _serialize drops completion_token_ids when None (mirrors the existing hidden_states / prompt_token_ids handling). - _serialize keeps completion_token_ids when populated. Verified against the actual protocol module (runs standalone; no GPU).

WorkerConfigRequest does not implement Default on the current smg branch, so the prior `..Default::default()` spread fails to compile on a proper Linux build env. Mac-side cargo check missed this (stale resolution). Switch to `serde_json::from_value(json!({"url": ...}))` which relies on the struct's `#[serde(default)]` field annotations to fill in defaults. More resilient to upstream field additions; fails loud with a 400 if the shape ever changes.

…let 5) Adds a tokenizer_sha256 field to /model_info so agent-side consistency checks (harbor.utils.tokenizer_identity.assert_tokenizer_matches in the Harbor tito/agentic-path-infra branch) can abort when the trainer and rollout-worker tokenizers have drifted. The hash is SHA256 of the HF fast tokenizer's canonical JSON form (backend_tokenizer.to_str()), which is stable across loading methods (local path, HF Hub, programmatic construction) and changes iff tokenization output could change. Computed once and cached; returns None for slow tokenizers (no stable canonical form). A silent tokenizer mismatch between trainer and rollout is a known failure mode for TITO (PRD W-TITO bullet 5, risk R11); this patch is the worker side of that consistency check.

…e (tito_debug) Temporary DEBUG-level logging (target: smg::tito_debug) to localize where completion_token_ids is dropped on the /v1/chat/completions pass-through. Enable with RUST_LOG=smg::tito_debug=debug. Revert once localized.

…=1.0.0 pin) The prior [patch.crates-io] override never applied: crates.io pins openai-protocol at 1.0.0, our smg fork is at 1.7.0, so cargo silently fell back to the frozen crates.io version without the serde(flatten) catch-all. See docs/gateway-smg-version-drift.md in LLM360/RL360#76 for the full localization writeup. Switching to a direct git dependency on the fork branch so cargo takes our smg at its declared version and the flatten actually lands in the built binary. Will drive gateway source updates for the upstream smg API drift in follow-up commits.

… fork) Gateway byte-streams responses (see http/router.rs:584 — res.bytes() -> Body::from(body), no ChatCompletionResponse deserialization), so the response-side serde(flatten) I added in DavidBellamy/smg: add-flatten-chat-completion-response is NOT needed by the gateway. It only adds value for gRPC paths and for any future deserialize-response callers. The ONLY flatten the gateway needs is request-side on ChatCompletionRequest, which upstream smg main already carries via sgl-project#1130 (merged). Pointing at upstream directly lets the request-side flatten apply and skips the struct-literal maintenance burden in smg builders that my response-side additions introduced. Upstream smg PR lightseekorg/smg#1160 (response flatten) remains open as an independent improvement; it is not on this gateway's critical path anymore.

The prior direct-git dep only affected sgl-model-gateway's top-level dependency; transitive deps (smg-grpc-client, reasoning-parser, llm-tokenizer, tool-parser) still pulled openai-protocol 1.0.0 from crates.io, so the workspace had TWO incompatible openai-protocol versions linked simultaneously, giving 111 type errors. Restoring =1.0.0 and patching crates.io to a new fork branch (DavidBellamy/smg:gateway-patch-1.0.0) that carries upstream main's code but declares itself version 1.0.0 makes the patch apply to every openai-protocol consumer in the workspace. One version in the graph, and it's the one that has the ChatCompletionRequest flatten.

Expands the workspace patch so every smg-family transitive dep (smg-auth, smg-mcp, smg-mesh, smg-wasm, smg-grpc-client, reasoning-parser, llm-tokenizer, tool-parser) resolves to the same fork branch (gateway-patch-1.0.0), not crates.io. Prevents old-shape smg crates from colliding with the new openai-protocol shape.

…mg 1.7.0) Replaces the 9-crate smg workspace patch with a single minimal patch: openai-protocol 1.0.0 source verbatim + serde(flatten) on ChatCompletionRequest. Everything else stays as published crates.io 1.0.0, so gateway source compiles without the 168-error smg API drift cascade that full upstream main triggered. Scope: fixes request-side unknown-field round-trip (return_completion_token_ids, return_routed_experts, input_ids, extra_body). Response side stays byte-streamed; no response flatten needed at the gateway.

…gfault Adds threading.Lock around register/batch_register and engine init to prevent concurrent ibv_reg_mr on GPU memory from multiple DP worker threads. nvidia-peermem's page callback segfaults when handling concurrent GPU memory registration from multiple IB VF contexts on M2's SR-IOV devices. Sequential registration works on all 8 HCAs. Reproducer: /mnt/weka/shrd/k2pta/rl360/ibv_reg_mr_stress.py (segfaults without this patch, passes with it). Tracked in #1 and LLM360/posttraining_knowledge#109.

…r shim The W2 Miles-API compatibility shim previously accepted only `?url=` and registered every worker as Regular. After miles#990 relaxed the client- side assert on /add_worker, PD-disagg rollouts using Miles forward worker_type=prefill|decode (and bootstrap_port for prefill) as query params on the same endpoint. This patch reads those params and plumbs them into the JSON passed to serde_json::from_value<WorkerConfigRequest> so the gateway's existing PD routing layers (routers/http/pd_router.rs etc.) receive properly-typed workers. Matches the shape that src/service_discovery.rs:433 already sends for PD pods discovered in-cluster: worker_type is a flat Option<String> alongside bootstrap_port: Option<u16> -- not nested. Explicit 'regular' and missing worker_type both preserve the original regular- only behavior. Unblocks PD disaggregation in the Miles agentic rollout loop without switching away from sgl-model-gateway. Keeps cache_aware + serde(flatten) response passthrough intact (unlike the alternative of swapping in sglang_router.launch_router per-variant).

#5 inadvertently switched the workflow's UPSTREAM from LLM360/sglang:llm360-main (the 19-commit accumulator of our patches) to sgl-project/sglang:main (pure upstream). The next scheduled deploy rebuild would have dropped our 19 accumulated patch commits from deploy, keeping only the currently-open PRs. Fix: point UPSTREAM at LLM360/sglang itself and change the base branch reference to llm360-main. This restores the pre-#5 behavior where deploy = llm360-main + open PRs, while keeping the #5 improvements intact (daily 02:00 UTC cron + preserve fork-local .github/workflows). Preserve step already reads origin/llm360-main so no further change needed.

Port of upstream PR sgl-project#23003 applied directly on llm360-main so the octopus-merge into deploy picks it up for tonight's agentic RL pilots. Blocking fix: sglang PD disaggregation currently rejects the launcher's rail-mapping JSON path. Adds a 2-line early return in _validate_ib_devices() that passes JSON content ({...}) and .json file paths through unchanged; get_ib_devices_for_gpu() in mooncake_transfer_engine.py already handles those formats downstream.

…roject/sglang (#10) The other 4 LLM360 forks (miles, harbor, smg, Megatron-LM) use their stabilized LLM360-friendly upstream (radixark, harbor-framework, lightseekorg) as BOTH the deploy reset base AND the source of open PRs with LLM360-owned heads. That pattern breaks for sglang: sgl-project/ sglang moves very fast and does NOT contain our patch stack, so resetting deploy to it would wipe sglang-miles/TITO/MTP/rail-map every build. The old self-referential workaround (UPSTREAM=LLM360/sglang) kept the reset correct but left a gap: upstream PRs the team authors against sgl-project/sglang (e.g. rail-map JSON passthrough in sgl-project#23003) don't land in our nightly image until they merge upstream. That block tonight's agentic RL pilots. Decouple the two concerns: - DEPLOY_BASE + DEPLOY_BASE_BRANCH = LLM360/sglang:llm360-main (what we reset to; contains our patch stack) - REAL_UPSTREAM + REAL_UPSTREAM_BRANCH = sgl-project/sglang:main (where we also query for open PRs with LLM360-owned heads) Deploy = DEPLOY_BASE + cross-fork PRs into llm360-main + fork-local PRs + LLM360-head PRs open against sgl-project/sglang:main. Dedup by branch name. Branches are all fetchable via origin=LLM360/sglang since the head branch always lives there regardless of which base the PR points at.

…project#23003 (#11) Upstream PR sgl-project#23003 (fix/json-ib-device-passthrough) is superseded by #9 (fix/json-ib-device-passthrough-llm360-main), which mirrors the same 7-line patch on top of llm360-main. The upstream branch was based on sgl-project/sglang:main and line-drifts when merged into llm360-main — was triggering a merge conflict in every octopus run after the PR #10 decouple landed. Port harbor's SKIP_UPSTREAM_BRANCHES pattern: env var + jq `NOT_SKIPPED` filter applied to both UPSTREAM_PR_STATE/BRANCHES (cross-fork PRs on LLM360/sglang:llm360-main) and REAL_UPSTREAM_PR_STATE/BRANCHES (upstream sgl-project/sglang PRs). Removes the skipped branch cleanly from the state fingerprint too, so re-running doesn't see it as a pending change.

…ds-response=8182368a8f543c6b3c7ed0f4a1847042a01d2b79,fix/json-ib-device-passthrough-llm360-main=18a70f030015310ccb08b0093a3cbe0bb57de371,patch-openai-protocol=df4a3b4588ae95a18c0c5512f326ccddf558e132,add-completion-token-ids-response=8182368a8f543c6b3c7ed0f4a1847042a01d2b79,fix/json-ib-device-passthrough-llm360-main=18a70f030015310ccb08b0093a3cbe0bb57de371,patch-openai-protocol=df4a3b4588ae95a18c0c5512f326ccddf558e132

Wrap engine.batch_transfer_sync in a small retry loop so a single transient transfer failure does not permakill a PD session. Background: the disaggregation polling loop increments session_failures on any non-zero return from batch_transfer_sync, and the default kill threshold is low. Mooncake has internal per-slice retries for WC errors but not for handshake RPC failures or endpoint resets under load, so a single transient error from those paths takes the session out for the rest of the run. Observed under heavy concurrent transfer on an H200 cluster running PD disagg with multiple decode pools. Behavior: - Retry up to N attempts with linear backoff between them. - Return 0 on first success and log a recovery warning if it took more than one attempt. - On persistent failure, log session id, slice count, total bytes, first src/dst/len, then return the last non-zero ret unchanged so the existing failure-accounting path handles it. Knobs (env, defaults preserve existing behavior at N=1): SGLANG_DISAGG_TRANSFER_RETRIES default 3 SGLANG_DISAGG_TRANSFER_RETRY_BACKOFF_MS default 50

The disaggregation polling loop currently marks a session as failed the first time session_failures hits 1. Combined with the batch_transfer_sync retry loop in the previous commit, this is too aggressive: even after the transfer-level retries recover, a future transient on the same session can still tip it into the failed set because the threshold is 1. Make the threshold configurable via SGLANG_DISAGG_SESSION_FAILURE_THRESHOLD and raise the default to 10. This trades a small amount of latency on truly dead sessions for resilience against intermittent fabric hiccups, which in our experience are by far the more common case at scale on PD disagg with mooncake. Setting the env var to 1 restores the prior fail-fast behavior for anyone who wants it.

mickqian and others added 30 commits April 4, 2026 23:37

[diffusion] CI: improve diffusion comparison benchmark setting for re…

43654ef

…alistic perf and auto-discover ut (sgl-project#22086) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Fix] Fix nightly tests (sgl-project#22140)

164bc0a

Enable IndexCache for DeepSeek V3.2 (sgl-project#21405)

07f57fc

Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (s…

c1927e1

…gl-project#21649) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

[Hotfix] Fix router gemm on sm103 (sgl-project#22134)

1519acf

[1/8] [sglang-miles] True on-policy training support for FSDP2 (sgl-p…

74c9eab

…roject#18639)

[2/8] [sglang-miles] R3 (Rollout Routing Replay) DeepEP and MTP suppo…

453eb15

…rt (sgl-project#18642)

[3/8] [sglang-miles] Support INT4 QAT for RL (sgl-project#18565)

2bc3f68

[4/8] [sglang-miles] PD disaggregation for RL (sgl-project#18646)

277147c

[5/8] [sglang-miles] MTP related fix (sgl-project#18647)

7af1f15

[6/8] [sglang-miles] tmp fix for vlm training: use legacy_load_mm_dat…

24def86

…a for qwen-vl (sgl-project#18781)

[7/8] [sglang-miles] support better token id return for TITO (sgl-pro…

69018ba

…ject#19731)

[8/8] [feat] Support cross turn token in after last user message (sgl…

3606aec

…-project#20066)

[sglang-miles] fix weight checker (sgl-project#21494)

98b5440

P2P Weight Update features for miles (sgl-project#21278)

629aa25

Fix is_multimodal_gen attr not in v0.5.10 ModelConfig

d335128

Fix flashinfer fused_moe with topk>8 (sgl-project#22201)

0e3e23e

Add maintain-deploy workflow for auto-merging PRs into deploy branch

58fc036

DavidBellamy and others added 14 commits April 15, 2026 17:28

ci: daily 2am rebuild + preserve fork-local .github/workflows (#5)

9ca993c

chore(ci): shift nightly rebuild to 2 AM Pacific (#7)

9314443

Deploy: merge fix/json-ib-device-passthrough-llm360-main

d61f614

Deploy: merge add-completion-token-ids-response

ad6c851

github-actions Bot added model-gateway dependencies diffusion lora blackwell deepseek labels May 4, 2026

DavidBellamy mentioned this pull request May 4, 2026

fix(mooncake): retry transient transfer failures and make session-kill threshold configurable #4

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(disaggregation): retry transient mooncake transfer failures and make session-kill threshold configurable#16

fix(disaggregation): retry transient mooncake transfer failures and make session-kill threshold configurable#16
DavidBellamy wants to merge 44 commits intomainfrom
fix/mooncake-transfer-retry-threshold-llm360-fork

DavidBellamy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

DavidBellamy commented May 4, 2026

Summary

Why

Behavior change

Provenance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants