Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks by ch-wan · Pull Request #1297 · SemiAnalysisAI/InferenceX

ch-wan · 2026-05-07T18:02:03Z

Summary

Adds 6 disagg MTP recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/ (low-latency 1p1d-tp4 / 1p6d-dep4-tp4 + mid-curve dep4-dep8/dep16 with 1p, 2p, 4p prefill)
Wires them into dsv4-fp4-gb300-dynamo-sglang-mtp in .github/configs/nvidia-master.yaml, each entry carrying spec-decoding: "mtp" and the corresponding topology
Recipes adapted from elvischenv/srt-slurm@dsv4-gb300-disagg-8k1k-mtp, repointed at the public lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev container and the deepseek-v4-pro model alias

Test plan

/sweep on this PR — verify the matrix dispatches the 6 new MTP entries
Confirm the dsv4-fp4-gb300-dynamo-sglang-mtp rows appear in the sweep matrix listing
Eval-only entry (max-conc) produces lm-eval scores

🤖 Generated with Claude Code

github-actions · 2026-05-07T18:02:14Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-07T18:04:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25513269866
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25513269866

github-actions · 2026-05-07T18:07:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25513378863
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25513378863

claude · 2026-05-07T18:13:30Z

+model:
+  path: "deepseek-v4-pro"
+  container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev"
+  precision: "mxfp4"


🟡 All 6 new MTP recipes set model.precision: "mxfp4", but every existing sibling dsv4 SGLang recipe in benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ uses precision: "fp4" — even though they share the same moe-runner-backend: flashinfer_mxfp4 — and the matrix entry dsv4-fp4-gb300-dynamo-sglang-mtp itself has precision: fp4. Nit: align all 6 MTP recipes to precision: "fp4" to match the established convention; this is metadata-only (InferenceX aggregation keys off the matrix-level precision, not the recipe yaml), so runtime impact is minimal.

Extended reasoning...

What the inconsistency is

Each of the 6 new files at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/*.yaml has:

model: path: "deepseek-v4-pro" container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev" precision: "mxfp4"

Whereas all 6 pre-existing sibling recipes at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-*.yaml use precision: "fp4" (line 37 of each), despite carrying the same moe-runner-backend: "flashinfer_mxfp4" setting in their sglang_config. The matrix entry added in .github/configs/nvidia-master.yaml for these MTP recipes also uses precision: fp4, and AGENTS.md lists only fp4 and fp8 as recognized precisions in the project.

Step-by-step proof of the divergence

Open benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p1d-tp4-tp4.yaml line 15: precision: "mxfp4".

Open benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-dep4-dep8-3-c256.yaml (or any of the 6 sibling recipes added in Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1295) around line 37: precision: "fp4".

Both files set moe-runner-backend: "flashinfer_mxfp4" in their sglang_config.decode blocks.

Open .github/configs/nvidia-master.yaml at the new dsv4-fp4-gb300-dynamo-sglang-mtp: block: precision: fp4.

So within the same PR, the matrix says fp4 and the recipe yamls say mxfp4, while the equivalent non-MTP sibling recipes that share the same MoE backend say fp4 at the recipe level too. That is a copy-paste inconsistency with the established convention.

Addressing the refutation: what the runtime impact actually is

The refutation correctly notes that InferenceX's own aggregation pipelines (utils/summarize.py, utils/collect_eval_results.py, utils/matrix_logic/generate_sweep_configs.py, launch_gb300-cw.sh) key off the matrix-level precision field from nvidia-master.yaml, not the recipe yaml's model.precision. Since the matrix entry is correctly fp4, in-repo aggregation/labeling is unaffected — the original framing of "confusing labels in eval/result aggregation pipelines" overstates the impact. The recipe-level field is consumed externally by srt-slurm/srtctl, and the upstream source (elvischenv/srt-slurm@dsv4-gb300-disagg-8k1k-mtp) presumably accepts mxfp4. So this is not a runtime breakage.

Why it's still worth fixing

It is purely a cross-recipe metadata uniformity nit: every sibling dsv4 SGLang recipe in the same directory tree, even ones using the identical flashinfer_mxfp4 MoE backend, declares precision: "fp4" at the recipe level. The mxfp4 label here will trip up future grep-based audits and contradicts the project-wide enum in AGENTS.md. The fix is to replace precision: "mxfp4" with precision: "fp4" on line 15 of all 6 new MTP recipes — no other change required.

github-actions · 2026-05-07T21:59:09Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25513378863
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25513378863

github-actions · 2026-05-08T05:44:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25513378863
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25513378863

github-actions · 2026-05-08T06:05:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25539877826
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25539877826

github-actions · 2026-05-08T06:06:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25539890483
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25539890483

github-actions · 2026-05-08T06:21:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25539917178
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25539917178

github-actions · 2026-05-08T06:26:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25540254945
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25540254945

ch-wan · 2026-05-08T06:29:57Z

/sweep

github-actions · 2026-05-08T06:30:13Z

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25540780423
Command: ``
Pinned ref: dba5e0d
Approval: not required (trusted collaborator).

github-actions · 2026-05-08T06:37:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25540254945
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25540254945

ch-wan · 2026-05-08T06:54:36Z

/sweep

github-actions · 2026-05-08T06:54:51Z

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25541720592
Command: ``
Pinned ref: 5e30f2c
Approval: not required (trusted collaborator).

github-actions · 2026-05-08T06:56:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25541004960
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25541004960

ch-wan · 2026-05-08T07:02:33Z

/sweep

github-actions · 2026-05-08T07:02:49Z

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25542023314
Command: ``
Pinned ref: ff0df99
Approval: not required (trusted collaborator).

github-actions · 2026-05-08T07:04:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25541718378
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25541718378

github-actions · 2026-05-08T07:10:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25542020807
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25542020807

github-actions · 2026-05-08T07:16:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25542257974
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25542257974

Without `cpus-per-task: 144` and `mem: 0`, slurm hands out 1 CPU and ~4 MB per task, and the dynamo cold source build (~500 rust crates) is OOM-killed before any worker comes up. Manifests as `Sweep failed (exit code: 137)` ~30 s after orchestrator start. Mirrors the block already present in the working main 8k1k recipes (e.g. disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-08T07:45:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25542490551
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25542490551

Mirrors the convention used elsewhere in the repo: per-config files at the same depth as their non-MTP siblings, distinguished only by the -mtp suffix. CONFIG_FILE references in nvidia-master.yaml updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-08T08:34:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25545495883
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25545495883

github-actions · 2026-05-08T08:38:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25545748138
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25545748138

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:141-146 — This PR removes disable-radix-cache: true from both the prefill (line 124) and decode (line 145) blocks of the existing non-MTP recipe disagg-gb300-1p1d-tp4-tp4-2-c1.yaml, but the change is out of scope for an "Add ... MTP disagg benchmarks" PR and is not tracked in the new perf-changelog entry (which lists only dsv4-fp4-gb300-dynamo-sglang-mtp under config-keys, not the affected dsv4-fp4-gb300-dynamo-sglang). Runtime impact at conc=1 is essentially zero (no cross-request prefix reuse), but please either revert the two deletions to keep the PR scoped to MTP additions, or add dsv4-fp4-gb300-dynamo-sglang to the changelog config-keys so the perf delta is tracked.
Extended reasoning...

What's happening

The diff against benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml removes two lines that this PR otherwise has no business touching:
```
@@ backend.sglang_config.prefill (around line 124) @@
       trust-remote-code: true
-      disable-radix-cache: true
@@ backend.sglang_config.decode (around line 145) @@
       trust-remote-code: true
-      disable-radix-cache: true
```
That recipe powers the already-merged dsv4-fp4-gb300-dynamo-sglang matrix entry in .github/configs/nvidia-master.yaml. Removing disable-radix-cache: true re-enables SGLang's radix prefix cache (the default) on a baseline that previously ran with it disabled.

Why this is out of scope

The PR is titled Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks and the description scopes the change to: 6 new MTP recipes under .../mtp/, the new dsv4-fp4-gb300-dynamo-sglang-mtp matrix entry, and the corresponding perf-changelog.yaml entry. The non-MTP disagg-gb300-1p1d-tp4-tp4-2-c1.yaml recipe is not mentioned anywhere — so a behavioral flip on a sibling baseline ships invisibly.

The new perf-changelog.yaml entry (lines 2303–2309) only lists:
```
- config-keys:
    - dsv4-fp4-gb300-dynamo-sglang-mtp
```
It does not list dsv4-fp4-gb300-dynamo-sglang (the config-key the modified recipe actually serves), so per AGENTS.md guidance the radix-cache flip won't trigger benchmark re-runs on the non-MTP baseline and won't be tracked in the historical record.

Step-by-step proof
1. Pre-PR file at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml had disable-radix-cache: true in both backend.sglang_config.prefill and backend.sglang_config.decode.
2. Diff hunks at lines 121 (prefill) and 141 (decode in the new file numbering) delete both lines.
3. Recipe ships under matrix key dsv4-fp4-gb300-dynamo-sglang (see .github/configs/nvidia-master.yaml block above the new -mtp section), which is unrelated to this PR's stated scope.
4. New changelog entry's config-keys list contains only dsv4-fp4-gb300-dynamo-sglang-mtp — so any tooling that walks the changelog to figure out "what configs changed in this PR" will miss the radix-cache flip.
Impact

Genuinely minimal at runtime: this recipe runs at concurrencies: "1" only, so there is no cross-request prefix overlap for the radix cache to exploit; TTFT/throughput numbers won't visibly shift. That's why this is a nit, not a normal severity.

The concern is hygiene: an undocumented behavioral change on an existing baseline is exactly the kind of edit that makes future perf regressions hard to bisect.

How to fix

Either:
1. Keep the PR scoped — restore the two disable-radix-cache: true lines in disagg-gb300-1p1d-tp4-tp4-2-c1.yaml (prefill at ~line 124, decode at ~line 145), or
2. Document the change — add dsv4-fp4-gb300-dynamo-sglang to the new changelog entry's config-keys and a description line explaining the radix-cache re-enablement, so the perf delta is tracked.
🟡 benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p1d-tp4-tp4.yaml:67-68 — Nit: the trailing comment in the decode_environment block (lines 67-68) — "SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 is single-node only and corrupts results in 2-node decode setups" — is a copy-paste from the multi-node mid-curve siblings and is inverted here. This recipe's decode is single-node (decode_nodes: 1, decode_workers: 1, TP=4 on gpus_per_node: 4), exactly the topology the comment says CAR_V2 supports. The same stale comment also appears in disagg-low-latency-1p6d-dep4-tp4.yaml where each decode worker is single-node. Either set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" to match the prefill_environment in 1p6d-dep4-tp4 (which DOES set it for single-node prefill), or just drop the misleading comment. Doc-only — no runtime impact.
Extended reasoning...

What the bug is

The decode_environment block in disagg-low-latency-1p1d-tp4-tp4.yaml ends with:
```
    # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
    # is single-node only and corrupts results in 2-node decode setups.
```
The comment's premise ("single-node only") is fine, but its conclusion ("so we omit it here") is inverted relative to the topology of this file. The decode here is single-node, so by the comment's own rationale CAR_V2 would be the applicable option, not the dangerous one.

How the comment got there

Looking at the sibling files added in the same PR:
- disagg-mid-curve-1p1d-dep4-dep8.yaml: decode_nodes: 2, gpus_per_decode: 8, decode TP=8 — multi-node, comment is correct.
- disagg-mid-curve-1p1d-dep4-dep16.yaml: decode_nodes: 4, gpus_per_decode: 16, decode TP=16 — multi-node, comment is correct.
- disagg-mid-curve-2p1d-dep4-dep8.yaml / disagg-mid-curve-4p1d-dep4-dep8.yaml: same multi-node decode, comment is correct.
- disagg-low-latency-1p1d-tp4-tp4.yaml (this file): decode_nodes: 1, decode_workers: 1, decode TP=4 on gpus_per_node: 4 — single-node, comment's rationale does not apply.
- disagg-low-latency-1p6d-dep4-tp4.yaml: decode_nodes: 6, decode_workers: 6 — i.e. each of 6 decode workers is single-node (TP=4 on a single 4-GPU node), so the comment's rationale also does not apply per-worker.
The comment is verbatim across all six recipes, so it was clearly stamped from the multi-node template and not re-evaluated for the low-latency single-node files.

Step-by-step proof for 1p1d-tp4-tp4
1. resources: block declares gpus_per_node: 4, decode_nodes: 1, decode_workers: 1 — one decode worker, on one node, with 4 GPUs available.
2. sglang_config.decode declares tensor-parallel-size: 4, data-parallel-size: 1, expert-parallel-size: 1 — the worker uses all 4 local GPUs via TP only, no inter-node comm.
3. By the comment's own claim ("CAR_V2 is single-node only"), this is precisely the supported topology.
4. The matching prefill_environment in this same file is also single-node, but does not set CAR_V2 either — and the prefill environment in disagg-low-latency-1p6d-dep4-tp4.yaml does set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" for its (also) single-node prefill worker. So elsewhere in the PR, single-node + CAR_V2 is the established choice.
5. The comment as written gives a future maintainer a false reason for the omission, obscuring whether CAR_V2 was deliberately skipped here for some other reason or just left off by inertia.
Why existing review didn't catch it

The other inline comments on this PR call out cross-file metadata divergences (mxfp4 vs fp4, missing gpus_per_prefill/gpus_per_decode, the TBD PR link). This one is more subtle: the comment looks correct in 4 of 6 files and only the topology context flips its meaning in the 2 low-latency files.

Impact

Doc-only — no runtime impact. The env var is unset (default behavior), so the runtime is whatever SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 defaults to. The harm is to future readers/auditors and to anyone copying this recipe to a similar single-node decode setup who would (correctly) read the comment as a recommendation against CAR_V2 in their topology.

How to fix

Two acceptable options:
1. Drop the comment from disagg-low-latency-1p1d-tp4-tp4.yaml lines 67-68 and from disagg-low-latency-1p6d-dep4-tp4.yaml lines 79-80. This is the minimum viable fix — the env is left unset, matching the current behavior, but without the misleading rationale.
2. Enable CAR_V2 on decode: add SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" to the decode_environment of both low-latency files, mirroring the prefill_environment of 1p6d-dep4-tp4.yaml. This actually exercises CAR_V2 in the topology where it is supposed to be safe.
Either way, the multi-node mid-curve siblings should keep the comment as-is — it is correct for them.

github-actions · 2026-05-08T09:31:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25545848356
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25545848356

ch-wan requested a review from a team May 7, 2026 18:02

github-project-automation Bot added this to InferenceMAX Board May 7, 2026

ch-wan requested review from jgangani and kedarpotdar-nv as code owners May 7, 2026 18:02

ch-wan added the full-sweep-enabled label May 7, 2026

ch-wan changed the title ~~Add DeepSeek V4 Pro FP4 GB300 disaggregated SGLang MTP benchmarks~~ Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks May 7, 2026

claude Bot reviewed May 7, 2026

View reviewed changes

add mtp configs

ce53cf1

ch-wan force-pushed the sglang-disagg-gb300-mtp-0507 branch from ea35b7b to ce53cf1 Compare May 8, 2026 06:06

ch-wan closed this May 8, 2026

ch-wan force-pushed the sglang-disagg-gb300-mtp-0507 branch from 06ff413 to 876b595 Compare May 8, 2026 07:17

github-project-automation Bot moved this to Done in InferenceMAX Board May 8, 2026

ch-wan reopened this May 8, 2026

Fridge003 and others added 2 commits May 8, 2026 01:31

Change deepgemm flags

59f899d

fix

50c6c59

claude Bot reviewed May 8, 2026

View reviewed changes

Conversation

ch-wan commented May 7, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

claude Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

ch-wan commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

ch-wan commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

ch-wan commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What's happening

Why this is out of scope

Step-by-step proof

Impact

How to fix

What the bug is

How the comment got there

Step-by-step proof for 1p1d-tp4-tp4

Why existing review didn't catch it

Impact

How to fix

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants