Add DSV4 B200 Dynamo vLLM disagg#1303
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
d2d12d5 to
2191510
Compare
|
Claude finished @Oseltamivir's task in 4m 2s —— View job PR Review: Add DSV4 B200 Dynamo vLLM disagg
SummaryNo blocking issues found. Two warnings posted as inline comments on the launch script:
Validations passed:
|
| elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then | ||
| git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" | ||
| cd "$SRT_REPO_DIR" || exit 1 | ||
| git checkout aflowers/vllm-gb200-v0.20.0 |
There was a problem hiding this comment.
🟡 WARNING: This checks out a feature branch (aflowers/vllm-gb200-v0.20.0) on NVIDIA/srt-slurm. If that branch is deleted, force-pushed, or renamed upstream, all dsv4-fp4-b200-dynamo-vllm jobs will silently break.
Why it matters: Feature branches on external repos are ephemeral — this creates a fragile dependency.
Fix: Consider pinning to a specific commit SHA instead of a branch name for reproducibility:
| git checkout aflowers/vllm-gb200-v0.20.0 | |
| - git checkout aflowers/vllm-gb200-v0.20.0 | |
| + git checkout <specific-commit-sha> # aflowers/vllm-gb200-v0.20.0 as of YYYY-MM-DD |
| if [[ "$IS_AGENTIC" == "1" ]]; then | ||
| git clone --branch cam/sa-submission-q2-2026 --single-branch https://github.com/cquil11/srt-slurm-nv.git "$SRT_REPO_DIR" | ||
| cd "$SRT_REPO_DIR" || exit 1 | ||
| elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then |
There was a problem hiding this comment.
🟡 WARNING: The existing sed at line 146 (sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/') will override all 5 new B200 recipes' max_attempts: 1440 down to 720. The recipes intentionally set 1440 (≈ 4 h of health-check polling) with comments about absorbing cold-cache model loads on B200, but the blanket sed will halve that to ≈ 2 h.
Why it matters: If the DSV4-Pro model load on B200 genuinely needs the longer timeout, jobs could fail with spurious health-check timeouts before the model finishes loading.
Fix: Guard the sed so it only bumps recipes that ship with the old 360 default, or skip it for dynamo-vllm recipes:
if ! grep -q 'max_attempts: 1440' "${CONFIG_FILE%%:*}"; then
sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/' "${CONFIG_FILE%%:*}"
fi| elif [[ $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" && $FRAMEWORK == "dynamo-vllm" ]]; then | ||
| SELECTED_MODEL_PATH="" | ||
| if [[ -n "${MODEL_PATH:-}" && -d "${MODEL_PATH}" ]]; then | ||
| SELECTED_MODEL_PATH="$MODEL_PATH" | ||
| else | ||
| for candidate in /lustre/fsw/models/deepseek-v4-pro /lustre/fsw/models/dsv4-pro /lustre/fsw/models/DeepSeek-V4-Pro; do | ||
| if [[ -d "$candidate" ]]; then | ||
| SELECTED_MODEL_PATH="$candidate" | ||
| break | ||
| fi | ||
| done | ||
| fi | ||
| export MODEL_PATH="${SELECTED_MODEL_PATH:-/lustre/fsw/models/deepseek-v4-pro}" | ||
| export SRT_SLURM_MODEL_PREFIX="deepseek-v4-pro" |
There was a problem hiding this comment.
🔴 The shared sed at runners/launch_b200-dgxc.sh:146 (sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/') matches any value, so the new B200 vLLM recipes' deliberate max_attempts: 1440 (documented in-file as needed for cold-cache DSV4-Pro loads) gets silently rewritten to 720 — halving the intended 14400s/4h health-check window to 7200s/2h. Make the rewrite a floor (only set 720 when the current value is lower) so it stops downgrading recipes that already specify a higher value, e.g. awk -i inplace '/^ max_attempts:/ {n=$2+0; if (n<720) $0=" max_attempts: 720"} 1' "${CONFIG_FILE%%:*}".
Extended reasoning...
What the bug is
The pre-existing line at runners/launch_b200-dgxc.sh:146:
sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/' "${CONFIG_FILE%%:*}"was introduced as a bump — the comment immediately above (lines 143–145) explains it raises DSR1-FP8's default max_attempts: 360 to 720 (3600s → 7200s) so large-model loads off shared FS finish in time. The substitution pattern [0-9]*, however, matches any numeric value, so it is really a force-set, not a floor.
How this PR triggers it
This PR adds a new elif branch at lines 57–62 that routes FRAMEWORK=dynamo-vllm + MODEL_PREFIX=dsv4 into the same code path that later runs the sed. The 5 new recipes (disagg-b200-{low-latency,low-middle-curve,mid-curve-megamoe,high-tpt-megamoe,max-tpt-megamoe}.yaml) all explicitly ship:
slurm:
time_limit: "8:00:00"
health_check:
max_attempts: 1440
interval_seconds: 10with an in-file rationale: "slurm.time_limit + health_check set to 8h / 1440 attempts to absorb cold-cache model loads." DeepSeek-V4-Pro is a large MoE model and the recipe author deliberately picked the higher value to cover the 4h cold-cache window. The launcher silently negates that choice.
Step-by-step proof
- CI runs
dsv4-fp4-b200-dynamo-vllm(see.github/configs/nvidia-master.yaml). runners/launch_b200-dgxc.shmatches the newdsv4/fp4/dynamo-vllmbranch (line 26) and clones srt-slurm + overlays the new recipes (lines 57–62).- Execution falls through to line 146 with
CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-b200-low-latency.yaml. - The recipe currently contains
max_attempts: 1440. - The sed pattern
^ max_attempts: [0-9]*matches that line and replaces it withmax_attempts: 720. - srtctl now polls health for 720 × 10s = 7200s (2h) instead of the 1440 × 10s = 14400s (4h) the author specified.
Verified there is no later code that resets max_attempts after this sed (only launch_h200-dgxc-slurm.sh references it, and it appends to a fresh file, not the same path).
Why existing safeguards don't prevent it
bash -n only checks syntax, the generate_sweep_configs.py validation in the PR description doesn't exercise the launcher at all, and there is no test that asserts the recipe's max_attempts is preserved.
Impact
Cold-start runs of DSV4-Pro on the shared filesystem may report spurious health_check failures somewhere in the 2h–4h window where the recipe's 1440-attempt setting would have succeeded. Failures here cost a full multi-node B200 allocation per occurrence.
Fix
Make the rewrite a floor instead of a force-set so it stops downgrading recipes that already specify a higher value, for example:
awk -i inplace '/^ max_attempts:/ {n=$2+0; if (n<720) $0=" max_attempts: 720"} 1' "${CONFIG_FILE%%:*}"or scope the sed to only the older paths that actually need the bump.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25579129066 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25579927968 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25581086579 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25583036657 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25589968098 |
Summary
dsv4-fp4-b200-dynamo-vllmmulti-node disaggregated config.dynamo-vllmfordsv4/fp4.Validation
python utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml --config-keys dsv4-fp4-b200-dynamo-vllmpython utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml --model-prefix dsv4 --framework dynamo-vllm --runner-type b200-multinode --multi-nodepython -m pytest utils/matrix_logic/ -vbash -n runners/launch_b200-dgxc.shgit diff --check