Skip to content

Add DSV4 B200 Dynamo vLLM disagg#1303

Open
Oseltamivir wants to merge 6 commits intomainfrom
dsv4-b200-dynamo-vllm-disagg
Open

Add DSV4 B200 Dynamo vLLM disagg#1303
Oseltamivir wants to merge 6 commits intomainfrom
dsv4-b200-dynamo-vllm-disagg

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Add dsv4-fp4-b200-dynamo-vllm multi-node disaggregated config.
  • Add B200-adapted DSV4 Dynamo vLLM recipes derived from the existing GB200 recipe set.
  • Update the B200 DGXC Slurm launcher to support dynamo-vllm for dsv4/fp4.

Validation

  • python utils/matrix_logic/generate_sweep_configs.py test-config --config-files .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml --config-keys dsv4-fp4-b200-dynamo-vllm
  • python utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml --model-prefix dsv4 --framework dynamo-vllm --runner-type b200-multinode --multi-node
  • python -m pytest utils/matrix_logic/ -v
  • bash -n runners/launch_b200-dgxc.sh
  • git diff --check

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir force-pushed the dsv4-b200-dynamo-vllm-disagg branch from d2d12d5 to 2191510 Compare May 8, 2026 20:53
@Oseltamivir Oseltamivir marked this pull request as ready for review May 8, 2026 20:56
@Oseltamivir Oseltamivir requested a review from a team May 8, 2026 20:56
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Claude finished @Oseltamivir's task in 4m 2s —— View job


PR Review: Add DSV4 B200 Dynamo vLLM disagg

  • Read and analyze all changed files
  • Validate master config and perf-changelog
  • Check container image accessibility
  • Validate model prefixes
  • Check launch script (enroot import, expert parallelism)
  • Review recipe YAML files for correctness
  • Post review findings

Summary

No blocking issues found. Two warnings posted as inline comments on the launch script:

  1. max_attempts sed override — The existing blanket sed at line 146 will downgrade all 5 new recipes from max_attempts: 1440 to 720, contradicting the recipes' intentional higher timeout for cold-cache B200 model loads.

  2. Fragile branch pingit checkout aflowers/vllm-gb200-v0.20.0 depends on a feature branch on NVIDIA/srt-slurm that could disappear. Pinning a commit SHA would be more robust.

Validations passed:

  • perf-changelog.yaml updated and appended at end of file
  • Container image vllm/vllm-openai:v0.20.0-ubuntu2404 is publicly accessible on Docker Hub
  • Model prefix dsv4 is valid and already used across many existing configs
  • Enroot import pattern present in launch script (line 89)
  • Master config worker counts (prefill/decode num-worker) match all 5 recipe YAML prefill_workers/decode_workers values
  • Recipe vllm_config EP/DP settings are consistent with the master config ep/dp-attn flags

elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR" || exit 1
git checkout aflowers/vllm-gb200-v0.20.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: This checks out a feature branch (aflowers/vllm-gb200-v0.20.0) on NVIDIA/srt-slurm. If that branch is deleted, force-pushed, or renamed upstream, all dsv4-fp4-b200-dynamo-vllm jobs will silently break.

Why it matters: Feature branches on external repos are ephemeral — this creates a fragile dependency.

Fix: Consider pinning to a specific commit SHA instead of a branch name for reproducibility:

Suggested change
git checkout aflowers/vllm-gb200-v0.20.0
- git checkout aflowers/vllm-gb200-v0.20.0
+ git checkout <specific-commit-sha> # aflowers/vllm-gb200-v0.20.0 as of YYYY-MM-DD

if [[ "$IS_AGENTIC" == "1" ]]; then
git clone --branch cam/sa-submission-q2-2026 --single-branch https://github.com/cquil11/srt-slurm-nv.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR" || exit 1
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: The existing sed at line 146 (sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/') will override all 5 new B200 recipes' max_attempts: 1440 down to 720. The recipes intentionally set 1440 (≈ 4 h of health-check polling) with comments about absorbing cold-cache model loads on B200, but the blanket sed will halve that to ≈ 2 h.

Why it matters: If the DSV4-Pro model load on B200 genuinely needs the longer timeout, jobs could fail with spurious health-check timeouts before the model finishes loading.

Fix: Guard the sed so it only bumps recipes that ship with the old 360 default, or skip it for dynamo-vllm recipes:

if ! grep -q 'max_attempts: 1440' "${CONFIG_FILE%%:*}"; then
    sed -i 's/^  max_attempts: [0-9]*/  max_attempts: 720/' "${CONFIG_FILE%%:*}"
fi

Comment on lines +26 to +39
elif [[ $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" && $FRAMEWORK == "dynamo-vllm" ]]; then
SELECTED_MODEL_PATH=""
if [[ -n "${MODEL_PATH:-}" && -d "${MODEL_PATH}" ]]; then
SELECTED_MODEL_PATH="$MODEL_PATH"
else
for candidate in /lustre/fsw/models/deepseek-v4-pro /lustre/fsw/models/dsv4-pro /lustre/fsw/models/DeepSeek-V4-Pro; do
if [[ -d "$candidate" ]]; then
SELECTED_MODEL_PATH="$candidate"
break
fi
done
fi
export MODEL_PATH="${SELECTED_MODEL_PATH:-/lustre/fsw/models/deepseek-v4-pro}"
export SRT_SLURM_MODEL_PREFIX="deepseek-v4-pro"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The shared sed at runners/launch_b200-dgxc.sh:146 (sed -i 's/^ max_attempts: [0-9]*/ max_attempts: 720/') matches any value, so the new B200 vLLM recipes' deliberate max_attempts: 1440 (documented in-file as needed for cold-cache DSV4-Pro loads) gets silently rewritten to 720 — halving the intended 14400s/4h health-check window to 7200s/2h. Make the rewrite a floor (only set 720 when the current value is lower) so it stops downgrading recipes that already specify a higher value, e.g. awk -i inplace '/^ max_attempts:/ {n=$2+0; if (n<720) $0=" max_attempts: 720"} 1' "${CONFIG_FILE%%:*}".

Extended reasoning...

What the bug is

The pre-existing line at runners/launch_b200-dgxc.sh:146:

sed -i 's/^  max_attempts: [0-9]*/  max_attempts: 720/' "${CONFIG_FILE%%:*}"

was introduced as a bump — the comment immediately above (lines 143–145) explains it raises DSR1-FP8's default max_attempts: 360 to 720 (3600s → 7200s) so large-model loads off shared FS finish in time. The substitution pattern [0-9]*, however, matches any numeric value, so it is really a force-set, not a floor.

How this PR triggers it

This PR adds a new elif branch at lines 57–62 that routes FRAMEWORK=dynamo-vllm + MODEL_PREFIX=dsv4 into the same code path that later runs the sed. The 5 new recipes (disagg-b200-{low-latency,low-middle-curve,mid-curve-megamoe,high-tpt-megamoe,max-tpt-megamoe}.yaml) all explicitly ship:

slurm:
  time_limit: "8:00:00"

health_check:
  max_attempts: 1440
  interval_seconds: 10

with an in-file rationale: "slurm.time_limit + health_check set to 8h / 1440 attempts to absorb cold-cache model loads." DeepSeek-V4-Pro is a large MoE model and the recipe author deliberately picked the higher value to cover the 4h cold-cache window. The launcher silently negates that choice.

Step-by-step proof

  1. CI runs dsv4-fp4-b200-dynamo-vllm (see .github/configs/nvidia-master.yaml).
  2. runners/launch_b200-dgxc.sh matches the new dsv4/fp4/dynamo-vllm branch (line 26) and clones srt-slurm + overlays the new recipes (lines 57–62).
  3. Execution falls through to line 146 with CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-b200-low-latency.yaml.
  4. The recipe currently contains max_attempts: 1440.
  5. The sed pattern ^ max_attempts: [0-9]* matches that line and replaces it with max_attempts: 720.
  6. srtctl now polls health for 720 × 10s = 7200s (2h) instead of the 1440 × 10s = 14400s (4h) the author specified.

Verified there is no later code that resets max_attempts after this sed (only launch_h200-dgxc-slurm.sh references it, and it appends to a fresh file, not the same path).

Why existing safeguards don't prevent it

bash -n only checks syntax, the generate_sweep_configs.py validation in the PR description doesn't exercise the launcher at all, and there is no test that asserts the recipe's max_attempts is preserved.

Impact

Cold-start runs of DSV4-Pro on the shared filesystem may report spurious health_check failures somewhere in the 2h–4h window where the recipe's 1440-attempt setting would have succeeded. Failures here cost a full multi-node B200 allocation per occurrence.

Fix

Make the rewrite a floor instead of a force-set so it stops downgrading recipes that already specify a higher value, for example:

awk -i inplace '/^  max_attempts:/ {n=$2+0; if (n<720) $0="  max_attempts: 720"} 1' "${CONFIG_FILE%%:*}"

or scope the sed to only the older paths that actually need the bump.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant