Skip to content

MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models#101

Merged
bongwoobak merged 4 commits intomainfrom
MAF-19524/add-h200-presets-april-launch-models
Apr 3, 2026
Merged

MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models#101
bongwoobak merged 4 commits intomainfrom
MAF-19524/add-h200-presets-april-launch-models

Conversation

@bongwoobak
Copy link
Copy Markdown
Contributor

@bongwoobak bongwoobak commented Mar 31, 2026

Summary

  • Replace untested v0.15.1 presets with vLLM v0.17.0 presets validated on aiand-rke2 cluster
  • All 7 models deployed and serving via Odin + Heimdall BBR routing
  • v0.15.1 was incompatible with cluster CUDA driver 580 (CUDA runtime 13.0)

New Presets (7 files, all tested)

Model GPU Parallelism Key Config
Qwen3.5-9B L40S tp1 DEEP_GEMM=0, --reasoning-parser qwen3
Qwen3.5-27B-FP8 L40S tp1 DEEP_GEMM=0, --enforce-eager, --reasoning-parser qwen3
GPT-OSS-120B H100-NVL tp2-moe-tp2 --max-num-seqs 128 (warmup OOM fix)
Nemotron Super 120B FP8 H200 tp2-moe-tp2 --mamba-ssm-cache-dtype float16 (Mamba-2 hybrid)
DeepSeek V3.2 H200 dp8-moe-ep8 --tokenizer-mode deepseek_v32 (tp8 hangs on H200)
Qwen3.5-397B-A17B-FP8 H200 dp8-moe-ep8 DEEP_GEMM=0 (kv_heads=2, TP impossible)
GLM-5-FP8 H200 tp8-moe-tp8 dev.local/vllm-glm5:final, --gpu-memory-utilization 0.85

Issues Found During Deployment

  • vLLM v0.15.1 + driver 580: Error 803: unsupported display driver / cuda driver combination
  • DeepGEMM bug: Qwen3.5 FP8 MoE (32 heads < 52 hardcoded) → VLLM_USE_DEEP_GEMM=0
  • DeepSeek V3.2 tp8 hang: FlashMLA-Sparse head padding + multiprocessing deadlock → dp8-moe-ep8
  • DeepSeek V3.2 tokenizer: no Jinja template → --tokenizer-mode deepseek_v32 required
  • GLM-5 arch: glm_moe_dsa not in official vLLM → custom image required
  • L40S OOM: CUDA graph capture exceeds 48GB → --enforce-eager
  • H100 warmup OOM: max_num_seqs default 1024 too high → 128
  • Nemotron Mamba-2: SSM cache dtype must be specified explicitly

Test plan

  • helm template renders all 7 presets without errors
  • Deploy on aiand-rke2 cluster via Odin InferenceService
  • All 7 models reach Ready state
  • BBR multi-model routing test via Gateway port-forward
  • Heimdall EPP logs confirm per-model routing
  • Load testing under concurrent requests

Add vLLM v0.15.1 E2E presets for H200-SXM targeting the AI& April
launch model scope. All presets include ISVC_USE_KV_EVENTS for
precise-prefix-cache-aware Heimdall scheduling.

New models:
- Qwen3.5-9B (tp1), Qwen3.5-27B (tp1), Qwen3.5-27B-FP8 (tp1)
- Qwen3.5-397B (tp8, expert parallel)
- DeepSeek V3.2 (tp8, expert parallel)
- Nemotron Super 120B BF16/FP8 (tp2)
- Nemotron Nano 30B BF16/FP8 (tp1)
- GLM-5 BF16/FP8 (tp8, expert parallel)
…presets

Replace untested v0.15.1 H200 presets with v0.17.0 presets validated on
aiand-rke2 cluster. All 7 models serving and BBR routing confirmed.

Removed: 11 v0.15.1 preset files (incompatible with CUDA driver 580)

Added 7 tested presets:
- Qwen3.5-9B (L40S tp1, DEEP_GEMM=0, reasoning-parser qwen3)
- Qwen3.5-27B-FP8 (L40S tp1, enforce-eager, DEEP_GEMM=0)
- GPT-OSS-120B (H100-NVL tp2, max-num-seqs 128)
- Nemotron Super 120B FP8 (H200 tp2, mamba-ssm-cache-dtype float16)
- DeepSeek V3.2 (H200 dp8-moe-ep8, tokenizer-mode deepseek_v32)
- Qwen3.5-397B-A17B-FP8 (H200 dp8-moe-ep8, DEEP_GEMM=0)
- GLM-5-FP8 (H200 tp8, dev.local/vllm-glm5:final, gpu-mem-util 0.85)
@bongwoobak bongwoobak changed the title MAF-19524: feat(preset): add H200 presets for AI& April launch models MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models Apr 1, 2026
@bongwoobak bongwoobak marked this pull request as ready for review April 1, 2026 16:31
@bongwoobak bongwoobak requested a review from a team as a code owner April 1, 2026 16:31
Copilot AI review requested due to automatic review settings April 1, 2026 16:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces new Odin InferenceServiceTemplate Helm preset templates for running a set of April launch models on vLLM v0.17.0 (plus a GLM-5 preset using a custom image), intended to replace/upgrade prior presets that were incompatible with the target CUDA driver environment.

Changes:

  • Added 6 new vLLM v0.17.0 E2E presets (Qwen 9B/27B/397B, GPT-OSS-120B, Nemotron 120B, DeepSeek V3.2) under templates/presets/vllm/v0.17.0/.
  • Added a GLM-5 FP8 E2E preset under templates/presets/vllm/glm5/ using a custom image and model path override.
  • Standardized per-preset resource requests/limits and node selection for the target GPU SKUs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml New Qwen3.5-9B L40S single-GPU vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml New Qwen3.5-27B-FP8 L40S single-GPU vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-397b-a17b-fp8-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml New Qwen3.5-397B-A17B-FP8 H200 DP8+EP vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml New GPT-OSS-120B H100-NVL TP2 vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml New Nemotron 120B FP8 H200 TP2 vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/deepseek-ai-deepseek-v3.2-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml New DeepSeek V3.2 H200 DP8+EP vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml New GLM-5 FP8 H200 TP8 preset using a custom image/model path

- Add --no-enable-log-requests to all 7 presets (runtime base override)
- Add --reasoning-parser qwen3 to Qwen3.5 9B, 27B, 397B
- Add --mamba-ssm-cache-dtype float16 --enable-chunked-prefill to Nemotron
- Add --gpu-memory-utilization 0.85 to GLM-5
Copilot AI review requested due to automatic review settings April 2, 2026 18:56
ISVC_MODEL_PATH should be overridden in InferenceService, not in the
preset template, to keep the preset general-purpose.
@bongwoobak bongwoobak force-pushed the MAF-19524/add-h200-presets-april-launch-models branch from aebe6e8 to a3c1ee7 Compare April 2, 2026 18:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

@bongwoobak bongwoobak merged commit 982ac94 into main Apr 3, 2026
3 checks passed
@bongwoobak bongwoobak deleted the MAF-19524/add-h200-presets-april-launch-models branch April 3, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants