MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models by bongwoobak · Pull Request #101 · moreh-dev/mif

bongwoobak · 2026-03-31T17:01:09Z

Summary

Replace untested v0.15.1 presets with vLLM v0.17.0 presets validated on aiand-rke2 cluster
All 7 models deployed and serving via Odin + Heimdall BBR routing
v0.15.1 was incompatible with cluster CUDA driver 580 (CUDA runtime 13.0)

New Presets (7 files, all tested)

Model	GPU	Parallelism	Key Config
Qwen3.5-9B	L40S	tp1	`DEEP_GEMM=0`, `--reasoning-parser qwen3`
Qwen3.5-27B-FP8	L40S	tp1	`DEEP_GEMM=0`, `--enforce-eager`, `--reasoning-parser qwen3`
GPT-OSS-120B	H100-NVL	tp2-moe-tp2	`--max-num-seqs 128` (warmup OOM fix)
Nemotron Super 120B FP8	H200	tp2-moe-tp2	`--mamba-ssm-cache-dtype float16` (Mamba-2 hybrid)
DeepSeek V3.2	H200	dp8-moe-ep8	`--tokenizer-mode deepseek_v32` (tp8 hangs on H200)
Qwen3.5-397B-A17B-FP8	H200	dp8-moe-ep8	`DEEP_GEMM=0` (kv_heads=2, TP impossible)
GLM-5-FP8	H200	tp8-moe-tp8	`dev.local/vllm-glm5:final`, `--gpu-memory-utilization 0.85`

Issues Found During Deployment

vLLM v0.15.1 + driver 580: Error 803: unsupported display driver / cuda driver combination
DeepGEMM bug: Qwen3.5 FP8 MoE (32 heads < 52 hardcoded) → VLLM_USE_DEEP_GEMM=0
DeepSeek V3.2 tp8 hang: FlashMLA-Sparse head padding + multiprocessing deadlock → dp8-moe-ep8
DeepSeek V3.2 tokenizer: no Jinja template → --tokenizer-mode deepseek_v32 required
GLM-5 arch: glm_moe_dsa not in official vLLM → custom image required
L40S OOM: CUDA graph capture exceeds 48GB → --enforce-eager
H100 warmup OOM: max_num_seqs default 1024 too high → 128
Nemotron Mamba-2: SSM cache dtype must be specified explicitly

Test plan

helm template renders all 7 presets without errors
Deploy on aiand-rke2 cluster via Odin InferenceService
All 7 models reach Ready state
BBR multi-model routing test via Gateway port-forward
Heimdall EPP logs confirm per-model routing
Load testing under concurrent requests

Add vLLM v0.15.1 E2E presets for H200-SXM targeting the AI& April launch model scope. All presets include ISVC_USE_KV_EVENTS for precise-prefix-cache-aware Heimdall scheduling. New models: - Qwen3.5-9B (tp1), Qwen3.5-27B (tp1), Qwen3.5-27B-FP8 (tp1) - Qwen3.5-397B (tp8, expert parallel) - DeepSeek V3.2 (tp8, expert parallel) - Nemotron Super 120B BF16/FP8 (tp2) - Nemotron Nano 30B BF16/FP8 (tp1) - GLM-5 BF16/FP8 (tp8, expert parallel)

…presets Replace untested v0.15.1 H200 presets with v0.17.0 presets validated on aiand-rke2 cluster. All 7 models serving and BBR routing confirmed. Removed: 11 v0.15.1 preset files (incompatible with CUDA driver 580) Added 7 tested presets: - Qwen3.5-9B (L40S tp1, DEEP_GEMM=0, reasoning-parser qwen3) - Qwen3.5-27B-FP8 (L40S tp1, enforce-eager, DEEP_GEMM=0) - GPT-OSS-120B (H100-NVL tp2, max-num-seqs 128) - Nemotron Super 120B FP8 (H200 tp2, mamba-ssm-cache-dtype float16) - DeepSeek V3.2 (H200 dp8-moe-ep8, tokenizer-mode deepseek_v32) - Qwen3.5-397B-A17B-FP8 (H200 dp8-moe-ep8, DEEP_GEMM=0) - GLM-5-FP8 (H200 tp8, dev.local/vllm-glm5:final, gpu-mem-util 0.85)

Copilot

Pull request overview

This PR introduces new Odin InferenceServiceTemplate Helm preset templates for running a set of April launch models on vLLM v0.17.0 (plus a GLM-5 preset using a custom image), intended to replace/upgrade prior presets that were incompatible with the target CUDA driver environment.

Changes:

Added 6 new vLLM v0.17.0 E2E presets (Qwen 9B/27B/397B, GPT-OSS-120B, Nemotron 120B, DeepSeek V3.2) under templates/presets/vllm/v0.17.0/.
Added a GLM-5 FP8 E2E preset under templates/presets/vllm/glm5/ using a custom image and model path override.
Standardized per-preset resource requests/limits and node selection for the target GPU SKUs.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml	New Qwen3.5-9B L40S single-GPU vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml	New Qwen3.5-27B-FP8 L40S single-GPU vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-397b-a17b-fp8-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	New Qwen3.5-397B-A17B-FP8 H200 DP8+EP vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml	New GPT-OSS-120B H100-NVL TP2 vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml	New Nemotron 120B FP8 H200 TP2 vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/deepseek-ai-deepseek-v3.2-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	New DeepSeek V3.2 H200 DP8+EP vLLM v0.17.0 preset
deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml	New GLM-5 FP8 H200 TP8 preset using a custom image/model path

...moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml

...inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml

...set/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml

...plates/presets/vllm/v0.17.0/qwen-qwen3.5-397b-a17b-fp8-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml

...set/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml

- Add --no-enable-log-requests to all 7 presets (runtime base override) - Add --reasoning-parser qwen3 to Qwen3.5 9B, 27B, 397B - Add --mamba-ssm-cache-dtype float16 --enable-chunked-prefill to Nemotron - Add --gpu-memory-utilization 0.85 to GLM-5

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml

ISVC_MODEL_PATH should be overridden in InferenceService, not in the preset template, to keep the preset general-purpose.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

...m/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml

gitgod-bot assigned bongwoobak Mar 31, 2026

bongwoobak changed the title ~~MAF-19524: feat(preset): add H200 presets for AI& April launch models~~ MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models Apr 1, 2026

bongwoobak marked this pull request as ready for review April 1, 2026 16:31

bongwoobak requested a review from a team as a code owner April 1, 2026 16:31

Copilot AI review requested due to automatic review settings April 1, 2026 16:31

Copilot started reviewing on behalf of bongwoobak April 1, 2026 16:31 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

hhk7734 requested changes Apr 2, 2026

View reviewed changes

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml Outdated Show resolved Hide resolved

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings April 2, 2026 18:56

Copilot started reviewing on behalf of bongwoobak April 2, 2026 18:57 View session

fix(preset): remove ISVC_MODEL_PATH from GLM-5 preset

a3c1ee7

ISVC_MODEL_PATH should be overridden in InferenceService, not in the preset template, to keep the preset general-purpose.

bongwoobak force-pushed the MAF-19524/add-h200-presets-april-launch-models branch from aebe6e8 to a3c1ee7 Compare April 2, 2026 18:57

Copilot AI reviewed Apr 2, 2026

View reviewed changes

...m/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml Show resolved Hide resolved

...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml Show resolved Hide resolved

hhk7734 approved these changes Apr 3, 2026

View reviewed changes

bongwoobak merged commit 982ac94 into main Apr 3, 2026
3 checks passed

bongwoobak deleted the MAF-19524/add-h200-presets-april-launch-models branch April 3, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models#101

MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models#101
bongwoobak merged 4 commits intomainfrom
MAF-19524/add-h200-presets-april-launch-models

bongwoobak commented Mar 31, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bongwoobak commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Presets (7 files, all tested)

Issues Found During Deployment

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bongwoobak commented Mar 31, 2026 •

edited

Loading