MAF-19524: feat(preset): add vLLM v0.17.0 E2E presets for AI& April launch models#101
Merged
bongwoobak merged 4 commits intomainfrom Apr 3, 2026
Merged
Conversation
Add vLLM v0.15.1 E2E presets for H200-SXM targeting the AI& April launch model scope. All presets include ISVC_USE_KV_EVENTS for precise-prefix-cache-aware Heimdall scheduling. New models: - Qwen3.5-9B (tp1), Qwen3.5-27B (tp1), Qwen3.5-27B-FP8 (tp1) - Qwen3.5-397B (tp8, expert parallel) - DeepSeek V3.2 (tp8, expert parallel) - Nemotron Super 120B BF16/FP8 (tp2) - Nemotron Nano 30B BF16/FP8 (tp1) - GLM-5 BF16/FP8 (tp8, expert parallel)
…presets Replace untested v0.15.1 H200 presets with v0.17.0 presets validated on aiand-rke2 cluster. All 7 models serving and BBR routing confirmed. Removed: 11 v0.15.1 preset files (incompatible with CUDA driver 580) Added 7 tested presets: - Qwen3.5-9B (L40S tp1, DEEP_GEMM=0, reasoning-parser qwen3) - Qwen3.5-27B-FP8 (L40S tp1, enforce-eager, DEEP_GEMM=0) - GPT-OSS-120B (H100-NVL tp2, max-num-seqs 128) - Nemotron Super 120B FP8 (H200 tp2, mamba-ssm-cache-dtype float16) - DeepSeek V3.2 (H200 dp8-moe-ep8, tokenizer-mode deepseek_v32) - Qwen3.5-397B-A17B-FP8 (H200 dp8-moe-ep8, DEEP_GEMM=0) - GLM-5-FP8 (H200 tp8, dev.local/vllm-glm5:final, gpu-mem-util 0.85)
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces new Odin InferenceServiceTemplate Helm preset templates for running a set of April launch models on vLLM v0.17.0 (plus a GLM-5 preset using a custom image), intended to replace/upgrade prior presets that were incompatible with the target CUDA driver environment.
Changes:
- Added 6 new vLLM v0.17.0 E2E presets (Qwen 9B/27B/397B, GPT-OSS-120B, Nemotron 120B, DeepSeek V3.2) under
templates/presets/vllm/v0.17.0/. - Added a GLM-5 FP8 E2E preset under
templates/presets/vllm/glm5/using a custom image and model path override. - Standardized per-preset resource requests/limits and node selection for the target GPU SKUs.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml | New Qwen3.5-9B L40S single-GPU vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml | New Qwen3.5-27B-FP8 L40S single-GPU vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-397b-a17b-fp8-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | New Qwen3.5-397B-A17B-FP8 H200 DP8+EP vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml | New GPT-OSS-120B H100-NVL TP2 vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml | New Nemotron 120B FP8 H200 TP2 vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.17.0/deepseek-ai-deepseek-v3.2-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | New DeepSeek V3.2 H200 DP8+EP vLLM v0.17.0 preset |
| deploy/helm/moai-inference-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml | New GLM-5 FP8 H200 TP8 preset using a custom image/model path |
...moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml
Show resolved
Hide resolved
...moai-inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-9b-nvidia-l40s-1.helm.yaml
Show resolved
Hide resolved
...inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml
Show resolved
Hide resolved
...inference-preset/templates/presets/vllm/v0.17.0/qwen-qwen3.5-27b-fp8-nvidia-l40s-1.helm.yaml
Show resolved
Hide resolved
...set/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml
Show resolved
Hide resolved
...plates/presets/vllm/v0.17.0/qwen-qwen3.5-397b-a17b-fp8-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml
Show resolved
Hide resolved
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Show resolved
Hide resolved
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Show resolved
Hide resolved
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Outdated
Show resolved
Hide resolved
...set/templates/presets/vllm/v0.17.0/openai-gpt-oss-120b-nvidia-h100-nvl-tp2-moe-tp2.helm.yaml
Show resolved
Hide resolved
- Add --no-enable-log-requests to all 7 presets (runtime base override) - Add --reasoning-parser qwen3 to Qwen3.5 9B, 27B, 397B - Add --mamba-ssm-cache-dtype float16 --enable-chunked-prefill to Nemotron - Add --gpu-memory-utilization 0.85 to GLM-5
hhk7734
requested changes
Apr 2, 2026
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Outdated
Show resolved
Hide resolved
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Outdated
Show resolved
Hide resolved
ISVC_MODEL_PATH should be overridden in InferenceService, not in the preset template, to keep the preset general-purpose.
aebe6e8 to
a3c1ee7
Compare
...m/v0.17.0/nvidia-nvidia-nemotron-3-super-120b-a12b-fp8-nvidia-h200-sxm-tp2-moe-tp2.helm.yaml
Show resolved
Hide resolved
...e-preset/templates/presets/vllm/glm5/zai-org-glm-5-fp8-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
Show resolved
Hide resolved
hhk7734
approved these changes
Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New Presets (7 files, all tested)
DEEP_GEMM=0,--reasoning-parser qwen3DEEP_GEMM=0,--enforce-eager,--reasoning-parser qwen3--max-num-seqs 128(warmup OOM fix)--mamba-ssm-cache-dtype float16(Mamba-2 hybrid)--tokenizer-mode deepseek_v32(tp8 hangs on H200)DEEP_GEMM=0(kv_heads=2, TP impossible)dev.local/vllm-glm5:final,--gpu-memory-utilization 0.85Issues Found During Deployment
Error 803: unsupported display driver / cuda driver combinationVLLM_USE_DEEP_GEMM=0--tokenizer-mode deepseek_v32requiredglm_moe_dsanot in official vLLM → custom image required--enforce-eagermax_num_seqsdefault 1024 too high → 128Test plan
helm templaterenders all 7 presets without errors