Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2064,6 +2064,28 @@ qwen3.5-fp8-h200-sglang:
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }

qwen3.5-fp8-h200-sglang-mtp:
image: lmsysorg/sglang:v0.5.9-cu129-amd64
model: Qwen/Qwen3.5-397B-A17B-FP8
model-prefix: qwen3.5
runner: h200
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128, spec-decoding: mtp }

glm5-fp8-h200-sglang:
image: lmsysorg/sglang:glm5-hopper
model: zai-org/GLM-5-FP8
Expand Down
90 changes: 90 additions & 0 deletions benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
MAX_SEQ_LEN=$((ISL + OSL + 20))

echo "CONC: $CONC, ISL: $ISL, OSL: $OSL, MAX_SEQ_LEN: $MAX_SEQ_LEN"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
python3 -m sglang.launch_server \
--model "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--tp "$TP" \
--expert-parallel-size "$EP_SIZE" \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-flashinfer-allreduce-fusion \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--decode-log-interval 1 \
--mem-fraction-static 0.8 \
--cuda-graph-max-bs "$CONC" \
--context-length "$MAX_SEQ_LEN" \
--kv-cache-dtype fp8_e4m3 \
--quantization fp8 \
--attention-backend flashinfer \
--stream-interval 50 \
--tokenizer-worker-num 6 \
--mamba-ssm-dtype bfloat16 \
--disable-radix-cache \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 2 \
--speculative-num-draft-tokens 3 \
--speculative-eagle-topk 1 \
> "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
Comment on lines +67 to +77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The run_benchmark_serving call is missing --use-chat-template, which every other MTP benchmark script in the repo (6 out of 6) includes. Without this flag, MTP acceptance rates are artificially high because raw text without chat formatting special tokens is easier for the draft model to predict, producing misleading benchmark results. Add --use-chat-template after the --result-dir line to match the established pattern.

Extended reasoning...

What the bug is

The new qwen3.5_fp8_h200_mtp.sh benchmark script omits --use-chat-template from its run_benchmark_serving call (lines 69-79). This flag is present in every other MTP benchmark script in the repository.

Evidence of the pattern

All 6 existing single-node MTP benchmark scripts include --use-chat-template:

  • dsr1_fp8_b200_mtp.sh (line 108)
  • dsr1_fp4_b200_trt_mtp.sh (line 133)
  • dsr1_fp8_b200_trt_mtp.sh (line 143)
  • dsr1_fp8_h200_trt_mtp.sh (line 115)
  • dsr1_fp4_mi355x_atom_mtp.sh (line 71)
  • dsr1_fp8_mi355x_atom_mtp.sh (line 70)

Additionally, the multi-node AMD utility (bench.sh:60) adds this flag generically for ALL MTP benchmarks via [ "$IS_MTP" = "true" ] && echo "--use-chat-template", confirming this is a model-agnostic requirement, not DeepSeek-specific.

Root cause

The script was likely copied from the non-MTP qwen3.5_fp8_h200.sh (which correctly omits the flag since MTP acceptance rates are irrelevant without speculative decoding) but failed to add --use-chat-template as all other MTP scripts do.

Step-by-step proof of impact

  1. The benchmark runs with EAGLE speculative decoding enabled (--speculative-algorithm EAGLE, lines 55-58).
  2. run_benchmark_serving sends prompts to the server. Without --use-chat-template, raw text is sent without chat formatting special tokens.
  3. The draft model finds raw text easier to predict than properly formatted chat messages (which contain special tokens like <|im_start|>, <|im_end|>, etc.).
  4. This results in artificially higher MTP acceptance rates.
  5. The benchmark reports misleadingly optimistic throughput numbers that won't reflect real-world chat serving performance.

How to fix

Add --use-chat-template to the run_benchmark_serving call, e.g. after --result-dir /workspace/. This is a one-line addition that aligns the script with every other MTP benchmark in the repository.

--use-chat-template \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT" --concurrent-requests $CONC
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
17 changes: 8 additions & 9 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -984,15 +984,6 @@
- "14 variants: STP/MTP x low-latency/max-throughput with updated concurrencies and scale points"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/907

- config-keys:
- glm5-fp8-h200-sglang
description:
- "Add GLM-5 FP8 SGLang H200 single-node benchmark"
- "Model: zai-org/GLM-5-FP8, image: lmsysorg/sglang:glm5-hopper"
- "Benchmark script: benchmarks/single_node/glm5_fp8_h200.sh"
- "Tool-call-parser glm47, reasoning-parser glm45, mem-fraction-static 0.85"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/914

- config-keys:
- glm5-fp8-b200-sglang
description:
Expand Down Expand Up @@ -1042,3 +1033,11 @@
- "Only non-TP8 configs listed; TP8 already uses all GPUs on the node"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/934

- config-keys:
- glm5-fp8-h200-sglang
description:
- "Add GLM-5 FP8 SGLang H200 single-node benchmark"
- "Model: zai-org/GLM-5-FP8, image: lmsysorg/sglang:glm5-hopper"
- "Benchmark script: benchmarks/single_node/glm5_fp8_h200.sh"
- "Tool-call-parser glm47, reasoning-parser glm45, mem-fraction-static 0.85"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/914
Loading