Problem Description
When serving Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 on ROCm with AITER enabled, vLLM crashes with GPU memory access faults during startup (during cudagraph capture):
INFO 03-04 11:35:04 [fp8.py:259] Using AITER Fp8 MoE backend out of potential backends
INFO fused_moe.py:757: [fused_moe] using 1stage default for (256, 256, 4096, 1536, 16, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False)
...
Memory access fault by GPU node-8 ... Reason: Unknown.
Memory access fault by GPU node-3 ... Reason: Unknown.
Memory access fault by GPU node-6 ... Reason: Unknown.
Memory access fault by GPU node-2 ... Reason: Unknown.
Memory access fault by GPU node-9 ... Reason: Unknown.
Memory access fault by GPU node-5 ... Reason: Unknown.
Memory access fault by GPU node-7 ... Reason: Unknown.
The issue appears tied to AITER MoE path and specific shapes above size 64. Explicitly disabling AITER MoE via env variable avoids the problem.
Operating System
Ubuntu 22.04.2 LTS
CPU
AMD EPYC 9575F 64-Core Processor
GPU
8 x AMD Instinct MI350X
ROCm Version
ROCm 7.1
ROCm Component
No response
Steps to Reproduce
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1 \
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
--dtype auto \
-tp 8 \
--trust-remote-code \
--swap-space 16 \
--disable-uvicorn-access-log \
--enable-expert-parallel \
--kv_cache_dtype fp8 \
--compilation-config '{"cudagraph_mode":"PIECEWISE","max_cudagraph_capture_size":256}'
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Cudagraph size sensitivity:
max_cudagraph_capture_size=256 - crashes with memory access fault
max_cudagraph_capture_size=128 - crashes with memory access fault
max_cudagraph_capture_size=64 startup/capture completes, but memory access fault still occurs later during inference
This suggests a problematic capture shape between (64, 128] (likely one of the piecewise capture sizes above 64).
Environment:
Docker image: rocm/vllm-dev:nightly (sha256:064be2c40583a5a3937ff976a79ada8b49865d829c73c4bfc0985b721dc9fb97)
vLLM version: v0.16.1rc1.dev202+ge37939616
AITER version: amd-aiter 0.1.10.post2
Model: Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
Parallelism: TP=8, --enable-expert-parallel
KV cache dtype: fp8
Compilation mode: PIECEWISE cudagraph capture
Problem Description
When serving
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8on ROCm with AITER enabled, vLLM crashes with GPU memory access faults during startup (during cudagraph capture):The issue appears tied to AITER MoE path and specific shapes above size 64. Explicitly disabling AITER MoE via env variable avoids the problem.
Operating System
Ubuntu 22.04.2 LTS
CPU
AMD EPYC 9575F 64-Core Processor
GPU
8 x AMD Instinct MI350X
ROCm Version
ROCm 7.1
ROCm Component
No response
Steps to Reproduce
VLLM_ROCM_USE_AITER=1 \ VLLM_ROCM_USE_AITER_MHA=0 \ VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1 \ vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ --dtype auto \ -tp 8 \ --trust-remote-code \ --swap-space 16 \ --disable-uvicorn-access-log \ --enable-expert-parallel \ --kv_cache_dtype fp8 \ --compilation-config '{"cudagraph_mode":"PIECEWISE","max_cudagraph_capture_size":256}'(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Cudagraph size sensitivity:
max_cudagraph_capture_size=256 - crashes with memory access fault
max_cudagraph_capture_size=128 - crashes with memory access fault
max_cudagraph_capture_size=64 startup/capture completes, but memory access fault still occurs later during inference
This suggests a problematic capture shape between (64, 128] (likely one of the piecewise capture sizes above 64).
Environment: