Skip to content

[Issue]: GPU memory access fault with AITER FP8 MoE (Qwen3-VL-235B, TP=8, EP) #2187

@sshlyapn

Description

@sshlyapn

Problem Description

When serving Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 on ROCm with AITER enabled, vLLM crashes with GPU memory access faults during startup (during cudagraph capture):

INFO 03-04 11:35:04 [fp8.py:259] Using AITER Fp8 MoE backend out of potential backends
INFO fused_moe.py:757: [fused_moe] using 1stage default for (256, 256, 4096, 1536, 16, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

...

Memory access fault by GPU node-8 ... Reason: Unknown.
Memory access fault by GPU node-3 ... Reason: Unknown.
Memory access fault by GPU node-6 ... Reason: Unknown.
Memory access fault by GPU node-2 ... Reason: Unknown.
Memory access fault by GPU node-9 ... Reason: Unknown.
Memory access fault by GPU node-5 ... Reason: Unknown.
Memory access fault by GPU node-7 ... Reason: Unknown.

The issue appears tied to AITER MoE path and specific shapes above size 64. Explicitly disabling AITER MoE via env variable avoids the problem.

Operating System

Ubuntu 22.04.2 LTS

CPU

AMD EPYC 9575F 64-Core Processor

GPU

8 x AMD Instinct MI350X

ROCm Version

ROCm 7.1

ROCm Component

No response

Steps to Reproduce

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1 \
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --dtype auto \
  -tp 8 \
  --trust-remote-code \
  --swap-space 16 \
  --disable-uvicorn-access-log \
  --enable-expert-parallel \
  --kv_cache_dtype fp8 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","max_cudagraph_capture_size":256}'

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Cudagraph size sensitivity:
max_cudagraph_capture_size=256 - crashes with memory access fault
max_cudagraph_capture_size=128 - crashes with memory access fault
max_cudagraph_capture_size=64 startup/capture completes, but memory access fault still occurs later during inference

This suggests a problematic capture shape between (64, 128] (likely one of the piecewise capture sizes above 64).

Environment:

Docker image: rocm/vllm-dev:nightly (sha256:064be2c40583a5a3937ff976a79ada8b49865d829c73c4bfc0985b721dc9fb97)
vLLM version: v0.16.1rc1.dev202+ge37939616
AITER version: amd-aiter 0.1.10.post2
Model: Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
Parallelism: TP=8, --enable-expert-parallel
KV cache dtype: fp8
Compilation mode: PIECEWISE cudagraph capture

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions