[Issue]: GPU memory access fault with AITER FP8 MoE (Qwen3-VL-235B, TP=8, EP)

### Problem Description

When serving `Qwen/Qwen3-VL-235B-A22B-Instruct-FP8` on ROCm with AITER enabled, vLLM crashes with GPU memory access faults during startup (during cudagraph capture):
```
INFO 03-04 11:35:04 [fp8.py:259] Using AITER Fp8 MoE backend out of potential backends
INFO fused_moe.py:757: [fused_moe] using 1stage default for (256, 256, 4096, 1536, 16, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

...

Memory access fault by GPU node-8 ... Reason: Unknown.
Memory access fault by GPU node-3 ... Reason: Unknown.
Memory access fault by GPU node-6 ... Reason: Unknown.
Memory access fault by GPU node-2 ... Reason: Unknown.
Memory access fault by GPU node-9 ... Reason: Unknown.
Memory access fault by GPU node-5 ... Reason: Unknown.
Memory access fault by GPU node-7 ... Reason: Unknown.
```
The issue appears tied to AITER MoE path and specific shapes above size 64. Explicitly disabling AITER MoE via env variable avoids the problem.



### Operating System

Ubuntu 22.04.2 LTS

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

8 x AMD Instinct MI350X

### ROCm Version

ROCm 7.1

### ROCm Component

_No response_

### Steps to Reproduce

```bash
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1 \
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --dtype auto \
  -tp 8 \
  --trust-remote-code \
  --swap-space 16 \
  --disable-uvicorn-access-log \
  --enable-expert-parallel \
  --kv_cache_dtype fp8 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","max_cudagraph_capture_size":256}'
```

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

Cudagraph size sensitivity:
max_cudagraph_capture_size=256 - crashes with memory access fault
max_cudagraph_capture_size=128 - crashes with memory access fault
max_cudagraph_capture_size=64 startup/capture completes, but memory access fault still occurs later during inference

This suggests a problematic capture shape between (64, 128] (likely one of the piecewise capture sizes above 64).

Environment:
```
Docker image: rocm/vllm-dev:nightly (sha256:064be2c40583a5a3937ff976a79ada8b49865d829c73c4bfc0985b721dc9fb97)
vLLM version: v0.16.1rc1.dev202+ge37939616
AITER version: amd-aiter 0.1.10.post2
Model: Qwen/Qwen3-VL-235B-A22B-Instruct-FP8
Parallelism: TP=8, --enable-expert-parallel
KV cache dtype: fp8
Compilation mode: PIECEWISE cudagraph capture
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: GPU memory access fault with AITER FP8 MoE (Qwen3-VL-235B, TP=8, EP) #2187

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: GPU memory access fault with AITER FP8 MoE (Qwen3-VL-235B, TP=8, EP) #2187

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions