Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel by ChuanLi1101 · Pull Request #922 · SemiAnalysisAI/InferenceX

ChuanLi1101 · 2026-03-20T18:38:59Z

Summary

Enable AITER acceleration for Kimi-K2.5-FP4 on MI355X, including MLA, MoE, Triton RoPE, and INT8 quick-reduce quantization (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MLA=1, VLLM_ROCM_USE_AITER_MOE=1, etc.)
Add expert parallel support with --enable-expert-parallel when EP_SIZE > 1, and add a corresponding tp: 4, ep: 4 search-space entry in the benchmark config for ISL=1024/OSL=8192
Reduce tensor parallelism from 8 to 4 across all benchmark search spaces, allowing better per-GPU utilization
Tune serving parameters: lower --gpu-memory-utilization from 0.95 to 0.90 for stability, change --block-size from 64 to 1
Add MEC firmware version check to conditionally disable scratch reclaim (HSA_NO_SCRATCH_RECLAIM=1) on older firmware (< 177) to avoid crashes

Changed Files

File	Description
`.github/configs/amd-master.yaml`	Update `kimik2.5-fp4-mi355x-vllm` search spaces: tp 8->4, add ep=4 config
`benchmarks/single_node/kimik2.5_fp4_mi355x.sh`	Enable AITER env vars, add expert parallel flag, MEC FW check, tune vLLM args

Test Plan

Run kimik2.5-fp4-mi355x-vllm benchmark suite on MI355X with the updated config
Verify expert parallel mode (EP_SIZE > 1) launches correctly with --enable-expert-parallel
Validate AITER MLA/MoE kernels are active via vLLM logs
Confirm no OOM or crash with --gpu-memory-utilization 0.90 and --block-size 1
Test on systems with MEC FW < 177 to verify the HSA_NO_SCRATCH_RECLAIM guard works

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

chunfangamd

lgtm

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

…pace - Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1 (both default to True in vllm envs.py, only master switch needed) - Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without --compilation-config custom_ops+=+rotary_embedding) - Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4 for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks) - Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4 so InferenceX can sweep and determine optimal config empirically Made-with: Cursor

seungrokj · 2026-03-23T00:22:02Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm

github-actions · 2026-03-23T00:22:29Z

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23416123849
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm
Pinned ref: 5410ce5
Approval: not required (trusted collaborator).

seungrokj · 2026-03-23T04:24:02Z

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

 --max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--disable-log-requests \
+--block-size=1 \


--no-enable-prefix-caching \

@ChuanLi1101 now we need this

lxgsbqylbk added 2 commits March 19, 2026 10:57

optimize kimi-k2.5-fp4 on amd mi355x gpu

6a21687

add expert parallel for kimik2.5-fp4-mi355x

d404078

ChuanLi1101 requested a review from a team March 20, 2026 18:39

ChuanLi1101 requested review from billishyahao and chunfangamd as code owners March 20, 2026 18:39

github-project-automation bot added this to InferenceMAX Board Mar 20, 2026

functionstackx reviewed Mar 20, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh Show resolved Hide resolved

chunfangamd approved these changes Mar 20, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh Show resolved Hide resolved

ChuanLi1101 mentioned this pull request Mar 20, 2026

Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4 vllm-project/recipes#296

Open

3 tasks

functionstackx requested review from Oseltamivir, cquil11 and functionstackx and removed request for functionstackx March 21, 2026 05:15

billishyahao added AMD sweep-enabled labels Mar 23, 2026

Merge branch 'main' into dev/rocm

4084eeb

seungrokj reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm

ChuanLi1101 commented Mar 20, 2026

Uh oh!

Uh oh!

chunfangamd left a comment

Uh oh!

Uh oh!

seungrokj commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

seungrokj Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ChuanLi1101 commented Mar 20, 2026

Summary

Changed Files

Test Plan

Uh oh!

Uh oh!

chunfangamd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seungrokj commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

seungrokj Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants