Skip to content

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922

Open
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm
Open

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm

Conversation

@ChuanLi1101
Copy link
Contributor

Summary

  • Enable AITER acceleration for Kimi-K2.5-FP4 on MI355X, including MLA, MoE, Triton RoPE, and INT8 quick-reduce quantization (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MLA=1, VLLM_ROCM_USE_AITER_MOE=1, etc.)
  • Add expert parallel support with --enable-expert-parallel when EP_SIZE > 1, and add a corresponding tp: 4, ep: 4 search-space entry in the benchmark config for ISL=1024/OSL=8192
  • Reduce tensor parallelism from 8 to 4 across all benchmark search spaces, allowing better per-GPU utilization
  • Tune serving parameters: lower --gpu-memory-utilization from 0.95 to 0.90 for stability, change --block-size from 64 to 1
  • Add MEC firmware version check to conditionally disable scratch reclaim (HSA_NO_SCRATCH_RECLAIM=1) on older firmware (< 177) to avoid crashes

Changed Files

File Description
.github/configs/amd-master.yaml Update kimik2.5-fp4-mi355x-vllm search spaces: tp 8->4, add ep=4 config
benchmarks/single_node/kimik2.5_fp4_mi355x.sh Enable AITER env vars, add expert parallel flag, MEC FW check, tune vLLM args

Test Plan

  • Run kimik2.5-fp4-mi355x-vllm benchmark suite on MI355X with the updated config
  • Verify expert parallel mode (EP_SIZE > 1) launches correctly with --enable-expert-parallel
  • Validate AITER MLA/MoE kernels are active via vLLM logs
  • Confirm no OOM or crash with --gpu-memory-utilization 0.90 and --block-size 1
  • Test on systems with MEC FW < 177 to verify the HSA_NO_SCRATCH_RECLAIM guard works

Copy link
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

…pace

- Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1
  (both default to True in vllm envs.py, only master switch needed)
- Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without
  --compilation-config custom_ops+=+rotary_embedding)
- Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4
  for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks)
- Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4
  so InferenceX can sweep and determine optimal config empirically

Made-with: Cursor
@seungrokj
Copy link
Collaborator

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm

@github-actions
Copy link
Contributor

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23416123849
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm
Pinned ref: 5410ce5
Approval: not required (trusted collaborator).

--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--disable-log-requests \
--block-size=1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-enable-prefix-caching \

@ChuanLi1101 now we need this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants