Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4 by ChuanLi1101 · Pull Request #296 · vllm-project/recipes

ChuanLi1101 · 2026-03-20T21:31:25Z

Summary

Add AMD ROCm (MI355X) deployment instructions for Kimi-K2.5 using the MXFP4-quantized model
Include Docker examples for both 4×MI355X (TP=4) and 8×MI355X (TP=4 + Expert Parallel) configurations
Document AITER acceleration environment variables for optimized MLA, MoE, and RoPE kernels on ROCm
Add non-Docker "Running on AMD ROCm" section with vllm serve commands and AITER env var reference table

Based on the optimized configurations from InferenceX.

Changes

The recipe previously only covered NVIDIA GPUs (Hopper H200, Blackwell GB200). This PR adds:

Docker section: "AMD ROCm (MI355X)" — vllm/vllm-openai-rocm:v0.16.0 image with amd/Kimi-K2.5-MXFP4, AITER env vars, and ROCm-specific device flags
"Running on AMD ROCm" section — bare-metal vllm serve commands with TP=4 and TP=4+EP configurations
AITER env var reference table — documents VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_MLA, VLLM_ROCM_USE_AITER_MOE, VLLM_ROCM_QUICK_REDUCE_QUANTIZATION, VLLM_ROCM_USE_AITER_TRITON_ROPE
MEC firmware compatibility note — HSA_NO_SCRATCH_RECLAIM=1 workaround for older AMD drivers

Test Plan

Verify 4×MI355X TP=4 deployment with amd/Kimi-K2.5-MXFP4 using the documented Docker command
Verify 8×MI355X TP=4+EP deployment with --enable-expert-parallel
Confirm all AITER env vars are recognized by vLLM ROCm v0.16.0

gemini-code-assist · 2026-03-20T21:31:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the Kimi-K2.5 usage guide by integrating comprehensive deployment instructions for AMD ROCm (MI355X) GPUs. It provides detailed guidance for running the MXFP4-quantized model, covering both Dockerized and bare-metal vllm serve setups, and introduces optimized AITER acceleration environment variables to enhance performance on AMD hardware.

Highlights

AMD ROCm (MI355X) Deployment: Added comprehensive deployment instructions for the Kimi-K2.5 MXFP4-quantized model on AMD ROCm (MI355X) GPUs.
Docker Examples: Included Docker examples for both 4xMI355X (TP=4) and 8xMI355X (TP=4 + Expert Parallel) configurations.
AITER Acceleration Environment Variables: Documented AITER acceleration environment variables for optimized MLA, MoE, and RoPE kernels on ROCm.
Non-Docker Deployment: Introduced a non-Docker 'Running on AMD ROCm' section with vllm serve commands and an AITER environment variable reference table.
MEC Firmware Compatibility: Added a note regarding MEC firmware compatibility and a workaround for older AMD drivers to prevent memory reclaim crashes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a comprehensive guide for deploying the Kimi-K2.5 model on AMD ROCm hardware (MI355X). The instructions cover both Docker and bare-metal setups for 4-GPU and 8-GPU configurations, including necessary environment variables for AITER acceleration. The documentation is clear and well-structured. My review includes suggestions to add missing arguments to the ROCm examples to enable the model's agentic capabilities, aligning them with the existing NVIDIA examples.

gemini-code-assist · 2026-03-20T21:33:12Z

+    --tensor-parallel-size 4 \
+    --gpu-memory-utilization 0.90 \
+    --block-size 1 \
+    --mm-encoder-tp-mode data \
+    --trust-remote-code


The ROCm Docker example is missing arguments to enable the agentic features of the Kimi-K2.5 model, such as --tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice. These are present in the NVIDIA examples. Including them will ensure users can leverage the model's full capabilities.

This change should also be applied to the 8-GPU Docker example.

Suggested change

--tensor-parallel-size 4 \

--gpu-memory-utilization 0.90 \

--block-size 1 \

--mm-encoder-tp-mode data \

--trust-remote-code

--tensor-parallel-size 4 \

--gpu-memory-utilization 0.90 \

--block-size 1 \

--mm-encoder-tp-mode data \

--tool-call-parser kimi_k2 \

--reasoning-parser kimi_k2 \

--enable-auto-tool-choice \

--trust-remote-code

gemini-code-assist · 2026-03-20T21:33:12Z

+vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
+    --gpu-memory-utilization 0.90 \
+    --block-size 1 \
+    --mm-encoder-tp-mode data \
+    --trust-remote-code


Similar to the Docker example, the bare-metal vllm serve command for ROCm is missing the agentic feature flags (--tool-call-parser, --reasoning-parser, --enable-auto-tool-choice). Adding them ensures feature parity with the NVIDIA deployment and allows users to access the model's full capabilities.

This change should also be applied to the 8-GPU vllm serve example.

Suggested change

vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \

--gpu-memory-utilization 0.90 \

--block-size 1 \

--mm-encoder-tp-mode data \

--trust-remote-code

vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \

--gpu-memory-utilization 0.90 \

--block-size 1 \

--mm-encoder-tp-mode data \

--tool-call-parser kimi_k2 \

--reasoning-parser kimi_k2 \

--enable-auto-tool-choice \

--trust-remote-code

ChuanLi1101 · 2026-03-20T21:35:27Z

cc @ywang96 @andyluo7 @ChangLiu0709 - This PR adds AMD ROCm (MI355X) deployment instructions for Kimi-K2.5-MXFP4, based on the optimized configurations from SemiAnalysisAI/InferenceX. Would appreciate a review when you have a chance!

chunfangamd · 2026-03-20T21:42:47Z

+
+## Use vLLM with Docker
+
+Pull the vLLM release image from [Docker Hub](https://hub.docker.com/r/vllm/vllm-openai/tags?name=17.0):


For ROCm, we pull from https://hub.docker.com/r/vllm/vllm-openai-rocm/tags

chunfangamd · 2026-03-20T21:43:17Z

+
+```bash
+docker pull vllm/vllm-openai:v0.17.0-cu130 # CUDA 13.0
+docker pull vllm/vllm-openai:v0.17.0       # Other CUDA versions


We should also add ROCm versions here.

chunfangamd · 2026-03-20T21:46:29Z

+uv pip install vllm --torch-backend auto
+```
+
+## Running Kimi-K2.5 with vLLM


Should we insert a subsection title for NV, parallel to the next subsection "### Running on AMD ROCm"?

…ntic flags

ChuanLi1101 · 2026-03-20T21:52:22Z

Thanks for the review @chunfangamd and @gemini-code-assist! I've pushed a follow-up commit addressing all feedback:

Docker Hub link — Updated to include both CUDA and ROCm links, and added docker pull vllm/vllm-openai-rocm:v0.16.0 to the initial pull section.
NVIDIA subsection title — Added ### Running on NVIDIA parallel to ### Running on AMD ROCm for consistent structure.
Agentic flags — Added --tool-call-parser kimi_k2, --reasoning-parser kimi_k2, and --enable-auto-tool-choice to all four ROCm examples (2 Docker + 2 bare-metal) for feature parity with the NVIDIA examples.

gemini-code-assist · 2026-03-20T21:52:36Z

Thanks for the update, @ChuanLi1101! It's great to see the improvements based on the feedback. I'll take another look at the changes.

functionstackx · 2026-03-26T23:45:20Z

@mgoin can we get an review on this PR?

ChuanLi1101 · 2026-03-27T03:17:52Z

Hi @tjtanaa @gshtras — to ensure that the community can benefit from the hard work of documenting the optimal recipe for AMD ROCm (MI355X) deployment, it would be great if you could help review and merge this PR when you get a chance. Thanks!

tjtanaa · 2026-03-27T03:30:09Z

-
-For more usage examples, check out the [vLLM user guide for multimodal models](https://docs.vllm.ai/en/latest/features/multimodal_inputs.html) and the [official Kimi-K2.5 Hugging Face page](https://huggingface.co/moonshotai/Kimi-K2.5)!
+# moonshotai/Kimi-K2.5 Usage Guide
+[Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.


@ChuanLi1101 please fix this github history, there shouldn't be this many line changes.

tjtanaa · 2026-03-27T03:31:46Z

+  -e VLLM_ROCM_USE_AITER_MOE=1 \
+  -e VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT8 \
+  -e VLLM_ROCM_USE_AITER_TRITON_ROPE=1 \
+  vllm/vllm-openai-rocm:v0.16.0 \


We shouldn't pin to a specific version unless necessary. Is this intended?

tjtanaa · 2026-03-27T03:32:15Z

+  --shm-size=16G \
+  -p 8000:8000 \
+  -e VLLM_ROCM_USE_AITER=1 \
+  -e VLLM_ROCM_USE_AITER_MLA=1 \