Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4#296
Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4#296ChuanLi1101 wants to merge 2 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Kimi-K2.5 usage guide by integrating comprehensive deployment instructions for AMD ROCm (MI355X) GPUs. It provides detailed guidance for running the MXFP4-quantized model, covering both Dockerized and bare-metal Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive guide for deploying the Kimi-K2.5 model on AMD ROCm hardware (MI355X). The instructions cover both Docker and bare-metal setups for 4-GPU and 8-GPU configurations, including necessary environment variables for AITER acceleration. The documentation is clear and well-structured. My review includes suggestions to add missing arguments to the ROCm examples to enable the model's agentic capabilities, aligning them with the existing NVIDIA examples.
| --tensor-parallel-size 4 \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --block-size 1 \ | ||
| --mm-encoder-tp-mode data \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
The ROCm Docker example is missing arguments to enable the agentic features of the Kimi-K2.5 model, such as --tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice. These are present in the NVIDIA examples. Including them will ensure users can leverage the model's full capabilities.
This change should also be applied to the 8-GPU Docker example.
| --tensor-parallel-size 4 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 1 \ | |
| --mm-encoder-tp-mode data \ | |
| --trust-remote-code | |
| --tensor-parallel-size 4 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 1 \ | |
| --mm-encoder-tp-mode data \ | |
| --tool-call-parser kimi_k2 \ | |
| --reasoning-parser kimi_k2 \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
| vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --block-size 1 \ | ||
| --mm-encoder-tp-mode data \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
Similar to the Docker example, the bare-metal vllm serve command for ROCm is missing the agentic feature flags (--tool-call-parser, --reasoning-parser, --enable-auto-tool-choice). Adding them ensures feature parity with the NVIDIA deployment and allows users to access the model's full capabilities.
This change should also be applied to the 8-GPU vllm serve example.
| vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 1 \ | |
| --mm-encoder-tp-mode data \ | |
| --trust-remote-code | |
| vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --block-size 1 \ | |
| --mm-encoder-tp-mode data \ | |
| --tool-call-parser kimi_k2 \ | |
| --reasoning-parser kimi_k2 \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code |
|
cc @ywang96 @andyluo7 @ChangLiu0709 - This PR adds AMD ROCm (MI355X) deployment instructions for Kimi-K2.5-MXFP4, based on the optimized configurations from SemiAnalysisAI/InferenceX. Would appreciate a review when you have a chance! |
|
|
||
| ## Use vLLM with Docker | ||
|
|
||
| Pull the vLLM release image from [Docker Hub](https://hub.docker.com/r/vllm/vllm-openai/tags?name=17.0): |
There was a problem hiding this comment.
For ROCm, we pull from https://hub.docker.com/r/vllm/vllm-openai-rocm/tags
|
|
||
| ```bash | ||
| docker pull vllm/vllm-openai:v0.17.0-cu130 # CUDA 13.0 | ||
| docker pull vllm/vllm-openai:v0.17.0 # Other CUDA versions |
There was a problem hiding this comment.
We should also add ROCm versions here.
| uv pip install vllm --torch-backend auto | ||
| ``` | ||
|
|
||
| ## Running Kimi-K2.5 with vLLM |
There was a problem hiding this comment.
Should we insert a subsection title for NV, parallel to the next subsection "### Running on AMD ROCm"?
|
Thanks for the review @chunfangamd and @gemini-code-assist! I've pushed a follow-up commit addressing all feedback:
|
|
Thanks for the update, @ChuanLi1101! It's great to see the improvements based on the feedback. I'll take another look at the changes. |
|
@mgoin can we get an review on this PR? |
|
|
||
| For more usage examples, check out the [vLLM user guide for multimodal models](https://docs.vllm.ai/en/latest/features/multimodal_inputs.html) and the [official Kimi-K2.5 Hugging Face page](https://huggingface.co/moonshotai/Kimi-K2.5)! | ||
| # moonshotai/Kimi-K2.5 Usage Guide | ||
| [Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms. |
There was a problem hiding this comment.
@ChuanLi1101 please fix this github history, there shouldn't be this many line changes.
| -e VLLM_ROCM_USE_AITER_MOE=1 \ | ||
| -e VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT8 \ | ||
| -e VLLM_ROCM_USE_AITER_TRITON_ROPE=1 \ | ||
| vllm/vllm-openai-rocm:v0.16.0 \ |
There was a problem hiding this comment.
We shouldn't pin to a specific version unless necessary. Is this intended?
| --shm-size=16G \ | ||
| -p 8000:8000 \ | ||
| -e VLLM_ROCM_USE_AITER=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MLA=1 \ |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
| -p 8000:8000 \ | ||
| -e VLLM_ROCM_USE_AITER=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MLA=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MOE=1 \ |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
| --shm-size=16G \ | ||
| -p 8000:8000 \ | ||
| -e VLLM_ROCM_USE_AITER=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MLA=1 \ |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
| -p 8000:8000 \ | ||
| -e VLLM_ROCM_USE_AITER=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MLA=1 \ | ||
| -e VLLM_ROCM_USE_AITER_MOE=1 \ |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
| -e VLLM_ROCM_USE_AITER_MOE=1 \ | ||
| -e VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT8 \ | ||
| -e VLLM_ROCM_USE_AITER_TRITON_ROPE=1 \ | ||
| vllm/vllm-openai-rocm:v0.16.0 \ |
There was a problem hiding this comment.
We shouldn't pin to a specific version unless necessary. Is this intended?
|
|
||
| ```bash | ||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_USE_AITER_MLA=1 |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
| ```bash | ||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_USE_AITER_MLA=1 | ||
| export VLLM_ROCM_USE_AITER_MOE=1 |
There was a problem hiding this comment.
This is enabled by default, we can remove it.
Summary
vllm servecommands and AITER env var reference tableBased on the optimized configurations from InferenceX.
Changes
The recipe previously only covered NVIDIA GPUs (Hopper H200, Blackwell GB200). This PR adds:
vllm/vllm-openai-rocm:v0.16.0image withamd/Kimi-K2.5-MXFP4, AITER env vars, and ROCm-specific device flagsvllm servecommands with TP=4 and TP=4+EP configurationsVLLM_ROCM_USE_AITER,VLLM_ROCM_USE_AITER_MLA,VLLM_ROCM_USE_AITER_MOE,VLLM_ROCM_QUICK_REDUCE_QUANTIZATION,VLLM_ROCM_USE_AITER_TRITON_ROPEHSA_NO_SCRATCH_RECLAIM=1workaround for older AMD driversTest Plan
amd/Kimi-K2.5-MXFP4using the documented Docker command--enable-expert-parallel