Skip to content

Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4#296

Open
ChuanLi1101 wants to merge 2 commits intovllm-project:mainfrom
ChuanLi1101:add-rocm-kimi-k2.5
Open

Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4#296
ChuanLi1101 wants to merge 2 commits intovllm-project:mainfrom
ChuanLi1101:add-rocm-kimi-k2.5

Conversation

@ChuanLi1101
Copy link
Copy Markdown

Summary

  • Add AMD ROCm (MI355X) deployment instructions for Kimi-K2.5 using the MXFP4-quantized model
  • Include Docker examples for both 4×MI355X (TP=4) and 8×MI355X (TP=4 + Expert Parallel) configurations
  • Document AITER acceleration environment variables for optimized MLA, MoE, and RoPE kernels on ROCm
  • Add non-Docker "Running on AMD ROCm" section with vllm serve commands and AITER env var reference table

Based on the optimized configurations from InferenceX.

Changes

The recipe previously only covered NVIDIA GPUs (Hopper H200, Blackwell GB200). This PR adds:

  1. Docker section: "AMD ROCm (MI355X)"vllm/vllm-openai-rocm:v0.16.0 image with amd/Kimi-K2.5-MXFP4, AITER env vars, and ROCm-specific device flags
  2. "Running on AMD ROCm" section — bare-metal vllm serve commands with TP=4 and TP=4+EP configurations
  3. AITER env var reference table — documents VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_MLA, VLLM_ROCM_USE_AITER_MOE, VLLM_ROCM_QUICK_REDUCE_QUANTIZATION, VLLM_ROCM_USE_AITER_TRITON_ROPE
  4. MEC firmware compatibility noteHSA_NO_SCRATCH_RECLAIM=1 workaround for older AMD drivers

Test Plan

  • Verify 4×MI355X TP=4 deployment with amd/Kimi-K2.5-MXFP4 using the documented Docker command
  • Verify 8×MI355X TP=4+EP deployment with --enable-expert-parallel
  • Confirm all AITER env vars are recognized by vLLM ROCm v0.16.0

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the Kimi-K2.5 usage guide by integrating comprehensive deployment instructions for AMD ROCm (MI355X) GPUs. It provides detailed guidance for running the MXFP4-quantized model, covering both Dockerized and bare-metal vllm serve setups, and introduces optimized AITER acceleration environment variables to enhance performance on AMD hardware.

Highlights

  • AMD ROCm (MI355X) Deployment: Added comprehensive deployment instructions for the Kimi-K2.5 MXFP4-quantized model on AMD ROCm (MI355X) GPUs.
  • Docker Examples: Included Docker examples for both 4xMI355X (TP=4) and 8xMI355X (TP=4 + Expert Parallel) configurations.
  • AITER Acceleration Environment Variables: Documented AITER acceleration environment variables for optimized MLA, MoE, and RoPE kernels on ROCm.
  • Non-Docker Deployment: Introduced a non-Docker 'Running on AMD ROCm' section with vllm serve commands and an AITER environment variable reference table.
  • MEC Firmware Compatibility: Added a note regarding MEC firmware compatibility and a workaround for older AMD drivers to prevent memory reclaim crashes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive guide for deploying the Kimi-K2.5 model on AMD ROCm hardware (MI355X). The instructions cover both Docker and bare-metal setups for 4-GPU and 8-GPU configurations, including necessary environment variables for AITER acceleration. The documentation is clear and well-structured. My review includes suggestions to add missing arguments to the ROCm examples to enable the model's agentic capabilities, aligning them with the existing NVIDIA examples.

Comment thread moonshotai/Kimi-K2.5.md
Comment on lines +75 to +79
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ROCm Docker example is missing arguments to enable the agentic features of the Kimi-K2.5 model, such as --tool-call-parser, --reasoning-parser, and --enable-auto-tool-choice. These are present in the NVIDIA examples. Including them will ensure users can leverage the model's full capabilities.

This change should also be applied to the 8-GPU Docker example.

Suggested change
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--trust-remote-code
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--trust-remote-code

Comment thread moonshotai/Kimi-K2.5.md
Comment on lines +142 to +146
vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--trust-remote-code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the Docker example, the bare-metal vllm serve command for ROCm is missing the agentic feature flags (--tool-call-parser, --reasoning-parser, --enable-auto-tool-choice). Adding them ensures feature parity with the NVIDIA deployment and allows users to access the model's full capabilities.

This change should also be applied to the 8-GPU vllm serve example.

Suggested change
vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--trust-remote-code
vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
--gpu-memory-utilization 0.90 \
--block-size 1 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--trust-remote-code

@ChuanLi1101
Copy link
Copy Markdown
Author

cc @ywang96 @andyluo7 @ChangLiu0709 - This PR adds AMD ROCm (MI355X) deployment instructions for Kimi-K2.5-MXFP4, based on the optimized configurations from SemiAnalysisAI/InferenceX. Would appreciate a review when you have a chance!

Comment thread moonshotai/Kimi-K2.5.md Outdated

## Use vLLM with Docker

Pull the vLLM release image from [Docker Hub](https://hub.docker.com/r/vllm/vllm-openai/tags?name=17.0):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread moonshotai/Kimi-K2.5.md Outdated

```bash
docker pull vllm/vllm-openai:v0.17.0-cu130 # CUDA 13.0
docker pull vllm/vllm-openai:v0.17.0 # Other CUDA versions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add ROCm versions here.

Comment thread moonshotai/Kimi-K2.5.md
uv pip install vllm --torch-backend auto
```

## Running Kimi-K2.5 with vLLM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we insert a subsection title for NV, parallel to the next subsection "### Running on AMD ROCm"?

@ChuanLi1101
Copy link
Copy Markdown
Author

Thanks for the review @chunfangamd and @gemini-code-assist! I've pushed a follow-up commit addressing all feedback:

  1. Docker Hub link — Updated to include both CUDA and ROCm links, and added docker pull vllm/vllm-openai-rocm:v0.16.0 to the initial pull section.
  2. NVIDIA subsection title — Added ### Running on NVIDIA parallel to ### Running on AMD ROCm for consistent structure.
  3. Agentic flags — Added --tool-call-parser kimi_k2, --reasoning-parser kimi_k2, and --enable-auto-tool-choice to all four ROCm examples (2 Docker + 2 bare-metal) for feature parity with the NVIDIA examples.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the update, @ChuanLi1101! It's great to see the improvements based on the feedback. I'll take another look at the changes.

@functionstackx
Copy link
Copy Markdown
Contributor

@mgoin can we get an review on this PR?

@ChuanLi1101
Copy link
Copy Markdown
Author

Hi @tjtanaa @gshtras — to ensure that the community can benefit from the hard work of documenting the optimal recipe for AMD ROCm (MI355X) deployment, it would be great if you could help review and merge this PR when you get a chance. Thanks!

Comment thread moonshotai/Kimi-K2.5.md

For more usage examples, check out the [vLLM user guide for multimodal models](https://docs.vllm.ai/en/latest/features/multimodal_inputs.html) and the [official Kimi-K2.5 Hugging Face page](https://huggingface.co/moonshotai/Kimi-K2.5)!
# moonshotai/Kimi-K2.5 Usage Guide
[Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChuanLi1101 please fix this github history, there shouldn't be this many line changes.

Comment thread moonshotai/Kimi-K2.5.md
-e VLLM_ROCM_USE_AITER_MOE=1 \
-e VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT8 \
-e VLLM_ROCM_USE_AITER_TRITON_ROPE=1 \
vllm/vllm-openai-rocm:v0.16.0 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't pin to a specific version unless necessary. Is this intended?

Comment thread moonshotai/Kimi-K2.5.md
--shm-size=16G \
-p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MLA=1 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Comment thread moonshotai/Kimi-K2.5.md
-p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MLA=1 \
-e VLLM_ROCM_USE_AITER_MOE=1 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Comment thread moonshotai/Kimi-K2.5.md
--shm-size=16G \
-p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MLA=1 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Comment thread moonshotai/Kimi-K2.5.md
-p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MLA=1 \
-e VLLM_ROCM_USE_AITER_MOE=1 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Comment thread moonshotai/Kimi-K2.5.md
-e VLLM_ROCM_USE_AITER_MOE=1 \
-e VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT8 \
-e VLLM_ROCM_USE_AITER_TRITON_ROPE=1 \
vllm/vllm-openai-rocm:v0.16.0 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't pin to a specific version unless necessary. Is this intended?

Comment thread moonshotai/Kimi-K2.5.md

```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Comment thread moonshotai/Kimi-K2.5.md
```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=1
export VLLM_ROCM_USE_AITER_MOE=1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled by default, we can remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants