[plugin][MLA] optimize MLA metadata build and remove D2D copy by zejunchen-zejun · Pull Request #387 · ROCm/ATOM

zejunchen-zejun · 2026-03-23T07:04:58Z

This PR improves MLA decode performance in ATOM's vLLM plugin mode by reducing per-step metadata work, using direct KV index generation, and removing a redundant positions device-to-device copy

For performance:

Model	Concurrency	Baseline (total token/s)	this PR (total token/s)	Gain
Kimi-K2 TP8	4	666.38	678.63	+1.84%
Kimi-K2 TP8	32	3916.37	3967.31	+1.30%
DeepSeek-R1 FP8 TP8	4	630.41	641.05	+1.69%
DeepSeek-R1 FP8 TP8	32	3270.91	3310.07	+1.20%

For accuracy:

Model	Baseline	this PR	Delta
Kimi-K2 TP8	0.9325	0.9401	+0.0076
DeepSeek-R1 FP8 TP8	0.9492	0.9409	-0.0083

Copilot

Pull request overview

This PR optimizes MLA (Multi-head Latent Attention) execution in vLLM plugin mode by reducing per-step metadata overhead and avoiding unnecessary device-to-device copies, with a new env toggle to control persistent decode metadata behavior.

Changes:

Add ATOM_USE_PERSISTENT_MLA_DECODE_METADATA env flag to enable/disable persistent MLA decode metadata buffers.
Thread runtime positions through forward_context so MLA decode can consume them without extra D2D copies.
Optimize MLA decode scheduling by generating paged_kv_indices via a Triton kernel, using in-place cumsum/arange, and reusing decode buffers/imports.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`atom/utils/envs.py`	Adds env toggle for persistent MLA decode metadata.
`atom/plugin/vllm/model_wrapper.py`	Writes runtime `positions` into the forward context (fallback to static context).
`atom/plugin/attention_mla.py`	Reuses decode buffers, caches vLLM imports, and switches decode path selection based on persistent metadata enablement.
`atom/plugin/attention.py`	Optimizes MLA decode metadata build (in-place indptr, Triton kv index generation, optional persistent worker buffers).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention_mla.py

atom/utils/envs.py

D2D copy and use kv_indices_generate_triton Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention_mla.py

avoid a D2D copy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention.py

atom/plugin/attention_mla.py

wuhuikx · 2026-03-25T09:20:36Z

Hi @ZhangLirong-amd and @XiaobingSuper could you please help take a look at this PR? It has performance benefits on DS and Kimi. Even if the code changes are small, but it's in the MLA critical path and you'd better help review it.

for kimi-k2 Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T14:56:18Z

atom/plugin/attention.py

+        max_qo_len = (
+            (query_start_loc_cpu[-1] - query_start_loc_cpu[-2]).item()
+            if query_start_loc_cpu.numel() > 1
+            else 1
+        )


max_qo_len is now computed using only the last segment length (query_start_loc_cpu[-1] - query_start_loc_cpu[-2]), which underestimates the true max when per-request decode query lengths vary (i.e., when num_decode_tokens != num_reqs). This can lead to incorrect max_qo_len passed into mla_decode_fwd and potential workspace/shape issues. Compute max_qo_len as the maximum of all per-request query lengths (e.g., based on query_start_loc_cpu[1:] - query_start_loc_cpu[:-1]) instead of just the last one.

Copilot AI review requested due to automatic review settings March 23, 2026 07:04

zejunchen-zejun marked this pull request as draft March 23, 2026 07:05

Copilot started reviewing on behalf of zejunchen-zejun March 23, 2026 07:05 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/utils/envs.py Outdated Show resolved Hide resolved

[plugin][OOT MLA] optimize MLA metadata build, remove

ba4d185

D2D copy and use kv_indices_generate_triton Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun force-pushed the zejun/opt_plugin_build_mla branch from e1a0fe3 to ba4d185 Compare March 25, 2026 01:58

wuhuikx marked this pull request as ready for review March 25, 2026 04:39

wuhuikx changed the title ~~[draft][plugin][MLA] optimize MLA metadata build and remove D2D copy~~ [plugin][MLA] optimize MLA metadata build and remove D2D copy Mar 25, 2026

retriev some code changes

ee0af7d

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot AI review requested due to automatic review settings March 25, 2026 05:34

Copilot started reviewing on behalf of zejunchen-zejun March 25, 2026 05:35 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/plugin/attention_mla.py Show resolved Hide resolved

zejunchen-zejun added 2 commits March 25, 2026 13:58

positions use vLLM forward context to pass, which

d475186

avoid a D2D copy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

update comment for opt

64ed644

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot AI review requested due to automatic review settings March 25, 2026 06:20

make lint happy

60e0dc5

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot started reviewing on behalf of zejunchen-zejun March 25, 2026 06:22 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

atom/plugin/attention.py Show resolved Hide resolved

atom/plugin/attention_mla.py Show resolved Hide resolved

wuhuikx requested review from XiaobingSuper, ZhangLirong-amd and ganyi1996ppo March 25, 2026 08:46

zejunchen-zejun added 2 commits March 25, 2026 19:04

fix cannot find heuristic kernel issue

8f0a23f

for kimi-k2 Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Merge branch 'main' into zejun/opt_plugin_build_mla

3dfd2ac

Copilot AI review requested due to automatic review settings March 25, 2026 14:48

Copilot started reviewing on behalf of zejunchen-zejun March 25, 2026 14:50 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plugin][MLA] optimize MLA metadata build and remove D2D copy#387

[plugin][MLA] optimize MLA metadata build and remove D2D copy#387
zejunchen-zejun wants to merge 7 commits intomainfrom
zejun/opt_plugin_build_mla

zejunchen-zejun commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zejunchen-zejun commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zejunchen-zejun commented Mar 23, 2026 •

edited

Loading