[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin by kliuae-amd · Pull Request #399 · ROCm/ATOM

kliuae-amd · 2026-03-24T11:45:27Z

Motivation

Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.

Technical Details

Add Indexer and Sparse MLA backends for vLLM OOT plugin
Register GLM-5 as supported models

Test Plan

Accuracy test with lm_eval

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve /path/to/GLM-5-FP8/ \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype {auto,fp8} \
  --block-size 1

Test Result

lm_eval command

lm_eval --model local-completions   --model_args model=/path/to/GLM-5-FP8/,base_url=http://localhost:8000/v1/completions   --batch_size 100  --tasks gsm8k --num_fewshot 20

Model: zai-org/GLM-5-FP8

ATOM

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9538|_  |0.0058|
|     |       |strict-match    |     5|exact_match|_  |0.9515|_  |0.0059|

vLLM Plugin (bf16 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.953|_  |0.0058|
|     |       |strict-match    |    20|exact_match|_  |0.953|_  |0.0058|

vLLM Plugin (fp8 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.9439|_  |0.0063|
|     |       |strict-match    |    20|exact_match|_  |0.9439|_  |0.0063|

Performance on MI300X, TP8

ISL/OSL	Concurrency	KV Cache	vLLM Plugin Req/s	ATOM Req/s	vLLM Plugin over ATOM (Req/s)	vLLM Plugin Total tok/s	ATOM Total tok/s	vLLM Plugin over ATOM (tok/s)
1k/1k	128	bf16	2.06	2.02	+1.98%	4224.31	4137.09	+2.11%
1k/1k	64	bf16	1.40	1.36	+2.94%	2874.85	2784.97	+3.23%
1k/1k	128	fp8	2.14	2.23	-4.04%	4383.43	4568.58	-4.05%
1k/1k	64	fp8	1.44	1.43	+0.70%	2938.97	2935.14	+0.13%

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx · 2026-03-25T05:25:03Z

@kliuae could you please help fix the CI issue?

wuhuikx · 2026-03-25T05:32:25Z

atom/plugin/vllm/model_wrapper.py

+        We also register each instance in vLLM's static_forward_context using
+        the same prefix convention as other attention layers (the prefix
+        parameter passed at construction, e.g. 'model.layers.0...k_cache').
+        """


Thanks for the detailed comments here. I think it's very useful for developers to deep dive the code.

wuhuikx · 2026-03-25T05:38:54Z

atom/plugin/vllm/register.py

    "Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,
    "Qwen3_5ForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5ForConditionalGeneration",
    "Qwen3_5MoeForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5MoeForConditionalGeneration",
+    "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,


Can we put the "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, after "Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,

wuhuikx · 2026-03-25T07:54:52Z

atom/model_ops/attention_mla.py

+                # The kernel operates on non-padded inputs. Hence, pre-compiling
+                # triton kernel to avoid runtime compilation for unseen batch sizes
+                # Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
+                # On DS-R1, this step adds roughly 50s to the model loading time.


I think it's a good idea here. how about the other triton kernel?

cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?

vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

zejunchen-zejun · 2026-03-25T12:58:05Z

Hi, @kliuae-amd
Wonderful work, could we have a recipe for GLM5 OOT?

kliuae added 8 commits March 20, 2026 07:01

add sparse mla support for vllm plugin

16799d3

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

remove redundant metadata

da94028

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

inject sparse indexer methods

7d3c44b

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

support bf16 kv cache only

8c01a47

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

sync main

935ca09

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

disable persistent mla for fp8 kvcache

4911f42

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

cce218a

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

0f02329

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx requested review from ZhangLirong-amd and zejunchen-zejun March 25, 2026 05:22

wuhuikx reviewed Mar 25, 2026

View reviewed changes

wuhuikx requested a review from ganyi1996ppo March 25, 2026 05:48

wuhuikx reviewed Mar 25, 2026

View reviewed changes

kliuae added 2 commits March 25, 2026 09:05

format

343b4ef

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

format

7747dd0

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399
kliuae-amd wants to merge 10 commits intomainfrom
plugin_sparse_mla

kliuae-amd commented Mar 24, 2026

Uh oh!

wuhuikx commented Mar 25, 2026

Uh oh!

wuhuikx Mar 25, 2026

Uh oh!

wuhuikx Mar 25, 2026

Uh oh!

wuhuikx Mar 25, 2026

Uh oh!

zejunchen-zejun Mar 25, 2026

Uh oh!

zejunchen-zejun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kliuae-amd commented Mar 24, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

wuhuikx commented Mar 25, 2026

Uh oh!

wuhuikx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

wuhuikx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

wuhuikx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

zejunchen-zejun Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

zejunchen-zejun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants