Skip to content

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399

Open
kliuae-amd wants to merge 10 commits intomainfrom
plugin_sparse_mla
Open

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399
kliuae-amd wants to merge 10 commits intomainfrom
plugin_sparse_mla

Conversation

@kliuae-amd
Copy link
Contributor

Motivation

Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.

Technical Details

  • Add Indexer and Sparse MLA backends for vLLM OOT plugin
  • Register GLM-5 as supported models

Test Plan

Accuracy test with lm_eval

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve /path/to/GLM-5-FP8/ \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype {auto,fp8} \
  --block-size 1

Test Result

lm_eval command

lm_eval --model local-completions   --model_args model=/path/to/GLM-5-FP8/,base_url=http://localhost:8000/v1/completions   --batch_size 100  --tasks gsm8k --num_fewshot 20

Model: zai-org/GLM-5-FP8

ATOM

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9538|_  |0.0058|
|     |       |strict-match    |     5|exact_match|_  |0.9515|_  |0.0059|

vLLM Plugin (bf16 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.953|_  |0.0058|
|     |       |strict-match    |    20|exact_match|_  |0.953|_  |0.0058|

vLLM Plugin (fp8 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.9439|_  |0.0063|
|     |       |strict-match    |    20|exact_match|_  |0.9439|_  |0.0063|

Performance on MI300X, TP8

ISL/OSL Concurrency KV Cache vLLM Plugin Req/s ATOM Req/s vLLM Plugin over ATOM (Req/s) vLLM Plugin Total tok/s ATOM Total tok/s vLLM Plugin over ATOM (tok/s)
1k/1k 128 bf16 2.06 2.02 +1.98% 4224.31 4137.09 +2.11%
1k/1k 64 bf16 1.40 1.36 +2.94% 2874.85 2784.97 +3.23%
1k/1k 128 fp8 2.14 2.23 -4.04% 4383.43 4568.58 -4.05%
1k/1k 64 fp8 1.44 1.43 +0.70% 2938.97 2935.14 +0.13%

Submission Checklist

kliuae added 8 commits March 20, 2026 07:01
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@wuhuikx
Copy link
Contributor

wuhuikx commented Mar 25, 2026

@kliuae could you please help fix the CI issue?

We also register each instance in vLLM's static_forward_context using
the same prefix convention as other attention layers (the prefix
parameter passed at construction, e.g. 'model.layers.0...k_cache').
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments here. I think it's very useful for developers to deep dive the code.

"Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,
"Qwen3_5ForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5ForConditionalGeneration",
"Qwen3_5MoeForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5MoeForConditionalGeneration",
"GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, after "Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,

@wuhuikx wuhuikx requested a review from ganyi1996ppo March 25, 2026 05:48
# The kernel operates on non-padded inputs. Hence, pre-compiling
# triton kernel to avoid runtime compilation for unseen batch sizes
# Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
# On DS-R1, this step adds roughly 50s to the model loading time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea here. how about the other triton kernel?

cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here

kliuae added 2 commits March 25, 2026 09:05
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@zejunchen-zejun
Copy link
Contributor

Hi, @kliuae-amd
Wonderful work, could we have a recipe for GLM5 OOT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants