[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399
[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin#399kliuae-amd wants to merge 10 commits intomainfrom
Conversation
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
@kliuae could you please help fix the CI issue? |
| We also register each instance in vLLM's static_forward_context using | ||
| the same prefix convention as other attention layers (the prefix | ||
| parameter passed at construction, e.g. 'model.layers.0...k_cache'). | ||
| """ |
There was a problem hiding this comment.
Thanks for the detailed comments here. I think it's very useful for developers to deep dive the code.
atom/plugin/vllm/register.py
Outdated
| "Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, | ||
| "Qwen3_5ForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5ForConditionalGeneration", | ||
| "Qwen3_5MoeForConditionalGeneration": "atom.models.qwen3_5:Qwen3_5MoeForConditionalGeneration", | ||
| "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, |
There was a problem hiding this comment.
Can we put the "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, after "Glm4MoeForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER,
| # The kernel operates on non-padded inputs. Hence, pre-compiling | ||
| # triton kernel to avoid runtime compilation for unseen batch sizes | ||
| # Pre-compile for batch sizes 1 to 1024 to cover most use-cases. | ||
| # On DS-R1, this step adds roughly 50s to the model loading time. |
There was a problem hiding this comment.
I think it's a good idea here. how about the other triton kernel?
cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?
There was a problem hiding this comment.
vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here
|
Hi, @kliuae-amd |
Motivation
Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.
Technical Details
Test Plan
Accuracy test with lm_eval
Server command:
Test Result
lm_eval command
Model: zai-org/GLM-5-FP8
ATOM
vLLM Plugin (bf16 kv cache)
vLLM Plugin (fp8 kv cache)
Performance on MI300X, TP8
Submission Checklist