UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329
Open
UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329
Conversation
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
|
No meaningful performance changes were detected across 123999 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.libggml.so. 💬 Questions? Tag @loci-dev |
a8215be to
34734bc
Compare
245e873 to
d101579
Compare
63ab8d1 to
7638ab4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#21315
This PR adds support for the
MUL_MAT_IDop in the ZenDNN backend. TheMUL_MAT_IDop is used in Mixture-of-Experts (MoE) models to perform matrix multiplication with expert selection based on input IDs.MUL_MAT_IDop for MoE models using ZenDNN.ZenDNN-2026-WW13).Test Configuration
ZENDNNL_MATMUL_ALGO=1(Blocked AOCL BLIS)Benchmark Results
Optimal
ubatchSelection - CPU Backend (ggml-cpu)The heatmaps below show prompt-processing speed (t/s) across all valid
(prompt length, ubatch)combinations for the CPU backend.The optimal ubatch for the CPU backend varies by model -
512for gpt-oss-20B,768for Mixtral-8x7B and Phi-3.5-MoE. For ZenDNN,4096is consistently the best ubatch across all three models.gpt-oss-20B · Best CPU ubatch: 512

Mixtral-8x7B · Best CPU ubatch: 768

Phi-3.5-MoE · Best CPU ubatch: 768

llama-bench- Prompt Processing Throughput (tokens/sec)Mixtral-8x7B-Instruct-v0.1 - 86.99 GiB · 46.70B params
Phi-3.5-MoE-instruct - 78.00 GiB · 41.87B params
unsloth/gpt-oss-20b - 38.97 GiB · 20.91B params
llama-batched-bench- PP + TG PerformanceMixtral-8x7B-Instruct-v0.1
Phi-3.5-MoE-instruct
unsloth/gpt-oss-20b
cc: @amukho @avinashcpandey @taronaeo @danbev
AI usage disclosure: AI assistance was used for scripts that generate graph and text refractor only.