UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models by loci-dev · Pull Request #1329 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-03T02:17:39Z

Note

Source pull request: ggml-org/llama.cpp#21315

This PR adds support for the MUL_MAT_ID op in the ZenDNN backend. The MUL_MAT_ID op is used in Mixture-of-Experts (MoE) models to perform matrix multiplication with expert selection based on input IDs.

Acceleration of the MUL_MAT_ID op for MoE models using ZenDNN.
Fallback to the CPU backend when the total number of experts exceeds 32, as ZenDNN may not be efficient in these scenarios (for now).
Updated the ZenDNN lib to the latest bits (ZenDNN-2026-WW13).

Test Configuration

Hardware: AMD EPYC 9004 Series (Zen 4)
Threads: 96
Tool: llama-bench and llama-batched-bench
Precision: BF16
k/v cache: bf16
llama.cpp version: 8690
ZenDNN version: 1.0.0 (ZenDNN-2026-WW13)
Environment: ZENDNNL_MATMUL_ALGO=1 (Blocked AOCL BLIS)

Benchmark Results

Optimal `ubatch` Selection - CPU Backend (ggml-cpu)

The heatmaps below show prompt-processing speed (t/s) across all valid (prompt length, ubatch) combinations for the CPU backend.
The optimal ubatch for the CPU backend varies by model - 512 for gpt-oss-20B, 768 for Mixtral-8x7B and Phi-3.5-MoE. For ZenDNN, 4096 is consistently the best ubatch across all three models.

gpt-oss-20B · Best CPU ubatch: 512

Mixtral-8x7B · Best CPU ubatch: 768

Phi-3.5-MoE · Best CPU ubatch: 768

`llama-bench` - Prompt Processing Throughput (tokens/sec)

Config: 96 threads · BF16 · bf16 k/v cache
CPU numbers use the best ubatch per model (512 / 768 / 768). ZenDNN numbers use ubatch=4096.

Mixtral-8x7B-Instruct-v0.1 - 86.99 GiB · 46.70B params

Test	CPU (t/s)	ZenDNN (t/s)	Speedup
pp128	171.30 ± 0.36	162.33 ± 0.19	0.95×
pp256	181.34 ± 0.10	235.79 ± 0.88	1.30×
pp384	193.57 ± 0.06	276.02 ± 0.84	1.43×
pp512	184.60 ± 0.06	308.42 ± 0.22	1.67×
pp768	185.62 ± 0.12	360.08 ± 0.43	1.94×
pp1024	181.51 ± 0.08	374.11 ± 0.48	2.06×
pp1536	180.46 ± 0.13	392.31 ± 0.53	2.17×
pp2048	177.43 ± 0.07	411.17 ± 0.28	2.32×
pp3072	171.63 ± 0.02	413.27 ± 0.07	2.41×
pp4096	166.41 ± 0.03	393.64 ± 0.48	2.37×

Phi-3.5-MoE-instruct - 78.00 GiB · 41.87B params

Test	CPU (t/s)	ZenDNN (t/s)	Speedup
pp128	282.71 ± 0.82	210.67 ± 11.31	0.74×
pp256	347.71 ± 0.40	298.33 ± 3.35	0.86×
pp384	359.16 ± 0.63	396.00 ± 4.39	1.10×
pp512	371.69 ± 0.26	436.85 ± 2.89	1.18×
pp768	375.51 ± 0.68	512.55 ± 4.00	1.37×
pp1024	360.19 ± 0.39	536.58 ± 7.34	1.49×
pp1536	356.10 ± 0.48	561.33 ± 3.22	1.58×
pp2048	344.96 ± 0.62	583.05 ± 1.22	1.69×
pp3072	323.80 ± 1.30	600.95 ± 3.63	1.86×
pp4096	307.02 ± 0.08	556.82 ± 9.53	1.81×

unsloth/gpt-oss-20b - 38.97 GiB · 20.91B params

Test	CPU (t/s)	ZenDNN (t/s)	Speedup
pp128	471.56 ± 2.90	284.14 ± 6.81	0.60×
pp256	589.39 ± 1.80	426.90 ± 7.60	0.72×
pp384	599.90 ± 1.32	512.07 ± 5.06	0.85×
pp512	618.43 ± 0.38	563.97 ± 1.11	0.91×
pp768	587.94 ± 0.81	633.89 ± 0.87	1.08×
pp1024	577.87 ± 0.54	609.45 ± 1.67	1.05×
pp1536	553.02 ± 0.35	692.40 ± 0.63	1.25×
pp2048	533.92 ± 0.38	764.32 ± 1.61	1.43×
pp3072	498.19 ± 0.50	672.33 ± 37.24	1.35×
pp4096	467.38 ± 5.60	579.32 ± 31.03	1.24×

`llama-batched-bench` - PP + TG Performance

Configuration: 96 threads · BF16 compute · BF16 KV cache
Prompt length: 512 tokens
Generation length: 128 tokens
Context length per request: 512 + 128 = 640 tokens

Total context size (batch-wise):

Batch 16 -> 640 × 16 = 10,240 tokens

Batch 32 -> 640 × 32 = 20,480 tokens

Mixtral-8x7B-Instruct-v0.1

Batch	CPU uBatch	ZenDNN uBatch	CPU PP (t/s)	ZenDNN PP (t/s)	CPU TG (t/s)	ZenDNN TG (t/s)	PP Speedup
16	4096	4096	194.48	468.08	56.48	42.87	2.41x
32	4096	8192	199.08	515.11	92.38	69.00	2.59x

Phi-3.5-MoE-instruct

Batch	CPU uBatch	ZenDNN uBatch	CPU PP (t/s)	ZenDNN PP (t/s)	CPU TG (t/s)	ZenDNN TG (t/s)	PP Speedup
16	1024	8192	412.05	689.85	108.46	65.02	1.67x
32	1024	8192	415.51	825.44	163.35	98.43	1.99x

unsloth/gpt-oss-20b

Batch	CPU uBatch	ZenDNN uBatch	CPU PP (t/s)	ZenDNN PP (t/s)	CPU TG (t/s)	ZenDNN TG (t/s)	PP Speedup
16	512	16384	800.39	1338.16	225.68	96.59	1.67x
32	512	16384	810.13	1629.12	325.10	152.93	2.01x

Note: TG (token generation) throughput may be lower with ZenDNN in some configurations with small batch sizes. The ZenDNN team is actively working on optimizations for smaller batch sizes, and improvements are expected in upcoming releases.

cc: @amukho @avinashcpandey @taronaeo @danbev

AI usage disclosure: AI assistance was used for scripts that generate graph and text refractor only.

- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

loci-review · 2026-04-03T03:13:59Z

No meaningful performance changes were detected across 123999 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.libggml.so.

💬 Questions? Tag @loci-dev

z-vishal and others added 2 commits April 2, 2026 16:39

ggml-zendnn : add MUL_MAT_ID op support for MoE models

efcd1ea

- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13

ggml-zendnn : add braces to sgemm failure condition for consistency

cbaec92

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

loci-dev temporarily deployed to PROD__AL_DEMO April 3, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from a8215be to 34734bc Compare April 9, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18

loci-dev force-pushed the main branch 2 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329

UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21315-ggml-zendnn-mul-mat-id-support

loci-dev commented Apr 3, 2026

Uh oh!

loci-review bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 3, 2026

Test Configuration

Benchmark Results

Optimal ubatch Selection - CPU Backend (ggml-cpu)

llama-bench - Prompt Processing Throughput (tokens/sec)

Mixtral-8x7B-Instruct-v0.1 - 86.99 GiB · 46.70B params

Phi-3.5-MoE-instruct - 78.00 GiB · 41.87B params

unsloth/gpt-oss-20b - 38.97 GiB · 20.91B params

llama-batched-bench - PP + TG Performance

Mixtral-8x7B-Instruct-v0.1

Phi-3.5-MoE-instruct

unsloth/gpt-oss-20b

Uh oh!

loci-review bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimal `ubatch` Selection - CPU Backend (ggml-cpu)

`llama-bench` - Prompt Processing Throughput (tokens/sec)

`llama-batched-bench` - PP + TG Performance