Skip to content

UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21315-ggml-zendnn-mul-mat-id-support
Open

UPSTREAM PR #21315: ggml-zendnn : add MUL_MAT_ID op support for MoE models#1329
loci-dev wants to merge 2 commits intomainfrom
loci/pr-21315-ggml-zendnn-mul-mat-id-support

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 3, 2026

Note

Source pull request: ggml-org/llama.cpp#21315

This PR adds support for the MUL_MAT_ID op in the ZenDNN backend. The MUL_MAT_ID op is used in Mixture-of-Experts (MoE) models to perform matrix multiplication with expert selection based on input IDs.

  • Acceleration of the MUL_MAT_ID op for MoE models using ZenDNN.
  • Fallback to the CPU backend when the total number of experts exceeds 32, as ZenDNN may not be efficient in these scenarios (for now).
  • Updated the ZenDNN lib to the latest bits (ZenDNN-2026-WW13).

Test Configuration

  • Hardware: AMD EPYC 9004 Series (Zen 4)
  • Threads: 96
  • Tool: llama-bench and llama-batched-bench
  • Precision: BF16
  • k/v cache: bf16
  • llama.cpp version: 8690
  • ZenDNN version: 1.0.0 (ZenDNN-2026-WW13)
  • Environment: ZENDNNL_MATMUL_ALGO=1 (Blocked AOCL BLIS)

Benchmark Results

Optimal ubatch Selection - CPU Backend (ggml-cpu)

The heatmaps below show prompt-processing speed (t/s) across all valid (prompt length, ubatch) combinations for the CPU backend.
The optimal ubatch for the CPU backend varies by model - 512 for gpt-oss-20B, 768 for Mixtral-8x7B and Phi-3.5-MoE. For ZenDNN, 4096 is consistently the best ubatch across all three models.

gpt-oss-20B · Best CPU ubatch: 512
heatmap_CPU_gpt-oss-20B-BF16 gguf_96t

Mixtral-8x7B · Best CPU ubatch: 768
heatmap_CPU_Mixtral-8x7B-Instruct-v0 1-BF16 gguf_96t

Phi-3.5-MoE · Best CPU ubatch: 768
heatmap_CPU_Phi-3 5-MoE-16x4 1B-instruct-BF16 gguf_96t

llama-bench - Prompt Processing Throughput (tokens/sec)

Config: 96 threads · BF16 · bf16 k/v cache
CPU numbers use the best ubatch per model (512 / 768 / 768). ZenDNN numbers use ubatch=4096.

Mixtral-8x7B-Instruct-v0.1 - 86.99 GiB · 46.70B params

Test CPU (t/s) ZenDNN (t/s) Speedup
pp128 171.30 ± 0.36 162.33 ± 0.19 0.95×
pp256 181.34 ± 0.10 235.79 ± 0.88 1.30×
pp384 193.57 ± 0.06 276.02 ± 0.84 1.43×
pp512 184.60 ± 0.06 308.42 ± 0.22 1.67×
pp768 185.62 ± 0.12 360.08 ± 0.43 1.94×
pp1024 181.51 ± 0.08 374.11 ± 0.48 2.06×
pp1536 180.46 ± 0.13 392.31 ± 0.53 2.17×
pp2048 177.43 ± 0.07 411.17 ± 0.28 2.32×
pp3072 171.63 ± 0.02 413.27 ± 0.07 2.41×
pp4096 166.41 ± 0.03 393.64 ± 0.48 2.37×

Phi-3.5-MoE-instruct - 78.00 GiB · 41.87B params

Test CPU (t/s) ZenDNN (t/s) Speedup
pp128 282.71 ± 0.82 210.67 ± 11.31 0.74×
pp256 347.71 ± 0.40 298.33 ± 3.35 0.86×
pp384 359.16 ± 0.63 396.00 ± 4.39 1.10×
pp512 371.69 ± 0.26 436.85 ± 2.89 1.18×
pp768 375.51 ± 0.68 512.55 ± 4.00 1.37×
pp1024 360.19 ± 0.39 536.58 ± 7.34 1.49×
pp1536 356.10 ± 0.48 561.33 ± 3.22 1.58×
pp2048 344.96 ± 0.62 583.05 ± 1.22 1.69×
pp3072 323.80 ± 1.30 600.95 ± 3.63 1.86×
pp4096 307.02 ± 0.08 556.82 ± 9.53 1.81×

unsloth/gpt-oss-20b - 38.97 GiB · 20.91B params

Test CPU (t/s) ZenDNN (t/s) Speedup
pp128 471.56 ± 2.90 284.14 ± 6.81 0.60×
pp256 589.39 ± 1.80 426.90 ± 7.60 0.72×
pp384 599.90 ± 1.32 512.07 ± 5.06 0.85×
pp512 618.43 ± 0.38 563.97 ± 1.11 0.91×
pp768 587.94 ± 0.81 633.89 ± 0.87 1.08×
pp1024 577.87 ± 0.54 609.45 ± 1.67 1.05×
pp1536 553.02 ± 0.35 692.40 ± 0.63 1.25×
pp2048 533.92 ± 0.38 764.32 ± 1.61 1.43×
pp3072 498.19 ± 0.50 672.33 ± 37.24 1.35×
pp4096 467.38 ± 5.60 579.32 ± 31.03 1.24×

llama-batched-bench - PP + TG Performance

Configuration: 96 threads · BF16 compute · BF16 KV cache
Prompt length: 512 tokens
Generation length: 128 tokens
Context length per request: 512 + 128 = 640 tokens

Total context size (batch-wise):

  • Batch 16 -> 640 × 16 = 10,240 tokens
  • Batch 32 -> 640 × 32 = 20,480 tokens

Mixtral-8x7B-Instruct-v0.1

Batch CPU uBatch ZenDNN uBatch CPU PP (t/s) ZenDNN PP (t/s) CPU TG (t/s) ZenDNN TG (t/s) PP Speedup
16 4096 4096 194.48 468.08 56.48 42.87 2.41x
32 4096 8192 199.08 515.11 92.38 69.00 2.59x

Phi-3.5-MoE-instruct

Batch CPU uBatch ZenDNN uBatch CPU PP (t/s) ZenDNN PP (t/s) CPU TG (t/s) ZenDNN TG (t/s) PP Speedup
16 1024 8192 412.05 689.85 108.46 65.02 1.67x
32 1024 8192 415.51 825.44 163.35 98.43 1.99x

unsloth/gpt-oss-20b

Batch CPU uBatch ZenDNN uBatch CPU PP (t/s) ZenDNN PP (t/s) CPU TG (t/s) ZenDNN TG (t/s) PP Speedup
16 512 16384 800.39 1338.16 225.68 96.59 1.67x
32 512 16384 810.13 1629.12 325.10 152.93 2.01x

Note: TG (token generation) throughput may be lower with ZenDNN in some configurations with small batch sizes. The ZenDNN team is actively working on optimizations for smaller batch sizes, and improvements are expected in upcoming releases.

cc: @amukho @avinashcpandey @taronaeo @danbev

AI usage disclosure: AI assistance was used for scripts that generate graph and text refractor only.

z-vishal and others added 2 commits April 2, 2026 16:39
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
@loci-review
Copy link
Copy Markdown

loci-review bot commented Apr 3, 2026

No meaningful performance changes were detected across 123999 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-bench, build.bin.libllama.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml-base.so, build.bin.libggml.so.

💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from a8215be to 34734bc Compare April 9, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 245e873 to d101579 Compare April 17, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants