CUDA: fuse muls by am17an · Pull Request #21665 · ggml-org/llama.cpp

am17an · 2026-04-09T09:07:04Z

Overview

Add fusion for mul operator, same as adds. This is useful for gemma4 models which have a down expert scale which can be fused with mul, this saves a full roundtrip of used_experts x expert_dims in f32 from global memory, so it seems to help PP more than TG surprisingly. Additionally, we can fuse mul-mat + (epilogue), which would benefit all MoE models, however that is not a simple change since we have account for all the different mul-mat-id paths we take.

on a 4090

Model	Test	t/s cuda_fast_hash	t/s cuda_mul_fused	Speedup
gemma4 ?B Q4_0	pp2048	9525.77	10065.06	1.06
gemma4 ?B Q4_0	pp2048@d16384	7084.46	7335.44	1.04
gemma4 ?B Q4_0	pp2048@d32768	5819.71	6010.60	1.03
gemma4 ?B Q4_0	tg128	193.45	194.73	1.01
gemma4 ?B Q4_0	tg128@d16384	173.40	174.48	1.01
gemma4 ?B Q4_0	tg128@d32768	156.48	157.38	1.01

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, continuing on my letting AI run in a loop, it found this for gemma 4, it found some other ones as well but those don't work. I wrote the code

am17an · 2026-04-09T09:11:06Z

BTW @ggerganov, this model seems to have down_exps_scale even without nvfp4, how would it work if someone made an nvfp4 version of gemma4?

CISC · 2026-04-09T09:36:50Z

BTW @ggerganov, this model seems to have down_exps_scale even without nvfp4, how would it work if someone made an nvfp4 version of gemma4?

I think we will just have to fix that on conversion.

CUDA: fuse muls

8557e42

am17an requested a review from a team as a code owner April 9, 2026 09:07

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026

JohannesGaessler approved these changes Apr 9, 2026

View reviewed changes

CISC approved these changes Apr 9, 2026

View reviewed changes

am17an merged commit e34f042 into ggml-org:master Apr 10, 2026
45 of 48 checks passed

am17an deleted the cuda_mul_fused2 branch April 10, 2026 02:24

spiritbuun pushed a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 10, 2026

CUDA: fuse muls (ggml-org#21665)

63aa359

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fuse muls#21665

CUDA: fuse muls#21665
am17an merged 1 commit intoggml-org:masterfrom
am17an:cuda_mul_fused2

am17an commented Apr 9, 2026 •

edited

Loading

Uh oh!

am17an commented Apr 9, 2026

Uh oh!

CISC commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

am17an commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

am17an commented Apr 9, 2026

Uh oh!

CISC commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Apr 9, 2026 •

edited

Loading