Skip to content

CUDA: fuse muls#21665

Merged
am17an merged 1 commit intoggml-org:masterfrom
am17an:cuda_mul_fused2
Apr 10, 2026
Merged

CUDA: fuse muls#21665
am17an merged 1 commit intoggml-org:masterfrom
am17an:cuda_mul_fused2

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Apr 9, 2026

Overview

Add fusion for mul operator, same as adds. This is useful for gemma4 models which have a down expert scale which can be fused with mul, this saves a full roundtrip of used_experts x expert_dims in f32 from global memory, so it seems to help PP more than TG surprisingly. Additionally, we can fuse mul-mat + (epilogue), which would benefit all MoE models, however that is not a simple change since we have account for all the different mul-mat-id paths we take.

on a 4090

Model Test t/s cuda_fast_hash t/s cuda_mul_fused Speedup
gemma4 ?B Q4_0 pp2048 9525.77 10065.06 1.06
gemma4 ?B Q4_0 pp2048@d16384 7084.46 7335.44 1.04
gemma4 ?B Q4_0 pp2048@d32768 5819.71 6010.60 1.03
gemma4 ?B Q4_0 tg128 193.45 194.73 1.01
gemma4 ?B Q4_0 tg128@d16384 173.40 174.48 1.01
gemma4 ?B Q4_0 tg128@d32768 156.48 157.38 1.01

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, continuing on my letting AI run in a loop, it found this for gemma 4, it found some other ones as well but those don't work. I wrote the code

@am17an am17an requested a review from a team as a code owner April 9, 2026 09:07
@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Apr 9, 2026

BTW @ggerganov, this model seems to have down_exps_scale even without nvfp4, how would it work if someone made an nvfp4 version of gemma4?

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 9, 2026

BTW @ggerganov, this model seems to have down_exps_scale even without nvfp4, how would it work if someone made an nvfp4 version of gemma4?

I think we will just have to fix that on conversion.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026
@am17an am17an merged commit e34f042 into ggml-org:master Apr 10, 2026
45 of 48 checks passed
@am17an am17an deleted the cuda_mul_fused2 branch April 10, 2026 02:24
spiritbuun pushed a commit to spiritbuun/buun-llama-cpp that referenced this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants