sync : ggml by ggerganov · Pull Request #3803 · ggml-org/whisper.cpp

ggerganov · 2026-05-10T14:34:50Z

No description provided.

* MoE Mxfp4 CLC kernel added, router reorder on GPU * Pass test-backend-ops for MoE mxfp4 Adreno CLC * remove putenv in llama-model.cpp * fix indent style and whitespace * opencl: remove unnecessary headers * opencl: do not save cl_program objects * opencl: remove unnecessary assert * fix precision issue --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…irely) (llama/22533) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation

…2) (llama/22631)

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

Store the last graph uid and compare against it to determine if the same graph is being computed.

* hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…--fit (llama/22688) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de>

…lama/22149) * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao <chun.tao@intel.com> * Fix abort during test-backend-ops Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Regenerate ops.md Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Add scope_dbg_print to newly added SYCL ops. Also add scope_dbg_print to existing ssm_conv op. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

…/22651) * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends

* Q4_0 MoE CLC pass sanity check * release program * opencl: fix whitespace * opencl: remove unused cl_program * opencl: break #if block to make it more clear * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* L2_NORM Updates * Addressed PR Comments * ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend * hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhead by 2x - Separate PP/TG thread functions for I-cache isolation - VTCM state scratchpad with DMA in/out for TG single-cycle access - Vectorized gate exp via hvx_exp_f32

* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 * mimo-v2.5: follow (256, 256) fattn templates * mimo-v2.5: cleanup comments * mimo-v2.5: further comment cleanup * mimo-v2.5: address PR feedback fix GQA handling check for other dangling 320/576 carveouts and mirror them for 192 Add to backend ops test so new paths are covered

…(llama/22147) * sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove unneeded/unnecessary comments and annotations The MMQ subgroup annotations added are on functions gated behind ggml_sycl_supports_mmq(). Revisit the need for these annotations when that function changes. --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

ggerganov · 2026-05-13T07:11:19Z

@danbev Could you extract the 2 CUDA commits and PR them in llama.cpp? Also link to the failing CI runs for information. If we get them merged there, I'll sync them and update here.

…f/bfloat16 in CUDA 11.8" This reverts commit 5cd2284. Reverting in favor of: ggml-org/llama.cpp#22994

…f2 types" This reverts commit a2839b4. Reverting this as after closer inspection these only warnings and not errors.

danbev · 2026-05-14T03:34:21Z

@ggerganov I've reverted the two ggml commits now. The first one is covered by 22994, and after closer inspection the second one was only generating warnings, not errors.

So perhaps we can merge this, and then do another sync and we should have a green CI after that.

That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code

…/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.

Add missing `#include <mutex>` in ggml-backend-device.cpp. Fixes: #22809 Signed-off-by: Oliver Walsh <owalsh@redhat.com>

* add im2col_3d * format code * update the ops.md

Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator

* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases

`im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged.

* Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…lama/22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext

…s for Xe2 and newer (llama/22461) * refactor * Use l_warptile only when coopamt is available for BF16

* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp

* Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu

* ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com>

* hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…(llama/22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled

…ama/22995)

* opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…g and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994)

…le by sg_mat_k / sg_mat_n (llama/23020)

ggerganov · 2026-05-14T08:54:21Z

@danbev Updated this PR. Let's see if the CI passes

shawngu-quic and others added 30 commits May 10, 2026 17:26

ggml-virtgpu: fix circular dependency in headers (llama/22557)

ac514b3

fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus ent…

4a51f36

…irely) (llama/22533) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

vulkan: delete dead GGML_VK_MAX_NODES def (llama/22621)

e271e4c

CUDA: use fastdiv for batch index split in get_rows (llama/22650)

5bbef31

kleidiai : update to v1.24.0 and use release archive (llama/22549)

fd184cf

ggml : implement fast walsh-hadamard transform for kv rotation (#2135…

3d3cc92

…2) (llama/22631)

llama : add option to save memory in device buffers (llama/22679)

c34a8d1

* llama : add option to save memory in device buffers * tests : extend llama-save-load-state

ggml : bump version to 0.11.0 (ggml/1478)

93539a3

rpc : use graph uid instead of graph cache (llama/22701)

91dd659

Store the last graph uid and compare against it to determine if the same graph is being computed.

opencl: refactor Adreno q4_0 (llama/22335)

5a88384

Hexagon: Process M-tail rows on HMX instead of HVX (llama/22724)

1da6d86

* hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

ggml-cpu: fuse RMS_NORM + MUL on CPU backend (llama/22423)

be0e7ec

ggml-cpu: Optimized risc-v cpu q1_0 dot

62ef80e

opencl: add opfilter regex for debugging (llama/22782)

f8e8cf0

llama : fix device state save/load (llama/22805)

51e6dd5

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (llama…

fb4fc65

…/22651) * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends

ggml: update SCHED_DEBUG output to use ggml_op_desc() (llama/22825)

b1aaea6

vulkan: fix spv shadowing (llama/22760)

0d871bd

CUDA: lower-case PCI bus id, standardize for ggml (llama/22820)

acb484d

sycl: support non-contiguous input in PAD op (llama/22148)

723c888

Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

danbev added 2 commits May 14, 2026 05:27

Revert "ggml-cuda : add ar_add() to avoid ambiguous operator+ for hal…

a72e70d

…f/bfloat16 in CUDA 11.8" This reverts commit 5cd2284. Reverting in favor of: ggml-org/llama.cpp#22994

Revert "ggml-cuda : add explicit casts to -INFINITY for float and hal…

28cebf5

…f2 types" This reverts commit a2839b4. Reverting this as after closer inspection these only warnings and not errors.

robUx4 and others added 25 commits May 14, 2026 11:53

ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)

c8d3679

That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code

ggml-virtgpu : include missing mutex header (llama/22810)

046ce9e

Add missing `#include <mutex>` in ggml-backend-device.cpp. Fixes: #22809 Signed-off-by: Oliver Walsh <owalsh@redhat.com>

Add OP im2col_3d (llama/22903)

96c321e

* add im2col_3d * format code * update the ops.md

CUDA: directly include cuda/iterator (llama/22936)

ec2b0ce

Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator

vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (llama/22589)

fcc6d72

metal : promote mul_mv/mul_mm batch divisors to function constants (l…

82c8a86

…lama/22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext

vulkan: Check shared memory size for mmq shaders (llama/22693)

24d0ce6

vulkan: Fix Windows performance regression on Intel GPU BF16 workload…

3baccb0

…s for Xe2 and newer (llama/22461) * refactor * Use l_warptile only when coopamt is available for BF16

ggml-webgpu: Enables running gpt-oss-20b (llama/22906)

392c225

* Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu

opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (llama/22755)

3eaef95

* ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com>

ggml-zendnn : adaptive fallback to CPU backend for small batch sizes …

e2011d8

…(llama/22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled

hexagon: add unary tanh op (llama/22999)

a3b84f4

flush the gpu profile timestamp before the queryset is overflowed (ll…

858b9de

…ama/22995)

opencl: fix crash when warming up MoE on Adreno (llama/22876)

a3129c6

opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985)

9741d32

* opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

Fix for issue #22974. Cast intermediate results to float before addin…

a591708

…g and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994)

ggml-webgpu: only use subgroup-matrix path when head dims are divisib…

39e459f

…le by sg_mat_k / sg_mat_n (llama/23020)

sync : ggml

eb06cc8

talk-llama : sync llama.cpp

b273e14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml#3803

sync : ggml#3803
ggerganov wants to merge 73 commits into
masterfrom
sync-ggml-26-05-10

ggerganov commented May 10, 2026

Uh oh!

ggerganov commented May 13, 2026

Uh oh!

danbev commented May 14, 2026

Uh oh!

ggerganov commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ggerganov commented May 10, 2026

Uh oh!

ggerganov commented May 13, 2026

Uh oh!

danbev commented May 14, 2026

Uh oh!

ggerganov commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants