Skip to content

UPSTREAM PR #21046: ggml webgpu: move quantized buffers to u32 types and some other changes for wider browser/device support#1322

Open
loci-dev wants to merge 8 commits intomainfrom
loci/pr-21046-remove_bitcast
Open

UPSTREAM PR #21046: ggml webgpu: move quantized buffers to u32 types and some other changes for wider browser/device support#1322
loci-dev wants to merge 8 commits intomainfrom
loci/pr-21046-remove_bitcast

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Apr 1, 2026

Note

Source pull request: ggml-org/llama.cpp#21046

Overview

Some changes to provide wider compatibility with browsers/platforms

  • Unfortunately the change in ggml webgpu: Move to no timeout for WaitAny in graph submission to avoid deadlocks ggml-org/llama.cpp#20618 broke wasm builds and led to hangs, so adding back in a non-zero timeout for waits on submissions. I was then (finally) able to root cause the source of a deadlock, which was a lock inversion between the lock in the parameter buffer pool and the global device lock within Dawn.
  • Removes synchronous waits in ggml_set_tensor/memset_tensor, as this also causes issues in browsers in some cases. I believe some of the browser WebGPU implementations are not robust to handling OnSubmittedWorkDone callbacks if no work has been submitted. In my testing making these operations asynchronous haven't led to any issues, because the WriteBuffer and memset operations are still pushed to the global submission queue in the right order.
  • Moves quantization types in matrix and flash attention operations to be passed using flat u32 buffers, and updates the dequantization routines to handle the unpacking. We've found that Firefox doesn't support the bitcast operations needed to handle the conversion from f16 -> u32 like we were doing before, and that there are some issues getting this to work on DirectX backends as well on all browsers. Unfortunately, this does slow down decode speed about 10% in some cases, due to extra memory loads when quantization blocks are not aligned to 4-byte boundaries, but I think the extra support is more important right now and we can try and optimize the loads later if it becomes a larger issue.
    • In terms of browser speed, on my M3 Chrome > Safari > Firefox, showing that WebGPU implementations still have lots of room for improvement as well, and/or there are optimizations we can make to submission/synchronization that are causing slowdowns on some implementations.
  • If you're interested, try out llama.cpp on any machine/browser here and let me know if you run into issues!

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: yes, I used an agent to help with the refactor of the dequantization routines to load/convert u32 values.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Apr 1, 2026

No meaningful performance changes were detected across 124017 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from d101579 to 63ab8d1 Compare April 18, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants