UPSTREAM PR #21046: ggml webgpu: move quantized buffers to u32 types and some other changes for wider browser/device support by loci-dev · Pull Request #1322 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-01T03:10:32Z

Note

Source pull request: ggml-org/llama.cpp#21046

Overview

Some changes to provide wider compatibility with browsers/platforms

Unfortunately the change in ggml webgpu: Move to no timeout for WaitAny in graph submission to avoid deadlocks ggml-org/llama.cpp#20618 broke wasm builds and led to hangs, so adding back in a non-zero timeout for waits on submissions. I was then (finally) able to root cause the source of a deadlock, which was a lock inversion between the lock in the parameter buffer pool and the global device lock within Dawn.
Removes synchronous waits in ggml_set_tensor/memset_tensor, as this also causes issues in browsers in some cases. I believe some of the browser WebGPU implementations are not robust to handling OnSubmittedWorkDone callbacks if no work has been submitted. In my testing making these operations asynchronous haven't led to any issues, because the WriteBuffer and memset operations are still pushed to the global submission queue in the right order.
Moves quantization types in matrix and flash attention operations to be passed using flat u32 buffers, and updates the dequantization routines to handle the unpacking. We've found that Firefox doesn't support the bitcast operations needed to handle the conversion from f16 -> u32 like we were doing before, and that there are some issues getting this to work on DirectX backends as well on all browsers. Unfortunately, this does slow down decode speed about 10% in some cases, due to extra memory loads when quantization blocks are not aligned to 4-byte boundaries, but I think the extra support is more important right now and we can try and optimize the loads later if it becomes a larger issue.
- In terms of browser speed, on my M3 Chrome > Safari > Firefox, showing that WebGPU implementations still have lots of room for improvement as well, and/or there are optimizations we can make to submission/synchronization that are causing slowdowns on some implementations.
If you're interested, try out llama.cpp on any machine/browser here and let me know if you run into issues!

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: yes, I used an agent to help with the refactor of the dequantization routines to load/convert u32 values.

loci-review · 2026-04-01T04:07:12Z

No meaningful performance changes were detected across 124017 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

reeselevine added 8 commits March 25, 2026 15:21

Work towards removing bitcast

f1eb80e

Move rest of existing types over

e9af481

Add timeout back to wait and remove synchronous set_tensor/memset_tensor

b3aa3be

move to unpackf16 for wider compatibility

67fe089

cleanup

e85e8bc

Remove deadlock condition in free_bufs

32ee70a

Merge remote-tracking branch 'upstream/master' into remove_bitcast

309ef1f

Merge remote-tracking branch 'upstream/master' into remove_bitcast

70c9190

loci-dev temporarily deployed to PROD__AL_DEMO April 1, 2026 03:10 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from 126cd1f to a8215be Compare April 8, 2026 02:18

loci-dev force-pushed the main branch 7 times, most recently from e800934 to a024d9c Compare April 15, 2026 02:19

loci-dev force-pushed the main branch 4 times, most recently from d101579 to 63ab8d1 Compare April 18, 2026 02:17

loci-dev force-pushed the main branch from 63ab8d1 to 7638ab4 Compare April 19, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #21046: ggml webgpu: move quantized buffers to u32 types and some other changes for wider browser/device support#1322

UPSTREAM PR #21046: ggml webgpu: move quantized buffers to u32 types and some other changes for wider browser/device support#1322
loci-dev wants to merge 8 commits intomainfrom
loci/pr-21046-remove_bitcast

loci-dev commented Apr 1, 2026

Uh oh!

loci-review bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 1, 2026

Overview

Requirements

Uh oh!

loci-review bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants