Conversation
|
No meaningful performance changes were detected across 124017 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.llama-tts, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so. 🔎 Full breakdown: Loci Inspector |
126cd1f to
a8215be
Compare
e800934 to
a024d9c
Compare
d101579 to
63ab8d1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#21046
Overview
Some changes to provide wider compatibility with browsers/platforms
ggml_set_tensor/memset_tensor, as this also causes issues in browsers in some cases. I believe some of the browser WebGPU implementations are not robust to handlingOnSubmittedWorkDonecallbacks if no work has been submitted. In my testing making these operations asynchronous haven't led to any issues, because theWriteBufferandmemsetoperations are still pushed to the global submission queue in the right order.u32buffers, and updates the dequantization routines to handle the unpacking. We've found that Firefox doesn't support thebitcastoperations needed to handle the conversion fromf16 -> u32like we were doing before, and that there are some issues getting this to work on DirectX backends as well on all browsers. Unfortunately, this does slow down decode speed about 10% in some cases, due to extra memory loads when quantization blocks are not aligned to 4-byte boundaries, but I think the extra support is more important right now and we can try and optimize the loads later if it becomes a larger issue.Requirements