merge from upstream by l3utterfly · Pull Request #91 · l3utterfly/llama.cpp

l3utterfly · 2026-03-22T12:06:41Z

No description provided.

…ml-org#20157) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…0120) * server : preserve anthropic thinking blocks in conversion (ggml-org#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>

* hexagon: add ssm_conv op * hexagon: hvx kernel is functional * hexagon: improvements to ssm-conv hvx kernel * hexagon: added dma to ssm-conv hvx kernel * hexagon: ssm-conv dynamically compute gather scratchpad * hex-ssm-conv: add local context and fix various issues (spad indexing, etc) --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser

* Add memsets and other fixes for IQ quants * Make memset unconditional, change Laux back to L * Move another memset

* Allow reshuffled arguments in tagged argument parser format tool calls. * Remove shuffle just keep the optional parsers in any order * Remove unnecessary import

* Relax atomicity constraint for nicer, more pleasent, True Streaming parsing * Whitespace * Remove redundant atomics

* ggml: add GATED_DELTA_NET op * remove the transpose * add KDA * add qwen35 dense * llama : check for fused gated delta net backend support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

…20226)

* Revert to OAI-compatible args * Apply workaround::func_args_not_string

* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments

* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers

* ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md

* Fix structured outputs * Update common/chat-auto-parser-generator.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>

* Fix compile bug * Update common/chat-auto-parser-helpers.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : handle incomplete UTF-8 at end of input in PEG parser * cont : if reached end prematurely, emit needs_more_input to propagate partial output * cont: refactor peg parse context to add lenient flag * cont : remove partial flag, keep lenient flag

…20232)

* PEG parser for LFM2 * Simplify using python_value()

…ault (ggml-org#20211)

…ion (ggml-org#20185)

…gml-org#20219)

…of BF16 (ggml-org#20730) * Corrected convert script for NVFP4 naming and updated gguf constants * Add mostly_MXFP4 to FileType Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * simplify * set initial value [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

) rpc : prevent division by zero in deserialize_tensor When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0. (Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here). * style: remove trailing whitespace

…#19593)

) The MEAN/CLS/LAST pooling paths in encode() and decode() used n_embd_inp() (16384 for qwen3vl with deepstack) to read from the pooled embedding tensor, which only has n_embd_out() (4096) floats per sequence. This caused a tensor read out of bounds assertion. Fixes embedding mode for Qwen3-VL-Embedding models.

…and hangs (ggml-org#18604) * grammar: add test case for nullable symbol loop Reproduce stack overflow (or OOM) with ( [x]* )* found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= ( [x]* )*" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions * grammar: prevent stack overflow with nullable symbol loop Fix a potential stack overflow in llama_grammar_advance_stack that could occur when processing grammars with nullable symbols that lead to infinite derivations of empty strings. The fix introduces cycle detection by tracking visited stacks to prevent infinite recursion. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A20 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) * grammar: convert recursive llama_grammar_advance_stack to iterative This change converts the function to an iterative approach using explicit stacks, which prevents deep recursion and eliminates the risk of stack overflow. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A30 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration convert from recursive to interactive""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) v2: Added a `std::set` to perform tree-based lookups with O(N log N) complexity. Testing with a parallel run of `test-grammar-integration` shows a double-digit percentage increase in runtime. An `unordered_set` with O(1) hashing was also evaluated, but the overhead of constructing hash keys from pointers made it significantly slower than the rbtree implementation that only requires an ordering operator. The performance regression in the test suite appears justified by the overall reduction in algorithmic complexity. Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> * grammar: add test case for hang in repetition grammar processing This commit adds a new test case to the grammar integration tests that specifically targets a hang scenario in the repetition grammar parser found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= (([^x]*){0,99}){0,99}" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions * grammar: add repetition threshold check The change introduces a maximum repetition threshold to avoid excessive rule expansion during grammar parsing. When parsing repetition patterns like {m,n}, the parser now calculates the potential number of rules that would be generated and throws an error if the product of previous rules and new rules exceeds the threshold. A test case was added to verify the threshold is properly enforced for deeply nested repetition patterns that would otherwise cause hangs.

* misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file

…imension is small (ggml-org#20635) * Increase per-thread work if the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases. * Limit this change to ncols_dst = 1 * tab to space

* ggml-cuda: native bf16 flash attention for vec and tile kernels mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo * ggml-cuda: address code owner review feedback reverted tile kernel changes to avoid larger refactor * fix ci failures on turing and hip * fix bf16 vec kernel compile on hip v_dot2 platforms * add comments --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

taronaeo and others added 30 commits March 6, 2026 23:24

ggml: update comments for backends which have no memory to report (gg…

ba2ff79

…ml-org#20157) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-cuda: add mem check for fusion (ggml-org#19916)

d48e876

* ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

cpu: skip redudant ROPE cache updates (ggml-org#20149)

ba2fd11

server : preserve anthropic thinking blocks in conversion (ggml-org#2…

e68f2fb

…0120) * server : preserve anthropic thinking blocks in conversion (ggml-org#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>

Autoparser - complete refactoring of parser architecture (ggml-org#18675

566059a

) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser

Add @pwilkin to CODEOWNERS for autoparser code (ggml-org#20174)

7463687

quants : Add memsets and other fixes for IQ quants (ggml-org#19861)

649f064

* Add memsets and other fixes for IQ quants * Make memset unconditional, change Laux back to L * Move another memset

Autoparser: add optional argument reshuffle capability (ggml-org#20171)

2f2923f

* Allow reshuffled arguments in tagged argument parser format tool calls. * Remove shuffle just keep the optional parsers in any order * Remove unnecessary import

Autoparser: True streaming (ggml-org#20177)

c024d85

* Relax atomicity constraint for nicer, more pleasent, True Streaming parsing * Whitespace * Remove redundant atomics

opencl: add l2_norm (ggml-org#20160)

6fce5c6

ggml: add GATED_DELTA_NET op (ggml-org#19504)

c5a7788

* ggml: add GATED_DELTA_NET op * remove the transpose * add KDA * add qwen35 dense * llama : check for fused gated delta net backend support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (ggml-org#20190)

213c4a0

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

server : correct index on finish in OAI completion streams (ggml-org#…

ff52ee9

…20226)

Revert to OAI-compatible args (ggml-org#20213)

b283f6d

* Revert to OAI-compatible args * Apply workaround::func_args_not_string

readme : update infra list (ggml-org#20212)

a950479

llama: end-to-end tests (ggml-org#19802)

a976ff0

* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments

ggml-vulkan: Add ELU op support (ggml-org#20183)

d088d5b

* ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md

Fix structured outputs (ggml-org#20223)

62b8143

* Fix structured outputs * Update common/chat-auto-parser-generator.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>

Fix compile bug (ggml-org#20203)

9b24886

* Fix compile bug * Update common/chat-auto-parser-helpers.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

graph : remove redundant scale_w parameter (ggml-org#20235)

35bee03

server : do not create checkpoints right after mtmd chunks (ggml-org#…

d417bc4

…20232)

PEG parser for LFM2 (ggml-org#20251)

97c64fb

* PEG parser for LFM2 * Simplify using python_value()

llama-bench: introduce -hf and -hff flags & use --mmap 1 by def…

ae87863

…ault (ggml-org#20211)

cuda : display total and free VRAM capacity during device initializat…

5f4cdac

…ion (ggml-org#20185)

vulkan: skip zero size tensors in backend copies (ggml-org#20233)

b2f460b

ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (g…

0beb8db

…gml-org#20219)

contributing: limit open PRs for new contributors to 1 (ggml-org#20036)

e2763a6

michaelw9999 and others added 9 commits March 21, 2026 13:35

docs : explicit about banning accounts that violates policy (ggml-org…

568aec8

…#19593)

misc : prefer ggml-org models in docs and examples (ggml-org#20827)

3306dba

* misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file

Merge branch 'layla-build' into merge

7770e70

l3utterfly merged commit e242a2d into layla-build Mar 22, 2026
22 of 77 checks passed

l3utterfly deleted the merge branch March 22, 2026 12:09

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing examples devops python server ggml Apple Metal script Ascend NPU OpenCL model jinja parser Hexagon WebGPU OpenVINO labels Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream#91

merge from upstream#91
l3utterfly merged 256 commits intolayla-buildfrom
merge

l3utterfly commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

l3utterfly commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants