merge from upstream by l3utterfly · Pull Request #92 · l3utterfly/llama.cpp

l3utterfly · 2026-04-03T04:40:27Z

No description provided.

It should include port when it's not default.

* refactor token advancement * exercise sub-expressions

…#20849) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits

…del (ggml-org#20847) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

…#20857) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped

…nges system prompt (ggml-org#20859)

* Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * metal:add conv_3d backend Rebased with master and resolved conflicts. * Resolved issues related to changes in variable names * kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…#20823) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output

* server: use httplib dynamic threads * change to n_threads_http + 1024

…() (ggml-org#20887)

Tested to verify - the typo is just in the docs, not the actual flag.

* contrib: add "Requirements" section to PR template * typo [no ci] * use h2, add "Additional information" --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code * opencl: add q6_K transpose * opencl: fix cvt kernel name * opencl: add call to q6_K gemv * opencl: fix q6_K scale transpose * opencl: fix loading for gemv q6_K, refactor * opencl: fix transpose_8_buf kernel assignment, refactor * opencl: refactor q6_K transpose * opencl: add gemm_noshuffle_q6_k_f32 * opencl: fix qh loading * opencl: refactor q6_K gemv host side, release bufs and imgs * opencl: refactor * opencl: fix q6_K dequant and scale selection * opencl: workaround compiler bug, fix dump_tensor * opencl: refactor q6_K convert kernels * opencl: unpack transformed q6_K in get_tensor * opencl: refactor, handle non-uniform workgroups * opencl: support non-vector subgroup bcast

…0915) * Add codeowners for scripts/snapdragon * Also add docs/backends/snapdragon

…20918) * hex-dma: make chained dma the default to handle newer models This also includes some new instrumentation that we can remove later. * hexagon: add uint32 dump helper * hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset spans page boundaries. * hexagon: update ssm-conv to make base-addr compute a bit easier to read * hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB) * hex-bin: fix incorrect stride logic * hexagon: make sure repack buffs are dumped for verbose > 2 * hex-bin: consistently use dma_queue_push even for dummy dst transactions * hex-dma: start using 2d-wide mode on v75 and up The removes the need to deal with the 16-bit limitaion for the strides. * hex-bin: cleanup kernel selection logic * hex-bin: cleanup binary op core and fix transposed tensor handling * snapdragon: update run-bench to use larger ubatch and fa-on

…on and fix gpt-oss (ggml-org#20912)

* llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* naive vectorized version * add vectorized flash attention * update vec version * remove unused path and shader * remove unused helper functions * add comments * remove pad path * ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization * change back to vec4 * enable multi split * enable vec path when: - Q->ne[1] < 20 - Q->ne[0] % 32 == 0 - V->ne[0] % 4 == 0 - K->type == f16 * update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select * enable vec path for q4 and q8 * flash-attn vec nwg=1 fast path (skip tmp/reduce staging) * use packed f16 K loads in flash-attn vec split * use packed f16 K loads in flash-attn vec split on host side * tune flash-attn vec f16 VEC_NE by head dim * cleanup * cleanup * keep host side clean * cleanup host side * change back to original host wait/submit behavior * formatting * reverted param-buffer pool r ecfactor * add helper functions * ggml-webgpu: move flash-attn vec pipeline caching back into shader lib * ggml-webgpu: remove duplicate functions * ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation * ggml-webgpu: revert unrelated change * ggml-webgpu: revert deleted comment * disable uniformity check * remove unnecessary change * Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl * Update ggml/src/ggml-webgpu/ggml-webgpu.cpp --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Bump ROCm version on Linux from 7.2 to 7.2.1 Add gfx1102 target Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression ROCm/rocm-systems#2865 --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ci : add AMD CPU label to PR labeler Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files * ci : rename label AMD CPU to AMD ZenDNN in labeler config Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>

arthw and others added 30 commits March 22, 2026 22:06

support bf16 and quantized type (ggml-org#20803)

f40a80b

server: fix Host header (ggml-org#20843)

81bc4d3

It should include port when it's not default.

jinja : refactor token advancement (ggml-org#20864)

23c9182

* refactor token advancement * exercise sub-expressions

CUDA: fix BF16 FA compilation (ggml-org#20865)

bd3f1d9

server: allow router to report child instances sleep status (ggml-org…

49bfdde

…#20849) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits

mtmd : fix LightOnOCR image preprocessing (ggml-org#20877)

d3ac030

opencl: add flattened Q4_K mv and general Q4_K mm (ggml-org#20773)

84ffd0c

fix(openvino): explicit memset in buffer_context allocation (ggml-org…

cc18f96

…#20857) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

common/autoparser : detect reasoning markers when enable_thinking cha…

7a0b6a6

…nges system prompt (ggml-org#20859)

webui: fix --webui-config-file settings not applied on load (ggml-org…

c44a932

…#20823) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output

ai : update gh permissions (ggml-org#20895)

e32d243

server: use httplib dynamic threads (ggml-org#20817)

31a5cf4

* server: use httplib dynamic threads * change to n_threads_http + 1024

docs : rerun llama-gen-docs to include new CLI args (ggml-org#20892)

841bc20

memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta…

f93c09e

…() (ggml-org#20887)

docs: Fix typo in reasoning flag documentation (ggml-org#20780)

35b662b

Tested to verify - the typo is just in the docs, not the actual flag.

webui: Improve chat form positioning (ggml-org#20901)

11fb11b

devops: upgraded default oneAPI version (ggml-org#20731)

fd18364

contrib: add "Requirements" section to PR template (ggml-org#20841)

bd69921

* contrib: add "Requirements" section to PR template * typo [no ci] * use h2, add "Additional information" --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

rpc : RCE patch (ggml-org#20908)

39bf0d3

Add codeowners for scripts/snapdragon and docs/snapdragon (ggml-org#2…

1fb2290

…0915) * Add codeowners for scripts/snapdragon * Also add docs/backends/snapdragon

common : replace wrap_for_generation with a prefix convenience functi…

312d870

…on and fix gpt-oss (ggml-org#20912)

llama-fit: fix regex pattern for gate_up tensors (ggml-org#20910)

e852eb4

* llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

issues: add openvino backends (ggml-org#20932)

c2e224d

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

metal : add FA instantiations for HSK=512, HSV=512 (ggml-org#20902)

342d612

ArberSephirotheca and others added 6 commits April 2, 2026 10:40

fix: gemma 4 template (ggml-org#21326)

5208e2d

Merge branch 'layla-build' into merge

78e9965

l3utterfly merged commit e7e5f9a into layla-build Apr 3, 2026
19 of 71 checks passed

l3utterfly deleted the merge branch April 3, 2026 04:44

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python android server ggml Apple Metal script nix Ascend NPU OpenCL model jinja parser Hexagon WebGPU OpenVINO labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream#92

merge from upstream#92
l3utterfly merged 174 commits intolayla-buildfrom
merge

l3utterfly commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

l3utterfly commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants