Feat/glm4 mtp recursive #6

SamuelOliveirads · 2025-12-10T16:38:02Z

Note: Please be aware that this PR shows a high commit count because it was rebased against the latest master after a long period of inactivity.

Recursive MTP Drafting & Graph Optimization for GLM-4.5

This is a follow-up PR to F1LM1/llama.cpp#5, implementing major architectural improvements for Multi-Token Prediction (MTP).

Summary of Changes

Multi-Token Support: Enabled recursive drafting loop (Eagle-style) to predict N tokens per step.
Graph Optimization: Activated graph reuse and fixed callback handling for MTP operations, significantly reducing CPU/scheduler overhead.
Rebase: Updated the branch to the latest master (including server-loop optimizations from server: improve speed of speculative decoding ggml-org/llama.cpp#17808).

Detailed Changes

1. Recursive MTP Drafting

Previously, MTP was limited to a single draft token per loop, capping the theoretical maximum performance. This PR implements a recursive generation loop based on the architecture found in sglang#8224.

Architecture:
The MTP draft is not a parallel tensor operation but a recursive loop. The draft inputs require the embedding of the current token and the fixed hidden state from the main model's last interaction.

Main Model: Processes Prompt ($t_0...t_n$) $\rightarrow$ Outputs Hidden State ($H_{main}$) and Token ($t_{n+1}$).
MTP Loop (Draft 1): Input ($$Emb(t_{n+1})$$) + Context ($H_{main}$) $\rightarrow$ MTP Layer $\rightarrow$ Token ($t_{n+2}$).
MTP Loop (Draft 2): Input ($$Emb(t_{n+2})$$) + Context ($H_{main}$) $\rightarrow$ MTP Layer $\rightarrow$ Token ($t_{n+3}$).
Verification: Validates the sequence against the main model.

Controls:
I reused common_speculative_params:

--draft-max N: Sets the maximum number of recursive drafts.
--draft-p-min P: Confidence threshold. The loop exits early if the MTP model is not confident, saving compute resources.

Recommended Test Command:

-mtp --draft-max 3 --draft-p-min 0.85

(Note: I found better results by limiting draft-max and increasing p-min to avoid low-quality speculation).

2. Rebase & Modernization

I updated the codebase to include the latest commits from master (approx. 5 months of updates). This includes the recent server-loop refactor (ggml-org#17808). While the server refactor itself didn't yield massive gains in my specific MTP tests, the overall codebase improvements combined with my MTP fixes have boosted baseline performance.

3. Graph Reuse Optimization

A major bottleneck was the lack of graph reuse during MTP switching.

Problem: Switching between Main Model computation (~10ms) and MTP computation (~9.6ms) forced a full graph rebuild and allocation every step.
Fix: Added proper cb() callbacks to the MTP tensors in build_mtp_tail. This allows the scheduler to correctly offload these tensors and reuse the compute graph structure.
Result: Subsequent drafts in the loop are significantly faster (e.g., ~2.4ms vs 9.6ms) because the graph overhead is amortized.

Performance

Tested on a Windows workstation (Threadripper 5965WX + 2x RTX 3090).
Settings: Small prompt (8 tokens), average of 5 interactions.

Old Branch:

Baseline (No MTP): ~12.19 t/s
MTP (1 draft): ~10.52 t/s (Regression)
MTP (2 drafts): ~10.63 t/s

~~Updated Branch (This PR):~~

~~Baseline (No MTP): ~12.89 t/s~~
MTP (1 draft): ~13.14 t/s (Gain)
MTP (2 drafts): ~13.54 t/s (Gain)
~~MTP (3 drafts): ~12.56 t/s (Diminishing returns due to rejection)~~

Update (12/13):

After further testing following recent reports, I re-evaluated the performance. I was unable to replicate the regression reported, but my results differ from the initial run.

It is worth noting that this branch (even without MTP enabled) performs similar than master in my tests.

GLM-4.5 Air:

Master: ~17.60 t/s
Branch (No MTP): ~18.10 t/s
MTP (1 Draft): ~15.64 t/s
MTP (2 Drafts): ~15.63 t/s
MTP (3 Drafts): ~15.97 t/s

GLM-4.6:

Master: ~5.18 t/s
Branch (No MTP): ~5.47 t/s
MTP (1 Draft): ~5.91 t/s
MTP (2 Drafts): ~5.78 t/s
MTP (3 Drafts): ~5.41 t/s

Note: I updated the launch command to explicitly offload the MTP layer (layer 46 for Air, 92 for 4.6) to the GPU, which provided a small performance boost.

Commands Used:

.\build\bin\release\llama-server.exe ^
    --model "F:\llm_models\glm-4.5-air_Q4_general\GLM-4.5-Air-IQ4_XS-00001-of-00002.gguf" ^
    --alias GLM-4.5-Air ^
    --ctx-size 36864 ^
    -ctk q8_0 -ctv q8_0 ^
    -fa 1 --verbose ^
    --n-gpu-layers 99 ^
    -b 2048 -ub 1500 ^
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18)\.ffn_.*=CUDA0" ^
    -ot "blk\.(20|21|22|23|24|25|26|27|28|29|30|31|32|33|46)\.ffn_.*=CUDA1" ^
    --override-tensor exps=CPU ^
    --no-mmap  ^
    --numa distribute ^
    -mtp --draft-max 3 --draft-p-min 0.85 ^
    --threads 24 --threads-batch 36 ^
    --host 127.0.0.1 --port 8080

.\build\bin\release\llama-server.exe ^
    --model "F:\llm_models\glm-4.6\GLM-4.6-UD-IQ1_S-00001-of-00002.gguf" ^
    --alias GLM-4.6 ^
    --ctx-size 36864 ^
    -ctk q8_0 -ctv q8_0 ^
    -fa 1 --verbose ^
    --n-gpu-layers 99 ^
    -b 2048 -ub 1500 ^
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19)\.ffn_.*=CUDA0" ^
    -ot "blk\.(20|21|22|23|24|25|26|27|28|29|30|31|32|33|92)\.ffn_.*=CUDA1" ^
    --override-tensor exps=CPU ^
    --no-mmap  ^
    --numa distribute ^
    -mtp --draft-max 3 --draft-p-min 0.85 ^
    --threads 24 --threads-batch 36 ^
    --host 127.0.0.1 --port 8080

Conclusion

This update finally achieves net positive performance over the baseline. I am personally eager to try this on GLM-4.6 with better quantization. I invite others to test this configuration and report findings, as MTP logic is complex and may behave differently across hardware setups.

Also tagging @InfernalDread, as you previously mentioned interest in trying MTP once the performance improved. This update finally yields positive gains over the baseline, so I would love to hear your feedback if you have time to test it.

Transparency Note: I used an LLM to help organize and polish the description of this PR to ensure the architectural changes were explained clearly.

* feat: Add "Continue" action for assistant messages * feat: Continuation logic & prompt improvements * chore: update webui build output * feat: Improve logic for continuing the assistant message * chore: update webui build output * chore: Linting * chore: update webui build output * fix: Remove synthetic prompt logic, use the prefill feature by sending the conversation payload ending with assistant message * chore: update webui build output * feat: Enable "Continue" button based on config & non-reasoning model type * chore: update webui build output * chore: Update packages with `npm audit fix` * fix: Remove redundant error * chore: update webui build output * chore: Update `.gitignore` * fix: Add missing change * feat: Add auto-resizing for Edit Assistant/User Message textareas * chore: update webui build output

* vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead

…OOR, TRUNC (ggml-org#17319) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

…gml-org#17314) * ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment * fix comment 2 --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* DGX Spark: UMA support * Updates from PR feedback * More PR feedback cleanup * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Remove trailing whitespace * Update ggml/src/ggml-cuda/ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…ora_to_gguf (ggml-org#17385) * fix: TypeError when loading base model remotely in convert_lora_to_gguf * refactor: simplify base model loading using cache_dir from HuggingFace * Update convert_lora_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * feat: add remote_hf_model_id to trigger lazy mode in LoRA converter --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : more accurate sampling timing * eval-callback : minor fixes * cont : add time_meas impl * cont : fix log msg [no ci] * cont : fix multiple definitions of time_meas * llama-cli : exclude chat template init from time measurement * cont : print percentage of unaccounted time * cont : do not reset timings

* Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX

* refactor: Component iles naming & structure * chore: update webui build output * refactor: Dialog titles + components namig * chore: update webui build output * refactor: Imports * chore: update webui build output

* grammar: fix regression caused by ggml-org#17381 * more readable

* refactor: use hvx_vec_exp_fp32_guard_inf for overflow handling in hvx_exp_f32 * feat: add fast sigmoid function with overflow guard for fp32 * refactor: replace hvx_vec_inverse_fp32 with hvx_vec_inverse_fp32_guard_inf for improved overflow handling * feat: enhance hvx_add_scalar_f32 with overflow handling using infinity guard * wip * add HVX_Vector_Alias wip * wip * fix: improve handling of src1 tensor in glu_swiglu_fp32_per_thread function * fix nc * wip * wip * handle nan at inverse * wip * fix neg * wip * rename * fix hvx_vec_inverse_fp32_guard_inf to handle infinity and NaN cases correctly * wip * fix hvx_vec_inverse_fp32_guard_inf to handle NaN cases correctly * wip * wip * wip * fix output sign

* CANN: Refactor `evaluate_and_capture_cann_graph` **Description of the problem** * `matched_graph` is obtained even if graph mode is disabled. * End of graph capture and graph replay are unnecessarily placed in different `if` blocks. **Proposed solution** * Obtain `matched_graph` only if graph mode is enabled. * Place end of graph capture and graph reply inside the same `if` block. * Unify graph related comments. * Remove trailing whitespace

* vulkan: disable async for older Intel devices * update detection logic * use name string for detection

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* cmake: add option to build and link BoringSSL Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : fix typo Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : disable boringssl test and asm by default Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : skip bssl Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : disable fips Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : fix cmake --install Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ci : use boringssl for windows and mac Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Detect GigaChat3-10-A1.8B as deepseek lite Hardcodes checking number of layers to detect if lite version of deepseek. * Add commnent identifying deepseek lite variants deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B

* mmf for rdna4 * align the padding for rdna4 * forbit mul_mat_f for rdna4 * fix as comment * remove device kernels * add constexpr for early return * update based on review comment * change based on the review comment * pass compile error * keep code consistency --------- Co-authored-by: zhang hui <you@example.com>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…gml-org#17439) 26.04 provides these Signed-off-by: Eric Curtin <eric.curtin@docker.com>

* support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params

…ath (ggml-org#17869) * fix: Provide macos-specific backtrace printing to avoid terminal death Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Add DIAG for CUDA * Refactor parameters

* feat: Add a batched version of ssm_conv This was done using Claude Code. It found a number of optimizations around how the threads were organized, resulting in a huge performance boost! Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Optimized SSM_SCAN kernel for metal This used Claude Code and resulted in a modest performance improvement while maintaining correctness. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add test-backend-ops perf tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Real representitive tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use function constant for ssm_conv batch size Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: backend op tests for ssm_scan from granite4 1b-h Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: remove commented out templates Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: float4 version of ssm_conv_batched Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing ggml_metal_cv_free Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update cuda ops * update CPU as well

…g#17713)

* convert: allow using quantized Mistral weight * data_torch.ndim * update dequant fn Co-authored-by: compilade <compilade@users.noreply.github.com> --------- Co-authored-by: compilade <compilade@users.noreply.github.com>

* model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name

* wip * wip * fix logging, add display info * handle commands * add args * wip * move old cli to llama-completion * rm deprecation notice * move server to a shared library * move ci to llama-completion * add loading animation * add --show-timings arg * add /read command, improve LOG_ERR * add args for speculative decoding, enable show timings by default * add arg --image and --audio * fix windows build * support reasoning_content * fix llama2c workflow * color default is auto * fix merge conflicts * properly fix color problem Co-authored-by: bandoti <bandoti@users.noreply.github.com> * better loading spinner * make sure to clean color on force-exit * also clear input files on "/clear" * simplify common_log_flush * add warning in mtmd-cli * implement console writter * fix data race * add attribute * fix llama-completion and mtmd-cli * add some notes about console::log * fix compilation --------- Co-authored-by: bandoti <bandoti@users.noreply.github.com>

commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <samueloliveira32df@gmail.com> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <samueloliveira32df@gmail.com> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property

Stealt91 · 2025-12-10T19:30:28Z

Awesome! Thank you for the effort to make this actually worth using! Can't wait for further improvements!

InfernalDread · 2025-12-10T20:33:49Z

Hello again! Thank you for the mention! I will definitely be trying out these new improvements, can't wait to see what's to come!

timkhronos · 2025-12-10T22:19:19Z

Hello! I have been meaning to try out these new improvements, however I get a segmentation fault during the MTP pass.

Any idea what could be causing this? I am using a single gpu + cpu linux system, and a q4 quant of 4.6.

I also get the following warnings when building.

SamuelOliveirads · 2025-12-11T00:34:04Z

Any idea what could be causing this? I am using a single gpu + cpu linux system, and a q4 quant of 4.6.

@timkhronos I was able to replicate the problem. It occurs because you are using GLM-4.6, which wasn't supported in this PR (most of the work was done on an older branch before 4.6 existed).

I have downloaded a GLM-4.6 quant to test and will look deeper to understand and fix the incompatibility. In theory, the architecture should be the same (or very similar) to the 4.5 variant, so I should be able to patch it.

By the way, if anyone wants to test MTP right now, please use the GLM-4.5 variant, at least the Air works okay, probably the full version also.

GLM-4.6 models exclude specific MTP tensors (`embed_tokens` and `shared_head_head`), implying weight tying with the main model. Previously, this caused a crash when building the graph. This commit adds a fallback mechanism to use the main model's token embeddings and output head when the MTP-specific tensors are missing.

SamuelOliveirads · 2025-12-11T01:57:56Z

GLM-4.6 has a small architectural difference compared to the 4.5 variants, specifically in the MTP layer (shared weights vs separate). I just pushed a fix for that and it's working on my end now. Anyone who wants to try MTP with GLM-4.6 should be able to do so without issues

timkhronos · 2025-12-11T13:03:11Z

First of all, thank you.

After testing, this pr seems to suffer large performance regressions, both in TG and PP, even when not using '-mtp' in the launch arguments. On the latest mainline build I get around 6.2 tok/s TG and 100~120 PP. On this Pr without -mtp, I get around 2.2 tok/s TG and 30 PP. With -mtp I get around 2.7 tok/s TG, if I offload the MTP tensor{92} to the GPU.

SamuelOliveirads · 2025-12-13T19:37:02Z

After testing, this pr seems to suffer large performance regressions, both in TG and PP, even when not using '-mtp' in the launch arguments. On the latest mainline build I get around 6.2 tok/s TG and 100~120 PP. On this Pr without -mtp, I get around 2.2 tok/s TG and 30 PP. With -mtp I get around 2.7 tok/s TG, if I offload the MTP tensor{92} to the GPU.

That's curious. I ran a couple of tests using main (2fbe3b7) and this branch across different commits to see if I had introduced changes that degrade performance, but I couldn't replicate the problem.

In my tests using GLM-4.5-Air and GLM-4.6, both main and this branch (without MTP enabled) gave statistically identical performance. I have updated the PR description to show my latest results.

I did find some interesting behavior with the Air model: the speed varies quite a bit between runs, even on the same commit. I don't know exactly what causes this—perhaps it's related to how the graph is constructed each time the model loads or even because I'm testing to few examples—but I didn't have this issue with the 4.6 version, which was very consistent.

I also want to thank you for the idea of fully loading the MTP layer on the GPU. It gave me a small but measurable performance boost (jumping from around 14.40 t/s to between 15.6 and 16.0 t/s).

Adds a new `mtp` boolean to `llama_model_params`. When set to false (default): 1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`. 2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`). This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.

Removes heavy penalty checks (repetition, frequency, presence, DRY) from `common_sampler_sample_speculative`. The specialized speculative sampler now uses a pure ArgMax (Greedy) approach. This significantly reduces CPU overhead during the drafting phase, which improves overall tokens per second.

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

merge PR to sync latest additions (polished UX and recursive drafting) and rebase upstream

allozaur and others added 30 commits November 19, 2025 14:39

vulkan: Add copy_transpose shader (ggml-org#17371)

2eba631

kleidiai: fix zero-size array declaration (ggml-org#17240)

3ae282a

ggml : remove useless and error-prone variadic macros (ggml-org#17399)

79bb743

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

metal : fix compile on macos 11 (whisper/3533)

1d321e5

sync : ggml

2286a36

grammar : fix integer overflow (ggml-org#17381)

92c0b38

* Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX

Improved file naming & structure for UI components (ggml-org#17405)

4c91f26

* refactor: Component iles naming & structure * chore: update webui build output * refactor: Dialog titles + components namig * chore: update webui build output * refactor: Imports * chore: update webui build output

grammar: fix regression caused by ggml-org#17381 (ggml-org#17412)

054a45c

* grammar: fix regression caused by ggml-org#17381 * more readable

readme : add Unsloth exporting to GGUF in tools (ggml-org#17411)

dd0f321

vulkan: disable async for older Intel devices (ggml-org#17369)

f1ffbba

* vulkan: disable async for older Intel devices * update detection logic * use name string for detection

ci : start using OpenSSL (ggml-org#17235)

9cc4080

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

opencl: refine condition for kqv mm (ggml-org#17392)

8e9ddba

Revive MUL_MAT_ID to perf testing (ggml-org#17397)

3f3a4fb

ci : switch to BoringSSL on Server workflow (ggml-org#17441)

4949ac0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: remove a couple unnecessary switches (ggml-org#17419)

54d83bb

vulkan: Update docker image to Ubuntu 26.04 to enable glslc features (g…

bc809e9

…gml-org#17439) 26.04 provides these Signed-off-by: Eric Curtin <eric.curtin@docker.com>

cuda : support non-contiguous i32 to i32 copy (ggml-org#17326)

96ac5a2

* support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params

ggerganov and others added 16 commits December 9, 2025 15:25

metal : print node names for debugging (ggml-org#17882)

6b82eb7

docs: clarify that CPU support should be first (ggml-org#17886)

48f4756

Add DIAG for CUDA (ggml-org#17873)

b635092

* Add DIAG for CUDA * Refactor parameters

docs : update cpu and cuda ops (ggml-org#17890)

6339185

* update cuda ops * update CPU as well

common : add parser for ministral/mistral large 3/devstral 2 (ggml-or…

2fbe3b7

…g#17713)

fix softmax for iGPU (ggml-org#17838)

2e9eab8

convert: allow using quantized Mistral weight (ggml-org#17889)

9e79b01

* convert: allow using quantized Mistral weight * data_torch.ndim * update dequant fn Co-authored-by: compilade <compilade@users.noreply.github.com> --------- Co-authored-by: compilade <compilade@users.noreply.github.com>

CUDA: fix unpadded strides in MMA FA kernel (ggml-org#17891)

17f7f4b

docs : update opencl ops (ggml-org#17904)

2d2e103

model : Qwen3-Next-80B-A3B has 48 layers (ggml-org#17898)

b677721

* model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name

cuda : add missing support check for xielu (ggml-org#17895)

4df6e85

speculative: optimize graph reuse for GLM-4.5

0525086

SamuelOliveirads mentioned this pull request Dec 18, 2025

Fix/improve mtp performance #5

Open

SamuelOliveirads added 2 commits December 19, 2025 20:41

F1LM1 merged commit 1c5633f into F1LM1:glm4-moe-mtp Dec 21, 2025

F1LM1 added a commit that referenced this pull request Dec 21, 2025

Merge pull request #6 from SamuelOliveirads/glm4-mtp-recursive

79b0fea

merge PR to sync latest additions (polished UX and recursive drafting) and rebase upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/glm4 mtp recursive #6

Feat/glm4 mtp recursive #6

Uh oh!

SamuelOliveirads commented Dec 10, 2025 •

edited

Loading

Uh oh!

Stealt91 commented Dec 10, 2025

Uh oh!

InfernalDread commented Dec 10, 2025

Uh oh!

timkhronos commented Dec 10, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Dec 11, 2025

Uh oh!

SamuelOliveirads commented Dec 11, 2025

Uh oh!

timkhronos commented Dec 11, 2025

Uh oh!

SamuelOliveirads commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Feat/glm4 mtp recursive #6

Feat/glm4 mtp recursive #6

Uh oh!

Conversation

SamuelOliveirads commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recursive MTP Drafting & Graph Optimization for GLM-4.5

Summary of Changes

Detailed Changes

1. Recursive MTP Drafting

2. Rebase & Modernization

3. Graph Reuse Optimization

Performance

Conclusion

Uh oh!

Stealt91 commented Dec 10, 2025

Uh oh!

InfernalDread commented Dec 10, 2025

Uh oh!

timkhronos commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Dec 11, 2025

Uh oh!

SamuelOliveirads commented Dec 11, 2025

Uh oh!

timkhronos commented Dec 11, 2025

Uh oh!

SamuelOliveirads commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

SamuelOliveirads commented Dec 10, 2025 •

edited

Loading

timkhronos commented Dec 10, 2025 •

edited

Loading