Glm4 mtp optimizations #4

SamuelOliveirads · 2025-10-24T02:26:56Z

I've created this draft to share my findings on what to fix or improve to make MTP usable. Currently, MTP's output quality is good, but its performance is worse than not using it at all. Therefore, it's not enough to be on par with the baseline; we need to be faster.

My initial plan is to find areas for improvement. It's not necessary to implement everything at once, but some of these should be on our radar for the future. They are:

Graph reuse
llama_context::decode calls
Multi-token drafts

There are likely more things to improve, but for now, I find these to be the most impactful. Below are my thoughts on each:

1) Graph Reuse: The baseline implementation always reuses the graph. The process is simple: it stores the graph, and in the next call to llama_context::process_ubatch, it checks if the stored graph can be reused. If not, it's deleted and the new one is stored. This works well after the first token is generated, as subsequent graphs are identical. The main bottleneck isn't calling llama_model::build_graph constantly, but rather ggml_backend_sched_alloc_graph, which has to allocate and compute resources for the backend.

The first fix was simple: just store one graph. In this case, the main model's token generation graph, which is one of the most expensive, will always be reused. On my machine, this gave an uplift of 13.8% for small prompts.

Current state: Halted.

After that, I tried to store the graph for every operation, or at least the ones that didn't involve the KV cache. By applying llm_graph_context::cb to certain layers, I could store and reuse the graph, and I was able to compile and test this using only the CPU backend. However, I was unable to get it working with the offload policy. In theory, the cb function should handle that, but something else seems to be preventing specifically the allocation and computation. Is it mixing the offload policies of the main model and the MTP? This needs a deeper investigation, and I lack the proper knowledge in this area, so I'm setting it aside for now.

2) decode calls: MTP was successfully implemented inside decode, but it uses the old logic where each operation requires an expensive function call. Here is a comparison of how many calls we make in different scenarios:

LLM - Normal:
- Loop 1: Prompt + Generation = 1 call
- Loop 2: Token generation = 1 call
Draft Model:
- Loop 1: Prompt + Generation -> Draft generation -> Main model validation = 3 calls
- Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
MTP (Current Slow Implementation):
- Loop 1: Prompt + Generation -> MTP warmup -> MTP draft -> Main model validation -> MTP KV update = 5 calls
- Loop 2: Token generation -> MTP draft -> Main model validation -> MTP KV update = 4 calls

One way to make MTP more usable is to match the number of calls of a typical draft model. To do that, it's necessary to combine the KV cache update and the draft generation into a single call.

Current state: In progress.

I successfully merged the KV cache update with the draft generation. This required creating a custom batch and sinfo, and changing some logic regarding the embeddings and hidden states necessary for the MTP to work. The version in this branch works in terms of output, meaning it's not breaking quality. However, the draft acceptance rate has dropped to around 25%. I believe this happens because while the first step (KV update) works using the correct hidden state from the main model, the subsequent operation (draft) is using a new hidden state generated by the MTP itself during the update. I still need to confirm this theory and apply a fix to hopefully see the acceptance rate rise back to its previous level.

One last thing: this change will still require a separate warmup call on the first interaction, but this is less impactful than merging the update and draft steps. To merge the warmup step, it would be necessary to track the sinfo to know when the prompt processing has finished its last batch, and then insert a new slot for the draft token.

3) Multi-token drafts: We discussed this in another PR. The problem was that for each new draft token, the MTP's KV cache needed to be updated, which was painful to do before. Now that we are using the decode function, it's more feasible. If the unified update/draft implementation works, we could simply increase the batch and sinfo size to make the model draft more tokens.

These are some of my ideas. I'd appreciate any insights you might have on how to better handle some of these things, or even new ideas for improvements that I haven't spotted here.

…lative.cpp

…nt is unreasonable

…nd n_ubatch

feat: implemented sampling for MTP

…orruption

…tion enum

…helper methods

… debug logs

…llama.cpp into glm4-mtp-graph-cache

SamuelOliveirads · 2025-11-23T00:16:44Z

Hi @F1LM1,

I've implemented the logic mentioned regarding the llama_context::decode calls. It makes the flow much simpler and easier to understand. We now essentially call decode three times: once for the main model (prompt + token), once for the MTP draft, and finally for validation. The KV cache updates have also been merged into these calls to minimize overhead.

One issue I'm currently investigating is a potential bug with position handling in large prompts. The model sometimes ignores the stop token for "thinking" or repeats the thought process in the final reply.

In terms of results, I didn't observe a major boost in tokens/s yet, but the code is now much cleaner to maintain and sets a better foundation for applying optimizations like graph reuse and compute management.

I also wanted to touch base regarding the future of this MTP implementation. I shared my roadmap in the original PR, but since you are the owner, I'd like to know your plans/availability.

Thanks for your time!

F1LM1 · 2025-11-23T01:49:54Z

Hey @SamuelOliveirads,

I've been following along with the PRs even though I've been quiet these past few weeks. Been busier + I feel that I have little experience on the optimization front so haven't been able to contribute much, sadly. Outside of random bugs/lower-level optimizations like was just found in ggml-org#15225 (comment), the two things that stand out to me:

In the original PR I noted an issue where we alternate between main model calls with draft and main model calls without drafting. This is not an MTP-specific issue but it'll especially hurt MTP performance versus typical draft models because we're only drafting one token at once without multi-token. I'm referring to this:
```
Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
```
which could really be simplified to two calls: Draft generation -> Main model validation -> Draft generation -> Main model validation, etc. after the first "standard" token is generated, since model validation itself produces a previous main-model-validated token that can be plugged into the MTP head. Think this is low-hanging fruit, it could be as simple as using a boolean to make sure the main decode loop (most of lines 3474 to 3587) only run once. With MTP draft acceptance rates as high as they are, this should provide a decent speedup during generation, and this will scale as we continue to optimize MTP overhead.
If we want to hit the ambitious performance speedups claimed by MTP elsewhere we will have to get started on multi-token.

As for high-level questions about this PR's future, I'm not sure what standard we would have to hit to get it pushed through to the main llama.cpp branch. Noticed you asked ggerganov the same in the original PR. It's a question I would love to know the answer to as well. But I'd guess that if we can get this to the point where it's providing meaningful speedup over no speculative draft, it will be worth merging the original PR, at which point further optimizations can go in their own PRs.

I'm hoping 7c4b2c1 gets overhead low enough that combined with my first suggestion here and maybe some more minor optimizations, it wouldn't be a reach to hit something like 30-40% speedup over no MTP, which I think would be a reasonable checkpoint to push to get the original PR done with. Does that seem reasonable to you?

SamuelOliveirads · 2025-11-24T01:23:28Z

I've been following along with the PRs even though I've been quiet these past few weeks. Been busier + I feel that I have little experience on the optimization front so haven't been able to contribute much, sadly.

No worries! I'm also digging deeper into this area to identify what can be fixed and improved.

In the original PR I noted an issue where we alternate between main model calls with draft and main model calls without drafting. This is not an MTP-specific issue but it'll especially hurt MTP performance versus typical draft models because we're only drafting one token at once without multi-token. I'm referring to this:
Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
which could really be simplified to two calls: Draft generation -> Main model validation -> Draft generation -> Main model validation, etc. after the first "standard" token is generated, since model validation itself produces a previous main-model-validated token that can be plugged into the MTP head. Think this is low-hanging fruit, it could be as simple as using a boolean to make sure the main decode loop (most of lines 3474 to 3587) only run once. With MTP draft acceptance rates as high as they are, this should provide a decent speedup during generation, and this will scale as we continue to optimize MTP overhead.

I was initially planning to address this in a future PR, but it feels like it is necessary now to make MTP truly usable. I believe that, apart from a skip flag, we need to track tokens and potentially other metrics. I'll take some time to better understand the consequences of running without the distinct main decode step.

If we want to hit the ambitious performance speedups claimed by MTP elsewhere we will have to get started on multi-token.

I have a plan for that. For each new drafted token, I'll need to collect and store the hidden state of the previous MTP drafted token. The MTP graph will also need to output the hidden state for storage. Once that's in place, we can implement a loop inside mtp_speculative_gen_draft to handle each token in the batch.

I'm hoping 7c4b2c1 gets overhead low enough that combined with my first suggestion here and maybe some more minor optimizations, it wouldn't be a reach to hit something like 30-40% speedup over no MTP, which I think would be a reasonable checkpoint to push to get the original PR done with. Does that seem reasonable to you?

I hope so too. It's good to have a concrete goal. I believe your two suggestions, combined with saving one of the model graphs (preferably the larger validation one), will yield that speedup. Even if MTP doesn't quite hit that target, we can open it up to testers to see if different backends, configurations, and use cases provide better performance.

wishstudio · 2025-11-24T04:26:41Z

I think point 1 is probably the best next step. A hack should be relatively easy (I'm still wrapping my head on this). But a proper implementation may involve redesigning of the relevant code block.

Point 2 looks good on paper, but longer drafts result in worse acceptance rates so I'm a bit skeptical of how much speedup this can get. There are a many research work on optimizing this, like the techniques used in EAGLE papers. They are however quite complicated to implement.

Speaking of overheads I can still spot some small ones, but profiling shows the total room of overhead optimizations is like ~15%, so not a big deal.

Right now I'm getting like ~10% speedup over without MTP with --cpu-moe. My napkin math shows 30-40% speedup is probably the ceiling, with all these goals implemented and overheads fixed. I believe the claimed big speed-ups can only be achieved in pure GPU workloads because batch size 1 is very inefficient in GPUs. But testing this is outreach for me.

SamuelOliveirads · 2025-11-26T00:54:55Z

Point 2 looks good on paper, but longer drafts result in worse acceptance rates so I'm a bit skeptical of how much speedup this can get. There are a many research work on optimizing this, like the techniques used in EAGLE papers. They are however quite complicated to implement.

I don't know too much about other research on MTP performance, but I would love to read some. Looking at the example from the original PR regarding Nvidia's implementation, a single-token draft yielded a speedup of roughly 78%, going up to 128% for 3 drafts. We cannot expect the same gains, as they probably used only GPUs, and we don't know what the acceptance rate was. Most importantly, this is a first implementation that definitely will need more polish, but the potential is good enough to try.

wishstudio · 2025-11-26T10:09:34Z

I've now read most of this implementation and are feeling it's better to "revert" to separate speculative model (regarding @F1LM1 's comments in #2).

Conceptually although the main model and MTP heads are presented as a single unified model, I found no problem to think of them as different models. The MTP model only needs the hidden state and token outputs of the main model and work on its own. Compared to conventional speculation decoding, the only extra hidden state input needs not much different handling than the token input. It also has its own KV cache management that does not depend on anything from the main model.

Down to implementation, I feel the underlying codebase is much more compatible with defining MTP as a separate model. That's how current speculative decoding works, after all. We should also be able to reuse some common facilities. From an optimization perspective, the current unified approach in this PR is also suboptimal: the MTP KV update graph is merged into the main model, but it's still separate kernel calls. build_mtp_draft_graph and build_mtp_update_graph are essentially building the same graph, just working on different part of the inputs. Kernel-wise currently the entire process is still the main model -> update draft kv -> draft model trio. But the update draft kv and draft model series of kernel calls can be merged into one. It simply needs to roll back speculations and "catch up" to main model's outputs in a single batch, similar to what current speculative decoding is doing. Separating them into two models will also eliminate the need of MTP_OP_* handling and make the code even more cleaner.

The only drawback to this approach is we lost the ability to easily batch speculative decoding across multiple slots. Because now different slots could have different length of tokens to run the draft model due to different validation results. But since a) the current speculation decoding does not support slot batching at all b) we can forcibly batching them by using masks and I think the extra wasted computation is negligible, I don't think this is a problem.

Regarding reducing the model calls to 2, to my current understanding of the relevant code I believe there is no major roadblocks and should be doable with a few additional ifs to skip main model call as @F1LM1 suggested. But I guess this should better be another PR after we get this one merged because this is quite independent to the MTP implementation and could also help conventional speculative decoding.

SamuelOliveirads · 2025-11-28T00:42:47Z

build_mtp_draft_graph and build_mtp_update_graph are essentially building the same graph, just working on different part of the inputs.

Indeed, the idea was to remain the same; after all, the MTP still needs to update its cache to generate new drafts. My objective in merging the graphs was simply to avoid many llama_decode calls. It was bad to handle for the scheduler and graph_reuse as we often need to generate around 4 graphs per interaction.

The current implementation also has two possible problems: the graph_reuse that was previously partly activated and now I fully deactivated. In previous tests, this gave me even a slight boost over the original PR of around 13.8%. The other is the possibility of bugs, although I didn't find anything wrong in my tests.

But I guess this should better be another PR after we get this one merged because this is quite independent to the MTP implementation and could also help conventional speculative decoding.

This is kind of the main issue for now. I can simply create a PR for the fix with llama_get_logits_ith, reactivate the partial graph_reuse and better usability for the user to choose when to use MTP or not. Then @F1LM1 will need to open his original PR and we hope for someone to review it. Otherwise, we should probably only find enough traction if we actually get better token speed with MTP than without (which is the idea of this heavy PR, and I agree that it's a lot to change here).

…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING

* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (F1LM1#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b3. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>

* Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (ggml-org#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (F1LM1#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

F1LM1 and others added 30 commits August 10, 2025 23:52

added getter for nextn layer count and server slot has_mtp property

db60623

some work towards building mtp layer graph

e434f87

make nextn weights loadable without a crash

1f477b3

add model member function to build mtp graph, to be called from specu…

03231da

…lative.cpp

broad thrust of the mtp implementation

cf0f7c0

failed attempt to implement MTP; outputs tokens but KV cache manageme…

6e9bafc

…nt is unreasonable

added proper KV cache management for MTP layers and slightly refactored

6870f97

fixed mtp kv cache update sequencing after prompt processing

382135a

kludge-y kv cache management of mtp layer

d72f9d5

fixed vram leak

471e026

replace standard sampler with greedy sampler for mtp draft

98bc0c6

fixed mtp kv cache update step in cases where prompt size > n_batch a…

9fab53e

…nd n_ubatch

feat: implemented sampling for MTP

07670a2

fix: add sample acceptance

5a5bce8

feat: apply logits + greedy sampler

8742ce0

Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp

c6237c7

feat: implemented sampling for MTP

mtp-batch (wip): move mtp execution to batch format

1318b2d

mtp-batch (wip): merge mtp and model graph

042eb8a

mtp-batch (wip): merge glm graphs

df64508

mtp-batch (fix): warm mtp cache for small batch size

3da7e7f

mtp-batch (wip): organize batch for mtp cache

75dc25e

mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer c…

67c6c06

…orruption

mtp-batch (wip): fix how to warmup kv cache for MTP

febd823

mtp-batch (feat): Create and manage sinfo for MTP

5e1d719

mtp-batch (fix): prevent mtp draft from polluting the cache

6f74ba3

mtp-batch(refactor): Replace MTP boolean flags with an explicit opera…

913af8f

…tion enum

mtp-batch(refactor): Extract decode context and MTP input logic into …

a99709d

…helper methods

mtp-batch(chore): Fix logit flags for speculative sampling and remove…

b4cbe03

… debug logs

mtp-batch(fix): Correctly advance cache head and add MTP documentation

4bcc9e2

mtp-batch(chore): Remove final MTP debug logs and dead code

0127c6b

SamuelOliveirads added 4 commits October 12, 2025 16:33

mtp-graph(feat): Reactivate graph reuse only for main model path

171346c

mtp-batch(fix): avoid logits for mtp kv cache operations

cae85fe

Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/…

15dff20

…llama.cpp into glm4-mtp-graph-cache

mtp-graph (wip): testing different ways to allow graph reuse

5859cb9

SamuelOliveirads mentioned this pull request Nov 2, 2025

server: implement GLM-style MTP ggml-org/llama.cpp#15225

Open

mtp-graph (feat): simplify graph logic

3bfa5d3

SamuelOliveirads force-pushed the glm4-mtp-optimizations branch from b229c6a to 3bfa5d3 Compare November 22, 2025 22:12

mtp-graph (fix): move llama_get_logits_ith outside the loop

7c4b2c1

SamuelOliveirads mentioned this pull request Dec 6, 2025

Fix/improve mtp performance #5

Closed

F1LM1 force-pushed the glm4-moe-mtp branch from 79b0fea to d10a5a4 Compare December 21, 2025 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Glm4 mtp optimizations #4

Glm4 mtp optimizations #4

Uh oh!

SamuelOliveirads commented Oct 24, 2025

Uh oh!

SamuelOliveirads commented Nov 23, 2025

Uh oh!

F1LM1 commented Nov 23, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Nov 24, 2025

Uh oh!

wishstudio commented Nov 24, 2025

Uh oh!

SamuelOliveirads commented Nov 26, 2025

Uh oh!

wishstudio commented Nov 26, 2025 •

edited

Loading

Uh oh!

SamuelOliveirads commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Glm4 mtp optimizations #4

Are you sure you want to change the base?

Glm4 mtp optimizations #4

Uh oh!

Conversation

SamuelOliveirads commented Oct 24, 2025

Uh oh!

SamuelOliveirads commented Nov 23, 2025

Uh oh!

F1LM1 commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Nov 24, 2025

Uh oh!

wishstudio commented Nov 24, 2025

Uh oh!

SamuelOliveirads commented Nov 26, 2025

Uh oh!

wishstudio commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

F1LM1 commented Nov 23, 2025 •

edited

Loading

wishstudio commented Nov 26, 2025 •

edited

Loading