-
Notifications
You must be signed in to change notification settings - Fork 1
Glm4 mtp optimizations #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: glm4-moe-mtp
Are you sure you want to change the base?
Glm4 mtp optimizations #4
Conversation
…nt is unreasonable
feat: implemented sampling for MTP
…llama.cpp into glm4-mtp-graph-cache
b229c6a to
3bfa5d3
Compare
|
Hi @F1LM1, I've implemented the logic mentioned regarding the One issue I'm currently investigating is a potential bug with position handling in large prompts. The model sometimes ignores the stop token for "thinking" or repeats the thought process in the final reply. In terms of results, I didn't observe a major boost in tokens/s yet, but the code is now much cleaner to maintain and sets a better foundation for applying optimizations like graph reuse and compute management. I also wanted to touch base regarding the future of this MTP implementation. I shared my roadmap in the original PR, but since you are the owner, I'd like to know your plans/availability. Thanks for your time! |
|
Hey @SamuelOliveirads, I've been following along with the PRs even though I've been quiet these past few weeks. Been busier + I feel that I have little experience on the optimization front so haven't been able to contribute much, sadly. Outside of random bugs/lower-level optimizations like was just found in ggml-org#15225 (comment), the two things that stand out to me:
As for high-level questions about this PR's future, I'm not sure what standard we would have to hit to get it pushed through to the main llama.cpp branch. Noticed you asked ggerganov the same in the original PR. It's a question I would love to know the answer to as well. But I'd guess that if we can get this to the point where it's providing meaningful speedup over no speculative draft, it will be worth merging the original PR, at which point further optimizations can go in their own PRs. I'm hoping 7c4b2c1 gets overhead low enough that combined with my first suggestion here and maybe some more minor optimizations, it wouldn't be a reach to hit something like 30-40% speedup over no MTP, which I think would be a reasonable checkpoint to push to get the original PR done with. Does that seem reasonable to you? |
No worries! I'm also digging deeper into this area to identify what can be fixed and improved.
I was initially planning to address this in a future PR, but it feels like it is necessary now to make MTP truly usable. I believe that, apart from a skip flag, we need to track tokens and potentially other metrics. I'll take some time to better understand the consequences of running without the distinct main decode step.
I have a plan for that. For each new drafted token, I'll need to collect and store the hidden state of the previous MTP drafted token. The MTP graph will also need to output the hidden state for storage. Once that's in place, we can implement a loop inside
I hope so too. It's good to have a concrete goal. I believe your two suggestions, combined with saving one of the model graphs (preferably the larger validation one), will yield that speedup. Even if MTP doesn't quite hit that target, we can open it up to testers to see if different backends, configurations, and use cases provide better performance. |
|
I think point 1 is probably the best next step. A hack should be relatively easy (I'm still wrapping my head on this). But a proper implementation may involve redesigning of the relevant code block. Point 2 looks good on paper, but longer drafts result in worse acceptance rates so I'm a bit skeptical of how much speedup this can get. There are a many research work on optimizing this, like the techniques used in EAGLE papers. They are however quite complicated to implement. Speaking of overheads I can still spot some small ones, but profiling shows the total room of overhead optimizations is like ~15%, so not a big deal. Right now I'm getting like ~10% speedup over without MTP with |
I don't know too much about other research on MTP performance, but I would love to read some. Looking at the example from the original PR regarding Nvidia's implementation, a single-token draft yielded a speedup of roughly 78%, going up to 128% for 3 drafts. We cannot expect the same gains, as they probably used only GPUs, and we don't know what the acceptance rate was. Most importantly, this is a first implementation that definitely will need more polish, but the potential is good enough to try. |
|
I've now read most of this implementation and are feeling it's better to "revert" to separate speculative model (regarding @F1LM1 's comments in #2). Conceptually although the main model and MTP heads are presented as a single unified model, I found no problem to think of them as different models. The MTP model only needs the hidden state and token outputs of the main model and work on its own. Compared to conventional speculation decoding, the only extra hidden state input needs not much different handling than the token input. It also has its own KV cache management that does not depend on anything from the main model. Down to implementation, I feel the underlying codebase is much more compatible with defining MTP as a separate model. That's how current speculative decoding works, after all. We should also be able to reuse some common facilities. From an optimization perspective, the current unified approach in this PR is also suboptimal: the MTP KV update graph is merged into the main model, but it's still separate kernel calls. The only drawback to this approach is we lost the ability to easily batch speculative decoding across multiple slots. Because now different slots could have different length of tokens to run the draft model due to different validation results. But since a) the current speculation decoding does not support slot batching at all b) we can forcibly batching them by using masks and I think the extra wasted computation is negligible, I don't think this is a problem. Regarding reducing the model calls to 2, to my current understanding of the relevant code I believe there is no major roadblocks and should be doable with a few additional |
Indeed, the idea was to remain the same; after all, the MTP still needs to update its cache to generate new drafts. My objective in merging the graphs was simply to avoid many The current implementation also has two possible problems: the
This is kind of the main issue for now. I can simply create a PR for the fix with |
…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 ggml-org#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 ggml-org#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING
* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (F1LM1#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b3. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* Faster tensors (ggml-org#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (ggml-org#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (F1LM1#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
I've created this draft to share my findings on what to fix or improve to make MTP usable. Currently, MTP's output quality is good, but its performance is worse than not using it at all. Therefore, it's not enough to be on par with the baseline; we need to be faster.
My initial plan is to find areas for improvement. It's not necessary to implement everything at once, but some of these should be on our radar for the future. They are:
llama_context::decodecallsThere are likely more things to improve, but for now, I find these to be the most impactful. Below are my thoughts on each:
1) Graph Reuse: The baseline implementation always reuses the graph. The process is simple: it stores the graph, and in the next call to
llama_context::process_ubatch, it checks if the stored graph can be reused. If not, it's deleted and the new one is stored. This works well after the first token is generated, as subsequent graphs are identical. The main bottleneck isn't callingllama_model::build_graphconstantly, but ratherggml_backend_sched_alloc_graph, which has to allocate and compute resources for the backend.The first fix was simple: just store one graph. In this case, the main model's token generation graph, which is one of the most expensive, will always be reused. On my machine, this gave an uplift of 13.8% for small prompts.
Current state: Halted.
After that, I tried to store the graph for every operation, or at least the ones that didn't involve the KV cache. By applying
llm_graph_context::cbto certain layers, I could store and reuse the graph, and I was able to compile and test this using only the CPU backend. However, I was unable to get it working with the offload policy. In theory, thecbfunction should handle that, but something else seems to be preventing specifically the allocation and computation. Is it mixing the offload policies of the main model and the MTP? This needs a deeper investigation, and I lack the proper knowledge in this area, so I'm setting it aside for now.2)
decodecalls: MTP was successfully implemented insidedecode, but it uses the old logic where each operation requires an expensive function call. Here is a comparison of how many calls we make in different scenarios:LLM - Normal:
Draft Model:
MTP (Current Slow Implementation):
One way to make MTP more usable is to match the number of calls of a typical draft model. To do that, it's necessary to combine the KV cache update and the draft generation into a single call.
Current state: In progress.
I successfully merged the KV cache update with the draft generation. This required creating a custom batch and sinfo, and changing some logic regarding the embeddings and hidden states necessary for the MTP to work. The version in this branch works in terms of output, meaning it's not breaking quality. However, the draft acceptance rate has dropped to around 25%. I believe this happens because while the first step (KV update) works using the correct hidden state from the main model, the subsequent operation (draft) is using a new hidden state generated by the MTP itself during the update. I still need to confirm this theory and apply a fix to hopefully see the acceptance rate rise back to its previous level.
One last thing: this change will still require a separate warmup call on the first interaction, but this is less impactful than merging the update and draft steps. To merge the warmup step, it would be necessary to track the sinfo to know when the prompt processing has finished its last batch, and then insert a new slot for the draft token.
3) Multi-token drafts: We discussed this in another PR. The problem was that for each new draft token, the MTP's KV cache needed to be updated, which was painful to do before. Now that we are using the
decodefunction, it's more feasible. If the unified update/draft implementation works, we could simply increase the batch and sinfo size to make the model draft more tokens.These are some of my ideas. I'd appreciate any insights you might have on how to better handle some of these things, or even new ideas for improvements that I haven't spotted here.