-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization
#30310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization
#30310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a refactoring that clarifies the weight materialization process for GGUF FusedMoE layers. The changes introduce an assertion to ensure that GGUF weights for FusedMoE are loaded as 3D tensors, which makes an implicit assumption explicit and improves robustness. Additionally, comments have been added to explain the logic behind handling merged weights (w1 and w3), and a minor style improvement was made by using a set for membership testing. These changes improve the code's clarity and maintainability without altering the core functionality. The implementation is correct and I have no further recommendations.
| # Materialize GGUF UninitializedParameter accounting merged weights | ||
| if is_gguf_weight and isinstance(param, UninitializedParameter): | ||
| # GGUF currently requires full load (3D tensors). | ||
| assert full_load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a message to clarify what's happening when assertion is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, rather, it would be better to improve the comment (i.e. what is truly necessary). I'll consider changing around this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the comment to note why full_load is required.
Before:
# GGUF currently requires full load (3D tensors).After:
# To materialize a tensor, we must have full shape including
# number of experts, making this portion to require `full_load`.In the process of FusedMoE weight data materialization from GGUF files,
there is a magic number and some intents are not clear enough.
This commit clarifies some of them:
1. GGUF (currently) requires 3D tensor(s) for FusedMoE layer weights
as we have to know full tensor shape to materialize the parameter
(including number of experts).
2. w1 and w3 are merged per expert, i.e. the next dimension after
the expert ID is to be doubled to store both w1 and w3.
... and makes some minor adjustments.
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
0fac684 to
7524b0f
Compare
…erialization (vllm-project#30310) Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
…erialization (vllm-project#30310) Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
Purpose
This is a refactoring PR with no functional changes (unless the GGUF file is broken).
In the process of
FusedMoEweight data materialization from GGUF files, there is a magic number and some intents are not clear enough.This commit clarifies some of them:
full_load) forFusedMoElayer weights.w1andw3are merged per expert, i.e. the next dimension after the expert ID is to be doubled to store bothw1andw3.if is_gguf_weight...block).final_shape[1]) should be doubled forw1andw3, improving the clarity.... and makes some minor adjustments.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.