MXFP4/MXFP8/int4 weights support in CuTe interface MoE GEMM example#640
MXFP4/MXFP8/int4 weights support in CuTe interface MoE GEMM example#640sanchitintel wants to merge 4 commits intointel:mainfrom
Conversation
dd8400e to
9eb893b
Compare
5bb5022 to
f69ee6a
Compare
|
Please note that when weight=int4, there is a default zero point which is 8. |
|
Another question is that mxfp4 scales data type is ue8m0, storage data type is uint8. |
That'd depend upon the quantization-type.
No
I reinterpreted cast only because igc loads more data than necessary (discards the rest) when I used ue8m0, so I reinterpret casted to int8 for loads. |
|
This PR is inactive for more than 90 days, the code based has changed a lot, please reopen it in case you need it. |
|
Will reopen, thanks! It's still needed |
Summary
Adds MoE GEMM implementation for MXFP4/MXFP8 (FP4/FP8 weights & E8M0 scales, with group-wise quantization) with CuTe interface.
If users don't select copy atoms for loading activations, weights & storing output, then they would be chosen automatically. Users can pass void as corresponding copy atom template parameters, but the copy atoms chosen automatically may not not always attain the best performance, so users can specify custom copy atoms.
Support for int4 weights with BF16/FP16 scales has also been added.
Weights are in plain format, and have not been prepacked.
Details
BMG doesn't support MXFP4/MXFP8/int4 natively, so it's converted to either FP16 or BF16, depending upon the activation.
Currently, it assumes WG_K & SG_K are both equal to 32.
Performance
Largely depends upon scaledMM performance in #633
cc @CaoZhongZ @mayuyuace @pengzhao-intel