MooreThreads MUTLASS Changelog

0.3.0 (2025-12-19)

New Features:
- Tensor Memory Engine (TME) im2col primitives.
- New warp specialized GEMM mainloop targeting MP31 architecture.
- New instances of FP8 (e4m3, m5m2) GEMM in Library targeting MP31 architecture.
- New persistent tile schedule.
- New Warp specialized FMHA and Paged FMHA implementation for MP31 architecture.
- New Warp specialized MLA implementation for MP31 architecture.
Bug fixing and improvements
- Refine FP8 scale GEMM implementation for MP31 architecture.

MP31 Features:
- Squad-level MMA(SQMMA) and Warp-level MMA primitives with rich data types (TF32/FP16/BF16/FP8/S8 etc.).
- Tensor Memory Engine(TME) and RobustBufferAccess primitives.
New GEMM mainloop and epilogue targeting MP31 architecture that achieve high performance with TME and SQMMA.
New tile scheduler to support CTA swizzle for MP31 kernels.
New experimental directory housing the implementations that are not yet stable and may have significant changes in the future.
- Prototype of Flash Attention Forward targeting MP31 architecture with TME, RobustBufferAccess and SQMMA.
New FP8 GEMM with groupwise scaling.
Upgrade the backend from CUTLASS/CuTe 3.5.0 to CUTLASS/CuTe 3.6.0.

MuTe, a core library and backend adapted from CUTLASS CuTe
Quyuan Features
- MMA primitives: TensorFloat32, BFloat16, Float16, INT8
FMA/MMA GEMM Kernels targeting the Quyuan architecture
- Note: this is a beta release. Further updates to MUTLASS will include performance improvements, feature enablement, and possible breaking changes to the API
MUTLASS Profiler, Library, and Utilities
Two examples that demonstrate the usage of the low-level API and the collective builders to build GEMM kernelS