Skip to content

Latest commit

 

History

History
37 lines (29 loc) · 2.35 KB

File metadata and controls

37 lines (29 loc) · 2.35 KB

MooreThreads MUTLASS Changelog

0.3.0 (2025-12-19)

  • New Features:
    • Tensor Memory Engine (TME) im2col primitives.
    • New warp specialized GEMM mainloop targeting MP31 architecture.
    • New instances of FP8 (e4m3, m5m2) GEMM in Library targeting MP31 architecture.
    • New persistent tile schedule.
    • New Warp specialized FMHA and Paged FMHA implementation for MP31 architecture.
    • New Warp specialized MLA implementation for MP31 architecture.
  • Bug fixing and improvements
    • Refine FP8 scale GEMM implementation for MP31 architecture.

0.2.0 (2025-02-26)

  • MP31 Features:
    • Squad-level MMA(SQMMA) and Warp-level MMA primitives with rich data types (TF32/FP16/BF16/FP8/S8 etc.).
    • Tensor Memory Engine(TME) and RobustBufferAccess primitives.
  • New GEMM mainloop and epilogue targeting MP31 architecture that achieve high performance with TME and SQMMA.
  • New tile scheduler to support CTA swizzle for MP31 kernels.
  • New experimental directory housing the implementations that are not yet stable and may have significant changes in the future.
  • New FP8 GEMM with groupwise scaling.
  • Upgrade the backend from CUTLASS/CuTe 3.5.0 to CUTLASS/CuTe 3.6.0.

0.1.1 (2024-09-30)

  • MuTe, a core library and backend adapted from CUTLASS CuTe
  • Quyuan Features
    • MMA primitives: TensorFloat32, BFloat16, Float16, INT8
  • FMA/MMA GEMM Kernels targeting the Quyuan architecture
    • Note: this is a beta release. Further updates to MUTLASS will include performance improvements, feature enablement, and possible breaking changes to the API
  • MUTLASS Profiler, Library, and Utilities
  • Two examples that demonstrate the usage of the low-level API and the collective builders to build GEMM kernelS