Hello! We are considering support for f16 (and bf16) via the half crate in ndarray (rust-ndarray/ndarray#1551), but we are seeing rather dismal performance on matrix multiplication for the new types: f16 appears to be ~3 orders of magnitude slower than f32. After some debugging, I believe this is a testament to matrixmultiply's performance: the code on my Apple M2 chip is hitting the f16 assembly instructions, so I think most of the performance difference is thanks to matrixmultiply's very fast sgemm.
In light of this, I was wondering what the appetite would be for supporting f16 here in the matrixmultiply crate.
cc: @swfsql, who has been the champion for f16 in ndarray.
Hello! We are considering support for
f16(andbf16) via thehalfcrate inndarray(rust-ndarray/ndarray#1551), but we are seeing rather dismal performance on matrix multiplication for the new types:f16appears to be ~3 orders of magnitude slower thanf32. After some debugging, I believe this is a testament tomatrixmultiply's performance: the code on my Apple M2 chip is hitting thef16assembly instructions, so I think most of the performance difference is thanks tomatrixmultiply's very fastsgemm.In light of this, I was wondering what the appetite would be for supporting
f16here in thematrixmultiplycrate.cc: @swfsql, who has been the champion for
f16inndarray.