[Feature Request]: Support f16 via the half crate

Hello! We are considering support for `f16` (and `bf16`) via the `half` crate in `ndarray` (rust-ndarray/ndarray#1551), but we are seeing rather dismal performance on matrix multiplication for the new types: `f16` appears to be ~3 orders of magnitude slower than `f32`. After some debugging, I believe this is a testament to `matrixmultiply`'s performance: the code on my Apple M2 chip is hitting the `f16` assembly instructions, so I think most of the performance difference is thanks to `matrixmultiply`'s very fast `sgemm`.

In light of this, I was wondering what the appetite would be for supporting `f16` here in the `matrixmultiply` crate.

cc: @swfsql, who has been the champion for `f16` in `ndarray`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Support f16 via the half crate #95

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request]: Support f16 via the half crate #95

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions