There are often significant performance gains from e.g. increasing the tile size when a smaller type is used, or sometimes simply setting the entry hint occupancy=2 will have a big impact. It would be nice to be able to automatically find a good combination of parameters specific to the current array sizes, hardware, and arch.
See _autotuner.py with example usage in AttentionFMHA.py