FlashAttentionKernel2 have implement for cuda ，do we have example  use it

Hello, I see that FlashAttentionKernel2 has been implemented. Do we have test cases and benchmarks for it? How much faster is it compared to regular attention on CUDA? thanks