I gave the SLA code a try, plugged it into the self attention code for Wan 2.1, but in comparison to Flash attention 3 the speed is on Hopper Architecture at least unnoticeable and mainly slower than FA3 in some benchmarks. Is there certain Hyperparameters or specific set up that would introduce the speed up or is that as expected?
I gave the SLA code a try, plugged it into the self attention code for Wan 2.1, but in comparison to Flash attention 3 the speed is on Hopper Architecture at least unnoticeable and mainly slower than FA3 in some benchmarks. Is there certain Hyperparameters or specific set up that would introduce the speed up or is that as expected?