Okay so I try to run it and I don't even know when it's going to end because the std out is so vague. What do you mean triton took 6000? 6000 what? and then how do I know I'm even making progress when there are a million benchmarks and not printing to say where I am. On top of that, when I try change my os.environ to have TORCH_CUDA_ARCH_LIST so I don't have stupid warnings on every single iteration (which my only semblance of knowing progress was happening) and then once I added that line in everything broke and it couldn't even make it to the inner loop. Even when I removed the line and reset my terminal and recloned your repo. It's still broken. How is it possible for it to be this dysfunctional? Is it supposed to be windows only? So now it doesn't work at all.
Okay so I try to run it and I don't even know when it's going to end because the std out is so vague. What do you mean triton took 6000? 6000 what? and then how do I know I'm even making progress when there are a million benchmarks and not printing to say where I am. On top of that, when I try change my os.environ to have TORCH_CUDA_ARCH_LIST so I don't have stupid warnings on every single iteration (which my only semblance of knowing progress was happening) and then once I added that line in everything broke and it couldn't even make it to the inner loop. Even when I removed the line and reset my terminal and recloned your repo. It's still broken. How is it possible for it to be this dysfunctional? Is it supposed to be windows only? So now it doesn't work at all.