Fix Evoformer's multi-arch dispatch root cause#7881
Fix Evoformer's multi-arch dispatch root cause#7881tohtana merged 4 commits intodeepspeedai:masterfrom
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
If I understand this correctly it will now always build all kernel template specialization. So even with |
|
Hi @Flamefire, thank you for checking! Yes, it definitely generates many instances of the template. |
|
Ok, then it only needs to be verified that all kernels compile with the lowest supported TORCH_CUDA_ARCH, i.e. that using something like |
|
Good point, |
|
Great, then I'm also sure this works fine now. Thanks a lot! |
Fixes deepspeedai#7863 Replaces deepspeedai#7872 @Flamefire Issue deepspeedai#7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements. This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime. With TORCH_CUDA_ARCH_LIST='7.0;8.0': 1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53). 1. Runtime chooses implementation by device CC: - CC >= 80 -> Sm80 (Ampere+ path) - CC >= 75 -> Sm75 - CC >= 70 -> Sm70 1. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the Ampere-family kernel path. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Fixes #7863
Replaces #7872
@Flamefire
Issue #7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements.
This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime.
With TORCH_CUDA_ARCH_LIST='7.0;8.0':