Skip to content

Fix Evoformer's multi-arch dispatch root cause#7881

Merged
tohtana merged 4 commits intodeepspeedai:masterfrom
tohtana:tohtana/evoformer-multi-arch-root-cause
Mar 13, 2026
Merged

Fix Evoformer's multi-arch dispatch root cause#7881
tohtana merged 4 commits intodeepspeedai:masterfrom
tohtana:tohtana/evoformer-multi-arch-root-cause

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented Mar 2, 2026

Fixes #7863
Replaces #7872

@Flamefire
Issue #7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements.

This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime.

With TORCH_CUDA_ARCH_LIST='7.0;8.0':

  1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53).
  2. Runtime chooses implementation by device CC:
    • CC >= 80 -> Sm80 (Ampere+ path)
    • CC >= 75 -> Sm75
    • CC >= 70 -> Sm70
  3. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the Ampere-family kernel path.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@Flamefire
Copy link
Copy Markdown
Contributor

If I understand this correctly it will now always build all kernel template specialization. So even with TORCH_CUDA_ARCHLIST=8.0 we will get the kernel optimized for SM7.0 but it won't be selected on Ampere at runtime

@tohtana
Copy link
Copy Markdown
Collaborator Author

tohtana commented Mar 3, 2026

Hi @Flamefire, thank you for checking!

Yes, it definitely generates many instances of the template.
But DISPATCH_ARCHTAG(CC, ...) now selects ArchTag at runtime using the actual device compute capability. So on Ampere (CC >= 80), we pick Sm80 and use the Ampere-optimized kernel policy.

@Flamefire
Copy link
Copy Markdown
Contributor

Ok, then it only needs to be verified that all kernels compile with the lowest supported TORCH_CUDA_ARCH, i.e. that using something like TORCH_CUDA_ARCH_LIST=7.0 doesn't break because a SM 9.0 kernel uses features/types not available in 7.0

@tohtana
Copy link
Copy Markdown
Collaborator Author

tohtana commented Mar 3, 2026

Good point,
If it is the concern about compilation (whether higher-arch template paths break sm_70 compilation), the following command worked successfully.

CUTLASS_PATH=.../cutlass DS_BUILD_OPS=0 DS_BUILD_EVOFORMER_ATTN=1 TORCH_CUDA_ARCH_LIST=7.0 python setup.py build_ext 

@Flamefire
Copy link
Copy Markdown
Contributor

Great, then I'm also sure this works fine now. Thanks a lot!

@tohtana tohtana enabled auto-merge (squash) March 13, 2026 00:21
@tohtana tohtana merged commit 784cc26 into deepspeedai:master Mar 13, 2026
1 check passed
nathon-lee pushed a commit to nathon-lee/DeepSpeed_woo that referenced this pull request Mar 28, 2026
Fixes deepspeedai#7863
Replaces deepspeedai#7872

@Flamefire
Issue deepspeedai#7863 reports order-dependent failures in Evoformer when building
for mixed CUDA architectures. The guard-only approach prevents some bad
outputs but does not solve multi-generation packaging requirements.

This PR takes the root-cause direction: produce a correct multi-arch
binary that can run on pre-Ampere and Ampere+ and select the right
kernel family at runtime.

With TORCH_CUDA_ARCH_LIST='7.0;8.0':
1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch
dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53).
1. Runtime chooses implementation by device CC:
      - CC >= 80 -> Sm80 (Ampere+ path)
      - CC >= 75 -> Sm75
      - CC >= 70 -> Sm70
1. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the
Ampere-family kernel path.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-GPU-Arch pre-compilation of operators not supported

3 participants