Fix Evoformer's multi-arch dispatch root cause by tohtana · Pull Request #7881 · deepspeedai/DeepSpeed

tohtana · 2026-03-02T19:45:13Z

Fixes #7863
Replaces #7872

@Flamefire
Issue #7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements.

This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime.

With TORCH_CUDA_ARCH_LIST='7.0;8.0':

Build is no longer pinned by -DGPU_ARCH; it uses runtime arch dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53).
Runtime chooses implementation by device CC:
- CC >= 80 -> Sm80 (Ampere+ path)
- CC >= 75 -> Sm75
- CC >= 70 -> Sm70
So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the Ampere-family kernel path.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Flamefire · 2026-03-03T13:18:39Z

If I understand this correctly it will now always build all kernel template specialization. So even with TORCH_CUDA_ARCHLIST=8.0 we will get the kernel optimized for SM7.0 but it won't be selected on Ampere at runtime

tohtana · 2026-03-03T17:59:54Z

Hi @Flamefire, thank you for checking!

Yes, it definitely generates many instances of the template.
But DISPATCH_ARCHTAG(CC, ...) now selects ArchTag at runtime using the actual device compute capability. So on Ampere (CC >= 80), we pick Sm80 and use the Ampere-optimized kernel policy.

Flamefire · 2026-03-03T18:27:59Z

Ok, then it only needs to be verified that all kernels compile with the lowest supported TORCH_CUDA_ARCH, i.e. that using something like TORCH_CUDA_ARCH_LIST=7.0 doesn't break because a SM 9.0 kernel uses features/types not available in 7.0

tohtana · 2026-03-03T19:28:58Z

Good point,
If it is the concern about compilation (whether higher-arch template paths break sm_70 compilation), the following command worked successfully.

CUTLASS_PATH=.../cutlass DS_BUILD_OPS=0 DS_BUILD_EVOFORMER_ATTN=1 TORCH_CUDA_ARCH_LIST=7.0 python setup.py build_ext

Flamefire · 2026-03-04T09:18:41Z

Great, then I'm also sure this works fine now. Thanks a lot!

@Flamefire

Fixes deepspeedai#7863 Replaces deepspeedai#7872 @Flamefire Issue deepspeedai#7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements. This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime. With TORCH_CUDA_ARCH_LIST='7.0;8.0': 1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53). 1. Runtime chooses implementation by device CC: - CC >= 80 -> Sm80 (Ampere+ path) - CC >= 75 -> Sm75 - CC >= 70 -> Sm70 1. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the Ampere-family kernel path. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

evoformer: fix multi-arch dispatch root cause and builder behavior

850a278

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana requested review from loadams and tjruwase as code owners March 2, 2026 19:45

This was referenced Mar 2, 2026

Fix Evoformer arch filtering consistency for mixed targets (#7863) #7872

Closed

Multi-GPU-Arch pre-compilation of operators not supported #7863

Closed

tests: use concrete evoformer builder class in unit test

8782217

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

sfc-gh-truwase approved these changes Mar 12, 2026

View reviewed changes

Merge branch 'master' into tohtana/evoformer-multi-arch-root-cause

3e081ee

tohtana enabled auto-merge (squash) March 13, 2026 00:21

Merge branch 'master' into tohtana/evoformer-multi-arch-root-cause

16fe172

tohtana merged commit 784cc26 into deepspeedai:master Mar 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Evoformer's multi-arch dispatch root cause#7881

Fix Evoformer's multi-arch dispatch root cause#7881
tohtana merged 4 commits intodeepspeedai:masterfrom
tohtana:tohtana/evoformer-multi-arch-root-cause

tohtana commented Mar 2, 2026

Uh oh!

Flamefire commented Mar 3, 2026

Uh oh!

tohtana commented Mar 3, 2026

Uh oh!

Flamefire commented Mar 3, 2026

Uh oh!

tohtana commented Mar 3, 2026

Uh oh!

Flamefire commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tohtana commented Mar 2, 2026

Uh oh!

Flamefire commented Mar 3, 2026

Uh oh!

tohtana commented Mar 3, 2026

Uh oh!

Flamefire commented Mar 3, 2026

Uh oh!

tohtana commented Mar 3, 2026

Uh oh!

Flamefire commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants