[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions by yzhang93 · Pull Request #23834 · iree-org/iree

yzhang93 · 2026-03-18T17:57:46Z

Three changes to improve performance of grouped convolutions with small per-group channel counts (e.g., 32 groups with 8 input/output channels):

MMA intrinsic selection: Add MN utilization check as the highest-priority rule. When one intrinsic has >= 2x better MN utilization, prefer it regardless of K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over 16x16 intrinsics (25% util) for problems with small M/N dimensions. For same MN intrinsics differing only in K (e.g., 16x16x16 vs 16x16x32), use a 10% utilization threshold: at moderate util prefer smaller compute; at very low util let later rules pick the larger-K intrinsic for better K throughput.
Degenerate schedule fallback: When GCD-based distribution produces all-1 subgroup/tile counts (because tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2 seeds), fall back to min-based distribution. This ensures the schedule assigns meaningful work to subgroups for small filter spatial dimensions.
Batch dimension tiling: When the inner M and N dimensions both require padding (smaller than the intrinsic tile), increase the batch tile size from 1 to up to 4. This gives each workgroup more useful work to amortize dispatch and memory access overhead, which is critical for grouped convolutions where the per-group computation is very small.

Benchmark results on MI355X:

Top improvements (group backward weight convolutions, g=32):
3.94x (2466 -> 627 us) g=32, 256, 200x200, 3x3, stride 2
3.51x (1929 -> 550 us) g=32, 256, 100x100, 3x3
1.94x (370 -> 191 us) g=32, 512, 50x50, 3x3
No significant regressions.

Similar performance improvements are also observed on RDNA4 and Mi300x.

…channel group backward convolutions Three changes to improve performance of grouped convolutions with small per-group channel counts (e.g., 32 groups with 8 input/output channels): 1. MMA intrinsic selection: Add M*N utilization check as the highest-priority rule. When one intrinsic has >= 2x better M*N utilization, prefer it regardless of K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over 16x16 intrinsics (25% util) for problems with small M/N dimensions. For same-M*N intrinsics differing only in K (e.g., 16x16x16 vs 16x16x32), use a 10% utilization threshold: at moderate util prefer smaller compute; at very low util let later rules pick the larger-K intrinsic for better K throughput. 2. Degenerate schedule fallback: When GCD-based distribution produces all-1 subgroup/tile counts (because tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2 seeds), fall back to min-based distribution. This ensures the schedule assigns meaningful work to subgroups for small filter spatial dimensions. 3. Batch dimension tiling: When the inner M and N dimensions both require padding (smaller than the intrinsic tile), increase the batch tile size from 1 to up to 4. This gives each workgroup more useful work to amortize dispatch and memory access overhead, which is critical for grouped convolutions where the per-group computation is very small. Benchmark results on MI355X: - Top improvements (group backward weight convolutions, g=32): 3.94x (2466 -> 627 us) g=32, 256, 200x200, 3x3, stride 2 3.51x (1929 -> 550 us) g=32, 256, 100x100, 3x3 1.94x (370 -> 191 us) g=32, 512, 50x50, 3x3 - No significant regressions. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: yzhang93 <zhyuhang88@gmail.com>

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>

nirvedhmeshram

LGTM, I do have one question but the answer is not a blocker, just a good to know.

nirvedhmeshram · 2026-03-20T18:50:03Z

  // Therefore we disallow padding only when LHS is transposed.
+  // Include all batch dims (static and dynamic) so the heuristic can compute
+  // per-dimension batch tile sizes aligned with contractionB indices.
+  auto allBatchBounds = llvm::map_to_vector(


This is diverging from what we do for other dimensions which is fine, I just want to make sure we are doing it for a good reason? For other dimensions we will always tile outer dimensions to 1 and only the inner dimensions get something meaningful thats why we didnt care about skipping the dynamic dims, but the logic here is setup in a way that you can have multiple non unit batch dims, is that something you have found necessary?

No, I haven't found cases where multiple non-unit batch dims are needed (since there is no multiple batch dims in real applications). However, to keep consistent with other dimensions, I updated the codes to pass only static batch dims (using existing getDimBoundsNoPad(batchDims)) and only tile the innermost dim if possible.

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>

yzhang93 requested review from Groverkss, Max191, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 18, 2026 17:57

yzhang93 force-pushed the intrinsic_selection_padding branch 2 times, most recently from 7788e0d to 7432a3f Compare March 18, 2026 23:00

yzhang93 force-pushed the intrinsic_selection_padding branch from 7432a3f to 8842f7a Compare March 19, 2026 18:10

nirvedhmeshram reviewed Mar 19, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp Outdated

yzhang93 requested review from MaheshRavishankar March 19, 2026 22:13

Move batch tiling to deduceMMASchedule

0d70462

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>

nirvedhmeshram approved these changes Mar 20, 2026

View reviewed changes

Minor changes for batch tiling

5c18b1f

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>

yzhang93 merged commit 6b4795c into iree-org:main Mar 23, 2026
53 of 57 checks passed

yzhang93 deleted the intrinsic_selection_padding branch March 30, 2026 22:07

yzhang93 mentioned this pull request Apr 13, 2026

[GPU] Implement CDNA block intrinsics (6/6) #24020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions#23834

[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions#23834
yzhang93 merged 3 commits intoiree-org:mainfrom
yzhang93:intrinsic_selection_padding

yzhang93 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

nirvedhmeshram left a comment

Uh oh!

nirvedhmeshram Mar 20, 2026

Uh oh!

yzhang93 Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yzhang93 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nirvedhmeshram left a comment

Choose a reason for hiding this comment

Uh oh!

nirvedhmeshram Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

yzhang93 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yzhang93 commented Mar 18, 2026 •

edited

Loading