Skip to content

[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions#23834

Merged
yzhang93 merged 3 commits intoiree-org:mainfrom
yzhang93:intrinsic_selection_padding
Mar 23, 2026
Merged

[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions#23834
yzhang93 merged 3 commits intoiree-org:mainfrom
yzhang93:intrinsic_selection_padding

Conversation

@yzhang93
Copy link
Copy Markdown
Contributor

@yzhang93 yzhang93 commented Mar 18, 2026

Three changes to improve performance of grouped convolutions with small per-group channel counts (e.g., 32 groups with 8 input/output channels):

  1. MMA intrinsic selection: Add MN utilization check as the highest-priority rule. When one intrinsic has >= 2x better MN utilization, prefer it regardless of K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over 16x16 intrinsics (25% util) for problems with small M/N dimensions. For same MN intrinsics differing only in K (e.g., 16x16x16 vs 16x16x32), use a 10% utilization threshold: at moderate util prefer smaller compute; at very low util let later rules pick the larger-K intrinsic for better K throughput.

  2. Degenerate schedule fallback: When GCD-based distribution produces all-1 subgroup/tile counts (because tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2 seeds), fall back to min-based distribution. This ensures the schedule assigns meaningful work to subgroups for small filter spatial dimensions.

  3. Batch dimension tiling: When the inner M and N dimensions both require padding (smaller than the intrinsic tile), increase the batch tile size from 1 to up to 4. This gives each workgroup more useful work to amortize dispatch and memory access overhead, which is critical for grouped convolutions where the per-group computation is very small.

Benchmark results on MI355X:

  • Top improvements (group backward weight convolutions, g=32):
    3.94x (2466 -> 627 us) g=32, 256, 200x200, 3x3, stride 2
    3.51x (1929 -> 550 us) g=32, 256, 100x100, 3x3
    1.94x (370 -> 191 us) g=32, 512, 50x50, 3x3
  • No significant regressions.

Similar performance improvements are also observed on RDNA4 and Mi300x.

@yzhang93 yzhang93 force-pushed the intrinsic_selection_padding branch 2 times, most recently from 7788e0d to 7432a3f Compare March 18, 2026 23:00
…channel group backward convolutions

Three changes to improve performance of grouped convolutions with small
per-group channel counts (e.g., 32 groups with 8 input/output channels):

1. MMA intrinsic selection:
   Add M*N utilization check as the highest-priority rule. When one
   intrinsic has >= 2x better M*N utilization, prefer it regardless of
   K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over
   16x16 intrinsics (25% util) for problems with small M/N dimensions.
   For same-M*N intrinsics differing only in K (e.g., 16x16x16 vs
   16x16x32), use a 10% utilization threshold: at moderate util prefer
   smaller compute; at very low util let later rules pick the larger-K
   intrinsic for better K throughput.

2. Degenerate schedule fallback:
   When GCD-based distribution produces all-1 subgroup/tile counts (because
   tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2
   seeds), fall back to min-based distribution. This ensures the schedule
   assigns meaningful work to subgroups for small filter spatial dimensions.

3. Batch dimension tiling:
   When the inner M and N dimensions both require padding (smaller than the
   intrinsic tile), increase the batch tile size from 1 to up to 4. This
   gives each workgroup more useful work to amortize dispatch and memory
   access overhead, which is critical for grouped convolutions where the
   per-group computation is very small.

Benchmark results on MI355X:
  - Top improvements (group backward weight convolutions, g=32):
      3.94x (2466 -> 627 us)  g=32, 256, 200x200, 3x3, stride 2
      3.51x (1929 -> 550 us)  g=32, 256, 100x100, 3x3
      1.94x (370 -> 191 us)   g=32, 512, 50x50, 3x3
  - No significant regressions.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
@yzhang93 yzhang93 force-pushed the intrinsic_selection_padding branch from 7432a3f to 8842f7a Compare March 19, 2026 18:10
Comment thread compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp Outdated
Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
Copy link
Copy Markdown
Contributor

@nirvedhmeshram nirvedhmeshram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I do have one question but the answer is not a blocker, just a good to know.

// Therefore we disallow padding only when LHS is transposed.
// Include all batch dims (static and dynamic) so the heuristic can compute
// per-dimension batch tile sizes aligned with contractionB indices.
auto allBatchBounds = llvm::map_to_vector(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is diverging from what we do for other dimensions which is fine, I just want to make sure we are doing it for a good reason? For other dimensions we will always tile outer dimensions to 1 and only the inner dimensions get something meaningful thats why we didnt care about skipping the dynamic dims, but the logic here is setup in a way that you can have multiple non unit batch dims, is that something you have found necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't found cases where multiple non-unit batch dims are needed (since there is no multiple batch dims in real applications). However, to keep consistent with other dimensions, I updated the codes to pass only static batch dims (using existing getDimBoundsNoPad(batchDims)) and only tile the innermost dim if possible.

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
@yzhang93 yzhang93 merged commit 6b4795c into iree-org:main Mar 23, 2026
53 of 57 checks passed
@yzhang93 yzhang93 deleted the intrinsic_selection_padding branch March 30, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants