[GPUHeuristics] Improve MMA intrinsic selection and tiling for small-channel group backward convolutions#23834
Conversation
7788e0d to
7432a3f
Compare
…channel group backward convolutions
Three changes to improve performance of grouped convolutions with small
per-group channel counts (e.g., 32 groups with 8 input/output channels):
1. MMA intrinsic selection:
Add M*N utilization check as the highest-priority rule. When one
intrinsic has >= 2x better M*N utilization, prefer it regardless of
K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over
16x16 intrinsics (25% util) for problems with small M/N dimensions.
For same-M*N intrinsics differing only in K (e.g., 16x16x16 vs
16x16x32), use a 10% utilization threshold: at moderate util prefer
smaller compute; at very low util let later rules pick the larger-K
intrinsic for better K throughput.
2. Degenerate schedule fallback:
When GCD-based distribution produces all-1 subgroup/tile counts (because
tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2
seeds), fall back to min-based distribution. This ensures the schedule
assigns meaningful work to subgroups for small filter spatial dimensions.
3. Batch dimension tiling:
When the inner M and N dimensions both require padding (smaller than the
intrinsic tile), increase the batch tile size from 1 to up to 4. This
gives each workgroup more useful work to amortize dispatch and memory
access overhead, which is critical for grouped convolutions where the
per-group computation is very small.
Benchmark results on MI355X:
- Top improvements (group backward weight convolutions, g=32):
3.94x (2466 -> 627 us) g=32, 256, 200x200, 3x3, stride 2
3.51x (1929 -> 550 us) g=32, 256, 100x100, 3x3
1.94x (370 -> 191 us) g=32, 512, 50x50, 3x3
- No significant regressions.
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
7432a3f to
8842f7a
Compare
Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
nirvedhmeshram
left a comment
There was a problem hiding this comment.
LGTM, I do have one question but the answer is not a blocker, just a good to know.
| // Therefore we disallow padding only when LHS is transposed. | ||
| // Include all batch dims (static and dynamic) so the heuristic can compute | ||
| // per-dimension batch tile sizes aligned with contractionB indices. | ||
| auto allBatchBounds = llvm::map_to_vector( |
There was a problem hiding this comment.
This is diverging from what we do for other dimensions which is fine, I just want to make sure we are doing it for a good reason? For other dimensions we will always tile outer dimensions to 1 and only the inner dimensions get something meaningful thats why we didnt care about skipping the dynamic dims, but the logic here is setup in a way that you can have multiple non unit batch dims, is that something you have found necessary?
There was a problem hiding this comment.
No, I haven't found cases where multiple non-unit batch dims are needed (since there is no multiple batch dims in real applications). However, to keep consistent with other dimensions, I updated the codes to pass only static batch dims (using existing getDimBoundsNoPad(batchDims)) and only tile the innermost dim if possible.
Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
Three changes to improve performance of grouped convolutions with small per-group channel counts (e.g., 32 groups with 8 input/output channels):
MMA intrinsic selection: Add MN utilization check as the highest-priority rule. When one intrinsic has >= 2x better MN utilization, prefer it regardless of K-alignment. This avoids selecting 32x32 intrinsics (6.25% util) over 16x16 intrinsics (25% util) for problems with small M/N dimensions. For same MN intrinsics differing only in K (e.g., 16x16x16 vs 16x16x32), use a 10% utilization threshold: at moderate util prefer smaller compute; at very low util let later rules pick the larger-K intrinsic for better K throughput.
Degenerate schedule fallback: When GCD-based distribution produces all-1 subgroup/tile counts (because tile counts are small odd numbers like 3x3 with GCD=1 against power-of-2 seeds), fall back to min-based distribution. This ensures the schedule assigns meaningful work to subgroups for small filter spatial dimensions.
Batch dimension tiling: When the inner M and N dimensions both require padding (smaller than the intrinsic tile), increase the batch tile size from 1 to up to 4. This gives each workgroup more useful work to amortize dispatch and memory access overhead, which is critical for grouped convolutions where the per-group computation is very small.
Benchmark results on MI355X:
3.94x (2466 -> 627 us) g=32, 256, 200x200, 3x3, stride 2
3.51x (1929 -> 550 us) g=32, 256, 100x100, 3x3
1.94x (370 -> 191 us) g=32, 512, 50x50, 3x3
Similar performance improvements are also observed on RDNA4 and Mi300x.