[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1#3978
Conversation
…o users/emezh/cms_tf32_128x128x64_nn
sebvince
left a comment
There was a problem hiding this comment.
LGTM. Approving if it passes the tests and validators. I left a couple of optimization ideas I got looking at the traces.
|
|
||
| elif isNN(kernel) and TLDS==1: | ||
| lra0 = [0,0,0,0, | ||
| 1,1,1,1, |
There was a problem hiding this comment.
Opt idea : I think we could save some cycles by interleaving lra0 and GrIncA (right now, the 4 ds_read stalls a bit because we can't issue them at the same time on SIMD pairs)
| lrb0 = [ 4,5,6,7] | ||
| # wait then read | ||
| syncs.add( 4, dscnt=8, comment="wait for the first 2x4 LRAs before packing") | ||
| syncs.add( 5, dscnt=1, comment="wait for the rest of LRAs before packing them") |
There was a problem hiding this comment.
Opt idea : how about we do this instead :
- dscnt=(1+4) instead dscnt=1
- move last CVT pair to the next mfma index and replace them with some GRIncB that are after
- add a dscnt=(1) at the next mfma before the last CVT pair
There was a problem hiding this comment.
after setting VectorWidthA=1 the timing became better, but after applying this suggestion the iteration took 17cyc more (on average).
Also, note that this CMS is using mfma 32x32x16, so the gap between mfmas is ~28cyc, so 6 GRINCs can fit there.
| num_gr = len(gra) + len(grb) | ||
| syncs.add( 12, vlcnt=8, barrier=True, comment="wait for the previous GRAs") | ||
|
|
||
| lra1 = [ 12,12,12,12, |
There was a problem hiding this comment.
To avoid some of the stalls on LRA1, have you tried to interleave them with PACK1 a bit ? Instead of doing all LRA1 in a single block, we could do :
- LRA1 (0-7) - CVT1 0-3 (talking about pack instructions before MFMA_4x4x4).
- LRA1 (8-15) - CVT1 4-7
By doing this, we could interleave a bit more ds_reads, g_mem & VALU with 2 codepaths I think, this could maybe help the the stalls we see on CVT instructions as well. To be tested.
There was a problem hiding this comment.
after VectorWidthA=1, timing changed and i've optimized it a bit more, including more pipelining.
| lrsb = [10] | ||
|
|
||
| gra = [ 10,10,11,11] # one index for two instructions | ||
| grb = [ 13,13,14,14] # one index for two instructions |
…o users/emezh/cms_tf32_128x128x32_plr1
…th=2,4 (#4639) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Now that codegen supports wider LDS reads for non-TN macrotiles (see #4174): - redo 128x128x32 NN with VWA=2 (improvement from PR #3978) and - add 128x128x64 NN with VWA=4 ## Technical Details Note, validation is disabled for NN schedules because validator doesn't support `swap` instructions in PACK schedule. ### 128x128x32 VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 2.3% improvement - Efficiency: 64% --> 71.8 #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: no improvement (-11.5%) - Efficiency: n/a - different kernels are used MNK = 2048,2048,4096 - Time: 7.4% improvement - Efficiency: n/a - different kernels are used ### 128x128x64 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 8.3% improvement - Efficiency: 9.4% --> 70.7% #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: 0% improvement - Efficiency: n/a - different kernels MNK = 2048,2048,4096 - Time: 14.6% improvement - Efficiency: n/a - different kernels <!-- Explain the changes along with any relevant GitHub links. --> ## Test Result <!-- Briefly summarize test outcomes. --> Tensile ranges tested: ``` - Exact: [2048, 2048, 1, 8192] - Exact: [2048, 2048, 1, 1024] - Exact: [5640, 8192, 1, 128] - Range: [[128], [128], [1], [64, 64, 256]] - Range: [[128], [128], [1], [1,1,64]] - Range: [[128], [128], [1], [32, 64, 256]] ``` hipblaslt-test: ``` [----------] Global test environment tear-down [==========] 22050 tests from 12 test suites ran. (1367351 ms total) [ PASSED ] 22050 tests. hipBLASLt version: 100202 ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIGECORE-77
…th=2,4 (ROCm#4639) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Now that codegen supports wider LDS reads for non-TN macrotiles (see ROCm#4174): - redo 128x128x32 NN with VWA=2 (improvement from PR ROCm#3978) and - add 128x128x64 NN with VWA=4 ## Technical Details Note, validation is disabled for NN schedules because validator doesn't support `swap` instructions in PACK schedule. ### 128x128x32 VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 2.3% improvement - Efficiency: 64% --> 71.8 #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: no improvement (-11.5%) - Efficiency: n/a - different kernels are used MNK = 2048,2048,4096 - Time: 7.4% improvement - Efficiency: n/a - different kernels are used ### 128x128x64 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 8.3% improvement - Efficiency: 9.4% --> 70.7% #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: 0% improvement - Efficiency: n/a - different kernels MNK = 2048,2048,4096 - Time: 14.6% improvement - Efficiency: n/a - different kernels <!-- Explain the changes along with any relevant GitHub links. --> ## Test Result <!-- Briefly summarize test outcomes. --> Tensile ranges tested: ``` - Exact: [2048, 2048, 1, 8192] - Exact: [2048, 2048, 1, 1024] - Exact: [5640, 8192, 1, 128] - Range: [[128], [128], [1], [64, 64, 256]] - Range: [[128], [128], [1], [1,1,64]] - Range: [[128], [128], [1], [32, 64, 256]] ``` hipblaslt-test: ``` [----------] Global test environment tear-down [==========] 22050 tests from 12 test suites ran. (1367351 ms total) [ PASSED ] 22050 tests. hipBLASLt version: 100202 ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIGECORE-77

Motivation
Add TF32 128x128x32 TN and NN CMS with PLR=1, MI=32x32x16.
The NN version uses VectorWidthA=1 to reduce bank conflicts (thanks to @sebvince for finding this out).
Technical Details
Improvement:
TN:
NN:
Test Result
Tested with tensilelite:
hipblaslt-test:Submission Checklist
AIGECORE-78