[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1 by emezh · Pull Request #3978 · ROCm/rocm-libraries

emezh · 2026-01-21T00:46:56Z

Motivation

Add TF32 128x128x32 TN and NN CMS with PLR=1, MI=32x32x16.
The NN version uses VectorWidthA=1 to reduce bank conflicts (thanks to @sebvince for finding this out).

Technical Details

Improvement:
TN:

Tensile: vs. non-cms : +15%
Bench: vs. current default solution: no improvement, but adding it anyway in case for smaller tiles it can be better.

NN:

Tensile: vs. non-cms : +17.6%
Bench: vs. current default solution
- K=8192: no improvement
- K=4096: 3.4%
- K=544: +54%

Test Result

Tested with tensilelite:

          - Exact: [2048, 2048, 1, 8192]
          - Exact: [2048, 2048, 1, 1024]
          - Exact: [5640, 8192, 1, 128]
          - Range: [[128], [128], [1], [64, 64, 256]]
          - Range: [[128], [128], [1], [1,1,64]]
          - Range: [[128], [128], [1], [32, 64, 256]]

hipblaslt-test:

[==========] 21891 tests from 12 test suites ran. (1612050 ms total)
[  PASSED  ] 21891 tests.
hipBLASLt version: 100200
hipBLASLt git version: 81be8207065b

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIGECORE-78

…o users/emezh/cms_tf32_128x128x64_nn

sebvince

LGTM. Approving if it passes the tests and validators. I left a couple of optimization ideas I got looking at the traces.

sebvince · 2026-01-23T10:48:20Z

+
+    elif isNN(kernel) and TLDS==1:
+        lra0   = [0,0,0,0,
+                   1,1,1,1,


Opt idea : I think we could save some cycles by interleaving lra0 and GrIncA (right now, the 4 ds_read stalls a bit because we can't issue them at the same time on SIMD pairs)

see updated schedule

sebvince · 2026-01-23T10:56:15Z

+        lrb0   = [          4,5,6,7]
+        #                wait then read
+        syncs.add(          4, dscnt=8, comment="wait for the first 2x4 LRAs before packing")
+        syncs.add(            5, dscnt=1, comment="wait for the rest of LRAs before packing them")


Opt idea : how about we do this instead :

dscnt=(1+4) instead dscnt=1

move last CVT pair to the next mfma index and replace them with some GRIncB that are after

add a dscnt=(1) at the next mfma before the last CVT pair

after setting VectorWidthA=1 the timing became better, but after applying this suggestion the iteration took 17cyc more (on average).
Also, note that this CMS is using mfma 32x32x16, so the gap between mfmas is ~28cyc, so 6 GRINCs can fit there.

sebvince · 2026-01-23T11:30:39Z

+        num_gr = len(gra) + len(grb)
+        syncs.add(                                             12, vlcnt=8, barrier=True, comment="wait for the previous GRAs")
+
+        lra1   = [                                             12,12,12,12,


To avoid some of the stalls on LRA1, have you tried to interleave them with PACK1 a bit ? Instead of doing all LRA1 in a single block, we could do :

LRA1 (0-7) - CVT1 0-3 (talking about pack instructions before MFMA_4x4x4).

LRA1 (8-15) - CVT1 4-7

By doing this, we could interleave a bit more ds_reads, g_mem & VALU with 2 codepaths I think, this could maybe help the the stalls we see on CVT instructions as well. To be tested.

after VectorWidthA=1, timing changed and i've optimized it a bit more, including more pipelining.

sebvince · 2026-01-23T11:36:22Z

+        lrsb   = [10]    
+
+        gra    = [                                 10,10,11,11] # one index for two instructions
+        grb    = [                                              13,13,14,14] # one index for two instructions


It seems we could spread out a bit grb (like you did on gra with the mfma4x4x4)

…o users/emezh/cms_tf32_128x128x32_plr1

…th=2,4 (#4639) ## Motivation  Now that codegen supports wider LDS reads for non-TN macrotiles (see #4174): - redo 128x128x32 NN with VWA=2 (improvement from PR #3978) and - add 128x128x64 NN with VWA=4 ## Technical Details Note, validation is disabled for NN schedules because validator doesn't support `swap` instructions in PACK schedule. ### 128x128x32 VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 2.3% improvement - Efficiency: 64% --> 71.8 #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: no improvement (-11.5%) - Efficiency: n/a - different kernels are used MNK = 2048,2048,4096 - Time: 7.4% improvement - Efficiency: n/a - different kernels are used ### 128x128x64 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 8.3% improvement - Efficiency: 9.4% --> 70.7% #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: 0% improvement - Efficiency: n/a - different kernels MNK = 2048,2048,4096 - Time: 14.6% improvement - Efficiency: n/a - different kernels  ## Test Result  Tensile ranges tested: ``` - Exact: [2048, 2048, 1, 8192] - Exact: [2048, 2048, 1, 1024] - Exact: [5640, 8192, 1, 128] - Range: [[128], [128], [1], [64, 64, 256]] - Range: [[128], [128], [1], [1,1,64]] - Range: [[128], [128], [1], [32, 64, 256]] ``` hipblaslt-test: ``` [----------] Global test environment tear-down [==========] 22050 tests from 12 test suites ran. (1367351 ms total) [ PASSED ] 22050 tests. hipBLASLt version: 100202 ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIGECORE-77

…th=2,4 (ROCm#4639) ## Motivation  Now that codegen supports wider LDS reads for non-TN macrotiles (see ROCm#4174): - redo 128x128x32 NN with VWA=2 (improvement from PR ROCm#3978) and - add 128x128x64 NN with VWA=4 ## Technical Details Note, validation is disabled for NN schedules because validator doesn't support `swap` instructions in PACK schedule. ### 128x128x32 VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 2.3% improvement - Efficiency: 64% --> 71.8 #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: no improvement (-11.5%) - Efficiency: n/a - different kernels are used MNK = 2048,2048,4096 - Time: 7.4% improvement - Efficiency: n/a - different kernels are used ### 128x128x64 #### Tensile, no CMS vs CMS MNK = 2048,2048,8192 - Time: 8.3% improvement - Efficiency: 9.4% --> 70.7% #### Bench, Baseline vs CMS MNK = 2048,2048,8192 - Time: 0% improvement - Efficiency: n/a - different kernels MNK = 2048,2048,4096 - Time: 14.6% improvement - Efficiency: n/a - different kernels  ## Test Result  Tensile ranges tested: ``` - Exact: [2048, 2048, 1, 8192] - Exact: [2048, 2048, 1, 1024] - Exact: [5640, 8192, 1, 128] - Range: [[128], [128], [1], [64, 64, 256]] - Range: [[128], [128], [1], [1,1,64]] - Range: [[128], [128], [1], [32, 64, 256]] ``` hipblaslt-test: ``` [----------] Global test environment tear-down [==========] 22050 tests from 12 test suites ran. (1367351 ms total) [ PASSED ] 22050 tests. hipBLASLt version: 100202 ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIGECORE-77

emezh added 9 commits January 19, 2026 23:18

functional 128x128x32_TF32_plr1 (NN)

56fd942

optimized v4

8da609c

57.5% efficiency

f757e24

add TN with du32, PLR1, mi32

b07260b

TN v2

bf3be55

imporved NN (v5)

144745e

clean-up _get_schedule_128x128x32_TF32_plr1

b1518fe

Merge remote-tracking branch 'origin/hipblaslt_common_cms_phase2' int…

e9a1ae5

…o users/emezh/cms_tf32_128x128x64_nn

add tests

6867ee4

github-actions Bot added the project: hipblaslt label Jan 21, 2026

assistant-librarian Bot added the organization: ROCm label Jan 21, 2026

emezh added 2 commits January 21, 2026 16:45

fix test

54b8d2f

merge into lib

81be820

emezh marked this pull request as ready for review January 22, 2026 03:33

emezh requested a review from a team as a code owner January 22, 2026 03:33

emezh changed the title ~~[hipblaslt][CMS] Add 128x128x32 TN and NN with PLR=1~~ [hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1 Jan 22, 2026

shbae reviewed Jan 22, 2026

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_CustomSchedule.py Outdated

sebvince approved these changes Jan 23, 2026

View reviewed changes

emezh added 3 commits January 23, 2026 23:16

use VectorWidthA=1, improve schedule a bit. other fixes.

3fb88df

update library for VectorWidthA=1

01118fb

Merge remote-tracking branch 'origin/hipblaslt_common_cms_phase2' int…

00ebeb7

…o users/emezh/cms_tf32_128x128x32_plr1

emezh merged commit f5125fb into hipblaslt_common_cms_phase2 Jan 24, 2026
13 of 14 checks passed

emezh deleted the users/emezh/cms_tf32_128x128x32_plr1 branch January 24, 2026 02:29

emezh mentioned this pull request Feb 18, 2026

[hipBLASlt][CMS] TF32 128x128x32 and 128x128x64 NN w/PLR1 w/VectorWidth=2,4 #4639

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1#3978

[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1#3978
emezh merged 14 commits intohipblaslt_common_cms_phase2from
users/emezh/cms_tf32_128x128x32_plr1

emezh commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

sebvince left a comment

Uh oh!

sebvince Jan 23, 2026

Uh oh!

emezh Jan 23, 2026

Uh oh!

sebvince Jan 23, 2026

Uh oh!

emezh Jan 23, 2026

Uh oh!

sebvince Jan 23, 2026

Uh oh!

emezh Jan 23, 2026

Uh oh!

sebvince Jan 23, 2026

Uh oh!

emezh Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

emezh commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Result

Submission Checklist

Uh oh!

Uh oh!

sebvince left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emezh commented Jan 21, 2026 •

edited

Loading