Skip to content

[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1#3978

Merged
emezh merged 14 commits intohipblaslt_common_cms_phase2from
users/emezh/cms_tf32_128x128x32_plr1
Jan 24, 2026
Merged

[hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1#3978
emezh merged 14 commits intohipblaslt_common_cms_phase2from
users/emezh/cms_tf32_128x128x32_plr1

Conversation

@emezh
Copy link
Copy Markdown
Contributor

@emezh emezh commented Jan 21, 2026

Motivation

Add TF32 128x128x32 TN and NN CMS with PLR=1, MI=32x32x16.
The NN version uses VectorWidthA=1 to reduce bank conflicts (thanks to @sebvince for finding this out).

Technical Details

Improvement:
TN:

  • Tensile: vs. non-cms : +15%
  • Bench: vs. current default solution: no improvement, but adding it anyway in case for smaller tiles it can be better.

NN:

  • Tensile: vs. non-cms : +17.6%
  • Bench: vs. current default solution
    • K=8192: no improvement
    • K=4096: 3.4%
    • K=544: +54%

Test Result

Tested with tensilelite:

          - Exact: [2048, 2048, 1, 8192]
          - Exact: [2048, 2048, 1, 1024]
          - Exact: [5640, 8192, 1, 128]
          - Range: [[128], [128], [1], [64, 64, 256]]
          - Range: [[128], [128], [1], [1,1,64]]
          - Range: [[128], [128], [1], [32, 64, 256]]

hipblaslt-test:

[==========] 21891 tests from 12 test suites ran. (1612050 ms total)
[  PASSED  ] 21891 tests.
hipBLASLt version: 100200
hipBLASLt git version: 81be8207065b

Submission Checklist

AIGECORE-78

@emezh emezh marked this pull request as ready for review January 22, 2026 03:33
@emezh emezh requested a review from a team as a code owner January 22, 2026 03:33
@emezh emezh changed the title [hipblaslt][CMS] Add 128x128x32 TN and NN with PLR=1 [hipblaslt][CMS] Add CMS for TF32 128x128x32 TN and NN with PLR=1 Jan 22, 2026
Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_CustomSchedule.py Outdated
Copy link
Copy Markdown
Contributor

@sebvince sebvince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Approving if it passes the tests and validators. I left a couple of optimization ideas I got looking at the traces.


elif isNN(kernel) and TLDS==1:
lra0 = [0,0,0,0,
1,1,1,1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt idea : I think we could save some cycles by interleaving lra0 and GrIncA (right now, the 4 ds_read stalls a bit because we can't issue them at the same time on SIMD pairs)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see updated schedule

lrb0 = [ 4,5,6,7]
# wait then read
syncs.add( 4, dscnt=8, comment="wait for the first 2x4 LRAs before packing")
syncs.add( 5, dscnt=1, comment="wait for the rest of LRAs before packing them")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt idea : how about we do this instead :

  • dscnt=(1+4) instead dscnt=1
  • move last CVT pair to the next mfma index and replace them with some GRIncB that are after
  • add a dscnt=(1) at the next mfma before the last CVT pair

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after setting VectorWidthA=1 the timing became better, but after applying this suggestion the iteration took 17cyc more (on average).
Also, note that this CMS is using mfma 32x32x16, so the gap between mfmas is ~28cyc, so 6 GRINCs can fit there.

num_gr = len(gra) + len(grb)
syncs.add( 12, vlcnt=8, barrier=True, comment="wait for the previous GRAs")

lra1 = [ 12,12,12,12,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid some of the stalls on LRA1, have you tried to interleave them with PACK1 a bit ? Instead of doing all LRA1 in a single block, we could do :

  • LRA1 (0-7) - CVT1 0-3 (talking about pack instructions before MFMA_4x4x4).
  • LRA1 (8-15) - CVT1 4-7

By doing this, we could interleave a bit more ds_reads, g_mem & VALU with 2 codepaths I think, this could maybe help the the stalls we see on CVT instructions as well. To be tested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after VectorWidthA=1, timing changed and i've optimized it a bit more, including more pipelining.

lrsb = [10]

gra = [ 10,10,11,11] # one index for two instructions
grb = [ 13,13,14,14] # one index for two instructions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we could spread out a bit grb (like you did on gra with the mfma4x4x4)
image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@emezh emezh merged commit f5125fb into hipblaslt_common_cms_phase2 Jan 24, 2026
13 of 14 checks passed
@emezh emezh deleted the users/emezh/cms_tf32_128x128x32_plr1 branch January 24, 2026 02:29
talumbau pushed a commit that referenced this pull request Feb 20, 2026
…th=2,4 (#4639)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Now that codegen supports wider LDS reads for non-TN macrotiles (see
#4174):
- redo 128x128x32 NN with VWA=2 (improvement from PR #3978) and 
- add 128x128x64 NN with VWA=4

## Technical Details

Note, validation is disabled for NN schedules because validator doesn't
support `swap` instructions in PACK schedule.

### 128x128x32
VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2

#### Tensile, no CMS vs CMS
MNK = 2048,2048,8192
- Time: 2.3% improvement
- Efficiency:  64% --> 71.8

#### Bench, Baseline vs CMS
MNK = 2048,2048,8192
- Time: no improvement (-11.5%)
- Efficiency:  n/a - different kernels are used

MNK = 2048,2048,4096
- Time: 7.4% improvement
- Efficiency:  n/a - different kernels are used

### 128x128x64

#### Tensile, no CMS vs CMS
MNK = 2048,2048,8192
- Time: 8.3% improvement
- Efficiency:  9.4% --> 70.7%

#### Bench, Baseline vs CMS
MNK = 2048,2048,8192
- Time: 0% improvement
- Efficiency:  n/a - different kernels

MNK = 2048,2048,4096
- Time: 14.6% improvement
- Efficiency:  n/a - different kernels
 
<!-- Explain the changes along with any relevant GitHub links. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

Tensile ranges tested:
```
          - Exact: [2048, 2048, 1, 8192]
          - Exact: [2048, 2048, 1, 1024]
          - Exact: [5640, 8192, 1, 128]
          - Range: [[128], [128], [1], [64, 64, 256]]
          - Range: [[128], [128], [1], [1,1,64]]
          - Range: [[128], [128], [1], [32, 64, 256]]
```

hipblaslt-test:
```
[----------] Global test environment tear-down
[==========] 22050 tests from 12 test suites ran. (1367351 ms total)
[  PASSED  ] 22050 tests.
hipBLASLt version: 100202
```
## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIGECORE-77
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
…th=2,4 (ROCm#4639)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Now that codegen supports wider LDS reads for non-TN macrotiles (see
ROCm#4174):
- redo 128x128x32 NN with VWA=2 (improvement from PR ROCm#3978) and 
- add 128x128x64 NN with VWA=4

## Technical Details

Note, validation is disabled for NN schedules because validator doesn't
support `swap` instructions in PACK schedule.

### 128x128x32
VWA=4 can't be used with DepthU=32 in this MT, so used VWA=2

#### Tensile, no CMS vs CMS
MNK = 2048,2048,8192
- Time: 2.3% improvement
- Efficiency:  64% --> 71.8

#### Bench, Baseline vs CMS
MNK = 2048,2048,8192
- Time: no improvement (-11.5%)
- Efficiency:  n/a - different kernels are used

MNK = 2048,2048,4096
- Time: 7.4% improvement
- Efficiency:  n/a - different kernels are used

### 128x128x64

#### Tensile, no CMS vs CMS
MNK = 2048,2048,8192
- Time: 8.3% improvement
- Efficiency:  9.4% --> 70.7%

#### Bench, Baseline vs CMS
MNK = 2048,2048,8192
- Time: 0% improvement
- Efficiency:  n/a - different kernels

MNK = 2048,2048,4096
- Time: 14.6% improvement
- Efficiency:  n/a - different kernels
 
<!-- Explain the changes along with any relevant GitHub links. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

Tensile ranges tested:
```
          - Exact: [2048, 2048, 1, 8192]
          - Exact: [2048, 2048, 1, 1024]
          - Exact: [5640, 8192, 1, 128]
          - Range: [[128], [128], [1], [64, 64, 256]]
          - Range: [[128], [128], [1], [1,1,64]]
          - Range: [[128], [128], [1], [32, 64, 256]]
```

hipblaslt-test:
```
[----------] Global test environment tear-down
[==========] 22050 tests from 12 test suites ran. (1367351 ms total)
[  PASSED  ] 22050 tests.
hipBLASLt version: 100202
```
## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIGECORE-77
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants