[LLVMGPU] Handle decomposed masks in ROCDLBufferInstructionsOptimization#23859
Draft
Max191 wants to merge 1 commit intoiree-org:mainfrom
Draft
[LLVMGPU] Handle decomposed masks in ROCDLBufferInstructionsOptimization#23859Max191 wants to merge 1 commit intoiree-org:mainfrom
Max191 wants to merge 1 commit intoiree-org:mainfrom
Conversation
Contributor
Author
|
I still need to review properly this myself, so probably don't review yet, but making this draft to provide context in another PR. |
krzysz00
reviewed
Mar 19, 2026
187d635 to
4fefc05
Compare
The tile-and-fuse pipeline now decomposes masks (`decomposeMasks=true`), producing step+broadcast+cmpi+andi IR instead of create_mask ops. This commit adapts the ROCDL buffer optimization passes to work with the new IR shape. **ROCDLBufferInstructionsOptimization** is simplified to pattern-match `vector.broadcast(%scalar_i1)` as the mask on `vector.transfer_read` and `vector.maskedload`, replacing with an unmasked operation + `arith.select` (or just the unmasked operation if the mask is constant true). Moved from post-bufferize to after vector lowering in the pipeline. **OptimizeComparisonOps** is a new pass that simplifies vector `arith.cmpi` where one operand is a broadcast of a scalar with known divisibility. Uses IREE's `IntegerDivisibilityAnalysis` to determine if the comparison result is uniform across all vector lanes, and rewrites to a scalar comparison + broadcast. Handles all signed ordered predicates (slt, sle, sgt, sge) via a single-bucket condition (`floor(vecMin/sdiv) == floor(vecMax/sdiv)` with adjustment for non-strict predicates), and folds eq/ne to constants when no multiple of sdiv falls within the vector range. Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
4fefc05 to
8ee02af
Compare
Contributor
Author
|
Converted to draft because this is causing some performance regressions that are difficult to deal with. I will see about coming back to this in the near future. |
Max191
added a commit
that referenced
this pull request
Mar 27, 2026
…uctionsOptimization (#23947) The pass was bailing out on vector.transfer_read ops with non-identity permutation maps (e.g., 1D reads from a 4D memref). After #23855, we will frequently see 1D reads, which need to be supported here. Ideally, we will do something like what is done in #23859, but that approach is causing performance regressions that are difficult to deal with. For now, this provides a solution for the new mask types we will be seeing. Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The tile-and-fuse pipeline now decomposes masks (
decomposeMasks=true), producing step+broadcast+cmpi+andi IR instead of create_mask ops. This commit adapts the ROCDL buffer optimization passes to work with the new IR shape.ROCDLBufferInstructionsOptimization is simplified to pattern-match
vector.broadcast(%scalar_i1)as the mask onvector.transfer_readandvector.maskedload, replacing with an unmasked operation +arith.select(or just the unmasked operation if the mask is constant true). Moved from post-bufferize to after vector lowering in the pipeline.OptimizeComparisonOps is a new pass that simplifies vector
arith.cmpiwhere one operand is a broadcast of a scalar with known divisibility. Uses IREE'sIntegerDivisibilityAnalysisto determine if the comparison result is uniform across all vector lanes, and rewrites to a scalar comparison + broadcast.