Skip to content

[LLVMGPU] Handle decomposed masks in ROCDLBufferInstructionsOptimization#23859

Draft
Max191 wants to merge 1 commit intoiree-org:mainfrom
Max191-agents:buffer-instr-opt-decomposed-mask
Draft

[LLVMGPU] Handle decomposed masks in ROCDLBufferInstructionsOptimization#23859
Max191 wants to merge 1 commit intoiree-org:mainfrom
Max191-agents:buffer-instr-opt-decomposed-mask

Conversation

@Max191
Copy link
Copy Markdown
Contributor

@Max191 Max191 commented Mar 19, 2026

The tile-and-fuse pipeline now decomposes masks (decomposeMasks=true), producing step+broadcast+cmpi+andi IR instead of create_mask ops. This commit adapts the ROCDL buffer optimization passes to work with the new IR shape.

ROCDLBufferInstructionsOptimization is simplified to pattern-match vector.broadcast(%scalar_i1) as the mask on vector.transfer_read and vector.maskedload, replacing with an unmasked operation + arith.select (or just the unmasked operation if the mask is constant true). Moved from post-bufferize to after vector lowering in the pipeline.

OptimizeComparisonOps is a new pass that simplifies vector arith.cmpi where one operand is a broadcast of a scalar with known divisibility. Uses IREE's IntegerDivisibilityAnalysis to determine if the comparison result is uniform across all vector lanes, and rewrites to a scalar comparison + broadcast.

@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 19, 2026

I still need to review properly this myself, so probably don't review yet, but making this draft to provide context in another PR.

@Max191 Max191 force-pushed the buffer-instr-opt-decomposed-mask branch 3 times, most recently from 187d635 to 4fefc05 Compare March 23, 2026 20:52
@Max191 Max191 requested a review from krzysz00 March 23, 2026 20:52
@Max191 Max191 marked this pull request as ready for review March 23, 2026 20:53
The tile-and-fuse pipeline now decomposes masks (`decomposeMasks=true`),
producing step+broadcast+cmpi+andi IR instead of create_mask ops. This
commit adapts the ROCDL buffer optimization passes to work with the new
IR shape.

**ROCDLBufferInstructionsOptimization** is simplified to pattern-match
`vector.broadcast(%scalar_i1)` as the mask on `vector.transfer_read` and
`vector.maskedload`, replacing with an unmasked operation + `arith.select`
(or just the unmasked operation if the mask is constant true). Moved from
post-bufferize to after vector lowering in the pipeline.

**OptimizeComparisonOps** is a new pass that simplifies vector `arith.cmpi`
where one operand is a broadcast of a scalar with known divisibility. Uses
IREE's `IntegerDivisibilityAnalysis` to determine if the comparison result
is uniform across all vector lanes, and rewrites to a scalar comparison +
broadcast. Handles all signed ordered predicates (slt, sle, sgt, sge) via
a single-bucket condition (`floor(vecMin/sdiv) == floor(vecMax/sdiv)` with
adjustment for non-strict predicates), and folds eq/ne to constants when
no multiple of sdiv falls within the vector range.

Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 27, 2026

Converted to draft because this is causing some performance regressions that are difficult to deal with. I will see about coming back to this in the near future.

Max191 added a commit that referenced this pull request Mar 27, 2026
…uctionsOptimization (#23947)

The pass was bailing out on vector.transfer_read ops with non-identity
permutation maps (e.g., 1D reads from a 4D memref). After
#23855, we will frequently see 1D
reads, which need to be supported here. Ideally, we will do something
like what is done in #23859, but
that approach is causing performance regressions that are difficult to
deal with. For now, this provides a solution for the new mask types we
will be seeing.

Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants