Skip to content

[DMA][Swizzle] Enable LDS DMA with swizzling#23807

Open
lialan wants to merge 13 commits intousers/lialan/lower_dma_when_scaledfrom
users/lialan/swizzle_dma
Open

[DMA][Swizzle] Enable LDS DMA with swizzling#23807
lialan wants to merge 13 commits intousers/lialan/lower_dma_when_scaledfrom
users/lialan/swizzle_dma

Conversation

@lialan
Copy link
Copy Markdown
Contributor

@lialan lialan commented Mar 17, 2026

  • Add swizzle detection helper that traces destination memref through expand_shape/collapse_shape/subview to find SwizzleHintOp.
  • Apply the swizzle attribute's offset transformation to source linear offsets in the gather_to_lds lowering.
  • XOR swizzle is self-inverse, so applying it to source addresses produces the correct swizzled layout in LDS without violating gather_to_lds's uniform-destination constraint.
  • Add pipeline tests and E2E tests.

@lialan lialan changed the title [DMA][Swizzle] Apply inverse source swizzle in DMA lowering for swizzled destinations [DMA][Swizzle] Enable LDS DMA with swizzling Mar 17, 2026
@lialan lialan force-pushed the users/lialan/swizzle_dma branch from c42cd2c to aed9f22 Compare March 17, 2026 15:08
@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Mar 17, 2026

Initial benchmark results on MI355X without tuning:

Shape Branch Time (mean) TFLOPS vs main
8192³ PR (XOR swizzle + DMA) 0.522 ms 2106 -17%
8192³ main (XOR swizzle only) 0.434 ms 2533 baseline
16384³ PR (XOR swizzle + DMA) 4.02 ms 2188 -23%
16384³ main (XOR swizzle only) 3.09 ms 2847 baseline
32768³ PR (XOR swizzle + DMA) 33.3 ms 2113 -22%
32768³ main (XOR swizzle only) 27.3 ms 2578 baseline

@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Mar 17, 2026

looking into the reason of regression.

@lialan lialan changed the base branch from main to users/lialan/lower_dma_when_scaled March 17, 2026 20:04
@lialan lialan force-pushed the users/lialan/lower_dma_when_scaled branch from 6ab1f22 to 5c175cf Compare March 17, 2026 20:30
@lialan lialan force-pushed the users/lialan/swizzle_dma branch 2 times, most recently from 3f861f5 to cefbf84 Compare March 18, 2026 14:37
lialan and others added 9 commits March 18, 2026 14:25
* For now, remove the blanket guard that disabled DMA for all scaled matmuls.
* When manually enable DMA, XOR swizzle will get disabled (for now).
* Use DMA (UseGlobalLoadDMAAttr) for LHS/RHS operands.
* Fix lowering of DMA copy.
Revert destination indices from divergent (srcLinearOffset) back to
subgroup-uniform (linearOffsetVal). The gather_to_lds op contract
specifies that only lane 0's dstIndices are used, so the dst base
must be uniform. Also add a TODO in the scaled matmul DMA pipeline
test noting that gather_to_lds is not yet produced for scaled operands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MX tests require static shapes. The "small" shape set includes dynamic
dynamicities by default, so explicitly pass --mnk_dynamicities=static,static,static.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a SwizzleHintOp is used as the destination of a gather_to_lds via
reshape or subview ops (expand_shape/collapse_shape/subview), the swizzle
is applied on the source side in the DMA lowering pass. These reshape ops
just pass through the swizzled allocation and should be treated as
transparent users rather than unsupported ones.

This fixes a compiler crash in the scaled matmul DMA path where:
  alloc -> swizzle_hint -> expand_shape -> gather_to_lds.dst

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace LLVMGPUTileAndFuse with #iree_gpu.pipeline<TileAndFuse> to
match the migration in #23816.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lialan lialan force-pushed the users/lialan/lower_dma_when_scaled branch from 4518445 to b027b1e Compare March 18, 2026 21:46
lialan and others added 2 commits March 19, 2026 08:41
…led destinations

* Add swizzle detection helper that traces destination memref through
expand_shape/collapse_shape/subview to find SwizzleHintOp.
* apply the swizzle attribute's offset transformation to source linear offsets
in the `gather_to_lds` lowering.
* XOR swizzle is self-inverse, so applying
it to source addresses produces the correct swizzled layout in LDS
without violating gather_to_lds's uniform-destination constraint.
* Add pipeline tests and E2E tests to make sure it works.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The base branch had plain DMA CHECK-DIRECT-LOAD checks that were superseded
by the swizzle_dma branch's swizzled format checks. Remove the duplicate stale
checks that fail with the new swizzle-enabled config.
@lialan lialan force-pushed the users/lialan/swizzle_dma branch from 56b7c30 to c4f0eee Compare March 19, 2026 15:42
@lialan lialan marked this pull request as ready for review March 19, 2026 15:58
@lialan lialan force-pushed the users/lialan/swizzle_dma branch from ae31340 to 3d21fd2 Compare March 19, 2026 16:09
@Yu-Zhewen
Copy link
Copy Markdown
Contributor

looking into the reason of regression.

For BF16, I noticed that multi-buffering does not work with swizzling (#23919). This might be the same issue here.

Copy link
Copy Markdown
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have correctness concerns, but I don't think I'd want to land this with those perf regressions. Maybe we should investigate that multibuffering issue?

@Yu-Zhewen
Copy link
Copy Markdown
Contributor

I don't have correctness concerns, but I don't think I'd want to land this with those perf regressions. Maybe we should investigate that multibuffering issue?

Agree. I’m currently looking into it.

Comment thread compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp Outdated
@lialan
Copy link
Copy Markdown
Contributor Author

lialan commented Apr 13, 2026

The plan is to split this PR into smaller, ones.

@lialan lialan force-pushed the users/lialan/lower_dma_when_scaled branch from b027b1e to 11e2260 Compare April 13, 2026 18:29
@lialan lialan force-pushed the users/lialan/lower_dma_when_scaled branch 3 times, most recently from cb09fe7 to 4c416a0 Compare April 21, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants