[DMA][Swizzle] Enable LDS DMA with swizzling by lialan · Pull Request #23807 · iree-org/iree

lialan · 2026-03-17T14:23:33Z

Add swizzle detection helper that traces destination memref through expand_shape/collapse_shape/subview to find SwizzleHintOp.
Apply the swizzle attribute's offset transformation to source linear offsets in the gather_to_lds lowering.
XOR swizzle is self-inverse, so applying it to source addresses produces the correct swizzled layout in LDS without violating gather_to_lds's uniform-destination constraint.
Add pipeline tests and E2E tests.

lialan · 2026-03-17T16:14:01Z

Initial benchmark results on MI355X without tuning:

Shape	Branch	Time (mean)	TFLOPS	vs main
8192³	PR (XOR swizzle + DMA)	0.522 ms	2106	-17%
8192³	main (XOR swizzle only)	0.434 ms	2533	baseline
16384³	PR (XOR swizzle + DMA)	4.02 ms	2188	-23%
16384³	main (XOR swizzle only)	3.09 ms	2847	baseline
32768³	PR (XOR swizzle + DMA)	33.3 ms	2113	-22%
32768³	main (XOR swizzle only)	27.3 ms	2578	baseline

lialan · 2026-03-17T17:05:38Z

looking into the reason of regression.

* For now, remove the blanket guard that disabled DMA for all scaled matmuls. * When manually enable DMA, XOR swizzle will get disabled (for now). * Use DMA (UseGlobalLoadDMAAttr) for LHS/RHS operands. * Fix lowering of DMA copy.

Revert destination indices from divergent (srcLinearOffset) back to subgroup-uniform (linearOffsetVal). The gather_to_lds op contract specifies that only lane 0's dstIndices are used, so the dst base must be uniform. Also add a TODO in the scaled matmul DMA pipeline test noting that gather_to_lds is not yet produced for scaled operands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MX tests require static shapes. The "small" shape set includes dynamic dynamicities by default, so explicitly pass --mnk_dynamicities=static,static,static. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a SwizzleHintOp is used as the destination of a gather_to_lds via reshape or subview ops (expand_shape/collapse_shape/subview), the swizzle is applied on the source side in the DMA lowering pass. These reshape ops just pass through the swizzled allocation and should be treated as transparent users rather than unsupported ones. This fixes a compiler crash in the scaled matmul DMA path where: alloc -> swizzle_hint -> expand_shape -> gather_to_lds.dst Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace LLVMGPUTileAndFuse with #iree_gpu.pipeline<TileAndFuse> to match the migration in #23816. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…led destinations * Add swizzle detection helper that traces destination memref through expand_shape/collapse_shape/subview to find SwizzleHintOp. * apply the swizzle attribute's offset transformation to source linear offsets in the `gather_to_lds` lowering. * XOR swizzle is self-inverse, so applying it to source addresses produces the correct swizzled layout in LDS without violating gather_to_lds's uniform-destination constraint. * Add pipeline tests and E2E tests to make sure it works. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The base branch had plain DMA CHECK-DIRECT-LOAD checks that were superseded by the swizzle_dma branch's swizzled format checks. Remove the duplicate stale checks that fail with the new swizzle-enabled config.

Yu-Zhewen · 2026-03-24T12:05:52Z

looking into the reason of regression.

For BF16, I noticed that multi-buffering does not work with swizzling (#23919). This might be the same issue here.

krzysz00

I don't have correctness concerns, but I don't think I'd want to land this with those perf regressions. Maybe we should investigate that multibuffering issue?

Yu-Zhewen · 2026-03-24T22:50:15Z

I don't have correctness concerns, but I don't think I'd want to land this with those perf regressions. Maybe we should investigate that multibuffering issue?

Agree. I’m currently looking into it.

lialan · 2026-04-13T16:32:19Z

The plan is to split this PR into smaller, ones.

[DMA][Swizzle] Apply inverse source swizzle in DMA lowering #24094 only handles reverse swizzling, will go in first.

lialan changed the title ~~[DMA][Swizzle] Apply inverse source swizzle in DMA lowering for swizzled destinations~~ [DMA][Swizzle] Enable LDS DMA with swizzling Mar 17, 2026

lialan force-pushed the users/lialan/swizzle_dma branch from c42cd2c to aed9f22 Compare March 17, 2026 15:08

lialan changed the base branch from main to users/lialan/lower_dma_when_scaled March 17, 2026 20:04

lialan force-pushed the users/lialan/lower_dma_when_scaled branch from 6ab1f22 to 5c175cf Compare March 17, 2026 20:30

lialan force-pushed the users/lialan/swizzle_dma branch 2 times, most recently from 3f861f5 to cefbf84 Compare March 18, 2026 14:37

lialan and others added 9 commits March 18, 2026 14:25

[Codegen] Use DMA for LHS/RHS only in scaled matmul

904b2fa

* For now, remove the blanket guard that disabled DMA for all scaled matmuls. * When manually enable DMA, XOR swizzle will get disabled (for now). * Use DMA (UseGlobalLoadDMAAttr) for LHS/RHS operands. * Fix lowering of DMA copy.

Fix DMA pre-check incorrectly upgrading scale operand copies

6dc7236

Change to small test.

8a8c396

[e2e] Fix MX DMA test to use static-only dynamicity

eb751c7

MX tests require static shapes. The "small" shape set includes dynamic dynamicities by default, so explicitly pass --mnk_dynamicities=static,static,static. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix mi355 issue.

5fd8517

Fix clang-format

a13a4ee

[Codegen] Update pipeline keyword after rebase

b027b1e

Replace LLVMGPUTileAndFuse with #iree_gpu.pipeline<TileAndFuse> to match the migration in #23816. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lialan force-pushed the users/lialan/lower_dma_when_scaled branch from 4518445 to b027b1e Compare March 18, 2026 21:46

lialan and others added 2 commits March 19, 2026 08:41

Fix rebase conflict: remove stale CHECK-DIRECT-LOAD checks.

c4f0eee

The base branch had plain DMA CHECK-DIRECT-LOAD checks that were superseded by the swizzle_dma branch's swizzled format checks. Remove the duplicate stale checks that fail with the new swizzle-enabled config.

lialan force-pushed the users/lialan/swizzle_dma branch from 56b7c30 to c4f0eee Compare March 19, 2026 15:42

lialan marked this pull request as ready for review March 19, 2026 15:58

lialan requested review from Groverkss, Max191, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 19, 2026 15:58

Fix pipeline keyword format intest after rebase.

3d21fd2

lialan force-pushed the users/lialan/swizzle_dma branch from ae31340 to 3d21fd2 Compare March 19, 2026 16:09

Yu-Zhewen reviewed Mar 21, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp Outdated

Yu-Zhewen mentioned this pull request Mar 23, 2026

[GPU] Increased LDS bank conflicts when DMA enabled #23901

Closed

krzysz00 reviewed Mar 24, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp Outdated

Comment thread compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp Outdated

Muzammiluddin-Syed-ECE reviewed Mar 24, 2026

View reviewed changes

Yu-Zhewen mentioned this pull request Mar 25, 2026

[Codegen] Add XOR swizzle for BF16 matmul with DMA #23932

Merged

Address comments.

90d9ed4

Yu-Zhewen mentioned this pull request Apr 2, 2026

[Codegen] Absorb SwizzleHintOp into alloc attribute for pipelining #23945

Closed

Yu-Zhewen mentioned this pull request Apr 10, 2026

[GPU] Tracking issue for GEMM DMA enablement #24078

Open

16 tasks

lialan force-pushed the users/lialan/lower_dma_when_scaled branch from b027b1e to 11e2260 Compare April 13, 2026 18:29

lialan requested review from MaheshRavishankar, bjacob and hanhanW as code owners April 13, 2026 18:29

lialan force-pushed the users/lialan/lower_dma_when_scaled branch 3 times, most recently from cb09fe7 to 4c416a0 Compare April 21, 2026 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DMA][Swizzle] Enable LDS DMA with swizzling#23807

[DMA][Swizzle] Enable LDS DMA with swizzling#23807
lialan wants to merge 13 commits intousers/lialan/lower_dma_when_scaledfrom
users/lialan/swizzle_dma

lialan commented Mar 17, 2026 •

edited

Loading

Uh oh!

lialan commented Mar 17, 2026

Uh oh!

lialan commented Mar 17, 2026

Uh oh!

Uh oh!

Yu-Zhewen commented Mar 24, 2026

Uh oh!

krzysz00 left a comment

Uh oh!

Uh oh!

Uh oh!

Yu-Zhewen commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lialan commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lialan commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lialan commented Mar 17, 2026

Uh oh!

lialan commented Mar 17, 2026

Uh oh!

Uh oh!

Yu-Zhewen commented Mar 24, 2026

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yu-Zhewen commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lialan commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lialan commented Mar 17, 2026 •

edited

Loading