8Wave PP race-free kernel w/o double barrier by adedespirlet · Pull Request #1123 · iree-org/wave

adedespirlet · 2026-03-12T20:10:32Z

This PR:

Optimizes MXFP 8w schedule with respect to counters and memory ops.
Moved the pass merge_contiguous_reads earlier in the pipeline such that I can reason about merged read widiths/counts in my schedule. Previously pass took place after the scheduling pass.
Unrolls the kernel twice to remove forced vmcnt(0) + v_mov copies at the end of the loop
Without unrolling, scale loads and scale consumption overlap within the same iteration, forcing the compiler to load into temporary VGPRs and copy them back to the loop iter_args's registers at the end of the loop (vmcnt(0) + v_mov) . With 2x unrolling, odd/even iterations alternate scale register sets, so loads target already "dead" registers directly. This prevents copies and vmcnt(0) which breaks perf. Scale waits now happen right before the MFMAs, maximizing latency hiding.
Fix costly divisions emission when dealing with dynamic kernels
When encountering dynamic divisors, the LLVM backend cannot emit efficient bit shifts or Barrett reduction as it does for static constants. Instead it falls back to expensive integer division (~150 extra VALU ops per division). To prevent this, this PR replaces the address computation divisions inside each loop iteration with the "Barrett reduction / magic number trick" .
A "magic number" = ceil(2^32 / d) is precomputed once per dynamic divisor before the loop, and each in-loop division is replaced by a single v_mul_hi_u32 + a few ALU ops, effectively eliminating expensive divisions from the loop body.
delete bounds expression attached to gather_to_lds ops when dealing with dynamic values based on Assumption given to the compiler. This prevent costly masking logic insertion
When the compiler emits IR for expressions containing multiple fraction additions, it defaulted to cross-multiplication even when the denominator was the same across all fractions. This is now fixed : it identifies when denominators are identical and directly adds the corresponding numerators instead.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the 8wavepingpong branch 2 times, most recently from a75b940 to 063261e Compare March 13, 2026 22:31

adedespirlet added 4 commits March 16, 2026 18:07

add race free optimization wo double barrier

46a8be2

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

enable minimize shared allocs when conditional

357d98f

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

optimize schedule

2d36272

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

cleaning

2dea448

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the 8wavepingpong branch from ee7cd80 to 73de1db Compare March 17, 2026 23:31

harsh-nod requested review from panditsa and xintin March 24, 2026 22:06

adedespirlet added 12 commits March 26, 2026 23:45

Revert unintended compile.py changes from 063261e

30ef3ed

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

cleaning

36f11ac

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

disable wave runtime

e77a66d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adding assumption to prevent masking in gathertoshared

c6d9e84

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

unroll to prevent early counters + magic number logic for dynamic kernel

1a635de

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

cleaning

181872d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

optimization for 256x192

718970a

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

detects when denomitor is the same when adding fractions

d598a01

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

opt

c27079d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

cast to bf16

4969282

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

shuffle for 256x192

a475f6a

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

hardcode values instead of moving passs aorund

de0069f

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the 8wavepingpong branch from aa1e909 to de0069f Compare March 26, 2026 23:46

adedespirlet and others added 3 commits March 28, 2026 00:06

add transposed kernel

ca6fdbd

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add transposed dwordx4kernel

ac5ae80

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

Merge branch 'main' into 8wavepingpong

d3c7564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8Wave PP race-free kernel w/o double barrier#1123

8Wave PP race-free kernel w/o double barrier#1123
adedespirlet wants to merge 19 commits intoiree-org:mainfrom
adedespirlet:8wavepingpong

adedespirlet commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adedespirlet commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adedespirlet commented Mar 12, 2026 •

edited

Loading