Skip to content

8Wave PP race-free kernel w/o double barrier#1123

Open
adedespirlet wants to merge 19 commits intoiree-org:mainfrom
adedespirlet:8wavepingpong
Open

8Wave PP race-free kernel w/o double barrier#1123
adedespirlet wants to merge 19 commits intoiree-org:mainfrom
adedespirlet:8wavepingpong

Conversation

@adedespirlet
Copy link
Copy Markdown
Contributor

@adedespirlet adedespirlet commented Mar 12, 2026

This PR:

  • Optimizes MXFP 8w schedule with respect to counters and memory ops.

  • Moved the pass merge_contiguous_reads earlier in the pipeline such that I can reason about merged read widiths/counts in my schedule. Previously pass took place after the scheduling pass.

  • Unrolls the kernel twice to remove forced vmcnt(0) + v_mov copies at the end of the loop
    Without unrolling, scale loads and scale consumption overlap within the same iteration, forcing the compiler to load into temporary VGPRs and copy them back to the loop iter_args's registers at the end of the loop (vmcnt(0) + v_mov) . With 2x unrolling, odd/even iterations alternate scale register sets, so loads target already "dead" registers directly. This prevents copies and vmcnt(0) which breaks perf. Scale waits now happen right before the MFMAs, maximizing latency hiding.

  • Fix costly divisions emission when dealing with dynamic kernels
    When encountering dynamic divisors, the LLVM backend cannot emit efficient bit shifts or Barrett reduction as it does for static constants. Instead it falls back to expensive integer division (~150 extra VALU ops per division). To prevent this, this PR replaces the address computation divisions inside each loop iteration with the "Barrett reduction / magic number trick" .
    A "magic number" = ceil(2^32 / d) is precomputed once per dynamic divisor before the loop, and each in-loop division is replaced by a single v_mul_hi_u32 + a few ALU ops, effectively eliminating expensive divisions from the loop body.

  • delete bounds expression attached to gather_to_lds ops when dealing with dynamic values based on Assumption given to the compiler. This prevent costly masking logic insertion

  • When the compiler emits IR for expressions containing multiple fraction additions, it defaulted to cross-multiplication even when the denominator was the same across all fractions. This is now fixed : it identifies when denominators are identical and directly adds the corresponding numerators instead.

@adedespirlet adedespirlet force-pushed the 8wavepingpong branch 2 times, most recently from a75b940 to 063261e Compare March 13, 2026 22:31
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
adedespirlet and others added 3 commits March 28, 2026 00:06
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant