8Wave PP race-free kernel w/o double barrier#1123
Open
adedespirlet wants to merge 19 commits intoiree-org:mainfrom
Open
8Wave PP race-free kernel w/o double barrier#1123adedespirlet wants to merge 19 commits intoiree-org:mainfrom
adedespirlet wants to merge 19 commits intoiree-org:mainfrom
Conversation
a75b940 to
063261e
Compare
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
ee7cd80 to
73de1db
Compare
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
aa1e909 to
de0069f
Compare
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR:
Optimizes MXFP 8w schedule with respect to counters and memory ops.
Moved the pass merge_contiguous_reads earlier in the pipeline such that I can reason about merged read widiths/counts in my schedule. Previously pass took place after the scheduling pass.
Unrolls the kernel twice to remove forced vmcnt(0) + v_mov copies at the end of the loop
Without unrolling, scale loads and scale consumption overlap within the same iteration, forcing the compiler to load into temporary VGPRs and copy them back to the loop iter_args's registers at the end of the loop (vmcnt(0) + v_mov) . With 2x unrolling, odd/even iterations alternate scale register sets, so loads target already "dead" registers directly. This prevents copies and vmcnt(0) which breaks perf. Scale waits now happen right before the MFMAs, maximizing latency hiding.
Fix costly divisions emission when dealing with dynamic kernels
When encountering dynamic divisors, the LLVM backend cannot emit efficient bit shifts or Barrett reduction as it does for static constants. Instead it falls back to expensive integer division (~150 extra VALU ops per division). To prevent this, this PR replaces the address computation divisions inside each loop iteration with the "Barrett reduction / magic number trick" .
A "magic number" = ceil(2^32 / d) is precomputed once per dynamic divisor before the loop, and each in-loop division is replaced by a single v_mul_hi_u32 + a few ALU ops, effectively eliminating expensive divisions from the loop body.
delete bounds expression attached to gather_to_lds ops when dealing with dynamic values based on Assumption given to the compiler. This prevent costly masking logic insertion
When the compiler emits IR for expressions containing multiple fraction additions, it defaulted to cross-multiplication even when the denominator was the same across all fractions. This is now fixed : it identifies when denominators are identical and directly adds the corresponding numerators instead.