Sched group#2186
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Rock transform pass to insert AMDGPU scheduling barriers (“sched group” barriers) based on analysis of scf.for loop bodies, and wires it into the default kernel pipeline with accompanying integration/unit tests.
Changes:
- Introduce
rock-add-sched-group-barrierspass that analyzes memory ops + MFMA/WMMA counts and injectsamdgpu.sched_barrierandrocdl.sched.group.barrier. - Integrate the new pass into the ROCmLIR kernel pipeline immediately after
rock-buffer-load-merge. - Add MLIR tests covering both pipeline-level behavior and pre-lowered IR edge cases.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/AddSchedGroupBarriers.cpp | Implements the new analysis + barrier insertion transform. |
| mlir/include/mlir/Dialect/Rock/Passes.td | Declares the new pass and its documentation/dependent dialects. |
| mlir/include/mlir/Dialect/Rock/Passes.h | Exposes the generated pass declaration macro. |
| mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt | Adds the new source file to the Rock transforms library build. |
| mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp | Inserts the pass into the kernel pipeline sequence. |
| mlir/test/rocmlir-driver/pipelines.mlir | Updates expected pipeline listing to include the new pass. |
| mlir/test/Dialect/Rock/add_sched_group_barriers.mlir | End-to-end pipeline test validating barrier insertion/skips. |
| mlir/test/Dialect/Rock/add_sched_group_barriers_unit.mlir | Unit tests for edge cases on pre-lowered IR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Get the trip count of an affine.for loop, returns 1 if unknown | ||
| static uint64_t getAffineForTripCount(affine::AffineForOp affineFor) { | ||
| std::optional<uint64_t> tripCount = affine::getConstantTripCount(affineFor); | ||
| if (tripCount.has_value()) { | ||
| return tripCount.value(); | ||
| } | ||
| LLVM_DEBUG(llvm::dbgs() | ||
| << "WARNING: could not determine trip count for affine.for at " | ||
| << affineFor.getLoc() << ", defaulting to 1\n"); | ||
| return 1; | ||
| } |
There was a problem hiding this comment.
getAffineForTripCount() defaults unknown affine trip counts to 1, which can significantly undercount ops inside affine loops. That can cause this pass to insert sched-group barriers even when the real per-iteration instruction count is much larger (including violating the >25 MFMA cutoff), potentially increasing backend compile time unexpectedly. Consider treating unknown trip counts as a reason to skip barrier insertion (or conservatively assume a large trip count), and/or add overflow-safe multiplication in computeAffineLoopMultiplier().
| // For the remaining single-loop functions, barriers are only inserted if: | ||
| // - The loop uses double buffering (LDS reads/writes use arith.select). | ||
| // - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or | ||
| // amdgpu.async_load_to_lds). | ||
| // - The loop has at most one rock.lds_barrier (excludes attention kernels). | ||
| // - The loop contains at least one global load and one matrix multiply op. | ||
| // - The number of matrix multiply ops per iteration does not exceed 25. |
There was a problem hiding this comment.
The header comment says barriers are only inserted for “remaining single-loop functions”, but the implementation processes every scf.for in any function that merely lacks nested scf.for loops. Either enforce the single-loop restriction or update the comment/Passes.td description to match actual behavior.
| // For the remaining single-loop functions, barriers are only inserted if: | |
| // - The loop uses double buffering (LDS reads/writes use arith.select). | |
| // - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or | |
| // amdgpu.async_load_to_lds). | |
| // - The loop has at most one rock.lds_barrier (excludes attention kernels). | |
| // - The loop contains at least one global load and one matrix multiply op. | |
| // - The number of matrix multiply ops per iteration does not exceed 25. | |
| // For the remaining functions, each scf.for loop is a candidate for barrier | |
| // insertion only if: | |
| // - The loop uses double buffering (LDS reads/writes use arith.select). | |
| // - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or | |
| // amdgpu.async_load_to_lds). | |
| // - The loop has at most one rock.lds_barrier (excludes attention kernels). | |
| // - The loop contains at least one global load and one matrix multiply op. |
| /// Check if a value is defined by an arith.select operation, which indicates | ||
| /// double buffering (selecting between two different LDS buffers) | ||
| static bool isDefinedBySelect(Value val) { | ||
| return val.getDefiningOp<arith::SelectOp>() != nullptr; |
There was a problem hiding this comment.
Double-buffering detection currently treats any LDS load/store whose base is defined by an arith.select as “double buffered”, without verifying that the select toggles across iterations or that it selects between two distinct workgroup buffers. This can create false positives (inserting barriers into non-double-buffered loops) and diverges from the stated “alternate between two buffers” requirement. Consider tightening the check (e.g., ensure both select operands are different LDS buffers and the condition depends on the loop iteration/parity).
| /// Check if a value is defined by an arith.select operation, which indicates | |
| /// double buffering (selecting between two different LDS buffers) | |
| static bool isDefinedBySelect(Value val) { | |
| return val.getDefiningOp<arith::SelectOp>() != nullptr; | |
| /// Check if a value is defined by an arith.select operation that selects | |
| /// between two *distinct* LDS (workgroup) buffers. This is used as a proxy | |
| /// for detecting double buffering. | |
| static bool isDefinedBySelect(Value val) { | |
| auto selectOp = val.getDefiningOp<arith::SelectOp>(); | |
| if (!selectOp) | |
| return false; | |
| // Require that the selected value itself is an LDS memref. | |
| auto resultMemRefTy = dyn_cast<MemRefType>(val.getType()); | |
| if (!resultMemRefTy || !hasWorkgroupAddressSpace(resultMemRefTy)) | |
| return false; | |
| Value trueVal = selectOp.getTrueValue(); | |
| Value falseVal = selectOp.getFalseValue(); | |
| // The two buffers must be distinct to represent real double buffering. | |
| if (trueVal == falseVal) | |
| return false; | |
| auto trueMemRefTy = dyn_cast<MemRefType>(trueVal.getType()); | |
| auto falseMemRefTy = dyn_cast<MemRefType>(falseVal.getType()); | |
| if (!trueMemRefTy || !falseMemRefTy) | |
| return false; | |
| // Both operands must be LDS memrefs. | |
| if (!hasWorkgroupAddressSpace(trueMemRefTy) || | |
| !hasWorkgroupAddressSpace(falseMemRefTy)) | |
| return false; | |
| return true; |
| uint64_t dsReadsPerMFMA = llvm::divideCeil(numDSReads, numMatrixMultiplyOps); | ||
| uint64_t dsWritesPerMFMA = | ||
| llvm::divideCeil(numDSWrites, numMatrixMultiplyOps); | ||
| uint64_t bufferLoadsPerMFMA = | ||
| llvm::divideCeil(numBufferLoads, numMatrixMultiplyOps); | ||
| for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) { | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0); | ||
| if (numDSWrites > 0) { | ||
| uint64_t count = std::min(dsWritesPerMFMA, numDSWrites); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0); | ||
| numDSWrites -= count; | ||
| } | ||
| if (numBufferLoads > 0) { | ||
| uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0); | ||
| numBufferLoads -= count; | ||
| } | ||
| if (numDSReads > 0) { | ||
| uint64_t count = std::min(dsReadsPerMFMA, numDSReads); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0); | ||
| numDSReads -= count; |
There was a problem hiding this comment.
The sched-group barrier masks (0x008/0x200/0x020/0x100) are hard-coded magic numbers. Since they correspond to specific AMDGPU scheduling-class bits, it would be safer/clearer to define named constants (or reuse the AMDGPU sched-barrier enum bit positions) to avoid mistakes and make future changes easier.
| (ROCDL::SchedGroupBarrier) to optimize instruction scheduling on AMD GPUs. | ||
|
|
||
| The pass skips functions that contain nested scf.for loops. | ||
| For the remaining single-loop functions, barriers are only inserted if: |
There was a problem hiding this comment.
The description says the pass applies to “remaining single-loop functions”, but the implementation only skips functions with nested scf.for and will process multiple top-level scf.for loops in the same function. Please align the description with the implementation (or enforce the single-loop restriction).
| For the remaining single-loop functions, barriers are only inserted if: | |
| For the remaining functions, barriers are only inserted if: |
| for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) { | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0); | ||
| if (numDSWrites > 0) { | ||
| uint64_t count = std::min(dsWritesPerMFMA, numDSWrites); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0); | ||
| numDSWrites -= count; | ||
| } | ||
| if (numBufferLoads > 0) { | ||
| uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0); | ||
| numBufferLoads -= count; | ||
| } | ||
| if (numDSReads > 0) { | ||
| uint64_t count = std::min(dsReadsPerMFMA, numDSReads); | ||
| ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0); |
There was a problem hiding this comment.
Use existing sched_barrier_opt_enum instead of numbers, you can use the existing enum directly:
static int32_t schedGroupMask(amdgpu::sched_barrier_opt_enum opt) {
return static_cast<int32_t>(opt);
}
Then replace the raw numbers:
using SchedMask = amdgpu::sched_barrier_opt_enum;
ROCDL::SchedGroupBarrier::create(builder, loc,
schedGroupMask(SchedMask::mfma_wmma), 1, 0);
.......
stefankoncarevic
left a comment
There was a problem hiding this comment.
Overall the implementation looks clean and well-guarded. It would be helpful to see benchmark results for both CDNA and RDNA architectures — specifically which GEMM configurations benefit from the scheduling barriers and whether there are any regressions.
|
Closing this PR for now. |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist