Skip to content

Sched group#2186

Closed
umangyadav wants to merge 30 commits intodevelopfrom
schedGroup
Closed

Sched group#2186
umangyadav wants to merge 30 commits intodevelopfrom
schedGroup

Conversation

@umangyadav
Copy link
Copy Markdown
Member

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@umangyadav umangyadav marked this pull request as ready for review March 30, 2026 23:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Rock transform pass to insert AMDGPU scheduling barriers (“sched group” barriers) based on analysis of scf.for loop bodies, and wires it into the default kernel pipeline with accompanying integration/unit tests.

Changes:

  • Introduce rock-add-sched-group-barriers pass that analyzes memory ops + MFMA/WMMA counts and injects amdgpu.sched_barrier and rocdl.sched.group.barrier.
  • Integrate the new pass into the ROCmLIR kernel pipeline immediately after rock-buffer-load-merge.
  • Add MLIR tests covering both pipeline-level behavior and pre-lowered IR edge cases.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/AddSchedGroupBarriers.cpp Implements the new analysis + barrier insertion transform.
mlir/include/mlir/Dialect/Rock/Passes.td Declares the new pass and its documentation/dependent dialects.
mlir/include/mlir/Dialect/Rock/Passes.h Exposes the generated pass declaration macro.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt Adds the new source file to the Rock transforms library build.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp Inserts the pass into the kernel pipeline sequence.
mlir/test/rocmlir-driver/pipelines.mlir Updates expected pipeline listing to include the new pass.
mlir/test/Dialect/Rock/add_sched_group_barriers.mlir End-to-end pipeline test validating barrier insertion/skips.
mlir/test/Dialect/Rock/add_sched_group_barriers_unit.mlir Unit tests for edge cases on pre-lowered IR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +106
/// Get the trip count of an affine.for loop, returns 1 if unknown
static uint64_t getAffineForTripCount(affine::AffineForOp affineFor) {
std::optional<uint64_t> tripCount = affine::getConstantTripCount(affineFor);
if (tripCount.has_value()) {
return tripCount.value();
}
LLVM_DEBUG(llvm::dbgs()
<< "WARNING: could not determine trip count for affine.for at "
<< affineFor.getLoc() << ", defaulting to 1\n");
return 1;
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAffineForTripCount() defaults unknown affine trip counts to 1, which can significantly undercount ops inside affine loops. That can cause this pass to insert sched-group barriers even when the real per-iteration instruction count is much larger (including violating the >25 MFMA cutoff), potentially increasing backend compile time unexpectedly. Consider treating unknown trip counts as a reason to skip barrier insertion (or conservatively assume a large trip count), and/or add overflow-safe multiplication in computeAffineLoopMultiplier().

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +20
// For the remaining single-loop functions, barriers are only inserted if:
// - The loop uses double buffering (LDS reads/writes use arith.select).
// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or
// amdgpu.async_load_to_lds).
// - The loop has at most one rock.lds_barrier (excludes attention kernels).
// - The loop contains at least one global load and one matrix multiply op.
// - The number of matrix multiply ops per iteration does not exceed 25.
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says barriers are only inserted for “remaining single-loop functions”, but the implementation processes every scf.for in any function that merely lacks nested scf.for loops. Either enforce the single-loop restriction or update the comment/Passes.td description to match actual behavior.

Suggested change
// For the remaining single-loop functions, barriers are only inserted if:
// - The loop uses double buffering (LDS reads/writes use arith.select).
// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or
// amdgpu.async_load_to_lds).
// - The loop has at most one rock.lds_barrier (excludes attention kernels).
// - The loop contains at least one global load and one matrix multiply op.
// - The number of matrix multiply ops per iteration does not exceed 25.
// For the remaining functions, each scf.for loop is a candidate for barrier
// insertion only if:
// - The loop uses double buffering (LDS reads/writes use arith.select).
// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or
// amdgpu.async_load_to_lds).
// - The loop has at most one rock.lds_barrier (excludes attention kernels).
// - The loop contains at least one global load and one matrix multiply op.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +93
/// Check if a value is defined by an arith.select operation, which indicates
/// double buffering (selecting between two different LDS buffers)
static bool isDefinedBySelect(Value val) {
return val.getDefiningOp<arith::SelectOp>() != nullptr;
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-buffering detection currently treats any LDS load/store whose base is defined by an arith.select as “double buffered”, without verifying that the select toggles across iterations or that it selects between two distinct workgroup buffers. This can create false positives (inserting barriers into non-double-buffered loops) and diverges from the stated “alternate between two buffers” requirement. Consider tightening the check (e.g., ensure both select operands are different LDS buffers and the condition depends on the loop iteration/parity).

Suggested change
/// Check if a value is defined by an arith.select operation, which indicates
/// double buffering (selecting between two different LDS buffers)
static bool isDefinedBySelect(Value val) {
return val.getDefiningOp<arith::SelectOp>() != nullptr;
/// Check if a value is defined by an arith.select operation that selects
/// between two *distinct* LDS (workgroup) buffers. This is used as a proxy
/// for detecting double buffering.
static bool isDefinedBySelect(Value val) {
auto selectOp = val.getDefiningOp<arith::SelectOp>();
if (!selectOp)
return false;
// Require that the selected value itself is an LDS memref.
auto resultMemRefTy = dyn_cast<MemRefType>(val.getType());
if (!resultMemRefTy || !hasWorkgroupAddressSpace(resultMemRefTy))
return false;
Value trueVal = selectOp.getTrueValue();
Value falseVal = selectOp.getFalseValue();
// The two buffers must be distinct to represent real double buffering.
if (trueVal == falseVal)
return false;
auto trueMemRefTy = dyn_cast<MemRefType>(trueVal.getType());
auto falseMemRefTy = dyn_cast<MemRefType>(falseVal.getType());
if (!trueMemRefTy || !falseMemRefTy)
return false;
// Both operands must be LDS memrefs.
if (!hasWorkgroupAddressSpace(trueMemRefTy) ||
!hasWorkgroupAddressSpace(falseMemRefTy))
return false;
return true;

Copilot uses AI. Check for mistakes.
Comment on lines +346 to +366
uint64_t dsReadsPerMFMA = llvm::divideCeil(numDSReads, numMatrixMultiplyOps);
uint64_t dsWritesPerMFMA =
llvm::divideCeil(numDSWrites, numMatrixMultiplyOps);
uint64_t bufferLoadsPerMFMA =
llvm::divideCeil(numBufferLoads, numMatrixMultiplyOps);
for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) {
ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0);
if (numDSWrites > 0) {
uint64_t count = std::min(dsWritesPerMFMA, numDSWrites);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0);
numDSWrites -= count;
}
if (numBufferLoads > 0) {
uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0);
numBufferLoads -= count;
}
if (numDSReads > 0) {
uint64_t count = std::min(dsReadsPerMFMA, numDSReads);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0);
numDSReads -= count;
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sched-group barrier masks (0x008/0x200/0x020/0x100) are hard-coded magic numbers. Since they correspond to specific AMDGPU scheduling-class bits, it would be safer/clearer to define named constants (or reuse the AMDGPU sched-barrier enum bit positions) to avoid mistakes and make future changes easier.

Copilot uses AI. Check for mistakes.
(ROCDL::SchedGroupBarrier) to optimize instruction scheduling on AMD GPUs.

The pass skips functions that contain nested scf.for loops.
For the remaining single-loop functions, barriers are only inserted if:
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says the pass applies to “remaining single-loop functions”, but the implementation only skips functions with nested scf.for and will process multiple top-level scf.for loops in the same function. Please align the description with the implementation (or enforce the single-loop restriction).

Suggested change
For the remaining single-loop functions, barriers are only inserted if:
For the remaining functions, barriers are only inserted if:

Copilot uses AI. Check for mistakes.
Comment on lines +351 to +365
for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) {
ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0);
if (numDSWrites > 0) {
uint64_t count = std::min(dsWritesPerMFMA, numDSWrites);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0);
numDSWrites -= count;
}
if (numBufferLoads > 0) {
uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0);
numBufferLoads -= count;
}
if (numDSReads > 0) {
uint64_t count = std::min(dsReadsPerMFMA, numDSReads);
ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use existing sched_barrier_opt_enum instead of numbers, you can use the existing enum directly:
static int32_t schedGroupMask(amdgpu::sched_barrier_opt_enum opt) {
return static_cast<int32_t>(opt);
}
Then replace the raw numbers:
using SchedMask = amdgpu::sched_barrier_opt_enum;
ROCDL::SchedGroupBarrier::create(builder, loc,
schedGroupMask(SchedMask::mfma_wmma), 1, 0);
.......

Copy link
Copy Markdown
Contributor

@stefankoncarevic stefankoncarevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the implementation looks clean and well-guarded. It would be helpful to see benchmark results for both CDNA and RDNA architectures — specifically which GEMM configurations benefit from the scheduling barriers and whether there are any regressions.

@umangyadav
Copy link
Copy Markdown
Member Author

Closing this PR for now.

@umangyadav umangyadav closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants