Sched group by umangyadav · Pull Request #2186 · ROCm/rocMLIR

umangyadav · 2025-12-25T21:44:57Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds a new Rock transform pass to insert AMDGPU scheduling barriers (“sched group” barriers) based on analysis of scf.for loop bodies, and wires it into the default kernel pipeline with accompanying integration/unit tests.

Changes:

Introduce rock-add-sched-group-barriers pass that analyzes memory ops + MFMA/WMMA counts and injects amdgpu.sched_barrier and rocdl.sched.group.barrier.
Integrate the new pass into the ROCmLIR kernel pipeline immediately after rock-buffer-load-merge.
Add MLIR tests covering both pipeline-level behavior and pre-lowered IR edge cases.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
mlir/lib/Dialect/Rock/Transforms/AddSchedGroupBarriers.cpp	Implements the new analysis + barrier insertion transform.
mlir/include/mlir/Dialect/Rock/Passes.td	Declares the new pass and its documentation/dependent dialects.
mlir/include/mlir/Dialect/Rock/Passes.h	Exposes the generated pass declaration macro.
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt	Adds the new source file to the Rock transforms library build.
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp	Inserts the pass into the kernel pipeline sequence.
mlir/test/rocmlir-driver/pipelines.mlir	Updates expected pipeline listing to include the new pass.
mlir/test/Dialect/Rock/add_sched_group_barriers.mlir	End-to-end pipeline test validating barrier insertion/skips.
mlir/test/Dialect/Rock/add_sched_group_barriers_unit.mlir	Unit tests for edge cases on pre-lowered IR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T23:35:40Z

+/// Get the trip count of an affine.for loop, returns 1 if unknown
+static uint64_t getAffineForTripCount(affine::AffineForOp affineFor) {
+  std::optional<uint64_t> tripCount = affine::getConstantTripCount(affineFor);
+  if (tripCount.has_value()) {
+    return tripCount.value();
+  }
+  LLVM_DEBUG(llvm::dbgs()
+             << "WARNING: could not determine trip count for affine.for at "
+             << affineFor.getLoc() << ", defaulting to 1\n");
+  return 1;
+}


getAffineForTripCount() defaults unknown affine trip counts to 1, which can significantly undercount ops inside affine loops. That can cause this pass to insert sched-group barriers even when the real per-iteration instruction count is much larger (including violating the >25 MFMA cutoff), potentially increasing backend compile time unexpectedly. Consider treating unknown trip counts as a reason to skip barrier insertion (or conservatively assume a large trip count), and/or add overflow-safe multiplication in computeAffineLoopMultiplier().

Copilot · 2026-03-30T23:35:40Z

+// For the remaining single-loop functions, barriers are only inserted if:
+// - The loop uses double buffering (LDS reads/writes use arith.select).
+// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or
+//   amdgpu.async_load_to_lds).
+// - The loop has at most one rock.lds_barrier (excludes attention kernels).
+// - The loop contains at least one global load and one matrix multiply op.
+// - The number of matrix multiply ops per iteration does not exceed 25.


The header comment says barriers are only inserted for “remaining single-loop functions”, but the implementation processes every scf.for in any function that merely lacks nested scf.for loops. Either enforce the single-loop restriction or update the comment/Passes.td description to match actual behavior.

Suggested change

// For the remaining single-loop functions, barriers are only inserted if:

// - The loop uses double buffering (LDS reads/writes use arith.select).

// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or

// amdgpu.async_load_to_lds).

// - The loop has at most one rock.lds_barrier (excludes attention kernels).

// - The loop contains at least one global load and one matrix multiply op.

// - The number of matrix multiply ops per iteration does not exceed 25.

// For the remaining functions, each scf.for loop is a candidate for barrier

// insertion only if:

// - The loop uses double buffering (LDS reads/writes use arith.select).

// - The loop does not use direct-to-LDS loads (amdgpu.gather_to_lds or

// amdgpu.async_load_to_lds).

// - The loop has at most one rock.lds_barrier (excludes attention kernels).

// - The loop contains at least one global load and one matrix multiply op.

Copilot · 2026-03-30T23:35:41Z

+/// Check if a value is defined by an arith.select operation, which indicates
+/// double buffering (selecting between two different LDS buffers)
+static bool isDefinedBySelect(Value val) {
+  return val.getDefiningOp<arith::SelectOp>() != nullptr;


Double-buffering detection currently treats any LDS load/store whose base is defined by an arith.select as “double buffered”, without verifying that the select toggles across iterations or that it selects between two distinct workgroup buffers. This can create false positives (inserting barriers into non-double-buffered loops) and diverges from the stated “alternate between two buffers” requirement. Consider tightening the check (e.g., ensure both select operands are different LDS buffers and the condition depends on the loop iteration/parity).

Suggested change

/// Check if a value is defined by an arith.select operation, which indicates

/// double buffering (selecting between two different LDS buffers)

static bool isDefinedBySelect(Value val) {

return val.getDefiningOp<arith::SelectOp>() != nullptr;

/// Check if a value is defined by an arith.select operation that selects

/// between two *distinct* LDS (workgroup) buffers. This is used as a proxy

/// for detecting double buffering.

static bool isDefinedBySelect(Value val) {

auto selectOp = val.getDefiningOp<arith::SelectOp>();

if (!selectOp)

return false;

// Require that the selected value itself is an LDS memref.

auto resultMemRefTy = dyn_cast<MemRefType>(val.getType());

if (!resultMemRefTy || !hasWorkgroupAddressSpace(resultMemRefTy))

return false;

Value trueVal = selectOp.getTrueValue();

Value falseVal = selectOp.getFalseValue();

// The two buffers must be distinct to represent real double buffering.

if (trueVal == falseVal)

return false;

auto trueMemRefTy = dyn_cast<MemRefType>(trueVal.getType());

auto falseMemRefTy = dyn_cast<MemRefType>(falseVal.getType());

if (!trueMemRefTy || !falseMemRefTy)

return false;

// Both operands must be LDS memrefs.

if (!hasWorkgroupAddressSpace(trueMemRefTy) ||

!hasWorkgroupAddressSpace(falseMemRefTy))

return false;

return true;

Copilot · 2026-03-30T23:35:41Z

+  uint64_t dsReadsPerMFMA = llvm::divideCeil(numDSReads, numMatrixMultiplyOps);
+  uint64_t dsWritesPerMFMA =
+      llvm::divideCeil(numDSWrites, numMatrixMultiplyOps);
+  uint64_t bufferLoadsPerMFMA =
+      llvm::divideCeil(numBufferLoads, numMatrixMultiplyOps);
+  for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) {
+    ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0);
+    if (numDSWrites > 0) {
+      uint64_t count = std::min(dsWritesPerMFMA, numDSWrites);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0);
+      numDSWrites -= count;
+    }
+    if (numBufferLoads > 0) {
+      uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0);
+      numBufferLoads -= count;
+    }
+    if (numDSReads > 0) {
+      uint64_t count = std::min(dsReadsPerMFMA, numDSReads);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0);
+      numDSReads -= count;


The sched-group barrier masks (0x008/0x200/0x020/0x100) are hard-coded magic numbers. Since they correspond to specific AMDGPU scheduling-class bits, it would be safer/clearer to define named constants (or reuse the AMDGPU sched-barrier enum bit positions) to avoid mistakes and make future changes easier.

Copilot · 2026-03-30T23:35:41Z

+    (ROCDL::SchedGroupBarrier) to optimize instruction scheduling on AMD GPUs.
+
+    The pass skips functions that contain nested scf.for loops.
+    For the remaining single-loop functions, barriers are only inserted if:


The description says the pass applies to “remaining single-loop functions”, but the implementation only skips functions with nested scf.for and will process multiple top-level scf.for loops in the same function. Please align the description with the implementation (or enforce the single-loop restriction).

Suggested change

For the remaining single-loop functions, barriers are only inserted if:

For the remaining functions, barriers are only inserted if:

stefankoncarevic · 2026-04-01T12:19:57Z

+  for (uint64_t i = 0; i < numMatrixMultiplyOps; i++) {
+    ROCDL::SchedGroupBarrier::create(builder, loc, 0x008, 1, 0);
+    if (numDSWrites > 0) {
+      uint64_t count = std::min(dsWritesPerMFMA, numDSWrites);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x200, count, 0);
+      numDSWrites -= count;
+    }
+    if (numBufferLoads > 0) {
+      uint64_t count = std::min(bufferLoadsPerMFMA, numBufferLoads);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x020, count, 0);
+      numBufferLoads -= count;
+    }
+    if (numDSReads > 0) {
+      uint64_t count = std::min(dsReadsPerMFMA, numDSReads);
+      ROCDL::SchedGroupBarrier::create(builder, loc, 0x100, count, 0);


Use existing sched_barrier_opt_enum instead of numbers, you can use the existing enum directly:
static int32_t schedGroupMask(amdgpu::sched_barrier_opt_enum opt) {
return static_cast<int32_t>(opt);
}
Then replace the raw numbers:
using SchedMask = amdgpu::sched_barrier_opt_enum;
ROCDL::SchedGroupBarrier::create(builder, loc,
schedGroupMask(SchedMask::mfma_wmma), 1, 0);
.......

stefankoncarevic

Overall the implementation looks clean and well-guarded. It would be helpful to see benchmark results for both CDNA and RDNA architectures — specifically which GEMM configurations benefit from the scheduling barriers and whether there are any regressions.

umangyadav · 2026-04-28T19:46:38Z

Closing this PR for now.

umangyadav and others added 29 commits December 23, 2025 20:57

add schedGroup

aefeb31

change names

d27c187

try exhaustive tune

c9386a1

Merge branch 'develop' into schedGroup

8e1e95b

This achieves 160 TFLops

05e0f09

This is better achieves upto 166 TFLops

42f005b

add logic for both single and double buffered pipelines

e50779e

swap order

20d2342

change to greedy

920a35b

Fix bug

94317d6

Merge branch 'develop' into schedGroup

fd32439

fix bug

3ad9d0d

fix bug for LDS bank conflicts

03dacc1

lower minCU Count for CPX mode

4bb1266

Merge remote-tracking branch 'origin/lowerMinCU' into schedGroup

c0f313d

Add greedy type in API

f2632b7

add logic for directToLDS

a6e5534

do not use sched group for direct tolds

b0e84e4

Merge branch 'develop' into schedGroup

b160f29

Merge branch 'develop' into schedGroup

17b4d4c

Merge branch 'develop' into schedGroup

8f5631b

Merge remote-tracking branch 'origin/develop' into schedGroup

6e5710a

remove unnecessary changes

2afb221

Merge branch 'develop' into schedGroup

da19480

Use only MFMA and double buffering

20ad911

Formatting

57f443b

Address review comments

73feac3

Some more refactoring

0b0a234

Merge branch 'develop' into schedGroup

3eb54db

umangyadav marked this pull request as ready for review March 30, 2026 23:25

umangyadav requested a review from causten as a code owner March 30, 2026 23:25

umangyadav requested a review from Copilot March 30, 2026 23:25

umangyadav self-assigned this Mar 30, 2026

Copilot started reviewing on behalf of umangyadav March 30, 2026 23:28 View session

some more changes

046f640

Copilot AI reviewed Mar 30, 2026

View reviewed changes

umangyadav requested review from dhernandez0, justinrosner, pabloantoniom and stefankoncarevic March 31, 2026 14:07

stefankoncarevic reviewed Apr 1, 2026

View reviewed changes

umangyadav closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sched group#2186

Sched group#2186
umangyadav wants to merge 30 commits intodevelopfrom
schedGroup

umangyadav commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

stefankoncarevic Apr 1, 2026

Uh oh!

stefankoncarevic left a comment

Uh oh!

umangyadav commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-/// Check if a value is defined by an arith.select operation, which indicates
-/// double buffering (selecting between two different LDS buffers)
-static bool isDefinedBySelect(Value val) {
-  return val.getDefiningOp<arith::SelectOp>() != nullptr;
+/// Check if a value is defined by an arith.select operation that selects
+/// between two *distinct* LDS (workgroup) buffers. This is used as a proxy
+/// for detecting double buffering.
+static bool isDefinedBySelect(Value val) {
+  auto selectOp = val.getDefiningOp<arith::SelectOp>();
+  if (!selectOp)
+    return false;
+  // Require that the selected value itself is an LDS memref.
+  auto resultMemRefTy = dyn_cast<MemRefType>(val.getType());
+  if (!resultMemRefTy || !hasWorkgroupAddressSpace(resultMemRefTy))
+    return false;
+  Value trueVal = selectOp.getTrueValue();
+  Value falseVal = selectOp.getFalseValue();
+  // The two buffers must be distinct to represent real double buffering.
+  if (trueVal == falseVal)
+    return false;
+  auto trueMemRefTy = dyn_cast<MemRefType>(trueVal.getType());
+  auto falseMemRefTy = dyn_cast<MemRefType>(falseVal.getType());
+  if (!trueMemRefTy || !falseMemRefTy)
+    return false;
+  // Both operands must be LDS memrefs.
+  if (!hasWorkgroupAddressSpace(trueMemRefTy) ||
+      !hasWorkgroupAddressSpace(falseMemRefTy))
+    return false;
+  return true;

	For the remaining single-loop functions, barriers are only inserted if:
	For the remaining functions, barriers are only inserted if:

Conversation

umangyadav commented Dec 25, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

stefankoncarevic Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

stefankoncarevic left a comment

Choose a reason for hiding this comment

Uh oh!

umangyadav commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants