Skip to content

[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765

Open
andcarminati wants to merge 3 commits intoaie-publicfrom
andreu.prologue.improve
Open

[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765
andcarminati wants to merge 3 commits intoaie-publicfrom
andreu.prologue.improve

Conversation

@andcarminati
Copy link
Collaborator

The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.

; CHECK-LABEL: zol_elongated_preheader_plus_NumOfLoopBundles_ge112bytes:
; CHECK: .p2align 4
; CHECK-NEXT: // %bb.0: // %entry
; CHECK-NEXT: vldb x6, [p0], #0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something relevant. The old approach detected a dependency through p0, but by using maxLatency, we extend the distance to 7 (load latency), what is quite pessimistic.

@andcarminati
Copy link
Collaborator Author

Next step is to refactor a bit the code.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch from eced1ea to d3620c9 Compare January 30, 2026 13:05
/*InSeparateRegion=*/false);

// Find the loop successor - must exist if we have bot-fixed bundles in a
// prologue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between regular interblock scheduling and the pipelining interblock scheduling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different in the sense that without pipelining:

  • We push a safety margin on the epilogue block based on what we see on the loop.
  • We initialize the bot scoreboard of the prologue based on the loop state.

With a pipelined loop, we have the top and bot insert regions that are already scheduled. We carefully dump those regions to a place where the normal DAG construction will not see them (sequence of bundles), because we need full control to not rearrange/change timing properties. After this timing region placement, we carefully create DAG nodes for the bundles (Also, normal DAG construction is not prepared for bundles- but this is not the reason), tying them together to a proper in-sequence scheduling. After, we construct a timeline of events relative the EntrySU and ExitSU, so we can create dependencies for the "free instructions". Register dependencies are the easy part on this timeline construction, the problem arises when we consider also memory cycles and alias relations (most of the code). As result, for each "free" instruction, we have a single edge to Entry/Exit modeling all memory and register latencies. For example, a memory instruction in the prologue can execute a register write inside de loop, and this must be also tracked - we do this here using events with negative cycles.

Also, this is the starting point for prologue/epilogue merging as a whole, the most complicated part is done by the scheduler: release each fixed node in sequence before any other node. This involves, for example, the switching the from top to bottom-um scheduling and delta cycles in both directions.

const BlockState &LoopBS =
Scheduler->getInterBlock().getBlockState(LoopSucc);
ArrayRef<AIE::MachineBundle> LoopTimedBundles = LoopBS.getTop().Bundles;
BackwardRET.computeUseDefBackward(LoopTimedBundles,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the wrap around nature of the loops?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we execute the first iteration of the loop (all instructions will be there), this first iteration will be responsible for the first events in the timeline that may influence the prologue. Once we iterate again, events will not be interesting anymore because we already saw them previously, closer to the prologue. Happily, pipelined loops execute at least once and they contain all instructions.

}

void AIERegMemEventTracker::updateLastStoreCycle(int StoreCycle) {
LastStoreCycle = std::max(LastStoreCycle, StoreCycle);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this and the scoreboard we use in the Postpipeliner? Can't we reuse it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scoreboard can track resources (functional units/slots etc.) in the pipeline without the notion of latency. An instruction may reserve/require resources the the scoreboard, but there in no notion of operand latency/data dependency there. Here, we are not interested any resources, only latencies that can accommodate data dependencies and also memory relations between "free" instructions and already scheduled instructions.

SDep Dep(FixedDepSU, SDep::Artificial);
Dep.setLatency(1);
FreeSU.addPred(Dep, /*Required=*/true);
// Handle bot-fixed instructions (prologue) with backward tracking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I think that we can make this function even more understandable by factoring out the intermediate steps into helper functions. Maybe it's even possible to share some code between the prologue and epilogue handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

FreeSU.addPred(Dep, /*Required=*/true);
// Handle bot-fixed instructions (prologue) with backward tracking
// Only handle prologues (preheader blocks), not epilogues here
if (!BotFixedBundles.empty() && BS.Kind != BlockType::Epilogue) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant. I suggest moving the block type check into an assert (as we do in the prologue case) or removing it altogether.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New checks available. We can have 1-stage pipelined loop with BotFixedBundles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: iterate

int Cycle = 0;
const int TotalCycles = Bundles.size();

// Track top-fixed region size (only for the first call, not separate region)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment explains the what and not the why. Could you add an explanation of why this tracking is only needed in the first call? Also, why not track it in both cases and move the check to the place where the total cycles is actually used, i.e. if the size is only needed in a computation in certain scenarios, then I think it makes sense to keep that check closer to the computation rather than here.


// Count progressively from 0 as we iterate backward through bundles
// Cycle represents the distance from ExitSU for each bundle
int Cycle = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: AFAIK we usually call this the height. Please rename for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not. AIERegMemEventTracker is a register and memory event tracker so we register events in their respective cycles.

int Cycle = 0;

for (const auto &Bundle : reverse(Bundles)) {
// Cycle is the distance from ExitSU for this bundle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Repeated comment.

if (getFirstMemoryAccessCycle() > INT_MIN) {
const int CoreResumeCycle = TII->getCoreResumeCycleAfterLock();
const int MIResumeCycle = CoreResumeCycle - 1;
const int MemDep = getFirstMemoryAccessCycle() - MIResumeCycle + 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this part for clarity.

}
// Check stores in top-fixed (RAW for loads, WAR for stores)
const int StoreCycle =
AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true) : getLastStoreCycle();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be getLastMemoryAccessCycle since we are unifying store and load?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct to me, we care about stores here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we then only consider stores in the if condition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I extended to cover the load case in a cleaner way!

@andcarminati
Copy link
Collaborator Author

andcarminati commented Feb 4, 2026

QoR:
image
image

This helps to improve the outermost loop of GEMM int8.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch 4 times, most recently from 94f81d8 to 627374c Compare February 10, 2026 11:42
@andcarminati andcarminati marked this pull request as ready for review February 10, 2026 11:42
const std::map<MCRegister, int> &RegToCycle =
IsDef ? RegisterToCycleDef : RegisterToCycleUse;
int CurrMaxLatency = 0;
for (MCRegAliasIterator Ali(MO.getReg(), TRI, true); Ali.isValid();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also directly use RegUnits instead of going through all aliases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way it is simple, we just keep one register in the timeline. RegUnit can also work, but we need to create events for each RegUnit - add all reguinits and check all regunits after (I see a complexity increase.)

if (RegCycle != RegToCycle.end()) {
const int RegCycleValue = RegCycle->second;
int ThisOperandLatency = IsBackward
? RegCycleValue + OperandCycle + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment why we have +1 here?
Is this the latency after which the register is free to be used again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could also call checkMemoryDependency

MO.isDef() ? SafeDistance(MO, /*IsDef*/ false) : 0;

// Only use positive distances
if (DistFromLastWrite > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::max will filter out the case of DistFromLastWrite < 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch: I replaced by MaxLatency = std::max({MaxLatency, DistFromLastWrite, DistFromLastRead});

return CurrMaxLatency;
};

const int DistFromLastWrite = SafeDistance(MO, /*IsDef*/ true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we unify RegisterToCycleDef and RegisterToCycleUse we only need to execute the lambda once, since the difference between this call and the one below is just a change in the datastructure we use throught the lambda.

Copy link
Collaborator

@F-Stuckmann F-Stuckmann Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also refactor this to call checkMemoryDependency ?
Can we harmonize the Cycles. Why dont we go to the next cycle that is free like in checkMemoryDependency ?

Copy link
Collaborator

@F-Stuckmann F-Stuckmann Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understand the code correctly, the ints in RegisterToCycleDef and MemoryCycleToStoreInstrs are complelty different, right? Could we harmonize this so that a cycle always means the same thing:
a) Cycle before the last time it is blocked (as is used here and MemoryCycleToStoreInstrs/MemoryCycleToLoadInstrs)
b) Cycle the instruction Def/Use is last used (as in updateUseDefMaxCycle(MO.getReg(), EventCycle, IsDef);)

@@ -73,7 +219,7 @@
int MemAccessCycle = Cycle + *OptLastMemCycle - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up until this point the code is identical to the if (BundledMI->mayStore()) { case. We could harmonize these two conditions.

assert(OptMOCycle);
const int OperandCycle = *OptMOCycle;
if (InSeparateRegion) {
const int EventCycle = Cycle + OperandCycle;
Copy link
Collaborator

@F-Stuckmann F-Stuckmann Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can hoist const int EventCycle = Cycle + OperandCycle; and use the variable in the conditional branches

auto OptFirstMemCycle = TII->getFirstMemoryCycle(SchedClass);
assert(OptFirstMemCycle && "Load instruction without MemoryCycles");
const int FirstMemCycle = *OptFirstMemCycle;
int FirstLoadFromEnd = Cycle - FirstMemCycle + 1;
Copy link
Collaborator

@F-Stuckmann F-Stuckmann Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do I need a +1 here?
Shouldn't the current variable be called CycleAfterFirstLoadFromEnd?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline.

}

// Track stores for AA-based dependencies
if (BundledMI->mayStore()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unify with the if (BundledMI->mayLoad()) { block


unsigned
AIERegMemEventTracker::getSafeOperandsDistance(const MachineInstr &MI) const {
void AIERegMemEventTracker::computeUseDefBackward(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could harmonize this method and computeUseDefForward, since they compute similar results (except for a backward forward update of cycles)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some differences that could lead to hard to maintain conditional code.

// For example, if we have LastStoreCycle = 21 (absolute) and this
// instruction FirstMemoryCycle = 5 (relative to the start cycle of this
// instruction), then this instruction should start earliest at cycle 18,
// to be able to start the FirstMemoryCycle (absolute) at cycle 22. In
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you keep the comments that explain magic numbers?

}

// Free load/store: check stores in bot-fixed (RAW dependency)
const int LoadStoreCycle = AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true)
Copy link
Collaborator

@F-Stuckmann F-Stuckmann Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit we could remove the above if (MI.mayStore()) { code, if we use const bool IsStore= MI.mayStore() in these method calls

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this flag is related to the events, not to MI.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch 3 times, most recently from 862fb17 to 4604156 Compare February 23, 2026 15:16
@andcarminati
Copy link
Collaborator Author

andcarminati commented Feb 23, 2026

Hi @F-Stuckmann, thank you for the review and for catching some inconsistencies! I tried to improve the code with some refactor and harmonization, but I kept backward and forward mechanisms separated.

@andcarminati andcarminati force-pushed the andreu.prologue.improve branch from 4604156 to 74adf86 Compare March 4, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants