[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765
[AIEX] Improve prologue scheduling by using AIERegMemEventTracker#765andcarminati wants to merge 3 commits intoaie-publicfrom
Conversation
| ; CHECK-LABEL: zol_elongated_preheader_plus_NumOfLoopBundles_ge112bytes: | ||
| ; CHECK: .p2align 4 | ||
| ; CHECK-NEXT: // %bb.0: // %entry | ||
| ; CHECK-NEXT: vldb x6, [p0], #0 |
There was a problem hiding this comment.
This is something relevant. The old approach detected a dependency through p0, but by using maxLatency, we extend the distance to 7 (load latency), what is quite pessimistic.
|
Next step is to refactor a bit the code. |
eced1ea to
d3620c9
Compare
| /*InSeparateRegion=*/false); | ||
|
|
||
| // Find the loop successor - must exist if we have bot-fixed bundles in a | ||
| // prologue |
There was a problem hiding this comment.
What is the difference between regular interblock scheduling and the pipelining interblock scheduling?
There was a problem hiding this comment.
They are different in the sense that without pipelining:
- We push a safety margin on the epilogue block based on what we see on the loop.
- We initialize the bot scoreboard of the prologue based on the loop state.
With a pipelined loop, we have the top and bot insert regions that are already scheduled. We carefully dump those regions to a place where the normal DAG construction will not see them (sequence of bundles), because we need full control to not rearrange/change timing properties. After this timing region placement, we carefully create DAG nodes for the bundles (Also, normal DAG construction is not prepared for bundles- but this is not the reason), tying them together to a proper in-sequence scheduling. After, we construct a timeline of events relative the EntrySU and ExitSU, so we can create dependencies for the "free instructions". Register dependencies are the easy part on this timeline construction, the problem arises when we consider also memory cycles and alias relations (most of the code). As result, for each "free" instruction, we have a single edge to Entry/Exit modeling all memory and register latencies. For example, a memory instruction in the prologue can execute a register write inside de loop, and this must be also tracked - we do this here using events with negative cycles.
Also, this is the starting point for prologue/epilogue merging as a whole, the most complicated part is done by the scheduler: release each fixed node in sequence before any other node. This involves, for example, the switching the from top to bottom-um scheduling and delta cycles in both directions.
| const BlockState &LoopBS = | ||
| Scheduler->getInterBlock().getBlockState(LoopSucc); | ||
| ArrayRef<AIE::MachineBundle> LoopTimedBundles = LoopBS.getTop().Bundles; | ||
| BackwardRET.computeUseDefBackward(LoopTimedBundles, |
There was a problem hiding this comment.
How do we handle the wrap around nature of the loops?
There was a problem hiding this comment.
Once we execute the first iteration of the loop (all instructions will be there), this first iteration will be responsible for the first events in the timeline that may influence the prologue. Once we iterate again, events will not be interesting anymore because we already saw them previously, closer to the prologue. Happily, pipelined loops execute at least once and they contain all instructions.
| } | ||
|
|
||
| void AIERegMemEventTracker::updateLastStoreCycle(int StoreCycle) { | ||
| LastStoreCycle = std::max(LastStoreCycle, StoreCycle); |
There was a problem hiding this comment.
What is the difference between this and the scoreboard we use in the Postpipeliner? Can't we reuse it here?
There was a problem hiding this comment.
A scoreboard can track resources (functional units/slots etc.) in the pipeline without the notion of latency. An instruction may reserve/require resources the the scoreboard, but there in no notion of operand latency/data dependency there. Here, we are not interested any resources, only latencies that can accommodate data dependencies and also memory relations between "free" instructions and already scheduled instructions.
| SDep Dep(FixedDepSU, SDep::Artificial); | ||
| Dep.setLatency(1); | ||
| FreeSU.addPred(Dep, /*Required=*/true); | ||
| // Handle bot-fixed instructions (prologue) with backward tracking |
There was a problem hiding this comment.
Looks really good. I think that we can make this function even more understandable by factoring out the intermediate steps into helper functions. Maybe it's even possible to share some code between the prologue and epilogue handling.
| FreeSU.addPred(Dep, /*Required=*/true); | ||
| // Handle bot-fixed instructions (prologue) with backward tracking | ||
| // Only handle prologues (preheader blocks), not epilogues here | ||
| if (!BotFixedBundles.empty() && BS.Kind != BlockType::Epilogue) { |
There was a problem hiding this comment.
I think this check is redundant. I suggest moving the block type check into an assert (as we do in the prologue case) or removing it altogether.
There was a problem hiding this comment.
New checks available. We can have 1-stage pipelined loop with BotFixedBundles.
| int Cycle = 0; | ||
| const int TotalCycles = Bundles.size(); | ||
|
|
||
| // Track top-fixed region size (only for the first call, not separate region) |
There was a problem hiding this comment.
Nit: This comment explains the what and not the why. Could you add an explanation of why this tracking is only needed in the first call? Also, why not track it in both cases and move the check to the place where the total cycles is actually used, i.e. if the size is only needed in a computation in certain scenarios, then I think it makes sense to keep that check closer to the computation rather than here.
|
|
||
| // Count progressively from 0 as we iterate backward through bundles | ||
| // Cycle represents the distance from ExitSU for each bundle | ||
| int Cycle = 0; |
There was a problem hiding this comment.
Nit: AFAIK we usually call this the height. Please rename for consistency.
There was a problem hiding this comment.
Definitely not. AIERegMemEventTracker is a register and memory event tracker so we register events in their respective cycles.
| int Cycle = 0; | ||
|
|
||
| for (const auto &Bundle : reverse(Bundles)) { | ||
| // Cycle is the distance from ExitSU for this bundle |
| if (getFirstMemoryAccessCycle() > INT_MIN) { | ||
| const int CoreResumeCycle = TII->getCoreResumeCycleAfterLock(); | ||
| const int MIResumeCycle = CoreResumeCycle - 1; | ||
| const int MemDep = getFirstMemoryAccessCycle() - MIResumeCycle + 2; |
There was a problem hiding this comment.
I rewrote this part for clarity.
| } | ||
| // Check stores in top-fixed (RAW for loads, WAR for stores) | ||
| const int StoreCycle = | ||
| AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true) : getLastStoreCycle(); |
There was a problem hiding this comment.
shouldn't this be getLastMemoryAccessCycle since we are unifying store and load?
There was a problem hiding this comment.
This seems correct to me, we care about stores here.
There was a problem hiding this comment.
shouldn't we then only consider stores in the if condition?
There was a problem hiding this comment.
You are right! I extended to cover the load case in a cleaner way!
94f81d8 to
627374c
Compare
| const std::map<MCRegister, int> &RegToCycle = | ||
| IsDef ? RegisterToCycleDef : RegisterToCycleUse; | ||
| int CurrMaxLatency = 0; | ||
| for (MCRegAliasIterator Ali(MO.getReg(), TRI, true); Ali.isValid(); |
There was a problem hiding this comment.
we could also directly use RegUnits instead of going through all aliases.
There was a problem hiding this comment.
In this way it is simple, we just keep one register in the timeline. RegUnit can also work, but we need to create events for each RegUnit - add all reguinits and check all regunits after (I see a complexity increase.)
| if (RegCycle != RegToCycle.end()) { | ||
| const int RegCycleValue = RegCycle->second; | ||
| int ThisOperandLatency = IsBackward | ||
| ? RegCycleValue + OperandCycle + 1 |
There was a problem hiding this comment.
Could you add a comment why we have +1 here?
Is this the latency after which the register is free to be used again?
There was a problem hiding this comment.
nit: we could also call checkMemoryDependency
| MO.isDef() ? SafeDistance(MO, /*IsDef*/ false) : 0; | ||
|
|
||
| // Only use positive distances | ||
| if (DistFromLastWrite > 0) |
There was a problem hiding this comment.
nit: std::max will filter out the case of DistFromLastWrite < 0
There was a problem hiding this comment.
Good catch: I replaced by MaxLatency = std::max({MaxLatency, DistFromLastWrite, DistFromLastRead});
| return CurrMaxLatency; | ||
| }; | ||
|
|
||
| const int DistFromLastWrite = SafeDistance(MO, /*IsDef*/ true); |
There was a problem hiding this comment.
nit: If we unify RegisterToCycleDef and RegisterToCycleUse we only need to execute the lambda once, since the difference between this call and the one below is just a change in the datastructure we use throught the lambda.
There was a problem hiding this comment.
can we also refactor this to call checkMemoryDependency ?
Can we harmonize the Cycles. Why dont we go to the next cycle that is free like in checkMemoryDependency ?
There was a problem hiding this comment.
If i understand the code correctly, the ints in RegisterToCycleDef and MemoryCycleToStoreInstrs are complelty different, right? Could we harmonize this so that a cycle always means the same thing:
a) Cycle before the last time it is blocked (as is used here and MemoryCycleToStoreInstrs/MemoryCycleToLoadInstrs)
b) Cycle the instruction Def/Use is last used (as in updateUseDefMaxCycle(MO.getReg(), EventCycle, IsDef);)
| @@ -73,7 +219,7 @@ | |||
| int MemAccessCycle = Cycle + *OptLastMemCycle - 1; | |||
There was a problem hiding this comment.
up until this point the code is identical to the if (BundledMI->mayStore()) { case. We could harmonize these two conditions.
| assert(OptMOCycle); | ||
| const int OperandCycle = *OptMOCycle; | ||
| if (InSeparateRegion) { | ||
| const int EventCycle = Cycle + OperandCycle; |
There was a problem hiding this comment.
nit: we can hoist const int EventCycle = Cycle + OperandCycle; and use the variable in the conditional branches
| auto OptFirstMemCycle = TII->getFirstMemoryCycle(SchedClass); | ||
| assert(OptFirstMemCycle && "Load instruction without MemoryCycles"); | ||
| const int FirstMemCycle = *OptFirstMemCycle; | ||
| int FirstLoadFromEnd = Cycle - FirstMemCycle + 1; |
There was a problem hiding this comment.
Why do I need a +1 here?
Shouldn't the current variable be called CycleAfterFirstLoadFromEnd?
There was a problem hiding this comment.
Discussed offline.
| } | ||
|
|
||
| // Track stores for AA-based dependencies | ||
| if (BundledMI->mayStore()) { |
There was a problem hiding this comment.
nit: unify with the if (BundledMI->mayLoad()) { block
|
|
||
| unsigned | ||
| AIERegMemEventTracker::getSafeOperandsDistance(const MachineInstr &MI) const { | ||
| void AIERegMemEventTracker::computeUseDefBackward( |
There was a problem hiding this comment.
I think we could harmonize this method and computeUseDefForward, since they compute similar results (except for a backward forward update of cycles)
There was a problem hiding this comment.
We have some differences that could lead to hard to maintain conditional code.
| // For example, if we have LastStoreCycle = 21 (absolute) and this | ||
| // instruction FirstMemoryCycle = 5 (relative to the start cycle of this | ||
| // instruction), then this instruction should start earliest at cycle 18, | ||
| // to be able to start the FirstMemoryCycle (absolute) at cycle 22. In |
There was a problem hiding this comment.
nit: could you keep the comments that explain magic numbers?
| } | ||
|
|
||
| // Free load/store: check stores in bot-fixed (RAW dependency) | ||
| const int LoadStoreCycle = AA ? getMaxAliasingMemCycle(MI, /*IsStore=*/true) |
There was a problem hiding this comment.
nit we could remove the above if (MI.mayStore()) { code, if we use const bool IsStore= MI.mayStore() in these method calls
There was a problem hiding this comment.
But this flag is related to the events, not to MI.
862fb17 to
4604156
Compare
|
Hi @F-Stuckmann, thank you for the review and for catching some inconsistencies! I tried to improve the code with some refactor and harmonization, but I kept backward and forward mechanisms separated. |
The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.
4604156 to
74adf86
Compare


The goal is to reduce pessimism by precise register event tracking. Locks and events are handled in the same way as in the epilogue.