Determining if we should fuse (or split) loops for GPUs #2881
Replies: 1 comment 1 reply
-
|
My view is that this is hard to do it for the general case, it requires heuristics and hardware/performance models that we don't have and needs a full fledge research project to do it properly. Also, I don't think the "language level PSyIR" is the most appropriate level of abstraction for this. That said, I am sure that there are some easy wins for "obviously to small -> fuse", "obviously to big -> split" that we could add to PSyclone would represent big performance improvements, but I wouldn't go much deeper than that using PSyIR. For more complex cases there are two approaches, one is to concentrate on Psyclone having the right capabilities for transforming the code safely, but putting the decision making / heuristics into the Socrates/CASIM scripts. Then the problem is much more concrete, and we may only need to identify a few patterns present in the application and a few filename exceptions that fail for these patterns. This is typically our approach. The other is what @schreiberx group is doing by finding an appropriate level of abstraction where only the problem domain space is defined, but not where each explicit loop / device launch is exactly, so this can be decided top-down with a performance model of the hardware/memory. This is still for a subset of Fortran (that can be raised to the desired abstraction - or written as a DSL) but could make more optimal decisions for the general case than our approach. However, it will require time to have all components ready and do all the research needed. (Autotunning is another strategy that we have talked a few times.) The good thing is that the second approach can be built on top of the first one, it kind of complements what we do. So, for NG-Arch I would add simple metatransformation for fusing/splitting (e.g. single statements loops that use the same references should be fused, threshold number of references before splitting) when this is beneficial to one of our codes, but I suggest not duplicating work with @schreiberx and talking with him if you are thinking something more complex involving performance models. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
One thing that came up when talking to US labs about Socrates/CASIM is that for performance, we should fuse loops where possible if we don't exceed some register count - and that some loops on GPUs may need to be split into multiple (if possible) to reduce register usage.
Is this something we could teach to PSyclone (or have a utility function to try to do this). Nowait helps with loops that could have been fused at least, but splitting is a more complicated beast (and isn't easy to do for a lot of loops, but we can see what we can do at least).
Does this seems like something we should be able to do? Fitting it in NG-ARCH timeframe might be difficulty thoguh. @sergisiso @arporter
Beta Was this translation helpful? Give feedback.
All reactions