Add Support for Block-Level GPU Kernels #87

davschneller · 2025-02-26T23:18:22Z

We

add GPU block-level kernel support; i.e. called from GPU code, and run in the GPU block. (for now, it's the kernel target igpu; instead of gpu or cpu)
- That contrasts to the current, kernel-level kernel generation the GPU which is called from host/CPU code.
- Temporary memory is currently allocated as shared—and needs to be supplied from outside. (TODO for the future: maybe also store some intermediate results in registers) Though effectively, only a pointer is passed.
make Yateto kernels compile and run with and callable within CUDA device code; also add support (still untested) for HIP, SYCL, and OMP Target as well.

All still very WIP and probably pretty slow, and hence a draft. (the only "performance" benefits at the very moment are that multiple threads are used for some computations; and shared memory is used for intermediate results)

uphoffc · 2025-02-27T09:26:38Z

How is this going to interact with the existing gemm / fused_gemm code?

Are you going to create a calling convention and expect the existing generators to provide GEMM / fused GEMM kernels that only work on a single matrix chain multiplication at a time instead of a batch of matrix chains?

You need to keep in mind that chainforge and tinytc have restrictions on the work group size (thread block size), that might depend on architecture and problem size. So when fusing those kernels into single one, combined with generic code from yateto, then you need to ensure that you use the same work group size for all kernels.

davschneller · 2025-02-27T16:16:11Z

Firstly, to answer the issues (note that the branch is still WIP):

yes; of course the block-level setting should be forwarded. It's not done so far, since I didn't know a beautiful way how to do it. (Python/C++ interop...) But yes; it should be pretty necessary.
The interface towards TensorForge/chainforge on the Yateto side should be pretty much done already in conjunction with Add support for external kernel code generators #81; by just setting a routine generator for the igpu target. Since it already generates CUDA/HIP/SYCL/OMPT code, we'd just need to remove the launcher there and pass the given block sizes. The only thing left on the Yateto side would be for the subroutines (saying that without having it tested though, haha...); they should be wired in a usable way; probably resulting one more codegen/subroutine file. Also, I can't say anything about any other code generator, as they're outside my jurisdiction. :)

Having to handle the kernel only one tensor contraction, instead of a whole batch, also feels more in line with the original "spirit" of Yateto (at least how I interpret it); as a tool besides other code in a kernel; and not a full kernel next to other kernels. Maybe debatable; but there's a difference on whether you include the launch code (the "parallel for") or not.

Anyways; if someone really needs more "why" to why this PR:

(a) an intermediate way to harness fast small GETTs/block-level kernels in kernels which can/will be fused later (but aren't yet due to time constraints) and

(b) a way to harness fast small GETTs/block-level kernels in more complex kernels than we even currently plan. E.g. when, together with some grid syncs and some nv/roc/intel shmem implementation, maybe even going for one big permanent kernel per cluster—as a faint idea. If we'd get Yateto to cover all that far, I'd be really surprised.

(side note to self: maybe kernel-in-kernel/dynamic parallelism launch code might be another tiny interesting thing to implement)

Also, the current "GPU" kernels (which will probably be horribly slow) should just provide a bare baseline; so that we get e.g. SeisSol at least to compile without any additional codegen. Which can already help for a broken codegen.

[Anyways, replacing all kernels by Yateto/batch calls would be absolutely great; cf. the TensorForge-develop branch in SeisSol; the Imposed-Slip-Rates DR could be already generated from a Python description there and yield (mostly) the correct results.
I'm still working to get some elementwise nonlinear functions to mainline Yateto soon-ish (i.e. "port" them from the TensorForge branch); but progress had been slower than intended; at least until recently. A datatype extension now finally already in the works on davschneller/datatyping.]

In principle, the whole thing here is partially inspired by cuBLASDx which strives for such an interface for GEMMs (in almost LIBXSMM-esque fashion) in an in-kernel environment for NVIDIA GPUs.

And it (surprisingly) hasn't been too hard to implement a similar mechanism in Yateto right now; probably minus performance.

uphoffc · 2025-02-28T08:06:46Z

Having to handle the kernel only one tensor contraction, instead of a whole batch, also feels more in line with the original "spirit" of Yateto

Yes that it is how it is intended, and that is also a major necessity to avoid data movement and get a good AI (arithmetic intensity, not the other thing :-D). The problem on GPUs is that it isn't as easy as on CPUs...

Anyways; if someone really needs more "why" to why this PR:

I see the point of the PR, I just wanted to point out some difficulties that one is going to encounter, in particular w.r.t. work-group size constraints.

Add a rudimentary implementation for GPU block-level kernel calls

72c8756

davschneller mentioned this pull request Mar 15, 2025

Remove most not-yet-removed ifdefs SeisSol/SeisSol#1303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Block-Level GPU Kernels #87

Add Support for Block-Level GPU Kernels #87

Uh oh!

davschneller commented Feb 26, 2025

Uh oh!

uphoffc commented Feb 27, 2025

Uh oh!

davschneller commented Feb 27, 2025

Uh oh!

uphoffc commented Feb 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Support for Block-Level GPU Kernels #87

Are you sure you want to change the base?

Add Support for Block-Level GPU Kernels #87

Uh oh!

Conversation

davschneller commented Feb 26, 2025

Uh oh!

uphoffc commented Feb 27, 2025

Uh oh!

davschneller commented Feb 27, 2025

Uh oh!

uphoffc commented Feb 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants