Skip to content

Conversation

@davschneller
Copy link
Contributor

We

  • add GPU block-level kernel support; i.e. called from GPU code, and run in the GPU block. (for now, it's the kernel target igpu; instead of gpu or cpu)
    • That contrasts to the current, kernel-level kernel generation the GPU which is called from host/CPU code.
    • Temporary memory is currently allocated as shared—and needs to be supplied from outside. (TODO for the future: maybe also store some intermediate results in registers) Though effectively, only a pointer is passed.
  • make Yateto kernels compile and run with and callable within CUDA device code; also add support (still untested) for HIP, SYCL, and OMP Target as well.

All still very WIP and probably pretty slow, and hence a draft. (the only "performance" benefits at the very moment are that multiple threads are used for some computations; and shared memory is used for intermediate results)

@uphoffc
Copy link
Contributor

uphoffc commented Feb 27, 2025

How is this going to interact with the existing gemm / fused_gemm code?

Are you going to create a calling convention and expect the existing generators to provide GEMM / fused GEMM kernels that only work on a single matrix chain multiplication at a time instead of a batch of matrix chains?

You need to keep in mind that chainforge and tinytc have restrictions on the work group size (thread block size), that might depend on architecture and problem size. So when fusing those kernels into single one, combined with generic code from yateto, then you need to ensure that you use the same work group size for all kernels.

@davschneller
Copy link
Contributor Author

Firstly, to answer the issues (note that the branch is still WIP):

  • yes; of course the block-level setting should be forwarded. It's not done so far, since I didn't know a beautiful way how to do it. (Python/C++ interop...) But yes; it should be pretty necessary.
  • The interface towards TensorForge/chainforge on the Yateto side should be pretty much done already in conjunction with Add support for external kernel code generators #81; by just setting a routine generator for the igpu target. Since it already generates CUDA/HIP/SYCL/OMPT code, we'd just need to remove the launcher there and pass the given block sizes. The only thing left on the Yateto side would be for the subroutines (saying that without having it tested though, haha...); they should be wired in a usable way; probably resulting one more codegen/subroutine file. Also, I can't say anything about any other code generator, as they're outside my jurisdiction. :)

Having to handle the kernel only one tensor contraction, instead of a whole batch, also feels more in line with the original "spirit" of Yateto (at least how I interpret it); as a tool besides other code in a kernel; and not a full kernel next to other kernels. Maybe debatable; but there's a difference on whether you include the launch code (the "parallel for") or not.

Anyways; if someone really needs more "why" to why this PR:

(a) an intermediate way to harness fast small GETTs/block-level kernels in kernels which can/will be fused later (but aren't yet due to time constraints) and

(b) a way to harness fast small GETTs/block-level kernels in more complex kernels than we even currently plan. E.g. when, together with some grid syncs and some nv/roc/intel shmem implementation, maybe even going for one big permanent kernel per cluster—as a faint idea. If we'd get Yateto to cover all that far, I'd be really surprised.

(side note to self: maybe kernel-in-kernel/dynamic parallelism launch code might be another tiny interesting thing to implement)

Also, the current "GPU" kernels (which will probably be horribly slow) should just provide a bare baseline; so that we get e.g. SeisSol at least to compile without any additional codegen. Which can already help for a broken codegen.

[Anyways, replacing all kernels by Yateto/batch calls would be absolutely great; cf. the TensorForge-develop branch in SeisSol; the Imposed-Slip-Rates DR could be already generated from a Python description there and yield (mostly) the correct results.
I'm still working to get some elementwise nonlinear functions to mainline Yateto soon-ish (i.e. "port" them from the TensorForge branch); but progress had been slower than intended; at least until recently. A datatype extension now finally already in the works on davschneller/datatyping.]

In principle, the whole thing here is partially inspired by cuBLASDx which strives for such an interface for GEMMs (in almost LIBXSMM-esque fashion) in an in-kernel environment for NVIDIA GPUs.

And it (surprisingly) hasn't been too hard to implement a similar mechanism in Yateto right now; probably minus performance.

@uphoffc
Copy link
Contributor

uphoffc commented Feb 28, 2025

Having to handle the kernel only one tensor contraction, instead of a whole batch, also feels more in line with the original "spirit" of Yateto

Yes that it is how it is intended, and that is also a major necessity to avoid data movement and get a good AI (arithmetic intensity, not the other thing :-D). The problem on GPUs is that it isn't as easy as on CPUs...

Anyways; if someone really needs more "why" to why this PR:

I see the point of the PR, I just wanted to point out some difficulties that one is going to encounter, in particular w.r.t. work-group size constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants