[OpenMP][AMDGPU] Introduce device memory initialization #807

Kewen12 · 2025-12-10T05:02:20Z

This PR is to fix the issue in https://ontrack-internal.amd.com/browse/SWDEV-566712

Launch __ockl_dm_init_v1 in a special kernel before user kernels for initializing device memory. In this implementation, we launch the kernel after the image is loaded so that it doesn't have to repeat the image load/unload logic. This special kernel will be built into device image (there probably will be duplication for multiple images).

I think ideally this kernel could be in a separate image (or a binary blob into host runtime) so that it will be more self-contained. I will think about it as future work.

Smoke tests all passed.

github-actions · 2025-12-10T05:02:46Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

z1-cciauto · 2025-12-10T05:03:33Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/3211

z1-cciauto · 2025-12-10T05:26:23Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/3212

ronlieb

LGTM, JP ?

jplehr

Do we know what the perf implications are?

jplehr · 2025-12-10T08:30:21Z

offload/plugins-nextgen/amdgpu/src/rtl.cpp

+
+    // Launch the special kernel for device memory initialization
+    if (Error Err = launchDMInitKernel(*AMDImage))
+      return std::move(Err);


Why can't we simply return Err here?

Yes, we can directly return Err here. I think compiler will do implicit move for us. Using move just to match the coding style with existing code.

b-sumner · 2025-12-10T20:22:40Z

offload/plugins-nextgen/amdgpu/src/rtl.cpp

+  void *DMHeapPtr = nullptr;
+  void *DMSlabPtr = nullptr;
+  bool DMInitialized = false;
+  static constexpr uint32_t DMNumSlabs = 256;


This seems really large, especially if malloc is unused. HIP uses 4 here (8MB).

Thanks for the review, Brian! For my learning, if slab runs out for malloc, will it cause substantial performance overhead (I guess there will be dynamic allocation)?

Once the initial DMNumSlabs=256 slabs are exhausted, then the implementation will start asking for new slabs via hostcall/hostrpc and that will be slower. But on the other hand, every process running on the GPU taking away 1/2 GB of space each and sometimes using none of it seems wasteful, and could be problematic for applications that want to use most of the device memory for their own purposes and are avoiding using device malloc. The OpenMP team will have to decide on the best tradeoff of course, but I think 256 is pretty large.

thanks Brian, Kewen will try the smaller threshold later today, follow on PR

Thanks, Brian! I will put up another PR following your suggestion to lower the slab size (to 8MB to see how it performs).

b-sumner · 2025-12-10T21:42:15Z

offload/plugins-nextgen/amdgpu/src/rtl.cpp

    for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) {
      if (!MemoryPool->isGlobal())
        continue;

      if (MemoryPool->isCoarseGrained()) {
        DevPtr = nullptr;
-        size_t PreAllocSize = hsa_utils::PER_DEVICE_PREALLOC_SIZE;

        Error Err = MemoryPool->allocate(PreAllocSize, &DevPtr);


I'm a little concerned about doing this in a single allocation. The slabs must be 2MB aligned. With the prealloc at the beginning, it seems unlikely that the slabs will end up being 2MB aligned. Putting it at the end might raise the possibility, or allocating it separately.

We might want an assert or some other guard to ensure that DMSlabPtr is 2MB aligned.

Thanks for the comment! Looking into this now

[OpenMP][AMDGPU] Introduce memory initialization

2ea79f2

Kewen12 requested review from dhruvachak, jplehr and ronlieb December 10, 2025 05:02

cleanup

ac1c6c3

Kewen12 changed the title ~~[OpenMP][AMDGPU] Introduce memory initialization~~ [OpenMP][AMDGPU] Introduce device memory initialization Dec 10, 2025

ronlieb approved these changes Dec 10, 2025

View reviewed changes

jplehr reviewed Dec 10, 2025

View reviewed changes

ronlieb merged commit 7ea86bb into amd-staging Dec 10, 2025
9 checks passed

ronlieb deleted the omp-enable-mem-init branch December 10, 2025 18:49

b-sumner reviewed Dec 10, 2025

View reviewed changes

[OpenMP][AMDGPU] Introduce device memory initialization #807

[OpenMP][AMDGPU] Introduce device memory initialization #807

Uh oh!

Conversation

Kewen12 commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

z1-cciauto commented Dec 10, 2025

Uh oh!

z1-cciauto commented Dec 10, 2025

Uh oh!

ronlieb left a comment

Choose a reason for hiding this comment

Uh oh!

jplehr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants