Skip to content

Conversation

@Kewen12
Copy link

@Kewen12 Kewen12 commented Dec 10, 2025

This PR is to fix the issue in https://ontrack-internal.amd.com/browse/SWDEV-566712

Launch __ockl_dm_init_v1 in a special kernel before user kernels for initializing device memory. In this implementation, we launch the kernel after the image is loaded so that it doesn't have to repeat the image load/unload logic. This special kernel will be built into device image (there probably will be duplication for multiple images).

I think ideally this kernel could be in a separate image (or a binary blob into host runtime) so that it will be more self-contained. I will think about it as future work.

Smoke tests all passed.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@z1-cciauto
Copy link
Collaborator

@Kewen12 Kewen12 changed the title [OpenMP][AMDGPU] Introduce memory initialization [OpenMP][AMDGPU] Introduce device memory initialization Dec 10, 2025
@z1-cciauto
Copy link
Collaborator

Copy link
Collaborator

@ronlieb ronlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, JP ?

Copy link

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what the perf implications are?


// Launch the special kernel for device memory initialization
if (Error Err = launchDMInitKernel(*AMDImage))
return std::move(Err);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we simply return Err here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can directly return Err here. I think compiler will do implicit move for us. Using move just to match the coding style with existing code.

@ronlieb ronlieb merged commit 7ea86bb into amd-staging Dec 10, 2025
9 checks passed
@ronlieb ronlieb deleted the omp-enable-mem-init branch December 10, 2025 18:49
void *DMHeapPtr = nullptr;
void *DMSlabPtr = nullptr;
bool DMInitialized = false;
static constexpr uint32_t DMNumSlabs = 256;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really large, especially if malloc is unused. HIP uses 4 here (8MB).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Brian! For my learning, if slab runs out for malloc, will it cause substantial performance overhead (I guess there will be dynamic allocation)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the initial DMNumSlabs=256 slabs are exhausted, then the implementation will start asking for new slabs via hostcall/hostrpc and that will be slower. But on the other hand, every process running on the GPU taking away 1/2 GB of space each and sometimes using none of it seems wasteful, and could be problematic for applications that want to use most of the device memory for their own purposes and are avoiding using device malloc. The OpenMP team will have to decide on the best tradeoff of course, but I think 256 is pretty large.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Brian, Kewen will try the smaller threshold later today, follow on PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Brian! I will put up another PR following your suggestion to lower the slab size (to 8MB to see how it performs).

for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) {
if (!MemoryPool->isGlobal())
continue;

if (MemoryPool->isCoarseGrained()) {
DevPtr = nullptr;
size_t PreAllocSize = hsa_utils::PER_DEVICE_PREALLOC_SIZE;

Error Err = MemoryPool->allocate(PreAllocSize, &DevPtr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about doing this in a single allocation. The slabs must be 2MB aligned. With the prealloc at the beginning, it seems unlikely that the slabs will end up being 2MB aligned. Putting it at the end might raise the possibility, or allocating it separately.

We might want an assert or some other guard to ensure that DMSlabPtr is 2MB aligned.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! Looking into this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants