-
Notifications
You must be signed in to change notification settings - Fork 77
[OpenMP][AMDGPU] Introduce device memory initialization #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
ronlieb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, JP ?
jplehr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what the perf implications are?
|
|
||
| // Launch the special kernel for device memory initialization | ||
| if (Error Err = launchDMInitKernel(*AMDImage)) | ||
| return std::move(Err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we simply return Err here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can directly return Err here. I think compiler will do implicit move for us. Using move just to match the coding style with existing code.
| void *DMHeapPtr = nullptr; | ||
| void *DMSlabPtr = nullptr; | ||
| bool DMInitialized = false; | ||
| static constexpr uint32_t DMNumSlabs = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really large, especially if malloc is unused. HIP uses 4 here (8MB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, Brian! For my learning, if slab runs out for malloc, will it cause substantial performance overhead (I guess there will be dynamic allocation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the initial DMNumSlabs=256 slabs are exhausted, then the implementation will start asking for new slabs via hostcall/hostrpc and that will be slower. But on the other hand, every process running on the GPU taking away 1/2 GB of space each and sometimes using none of it seems wasteful, and could be problematic for applications that want to use most of the device memory for their own purposes and are avoiding using device malloc. The OpenMP team will have to decide on the best tradeoff of course, but I think 256 is pretty large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Brian, Kewen will try the smaller threshold later today, follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Brian! I will put up another PR following your suggestion to lower the slab size (to 8MB to see how it performs).
| for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) { | ||
| if (!MemoryPool->isGlobal()) | ||
| continue; | ||
|
|
||
| if (MemoryPool->isCoarseGrained()) { | ||
| DevPtr = nullptr; | ||
| size_t PreAllocSize = hsa_utils::PER_DEVICE_PREALLOC_SIZE; | ||
|
|
||
| Error Err = MemoryPool->allocate(PreAllocSize, &DevPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about doing this in a single allocation. The slabs must be 2MB aligned. With the prealloc at the beginning, it seems unlikely that the slabs will end up being 2MB aligned. Putting it at the end might raise the possibility, or allocating it separately.
We might want an assert or some other guard to ensure that DMSlabPtr is 2MB aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! Looking into this now
This PR is to fix the issue in https://ontrack-internal.amd.com/browse/SWDEV-566712
Launch
__ockl_dm_init_v1in a special kernel before user kernels for initializing device memory. In this implementation, we launch the kernel after the image is loaded so that it doesn't have to repeat the image load/unload logic. This special kernel will be built into device image (there probably will be duplication for multiple images).I think ideally this kernel could be in a separate image (or a binary blob into host runtime) so that it will be more self-contained. I will think about it as future work.
Smoke tests all passed.