-
Notifications
You must be signed in to change notification settings - Fork 120
[ROCM] Hipify Monarch #2073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[ROCM] Hipify Monarch #2073
Conversation
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
monarch_rdma/build.rs
Outdated
| println!("cargo:rustc-link-lib=amdhip64"); | ||
| println!("cargo:rustc-link-lib=hsa-runtime64"); | ||
| } else { | ||
| println!("cargo:rustc-link-lib=cuda"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our build process has changed, an we no longer dynamically link libcuda.so or libcudart.so. This PR will need to continue to match the new approach before we can land it.
ldd _rust_bindings.so should not link: any rdma libraries dynamically, nccl, libcudart, libcuda, or torch. This lets us have a single disttribution of monarch with a small binary footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zdevito do you have a pointer/PR on how the build process should be structured now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fast moving target!
Appreciate the link and the comment. I'll work on this tomorrow!
|
Refactors the build system to properly support ROCm by using dlopen for HIP/HSA driver API functions, matching CUDA's approach where possible. New build_utils/src/rocm.rs - Modular patching for hipified sources (ROCm 6.x vs 7+ handling) Why libamdhip64.so still appears in ldd All passed
Also all CI unit tests passed. |
Initial draft PR has the following unit test status on ROCm6.4 (ROCm7.0 still needs to be tested). This shows the basic functionality is there. Now need to work on failing unit tests but this work may include modifications to higher level rust code in monarch.