Skip to content

Conversation

@zstreet87
Copy link

Initial draft PR has the following unit test status on ROCm6.4 (ROCm7.0 still needs to be tested). This shows the basic functionality is there. Now need to work on failing unit tests but this work may include modifications to higher level rust code in monarch.

running 38 tests
test ibverbs_primitives::tests::test_format_gid ... ok
test device_selection::tests::test_parse_device_string ... ok
test ibverbs_primitives::tests::test_ibv_wc ... ok
test ibverbs_primitives::tests::test_rdma_operation_conversion ... ok
test ibverbs_primitives::tests::test_rdma_endpoint ... ok
test device_selection::tests::test_gt20_hardware ... ok
test rdma_manager_actor_tests::tests::test_rdma_read_separate_devices_db_check ... ignored, This test needed to be run in isolation
test rdma_manager_actor_tests::tests::test_rdma_read_separate_devices_db_device_trigger ... ignored, This test needed to be run in isolation
test rdma_manager_actor_tests::tests::test_rdma_write_recv_separate_devices_db_trigger ... ignored, This test needed to be run in isolation
test rdma_manager_actor_tests::tests::test_rdma_write_separate_devices_db ... ignored, This test needed to be run in isolation
test ibverbs_primitives::tests::test_mlx5dv_supported_basic ... ok
test ibverbs_primitives::tests::test_device_display ... ok
test ibverbs_primitives::tests::test_first_available ... ok
test ibverbs_primitives::tests::test_port_display ... ok
test ibverbs_primitives::tests::test_get_all_devices ... ok
failed to create queue pair (QP): No space left on device
failed to create queue pair (QP): No space left on device
test rdma_components::tests::test_loopback_connection ... FAILED
test rdma_components::tests::test_create_connection ... FAILED
test rdma_manager_actor_tests::tests::test_rdma_write_separate_devices_db_device_trigger ... ok
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
failed to create queue pair (QP): Device or resource busy
error: test failed, to rerun pass `-p monarch_rdma --lib`

Caused by:
  process didn't exit successfully: `/home/zstreete/monarch/target/debug/deps/monarch_rdma-95e7dfb1ab756acb` (signal: 11, SIGSEGV: invalid memory reference)

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025
@zstreet87 zstreet87 changed the title Hipify Monarch [ROCM] Hipify Monarch Dec 5, 2025
@zstreet87 zstreet87 marked this pull request as ready for review December 9, 2025 22:53
@meta-codesync
Copy link

meta-codesync bot commented Dec 9, 2025

@mreso has imported this pull request. If you are a Meta employee, you can view this in D88788781.

println!("cargo:rustc-link-lib=amdhip64");
println!("cargo:rustc-link-lib=hsa-runtime64");
} else {
println!("cargo:rustc-link-lib=cuda");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our build process has changed, an we no longer dynamically link libcuda.so or libcudart.so. This PR will need to continue to match the new approach before we can land it.
ldd _rust_bindings.so should not link: any rdma libraries dynamically, nccl, libcudart, libcuda, or torch. This lets us have a single disttribution of monarch with a small binary footprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdevito do you have a pointer/PR on how the build process should be structured now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the right PR: #2045 ? @zdevito

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fast moving target!

Appreciate the link and the comment. I'll work on this tomorrow!

@zstreet87
Copy link
Author

zstreet87 commented Dec 18, 2025

Refactors the build system to properly support ROCm by using dlopen for HIP/HSA driver API functions, matching CUDA's approach where possible.
Key Changes

New build_utils/src/rocm.rs - Modular patching for hipified sources (ROCm 6.x vs 7+ handling)
Simplified rdmaxcel-sys/build.rs - Platform abstraction for cleaner CUDA/ROCm paths
dlopen for driver API - hipFree and HSA functions loaded dynamically instead of linked

Why libamdhip64.so still appears in ldd
Unlike CUDA (which has libcudart_static.a), ROCm has no static runtime library. Code compiled by hipcc generates device registration calls (__hipRegisterFatBinary, hipLaunchKernel) that must resolve at load time. This is an AMD architectural limitation, not something we can work around.
Testing

All passed

  • cargo build --features tensor_engine -p monarch_extension
  • cargo test -p rdmaxcel-sys / cargo test -p nccl-sys
  • Python bindings load successfully

Also all CI unit tests passed.
Summary [ 33.872s] 779 tests run: 779 passed, 28 skipped

cc: @zdevito @mreso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants