[ROCm] Rework ROCm build to use ROCm version instead of HIP version#1888
[ROCm] Rework ROCm build to use ROCm version instead of HIP version#1888sstamenk wants to merge 5 commits intobitsandbytes-foundation:mainfrom
Conversation
|
Providing a table with various scenarios to better understand the logic. The library is named at build time and looked up by name at runtime. If the names don't match, loading fails.
When both In most cases we will hit the first scenario, and everything will be fine. Going forward the ROCm version will become even more common. If for some reason, there is a mismatch in version the user can always override them with either Going forward the hip version fallback can be entirely removed once a few versions pass and it is no longer needed for compatibility. |
Starting with ROCm 7.x the HIP SDK version diverged from the ROCm version, causing misnamed libraries when relying on `hipconfig --version`. This commit: - Renames BUILD_HIP to BUILD_ROCM and COMPUTE_BACKEND "hip" to "rocm" throughout CMake, C++, CI scripts, and documentation. - Detects the ROCm version from <rocm_root>/.info/version (the canonical source) and falls back to hipconfig only for older installs, with a warning about the version mismatch risk. - Emits FATAL_ERROR if neither .info/version nor hipconfig can detect a ROCm version, with clear instructions to pass -DROCM_VERSION. - Simplifies HIP architecture target selection: BNB_ROCM_ARCH overrides everything, CMAKE_HIP_ARCHITECTURES is respected if already set, otherwise the full default target list is used. AMDGPU_TARGETS is no longer consulted (it often narrows the list unintentionally). - Adds -DROCM_VERSION=<shortcode> for explicit build-time override and BNB_ROCM_VERSION=<shortcode> for runtime library selection. - Adds hip/hip_runtime.h include under BUILD_ROCM in pythonInterface.cpp to provide HIP type definitions needed by the interface layer. - Handles single-digit -DROCM_VERSION input gracefully. - Prefers torch.version.rocm over torch.version.hip in Python for accurate version reporting. - Improves BNB_ROCM_VERSION / BNB_CUDA_VERSION override validation and error messages. - Updates tests to match the new override logic.
91c27f2 to
eb72c84
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
#1889 should address issues seen when using TheRock builds (712 -> 82), this is something I have missed. @matthewdouglas We might want to hold off a bit on the CMake changes for a later date. The PR changes to the backend naming from |
Rework ROCm build to use ROCm version instead of HIP version
Problem
Starting with ROCm 7.x, the HIP SDK version diverged from the ROCm version. The build system relied on
hipconfig --versionto name the shared library (e.g.libbitsandbytes_rocm71.so), buthipconfignow reports the HIP SDK version rather than the ROCm version, producing incorrectly named libraries that fail to load at runtime.Additionally, the internal
BUILD_HIPmacro and the user-facingCOMPUTE_BACKEND=hipnaming were inconsistent with the project's ROCm terminology.Changes
CMake build system (
CMakeLists.txt,.github/scripts/build-rocm.sh):BUILD_HIP→BUILD_ROCMandCOMPUTE_BACKENDvalue from"hip"to"rocm"throughout.<rocm_root>/.info/version(the canonical source) instead ofhipconfig --version.hipconfigfor older installs with a warning about the version mismatch risk.hipconfigfallback fails to obtain a version, stop the build.-DROCM_VERSION=<shortcode>for explicit build-time override (e.g.-DROCM_VERSION=71).-DROCM_VERSIONinput gracefully with a warning.Python runtime (
cextension.py,cuda_specs.py,diagnostics/):torch.version.rocm(when available) overtorch.version.hipfor accurate version reporting.BNB_ROCM_VERSIONenvironment variable for runtime library override.BNB_ROCM_VERSIONon CUDA andBNB_CUDA_VERSIONon ROCm with clear error messages.show_environment()diagnostics.C++ interface (
pythonInterface.cpp):BUILD_HIPguards toBUILD_ROCM.#include <hip/hip_runtime.h>underBUILD_ROCM.Docs & CI (
installation.mdx,build-rocm.sh):-DCOMPUTE_BACKEND=hipto-DCOMPUTE_BACKEND=rocm.Tests (
test_cuda_setup_evaluator.py):BNB_ROCM_VERSIONon CUDA.Breaking change
-DCOMPUTE_BACKEND=hipis replaced by-DCOMPUTE_BACKEND=rocm. Downstream build scripts that pass the old value will need to update.