Skip to content

[ROCm] Rework ROCm build to use ROCm version instead of HIP version#1888

Open
sstamenk wants to merge 5 commits intobitsandbytes-foundation:mainfrom
sstamenk:fix/rocm-build-rework
Open

[ROCm] Rework ROCm build to use ROCm version instead of HIP version#1888
sstamenk wants to merge 5 commits intobitsandbytes-foundation:mainfrom
sstamenk:fix/rocm-build-rework

Conversation

@sstamenk
Copy link
Contributor

@sstamenk sstamenk commented Mar 4, 2026

Rework ROCm build to use ROCm version instead of HIP version

Problem

Starting with ROCm 7.x, the HIP SDK version diverged from the ROCm version. The build system relied on hipconfig --version to name the shared library (e.g. libbitsandbytes_rocm71.so), but hipconfig now reports the HIP SDK version rather than the ROCm version, producing incorrectly named libraries that fail to load at runtime.

Additionally, the internal BUILD_HIP macro and the user-facing COMPUTE_BACKEND=hip naming were inconsistent with the project's ROCm terminology.

Changes

CMake build system (CMakeLists.txt, .github/scripts/build-rocm.sh):

  • Rename BUILD_HIPBUILD_ROCM and COMPUTE_BACKEND value from "hip" to "rocm" throughout.
  • Detect the ROCm version from <rocm_root>/.info/version (the canonical source) instead of hipconfig --version.
  • Fall back to hipconfig for older installs with a warning about the version mismatch risk.
  • If hipconfig fallback fails to obtain a version, stop the build.
  • Add -DROCM_VERSION=<shortcode> for explicit build-time override (e.g. -DROCM_VERSION=71).
  • Handle single-digit -DROCM_VERSION input gracefully with a warning.

Python runtime (cextension.py, cuda_specs.py, diagnostics/):

  • Prefer torch.version.rocm (when available) over torch.version.hip for accurate version reporting.
  • Add BNB_ROCM_VERSION environment variable for runtime library override.
  • Improve validation: reject BNB_ROCM_VERSION on CUDA and BNB_CUDA_VERSION on ROCm with clear error messages.
  • Print ROCm version in show_environment() diagnostics.

C++ interface (pythonInterface.cpp):

  • Update all BUILD_HIP guards to BUILD_ROCM.
  • Add missing #include <hip/hip_runtime.h> under BUILD_ROCM.

Docs & CI (installation.mdx, build-rocm.sh):

  • Update build instructions from -DCOMPUTE_BACKEND=hip to -DCOMPUTE_BACKEND=rocm.

Tests (test_cuda_setup_evaluator.py):

  • Add test for rejecting BNB_ROCM_VERSION on CUDA.
  • Update existing tests to match revised override logic and error messages.

Breaking change

-DCOMPUTE_BACKEND=hip is replaced by -DCOMPUTE_BACKEND=rocm. Downstream build scripts that pass the old value will need to update.

@sstamenk
Copy link
Contributor Author

sstamenk commented Mar 4, 2026

Providing a table with various scenarios to better understand the logic. The library is named at build time and looked up by name at runtime. If the names don't match, loading fails.

Runtime: torch.version.rocm available (new PyTorch) → looks for _rocm71 Runtime: only torch.version.hip (older PyTorch) → looks for _rocm64 Runtime: BNB_ROCM_VERSION=71 → looks for _rocm71
Build: .info/version exists → builds _rocm71 ✅ Loads ❌ Fails - rebuild with -DROCM_VERSION=64 or BNB_ROCM_VERSION=71 ✅ Loads
Build: -DROCM_VERSION=71 → builds _rocm71 ✅ Loads ❌ Fails - rebuild with -DROCM_VERSION=64 or BNB_ROCM_VERSION=71 ✅ Loads
Build: hipconfig fallback → builds _rocm64 ❌ Fails - rebuild with -DROCM_VERSION=71 or run with BNB_ROCM_VERSION=64 ✅ Loads (both wrong but agree) ❌ Fails - override points to 71 but lib is 64

When both -DROCM_VERSION and BNB_ROCM_VERSION are listed as fixes, BNB_ROCM_VERSION is preferred as it doesn't require a rebuild.

In most cases we will hit the first scenario, and everything will be fine. Going forward the ROCm version will become even more common. If for some reason, there is a mismatch in version the user can always override them with either -DROCM_VERSION or BNB_ROCM_VERSION.

Going forward the hip version fallback can be entirely removed once a few versions pass and it is no longer needed for compatibility.

Starting with ROCm 7.x the HIP SDK version diverged from the ROCm
version, causing misnamed libraries when relying on `hipconfig --version`.

This commit:
- Renames BUILD_HIP to BUILD_ROCM and COMPUTE_BACKEND "hip" to "rocm"
  throughout CMake, C++, CI scripts, and documentation.
- Detects the ROCm version from <rocm_root>/.info/version (the canonical
  source) and falls back to hipconfig only for older installs, with a
  warning about the version mismatch risk.
- Emits FATAL_ERROR if neither .info/version nor hipconfig can detect
  a ROCm version, with clear instructions to pass -DROCM_VERSION.
- Simplifies HIP architecture target selection: BNB_ROCM_ARCH overrides
  everything, CMAKE_HIP_ARCHITECTURES is respected if already set,
  otherwise the full default target list is used. AMDGPU_TARGETS is no
  longer consulted (it often narrows the list unintentionally).
- Adds -DROCM_VERSION=<shortcode> for explicit build-time override and
  BNB_ROCM_VERSION=<shortcode> for runtime library selection.
- Adds hip/hip_runtime.h include under BUILD_ROCM in pythonInterface.cpp
  to provide HIP type definitions needed by the interface layer.
- Handles single-digit -DROCM_VERSION input gracefully.
- Prefers torch.version.rocm over torch.version.hip in Python for
  accurate version reporting.
- Improves BNB_ROCM_VERSION / BNB_CUDA_VERSION override validation and
  error messages.
- Updates tests to match the new override logic.
@sstamenk sstamenk force-pushed the fix/rocm-build-rework branch from 91c27f2 to eb72c84 Compare March 4, 2026 20:50
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Mar 5, 2026
@sstamenk
Copy link
Contributor Author

sstamenk commented Mar 6, 2026

#1889 should address issues seen when using TheRock builds (712 -> 82), this is something I have missed.

@matthewdouglas We might want to hold off a bit on the CMake changes for a later date. The PR changes to the backend naming from hip to rocm people to switch to rocm when building the source and the fallback logic adds a lot of code for not much gain. As long as both runtime look up and build time look up check the HIP version, there shouldn't be an issue in naming. If you want, I can make a separate PR with the unit test changes only or modify this one to just have those changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants