Skip to content

[ROCm] Replace compile-time warp size with runtime query in host code#1885

Merged
matthewdouglas merged 8 commits intobitsandbytes-foundation:mainfrom
sstamenk:fix/rocm-runtime-warp-size
Mar 5, 2026
Merged

[ROCm] Replace compile-time warp size with runtime query in host code#1885
matthewdouglas merged 8 commits intobitsandbytes-foundation:mainfrom
sstamenk:fix/rocm-runtime-warp-size

Conversation

@sstamenk
Copy link
Contributor

@sstamenk sstamenk commented Mar 4, 2026

Summary

  • Replace the compile-time BNB_WARP_SIZE macro with a runtime bnb_host_warp_size() query in all host-side dispatch code. The macro is only correctly defined during device-code compilation passes and was silently wrong in host code on ROCm 7.x (which removed __AMDGCN_WAVEFRONT_SIZE). On RDNA which utilizes a warp size of 32 this logic was defaulting to 64.
  • Fix the blocksize=64 4-bit quantization dispatch on HIP to select the correct kernel based on the actual device warp size, instead of unconditionally using kQuantizeBlockwiseSmall with a hardcoded 64-thread launch.
  • Improve the BNB_WARP_SIZE fallback chain in common.cuh for ROCm 7.0+ where __AMDGCN_WAVEFRONT_SIZE is no longer emitted: fall back to __GFX9__ (CDNA = 64), then default to 32 (RDNA).

Problem

On RDNA GPUs (gfx10xx/gfx11xx/gfx12xx, wavefront size 32), the old default of BNB_WARP_SIZE = 64 caused two classes of failures:

  1. blocksize=32 crash (Fatal Python error: Aborted): kQuantizeBlockwiseSmall was compiled with THREADS=64 but launched with only 32 threads, causing CUB cooperative operations to abort.
  2. blocksize=64 garbage output (mean error ~0.73 vs threshold 0.11): After fixing the macro default to 32, kQuantizeBlockwiseSmall now compiled with THREADS=32 but was still launched with 64 threads (hardcoded in the blocksize=64 HIP dispatch path), producing corrupted quantization output. Unit tests test_4bit_compressed_stats and test_4bit_quant with blocksize=64 were broken on RDNA as a result.

The root issue is that BNB_WARP_SIZE is architecture-specific and only valid inside device code, but was being used in host-side kernel launch configuration.

Changes

csrc/common.cuh — Improved warp size fallback for device code:

  • __AMDGCN_WAVEFRONT_SIZE (preferred, but removed in ROCm 7.0)
  • __GFX9__ → 64 (CDNA)
  • Default → 32 (RDNA and others)

csrc/ops.cu — Runtime warp size for host code:

  • Added bnb_host_warp_size(): queries device 0 with hipDeviceAttributeWarpSize. Returns 32 on CUDA.
  • gemm_4bit_inference_naive: replaced BNB_WARP_SIZE == 64 with bnb_host_warp_size() == 64.
  • quantizeBlockwise for blocksize=64 on HIP: dispatch to kQuantizeBlockwiseSmall only when runtime warp size is 64 (CDNA); fall through to kQuantizeBlockwise<64, 2> otherwise (RDNA, same as CUDA path).

Test plan

  • Verify all relevant unit tests pass on RDNA
  • Confirm no unit test regression on gfx9xx

@sstamenk sstamenk changed the title Replace compile-time warp size with runtime query in host code [ROCM] Replace compile-time warp size with runtime query in host code Mar 4, 2026
@sstamenk sstamenk changed the title [ROCM] Replace compile-time warp size with runtime query in host code [ROCm] Replace compile-time warp size with runtime query in host code Mar 4, 2026
sstamenk added 2 commits March 4, 2026 14:34
Add bnb_host_warp_size() that queries hipDeviceGetAttribute at runtime
with per-device caching (up to 32 GPUs), replacing the compile-time
BNB_WARP_SIZE macro in host-side dispatch. This fixes incorrect
defaulting to warp size 64 on RDNA and kernel dispatch with
proper parameters.
@sstamenk sstamenk force-pushed the fix/rocm-runtime-warp-size branch from b6d47ab to 83892a5 Compare March 4, 2026 13:38
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Mar 4, 2026
@matthewdouglas
Copy link
Member

Thanks! Looks good apart from minor lint issue. @Abdennacer-Badaoui do you have any feedback?

matthewdouglas
matthewdouglas previously approved these changes Mar 4, 2026
@sstamenk
Copy link
Contributor Author

sstamenk commented Mar 4, 2026

@matthewdouglas I have a follow up to this at #1887
Would highly appreciate feedback from @Abdennacer-Badaoui there as well considering he originally implemented the kernel.

Copy link
Member

@Abdennacer-Badaoui Abdennacer-Badaoui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @sstamenk !
I will look at your follow up PR now.

Copy link
Member

@Abdennacer-Badaoui Abdennacer-Badaoui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The separation of compile-time BNB_WARP_SIZE (for device code) and runtime bnb_host_warp_size() (for host-side launch config) is the right approach for correct RDNA/CDNA dispatch.
Minor nit: the static warp_size cache has a benign data race (could use std::atomic) but not a blocker.

@Abdennacer-Badaoui
Copy link
Member

can you fix the linting @sstamenk , thanks!

@sstamenk
Copy link
Contributor Author

sstamenk commented Mar 5, 2026

Good catch, I've changed it to utilize atomics.
The worst case I can see is that we do the hip query more than once which shouldn't really cause problems since we always query device 0 but I'm not opposed to guarding it with atomic reads/writes.

@matthewdouglas matthewdouglas merged commit 373f23b into bitsandbytes-foundation:main Mar 5, 2026
91 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants