[ROCm] Replace compile-time warp size with runtime query in host code by sstamenk · Pull Request #1885 · bitsandbytes-foundation/bitsandbytes

sstamenk · 2026-03-04T13:06:29Z

Summary

Replace the compile-time BNB_WARP_SIZE macro with a runtime bnb_host_warp_size() query in all host-side dispatch code. The macro is only correctly defined during device-code compilation passes and was silently wrong in host code on ROCm 7.x (which removed __AMDGCN_WAVEFRONT_SIZE). On RDNA which utilizes a warp size of 32 this logic was defaulting to 64.
Fix the blocksize=64 4-bit quantization dispatch on HIP to select the correct kernel based on the actual device warp size, instead of unconditionally using kQuantizeBlockwiseSmall with a hardcoded 64-thread launch.
Improve the BNB_WARP_SIZE fallback chain in common.cuh for ROCm 7.0+ where __AMDGCN_WAVEFRONT_SIZE is no longer emitted: fall back to __GFX9__ (CDNA = 64), then default to 32 (RDNA).

Problem

On RDNA GPUs (gfx10xx/gfx11xx/gfx12xx, wavefront size 32), the old default of BNB_WARP_SIZE = 64 caused two classes of failures:

blocksize=32 crash (Fatal Python error: Aborted): kQuantizeBlockwiseSmall was compiled with THREADS=64 but launched with only 32 threads, causing CUB cooperative operations to abort.
blocksize=64 garbage output (mean error ~0.73 vs threshold 0.11): After fixing the macro default to 32, kQuantizeBlockwiseSmall now compiled with THREADS=32 but was still launched with 64 threads (hardcoded in the blocksize=64 HIP dispatch path), producing corrupted quantization output. Unit tests test_4bit_compressed_stats and test_4bit_quant with blocksize=64 were broken on RDNA as a result.

The root issue is that BNB_WARP_SIZE is architecture-specific and only valid inside device code, but was being used in host-side kernel launch configuration.

Changes

csrc/common.cuh — Improved warp size fallback for device code:

__AMDGCN_WAVEFRONT_SIZE (preferred, but removed in ROCm 7.0)
__GFX9__ → 64 (CDNA)
Default → 32 (RDNA and others)

csrc/ops.cu — Runtime warp size for host code:

Added bnb_host_warp_size(): queries device 0 with hipDeviceAttributeWarpSize. Returns 32 on CUDA.
gemm_4bit_inference_naive: replaced BNB_WARP_SIZE == 64 with bnb_host_warp_size() == 64.
quantizeBlockwise for blocksize=64 on HIP: dispatch to kQuantizeBlockwiseSmall only when runtime warp size is 64 (CDNA); fall through to kQuantizeBlockwise<64, 2> otherwise (RDNA, same as CUDA path).

Test plan

Verify all relevant unit tests pass on RDNA
Confirm no unit test regression on gfx9xx

csrc/ops.cu

Add bnb_host_warp_size() that queries hipDeviceGetAttribute at runtime with per-device caching (up to 32 GPUs), replacing the compile-time BNB_WARP_SIZE macro in host-side dispatch. This fixes incorrect defaulting to warp size 64 on RDNA and kernel dispatch with proper parameters.

github-actions · 2026-03-04T16:22:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

matthewdouglas · 2026-03-04T16:24:10Z

Thanks! Looks good apart from minor lint issue. @Abdennacer-Badaoui do you have any feedback?

sstamenk · 2026-03-04T16:32:43Z

@matthewdouglas I have a follow up to this at #1887
Would highly appreciate feedback from @Abdennacer-Badaoui there as well considering he originally implemented the kernel.

Abdennacer-Badaoui

Looks good to me. Thanks @sstamenk !
I will look at your follow up PR now.

Abdennacer-Badaoui

LGTM! The separation of compile-time BNB_WARP_SIZE (for device code) and runtime bnb_host_warp_size() (for host-side launch config) is the right approach for correct RDNA/CDNA dispatch.
Minor nit: the static warp_size cache has a benign data race (could use std::atomic) but not a blocker.

Abdennacer-Badaoui · 2026-03-05T12:40:11Z

can you fix the linting @sstamenk , thanks!

sstamenk · 2026-03-05T12:40:14Z

Good catch, I've changed it to utilize atomics.
The worst case I can see is that we do the hip query more than once which shouldn't really cause problems since we always query device 0 but I'm not opposed to guarding it with atomic reads/writes.

sstamenk changed the title ~~Replace compile-time warp size with runtime query in host code~~ [ROCM] Replace compile-time warp size with runtime query in host code Mar 4, 2026

sstamenk changed the title ~~[ROCM] Replace compile-time warp size with runtime query in host code~~ [ROCm] Replace compile-time warp size with runtime query in host code Mar 4, 2026

sstamenk commented Mar 4, 2026

View reviewed changes

csrc/ops.cu Show resolved Hide resolved

sstamenk added 2 commits March 4, 2026 14:34

Fix kernel dispatching for RDNA

83892a5

sstamenk force-pushed the fix/rocm-runtime-warp-size branch from b6d47ab to 83892a5 Compare March 4, 2026 13:38

matthewdouglas added the ROCm label Mar 4, 2026

matthewdouglas added this to the v0.50.0 milestone Mar 4, 2026

matthewdouglas previously approved these changes Mar 4, 2026

View reviewed changes

sstamenk mentioned this pull request Mar 4, 2026

[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA #1887

Open

2 tasks

Abdennacer-Badaoui previously approved these changes Mar 4, 2026

View reviewed changes

Fix linting issues

6d51838

sstamenk dismissed stale reviews from Abdennacer-Badaoui and matthewdouglas via 6d51838 March 4, 2026 16:55

sstamenk and others added 3 commits March 4, 2026 19:17

Fix linting issues

947f64e

Fix linting issues

0c94504

Revert device array caching and instead only do device 0

72e4e39

Abdennacer-Badaoui previously approved these changes Mar 5, 2026

View reviewed changes

Use atomics to avoid a race condition

fa09fb1

sstamenk dismissed Abdennacer-Badaoui’s stale review via fa09fb1 March 5, 2026 12:37

Fix linting issues

30bbc4f

Abdennacer-Badaoui approved these changes Mar 5, 2026

View reviewed changes

matthewdouglas approved these changes Mar 5, 2026

View reviewed changes

matthewdouglas merged commit 373f23b into bitsandbytes-foundation:main Mar 5, 2026
91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Replace compile-time warp size with runtime query in host code#1885

[ROCm] Replace compile-time warp size with runtime query in host code#1885
matthewdouglas merged 8 commits intobitsandbytes-foundation:mainfrom
sstamenk:fix/rocm-runtime-warp-size

sstamenk commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

matthewdouglas commented Mar 4, 2026

Uh oh!

sstamenk commented Mar 4, 2026

Uh oh!

Abdennacer-Badaoui left a comment •

edited

Loading

Uh oh!

Abdennacer-Badaoui left a comment

Uh oh!

Abdennacer-Badaoui commented Mar 5, 2026

Uh oh!

sstamenk commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sstamenk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Test plan

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

matthewdouglas commented Mar 4, 2026

Uh oh!

sstamenk commented Mar 4, 2026

Uh oh!

Abdennacer-Badaoui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abdennacer-Badaoui left a comment

Choose a reason for hiding this comment

Uh oh!

Abdennacer-Badaoui commented Mar 5, 2026

Uh oh!

sstamenk commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sstamenk commented Mar 4, 2026 •

edited

Loading

Abdennacer-Badaoui left a comment •

edited

Loading

sstamenk commented Mar 5, 2026 •

edited

Loading