-
-
Notifications
You must be signed in to change notification settings - Fork 831
[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA #1887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3c19b4c
d4ed073
e3b9c46
44fad6c
755d1af
c718184
4fb8635
6812b96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,30 +54,20 @@ void quantizeBlockwise( | |
| else if (blocksize == 128) | ||
| kQuantizeBlockwise<T, 128, 2, 0, DATA_TYPE><<<num_blocks, 64>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| else if (blocksize == 64) { | ||
| #if BNB_HIP | ||
| if constexpr (DATA_TYPE > 0) { | ||
| if (bnb_host_warp_size() == 64) { | ||
| // CDNA: kQuantizeBlockwiseSmall is compiled with THREADS=64 | ||
| kQuantizeBlockwiseSmall<T, DATA_TYPE> | ||
| <<<(num_blocks + 1) / 2, 64>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| } else { | ||
| // RDNA: standard kernel (same as CUDA path) | ||
| kQuantizeBlockwise<T, 64, 2, 0, DATA_TYPE> | ||
| <<<num_blocks, 32>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| } | ||
| const int ws = bnb_host_warp_size(); | ||
| const int num_qb = ws / (64 / 2); | ||
| int grid = (num_blocks + num_qb - 1) / num_qb; | ||
| kQuantizeBlockwiseSmall<T, 64, DATA_TYPE><<<grid, ws>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| } else { | ||
| kQuantizeBlockwise<T, 64, 2, 0, DATA_TYPE><<<num_blocks, 32>>>(code, A, absmax, out, rand, rand_offset, n); | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Observation here that kQuantizeBlockwise in combination with General8bit data type only utilizes half of the warp since it launches 32 threads on CDNA which has 64.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we can discuss this further in another PR. For now, let’s focus on merging this one. |
||
| } | ||
| #else | ||
| kQuantizeBlockwise<T, 64, 2, 0, DATA_TYPE><<<num_blocks, 32>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| #endif | ||
| } else if (blocksize == 32) { | ||
| // For 4-bit: use specialized kernel that processes 2 blocks per warp | ||
| // Each CUDA block handles 2 quantization blocks, so divide num_blocks by 2 | ||
| if constexpr (DATA_TYPE > 0) { | ||
| int num_blocks_adjusted = (num_blocks + 1) / 2; | ||
| kQuantizeBlockwiseSmall<T, DATA_TYPE> | ||
| <<<num_blocks_adjusted, 32>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| const int ws = bnb_host_warp_size(); | ||
| const int num_qb = ws / (32 / 2); | ||
| int grid = (num_blocks + num_qb - 1) / num_qb; | ||
| kQuantizeBlockwiseSmall<T, 32, DATA_TYPE><<<grid, ws>>>(code, A, absmax, out, rand, rand_offset, n); | ||
| } | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that for (CUDA , bs = 64, 4-bit dtypes), we would use the kQuantizeBlockwiseSmall, while it should use directly kQuantizeBlockwise.
IMO, on CUDA, blocksize=64 should still use the regular kQuantizeBlockwise kernel. Wdyt @matthewdouglas
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did this intentionally to minimize the kernel selection logic since I believe there shouldn't be much of a performance impact between the 2 but I haven't tested this explicitly. Alternatively, for bs = 64 we can use kQuantizeBlockwiseSmall for CDNA (ws = 64) and the standard kQuantizeBlockwise on CUDA/RDNA (ws = 32).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, me too! Your way of writing it is definitely better for code readiness. But I’m not sure how much impact there’d be going from kQuantizeBlockwise to kQuantizeBlockwiseSmall. I think it’s negligible, but I’m not entirely sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do some benchmarking/profiling for the overhead between kQuantizeBlockwise and kQuantizeBlockwiseSmall for RDNA and CUDA with batch_size == 64 to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can you do that please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are quite the same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdouglas It's basically does 200 iterations of
quantize_4bitand captures the time taken withtorch.cuda.EventThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abdennacer-Badaoui Same here, on an RTX 5090 the results are within margin of error:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdouglas Data with larger N:
Quantize
Dequantize (same kernel on both branches)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
LGTM! I think we can keep your changes. Thanks for working on this!