-
Notifications
You must be signed in to change notification settings - Fork 267
[gfx906] Collection of fixes for MI50/MI60 (non-MFMA) GPUs #3593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Problem: DeviceGemmDl crashes on gfx906 when K >= 1472 with small M (M=1 decode case). Root cause: CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK was disabled by default. Without this, invalid buffer loads execute and crash before bounds checking can prevent them. Solution: 1. Enable the OOB offset trick (0x80000000) so invalid coordinates safely return zero instead of accessing unmapped memory 2. Use full coordinate_has_valid_offset() check instead of the _assuming_visible_index_is_valid variant for proper K bounds validation Verified with INT8 GEMM tests: M=1 decode, K=14336, FFN projections.
Problem: When FloatAcc differs from FloatC (e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types. The element op contract is: (E& e, const C& c, const D& d...) where: - E = FloatC (final output type, e.g., float) - C = FloatAcc (accumulator type, e.g., int32_t) Original code used generate_tie() returning the same c_thread_buf for both E& and C&, which: 1. Violates the element op signature when types differ 2. Causes compile errors with strictly-typed element ops 3. Results in undefined behavior during ThreadwiseTensorSliceTransfer Solution: Introduce separate e_thread_buf<FloatC> for element op output, pass (E& e) from e_thread_buf and (const C& c) from c_thread_buf, then transfer e_thread_buf to global memory. Bug has existed since the file was created in December 2022 (PR ROCm#517).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses critical bugs in ComposableKernel's DeviceGemmDl implementation for gfx906 (MI50/MI60) GPUs that lack MFMA instructions. The fixes resolve out-of-bounds memory access crashes and type safety issues affecting INT8 GEMM operations with mixed accumulator and output types.
Changes:
- Enable hardware-assisted OOB protection by default for buffer loads on gfx906
- Fix bounds checking logic to validate all coordinates, preventing invalid memory accesses
- Correct type mismatches in element operations when accumulator type differs from output type
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| include/ck/ck.hpp | Enables offset trick for OOB buffer load protection by default |
| include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp | Strengthens bounds validation to check all coordinates including visible indices |
| include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp | Introduces separate buffer for element op output with correct type (FloatC) to fix type mismatch when FloatAcc differs from FloatC |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // - c (input): comes from c_thread_buf (FloatAcc type) | ||
| // Use tie() to create a tuple of references to different buffers | ||
| auto dst_data_refs = | ||
| tie(e_thread_buf( | ||
| Number<c_offset>{}), // E& e (output to FloatC buffer) | ||
| c_thread_buf( | ||
| Number<c_offset>{}) // C& c (input from FloatAcc buffer) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline comments identify 'e' as output and 'c' as input, but the element operation signature is '(E& e, const C& c, const D& d...)' where 'c' should be marked as const. The current comment doesn't indicate that 'c' is const, which may confuse readers about whether the element op modifies the accumulator buffer.
| // - c (input): comes from c_thread_buf (FloatAcc type) | |
| // Use tie() to create a tuple of references to different buffers | |
| auto dst_data_refs = | |
| tie(e_thread_buf( | |
| Number<c_offset>{}), // E& e (output to FloatC buffer) | |
| c_thread_buf( | |
| Number<c_offset>{}) // C& c (input from FloatAcc buffer) | |
| // - c (const input): comes from c_thread_buf (FloatAcc type) | |
| // Use tie() to create a tuple of references to different buffers | |
| auto dst_data_refs = | |
| tie(e_thread_buf( | |
| Number<c_offset>{}), // E& e (output to FloatC buffer) | |
| c_thread_buf( | |
| Number<c_offset>{}) // const C& c (input from FloatAcc buffer) |
| // Initialize C | ||
| c_thread_buf.Clear(); |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the introduction of e_thread_buf for element op output, consider whether e_thread_buf also needs initialization. If the element op always writes to all elements before the transfer, initialization may not be necessary, but this assumption should be documented or e_thread_buf should be cleared for consistency.
| // Initialize C | |
| c_thread_buf.Clear(); | |
| // Initialize C and element-op output buffer | |
| c_thread_buf.Clear(); | |
| e_thread_buf.Clear(); |
Summary
This PR is an aggregation of fixes discovered while working with ComposableKernel on gfx906 (MI50/MI60) GPUs. These GPUs don't have MFMA instructions, so they rely on the
DeviceGemmDlpath which has some edge cases that aren't well-tested.Note: This is a draft PR that will be updated as we discover more issues.
Fix 1: Buffer Load OOB Crash with Large K and Small M
Problem
DeviceGemmDlcrashes on gfx906 when K >= 1472 with small M (e.g., M=1 decode case in LLM inference).The crash occurs in
gridwise_gemm_dl_v1r3.hppduringblock_sync_lds()after an invalid buffer load.Root Cause
CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKwas disabled by default (set to 0).Without the offset trick:
With the offset trick enabled:
0x80000000added to offsetSolution
include/ck/ck.hpp: EnableCK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICKby defaultinclude/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer_v5r1.hpp: Usecoordinate_has_valid_offset()instead ofcoordinate_has_valid_offset_assuming_visible_index_is_valid()for full bounds validationVerification
INT8 GEMM tests pass for:
Fix 2: GridwiseGemmDlMultipleD Element Op Type Mismatch (FloatAcc != FloatC)
Problem
When
FloatAccdiffers fromFloatC(e.g., INT8×INT8→INT32 accumulator with FP32 output scaling), the CDE element op is invoked with wrong storage types.The element op contract is:
(E& e, const C& c, const D& d...)where:E=FloatC(final output type, e.g.,float)C=FloatAcc(accumulator type, e.g.,int32_t)Root Cause
Original code at lines 615-618 used
generate_tie()returning the samec_thread_buffor bothE&andC&:This causes:
FloatAcc != FloatC(element op expectsfloat&fore, getsint32_t&)ThreadwiseTensorSliceTransferwhich type-punsFloatAccbits asFloatCThis bug has existed since the file was created in December 2022 (PR #517).
Solution
include/ck/tensor_operation/gpu/grid/gridwise_gemm_dl_multiple_d.hpp:e_thread_buf<FloatC>for element op output(E& e)frome_thread_bufand(const C& c)fromc_thread_bufusingtie()e_thread_buf(notc_thread_buf) to global memoryMinimal Repro
See original PR #3565 for compile-time repro that demonstrates the type mismatch.
Environment