Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion include/ck/ck.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -186,8 +186,10 @@
// This (ifndef) is a hack to use customized behavior for buffer load rather than using default
// setting. Don't use this hack unless absolutely necessary!
// FIXME: make the behavior of buffer load a configurable (template) parameter for each usage
// FIX: Enable offset trick to prevent invalid loads from crashing on gfx906/MI50
// Without this, invalid loads still execute and crash despite bounds checking
#ifndef CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK
#define CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK 0
#define CK_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK 1
#endif
#define CK_EXPERIMENTAL_USE_BUFFER_STORE_OOB_CHECK_OFFSET_TRICK 1
#define CK_EXPERIMENTAL_USE_BUFFER_ATOMIC_ADD_OOB_CHECK_OFFSET_TRICK 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,12 @@ struct GridwiseGemmDlMultipleD_km_kn_mn
auto c_thread_buf = make_static_buffer<AddressSpaceEnum::Vgpr, FloatAcc>(
c_thread_desc_m10_m11_n10_n11.GetElementSpaceSize());

// FIX: Separate buffer for element op output with proper type (FloatC)
// This is needed when FloatAcc (e.g., int32) differs from FloatC (e.g., float)
// The element op e = static_cast<E>(c) * d expects to write to FloatC, not FloatAcc
auto e_thread_buf = make_static_buffer<AddressSpaceEnum::Vgpr, FloatC>(
c_thread_desc_m10_m11_n10_n11.GetElementSpaceSize());

// Initialize C
c_thread_buf.Clear();
Comment on lines 427 to 428
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the introduction of e_thread_buf for element op output, consider whether e_thread_buf also needs initialization. If the element op always writes to all elements before the transfer, initialization may not be necessary, but this assumption should be documented or e_thread_buf should be cleared for consistency.

Suggested change
// Initialize C
c_thread_buf.Clear();
// Initialize C and element-op output buffer
c_thread_buf.Clear();
e_thread_buf.Clear();

Copilot uses AI. Check for mistakes.

Expand Down Expand Up @@ -621,13 +627,22 @@ struct GridwiseGemmDlMultipleD_km_kn_mn
Number<NumDTensor>{});

// get reference to dst data
// FIX: Use e_thread_buf (FloatC) for output, c_thread_buf
// (FloatAcc) for input. This fixes type mismatch when FloatAcc
// (int32) != FloatC (float)
constexpr index_t c_offset =
c_thread_desc_m0_m10_m11_n0_n10_n11.CalculateOffset(
make_tuple(0, m10, m11, 0, n10, i));
auto dst_data_refs = generate_tie(
// return type should be lvalue
[&](auto) -> auto& { return c_thread_buf(Number<c_offset>{}); },
Number<2>{});
// Element op signature: (E& e, const C& c, const D& d)
// - e (output): goes to e_thread_buf (FloatC type)
// - c (input): comes from c_thread_buf (FloatAcc type)
// Use tie() to create a tuple of references to different buffers
auto dst_data_refs =
tie(e_thread_buf(
Number<c_offset>{}), // E& e (output to FloatC buffer)
c_thread_buf(
Number<c_offset>{}) // C& c (input from FloatAcc buffer)
Comment on lines +638 to +644
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comments identify 'e' as output and 'c' as input, but the element operation signature is '(E& e, const C& c, const D& d...)' where 'c' should be marked as const. The current comment doesn't indicate that 'c' is const, which may confuse readers about whether the element op modifies the accumulator buffer.

Suggested change
// - c (input): comes from c_thread_buf (FloatAcc type)
// Use tie() to create a tuple of references to different buffers
auto dst_data_refs =
tie(e_thread_buf(
Number<c_offset>{}), // E& e (output to FloatC buffer)
c_thread_buf(
Number<c_offset>{}) // C& c (input from FloatAcc buffer)
// - c (const input): comes from c_thread_buf (FloatAcc type)
// Use tie() to create a tuple of references to different buffers
auto dst_data_refs =
tie(e_thread_buf(
Number<c_offset>{}), // E& e (output to FloatC buffer)
c_thread_buf(
Number<c_offset>{}) // const C& c (input from FloatAcc buffer)

Copilot uses AI. Check for mistakes.
);

unpack2(cde_element_op, dst_data_refs, src_data_refs);
});
Expand All @@ -653,8 +668,10 @@ struct GridwiseGemmDlMultipleD_km_kn_mn
});
});

// FIX: Transfer from e_thread_buf (FloatC) instead of c_thread_buf (FloatAcc)
// since element op output is now stored in e_thread_buf with proper type
ThreadwiseTensorSliceTransfer_v1r3<
FloatAcc,
FloatC, // FIX: Source is now FloatC (e_thread_buf)
FloatC,
decltype(c_thread_desc_m0_m10_m11_n0_n10_n11),
decltype(c_grid_desc_m0_m10_m11_n0_n10_n11),
Expand All @@ -680,7 +697,7 @@ struct GridwiseGemmDlMultipleD_km_kn_mn
ck::tensor_operation::element_wise::PassThrough{}}
.Run(c_thread_desc_m0_m10_m11_n0_n10_n11,
make_tuple(I0, I0, I0, I0, I0, I0),
c_thread_buf,
e_thread_buf, // FIX: Use e_thread_buf instead of c_thread_buf
c_grid_desc_m0_m10_m11_n0_n10_n11,
c_grid_buf);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,10 @@ struct ThreadwiseTensorSliceTransfer_v5r1

using src_vector_t = typename decltype(src_vector)::type;

// FIX: Use full bounds check including visible index to prevent OOB access
// when K0 coordinate exceeds tensor bounds
const bool is_src_valid =
coordinate_has_valid_offset_assuming_visible_index_is_valid(src_desc, src_coord_);
coordinate_has_valid_offset(src_desc, src_coord_);

// copy data from src_buf to src_vector
src_vector.template AsType<src_vector_t>()(I0) =
Expand Down