Conversation
|
@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform? |
43a06f2 to
3578f8e
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR adds a comprehensive CuTe Matrix Transpose tutorial, based on Colfax's article, demonstrating various transpose implementation strategies for Intel GPUs using SYCL and CuTe abstractions.
Key changes:
- Implements multiple transpose kernels (naive, SMEM-based, block 2D) with performance benchmarking
- Adds utility functions for random data generation and validation
- Fixes SYCL compatibility issues in platform headers
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| include/cutlass/platform/platform.h | Adds lowest() method to numeric_limits<float> for proper float range support |
| include/cute/util/compat/traits.hpp | Fixes SYCL item template to use correct non-offset variant |
| examples/cute/tutorial/transpose/util.h | Provides benchmarking and validation utilities for transpose operations |
| examples/cute/tutorial/transpose/transpose_sycl.cpp | Implements reference SYCL transpose kernels for comparison |
| examples/cute/tutorial/transpose/transpose_smem.h | Implements CuTe SMEM-based transpose with optional swizzling |
| examples/cute/tutorial/transpose/transpose_naive.h | Implements naive CuTe transpose without SMEM |
| examples/cute/tutorial/transpose/main.cpp | Entry point that benchmarks all transpose implementations |
| examples/cute/tutorial/transpose/copy_smem.h | Implements SMEM copy kernel as baseline |
| examples/cute/tutorial/transpose/copy_direct.h | Implements direct GMEM copy kernel as baseline |
| examples/cute/tutorial/transpose/block_2d_transposed_copy.h | Implements Intel Xe block 2D transposed load operations |
| examples/cute/tutorial/CMakeLists.txt | Registers new transpose tutorial executable |
| examples/common/sycl_cute_common.hpp | Adds std::vector overloads for random_fill and zero_fill |
| CUTLASS_HOST_DEVICE | ||
| static constexpr float infinity() noexcept { return bit_cast<float, int32_t>(0x7f800000);} | ||
| CUTLASS_HOST_DEVICE | ||
| static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;} |
There was a problem hiding this comment.
The computation of lowest() is incorrect. The expression -bit_cast<float, int32_t>(0x7f7fffff) - 1 applies negation and subtraction to the bit-casted float, not to the integer before casting. It should be bit_cast<float, int32_t>(0xff7fffff) to represent the IEEE 754 bit pattern for the most negative finite float.
| static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;} | |
| static constexpr float lowest() noexcept { return bit_cast<float, int32_t>(0xff7fffff);} |
| auto transpose_function = make_layout(tensor_shape_S, LayoutRight{}); | ||
| for (size_t i = 0; i < h_D.size(); ++i) | ||
| if (h_D[i] != h_S[transpose_function(i)]) | ||
| bad++; |
There was a problem hiding this comment.
The transpose validation logic is incorrect. transpose_function(i) with a LayoutRight layout on tensor_shape_S will compute the same linear index i, not the transposed index. For proper transpose validation, you need to convert linear index i to (row, col) in the destination, swap to (col, row), then compute the linear index in the source.
| auto transpose_function = make_layout(tensor_shape_S, LayoutRight{}); | |
| for (size_t i = 0; i < h_D.size(); ++i) | |
| if (h_D[i] != h_S[transpose_function(i)]) | |
| bad++; | |
| // Correct validation: map destination index to source index via transpose | |
| for (size_t i = 0; i < h_D.size(); ++i) { | |
| // Destination shape: N x M, so row = i / M, col = i % M | |
| size_t row = i / M; | |
| size_t col = i % M; | |
| // Source shape: M x N, so source index = col * N + row | |
| size_t source_index = col * N + row; | |
| if (h_D[i] != h_S[source_index]) | |
| bad++; | |
| } |
|
|
||
| constexpr size_t numIters = 100; | ||
|
|
||
| typedef unsigned int uint; |
There was a problem hiding this comment.
Using typedef for uint is outdated C++ style. Replace with using uint = unsigned int; for consistency with modern C++ conventions.
| typedef unsigned int uint; | |
| using uint = unsigned int; |
Hi @Antonyvance, The PR is still a WIP and is meant to be an intro to memory access patterns in CuTe (similar to the colfax article in description). The tutorial has samples for gmem copy/naive transpose/smem copy & transpose etc, that demonstrates how one can avoid strided access from/to gmem and encourage the usage of smem for strided accesses. The main code runs these separate kernels and benchmarks the effective bandwidth achieved between them. I will also explore other CuTe concepts like using swizzling to avoid bank conflicts, tv-layouts, sgv-layout, and how to use block 2d copy atoms to achieve copy/transpose. Overall, this is a learning exercise for me and might also serve as an introduction to others learning CuTe. I will add a detailed README at the end when I am done with the implementation. |
|
We have refined CUTE API implementation, if you still need this PR, please update it bases on the latest source code and resubmit a new PR for review. |
Based on Colfax's article: https://research.colfax-intl.com/tutorial-matrix-transpose-in-cutlass/