Add CuTe Matrix Transpose tutorial by sspintel · Pull Request #562 · intel/sycl-tla

sspintel · 2025-10-15T14:44:38Z

Based on Colfax's article: https://research.colfax-intl.com/tutorial-matrix-transpose-in-cutlass/

Antonyvance · 2025-10-17T21:16:44Z

@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform?

Copilot

Pull Request Overview

This PR adds a comprehensive CuTe Matrix Transpose tutorial, based on Colfax's article, demonstrating various transpose implementation strategies for Intel GPUs using SYCL and CuTe abstractions.

Key changes:

Implements multiple transpose kernels (naive, SMEM-based, block 2D) with performance benchmarking
Adds utility functions for random data generation and validation
Fixes SYCL compatibility issues in platform headers

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
include/cutlass/platform/platform.h	Adds `lowest()` method to `numeric_limits<float>` for proper float range support
include/cute/util/compat/traits.hpp	Fixes SYCL item template to use correct non-offset variant
examples/cute/tutorial/transpose/util.h	Provides benchmarking and validation utilities for transpose operations
examples/cute/tutorial/transpose/transpose_sycl.cpp	Implements reference SYCL transpose kernels for comparison
examples/cute/tutorial/transpose/transpose_smem.h	Implements CuTe SMEM-based transpose with optional swizzling
examples/cute/tutorial/transpose/transpose_naive.h	Implements naive CuTe transpose without SMEM
examples/cute/tutorial/transpose/main.cpp	Entry point that benchmarks all transpose implementations
examples/cute/tutorial/transpose/copy_smem.h	Implements SMEM copy kernel as baseline
examples/cute/tutorial/transpose/copy_direct.h	Implements direct GMEM copy kernel as baseline
examples/cute/tutorial/transpose/block_2d_transposed_copy.h	Implements Intel Xe block 2D transposed load operations
examples/cute/tutorial/CMakeLists.txt	Registers new transpose tutorial executable
examples/common/sycl_cute_common.hpp	Adds std::vector overloads for random_fill and zero_fill

Copilot · 2025-11-05T07:02:32Z

  CUTLASS_HOST_DEVICE
  static constexpr float infinity() noexcept { return bit_cast<float, int32_t>(0x7f800000);}
  CUTLASS_HOST_DEVICE
+  static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;}


The computation of lowest() is incorrect. The expression -bit_cast<float, int32_t>(0x7f7fffff) - 1 applies negation and subtraction to the bit-casted float, not to the integer before casting. It should be bit_cast<float, int32_t>(0xff7fffff) to represent the IEEE 754 bit pattern for the most negative finite float.

Suggested change

static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;}

static constexpr float lowest() noexcept { return bit_cast<float, int32_t>(0xff7fffff);}

Copilot · 2025-11-05T07:02:32Z

+      auto transpose_function = make_layout(tensor_shape_S, LayoutRight{});
+      for (size_t i = 0; i < h_D.size(); ++i)
+        if (h_D[i] != h_S[transpose_function(i)])
+          bad++;


The transpose validation logic is incorrect. transpose_function(i) with a LayoutRight layout on tensor_shape_S will compute the same linear index i, not the transposed index. For proper transpose validation, you need to convert linear index i to (row, col) in the destination, swap to (col, row), then compute the linear index in the source.

Suggested change

auto transpose_function = make_layout(tensor_shape_S, LayoutRight{});

for (size_t i = 0; i < h_D.size(); ++i)

if (h_D[i] != h_S[transpose_function(i)])

bad++;

// Correct validation: map destination index to source index via transpose

for (size_t i = 0; i < h_D.size(); ++i) {

// Destination shape: N x M, so row = i / M, col = i % M

size_t row = i / M;

size_t col = i % M;

// Source shape: M x N, so source index = col * N + row

size_t source_index = col * N + row;

if (h_D[i] != h_S[source_index])

bad++;

}

Copilot · 2025-11-05T07:02:33Z

+
+constexpr size_t numIters = 100;
+
+typedef unsigned int uint;


Using typedef for uint is outdated C++ style. Replace with using uint = unsigned int; for consistency with modern C++ conventions.

Suggested change

typedef unsigned int uint;

using uint = unsigned int;

sspintel · 2025-11-06T08:23:52Z

@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform?

Hi @Antonyvance, The PR is still a WIP and is meant to be an intro to memory access patterns in CuTe (similar to the colfax article in description). The tutorial has samples for gmem copy/naive transpose/smem copy & transpose etc, that demonstrates how one can avoid strided access from/to gmem and encourage the usage of smem for strided accesses. The main code runs these separate kernels and benchmarks the effective bandwidth achieved between them.

I will also explore other CuTe concepts like using swizzling to avoid bank conflicts, tv-layouts, sgv-layout, and how to use block 2d copy atoms to achieve copy/transpose. Overall, this is a learning exercise for me and might also serve as an introduction to others learning CuTe. I will add a detailed README at the end when I am done with the implementation.

tdeng5 · 2026-04-01T06:46:11Z

We have refined CUTE API implementation, if you still need this PR, please update it bases on the latest source code and resubmit a new PR for review.

Antonyvance added the examples Label for adding examples, complex kernels development using cutlass or cute APIS label Oct 17, 2025

Antonyvance added the information required The PR requires more information to review them properly label Oct 17, 2025

sspintel added 3 commits October 28, 2025 22:37

Add CuTe Matrix Transpose tutorial

71829b7

Add the native sycl kernel code

1f53721

Add transposed load kernel through the new Xe Copy Atoms

3578f8e

sspintel force-pushed the dev/matrix-transpose branch from 43a06f2 to 3578f8e Compare October 29, 2025 05:39

Antonyvance requested a review from Copilot November 5, 2025 07:01

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Add block2dcopy test code

35b1308

Merge branch 'intel:main' into dev/matrix-transpose

f732325

tdeng5 closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CuTe Matrix Transpose tutorial#562

Add CuTe Matrix Transpose tutorial#562
sspintel wants to merge 5 commits intointel:mainfrom
sspintel:dev/matrix-transpose

sspintel commented Oct 15, 2025 •

edited

Loading

Uh oh!

Antonyvance commented Oct 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

sspintel commented Nov 6, 2025

Uh oh!

tdeng5 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;}
	static constexpr float lowest() noexcept { return bit_cast<float, int32_t>(0xff7fffff);}

-      auto transpose_function = make_layout(tensor_shape_S, LayoutRight{});
-      for (size_t i = 0; i < h_D.size(); ++i)
-        if (h_D[i] != h_S[transpose_function(i)])
-          bad++;
+      // Correct validation: map destination index to source index via transpose
+      for (size_t i = 0; i < h_D.size(); ++i) {
+        // Destination shape: N x M, so row = i / M, col = i % M
+        size_t row = i / M;
+        size_t col = i % M;
+        // Source shape: M x N, so source index = col * N + row
+        size_t source_index = col * N + row;
+        if (h_D[i] != h_S[source_index])
+          bad++;
+      }

Conversation

sspintel commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Antonyvance commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

sspintel commented Nov 6, 2025

Uh oh!

tdeng5 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sspintel commented Oct 15, 2025 •

edited

Loading

Antonyvance commented Oct 17, 2025 •

edited

Loading