This document describes the six GEMM (General Matrix Multiply) kernel versions implemented in FastCuda. Each version builds on the previous one and targets a specific level of the GPU memory / compute hierarchy.
The operation computed is:
where
Source: src/gemm/gemm_v1_naive.cu
Each CUDA thread computes exactly one element of
for i in 0..K:
accum += A[row][i] * B[i][col]
C[row][col] = alpha * accum + beta * C[row][col]
| Parameter | Value |
|---|---|
| Block | 16 × 16 |
| Grid | ⌈N/16⌉ × ⌈M/16⌉ |
-
Arithmetic intensity is
$O(1)$ – every FMA reads two floats from global memory. - Many redundant global loads: each row of
$A$ is loaded$N$ times; each column of$B$ is loaded$M$ times. - The only advantage is simplicity and correctness – used as the baseline.
Correctness reference and first debugging target.
Source: src/gemm/gemm_v2_shared.cu
The output matrix is partitioned into TILE × TILE blocks. For each
block, the
for each K-tile:
cooperatively load A[block_row..][tile_k..] → As
cooperatively load B[tile_k..][block_col..] → Bs
__syncthreads()
for i in 0..TILE:
accum += As[ty][i] * Bs[i][tx]
__syncthreads()
-
TILE = 32 with
+1padding on shared arrays to eliminate bank conflicts. - Each global-memory element is loaded once per tile step, then reused TILE times from shared memory.
- Arithmetic intensity rises to
$O(\text{TILE})$ .
| Parameter | Value |
|---|---|
| Block | 32 × 32 (1024 threads) |
| Grid | ⌈N/32⌉ × ⌈M/32⌉ |
Shared-memory arrays are declared as float As[TILE][TILE+1]. The extra
column shifts each row by one bank, preventing stride-32 accesses from hitting
the same bank.
Source: src/gemm/gemm_v3_register.cu
Each thread now computes a TM × TN (4 × 4) sub-tile of the output, using a
set of register-resident accumulators. The block tile grows to
BM × BN × BK = 64 × 64 × 16, and global loads use float4 for 128-bit
transactions.
float accum[TM][TN] = {0};
for each K-tile:
cooperative float4 load → As, Bs
__syncthreads()
for bk in 0..BK:
load a_frag[TM], b_frag[TN] from shared
outer product: accum += a_frag * b_frag^T
__syncthreads()
- Block size = 16 × 16 = 256 threads.
- Each thread owns 16 FMAs per inner-loop step (4 × 4), reaching higher instruction-level parallelism.
-
float4vectorized loads issue 128-bit transactions, saturating memory bus width. - Register blocking increases the compute-to-memory ratio to
$O(\text{TM} \times \text{TN})$ per shared-memory read.
| Param | Value | Meaning |
|---|---|---|
| BM | 64 | Block rows |
| BN | 64 | Block cols |
| BK | 16 | K-tile width |
| TM | 4 | Thread rows |
| TN | 4 | Thread cols |
Source: src/gemm/gemm_v4_warp.cu
This version is the high-performance completion of CUDA-core SGEMM. Three techniques are combined:
- Warp-level tiling — each warp in the block owns a clearly defined sub-region of the output tile.
- Double-buffered shared memory — two sets of shared arrays (
As[2],Bs[2]) alternate between loading and computing roles. - Software pipelining / prefetch — while the current tile is being computed from one buffer, the next tile is being loaded into the other buffer.
load tile 0 → As[0], Bs[0]
__syncthreads()
for tk = BK .. K+BK step BK:
load tile tk → As[1-buf], Bs[1-buf] -- prefetch
compute from As[buf], Bs[buf] -- current
swap buf
__syncthreads()
| Param | Value | Meaning |
|---|---|---|
| BM | 128 | Block rows |
| BN | 128 | Block cols |
| BK | 8 | K-tile width |
| TM | 8 | Thread rows |
| TN | 8 | Thread cols |
- The overlap between memory loading and computation hides most of the global memory latency.
- Larger tile (128 × 128) amortises the block scheduling overhead and maximises data reuse.
- Each thread computes 8 × 8 = 64 output elements, giving very high compute density.
Source: src/gemm/gemm_v5_tf32.cu
Input and output remain FP32, but the inner matrix multiply-accumulate is performed using the TF32 data path on Tensor Cores. TF32 is a 19-bit format (1 sign + 8 exponent + 10 mantissa) that retains the FP32 exponent range while reducing mantissa precision.
The kernel uses CUDA's nvcuda::wmma API with fragment type
wmma::precision::tf32 and shape M=16, N=16, K=8.
- Requires SM 80+ (Ampere and later).
- Supported on both RTX 4090 (SM 89) and RTX 5060 (SM 120).
| Param | Value |
|---|---|
| BM | 64 |
| BN | 64 |
| BK | 8 |
| Threads | 128 (4 warps) |
Each warp owns one 16-row slice of the output and iterates over 4 column tiles (4 × 16 = 64 columns).
- Tensor Core operations deliver significantly higher throughput than CUDA cores for the same matrix size.
- The numerical result will differ slightly from IEEE FP32 due to the shorter mantissa. For deep-learning workloads this is generally acceptable.
- NVIDIA's cuBLAS exposes an equivalent mode via
CUBLAS_COMPUTE_32F_FAST_TF32.
Source: src/gemm/gemm_v6_hgemm.cu
The most natural Tensor Core GEMM path. Input matrices are FP16 (half),
accumulation and output use FP32. This follows the canonical mixed-precision
pattern described in NVIDIA's Tensor Core programming guides since Volta.
Uses nvcuda::wmma with FP16 fragments and shape M=16, N=16, K=16.
- Requires SM 70+ (Volta and later).
- Covers RTX 4090 (SM 89) and RTX 5060 (SM 120).
| Param | Value |
|---|---|
| BM | 64 |
| BN | 64 |
| BK | 16 |
| Threads | 128 (4 warps) |
Shared-memory arrays padded by 8 to avoid bank conflicts on half-precision loads.
- FP16 Tensor Core GEMM typically achieves the highest throughput on modern NVIDIA GPUs (2× to 4× over FP32 Tensor Core paths, architecture-dependent).
- Output is FP32 to avoid accumulation error. Downstream code can convert to FP16 if needed.
| Version | Precision | Core type | Shared mem | Register blocking | Double buffer | WMMA |
|---|---|---|---|---|---|---|
| V1 | FP32 | CUDA | — | — | — | — |
| V2 | FP32 | CUDA | ✓ | — | — | — |
| V3 | FP32 | CUDA | ✓ | ✓ (4×4) | — | — |
| V4 | FP32 | CUDA | ✓ (2×) | ✓ (8×8) | ✓ | — |
| V5 | TF32 | Tensor | ✓ | via WMMA | — | ✓ |
| V6 | FP16→FP32 | Tensor | ✓ | via WMMA | — | ✓ |
# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Example (runs v1-v5)
./build/fastcuda_gemm_example 1024 1024 1024
# Benchmark (all versions + cuBLAS)
./build/fastcuda_bench gemm all m=1024,n=1024,k=1024- Mark Harris, Optimizing Parallel Reduction in CUDA (NVIDIA, 2007)
- NVIDIA CUDA C++ Programming Guide – Warp Matrix Functions (WMMA)
- NVIDIA cuBLAS documentation –
cublasGemmEx,CUBLAS_COMPUTE_32F_FAST_TF32 - How to optimize in GPU
- Optimizing SGEMM on NVIDIA Turing GPUs