- 03_v0:
- use 512 thread
- simple load and calculate.
- pre register allocate.
- 204 VGPRs;
- Should have bank conflict problem and load problems.
- use float4 load
- v0: 137 TF:
- v1: 136 TF: use float2 load for shared memory,
- not up. so prob mainly for the bank conflict problem?
- v2:
- 165 TF: remove clear shared memory.
- 162 TF: use float2 load for shared memory.
- 382 TF: add swizzle for shared memory. (+136%)
- 396 TF: add all unroll for params.
- v3:
- 400 TF: add XCD remap and L2 cache swizzle. why so still so slow speed....
- v4:
- have some register spill now? how to reduce them.
- first write a tile like the hipkittens do now.
- 491 TF: use float4 for g2s and s2.
- v5:
- 497 TF: add fast float2bfloat16... now ignore... maybe put into last.
- 535 TF: for open barrier... ping pong should have much more effect. learn it more.
- 602 TF: for close barrier...
- so shared memory interleave can do much more...
- v6:
- simple refact. to reduce register pressure and used for later optimizations.
- merge A_smem & B_smem to reduce 1 register.
- small tile mma, add 8 wave ping-pong interleave.
gau's hip gemm: https://github.com/gau-nernst/learn-cuda/tree/main/02c_matmul_hip
shark-ai's amdgpu kernel optimization guide: https://github.com/nod-ai/shark-ai/blob/main/docs/amdgpu_kernel_optimization_guide.md#mi300
592 TFlopson NTN shapes, and654 TFlopson TNT shapes script from https://github.com/ROCm/tritonBLAS/blob/main/tools/sweep_grid.py
Problem size: 4864x4096x4096 (transA=T, transB=N)
Best tile: (256, 256, 32) → 654.101 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 621.040 TFLOPS
Problem size: 4864x4096x4096 (transA=N, transB=T)
Best tile: (256, 256, 64) → 592.116 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 590.764 TFLOPS- add some simple asm code here?
add some builtin_assume here to skip the check of the original add kernel asms.
__launch_bounds__(256)
__builtin_assume(N >= 2048);
__builtin_assume(N <= 4096);
__builtin_assume(N % 1024 == 0); // 如果是256倍数- code from https://github.com/amd/amd-lab-notes/blob/release/matrix-cores/src/mfma_fp32_32x32x8fp16.cpp
- simple code to show how to use hip matrix core instructions.
- it's a builtin function, not inline assembly version though.
d = __builtin_amdgcn_mfma_f32_16x16x16f16(a, b, d, 0, 0, 0);
// ^ ^ ^
//D(=C) | | C(=D)
// 16 columns of A---| |--- 16 rows of Balso fp16_gemm_16x16x16_NTN below to make B transpose, then we can use 4 * fp16 -> b64 load instructions to load B.
should be more throughput? try find a microbenchmark to test it.
- TODO: make NTN to another file? and also add some microbenchmark to test it.
as our shape: 32 * 32 * 64 & 32 * 32 * 32
