Skip to content

benenzhu/learn-hip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

learn hip asm and matmul through hiprtc.

BF16 GEMM OPT LOG:

  • 03_v0:
    • use 512 thread
    • simple load and calculate.
    • pre register allocate.
    • 204 VGPRs;
    • Should have bank conflict problem and load problems.
  • use float4 load
  • v0: 137 TF:
  • v1: 136 TF: use float2 load for shared memory,
    • not up. so prob mainly for the bank conflict problem?
  • v2:
    • 165 TF: remove clear shared memory.
    • 162 TF: use float2 load for shared memory.
    • 382 TF: add swizzle for shared memory. (+136%)
    • 396 TF: add all unroll for params.
  • v3:
    • 400 TF: add XCD remap and L2 cache swizzle. why so still so slow speed....
  • v4:
    • have some register spill now? how to reduce them.
    • first write a tile like the hipkittens do now.
    • 491 TF: use float4 for g2s and s2.
  • v5:
    • 497 TF: add fast float2bfloat16... now ignore... maybe put into last.
    • 535 TF: for open barrier... ping pong should have much more effect. learn it more.
    • 602 TF: for close barrier...
    • so shared memory interleave can do much more...
  • v6:
    • simple refact. to reduce register pressure and used for later optimizations.
  • merge A_smem & B_smem to reduce 1 register.
  • small tile mma, add 8 wave ping-pong interleave.

ref:

gau's hip gemm: https://github.com/gau-nernst/learn-cuda/tree/main/02c_matmul_hip

shark-ai's amdgpu kernel optimization guide: https://github.com/nod-ai/shark-ai/blob/main/docs/amdgpu_kernel_optimization_guide.md#mi300

gemm target:

592 TFlops on NTN shapes, and 654 TFlops on TNT shapes script from https://github.com/ROCm/tritonBLAS/blob/main/tools/sweep_grid.py

Problem size: 4864x4096x4096 (transA=T, transB=N)
Best tile: (256, 256, 32) → 654.101 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 621.040 TFLOPS

Problem size: 4864x4096x4096 (transA=N, transB=T)
Best tile: (256, 256, 64) → 592.116 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 590.764 TFLOPS

code:

00_add.hip: add kernel, for testing hipasm. and learning:

  • add some simple asm code here?

00_add_2.hip: add kernel 2

add some builtin_assume here to skip the check of the original add kernel asms.

__launch_bounds__(256)

__builtin_assume(N >= 2048);
__builtin_assume(N <= 4096);
__builtin_assume(N % 1024 == 0);  // 如果是256倍数

01_mfma.hip of test_bf16_matmul_NNN

d = __builtin_amdgcn_mfma_f32_16x16x16f16(a, b, d, 0, 0, 0);
//                                        ^  ^  ^
//D(=C)                                   |  |  C(=D)
//                      16 columns of A---|  |--- 16 rows of B

also fp16_gemm_16x16x16_NTN below to make B transpose, then we can use 4 * fp16 -> b64 load instructions to load B.

should be more throughput? try find a microbenchmark to test it.

mfma_f32_16x16x16f16

  • TODO: make NTN to another file? and also add some microbenchmark to test it.

TODO: chang it to a version with shared memory?

as our shape: 32 * 32 * 64 & 32 * 32 * 32

About

hip asm learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors