learn hip asm and matmul through hiprtc.

BF16 GEMM OPT LOG:

03_v0:
- use 512 thread
- simple load and calculate.
- pre register allocate.
- 204 VGPRs;
- Should have bank conflict problem and load problems.
use float4 load
v0: 137 TF:
v1: 136 TF: use float2 load for shared memory,
- not up. so prob mainly for the bank conflict problem?
v2:
- 165 TF: remove clear shared memory.
- 162 TF: use float2 load for shared memory.
- 382 TF: add swizzle for shared memory. (+136%)
- 396 TF: add all unroll for params.
v3:
- 400 TF: add XCD remap and L2 cache swizzle. why so still so slow speed....
v4:
- have some register spill now? how to reduce them.
- first write a tile like the hipkittens do now.
- 491 TF: use float4 for g2s and s2.
v5:
- 497 TF: add fast float2bfloat16... now ignore... maybe put into last.
- 535 TF: for open barrier... ping pong should have much more effect. learn it more.
- 602 TF: for close barrier...
- so shared memory interleave can do much more...
v6:
- simple refact. to reduce register pressure and used for later optimizations.
merge A_smem & B_smem to reduce 1 register.
small tile mma, add 8 wave ping-pong interleave.

ref:

gau's hip gemm: https://github.com/gau-nernst/learn-cuda/tree/main/02c_matmul_hip

shark-ai's amdgpu kernel optimization guide: https://github.com/nod-ai/shark-ai/blob/main/docs/amdgpu_kernel_optimization_guide.md#mi300

gemm target:

592 TFlops on NTN shapes, and 654 TFlops on TNT shapes script from https://github.com/ROCm/tritonBLAS/blob/main/tools/sweep_grid.py

Problem size: 4864x4096x4096 (transA=T, transB=N)
Best tile: (256, 256, 32) → 654.101 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 621.040 TFLOPS

Problem size: 4864x4096x4096 (transA=N, transB=T)
Best tile: (256, 256, 64) → 592.116 TFLOPS ⭐️⭐️⭐️
Heuristic tile: (256, 256, 64) → 590.764 TFLOPS

code:

`00_add.hip`: add kernel, for testing hipasm. and learning:

add some simple asm code here?

`00_add_2.hip`: add kernel 2

add some builtin_assume here to skip the check of the original add kernel asms.

__launch_bounds__(256)

__builtin_assume(N >= 2048);
__builtin_assume(N <= 4096);
__builtin_assume(N % 1024 == 0);  // 如果是256倍数

`01_mfma.hip` of `test_bf16_matmul_NNN`

code from https://github.com/amd/amd-lab-notes/blob/release/matrix-cores/src/mfma_fp32_32x32x8fp16.cpp
simple code to show how to use hip matrix core instructions.
it's a builtin function, not inline assembly version though.

d = __builtin_amdgcn_mfma_f32_16x16x16f16(a, b, d, 0, 0, 0);
//                                        ^  ^  ^
//D(=C)                                   |  |  C(=D)
//                      16 columns of A---|  |--- 16 rows of B

also fp16_gemm_16x16x16_NTN below to make B transpose, then we can use 4 * fp16 -> b64 load instructions to load B.

should be more throughput? try find a microbenchmark to test it.

TODO: make NTN to another file? and also add some microbenchmark to test it.

TODO: chang it to a version with shared memory?

as our shape: 32 * 32 * 64 & 32 * 32 * 32

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.specstory		.specstory
image		image
.gitignore		.gitignore
00_add_v0.hip		00_add_v0.hip
00_add_v1_builtin_assume.hip		00_add_v1_builtin_assume.hip
01_mfma_v1.hip		01_mfma_v1.hip
01_mfma_v2.hip		01_mfma_v2.hip
01_mfma_v3.hip		01_mfma_v3.hip
02_fp16_gemm_v1_NTN.hip		02_fp16_gemm_v1_NTN.hip
02_fp16_gemm_v2_NTN.hip		02_fp16_gemm_v2_NTN.hip
02_fp16_gemm_v3_NTN.hip		02_fp16_gemm_v3_NTN.hip
02_fp16_gemm_v4_NTN.hip		02_fp16_gemm_v4_NTN.hip
03_fp16_gemm_v0.hip		03_fp16_gemm_v0.hip
03_fp16_gemm_v1.hip		03_fp16_gemm_v1.hip
03_fp16_gemm_v2.hip		03_fp16_gemm_v2.hip
03_fp16_gemm_v3.hip		03_fp16_gemm_v3.hip
03_fp16_gemm_v4.hip		03_fp16_gemm_v4.hip
03_fp16_gemm_v5.hip		03_fp16_gemm_v5.hip
03_fp16_gemm_v6.hip		03_fp16_gemm_v6.hip
03_fp16_gemm_v7.hip		03_fp16_gemm_v7.hip
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
README.md		README.md
compile_commands.json		compile_commands.json
gen_pure.py		gen_pure.py
helper.hpp		helper.hpp
lds_instruction_benchmark.hip		lds_instruction_benchmark.hip
lds_layout_benchmark.hip		lds_layout_benchmark.hip
log.py		log.py
main.py		main.py
persistent_matmul.amdgcn		persistent_matmul.amdgcn
rtc.py		rtc.py
run.sh		run.sh
run_layout_benchmark.sh		run_layout_benchmark.sh
run_lds_benchmark.sh		run_lds_benchmark.sh
tmp.hip		tmp.hip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

learn hip asm and matmul through hiprtc.

BF16 GEMM OPT LOG:

ref:

gemm target:

code:

`00_add.hip`: add kernel, for testing hipasm. and learning:

`00_add_2.hip`: add kernel 2

`01_mfma.hip` of `test_bf16_matmul_NNN`

TODO: chang it to a version with shared memory?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

learn hip asm and matmul through hiprtc.

BF16 GEMM OPT LOG:

ref:

gemm target:

code:

00_add.hip: add kernel, for testing hipasm. and learning:

00_add_2.hip: add kernel 2

01_mfma.hip of test_bf16_matmul_NNN

TODO: chang it to a version with shared memory?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`00_add.hip`: add kernel, for testing hipasm. and learning:

`00_add_2.hip`: add kernel 2

`01_mfma.hip` of `test_bf16_matmul_NNN`

Packages