Skip to content

[WIP] Use asm developed externally in hipblaslt#3960

Closed
newling wants to merge 3 commits intoROCm:developfrom
newling:asm_as_third_path
Closed

[WIP] Use asm developed externally in hipblaslt#3960
newling wants to merge 3 commits intoROCm:developfrom
newling:asm_as_third_path

Conversation

@newling
Copy link
Copy Markdown
Contributor

@newling newling commented Jan 19, 2026

Goal of this PR

We have been provided with an AITER .co file that we want to use instead of RocRoller's generated kernel, but only when Origami predicts that the best macro-tile size is 256x256 (see task description here).

Summary of PR

Currently (before this PR), the approach hipblaslt takes is, at a high-level

if canRunRocRoller: 
  return useRocroller()
return useTensilelite()

This PR changes this to

if canUseAssemblyDirect:
   return useAssemblyDirect()
if canRunRocRoller: 
  return useRocroller()
return useTensilelite()

Another PR being worked on by @awhittle3 and others uses a cleaner approach for packaging external kernels, and it will use the approach:

if canRunRocRoller: 
  # origami logic runs here
  if canUseAssemblyDirect:
     return useAssemblyDirect()
  return useRocroller()
return useTensilelite()

Where the difference above is that the Origami logic can be used more cleanly is deciding whether to use the externel kernel ('assembly direct').

This current PR has 2 environment variables, set as:

 export HIPBLASLT_CUSTOM_ASM_DIR=/path/to/custom/file.co
 export HIPBLASLT_ENABLE_DIRECT_ASSEMBLY=1

The first of these is hacky -- these .co files should be packaged correctly within the install of hipblaslt in production.

I have generated a toy .co file (see poc_co.cpp in this PR) for doing single precision gemm. To use

  1. Compile it to a .co file, and put it in the directory ${HIPBLASLT_CUSTOM_ASM_DIR}, and then
  2. Add the glue to register it in hipblaslt (already in this PR)

Running hipblaslt-bench, I see the following logging which shows that the custom kernel ran:

./clients/hipblaslt-bench  -m 128 -n 64 -k 256 -r f32_r --verify --alpha 1 --beta 1
[...]
[DirectAssembly] Match found: SimpleGemm_PoC
[DirectAssembly] Loading module: /home/jnewling/workspace/amd-experiments/hip_hipblaslt_standalone/poc_co.co
[DirectAssembly] Launching SimpleGemm...
[DirectAssembly] Match found: SimpleGemm_PoC
[DirectAssembly] Launching SimpleGemm...
[...]
[0]:transA,transB,grouped_gemm,batch_count,m,n,k,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,a_type,b_type,c_type,d_type,compute_type,scaleA,scaleB,scaleC,scaleD,amaxD,swizzle_a,swizzle_b,activation_type,bias_vector,bias_type,aux_type,rotating_buffer,flush,use_gpu_timer,hipblaslt-Gflops,hipblaslt-GB/s,us,CPU-Gflops,CPU-us,norm_error,atol,rtol
   N,N,0,1,128,64,256,1,128,32768,1,256,16384,128,8192,128,8192,f32_r,f32_r,f32_r,f32_r,f32_r,0,0,0,0,0,0,0,none,0,f32_r,f32_r,0,0,0,119.156,6.93581,35.2,0.220278,19041,2.94527e-07,1e-06,1e-05

If I disable the environment variable with

export HIPBLASLT_ENABLE_DIRECT_ASSEMBLY=0, the custom kernel is not used.

Integrating the AITER kernel

For the production .co that we want to use, we need to get information from running the kernel through AITER. Below I describe this (this step will be dependent on the source of the kernel).

Building AITER: My approach was

  1. Get AITER at https://github.com/ROCm/aiter
  2. create a venv
  3. wget https://download.pytorch.org/whl/rocm7.1/torch-2.10.0%2Brocm7.1-cp310-cp310-manylinux_2_28_x86_64.whl
  4. pip3 install ./torch-2.10.0+rocm7.1-cp310-cp310-manylinux_2_28_x86_64.whl torchvision --index-url https://download.pytorch.org/whl/rocm7.1
  5. pip install the 2 mentioned deps on https://github.com/ROCm/aiter

Run the test that exercises the kernel:

python3 op_tests/test_gemm_a4w4.py

Kernels are cached after compilation. To remove the cache (important if the printing / kernel changes) I did:

rm -rf aiter/jit
git checkout main aiter/jit

For figuring out kernel arguments, the branch is https://github.com/newling/aiter/tree/printing_for_kernel_arg_debug. I dumped the kernel arguments as bytes to help verify that intergration glue into hipblaslt is correct.

Offset | 00 01 02 03 04 05 06 07 | 08 09 0A 0B 0C 0D 0E 0F
-------|-------------------------|-------------------------
0000   | 00 00 20 6d ab 7f 00 00 | 00 00 00 00 00 00 00 00 
0016   | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 
[...]
0336   | 80 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 
0352   | 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 
0368   | 00 00 00 00             |                         

Important to note: Getting the kernel arguments matching was not enough for this specific kernel, the scale values needed to have a special layout (tiling) in HBM. That was the trickiest part of getting the kernel to give numerically correct results in hipblaslt (thanks to @bethune-bryant and @bnemanich for decoding the AITER logic).

@math-ci-webhook
Copy link
Copy Markdown

perfci run on commit a0f93bc

math-ci run

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 8.67052% with 158 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp 8.67% 155 Missing and 3 partials ⚠️

❌ Your project status has failed because the head coverage (49.21%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3960      +/-   ##
===========================================
- Coverage    61.41%   61.33%   -0.08%     
===========================================
  Files          627      627              
  Lines       113687   113854     +167     
  Branches     20034    20053      +19     
===========================================
+ Hits         69813    69825      +12     
- Misses       36016    36168     +152     
- Partials      7858     7861       +3     
Flag Coverage Δ *Carryforward flag
hipBLASLt 43.10% <8.67%> (-0.52%) ⬇️
rocFFT 49.21% <ø> (ø) Carriedforward from 6992cd2
rocSPARSE 69.19% <ø> (ø) Carriedforward from 6992cd2

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp 39.25% <8.67%> (-1.60%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@math-ci-webhook
Copy link
Copy Markdown

perfci run on commit 91fdf22

math-ci run

@awhittle3
Copy link
Copy Markdown
Contributor

NOTE: Experimental, do not merge

@awhittle3
Copy link
Copy Markdown
Contributor

PR #4384 is, in part, based off of the experimental work done here. We can probably close this PR.

@newling newling closed this Feb 10, 2026
@newling newling deleted the asm_as_third_path branch April 3, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants