Skip to content

RutanshS/amdgpu-kernel-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMDGPU Kernel Analyzer

A custom out-of-tree LLVM pass plugin that statically analyzes GPU kernels targeting the AMDGPU backend. The pass identifies uncoalesced memory accesses, a common source of memory bandwidth bottlenecks in GPU programming.

Overview

The analyzer operates as a plugin for LLVM's opt tool, inspecting AMDGPU kernel functions for global memory access patterns. For each load and store to addrspace(1) (global memory), it performs the following:

  1. Traces the address computation backward through GetElementPtr instructions, arithmetic operations, and casts.
  2. Identifies thread-varying components by searching for workitem intrinsics (llvm.amdgcn.workitem.id.*).
  3. Classifies the access pattern into one of four categories:
    • [Coalesced]: Threads access contiguous memory (stride equals element size).
    • [Uncoalesced]: Threads skip memory locations (stride exceeds element size).
    • [Indirect]: The address depends on a loaded value (gather pattern, e.g., A[B[tid]]).
    • [Unknown]: The address computation is too complex for static resolution.
  4. Reports severity relative to RDNA 2 hardware parameters (128B cache lines, Wave32).

Example Output

--- AMDGPU Kernel Analyzer: strided_access ---
  [Uncoalesced] (stride=4096, severity=high)  -  load 4B
      -> thread ID (x) with multiplier 1024
  [Coalesced]  -  store 4B
      -> thread ID (x) with multiplier 1

  Summary: 1 coalesced, 1 uncoalesced, 0 indirect, 0 unknown

Requirements

  • Ubuntu 24.04 (or compatible)
  • LLVM 18 development package (llvm-18-dev)
  • ROCm 6.3+ (for hipcc and AMDGPU target support)
  • CMake 3.20+
  • Python 3 + lit (for running tests)

Quick Start

# Build
cmake -B build -DLLVM_DIR=/usr/lib/llvm-18/lib/cmake/llvm/
cmake --build build

# Run on LLVM IR directly
opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
    -passes=amdgpu-kernel-analyzer \
    -amdgpu-analyzer-verbose \
    -disable-output your_kernel.ll

# Run lit tests
pip install lit
lit test/ -v

Analyzing HIP Kernels

# Compile HIP -> LLVM IR (device-only)
hipcc -c -emit-llvm -S --offload-arch=gfx1031 -O2 \
      --cuda-device-only kernel.hip -o kernel.ll

# Analyze
opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
    -passes=amdgpu-kernel-analyzer \
    -amdgpu-analyzer-verbose \
    -disable-output kernel.ll

# Or use the examples Makefile:
cd examples && make analyze-matmul_bad

Output Formats

LLVM Remarks (default, machine-readable)

The pass emits standard LLVM optimization remarks, viewable with:

opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
    -passes=amdgpu-kernel-analyzer \
    -pass-remarks=amdgpu-kernel-analyzer \
    -pass-remarks-missed=amdgpu-kernel-analyzer \
    -disable-output your_kernel.ll

Verbose stderr (human-readable)

Add -amdgpu-analyzer-verbose for the formatted output shown above.

Project Structure

├── CMakeLists.txt               # Build system (finds LLVM 18, builds .so plugin)
├── src/
│   ├── AMDGPUKernelAnalyzer.h   # Pass class declaration
│   ├── AMDGPUKernelAnalyzer.cpp # Plugin registration + orchestration
│   ├── CoalescingAnalysis.h     # Coalescing analysis interface
│   └── CoalescingAnalysis.cpp   # Address chain walker + stride classifier
├── test/
│   ├── lit.cfg.py               # LLVM lit test configuration
│   ├── coalesced_access.ll      # Test: A[tid] patterns (should pass clean)
│   ├── uncoalesced_access.ll    # Test: A[tid*N] + gather patterns (should warn)
│   └── mixed_access.ll          # Test: mix of good + bad patterns
└── examples/
    ├── matmul_bad.hip           # Naive matmul (uncoalesced B reads)
    ├── matmul_good.hip          # Tiled matmul (all accesses coalesced)
    └── Makefile                 # HIP -> IR -> analysis pipeline

Implementation Details

Def-Use Chain Traversal

To determine the effective stride of an access, the pass traverses the def-use chain of the pointer backward from the memory instruction:

load float, ptr addrspace(1) %ptr
  ^
  |-- getelementptr float, ptr %base, i64 %idx
        ^
        |-- mul i64 %tid64, 1024    (multiplier = 1024)
              ^
              |-- zext i32 %tid to i64
                    ^
                    |-- call i32 @llvm.amdgcn.workitem.id.x()    (thread-varying!)

Result: stride = 1024 * sizeof(float) = 4096B -> UNCOALESCED (high severity)

The traversal handles GetElementPtrInst, BinaryOperator (mul, add, shl, or), CastInst (zext, sext, trunc, inttoptr, bitcast), LoadInst (which signals an indirect gather access), and standard constants.

Severity Calculation

Based on RDNA 2 cache line size (128B) and wavefront width (32 threads):

Stride Severity Impact
= element size Perfect coalescing
≤ 64B low Some cache line reuse
≤ 128B medium ~1 cache line per thread
> 128B high Each thread hits a different cache line

Future Work & Extensions

There are a few areas I'm contemplating extending this tool into. The current def-use chain walker is effective for straightforward indexing, but it breaks down on complex loops.

  • Handling Complex Loops: Right now, the analyzer traces backward through simple math. If a memory address is calculated inside a complex for or while loop, the tool might get confused. A great next step is to integrate LLVM's ScalarEvolution pass, which is a built-in tool that mathematically models loop variables, so we can analyze access patterns across loop iterations.
  • Divergence Analysis: The analyzer assumes all threads in a wavefront (group of 32 threads) execute the same memory instruction at the same time. But if there's an if/else statement and half the threads take a different path, the memory access isn't truly contiguous anymore. Plugging into LLVM's divergence analysis would let us detect this.
  • Shared Memory (LDS) Bank Conflicts: We currently analyze global VRAM coalescing. However, GPUs also have ultra-fast shared memory (addrspace(3)). If multiple threads try to read from the same shared memory "bank" at once, they have to wait in line (a bank conflict). We could extend the analyzer to track strides in shared memory to warn developers about these conflicts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors