A custom out-of-tree LLVM pass plugin that statically analyzes GPU kernels targeting the AMDGPU backend. The pass identifies uncoalesced memory accesses, a common source of memory bandwidth bottlenecks in GPU programming.
The analyzer operates as a plugin for LLVM's opt tool, inspecting AMDGPU kernel functions for global memory access patterns. For each load and store to addrspace(1) (global memory), it performs the following:
- Traces the address computation backward through
GetElementPtrinstructions, arithmetic operations, and casts. - Identifies thread-varying components by searching for workitem intrinsics (
llvm.amdgcn.workitem.id.*). - Classifies the access pattern into one of four categories:
[Coalesced]: Threads access contiguous memory (stride equals element size).[Uncoalesced]: Threads skip memory locations (stride exceeds element size).[Indirect]: The address depends on a loaded value (gather pattern, e.g.,A[B[tid]]).[Unknown]: The address computation is too complex for static resolution.
- Reports severity relative to RDNA 2 hardware parameters (128B cache lines, Wave32).
--- AMDGPU Kernel Analyzer: strided_access ---
[Uncoalesced] (stride=4096, severity=high) - load 4B
-> thread ID (x) with multiplier 1024
[Coalesced] - store 4B
-> thread ID (x) with multiplier 1
Summary: 1 coalesced, 1 uncoalesced, 0 indirect, 0 unknown
- Ubuntu 24.04 (or compatible)
- LLVM 18 development package (
llvm-18-dev) - ROCm 6.3+ (for
hipccand AMDGPU target support) - CMake 3.20+
- Python 3 +
lit(for running tests)
# Build
cmake -B build -DLLVM_DIR=/usr/lib/llvm-18/lib/cmake/llvm/
cmake --build build
# Run on LLVM IR directly
opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
-passes=amdgpu-kernel-analyzer \
-amdgpu-analyzer-verbose \
-disable-output your_kernel.ll
# Run lit tests
pip install lit
lit test/ -v# Compile HIP -> LLVM IR (device-only)
hipcc -c -emit-llvm -S --offload-arch=gfx1031 -O2 \
--cuda-device-only kernel.hip -o kernel.ll
# Analyze
opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
-passes=amdgpu-kernel-analyzer \
-amdgpu-analyzer-verbose \
-disable-output kernel.ll
# Or use the examples Makefile:
cd examples && make analyze-matmul_badThe pass emits standard LLVM optimization remarks, viewable with:
opt-18 --load-pass-plugin=build/AMDGPUKernelAnalyzer.so \
-passes=amdgpu-kernel-analyzer \
-pass-remarks=amdgpu-kernel-analyzer \
-pass-remarks-missed=amdgpu-kernel-analyzer \
-disable-output your_kernel.llAdd -amdgpu-analyzer-verbose for the formatted output shown above.
├── CMakeLists.txt # Build system (finds LLVM 18, builds .so plugin)
├── src/
│ ├── AMDGPUKernelAnalyzer.h # Pass class declaration
│ ├── AMDGPUKernelAnalyzer.cpp # Plugin registration + orchestration
│ ├── CoalescingAnalysis.h # Coalescing analysis interface
│ └── CoalescingAnalysis.cpp # Address chain walker + stride classifier
├── test/
│ ├── lit.cfg.py # LLVM lit test configuration
│ ├── coalesced_access.ll # Test: A[tid] patterns (should pass clean)
│ ├── uncoalesced_access.ll # Test: A[tid*N] + gather patterns (should warn)
│ └── mixed_access.ll # Test: mix of good + bad patterns
└── examples/
├── matmul_bad.hip # Naive matmul (uncoalesced B reads)
├── matmul_good.hip # Tiled matmul (all accesses coalesced)
└── Makefile # HIP -> IR -> analysis pipeline
To determine the effective stride of an access, the pass traverses the def-use chain of the pointer backward from the memory instruction:
load float, ptr addrspace(1) %ptr
^
|-- getelementptr float, ptr %base, i64 %idx
^
|-- mul i64 %tid64, 1024 (multiplier = 1024)
^
|-- zext i32 %tid to i64
^
|-- call i32 @llvm.amdgcn.workitem.id.x() (thread-varying!)
Result: stride = 1024 * sizeof(float) = 4096B -> UNCOALESCED (high severity)
The traversal handles GetElementPtrInst, BinaryOperator (mul, add, shl, or), CastInst (zext, sext, trunc, inttoptr, bitcast), LoadInst (which signals an indirect gather access), and standard constants.
Based on RDNA 2 cache line size (128B) and wavefront width (32 threads):
| Stride | Severity | Impact |
|---|---|---|
| = element size | — | Perfect coalescing |
| ≤ 64B | low | Some cache line reuse |
| ≤ 128B | medium | ~1 cache line per thread |
| > 128B | high | Each thread hits a different cache line |
There are a few areas I'm contemplating extending this tool into. The current def-use chain walker is effective for straightforward indexing, but it breaks down on complex loops.
- Handling Complex Loops: Right now, the analyzer traces backward through simple math. If a memory address is calculated inside a complex
fororwhileloop, the tool might get confused. A great next step is to integrate LLVM'sScalarEvolutionpass, which is a built-in tool that mathematically models loop variables, so we can analyze access patterns across loop iterations. - Divergence Analysis: The analyzer assumes all threads in a wavefront (group of 32 threads) execute the same memory instruction at the same time. But if there's an
if/elsestatement and half the threads take a different path, the memory access isn't truly contiguous anymore. Plugging into LLVM's divergence analysis would let us detect this. - Shared Memory (LDS) Bank Conflicts: We currently analyze global VRAM coalescing. However, GPUs also have ultra-fast shared memory (
addrspace(3)). If multiple threads try to read from the same shared memory "bank" at once, they have to wait in line (a bank conflict). We could extend the analyzer to track strides in shared memory to warn developers about these conflicts.