Skip to content

LLVM dialect -> WaveASM GEMM e2e#1079

Closed
Hardcode84 wants to merge 22 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-gemm
Closed

LLVM dialect -> WaveASM GEMM e2e#1079
Hardcode84 wants to merge 22 commits intoiree-org:mainfrom
Hardcode84:llvm-asm-backend-gemm

Conversation

@Hardcode84
Copy link
Copy Markdown
Contributor

Add GEMM op handlers to the LLVM→WaveASM translation and fix a register allocator bug.

  • Implement ~15 new LLVM dialect op handlers needed to translate a basic GEMM kernel (64x64x64, F16→F32, MFMA 16x16x16) end-to-end through the water+waveasm pipeline: LDS addressing (addressof, multi-index GEP, ds_read/ds_write), integer division (sdiv/srem for power-of-2), barriers, MFMA intrinsics, scf.forwaveasm.loop structured control flow, dense vector constants, and shufflevector element extraction.
  • Extend water.py pipeline: preserve structured control flow (remove convert-scf-to-cf), add LDS alloca-to-global transform, add cleanup passes.
  • Fix register allocator loop init-arg coalescing: when an init arg has post-loop uses, skip coalescing with the block arg to prevent the loop body from corrupting the shared register. Emit entry copies for non-coalesced init args.

Depends on #1078

@Hardcode84 Hardcode84 requested a review from harsh-nod March 8, 2026 21:27
Hardcode84 and others added 22 commits April 8, 2026 03:56
Tests the copy kernel through the new water_waveasm_lowering_pipeline
path (use_water_backend=True, backend="asm").  Currently compile_to_mlir
only — verifies the lowering to LLVM dialect succeeds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Introduce --waveasm-translate-from-llvm pass that creates a ProgramOp
from llvm.func kernels and strictly rejects any unhandled op. This is
the entry point for incremental LLVM dialect support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Handle the 12 LLVM/ROCDL ops needed for the copy kernel:
- llvm.mlir.constant → v_mov_b32
- rocdl.workitem.id.x → precolored v0
- llvm.sext/zext/trunc (i32↔i64) → identity on 32-bit GPU
- llvm.icmp → v_cmp_* (VCC implicit)
- llvm.select → v_cndmask_b32
- llvm.mul → v_mul_lo_u32
- rocdl.make.buffer.rsrc → map to SRD from prologue
- llvm.getelementptr (ptr<7>) → decompose into (SRD, voffset)
- llvm.load/store (ptr<7>) → buffer_load_ushort/buffer_store_short

The lit test exercises the full copy kernel path end-to-end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement Step 2 of water_waveasm_lowering_pipeline: invoke
waveasm-translate with --waveasm-translate-from-llvm, optimization
passes, regalloc, and --emit-assembly. Step 3 assembles and links
to HSACO via clang (skipped when compile_to_mlir=True).

Also add use_buffer_ops=True to test_copy_water_waveasm (required
for the buffer ops path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
… final step

- TranslateFromLLVM now inserts waveasm.program inside the gpu.module
  that contains the kernel, instead of at module top-level. The gpu.module
  is preserved (not erased).
- GPUModuleToBinary is now a pure final step: expects already-optimized
  waveasm.program ops inside gpu.module, emits assembly, assembles + links
  to HSACO, and replaces each gpu.module with gpu.binary.
- The full pipeline is now externally composable as individual passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
…gram

Keep the original llvm.func in gpu.module so gpu.launch_func verification
passes, while placing waveasm.program alongside it with a mangled name
(e.g. test__waveasm). The kernel_name attribute preserves the original
name for assembly emission. GPUModuleToBinary erases everything when
replacing gpu.module with gpu.binary. Also wires up the 3-step
water-opt → waveasm-translate → water-opt pipeline and restricts the
waveasm e2e test to CDNA4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Map workgroup ID intrinsics to precolored system SGPRs, mirroring the
existing gpu.block_id handler from the MLIR-level translation path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Materialize poison values as zero immediates since they represent
undefined values with no meaningful content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When a GEP's base is another GEP result (not a direct buffer resource),
add the offsets together using v_add_u32 and propagate the SRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
When GEPs operate on bare pointers (!llvm.ptr) before make.buffer.rsrc,
propagate the mapper entry and accumulate the byte offset. The offset
is then added to the voffset when creating buffer GEPs on the resulting
ptr<7>.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Add handleAdd (v_add_u32) and extract handlePoison into a proper
function matching the handleMul/handleConstant pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Use properly sized vreg types (e.g. vreg<2,2> for dwordx2) so the
assembly emitter produces register pairs like v[2:3] instead of
single v2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The SRD prologue uses s_mov_b64 to copy base pointers into SRD[0:1],
which corrupts word 1 bits [31:16] (stride/swizzle) with pointer bits.
Additionally, SRD[3] was hardcoded to 0x20000 instead of using the
actual flags from make.buffer.rsrc. Now handleMakeBufferRsrc patches
the SRD: clears stride bits in word 1 and overwrites flags in word 3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
AMDGPU only allocates system SGPRs for enabled workgroup IDs, so if
only workgroup_id_y is used (not x), the hardware puts y at the first
system SGPR slot — not the second. Our getWorkgroupIdSgprIndex was
unconditionally using base+dimension, reading garbage from the wrong
SGPR. Fix by always enabling all three workgroup IDs when any is used,
matching the real LLVM backend behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
This shape generates vector constants for bounds checking which
require a scalarization pass not yet implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Non-pointer kernel arguments (e.g., dynamic dimension sizes) are now
mapped to their preloaded SGPR positions instead of being treated as
buffer pointers. Adds i64→i32 truncation for 32-bit VALU ops and
constant bus violation handling (SGPR→VGPR moves). Also adds
test_dynamic_copy_water_waveasm e2e test with proper xfail markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Documents the planned approach for type legalization (i64→i32),
register placement (SGPR/VGPR), and constant bus enforcement
using generic pseudo-ops and dedicated post-translation passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Replace direct SALU/VALU emission in TranslateFromLLVM with generic
arith pseudo-ops (arith.add, arith.mul, arith.cmp, arith.select,
arith.trunc, arith.sext, arith.zext). The arith legalization pass
handles register file selection, constant bus enforcement, and i64
narrowing as a separate pipeline stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Implement the missing LLVM dialect op handlers needed to translate a
basic GEMM kernel (64×64×64, F16→F32, MFMA 16×16×16) through the
water+waveasm pipeline. The kernel now compiles end-to-end to a GPU
binary (runtime correctness is a follow-up).

New handlers:
- addressof: LDS global → VGPR zero base offset, records LDS size
- sdiv/srem: power-of-2 via v_ashrrev / v_and_b32
- fence: no-op (implicit via barrier/waitcnt)
- barrier: both rocdl.barrier and rocdl.s.barrier → s_barrier
- shufflevector: single-element extract via waveasm.extract
- mfma_f32_16x16x16_f16: emit v_mfma with 4-wide VReg accumulator
- scf.for/yield: translate to waveasm.loop + condition (do-while)
- LDS load/store: ds_read/ds_write dispatch by access width
- dense vector constant: splat init for MFMA accumulators

GEP handling extended for LDS (ptr<3>): multi-index GEPs with
constant indices, constant-attr byte offsets, all-zero passthrough.

Pipeline changes (water.py):
- Remove convert-scf-to-cf (preserve structured control flow)
- Add water-alloc-to-alloca + alloca_to_global transform for LDS
- Add water-drop-transform-ops + symbol-dce cleanup

Includes a pytest e2e test (test_gemm_waveasm) that exercises the
full pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Cover the new op handlers added for GEMM translation:
- LDS: addressof, multi-index GEP, ds_read_b32, ds_write_b32
- sdiv/srem: power-of-2 via v_ashrrev_i32 / v_and_b32
- barrier: rocdl.barrier, rocdl.s.barrier, llvm.fence (no-op)
- MFMA: dense vector constant, v_mfma_f32_16x16x16_f16, extract
- SCF for: waveasm.loop + condition with IV increment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
The rebase introduced TranslateFromLLVM.cpp (branch stub) alongside
TranslateFromLLVMDialect.cpp (main's evolved implementation). Merge
the missing handlers from the stub into the canonical file:

- Dense vector constants and vector poison (MFMA accumulator init)
- LDS: addressof, ptr<3> GEPs, ds_read/ds_write routing
- scf.for -> waveasm.loop translation
- rocdl.barrier / llvm.fence
- llvm.sdiv/srem (power-of-2 constant)
- llvm.shufflevector (single-element extract)
- rocdl.mfma.f32.16x16x16f16
- Multi-index and all-zero GEP support

Update lit tests for arith pseudo-op output (s_mov_b32 proper ops
instead of raw strings, arith.add instead of v_add_u32).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
@Hardcode84 Hardcode84 force-pushed the llvm-asm-backend-gemm branch from 8f8ee32 to 11befd5 Compare April 8, 2026 12:26
@Hardcode84 Hardcode84 closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant