Skip to content

GCN: use byref instead of byval+lower_byval for kernel arguments#772

Open
gbaraldi wants to merge 14 commits intoJuliaGPU:masterfrom
gbaraldi:gcn-byref-kernel-args
Open

GCN: use byref instead of byval+lower_byval for kernel arguments#772
gbaraldi wants to merge 14 commits intoJuliaGPU:masterfrom
gbaraldi:gcn-byref-kernel-args

Conversation

@gbaraldi
Copy link
Member

Summary

  • Replace byval + lower_byval (FCA expansion) with byref for GCN kernel arguments
  • Add pass_by_ref interface method (defaults to false, GCN overrides to true)
  • Rewrite byref pointer parameters to addrspace(4) (kernarg) with addrspacecasts back to generic for the function body

Motivation

On AMDGPU, kernel arguments reside in the read-only kernarg segment. The current pipeline:

  1. irgen adds byval attributes to aggregate parameters
  2. lower_byval replaces ptr byval(T) with the FCA type directly
  3. The entry block extractvalues every field and stores the entire struct into an alloca in scratch memory (addrspace 5)

For large structs (e.g. Oceananigans' ImmersedBoundaryGrid — hundreds of bytes with ~28 array descriptors), this produces 78 scratch stores but only 4 scratch loads. ~95% of the stores are dead, but LLVM's DSE/SROA can't eliminate them from the massive nested aggregate.

With byref, the pointer semantics are preserved. LLVM generates s_load directly from the kernarg segment on demand, only loading fields that are actually used. The invariant.load and TBAA metadata Julia emits remain valid since kernarg memory is immutable.

Test plan

  • Run existing GPUCompiler GCN tests
  • Test with AMDGPU.jl on actual hardware
  • Verify the generated IR uses direct loads from kernarg instead of FCA + scratch stores
  • Test with Oceananigans mask_immersed_field! kernel to confirm reduced scratch usage

🤖 Generated with Claude Code

On AMDGPU, kernel arguments already reside in the read-only kernarg
segment. The current pipeline adds `byval` attributes and then
`lower_byval` expands them into first-class aggregates (FCAs), which
forces LLVM to extractvalue every field and store the entire struct
into scratch memory via alloca — even when only a few fields are used.
For large structs (e.g. Oceananigans' ImmersedBoundaryGrid), this
produces dozens of dead scratch stores.

Using `byref` instead keeps the pointer semantics, allowing LLVM to
generate scalar loads directly from the kernarg segment on demand.
The invariant.load and TBAA metadata that Julia emits remain valid
since the kernarg memory is immutable.

The byref pointer parameters are rewritten to addrspace(4) (AMDGPU
constant/kernarg address space), with addrspacecasts inserted so the
function body can continue using generic pointers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/gcn.jl b/src/gcn.jl
index e310b5c..2903354 100644
--- a/src/gcn.jl
+++ b/src/gcn.jl
@@ -64,7 +64,7 @@ function finish_ir!(
         # optimize after address space rewriting: propagate addrspace(4) through
         # the addrspacecast chains, then clean up newly-exposed opportunities
         tm = llvm_machine(job.config.target)
-        @dispose pb=NewPMPassBuilder() begin
+        @dispose pb = NewPMPassBuilder() begin
             add!(pb, NewPMFunctionPassManager()) do fpm
                 add!(fpm, InferAddressSpacesPass())
                 add!(fpm, SROAPass())
@@ -139,7 +139,7 @@ function add_kernarg_address_spaces!(
     # (which expects flat pointers) continues to work. The AMDGPU backend's
     # AMDGPULowerKernelArguments traces these casts and produces s_load.
     new_args = LLVM.Value[]
-    @dispose builder=IRBuilder() begin
+    @dispose builder = IRBuilder() begin
         entry_bb = BasicBlock(new_f, "conversion")
         position!(builder, entry_bb)
 
@@ -185,7 +185,7 @@ function add_kernarg_address_spaces!(
     LLVM.name!(new_f, fn)
 
     # clean up the extra conversion block
-    @dispose pb=NewPMPassBuilder() begin
+    @dispose pb = NewPMPassBuilder() begin
         add!(pb, NewPMFunctionPassManager()) do fpm
             add!(fpm, SimplifyCFGPass())
         end
diff --git a/test/gcn.jl b/test/gcn.jl
index 5b49cf5..c667cb8 100644
--- a/test/gcn.jl
+++ b/test/gcn.jl
@@ -37,121 +37,127 @@ end
     end
 end
 
-@testset "kernarg address space for byref parameters" begin
-    mod = @eval module $(gensym())
-        struct MyStruct
-            x::Float64
-            y::Float64
-        end
-
-        function kernel(s::MyStruct)
-            s.x + s.y
-            return
-        end
-    end
+        @testset "kernarg address space for byref parameters" begin
+            mod = @eval module $(gensym())
+            struct MyStruct
+                x::Float64
+                y::Float64
+            end
 
-    # byref struct params should be ptr addrspace(4) in kernel IR
-    @test @filecheck begin
-        check"CHECK: define amdgpu_kernel void @_Z6kernel8MyStruct(ptr addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module=true, kernel=true)
-    end
+            function kernel(s::MyStruct)
+                s.x + s.y
+                return
+            end
+            end
 
-    # non-kernel should NOT have addrspace(4)
-    @test @filecheck begin
-        check"CHECK-NOT: addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module=true, kernel=false)
-    end
-end
+            # byref struct params should be ptr addrspace(4) in kernel IR
+            @test @filecheck begin
+                check"CHECK: define amdgpu_kernel void @_Z6kernel8MyStruct(ptr addrspace(4)"
+                GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module = true, kernel = true)
+            end
 
-@testset "byref attribute preserved on kernarg parameters" begin
-    mod = @eval module $(gensym())
-        struct LargeStruct
-            a::Float64
-            b::Float64
-            c::Float64
-            d::Float64
+            # non-kernel should NOT have addrspace(4)
+            @test @filecheck begin
+                check"CHECK-NOT: addrspace(4)"
+                GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module = true, kernel = false)
+            end
         end
 
-        function kernel(s::LargeStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.a + s.b + s.c + s.d)
-            return
-        end
-    end
+        @testset "byref attribute preserved on kernarg parameters" begin
+            mod = @eval module $(gensym())
+            struct LargeStruct
+                a::Float64
+                b::Float64
+                c::Float64
+                d::Float64
+            end
 
-    # the byref attribute must survive the addrspace rewrite (clone_into! can drop it)
-    @test @filecheck begin
-        check"CHECK: byref"
-        check"CHECK: addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.LargeStruct, Ptr{Float64}};
-                       dump_module=true, kernel=true)
-    end
-end
+            function kernel(s::LargeStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.a + s.b + s.c + s.d)
+                return
+            end
+            end
 
-@testset "mixed byref and scalar kernel parameters" begin
-    mod = @eval module $(gensym())
-        struct Params
-            x::Float64
-            y::Float64
+            # the byref attribute must survive the addrspace rewrite (clone_into! can drop it)
+            @test @filecheck begin
+                check"CHECK: byref"
+                check"CHECK: addrspace(4)"
+                GCN.code_llvm(
+                    mod.kernel, Tuple{mod.LargeStruct, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
 
-        function kernel(a::Float64, s::Params, out::Ptr{Float64})
-            unsafe_store!(out, a + s.x + s.y)
-            return
-        end
-    end
+        @testset "mixed byref and scalar kernel parameters" begin
+            mod = @eval module $(gensym())
+            struct Params
+                x::Float64
+                y::Float64
+            end
 
-    # scalar Float64 should NOT be in addrspace(4),
-    # only the struct byref param should be.
-    # NOTE: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and ptr on Julia 1.12+.
-    @test @filecheck begin
-        check"CHECK: define amdgpu_kernel void"
-        check"CHECK-SAME: double"
-        check"CHECK-SAME: ptr addrspace(4)"
-        check"CHECK-SAME: {{(i64|ptr)}}"
-        GCN.code_llvm(mod.kernel, Tuple{Float64, mod.Params, Ptr{Float64}};
-                       dump_module=true, kernel=true)
-    end
-end
+            function kernel(a::Float64, s::Params, out::Ptr{Float64})
+                unsafe_store!(out, a + s.x + s.y)
+                return
+            end
+            end
 
-@testset "add_kernarg_address_spaces! rewrites IR correctly" begin
-    mod = @eval module $(gensym())
-        struct KernelArgs
-            x::Float64
-            y::Float64
-            z::Float64
+            # scalar Float64 should NOT be in addrspace(4),
+            # only the struct byref param should be.
+            # NOTE: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and ptr on Julia 1.12+.
+            @test @filecheck begin
+                check"CHECK: define amdgpu_kernel void"
+                check"CHECK-SAME: double"
+                check"CHECK-SAME: ptr addrspace(4)"
+                check"CHECK-SAME: {{(i64|ptr)}}"
+                GCN.code_llvm(
+                    mod.kernel, Tuple{Float64, mod.Params, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
 
-        function kernel(s::KernelArgs, scale::Float64, out::Ptr{Float64})
-            unsafe_store!(out, (s.x + s.y + s.z) * scale)
-            return
-        end
-    end
+        @testset "add_kernarg_address_spaces! rewrites IR correctly" begin
+            mod = @eval module $(gensym())
+            struct KernelArgs
+                x::Float64
+                y::Float64
+                z::Float64
+            end
 
-    job, _ = GCN.create_job(mod.kernel, Tuple{mod.KernelArgs, Float64, Ptr{Float64}};
-                             kernel=true)
-    JuliaContext() do ctx
-        ir, meta = GPUCompiler.compile(:llvm, job)
+            function kernel(s::KernelArgs, scale::Float64, out::Ptr{Float64})
+                unsafe_store!(out, (s.x + s.y + s.z) * scale)
+                return
+            end
+            end
 
-        entry = meta.entry
-        ft = function_type(entry)
-        params = parameters(ft)
+            job, _ = GCN.create_job(
+                mod.kernel, Tuple{mod.KernelArgs, Float64, Ptr{Float64}};
+                kernel = true
+            )
+            JuliaContext() do ctx
+                ir, meta = GPUCompiler.compile(:llvm, job)
 
-        # the struct byref param should be ptr addrspace(4)
-        has_as4 = any(p -> p isa LLVM.PointerType && addrspace(p) == 4, params)
-        @test has_as4
+                entry = meta.entry
+                ft = function_type(entry)
+                params = parameters(ft)
 
-        # non-struct params (double, and i64/ptr for Ptr{Float64}) should NOT
-        # be in addrspace(4). Ptr{Float64} is i64 on Julia ≤1.11, ptr on 1.12+.
-        non_byref = filter(p -> !(p isa LLVM.PointerType && addrspace(p) == 4), params)
-        @test !isempty(non_byref)  # double (and i64 or ptr) params
+                # the struct byref param should be ptr addrspace(4)
+                has_as4 = any(p -> p isa LLVM.PointerType && addrspace(p) == 4, params)
+                @test has_as4
 
-        # byref attribute must be present
-        ir_str = string(ir)
-        @test occursin("byref", ir_str)
+                # non-struct params (double, and i64/ptr for Ptr{Float64}) should NOT
+                # be in addrspace(4). Ptr{Float64} is i64 on Julia ≤1.11, ptr on 1.12+.
+                non_byref = filter(p -> !(p isa LLVM.PointerType && addrspace(p) == 4), params)
+                @test !isempty(non_byref)  # double (and i64 or ptr) params
 
-        dispose(ir)
-    end
-end
+                # byref attribute must be present
+                ir_str = string(ir)
+                @test occursin("byref", ir_str)
+
+                dispose(ir)
+            end
+        end
 
 @testset "https://github.com/JuliaGPU/AMDGPU.jl/issues/846" begin
     ir, rt = GCN.code_typed((Tuple{Tuple{Val{4}}, Tuple{Float32}},); always_inline=true) do t
@@ -165,47 +171,49 @@ end
 ############################################################################################
 @testset "assembly" begin
 
-@testset "s_load for kernarg struct access" begin
-    mod = @eval module $(gensym())
-        struct MyStruct
-            x::Float64
-            y::Float64
-        end
+        @testset "s_load for kernarg struct access" begin
+            mod = @eval module $(gensym())
+            struct MyStruct
+                x::Float64
+                y::Float64
+            end
 
-        function kernel(s::MyStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.x + s.y)
-            return
+            function kernel(s::MyStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.x + s.y)
+                return
+            end
+            end
+
+            # struct field loads from kernarg should use s_load, not flat_load
+            @test @filecheck begin
+                check"CHECK: s_load_dwordx"
+                check"CHECK-NOT: flat_load"
+                GCN.code_native(mod.kernel, Tuple{mod.MyStruct, Ptr{Float64}}; kernel = true)
+            end
         end
-    end
 
-    # struct field loads from kernarg should use s_load, not flat_load
-    @test @filecheck begin
-        check"CHECK: s_load_dwordx"
-        check"CHECK-NOT: flat_load"
-        GCN.code_native(mod.kernel, Tuple{mod.MyStruct, Ptr{Float64}}; kernel=true)
-    end
-end
+        @testset "no scratch spills for small struct kernarg" begin
+            mod = @eval module $(gensym())
+            struct SmallStruct
+                x::Float64
+                y::Float64
+            end
 
-@testset "no scratch spills for small struct kernarg" begin
-    mod = @eval module $(gensym())
-        struct SmallStruct
-            x::Float64
-            y::Float64
-        end
+            function kernel(s::SmallStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.x + s.y)
+                return
+            end
+            end
 
-        function kernel(s::SmallStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.x + s.y)
-            return
+            # a small struct kernel should not need scratch memory
+            @test @filecheck begin
+                check"CHECK: .private_segment_fixed_size: 0"
+                GCN.code_native(
+                    mod.kernel, Tuple{mod.SmallStruct, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
-    end
-
-    # a small struct kernel should not need scratch memory
-    @test @filecheck begin
-        check"CHECK: .private_segment_fixed_size: 0"
-        GCN.code_native(mod.kernel, Tuple{mod.SmallStruct, Ptr{Float64}};
-                         dump_module=true, kernel=true)
-    end
-end
 
 @testset "skip scalar trap" begin
     mod = @eval module $(gensym())

gbaraldi and others added 7 commits March 18, 2026 12:02
…spacecast

The addrspacecast from addrspace(4) to addrspace(0) caused "illegal
VGPR to SGPR copy" errors because LLVM couldn't properly lower
generic pointer accesses back to the constant address space.

Instead, follow Metal's approach: load the struct from the
addrspace(4) kernarg pointer into a local alloca (addrspace 5),
and let SROA decompose it during the optimization pipeline. This
avoids the address space mismatch while still benefiting from byref
semantics — the load from addrspace(4) is a scalar load from the
kernarg segment, and SROA will eliminate dead fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The byref TypeAttribute may be dropped when copying attributes to the
new function with changed parameter types (ptr -> ptr addrspace(4)).
Explicitly re-add it to ensure the AMDGPU backend knows the kernarg
contains the struct data inline, not a pointer to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove manual rewrite_byref_addrspaces!. The AMDGPU backend's
AMDGPULowerKernelArguments pass already knows how to handle
ptr byref(T) on amdgpu_kernel functions — it rewrites the pointer
to load from the kernarg segment (addrspace 4) automatically.

The previous manual approaches (addrspacecast, load→alloca→store)
conflicted with the backend's own lowering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add finish_ir! for GCN that rewrites byref kernel parameters from
flat (addrspace 0) to kernarg (addrspace 4) after optimization.

Clang emits byref params as ptr addrspace(4) from the frontend,
but Julia's RemoveJuliaAddrspacesPass strips them to flat. This
causes struct field loads to use flat_load instead of s_load.

The pass creates a new function with ptr addrspace(4) parameters,
inserts addrspacecasts back to flat for the cloned IR, then runs
InferAddressSpaces to propagate addrspace(4) through all GEPs and
loads. The result is that all kernel argument struct field accesses
become s_load (scalar, cached, one per wavefront) instead of
flat_load (per-lane, address disambiguation overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InferAddressSpaces needs TargetTransformInfo to determine the flat
address space (0 on AMDGPU). Without passing a TargetMachine, the
pass has no TTI and skips all promotions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CloneFunctionInto rebuilds the AttributeList from scratch using VMap.
For byref params, VMap maps old args to addrspacecast instructions
(not Arguments), so dyn_cast<Argument> fails and byref(T) is silently
dropped. The subsequent setAttributes() overwrites any attrs we set
before clone_into!.

Without byref, the backend emits global_buffer(8) metadata instead of
by_value(sizeof(T)), causing HIP to copy only 8 bytes of struct data
as a "pointer" — leading to illegal address errors at runtime.

Also remove InferAddressSpaces (rely on AMDGPULowerKernelArguments
in codegen to trace addrspacecast chains for s_load).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that byref struct kernel parameters are rewritten to ptr addrspace(4)
in IR and that the backend emits s_load (not flat_load) for struct field access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gbaraldi
Copy link
Member Author

@vchuravy after a small session of being nerd sniped. This + JuliaGPU/AMDGPU.jl#894 does

VERY AI AHEAD (I couldn't be bothered)

GCN Kernarg Address Space Rewriting: Performance Summary

What changed

Byref kernel parameters are now rewritten from flat (addrspace(0)) to kernarg (addrspace(4)) in finish_ir!. This allows the AMDGPU backend to emit scalar s_load instructions instead of vector flat_load for kernel argument struct field accesses.

The fix lives in add_kernarg_address_spaces! in src/gcn.jl, called from finish_ir! for kernel functions.

Why this matters

On AMDGPU, flat_load is a per-lane vector memory operation that must resolve address space at runtime and consumes VGPRs for both the address and result. s_load is a scalar operation — one load per wavefront from the read-only kernarg segment, using SGPRs instead of VGPRs. Since kernel arguments are uniform across all lanes, scalar loads are strictly better.

Julia's RemoveJuliaAddrspacesPass strips all non-integral address spaces (including the original addrspace(11) on tracked pointers) to flat. This pass restores addrspace(4) on byref parameters after optimization so that the backend's AMDGPULowerKernelArguments can trace the provenance and emit s_load.

Results: 48 Oceananigans kernels on gfx942

Aggregate

Metric Before After Change
Total VGPRs 2507 1435 -42.8%
Total SGPRs 1402 2111 +50.6% (expected)
Total Scratch 2228 B 1644 B -26.2%
Total flat_load 899 0 -100%
Total s_load 39 919 +880
Avg Occupancy 6.08 8.21 +35%
Kernels with occupancy gain 29/48 0 regressions

Per-kernel breakdown (sorted by VGPR reduction)

Kernel VGPRs Occupancy Scratch flat_load s_load
compute_transport_velocities!_39 124→32 2→8 0→0 41→0 0→33
compute_integrated_ab2_tendencies!_34 124→54 2→4 0→0 36→0 2→32
partial_mapreduce_device_30 89→42 2→6 0→0 13→0 5→14
partial_mapreduce_device_31 105→58 2→4 0→0 14→0 5→15
update_split_explicit_state!_38 58→14 4→10 0→0 25→0 0→21
mask_immersed_field!_11 55→17 4→10 0→0 13→0 1→12
compute_barotropic_mode!_1 80→44 3→5 0→0 28→0 1→25
mask_immersed_field!_10 57→21 4→10 0→0 13→0 1→12
mask_immersed_field!_8 55→19 4→10 0→0 13→0 1→12
mask_immersed_field!_9 55→19 4→10 0→0 13→0 1→12
partial_mapreduce_device_29 72→37 3→6 0→0 10→0 5→12
ab2_step_field!_43 45→13 5→10 0→0 12→0 1→11
ab2_step_tracer_field!_45 45→13 5→10 0→0 12→0 1→11
split_explicit_free_surface!_37 78→46 3→5 0→0 35→0 2→31
partial_mapreduce_device_28 59→31 4→8 0→0 9→0 5→14
barotropic_split_explicit_corrector!_44 59→32 4→8 0→0 32→0 0→26
update_hydrostatic_pressure!_22 96→72 2→3 0→0 18→0 0→15
cache_field_tendencies!_46 33→11 7→10 0→0 8→0 0→7
cache_field_tendencies!_47 32→10 8→10 0→0 8→0 0→7
cache_field_tendencies!_48 32→10 8→10 0→0 8→0 0→7
split_explicit_barotropic_velocity!_36 66→44 3→5 0→0 44→0 3→39
fill_south_and_north_halo!_24 37→19 6→10 0→0 10→0 0→10
fill_south_and_north_halo!_6 37→19 6→10 0→0 10→0 0→10
fill_south_and_north_halo!_13 41→24 6→10 0→0 10→0 0→10
fill_south_and_north_halo!_19 37→20 6→10 0→0 10→0 0→10
fill_south_and_north_halo!_2 41→24 6→10 0→0 10→0 0→10
compute_w_from_continuity!_21 80→64 3→4 0→0 30→0 0→27
fill_bottom_and_top_halo!_12 23→7 10→10 0→0 5→0 0→5
fill_bottom_and_top_halo!_15 23→7 10→10 0→0 5→0 0→5
fill_bottom_and_top_halo!_18 23→7 10→10 0→0 5→0 0→5
fill_bottom_and_top_halo!_23 23→7 10→10 0→0 5→0 0→5
fill_south_and_north_halo!_16 26→12 9→10 0→0 7→0 0→7
fill_south_and_north_halo!_4 26→12 9→10 0→0 7→0 0→7
fill_periodic_west_and_east_halo!_14 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_17 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_20 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_25 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_3 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_5 21→8 10→10 0→0 4→0 0→4
fill_periodic_west_and_east_halo!_7 21→8 10→10 0→0 4→0 0→4
fill_kernel!_35 12→6 10→10 0→0 3→0 1→4
compute_hydrostatic_free_surface_Gc!_40 128→128 2→2 196→112 77→0 1→66
compute_hydrostatic_free_surface_Gc!_41 128→128 2→2 196→112 77→0 1→66
compute_hydrostatic_free_surface_Gu!_26 128→128 2→2 868→692 100→0 1→155
compute_hydrostatic_free_surface_Gv!_27 128→128 2→2 968→728 95→0 1→121
compute_y_bcs!_32 0→0 10→10 0→0 0→0 0→0
compute_y_bcs!_33 0→0 10→10 0→0 0→0 0→0
compute_y_bcs!_42 0→0 10→10 0→0 0→0 0→0

Notes on remaining register-capped kernels

The 4 compute_hydrostatic_free_surface_G{u,v,c} kernels remain at 128 VGPRs (occupancy 2). These are massive compute kernels (~18K lines IR, ~7000 FP ops, ~350 global loads each). The register pressure is intrinsic to the computation (Oceananigans turbulence closure physics), not kernarg access. They still benefit from scratch reduction (up to -240 bytes) and flat_load elimination.

SGPR increase is expected and beneficial

SGPRs increased by ~50% because scalar registers now hold uniform kernel argument data that previously consumed vector registers via flat_load. This is the intended tradeoff: SGPRs are abundant (108 limit on gfx942) and rarely limit occupancy compared to VGPRs.

Implementation details

Key bug fix: byref attribute preservation

LLVM 18's CloneFunctionInto rebuilds AttributeList from scratch. For each old argument, it looks up VMap[&OldArg] and does dyn_cast<Argument>(). For byref parameters, VMap maps to addrspacecast instructions (not Argument objects), so the cast fails and byref(T) is silently dropped. The final setAttributes() overwrites any manually-set attributes.

Without byref, the backend emits global_buffer(8) metadata instead of by_value(sizeof(T)). HIP then copies only 8 bytes (a pointer) instead of the full struct into kernarg memory, causing illegal address errors at runtime.

Fix: Parameter attributes are copied AFTER clone_into!, so they persist on top of whatever CloneFunctionInto produced.

Design decisions

  • No InferAddressSpaces pass: Running it with a TargetMachine can over-propagate addrspace(4) into pointer values loaded from structs (which should remain flat/global). Instead, we rely on the backend's AMDGPULowerKernelArguments pass, which traces addrspacecast chains during codegen.
  • SimplifyCFG cleanup only: After cloning, we run SimplifyCFGPass() to merge the conversion block into the entry block. No other optimization passes are needed.

Run InferAddressSpaces (with TargetMachine) after add_kernarg_address_spaces!
to propagate addrspace(4) through addrspacecast chains. Follow up with SROA,
InstCombine, EarlyCSE, and SimplifyCFG to clean up newly-exposed opportunities.

The earlier illegal address errors were caused by byref attribute loss in
clone_into!, not by InferAddressSpaces itself.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
src/gcn.jl Outdated
return entry
end

# Rewrite byref kernel parameters from flat (addrspace 0) to kernarg (addrspace 4).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, addrspace 4 is the constant one and not kernarg.

https://llvm.org/docs/AMDGPUUsage.html#id132

src/gcn.jl Outdated
Comment on lines +68 to +69
# Clang emits byref parameters as `ptr addrspace(4)` from the frontend, but Julia's
# RemoveJuliaAddrspacesPass strips all address spaces to flat. This pass restores the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the comment makes no sense it seems to imply that somehow Julia emits addrspace(4) and then strips it. Julia doesn't known that it ought to emit in addrspace(4) (which we could fix), since we already fixed alloca addrspace emission.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is half wrong. The argument is first emitted as addresspace(11). Which gets stripped to 0 and then we need to replace it to 4

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, but we should add more tests :)

gbaraldi and others added 5 commits March 19, 2026 09:43
- Fix addrspace(4) comment: it's the "constant" address space, not "kernarg"
- Rewrite doc comment to accurately describe the Julia → AS(11) → AS(0) → AS(4) flow
- Add InferAddressSpaces + SROA + InstCombine + EarlyCSE after kernarg rewrite
- Add tests: byref attribute preservation, mixed byref/scalar params,
  programmatic IR inspection via compile(:llvm), zero scratch spills
- Apply Runic formatting to new code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The @dispose macro on older LLVM.jl requires `pb=expr()` without spaces,
not `pb = expr()`. The Runic-suggested formatting breaks precompilation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On LLVM 16 (Julia ≤1.11) with typed pointers, add_kernarg_address_spaces!
was creating opaque `ptr addrspace(4)` via `LLVM.PointerType(4)`, which
introduced opaque pointers into a typed-pointer module. When
InferAddressSpaces then propagated these through memcpy intrinsics
(which use typed `i8 addrspace(N)*`), LLVM's verifier rejected the
type mismatch.

Fix by following Metal's pattern: use `LLVM.PointerType(eltype, 4)`
on typed-pointer contexts and the original param type as the
addrspacecast target.

Also fix tests: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and
ptr on 1.12+, so use {{(i64|ptr)}} in filecheck and check for
non-byref params instead of non-AS4 pointer params.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
classify_arguments can fail post-optimization on typed-pointer LLVM
because convert(LLVMType, source_typ) may produce a different element
type than the post-optimization codegen type, hitting the assertion
at irgen.jl:322. This was triggered by the Symbols test in AMDGPU.jl.

Instead of re-classifying arguments, just check for the byref
attribute directly on each parameter — we know it's present because
irgen.jl added it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants