GCN: use byref instead of byval+lower_byval for kernel arguments by gbaraldi · Pull Request #772 · JuliaGPU/GPUCompiler.jl

gbaraldi · 2026-03-18T14:37:10Z

Summary

Replace byval + lower_byval (FCA expansion) with byref for GCN kernel arguments
Add pass_by_ref interface method (defaults to false, GCN overrides to true)
Rewrite byref pointer parameters to addrspace(4) (kernarg) with addrspacecasts back to generic for the function body

Motivation

On AMDGPU, kernel arguments reside in the read-only kernarg segment. The current pipeline:

irgen adds byval attributes to aggregate parameters
lower_byval replaces ptr byval(T) with the FCA type directly
The entry block extractvalues every field and stores the entire struct into an alloca in scratch memory (addrspace 5)

For large structs (e.g. Oceananigans' ImmersedBoundaryGrid — hundreds of bytes with ~28 array descriptors), this produces 78 scratch stores but only 4 scratch loads. ~95% of the stores are dead, but LLVM's DSE/SROA can't eliminate them from the massive nested aggregate.

With byref, the pointer semantics are preserved. LLVM generates s_load directly from the kernarg segment on demand, only loading fields that are actually used. The invariant.load and TBAA metadata Julia emits remain valid since kernarg memory is immutable.

Test plan

Run existing GPUCompiler GCN tests
Test with AMDGPU.jl on actual hardware
Verify the generated IR uses direct loads from kernarg instead of FCA + scratch stores
Test with Oceananigans mask_immersed_field! kernel to confirm reduced scratch usage

🤖 Generated with Claude Code

On AMDGPU, kernel arguments already reside in the read-only kernarg segment. The current pipeline adds `byval` attributes and then `lower_byval` expands them into first-class aggregates (FCAs), which forces LLVM to extractvalue every field and store the entire struct into scratch memory via alloca — even when only a few fields are used. For large structs (e.g. Oceananigans' ImmersedBoundaryGrid), this produces dozens of dead scratch stores. Using `byref` instead keeps the pointer semantics, allowing LLVM to generate scalar loads directly from the kernarg segment on demand. The invariant.load and TBAA metadata that Julia emits remain valid since the kernarg memory is immutable. The byref pointer parameters are rewritten to addrspace(4) (AMDGPU constant/kernarg address space), with addrspacecasts inserted so the function body can continue using generic pointers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-18T14:38:20Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/gcn.jl b/src/gcn.jl
index e310b5c..2903354 100644
--- a/src/gcn.jl
+++ b/src/gcn.jl
@@ -64,7 +64,7 @@ function finish_ir!(
         # optimize after address space rewriting: propagate addrspace(4) through
         # the addrspacecast chains, then clean up newly-exposed opportunities
         tm = llvm_machine(job.config.target)
-        @dispose pb=NewPMPassBuilder() begin
+        @dispose pb = NewPMPassBuilder() begin
             add!(pb, NewPMFunctionPassManager()) do fpm
                 add!(fpm, InferAddressSpacesPass())
                 add!(fpm, SROAPass())
@@ -139,7 +139,7 @@ function add_kernarg_address_spaces!(
     # (which expects flat pointers) continues to work. The AMDGPU backend's
     # AMDGPULowerKernelArguments traces these casts and produces s_load.
     new_args = LLVM.Value[]
-    @dispose builder=IRBuilder() begin
+    @dispose builder = IRBuilder() begin
         entry_bb = BasicBlock(new_f, "conversion")
         position!(builder, entry_bb)
 
@@ -185,7 +185,7 @@ function add_kernarg_address_spaces!(
     LLVM.name!(new_f, fn)
 
     # clean up the extra conversion block
-    @dispose pb=NewPMPassBuilder() begin
+    @dispose pb = NewPMPassBuilder() begin
         add!(pb, NewPMFunctionPassManager()) do fpm
             add!(fpm, SimplifyCFGPass())
         end
diff --git a/test/gcn.jl b/test/gcn.jl
index 5b49cf5..c667cb8 100644
--- a/test/gcn.jl
+++ b/test/gcn.jl
@@ -37,121 +37,127 @@ end
     end
 end
 
-@testset "kernarg address space for byref parameters" begin
-    mod = @eval module $(gensym())
-        struct MyStruct
-            x::Float64
-            y::Float64
-        end
-
-        function kernel(s::MyStruct)
-            s.x + s.y
-            return
-        end
-    end
+        @testset "kernarg address space for byref parameters" begin
+            mod = @eval module $(gensym())
+            struct MyStruct
+                x::Float64
+                y::Float64
+            end
 
-    # byref struct params should be ptr addrspace(4) in kernel IR
-    @test @filecheck begin
-        check"CHECK: define amdgpu_kernel void @_Z6kernel8MyStruct(ptr addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module=true, kernel=true)
-    end
+            function kernel(s::MyStruct)
+                s.x + s.y
+                return
+            end
+            end
 
-    # non-kernel should NOT have addrspace(4)
-    @test @filecheck begin
-        check"CHECK-NOT: addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module=true, kernel=false)
-    end
-end
+            # byref struct params should be ptr addrspace(4) in kernel IR
+            @test @filecheck begin
+                check"CHECK: define amdgpu_kernel void @_Z6kernel8MyStruct(ptr addrspace(4)"
+                GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module = true, kernel = true)
+            end
 
-@testset "byref attribute preserved on kernarg parameters" begin
-    mod = @eval module $(gensym())
-        struct LargeStruct
-            a::Float64
-            b::Float64
-            c::Float64
-            d::Float64
+            # non-kernel should NOT have addrspace(4)
+            @test @filecheck begin
+                check"CHECK-NOT: addrspace(4)"
+                GCN.code_llvm(mod.kernel, Tuple{mod.MyStruct}; dump_module = true, kernel = false)
+            end
         end
 
-        function kernel(s::LargeStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.a + s.b + s.c + s.d)
-            return
-        end
-    end
+        @testset "byref attribute preserved on kernarg parameters" begin
+            mod = @eval module $(gensym())
+            struct LargeStruct
+                a::Float64
+                b::Float64
+                c::Float64
+                d::Float64
+            end
 
-    # the byref attribute must survive the addrspace rewrite (clone_into! can drop it)
-    @test @filecheck begin
-        check"CHECK: byref"
-        check"CHECK: addrspace(4)"
-        GCN.code_llvm(mod.kernel, Tuple{mod.LargeStruct, Ptr{Float64}};
-                       dump_module=true, kernel=true)
-    end
-end
+            function kernel(s::LargeStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.a + s.b + s.c + s.d)
+                return
+            end
+            end
 
-@testset "mixed byref and scalar kernel parameters" begin
-    mod = @eval module $(gensym())
-        struct Params
-            x::Float64
-            y::Float64
+            # the byref attribute must survive the addrspace rewrite (clone_into! can drop it)
+            @test @filecheck begin
+                check"CHECK: byref"
+                check"CHECK: addrspace(4)"
+                GCN.code_llvm(
+                    mod.kernel, Tuple{mod.LargeStruct, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
 
-        function kernel(a::Float64, s::Params, out::Ptr{Float64})
-            unsafe_store!(out, a + s.x + s.y)
-            return
-        end
-    end
+        @testset "mixed byref and scalar kernel parameters" begin
+            mod = @eval module $(gensym())
+            struct Params
+                x::Float64
+                y::Float64
+            end
 
-    # scalar Float64 should NOT be in addrspace(4),
-    # only the struct byref param should be.
-    # NOTE: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and ptr on Julia 1.12+.
-    @test @filecheck begin
-        check"CHECK: define amdgpu_kernel void"
-        check"CHECK-SAME: double"
-        check"CHECK-SAME: ptr addrspace(4)"
-        check"CHECK-SAME: {{(i64|ptr)}}"
-        GCN.code_llvm(mod.kernel, Tuple{Float64, mod.Params, Ptr{Float64}};
-                       dump_module=true, kernel=true)
-    end
-end
+            function kernel(a::Float64, s::Params, out::Ptr{Float64})
+                unsafe_store!(out, a + s.x + s.y)
+                return
+            end
+            end
 
-@testset "add_kernarg_address_spaces! rewrites IR correctly" begin
-    mod = @eval module $(gensym())
-        struct KernelArgs
-            x::Float64
-            y::Float64
-            z::Float64
+            # scalar Float64 should NOT be in addrspace(4),
+            # only the struct byref param should be.
+            # NOTE: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and ptr on Julia 1.12+.
+            @test @filecheck begin
+                check"CHECK: define amdgpu_kernel void"
+                check"CHECK-SAME: double"
+                check"CHECK-SAME: ptr addrspace(4)"
+                check"CHECK-SAME: {{(i64|ptr)}}"
+                GCN.code_llvm(
+                    mod.kernel, Tuple{Float64, mod.Params, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
 
-        function kernel(s::KernelArgs, scale::Float64, out::Ptr{Float64})
-            unsafe_store!(out, (s.x + s.y + s.z) * scale)
-            return
-        end
-    end
+        @testset "add_kernarg_address_spaces! rewrites IR correctly" begin
+            mod = @eval module $(gensym())
+            struct KernelArgs
+                x::Float64
+                y::Float64
+                z::Float64
+            end
 
-    job, _ = GCN.create_job(mod.kernel, Tuple{mod.KernelArgs, Float64, Ptr{Float64}};
-                             kernel=true)
-    JuliaContext() do ctx
-        ir, meta = GPUCompiler.compile(:llvm, job)
+            function kernel(s::KernelArgs, scale::Float64, out::Ptr{Float64})
+                unsafe_store!(out, (s.x + s.y + s.z) * scale)
+                return
+            end
+            end
 
-        entry = meta.entry
-        ft = function_type(entry)
-        params = parameters(ft)
+            job, _ = GCN.create_job(
+                mod.kernel, Tuple{mod.KernelArgs, Float64, Ptr{Float64}};
+                kernel = true
+            )
+            JuliaContext() do ctx
+                ir, meta = GPUCompiler.compile(:llvm, job)
 
-        # the struct byref param should be ptr addrspace(4)
-        has_as4 = any(p -> p isa LLVM.PointerType && addrspace(p) == 4, params)
-        @test has_as4
+                entry = meta.entry
+                ft = function_type(entry)
+                params = parameters(ft)
 
-        # non-struct params (double, and i64/ptr for Ptr{Float64}) should NOT
-        # be in addrspace(4). Ptr{Float64} is i64 on Julia ≤1.11, ptr on 1.12+.
-        non_byref = filter(p -> !(p isa LLVM.PointerType && addrspace(p) == 4), params)
-        @test !isempty(non_byref)  # double (and i64 or ptr) params
+                # the struct byref param should be ptr addrspace(4)
+                has_as4 = any(p -> p isa LLVM.PointerType && addrspace(p) == 4, params)
+                @test has_as4
 
-        # byref attribute must be present
-        ir_str = string(ir)
-        @test occursin("byref", ir_str)
+                # non-struct params (double, and i64/ptr for Ptr{Float64}) should NOT
+                # be in addrspace(4). Ptr{Float64} is i64 on Julia ≤1.11, ptr on 1.12+.
+                non_byref = filter(p -> !(p isa LLVM.PointerType && addrspace(p) == 4), params)
+                @test !isempty(non_byref)  # double (and i64 or ptr) params
 
-        dispose(ir)
-    end
-end
+                # byref attribute must be present
+                ir_str = string(ir)
+                @test occursin("byref", ir_str)
+
+                dispose(ir)
+            end
+        end
 
 @testset "https://github.com/JuliaGPU/AMDGPU.jl/issues/846" begin
     ir, rt = GCN.code_typed((Tuple{Tuple{Val{4}}, Tuple{Float32}},); always_inline=true) do t
@@ -165,47 +171,49 @@ end
 ############################################################################################
 @testset "assembly" begin
 
-@testset "s_load for kernarg struct access" begin
-    mod = @eval module $(gensym())
-        struct MyStruct
-            x::Float64
-            y::Float64
-        end
+        @testset "s_load for kernarg struct access" begin
+            mod = @eval module $(gensym())
+            struct MyStruct
+                x::Float64
+                y::Float64
+            end
 
-        function kernel(s::MyStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.x + s.y)
-            return
+            function kernel(s::MyStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.x + s.y)
+                return
+            end
+            end
+
+            # struct field loads from kernarg should use s_load, not flat_load
+            @test @filecheck begin
+                check"CHECK: s_load_dwordx"
+                check"CHECK-NOT: flat_load"
+                GCN.code_native(mod.kernel, Tuple{mod.MyStruct, Ptr{Float64}}; kernel = true)
+            end
         end
-    end
 
-    # struct field loads from kernarg should use s_load, not flat_load
-    @test @filecheck begin
-        check"CHECK: s_load_dwordx"
-        check"CHECK-NOT: flat_load"
-        GCN.code_native(mod.kernel, Tuple{mod.MyStruct, Ptr{Float64}}; kernel=true)
-    end
-end
+        @testset "no scratch spills for small struct kernarg" begin
+            mod = @eval module $(gensym())
+            struct SmallStruct
+                x::Float64
+                y::Float64
+            end
 
-@testset "no scratch spills for small struct kernarg" begin
-    mod = @eval module $(gensym())
-        struct SmallStruct
-            x::Float64
-            y::Float64
-        end
+            function kernel(s::SmallStruct, out::Ptr{Float64})
+                unsafe_store!(out, s.x + s.y)
+                return
+            end
+            end
 
-        function kernel(s::SmallStruct, out::Ptr{Float64})
-            unsafe_store!(out, s.x + s.y)
-            return
+            # a small struct kernel should not need scratch memory
+            @test @filecheck begin
+                check"CHECK: .private_segment_fixed_size: 0"
+                GCN.code_native(
+                    mod.kernel, Tuple{mod.SmallStruct, Ptr{Float64}};
+                    dump_module = true, kernel = true
+                )
+            end
         end
-    end
-
-    # a small struct kernel should not need scratch memory
-    @test @filecheck begin
-        check"CHECK: .private_segment_fixed_size: 0"
-        GCN.code_native(mod.kernel, Tuple{mod.SmallStruct, Ptr{Float64}};
-                         dump_module=true, kernel=true)
-    end
-end
 
 @testset "skip scalar trap" begin
     mod = @eval module $(gensym())

…spacecast The addrspacecast from addrspace(4) to addrspace(0) caused "illegal VGPR to SGPR copy" errors because LLVM couldn't properly lower generic pointer accesses back to the constant address space. Instead, follow Metal's approach: load the struct from the addrspace(4) kernarg pointer into a local alloca (addrspace 5), and let SROA decompose it during the optimization pipeline. This avoids the address space mismatch while still benefiting from byref semantics — the load from addrspace(4) is a scalar load from the kernarg segment, and SROA will eliminate dead fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The byref TypeAttribute may be dropped when copying attributes to the new function with changed parameter types (ptr -> ptr addrspace(4)). Explicitly re-add it to ensure the AMDGPU backend knows the kernarg contains the struct data inline, not a pointer to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove manual rewrite_byref_addrspaces!. The AMDGPU backend's AMDGPULowerKernelArguments pass already knows how to handle ptr byref(T) on amdgpu_kernel functions — it rewrites the pointer to load from the kernarg segment (addrspace 4) automatically. The previous manual approaches (addrspacecast, load→alloca→store) conflicted with the backend's own lowering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add finish_ir! for GCN that rewrites byref kernel parameters from flat (addrspace 0) to kernarg (addrspace 4) after optimization. Clang emits byref params as ptr addrspace(4) from the frontend, but Julia's RemoveJuliaAddrspacesPass strips them to flat. This causes struct field loads to use flat_load instead of s_load. The pass creates a new function with ptr addrspace(4) parameters, inserts addrspacecasts back to flat for the cloned IR, then runs InferAddressSpaces to propagate addrspace(4) through all GEPs and loads. The result is that all kernel argument struct field accesses become s_load (scalar, cached, one per wavefront) instead of flat_load (per-lane, address disambiguation overhead). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

InferAddressSpaces needs TargetTransformInfo to determine the flat address space (0 on AMDGPU). Without passing a TargetMachine, the pass has no TTI and skips all promotions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CloneFunctionInto rebuilds the AttributeList from scratch using VMap. For byref params, VMap maps old args to addrspacecast instructions (not Arguments), so dyn_cast<Argument> fails and byref(T) is silently dropped. The subsequent setAttributes() overwrites any attrs we set before clone_into!. Without byref, the backend emits global_buffer(8) metadata instead of by_value(sizeof(T)), causing HIP to copy only 8 bytes of struct data as a "pointer" — leading to illegal address errors at runtime. Also remove InferAddressSpaces (rely on AMDGPULowerKernelArguments in codegen to trace addrspacecast chains for s_load). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verify that byref struct kernel parameters are rewritten to ptr addrspace(4) in IR and that the backend emits s_load (not flat_load) for struct field access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gbaraldi · 2026-03-18T23:11:04Z

@vchuravy after a small session of being nerd sniped. This + JuliaGPU/AMDGPU.jl#894 does

VERY AI AHEAD (I couldn't be bothered)

GCN Kernarg Address Space Rewriting: Performance Summary

What changed

Byref kernel parameters are now rewritten from flat (addrspace(0)) to kernarg (addrspace(4)) in finish_ir!. This allows the AMDGPU backend to emit scalar s_load instructions instead of vector flat_load for kernel argument struct field accesses.

The fix lives in add_kernarg_address_spaces! in src/gcn.jl, called from finish_ir! for kernel functions.

Why this matters

On AMDGPU, flat_load is a per-lane vector memory operation that must resolve address space at runtime and consumes VGPRs for both the address and result. s_load is a scalar operation — one load per wavefront from the read-only kernarg segment, using SGPRs instead of VGPRs. Since kernel arguments are uniform across all lanes, scalar loads are strictly better.

Julia's RemoveJuliaAddrspacesPass strips all non-integral address spaces (including the original addrspace(11) on tracked pointers) to flat. This pass restores addrspace(4) on byref parameters after optimization so that the backend's AMDGPULowerKernelArguments can trace the provenance and emit s_load.

Results: 48 Oceananigans kernels on gfx942

Aggregate

Metric	Before	After	Change
Total VGPRs	2507	1435	-42.8%
Total SGPRs	1402	2111	+50.6% (expected)
Total Scratch	2228 B	1644 B	-26.2%
Total `flat_load`	899	0	-100%
Total `s_load`	39	919	+880
Avg Occupancy	6.08	8.21	+35%
Kernels with occupancy gain	—	29/48	0 regressions

Per-kernel breakdown (sorted by VGPR reduction)

Kernel	VGPRs	Occupancy	Scratch	flat_load	s_load
compute_transport_velocities!_39	124→32	2→8	0→0	41→0	0→33
compute_integrated_ab2_tendencies!_34	124→54	2→4	0→0	36→0	2→32
partial_mapreduce_device_30	89→42	2→6	0→0	13→0	5→14
partial_mapreduce_device_31	105→58	2→4	0→0	14→0	5→15
update_split_explicit_state!_38	58→14	4→10	0→0	25→0	0→21
mask_immersed_field!_11	55→17	4→10	0→0	13→0	1→12
compute_barotropic_mode!_1	80→44	3→5	0→0	28→0	1→25
mask_immersed_field!_10	57→21	4→10	0→0	13→0	1→12
mask_immersed_field!_8	55→19	4→10	0→0	13→0	1→12
mask_immersed_field!_9	55→19	4→10	0→0	13→0	1→12
partial_mapreduce_device_29	72→37	3→6	0→0	10→0	5→12
ab2_step_field!_43	45→13	5→10	0→0	12→0	1→11
ab2_step_tracer_field!_45	45→13	5→10	0→0	12→0	1→11
split_explicit_free_surface!_37	78→46	3→5	0→0	35→0	2→31
partial_mapreduce_device_28	59→31	4→8	0→0	9→0	5→14
barotropic_split_explicit_corrector!_44	59→32	4→8	0→0	32→0	0→26
update_hydrostatic_pressure!_22	96→72	2→3	0→0	18→0	0→15
cache_field_tendencies!_46	33→11	7→10	0→0	8→0	0→7
cache_field_tendencies!_47	32→10	8→10	0→0	8→0	0→7
cache_field_tendencies!_48	32→10	8→10	0→0	8→0	0→7
split_explicit_barotropic_velocity!_36	66→44	3→5	0→0	44→0	3→39
fill_south_and_north_halo!_24	37→19	6→10	0→0	10→0	0→10
fill_south_and_north_halo!_6	37→19	6→10	0→0	10→0	0→10
fill_south_and_north_halo!_13	41→24	6→10	0→0	10→0	0→10
fill_south_and_north_halo!_19	37→20	6→10	0→0	10→0	0→10
fill_south_and_north_halo!_2	41→24	6→10	0→0	10→0	0→10
compute_w_from_continuity!_21	80→64	3→4	0→0	30→0	0→27
fill_bottom_and_top_halo!_12	23→7	10→10	0→0	5→0	0→5
fill_bottom_and_top_halo!_15	23→7	10→10	0→0	5→0	0→5
fill_bottom_and_top_halo!_18	23→7	10→10	0→0	5→0	0→5
fill_bottom_and_top_halo!_23	23→7	10→10	0→0	5→0	0→5
fill_south_and_north_halo!_16	26→12	9→10	0→0	7→0	0→7
fill_south_and_north_halo!_4	26→12	9→10	0→0	7→0	0→7
fill_periodic_west_and_east_halo!_14	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_17	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_20	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_25	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_3	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_5	21→8	10→10	0→0	4→0	0→4
fill_periodic_west_and_east_halo!_7	21→8	10→10	0→0	4→0	0→4
fill_kernel!_35	12→6	10→10	0→0	3→0	1→4
compute_hydrostatic_free_surface_Gc!_40	128→128	2→2	196→112	77→0	1→66
compute_hydrostatic_free_surface_Gc!_41	128→128	2→2	196→112	77→0	1→66
compute_hydrostatic_free_surface_Gu!_26	128→128	2→2	868→692	100→0	1→155
compute_hydrostatic_free_surface_Gv!_27	128→128	2→2	968→728	95→0	1→121
compute_y_bcs!_32	0→0	10→10	0→0	0→0	0→0
compute_y_bcs!_33	0→0	10→10	0→0	0→0	0→0
compute_y_bcs!_42	0→0	10→10	0→0	0→0	0→0

Notes on remaining register-capped kernels

The 4 compute_hydrostatic_free_surface_G{u,v,c} kernels remain at 128 VGPRs (occupancy 2). These are massive compute kernels (~18K lines IR, ~7000 FP ops, ~350 global loads each). The register pressure is intrinsic to the computation (Oceananigans turbulence closure physics), not kernarg access. They still benefit from scratch reduction (up to -240 bytes) and flat_load elimination.

SGPR increase is expected and beneficial

SGPRs increased by ~50% because scalar registers now hold uniform kernel argument data that previously consumed vector registers via flat_load. This is the intended tradeoff: SGPRs are abundant (108 limit on gfx942) and rarely limit occupancy compared to VGPRs.

Implementation details

Key bug fix: `byref` attribute preservation

LLVM 18's CloneFunctionInto rebuilds AttributeList from scratch. For each old argument, it looks up VMap[&OldArg] and does dyn_cast<Argument>(). For byref parameters, VMap maps to addrspacecast instructions (not Argument objects), so the cast fails and byref(T) is silently dropped. The final setAttributes() overwrites any manually-set attributes.

Without byref, the backend emits global_buffer(8) metadata instead of by_value(sizeof(T)). HIP then copies only 8 bytes (a pointer) instead of the full struct into kernarg memory, causing illegal address errors at runtime.

Fix: Parameter attributes are copied AFTER clone_into!, so they persist on top of whatever CloneFunctionInto produced.

Design decisions

No InferAddressSpaces pass: Running it with a TargetMachine can over-propagate addrspace(4) into pointer values loaded from structs (which should remain flat/global). Instead, we rely on the backend's AMDGPULowerKernelArguments pass, which traces addrspacecast chains during codegen.
SimplifyCFG cleanup only: After cloning, we run SimplifyCFGPass() to merge the conversion block into the entry block. No other optimization passes are needed.

Run InferAddressSpaces (with TargetMachine) after add_kernarg_address_spaces! to propagate addrspace(4) through addrspacecast chains. Follow up with SROA, InstCombine, EarlyCSE, and SimplifyCFG to clean up newly-exposed opportunities. The earlier illegal address errors were caused by byref attribute loss in clone_into!, not by InferAddressSpaces itself. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vchuravy · 2026-03-19T06:32:00Z

src/gcn.jl

+    return entry
+end
+
+# Rewrite byref kernel parameters from flat (addrspace 0) to kernarg (addrspace 4).


Small nit, addrspace 4 is the constant one and not kernarg.

https://llvm.org/docs/AMDGPUUsage.html#id132

vchuravy · 2026-03-19T08:15:58Z

src/gcn.jl

+# Clang emits byref parameters as `ptr addrspace(4)` from the frontend, but Julia's
+# RemoveJuliaAddrspacesPass strips all address spaces to flat. This pass restores the


This part of the comment makes no sense it seems to imply that somehow Julia emits addrspace(4) and then strips it. Julia doesn't known that it ought to emit in addrspace(4) (which we could fix), since we already fixed alloca addrspace emission.

The comment is half wrong. The argument is first emitted as addresspace(11). Which gets stripped to 0 and then we need to replace it to 4

vchuravy

Looks generally good, but we should add more tests :)

src/gcn.jl

- Fix addrspace(4) comment: it's the "constant" address space, not "kernarg" - Rewrite doc comment to accurately describe the Julia → AS(11) → AS(0) → AS(4) flow - Add InferAddressSpaces + SROA + InstCombine + EarlyCSE after kernarg rewrite - Add tests: byref attribute preservation, mixed byref/scalar params, programmatic IR inspection via compile(:llvm), zero scratch spills - Apply Runic formatting to new code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@dispose

The @dispose macro on older LLVM.jl requires `pb=expr()` without spaces, not `pb = expr()`. The Runic-suggested formatting breaks precompilation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

On LLVM 16 (Julia ≤1.11) with typed pointers, add_kernarg_address_spaces! was creating opaque `ptr addrspace(4)` via `LLVM.PointerType(4)`, which introduced opaque pointers into a typed-pointer module. When InferAddressSpaces then propagated these through memcpy intrinsics (which use typed `i8 addrspace(N)*`), LLVM's verifier rejected the type mismatch. Fix by following Metal's pattern: use `LLVM.PointerType(eltype, 4)` on typed-pointer contexts and the original param type as the addrspacecast target. Also fix tests: Ptr{Float64} is lowered to i64 on Julia ≤1.11 and ptr on 1.12+, so use {{(i64|ptr)}} in filecheck and check for non-byref params instead of non-AS4 pointer params. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

classify_arguments can fail post-optimization on typed-pointer LLVM because convert(LLVMType, source_typ) may produce a different element type than the post-optimization codegen type, hitting the assertion at irgen.jl:322. This was triggered by the Symbols test in AMDGPU.jl. Instead of re-classifying arguments, just check for the byref attribute directly on each parameter — we know it's present because irgen.jl added it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gbaraldi and others added 7 commits March 18, 2026 12:02

Add tests for kernarg addrspace(4) rewriting on GCN

14e010b

Verify that byref struct kernel parameters are rewritten to ptr addrspace(4) in IR and that the backend emits s_load (not flat_load) for struct field access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vchuravy reviewed Mar 19, 2026

View reviewed changes

src/gcn.jl Outdated Show resolved Hide resolved

gbaraldi and others added 5 commits March 19, 2026 09:43

Fix @dispose macro: revert space around = for older LLVM.jl compat

369d6eb

The @dispose macro on older LLVM.jl requires `pb=expr()` without spaces, not `pb = expr()`. The Runic-suggested formatting breaks precompilation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix @dispose: tm is not a managed resource, just pass to run!

7c72b3d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCN: use byref instead of byval+lower_byval for kernel arguments#772

GCN: use byref instead of byval+lower_byval for kernel arguments#772
gbaraldi wants to merge 14 commits intoJuliaGPU:masterfrom
gbaraldi:gcn-byref-kernel-args

gbaraldi commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

gbaraldi commented Mar 18, 2026

Uh oh!

vchuravy Mar 19, 2026

Uh oh!

vchuravy Mar 19, 2026

Uh oh!

gbaraldi Mar 19, 2026

Uh oh!

vchuravy left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Clang emits byref parameters as `ptr addrspace(4)` from the frontend, but Julia's
		# RemoveJuliaAddrspacesPass strips all address spaces to flat. This pass restores the

Conversation

gbaraldi commented Mar 18, 2026

Summary

Motivation

Test plan

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbaraldi commented Mar 18, 2026

GCN Kernarg Address Space Rewriting: Performance Summary

What changed

Why this matters

Results: 48 Oceananigans kernels on gfx942

Aggregate

Per-kernel breakdown (sorted by VGPR reduction)

Notes on remaining register-capped kernels

SGPR increase is expected and beneficial

Implementation details

Key bug fix: byref attribute preservation

Design decisions

Uh oh!

vchuravy Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

vchuravy Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gbaraldi Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

vchuravy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 18, 2026 •

edited

Loading

Key bug fix: `byref` attribute preservation