Skip to content

Unify unreachable control flow lowering across back-ends#812

Merged
maleadt merged 5 commits into
mainfrom
tb/lower_unreachable
May 28, 2026
Merged

Unify unreachable control flow lowering across back-ends#812
maleadt merged 5 commits into
mainfrom
tb/lower_unreachable

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 27, 2026

Julia often emits control flow like this:

if oob:
    signal_exception(); trap; unreachable    # the abort
load arr[i]                                  # the guarded op
threadgroup_barrier()

On Metal and SPIR-V we can't actually emit the trap (Metal hangs on M1/M2, SPIR-V doesn't have trap). However, we can't just drop the trap either, since it's there to avoid the out-of-bounds load. At the same time, we can't have the lane exit too, since that would skip the threadgroup_barrier and potentially deadlock. Basically, this pattern needs hardware support (like PTX and GCN have) to be implemented correctly.

Analyzing the actual divergence to decide whether we could jump straight from the OOB fail to the barrier is generally impossible: the behavior can be data-dependent, and the barrier may be in another function.

Given the inability to fully fix this, I'm opting for having the lane exit the kernel and possibly deadlock during the barrier. The reasoning here is that barrier deadlocks are a common issue, and should be caught by the driver's watchdog, as opposed to the current hang we're seeing in Metal, or the crash in PoCL when loading OOB data.

This PR unifies that handling into a single pass for both Metal and SPIR-V.

@vchuravy This may unblock PoCL since we're essentially getting rid of unreachable control flow.

maleadt and others added 5 commits May 27, 2026 22:12
Replace Metal's `replace_unreachable!` and SPIR-V's `rm_trap!` with one
target-agnostic `lower_unreachable_control_flow!` (run from `finish_ir!`):
strip device-side `llvm.trap`s and lower `unreachable` to a clean `ret`,
so a throwing lane exits without faulting the GPU (JuliaGPU/Metal.jl#433)
or emitting `OpUnreachable`. The trap stays through `optimize!` as a
`noreturn` guard (keeping the `signal_exception` mailbox write) and is
stripped post-opt. PTX/GCN unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maleadt added a commit to JuliaGPU/OpenCL.jl that referenced this pull request May 28, 2026
`intel_reqd_sub_group_size` is currently unconditionally set to the device's
reported subgroup size (or a heuristic default). However, the spec mentions:

> Note that there is no guarantee for the value of get_sub_group_size()
> even when this attribute is present, particularly when the work-group size
> is not evenly divisible by the required sub-group size.

Specifically, PoCL reports a subgroup count of 0 when using a work-group size
that's smaller than the chosen subgroup size:

```
julia> @OpenCL kernel();
get_num_sub_groups() = 1

julia> @OpenCL sub_group_size=32 kernel();
get_num_sub_groups() = 0
```

The above is with the fix from this PR already, which only sets the
attribute when explicitly requesting a subgroup size. Normally, PoCL
determines an appropriate subgroup size per launch, so revert to that
by not setting the attribute by default.

This bug broke the RNG, which queries the sub group count. FWIW, this only
surfaced on JuliaGPU/GPUCompiler.jl#812, because
previously the exception trap was simply removed by PoCL resulting in
the subsequent memory access simply happening as if there was no OOB.
@maleadt maleadt merged commit bbc528c into main May 28, 2026
35 of 37 checks passed
@maleadt maleadt deleted the tb/lower_unreachable branch May 28, 2026 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant