Unify unreachable control flow lowering across back-ends#812
Merged
Conversation
Replace Metal's `replace_unreachable!` and SPIR-V's `rm_trap!` with one target-agnostic `lower_unreachable_control_flow!` (run from `finish_ir!`): strip device-side `llvm.trap`s and lower `unreachable` to a clean `ret`, so a throwing lane exits without faulting the GPU (JuliaGPU/Metal.jl#433) or emitting `OpUnreachable`. The trap stays through `optimize!` as a `noreturn` guard (keeping the `signal_exception` mailbox write) and is stripped post-opt. PTX/GCN unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maleadt
added a commit
to JuliaGPU/OpenCL.jl
that referenced
this pull request
May 28, 2026
`intel_reqd_sub_group_size` is currently unconditionally set to the device's reported subgroup size (or a heuristic default). However, the spec mentions: > Note that there is no guarantee for the value of get_sub_group_size() > even when this attribute is present, particularly when the work-group size > is not evenly divisible by the required sub-group size. Specifically, PoCL reports a subgroup count of 0 when using a work-group size that's smaller than the chosen subgroup size: ``` julia> @OpenCL kernel(); get_num_sub_groups() = 1 julia> @OpenCL sub_group_size=32 kernel(); get_num_sub_groups() = 0 ``` The above is with the fix from this PR already, which only sets the attribute when explicitly requesting a subgroup size. Normally, PoCL determines an appropriate subgroup size per launch, so revert to that by not setting the attribute by default. This bug broke the RNG, which queries the sub group count. FWIW, this only surfaced on JuliaGPU/GPUCompiler.jl#812, because previously the exception trap was simply removed by PoCL resulting in the subsequent memory access simply happening as if there was no OOB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Julia often emits control flow like this:
On Metal and SPIR-V we can't actually emit the
trap(Metal hangs on M1/M2, SPIR-V doesn't have trap). However, we can't just drop thetrapeither, since it's there to avoid the out-of-bounds load. At the same time, we can't have the lane exit too, since that would skip the threadgroup_barrier and potentially deadlock. Basically, this pattern needs hardware support (like PTX and GCN have) to be implemented correctly.Analyzing the actual divergence to decide whether we could jump straight from the OOB fail to the barrier is generally impossible: the behavior can be data-dependent, and the barrier may be in another function.
Given the inability to fully fix this, I'm opting for having the lane exit the kernel and possibly deadlock during the barrier. The reasoning here is that barrier deadlocks are a common issue, and should be caught by the driver's watchdog, as opposed to the current hang we're seeing in Metal, or the crash in PoCL when loading OOB data.
This PR unifies that handling into a single pass for both Metal and SPIR-V.
@vchuravy This may unblock PoCL since we're essentially getting rid of unreachable control flow.