Adaptations to GPUCompiler could benefit from the approach as done in JuliaGPU/CUDA.jl#3131
Currently, AMDGPU requires a two pass approach. The first pass feeds functions into the optimizer. The second pass catches symbols the runtime introduced after the optimizer ran.
Maybe CUDA's runtime doesn't introduce new references to libdevice symbols after link_libraries! runs? Which allows the single pass + lazy linking approach?
In any case, this could be reworked eventually.
Adaptations to GPUCompiler could benefit from the approach as done in JuliaGPU/CUDA.jl#3131
Currently, AMDGPU requires a two pass approach. The first pass feeds functions into the optimizer. The second pass catches symbols the runtime introduced after the optimizer ran.
Maybe CUDA's runtime doesn't introduce new references to libdevice symbols after link_libraries! runs? Which allows the single pass + lazy linking approach?
In any case, this could be reworked eventually.