Skip to content

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800

Merged
maleadt merged 2 commits into
mainfrom
vc/ptx_fast_div
May 20, 2026
Merged

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800
maleadt merged 2 commits into
mainfrom
vc/ptx_fast_div

Conversation

@vchuravy
Copy link
Copy Markdown
Member

Overarching goal is to move the fast math handling from CUDA.jl to the GPUCompiler backend.

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32)
but has no fast path for Float64. This IR-level pass covers both:

  • Float32: replaces fdiv with __nv_fast_fdividef (libdevice)
  • Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement,
    matching CUDA.jl's inv_fast(::Float64) algorithm

The pass fires when the instruction carries the afn fast-math flag (set by
@fastmath) or when target.fastmath=true. It follows the NVVMReflectPass
pattern already in ptx.jl.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32)
but has no fast path for Float64. This IR-level pass covers both:
- Float32: replaces fdiv with __nv_fast_fdividef (libdevice)
- Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement,
  matching CUDA.jl's inv_fast(::Float64) algorithm

The pass fires when the instruction carries the afn fast-math flag (set by
@fastmath) or when target.fastmath=true. It follows the NVVMReflectPass
pattern already in ptx.jl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.69%. Comparing base (ea44b77) to head (8507a1c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/ptx.jl 96.22% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   75.41%   75.69%   +0.27%     
==========================================
  Files          25       25              
  Lines        3930     3983      +53     
==========================================
+ Hits         2964     3015      +51     
- Misses        966      968       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt marked this pull request as ready for review May 20, 2026 09:45
@maleadt
Copy link
Copy Markdown
Member

maleadt commented May 20, 2026

Rebased and added a Float32 path that plays nicely with #804, so that we can avoid these overrides in CUDA.jl.

Drop the `target.fastmath` check (`apply_fastmath!` stamps `afn`
already), and emit NVPTX intrinsics directly so the f32 rewrite doesn't
depend on libdevice being linked. f32 picks the FTZ variant from the
function's `denormal-fp-math-f32` attribute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt force-pushed the vc/ptx_fast_div branch from ca00b0f to 8507a1c Compare May 20, 2026 10:00
@maleadt maleadt merged commit 6c8303e into main May 20, 2026
71 of 73 checks passed
@maleadt maleadt deleted the vc/ptx_fast_div branch May 20, 2026 11:09
maleadt added a commit to JuliaGPU/CUDA.jl that referenced this pull request May 21, 2026
Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, avoid some of the uses of `libdevice`'s intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants