Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division by vchuravy · Pull Request #800 · JuliaGPU/GPUCompiler.jl

vchuravy · 2026-05-19T13:20:36Z

Overarching goal is to move the fast math handling from CUDA.jl to the GPUCompiler backend.

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32)
but has no fast path for Float64. This IR-level pass covers both:

Float32: replaces fdiv with __nv_fast_fdividef (libdevice)
Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement,
matching CUDA.jl's inv_fast(::Float64) algorithm

The pass fires when the instruction carries the afn fast-math flag (set by
@fastmath) or when target.fastmath=true. It follows the NVVMReflectPass
pattern already in ptx.jl.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

@fastmath

The LLVM NVPTX backend handles fdiv fast for Float32 (→ div.approx.ftz.f32) but has no fast path for Float64. This IR-level pass covers both: - Float32: replaces fdiv with __nv_fast_fdividef (libdevice) - Float64: replaces fdiv with rcp.approx.ftz.d + Newton refinement, matching CUDA.jl's inv_fast(::Float64) algorithm The pass fires when the instruction carries the afn fast-math flag (set by @fastmath) or when target.fastmath=true. It follows the NVVMReflectPass pattern already in ptx.jl. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-05-20T09:44:47Z

Codecov Report

❌ Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.69%. Comparing base (ea44b77) to head (8507a1c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/ptx.jl	96.22%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
+ Coverage   75.41%   75.69%   +0.27%     
==========================================
  Files          25       25              
  Lines        3930     3983      +53     
==========================================
+ Hits         2964     3015      +51     
- Misses        966      968       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt · 2026-05-20T09:47:23Z

Rebased and added a Float32 path that plays nicely with #804, so that we can avoid these overrides in CUDA.jl.

Drop the `target.fastmath` check (`apply_fastmath!` stamps `afn` already), and emit NVPTX intrinsics directly so the f32 rewrite doesn't depend on libdevice being linked. f32 picks the FTZ variant from the function's `denormal-fp-math-f32` attribute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, avoid some of the uses of `libdevice`'s intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vchuravy mentioned this pull request May 20, 2026

Add a pass to apply fastmath attributes. #804

Merged

maleadt force-pushed the vc/ptx_fast_div branch from 5904860 to ca00b0f Compare May 20, 2026 09:43

maleadt marked this pull request as ready for review May 20, 2026 09:45

maleadt force-pushed the vc/ptx_fast_div branch from ca00b0f to 8507a1c Compare May 20, 2026 10:00

maleadt merged commit 6c8303e into main May 20, 2026
71 of 73 checks passed

maleadt deleted the vc/ptx_fast_div branch May 20, 2026 11:09

This was referenced May 20, 2026

Reduce usage of libdevice, relying more on LLVM JuliaGPU/CUDA.jl#3149

Merged

PTX: add PTXRSqrtFastPass to fold afn 1/sqrt(x) to nvvm.rsqrt.approx #807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800

Add PTXFDivFastPass to lower fdiv fast to NVPTX approximate division#800
maleadt merged 2 commits into
mainfrom
vc/ptx_fast_div

vchuravy commented May 19, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

maleadt commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vchuravy commented May 19, 2026

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maleadt commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 20, 2026 •

edited

Loading