[FEA] JIT LTO Pairwise Distances#2099
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…e_distance_epilog_kernel.cu.in Co-authored-by: Kyle Edwards <kyedwards@nvidia.com>
| typename IdxT> | ||
| __global__ void pairwise_matrix_arch_probe_kernel() | ||
| { | ||
| } |
There was a problem hiding this comment.
Added an empty kernel here because we need a ptr to a non-jit kernel to do the arch check.
|
Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughMoves pairwise-matrix distance computation into a JIT-LTO pipeline: new fragment tags, device shims and compute/epilog kernels (including RBF), planner and kernel templates, type-to-tag dispatch, runtime arch probe, base integration toggle, include fixes, and CMake kernel generation. ChangesJIT-LTO Pairwise Distance Migration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
divyegala
left a comment
There was a problem hiding this comment.
Two minor questions, great PR!
Refactor the sm60 dispatch to use new fragments for:
[Note 05/19]:
the fused path is calling
PairwiseDistancesdirectly (which in turn subs in fragments forcompute_distance). This leads to symbol lookup errors. That means we need to keep the non-jit path around for the fused reductions (discussed with @divyegala)libcuvs.so size (CUDA 13.2): 255.92 MB -> 238.41 MB
libcuvs.so size (CUDA 12.9): 487.15 MB -> 448.81 MB
Benchmarks to check for regressions:
Hardware: H100
cold_before_ms: benchmark (main) i.e. without warmup runs
cold_after_ms: benchmark (PR) without warmup runs
warm_before_ms: benchmark (main) after warmup runs. We take the median over 20 runs
warm_after_ms: benchmark (PR) after warmup runs. We take the median over 20 runs