Skip to content

Add @nospecialize to mapreduce dispatch chain to cut compile time#66

Merged
lkdvos merged 2 commits into
mainfrom
ld-nospecialize
Jun 19, 2026
Merged

Add @nospecialize to mapreduce dispatch chain to cut compile time#66
lkdvos merged 2 commits into
mainfrom
ld-nospecialize

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 18, 2026

Copy link
Copy Markdown
Member

This PR adds @nospecialize to f/op/initop throughout that chain, so the upstream bookkeeping and threading code compiles once per (M, N, eltype) regardless of the function types.
The actual @generated _mapreduce_kernel! stays fully specialized.
Net effect: far fewer redundant compilations, identical steady-state runtime.

Validation

All measured on Julia 1.12.6.

TensorOperations compile time — lower

Fresh process, benchmark env with Strided developed to this branch vs. main.
Cold inference+codegen summed over a representative @tensor workload
(rank-2…5 contractions, permutations, tensoradd, over Float64 and ComplexF64)
via sum(Base.@timed(...).compile_time):

metric baseline (main) this branch change
@tensor workload compile 30.31 s 25.57 s −15.6 %
many-op synthetic compile (8 distinct map fns × 8 reduce ops × N=2..5) 14.87 s 11.18 s −24.8 %

The "many-op" workload mimics the real combinatorial driver (many distinct
closure types), where the win is largest; the end-to-end @tensor number is the
fraction of that a TensorOperations user actually pays.

Runtime — no regression (small and large arrays)

This is the key risk: @nospecialize makes the kernel call a dynamic dispatch,
so per-call overhead would show up most on tiny arrays. Back-to-back
single-threaded BenchmarkTools runs (@belapsed, 10 000 samples), baseline vs.
branch on the same machine:

case baseline branch ratio
permutedims! 4×4 145 ns 150 ns 1.034
permutedims! 4×4×4 251 ns 239 ns 0.95
permutedims! 8×8×8 480 ns 483 ns 1.01
permutedims! 8×8×8 ComplexF64 438 ns 441 ns 1.01
map!(+) 4×4 200 ns 203 ns 1.02
map!(+) 4×4×4 445 ns 449 ns 1.01
map!(+) 8×8×8 2116 ns 2076 ns 0.98
map!(conj) 8×8×8 ComplexF64 417 ns 413 ns 0.99
mapreducedim! 4×4 381 ns 256 ns 0.67
mapreducedim! 4×4×4 629 ns 375 ns 0.60
mapreducedim! 8×8×8 825 ns 612 ns 0.74
sum 4×4 984 ns 859 ns 0.87
sum 4×4×4 1190 ns 909 ns 0.76
sum 8×8×8 1928 ns 1636 ns 0.85
permutedims! large 256×256×16 2.759 ms 2.664 ms 0.97
sum large 256×256×16 1.679 ms 1.674 ms 1.00

The worst small-array result is permutedims! 4×4 at +3.4 % — a 5 ns absolute
difference on a 145 ns op. A focused back-to-back re-measurement put baseline at
140 ns and branch at 145 ns, i.e. the spread is run-to-run noise, not a
systematic per-call cost. Everything else is within ±2 %, large arrays are
unaffected, and several small reduction paths (mapreducedim!, sum) are
faster. No meaningful regression.

🤖 Generated with Claude Code

The strided mapreduce machinery (map/map!/mapreducedim!/_mapreduce and the
inner bookkeeping + threading helpers) was specialized on the map/reduce
function types f/op/initop. Combined with the (M, N, eltype) axes this caused
a combinatorial explosion of specializations for downstream packages such as
TensorOperations, which generate many distinct closures.

Annotate the outer entry points and the deeper bookkeeping/threading helpers
with @nospecialize so they compile once per (M, N, eltype) regardless of the
function types. The expensive @generated `_mapreduce_kernel!` stays fully
specialized and is reached via a function barrier (one dynamic dispatch per
coarse call), so steady-state runtime is unchanged.

Also split the body of the @generated `_mapreduce_kernel!` into a sibling
plain function `_mapreduce_kernel_expr(f, op, initop, N, M)` that returns the
Expr, for clarity; the generated kernel itself is otherwise unchanged.
`_mapreduce_block!` is preserved as the GPU extension override point.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/mapreduce.jl Outdated
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/mapreduce.jl 80.61% <100.00%> (+0.10%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jutho

Jutho commented Jun 18, 2026

Copy link
Copy Markdown
Member

What is strange is that these @nospecializes were once there, see e.g.:
https://github.com/QuantumKitHub/Strided.jl/blob/v2.0.3/src/mapreduce.jl

At some point they got removed, because I guess we felt it didn't really matter all that much. I was never able to really properly measure a difference between with and without any @nospecialize call, but that is probably due to my own failure in properly measuring things.

@Jutho

Jutho commented Jun 18, 2026

Copy link
Copy Markdown
Member

Also makes you wonder to what extent Claude has really just learned older versions of this code 😄

@Jutho

Jutho commented Jun 18, 2026

Copy link
Copy Markdown
Member

This is the commit that removed them:
b5bc609#diff-9ead82f1e0eabb1bf827e1a45d45cbd38bc95082400529082e40b457004fa0a8

Unfortunately, there is no explanation of what the motivation was. This happened after a long time of no updates:

Screenshot 2026-06-19 at 00 33 01

@lkdvos

lkdvos commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

To be fair, I actually spent the entire day trying out every single thing I could think of, and mostly used Claude to do the benchmarking, so maybe it's just me remembering the old versions of this code... Anyways, I guess this means you are okay with me merging this?

@Jutho

Jutho commented Jun 19, 2026

Copy link
Copy Markdown
Member

Yes, one thing I find confusing is that I thought that Julia anyway doesn't specialize on arguments which are functions, unless you explicitly write function higher_order_function(f::F) where {F}, i.e. if you add an explicit type parameter.

Comment thread src/mapreduce.jl
return _mapreduce_kernel_expr(f, op, initop, N, M)
end

function _mapreduce_kernel_expr(f, op, initop, N::Int, M::Int)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to this approach? Maybe this is actually where some of the compilation time gain comes from? That the code that generates the code no longer needs to be recompiled for different types?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured this at some point, including the actual runtime of this function, and this seemed somewhat negligible. The main reason I did this is specifically for that, where I was benchmarking the expression generation :). I'm happy to revert this, but it is also just convenient to actually inspect the generated code without having to work around the generated function.

@lkdvos lkdvos merged commit ac118bb into main Jun 19, 2026
10 of 13 checks passed
@lkdvos lkdvos deleted the ld-nospecialize branch June 19, 2026 15:19
@lkdvos

lkdvos commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

Re the specialization on functions - this is definitely true, but we bypassed this in TensorOperations by making the callable structs Scaler and Adder. The point is that we do actually want to specialize the kernels, since that gives measurable runtime benefits, but not the stuff on top of it (I think)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants