Add @nospecialize to mapreduce dispatch chain to cut compile time#66
Conversation
The strided mapreduce machinery (map/map!/mapreducedim!/_mapreduce and the inner bookkeeping + threading helpers) was specialized on the map/reduce function types f/op/initop. Combined with the (M, N, eltype) axes this caused a combinatorial explosion of specializations for downstream packages such as TensorOperations, which generate many distinct closures. Annotate the outer entry points and the deeper bookkeeping/threading helpers with @nospecialize so they compile once per (M, N, eltype) regardless of the function types. The expensive @generated `_mapreduce_kernel!` stays fully specialized and is reached via a function barrier (one dynamic dispatch per coarse call), so steady-state runtime is unchanged. Also split the body of the @generated `_mapreduce_kernel!` into a sibling plain function `_mapreduce_kernel_expr(f, op, initop, N, M)` that returns the Expr, for clarity; the generated kernel itself is otherwise unchanged. `_mapreduce_block!` is preserved as the GPU extension override point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
|
What is strange is that these At some point they got removed, because I guess we felt it didn't really matter all that much. I was never able to really properly measure a difference between with and without any |
|
Also makes you wonder to what extent Claude has really just learned older versions of this code 😄 |
|
This is the commit that removed them: Unfortunately, there is no explanation of what the motivation was. This happened after a long time of no updates:
|
|
To be fair, I actually spent the entire day trying out every single thing I could think of, and mostly used Claude to do the benchmarking, so maybe it's just me remembering the old versions of this code... Anyways, I guess this means you are okay with me merging this? |
|
Yes, one thing I find confusing is that I thought that Julia anyway doesn't specialize on arguments which are functions, unless you explicitly write |
| return _mapreduce_kernel_expr(f, op, initop, N, M) | ||
| end | ||
|
|
||
| function _mapreduce_kernel_expr(f, op, initop, N::Int, M::Int) |
There was a problem hiding this comment.
Is there a benefit to this approach? Maybe this is actually where some of the compilation time gain comes from? That the code that generates the code no longer needs to be recompiled for different types?
There was a problem hiding this comment.
I measured this at some point, including the actual runtime of this function, and this seemed somewhat negligible. The main reason I did this is specifically for that, where I was benchmarking the expression generation :). I'm happy to revert this, but it is also just convenient to actually inspect the generated code without having to work around the generated function.
|
Re the specialization on functions - this is definitely true, but we bypassed this in TensorOperations by making the callable structs |

This PR adds
@nospecializetof/op/initopthroughout that chain, so the upstream bookkeeping and threading code compiles once per(M, N, eltype)regardless of the function types.The actual
@generated _mapreduce_kernel!stays fully specialized.Net effect: far fewer redundant compilations, identical steady-state runtime.
Validation
All measured on Julia 1.12.6.
TensorOperations compile time — lower
Fresh process, benchmark env with Strided
developed to this branch vs.main.Cold inference+codegen summed over a representative
@tensorworkload(rank-2…5 contractions, permutations,
tensoradd, over Float64 and ComplexF64)via
sum(Base.@timed(...).compile_time):main)@tensorworkload compileThe "many-op" workload mimics the real combinatorial driver (many distinct
closure types), where the win is largest; the end-to-end
@tensornumber is thefraction of that a TensorOperations user actually pays.
Runtime — no regression (small and large arrays)
This is the key risk:
@nospecializemakes the kernel call a dynamic dispatch,so per-call overhead would show up most on tiny arrays. Back-to-back
single-threaded
BenchmarkToolsruns (@belapsed, 10 000 samples), baseline vs.branch on the same machine:
permutedims!4×4permutedims!4×4×4permutedims!8×8×8permutedims!8×8×8 ComplexF64map!(+)4×4map!(+)4×4×4map!(+)8×8×8map!(conj)8×8×8 ComplexF64mapreducedim!4×4mapreducedim!4×4×4mapreducedim!8×8×8sum4×4sum4×4×4sum8×8×8permutedims!large 256×256×16sumlarge 256×256×16The worst small-array result is
permutedims!4×4 at +3.4 % — a 5 ns absolutedifference on a 145 ns op. A focused back-to-back re-measurement put baseline at
140 ns and branch at 145 ns, i.e. the spread is run-to-run noise, not a
systematic per-call cost. Everything else is within ±2 %, large arrays are
unaffected, and several small reduction paths (
mapreducedim!,sum) arefaster. No meaningful regression.
🤖 Generated with Claude Code