Add @nospecialize to mapreduce dispatch chain to cut compile time by lkdvos · Pull Request #66 · QuantumKitHub/Strided.jl

lkdvos · 2026-06-18T21:43:22Z

This PR adds @nospecialize to f/op/initop throughout that chain, so the upstream bookkeeping and threading code compiles once per (M, N, eltype) regardless of the function types.
The actual @generated _mapreduce_kernel! stays fully specialized.
Net effect: far fewer redundant compilations, identical steady-state runtime.

Validation

All measured on Julia 1.12.6.

TensorOperations compile time — lower

Fresh process, benchmark env with Strided developed to this branch vs. main.
Cold inference+codegen summed over a representative @tensor workload
(rank-2…5 contractions, permutations, tensoradd, over Float64 and ComplexF64)
via sum(Base.@timed(...).compile_time):

metric	baseline (`main`)	this branch	change
`@tensor` workload compile	30.31 s	25.57 s	−15.6 %
many-op synthetic compile (8 distinct map fns × 8 reduce ops × N=2..5)	14.87 s	11.18 s	−24.8 %

The "many-op" workload mimics the real combinatorial driver (many distinct
closure types), where the win is largest; the end-to-end @tensor number is the
fraction of that a TensorOperations user actually pays.

Runtime — no regression (small and large arrays)

This is the key risk: @nospecialize makes the kernel call a dynamic dispatch,
so per-call overhead would show up most on tiny arrays. Back-to-back
single-threaded BenchmarkTools runs (@belapsed, 10 000 samples), baseline vs.
branch on the same machine:

case	baseline	branch	ratio
`permutedims!` 4×4	145 ns	150 ns	1.034
`permutedims!` 4×4×4	251 ns	239 ns	0.95
`permutedims!` 8×8×8	480 ns	483 ns	1.01
`permutedims!` 8×8×8 ComplexF64	438 ns	441 ns	1.01
`map!(+)` 4×4	200 ns	203 ns	1.02
`map!(+)` 4×4×4	445 ns	449 ns	1.01
`map!(+)` 8×8×8	2116 ns	2076 ns	0.98
`map!(conj)` 8×8×8 ComplexF64	417 ns	413 ns	0.99
`mapreducedim!` 4×4	381 ns	256 ns	0.67
`mapreducedim!` 4×4×4	629 ns	375 ns	0.60
`mapreducedim!` 8×8×8	825 ns	612 ns	0.74
`sum` 4×4	984 ns	859 ns	0.87
`sum` 4×4×4	1190 ns	909 ns	0.76
`sum` 8×8×8	1928 ns	1636 ns	0.85
`permutedims!` large 256×256×16	2.759 ms	2.664 ms	0.97
`sum` large 256×256×16	1.679 ms	1.674 ms	1.00

The worst small-array result is permutedims! 4×4 at +3.4 % — a 5 ns absolute
difference on a 145 ns op. A focused back-to-back re-measurement put baseline at
140 ns and branch at 145 ns, i.e. the spread is run-to-run noise, not a
systematic per-call cost. Everything else is within ±2 %, large arrays are
unaffected, and several small reduction paths (mapreducedim!, sum) are
faster. No meaningful regression.

🤖 Generated with Claude Code

@generated

The strided mapreduce machinery (map/map!/mapreducedim!/_mapreduce and the inner bookkeeping + threading helpers) was specialized on the map/reduce function types f/op/initop. Combined with the (M, N, eltype) axes this caused a combinatorial explosion of specializations for downstream packages such as TensorOperations, which generate many distinct closures. Annotate the outer entry points and the deeper bookkeeping/threading helpers with @nospecialize so they compile once per (M, N, eltype) regardless of the function types. The expensive @generated `_mapreduce_kernel!` stays fully specialized and is reached via a function barrier (one dynamic dispatch per coarse call), so steady-state runtime is unchanged. Also split the body of the @generated `_mapreduce_kernel!` into a sibling plain function `_mapreduce_kernel_expr(f, op, initop, N, M)` that returns the Expr, for clarity; the generated kernel itself is otherwise unchanged. `_mapreduce_block!` is preserved as the GPU extension override point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-18T21:53:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/mapreduce.jl	`80.61% <100.00%> (+0.10%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jutho · 2026-06-18T22:25:26Z

What is strange is that these @nospecializes were once there, see e.g.:
https://github.com/QuantumKitHub/Strided.jl/blob/v2.0.3/src/mapreduce.jl

At some point they got removed, because I guess we felt it didn't really matter all that much. I was never able to really properly measure a difference between with and without any @nospecialize call, but that is probably due to my own failure in properly measuring things.

Jutho · 2026-06-18T22:27:41Z

Also makes you wonder to what extent Claude has really just learned older versions of this code 😄

Jutho · 2026-06-18T22:34:16Z

This is the commit that removed them:
b5bc609#diff-9ead82f1e0eabb1bf827e1a45d45cbd38bc95082400529082e40b457004fa0a8

Unfortunately, there is no explanation of what the motivation was. This happened after a long time of no updates:

lkdvos · 2026-06-19T00:02:42Z

To be fair, I actually spent the entire day trying out every single thing I could think of, and mostly used Claude to do the benchmarking, so maybe it's just me remembering the old versions of this code... Anyways, I guess this means you are okay with me merging this?

Jutho · 2026-06-19T08:56:11Z

Yes, one thing I find confusing is that I thought that Julia anyway doesn't specialize on arguments which are functions, unless you explicitly write function higher_order_function(f::F) where {F}, i.e. if you add an explicit type parameter.

Jutho · 2026-06-19T08:59:01Z

+    return _mapreduce_kernel_expr(f, op, initop, N, M)
+end
+
+function _mapreduce_kernel_expr(f, op, initop, N::Int, M::Int)


Is there a benefit to this approach? Maybe this is actually where some of the compilation time gain comes from? That the code that generates the code no longer needs to be recompiled for different types?

I measured this at some point, including the actual runtime of this function, and this seemed somewhat negligible. The main reason I did this is specifically for that, where I was benchmarking the expression generation :). I'm happy to revert this, but it is also just convenient to actually inspect the generated code without having to work around the generated function.

lkdvos · 2026-06-19T15:20:38Z

Re the specialization on functions - this is definitely true, but we bypassed this in TensorOperations by making the callable structs Scaler and Adder. The point is that we do actually want to specialize the kernels, since that gives measurable runtime benefits, but not the stuff on top of it (I think)?

lkdvos commented Jun 18, 2026

View reviewed changes

Comment thread src/mapreduce.jl Outdated

Apply suggestion from @lkdvos

25b35b4

Jutho approved these changes Jun 19, 2026

View reviewed changes

Jutho reviewed Jun 19, 2026

View reviewed changes

lkdvos merged commit ac118bb into main Jun 19, 2026
10 of 13 checks passed

lkdvos deleted the ld-nospecialize branch June 19, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add @nospecialize to mapreduce dispatch chain to cut compile time#66

Add @nospecialize to mapreduce dispatch chain to cut compile time#66
lkdvos merged 2 commits into
mainfrom
ld-nospecialize

lkdvos commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

lkdvos commented Jun 19, 2026

Uh oh!

Jutho commented Jun 19, 2026

Uh oh!

Jutho Jun 19, 2026

Uh oh!

lkdvos Jun 19, 2026

Uh oh!

Uh oh!

lkdvos commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lkdvos commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

TensorOperations compile time — lower

Runtime — no regression (small and large arrays)

Uh oh!

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

Jutho commented Jun 18, 2026

Uh oh!

lkdvos commented Jun 19, 2026

Uh oh!

Jutho commented Jun 19, 2026

Uh oh!

Jutho Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

lkdvos Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lkdvos commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lkdvos commented Jun 18, 2026 •

edited

Loading

codecov Bot commented Jun 18, 2026 •

edited

Loading