Add unit-stride fast path to _mapreduce_kernel!#70
Conversation
The innermost (vectorized) loop dimension steps the parent indices by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride and auto-vectorizes the loop with gather/scatter instructions, which do not stream memory. For a contiguous 400x400 Float64 `copy!` this runs at ~8.5 GB/s (~300 us) instead of the ~33 GB/s a contiguous SIMD loop achieves. Add a runtime branch: when every array is contiguous along loop dimension 1 (all innermost strides == 1), step the indices by the literal `1` so the compiler emits streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals 1. Measured (contiguous 400x400 Float64, single thread): `copy!` 300 us -> 91 us (~3.3x), matching the compile-time-constant-stride ideal. Non-contiguous (e.g. transposed) inputs are unaffected and keep the existing path. Full test suite passes (single- and multi-threaded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactor comments for clarity and conciseness.
|
TLDR here: if a kernel hits a case where all accesses are secretly stride 1, this is not detected since these are runtime values, and (at least on my machine) instead of using contiguous loads, uses As a sidenote, the reason I found this is that I was experimenting with |
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
| unitstridecond = reduce( | ||
| (a, b) -> :($a && $b), | ||
| [:($(stridevars[1, j]) == 1) for j in 1:M] | ||
| ) |
There was a problem hiding this comment.
The standard && expression with more than 2 arguments seems to be always nested, i.e.
:(cond1 && cond2 && cond3) yields Expr(:&&, cond1, Expr(:&&, cond2, cond3)).
It does however see that we can manually build an && Expr with more than 2 arguments: Expr(:&&, cond1, cond2, cond3). While this expression object is not printed nicely, it is accepted as valid code (as I've checked by running eval on it). So we could do
| unitstridecond = reduce( | |
| (a, b) -> :($a && $b), | |
| [:($(stridevars[1, j]) == 1) for j in 1:M] | |
| ) | |
| unitstridecond = Expr(:&&, [:($(stridevars[1, j]) == 1) for j in 1:M]...) |
but the macro-expanded code would look a bit weird.
There was a problem hiding this comment.
I'm not sure why I went for the shortcircuit in the firstplace now that I look at it, this is probably completely unnecessary and it might just be faster to simply check all of them. I replaced this now with all(==(1), (stride_1_1, stride_2_1, ...)), which presumably just gives the exact same machine code but looks a bit cleaner.
Let me know what you think.
Problem
_mapreduce_kernel!steps the parent indices of the innermost (vectorized) loop dimension by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride, so LLVM auto-vectorizes the inner loop with gather/scatter instructions (vgatherqpd/vscatterqpd). Gather/scatter address each lane individually and do not stream memory.The effect is severe for memory-bound contiguous ops. A contiguous
400×400Float64copy!(map!(identity, …)) runs at ~8.5 GB/s (~300 µs) instead of the ~33 GB/s a contiguous SIMD loop achieves. Minimal reproduction — a hand-written@simdcopy loop:C[i]=A[i], runtime stride (=1)C[i]=A[i], compile-time stride1Pure-copy /
map!bodies are hit hardest because LLVM's cost model judges gather/scatter SIMD "profitable" for them, whereas heavier accumulate bodies are left as (faster) scalar loops.Change
Add a runtime branch in the innermost loop: when every array is contiguous along loop dimension 1 (all innermost strides
== 1), step the parent indices by the literal1instead of the runtime stride. This lets the compiler emit streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals1.The change is confined to the generated expression for the non-reduction innermost loop; non-contiguous (e.g. transposed) inputs are unaffected and take the existing path.
Results
Contiguous
400×400Float64, single thread:copy!(contiguous)copy!(transposed input)The contiguous result now matches the compile-time-constant-stride ideal.
Testing
JULIA_NUM_THREADS=4):map/scale!/axpy!/axpby!,copy,broadcasting,mapreduce,reduce,mul!.copy!/conj!/permuted copies/scaledmap!/binarymap!/reductions/axpby): max error0.0.Notes / possible follow-ups
iszero(stride)hoist branch) still gather contiguous inputs (e.g.sumover a contiguous array). An analogous unit-stride branch there would help; left out to keep this change focused.🤖 Generated with Claude Code