Add unit-stride fast path to _mapreduce_kernel! by lkdvos · Pull Request #70 · QuantumKitHub/Strided.jl

lkdvos · 2026-06-25T00:41:33Z

Problem

_mapreduce_kernel! steps the parent indices of the innermost (vectorized) loop dimension by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride, so LLVM auto-vectorizes the inner loop with gather/scatter instructions (vgatherqpd/vscatterqpd). Gather/scatter address each lane individually and do not stream memory.

The effect is severe for memory-bound contiguous ops. A contiguous 400×400 Float64 copy! (map!(identity, …)) runs at ~8.5 GB/s (~300 µs) instead of the ~33 GB/s a contiguous SIMD loop achieves. Minimal reproduction — a hand-written @simd copy loop:

inner loop (160 000 elems)	time	asm
`C[i]=A[i]`, runtime stride (=1)	298 µs	gather/scatter
`C[i]=A[i]`, compile-time stride `1`	91 µs	contiguous

Pure-copy / map! bodies are hit hardest because LLVM's cost model judges gather/scatter SIMD "profitable" for them, whereas heavier accumulate bodies are left as (faster) scalar loops.

Change

Add a runtime branch in the innermost loop: when every array is contiguous along loop dimension 1 (all innermost strides == 1), step the parent indices by the literal 1 instead of the runtime stride. This lets the compiler emit streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals 1.

The change is confined to the generated expression for the non-reduction innermost loop; non-contiguous (e.g. transposed) inputs are unaffected and take the existing path.

Results

Contiguous 400×400 Float64, single thread:

op	before	after
`copy!` (contiguous)	~300 µs	~91 µs (~3.3×)
`copy!` (transposed input)	~293 µs	~259 µs (unchanged path)

The contiguous result now matches the compile-time-constant-stride ideal.

Testing

Full Strided test suite passes, single- and multi-threaded (JULIA_NUM_THREADS=4): map/scale!/axpy!/axpby!, copy, broadcasting, mapreduce, reduce, mul!.
Additional correctness sweep (81 cases over Float32/Float64/ComplexF64 × ndims 1–4, covering copy!/conj!/permuted copies/scaled map!/binary map!/reductions/axpby): max error 0.0.

Notes / possible follow-ups

Reduction loops (the iszero(stride) hoist branch) still gather contiguous inputs (e.g. sum over a contiguous array). An analogous unit-stride branch there would help; left out to keep this change focused.
The gather/scatter code path remains in the binary as the fallback for the genuinely strided case; only the runtime branch taken for contiguous data changes.

🤖 Generated with Claude Code

The innermost (vectorized) loop dimension steps the parent indices by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride and auto-vectorizes the loop with gather/scatter instructions, which do not stream memory. For a contiguous 400x400 Float64 `copy!` this runs at ~8.5 GB/s (~300 us) instead of the ~33 GB/s a contiguous SIMD loop achieves. Add a runtime branch: when every array is contiguous along loop dimension 1 (all innermost strides == 1), step the indices by the literal `1` so the compiler emits streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals 1. Measured (contiguous 400x400 Float64, single thread): `copy!` 300 us -> 91 us (~3.3x), matching the compile-time-constant-stride ideal. Non-contiguous (e.g. transposed) inputs are unaffected and keep the existing path. Full test suite passes (single- and multi-threaded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Refactor comments for clarity and conciseness.

lkdvos · 2026-06-25T01:41:11Z

TLDR here: if a kernel hits a case where all accesses are secretly stride 1, this is not detected since these are runtime values, and (at least on my machine) instead of using contiguous loads, uses gather/scatter machine instructions. These are significantly slower, and at least for many of our use cases we are trying to optimize our tensors for running into this case as often as possible.

As a sidenote, the reason I found this is that I was experimenting with map! vs _mapreducedim! with an init-op, where somehow C[I1] = A[I2] was slower than C[I1] = C[I1] + A[I2], which really didn't make sense to me.
Inspecting the machine code, it turns out that my compiler decided that in the former case it would emit SIMD instructions (requiring gather/scatter because non-unitstride), while in the latter it wouldn't because the compiler determined that scalar instructions are more efficient than the two gathers + single scatters are.
This cost model is slightly inaccurate, and actually the scalar instructions end up faster for this on my machine which made copy! slower than mapreducedim! with a beta.

codecov · 2026-06-25T02:29:48Z

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/mapreduce.jl	93.33%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/mapreduce.jl	`81.13% <93.33%> (+0.51%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jutho · 2026-06-25T23:23:24Z

+    unitstridecond = reduce(
+        (a, b) -> :($a && $b),
+        [:($(stridevars[1, j]) == 1) for j in 1:M]
+    )


The standard && expression with more than 2 arguments seems to be always nested, i.e.

:(cond1 && cond2 && cond3) yields Expr(:&&, cond1, Expr(:&&, cond2, cond3)).

It does however see that we can manually build an && Expr with more than 2 arguments: Expr(:&&, cond1, cond2, cond3). While this expression object is not printed nicely, it is accepted as valid code (as I've checked by running eval on it). So we could do

Suggested change

unitstridecond = reduce(

(a, b) -> :($a && $b),

[:($(stridevars[1, j]) == 1) for j in 1:M]

)

unitstridecond = Expr(:&&, [:($(stridevars[1, j]) == 1) for j in 1:M]...)

but the macro-expanded code would look a bit weird.

I'm not sure why I went for the shortcircuit in the firstplace now that I look at it, this is probably completely unnecessary and it might just be faster to simply check all of them. I replaced this now with all(==(1), (stride_1_1, stride_2_1, ...)), which presumably just gives the exact same machine code but looks a bit cleaner.
Let me know what you think.

lkdvos and others added 2 commits June 24, 2026 20:41

Improve comments in mapreduce.jl

a302147

Refactor comments for clarity and conciseness.

lkdvos requested a review from Jutho June 25, 2026 15:29

Jutho reviewed Jun 25, 2026

View reviewed changes

simplify firststride condition

3d605d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unit-stride fast path to _mapreduce_kernel!#70

Add unit-stride fast path to _mapreduce_kernel!#70
lkdvos wants to merge 3 commits into
mainfrom
ld-unit-stride-kernel

lkdvos commented Jun 25, 2026

Uh oh!

lkdvos commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Jutho Jun 25, 2026

Uh oh!

lkdvos Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lkdvos commented Jun 25, 2026

Problem

Change

Results

Testing

Notes / possible follow-ups

Uh oh!

lkdvos commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jutho Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

lkdvos Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 25, 2026 •

edited

Loading