Skip to content

Add unit-stride fast path to _mapreduce_kernel!#70

Open
lkdvos wants to merge 3 commits into
mainfrom
ld-unit-stride-kernel
Open

Add unit-stride fast path to _mapreduce_kernel!#70
lkdvos wants to merge 3 commits into
mainfrom
ld-unit-stride-kernel

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 25, 2026

Copy link
Copy Markdown
Member

Problem

_mapreduce_kernel! steps the parent indices of the innermost (vectorized) loop dimension by the arrays' strides, which are runtime values. Even when the data is contiguous, the compiler cannot prove unit stride, so LLVM auto-vectorizes the inner loop with gather/scatter instructions (vgatherqpd/vscatterqpd). Gather/scatter address each lane individually and do not stream memory.

The effect is severe for memory-bound contiguous ops. A contiguous 400×400 Float64 copy! (map!(identity, …)) runs at ~8.5 GB/s (~300 µs) instead of the ~33 GB/s a contiguous SIMD loop achieves. Minimal reproduction — a hand-written @simd copy loop:

inner loop (160 000 elems) time asm
C[i]=A[i], runtime stride (=1) 298 µs gather/scatter
C[i]=A[i], compile-time stride 1 91 µs contiguous

Pure-copy / map! bodies are hit hardest because LLVM's cost model judges gather/scatter SIMD "profitable" for them, whereas heavier accumulate bodies are left as (faster) scalar loops.

Change

Add a runtime branch in the innermost loop: when every array is contiguous along loop dimension 1 (all innermost strides == 1), step the parent indices by the literal 1 instead of the runtime stride. This lets the compiler emit streaming SIMD loads/stores. The post-loop index correction reuses the existing return-stride expressions, which are numerically identical because the stride equals 1.

The change is confined to the generated expression for the non-reduction innermost loop; non-contiguous (e.g. transposed) inputs are unaffected and take the existing path.

Results

Contiguous 400×400 Float64, single thread:

op before after
copy! (contiguous) ~300 µs ~91 µs (~3.3×)
copy! (transposed input) ~293 µs ~259 µs (unchanged path)

The contiguous result now matches the compile-time-constant-stride ideal.

Testing

  • Full Strided test suite passes, single- and multi-threaded (JULIA_NUM_THREADS=4): map/scale!/axpy!/axpby!, copy, broadcasting, mapreduce, reduce, mul!.
  • Additional correctness sweep (81 cases over Float32/Float64/ComplexF64 × ndims 1–4, covering copy!/conj!/permuted copies/scaled map!/binary map!/reductions/axpby): max error 0.0.

Notes / possible follow-ups

  • Reduction loops (the iszero(stride) hoist branch) still gather contiguous inputs (e.g. sum over a contiguous array). An analogous unit-stride branch there would help; left out to keep this change focused.
  • The gather/scatter code path remains in the binary as the fallback for the genuinely strided case; only the runtime branch taken for contiguous data changes.

🤖 Generated with Claude Code

lkdvos and others added 2 commits June 24, 2026 20:41
The innermost (vectorized) loop dimension steps the parent indices by the
arrays' strides, which are runtime values. Even when the data is contiguous,
the compiler cannot prove unit stride and auto-vectorizes the loop with
gather/scatter instructions, which do not stream memory. For a contiguous
400x400 Float64 `copy!` this runs at ~8.5 GB/s (~300 us) instead of the
~33 GB/s a contiguous SIMD loop achieves.

Add a runtime branch: when every array is contiguous along loop dimension 1
(all innermost strides == 1), step the indices by the literal `1` so the
compiler emits streaming SIMD loads/stores. The post-loop index correction
reuses the existing return-stride expressions, which are numerically identical
because the stride equals 1.

Measured (contiguous 400x400 Float64, single thread): `copy!` 300 us -> 91 us
(~3.3x), matching the compile-time-constant-stride ideal. Non-contiguous
(e.g. transposed) inputs are unaffected and keep the existing path. Full test
suite passes (single- and multi-threaded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactor comments for clarity and conciseness.
@lkdvos

lkdvos commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

TLDR here: if a kernel hits a case where all accesses are secretly stride 1, this is not detected since these are runtime values, and (at least on my machine) instead of using contiguous loads, uses gather/scatter machine instructions. These are significantly slower, and at least for many of our use cases we are trying to optimize our tensors for running into this case as often as possible.

As a sidenote, the reason I found this is that I was experimenting with map! vs _mapreducedim! with an init-op, where somehow C[I1] = A[I2] was slower than C[I1] = C[I1] + A[I2], which really didn't make sense to me.
Inspecting the machine code, it turns out that my compiler decided that in the former case it would emit SIMD instructions (requiring gather/scatter because non-unitstride), while in the latter it wouldn't because the compiler determined that scalar instructions are more efficient than the two gathers + single scatters are.
This cost model is slightly inaccurate, and actually the scalar instructions end up faster for this on my machine which made copy! slower than mapreducedim! with a beta.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mapreduce.jl 93.33% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/mapreduce.jl 81.13% <93.33%> (+0.51%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos requested a review from Jutho June 25, 2026 15:29
Comment thread src/mapreduce.jl Outdated
Comment on lines +351 to +354
unitstridecond = reduce(
(a, b) -> :($a && $b),
[:($(stridevars[1, j]) == 1) for j in 1:M]
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard && expression with more than 2 arguments seems to be always nested, i.e.

:(cond1 && cond2 && cond3) yields Expr(:&&, cond1, Expr(:&&, cond2, cond3)).

It does however see that we can manually build an && Expr with more than 2 arguments: Expr(:&&, cond1, cond2, cond3). While this expression object is not printed nicely, it is accepted as valid code (as I've checked by running eval on it). So we could do

Suggested change
unitstridecond = reduce(
(a, b) -> :($a && $b),
[:($(stridevars[1, j]) == 1) for j in 1:M]
)
unitstridecond = Expr(:&&, [:($(stridevars[1, j]) == 1) for j in 1:M]...)

but the macro-expanded code would look a bit weird.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I went for the shortcircuit in the firstplace now that I look at it, this is probably completely unnecessary and it might just be faster to simply check all of them. I replaced this now with all(==(1), (stride_1_1, stride_2_1, ...)), which presumably just gives the exact same machine code but looks a bit cleaner.
Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants