Skip to content

Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians)#816

Draft
ChrisRackauckas-Claude wants to merge 1 commit into
JuliaDiff:masterfrom
ChrisRackauckas-Claude:gpuarrayscore-seed-ext
Draft

Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians)#816
ChrisRackauckas-Claude wants to merge 1 commit into
JuliaDiff:masterfrom
ChrisRackauckas-Claude:gpuarrayscore-seed-ext

Conversation

@ChrisRackauckas-Claude

@ChrisRackauckas-Claude ChrisRackauckas-Claude commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

ForwardDiff 1.x rewrote the four seed! methods in src/apiutils.jl to write each
dual with a scalar setindex! loop over structural_eachindex. On GPU arrays this
triggers ERROR: Scalar indexing is disallowed, so every ForwardDiff.jacobian /
jacobian! call on a GPU array (e.g. CuArray) errors in seed!. ForwardDiff 0.10
seeded with broadcast and worked on GPU arrays, so this is a regression for GPU usage
introduced in the 1.0 rewrite. First hit downstream in SciML/ComplementaritySolve.jl#65.

Fix

Following @devmotion's suggestion,
this now special-cases dense arrays in the main package instead of adding a
GPUArraysCore extension (the first iteration of this PR). Key fact:
GPUArraysCore.AbstractGPUArray <: DenseArray, so a DenseArray fast path covers all
GPU array types without depending on anything about the (undocumented)
AbstractGPUArray interface — and without a weak dependency.

Each of the four seed! methods takes a fast path when

duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)

using broadcast for the full-array write and map! over contiguous views for the
indexed writes (index:end and N-at-offset chunks). Everything else
(structural wrappers like UpperTriangular/Diagonal, non-isbits values with unset
elements, offset axes) falls through to the existing structural path, unchanged.

This turns out to be faster than the scalar loop on the CPU as well, because the
structural path pays Iterators.drop(structural_eachindex(duals, x), offset) — an
O(index) walk per chunk, i.e. O(n²/N) per full chunked jacobian/gradient sweep — while
the dense path indexes the chunk directly. map! (not broadcast) is used for the
chunk writes because slicing the seeds tuple at runtime (seeds[1:chunksize], as 0.10
did) allocates, and for the index:end write because the broadcast dotview allocates
under --check-bounds=yes on Julia 1.10; the map! forms are allocation-free on 1.10
and 1.12 with and without bounds checks, and work on GPU arrays.

Benchmarks

Julia 1.12.4 (1.10.11 agrees on all trends), x = rand(n), defaults (chunk = 12 for n ≥ 1000):

benchmark master this PR
seed!(duals, x, seeds) n=1000 64 ns 50 ns
seed!(duals, x, 501, seeds) n=1000 435 ns 65 ns
gradient! n=10 (vector mode) 141 ns 158 ns
gradient! n=1000 464 μs 344 μs
gradient! n=100000 (median of 7) 4.80 s 3.96 s
jacobian! n=100 63 μs 46 μs
jacobian! n=1000 5.92 ms 4.84 ms

The isolated seed! comparison per method (loop vs broadcast vs map!) is in
the review discussion below.

The only measured regression is vector-mode gradient! at n=10 (+15 ns, ~10%);
seed! itself measures identical there (46 ns both), so it is an
inlining/codegen side effect of the added branch.

Scope

This covers the jacobian paths (vector and chunk mode), which is what seed!
feeds. ForwardDiff.gradient on GPU arrays has a second scalar-indexing site in the
gradient extraction path (extract_gradient_chunk!) that this PR does not touch;
fixing seed! is the necessary first step and resolves the jacobian regression.

Separately, the 1.x rewrite made the chunk-mode "unseed" call seed!(xdual, x, i)
write from i to the end of the array (0.10 wrote exactly the N chunk elements),
which is O(n²) dual writes per sweep. Restoring the narrow unseed on top of this PR
takes gradient! at n=100000 from ~3.9 s to ~2.4 s. That is left for a follow-up PR
to keep this one reviewable.

Tests

test/GPUArraysTest.jl exercises all four seed! methods through jacobian,
jacobian!, chunked configs, matrix inputs, and view inputs using JLArrays, which
emulates GPU array semantics (including the scalar-indexing ban) on the CPU — so the
GPU code path is covered in CI without a physical GPU.


Note

Opened as a draft by an agent on behalf of @ChrisRackauckas. Please ignore
until reviewed by @ChrisRackauckas.

🤖 Generated with Claude Code

@ChrisRackauckas

Copy link
Copy Markdown
Member

Because the broadcasts were purposefully removed due to performance things, we should at least recover the GPU functionality by handling AbstractGPUArray.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.90%. Comparing base (090ddbb) to head (35d4f92).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #816      +/-   ##
==========================================
+ Coverage   90.74%   90.90%   +0.16%     
==========================================
  Files          11       11              
  Lines        1070     1089      +19     
==========================================
+ Hits          971      990      +19     
  Misses         99       99              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@devmotion

Copy link
Copy Markdown
Member

GPU arrays are always
dense, one-based, and carry isbits element types, so the
structural_eachindex / unset-element handling of the generic methods is
unnecessary on this path and plain broadcast suffices.

It's difficult to verify this claim given that the AbstractGPUArray interface is not documented: https://juliagpu.github.io/GPUArrays.jl/dev/functionality/host/

As an alternative to an extension, I wonder whether special-casing DenseArrays with isbits types to use e.g. map! (broadcasting caused performance regressions IIRC but maybe that would perform better? should be compared with eachindex though) could fix GPU support as well.

ForwardDiff 1.x seeds duals with scalar setindex! loops over
structural_eachindex, which errors on GPU arrays ("scalar indexing is
disallowed") — a regression vs 0.10's broadcast seeding. Add a fast path
to the four seed! methods gated on

    duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)

using broadcast for the bulk writes and map! over contiguous views for
the chunk writes. AbstractGPUArray <: DenseArray, so this restores GPU
jacobians without a weak dependency, and it is faster than the scalar
loop on the CPU as well: the structural path pays an O(index)
Iterators.drop walk per chunk, i.e. O(n^2/N) per chunked sweep.

map! (not broadcast) is used for the chunk writes because slicing the
seeds tuple at runtime allocates, and for the index:end write because
the broadcast dotview allocates under --check-bounds=yes on Julia 1.10.

Structural wrappers, non-isbits values (unset-element handling), and
offset axes keep using the structural path unchanged.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vx7zQ96NYk4VV4ML2s3kAC
@ChrisRackauckas-Claude ChrisRackauckas-Claude changed the title Add GPUArraysCore extension: broadcast-based seed! for GPU dual arrays Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians) Jul 3, 2026
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Good suggestion — I benchmarked it, and the DenseArray special-case is both viable and faster than the current scalar loops on the CPU, so I've reworked the PR to that approach and dropped the extension entirely.

On the interface concern: dispatching on DenseArray sidesteps it. GPUArraysCore.AbstractGPUArray <: DenseArray (it's declared that way in GPUArraysCore itself), so a dense fast path in the main package catches all GPU array types without relying on any undocumented AbstractGPUArray property. The gate is

duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)

so nothing needs to be assumed about GPU element types either — non-isbits V (which needs the unset-element handling) and offset axes fall through to the existing structural path.

map! vs broadcast vs the eachindex loop, per seed! method

Isolated seed! benchmarks, Julia 1.12.4 (1.10.11 agrees on all trends), Vector{Float64} input, chunk N = 12:

method writes loop (master) broadcast map!
seed!(duals, x, seed), n=1000 all n 4.1 μs 2.7 μs 4.1 μs
seed!(duals, x, seeds), n=1000 first N 56 ns 56 ns 52 ns
seed!(duals, x, 500, seed), n=1000 index:end 3.0 μs 1.4 μs 2.0 μs
seed!(duals, x, 500, seeds), n=1000 N at index 458 ns 1.6 μs (7 allocs) 62 ns
seed!(duals, x, 50000, seeds), n=100000 N at index 40 μs 1.6 μs (7 allocs) 68 ns

So the answer to "map! or broadcast?" is: both, per method (bold = what the PR now uses).

  • For the full-array write (method 1), broadcast wins; on 1.10 map! is actually ~25% slower than the loop there, so the PR uses broadcast.
  • For method 3 (index:end), broadcast is fastest but its dotview allocates 112 bytes under --check-bounds=yes on Julia 1.10 (which Pkg.test uses, and test/AllocationsTest.jl rightly rejects), so the PR uses map! over views — allocation-free on both versions and still 1.5× faster than the loop.
  • For the chunk writes (methods 2 and 4), map! wins — but the seeds NTuple can't be a map! source (there's no map!(f, dest, ::Tuple) method), and slicing it at runtime (seeds[1:chunksize], what both 0.10 and the extension in the first iteration of this PR did) is what costs the 1.6 μs + allocations in the broadcast column. map! over the index range with a closure over the tuple avoids both, and GPUArrays' map! handles the range argument fine.
  • The last row is the real story for chunk mode: the structural path pays Iterators.drop(structural_eachindex(duals, x), offset) — O(index) per chunk, so O(n²/N) per full chunked jacobian/gradient sweep. The dense path indexes the chunk directly.

On the 2018 revert (#354) that removed broadcast for performance: the regression jrevels measured there was pre-Julia-1.0 broadcast plus an allocating x[dual_inds] slice; with views and Ref none of these forms allocate today.

End-to-end (master → this PR, Julia 1.12.4)

  • gradient! n=1000 (chunk 12): 464 μs → 344 μs
  • gradient! n=100000 (chunk 12, median of 7 runs): 4.80 s → 3.96 s
  • jacobian! n=100: 63 μs → 46 μs
  • jacobian! n=1000: 5.92 ms → 4.84 ms
  • gradient! n=10 (vector mode): 141 ns → 158 ns — the one regression; seed! itself measures identical (46 ns both), so it's an inlining side effect of the added branch.

One more thing that turned up while measuring: the 1.x rewrite also changed the chunk-mode "unseed" call seed!(xdual, x, i) to write from i to the end of the array each chunk — 0.10 wrote exactly the N chunk elements — which is O(n²) dual writes per sweep (~40 GB of memory traffic for the n=100000 gradient above). Restoring the narrow unseed on top of this PR takes the n=100000 gradient! from ~3.9 s to ~2.4 s. I've kept that out of this PR to keep it focused; follow-up PR coming.

JLArray tests (vector/chunk mode, jacobian!, f!, matrix and view inputs) now exercise the main-package code path directly; the extension and weak dep are gone.

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

CI status: all 19 Julia 1 / lts / min-patch jobs (3 OSes × NaN-safe on/off) plus Documentation are green. The 6 "Julia pre" jobs fail, which needs a note:

  • Master's pre jobs already fail (every run since ≥2025-12): the JET.@test_opt tests in test/QATest.jl report runtime dispatches on Julia 1.13.0-rc1 whose leaf frames are all in Base internals (broadcast preprocess, simd_index/CartesianIndex, SubArray construction, mightalias/_in_tuple). Reproduced locally on a clean master clone with 1.13.0-rc1: same failures, byte-identical reports. JET's own Julia 1.13 support is still in progress (Add Julia 1.13 support while maintaining Julia 1.12 compatibility aviatesk/JET.jl#815, Declare documented API public (for downstream ExplicitImports) #819), so these look like upstream Julia/JET issues, not ForwardDiff regressions.
  • This PR adds one more instance of that same class: the gradient QA test (QATest.jl:11) now also reports a Base-broadcast preprocess dispatch on 1.13-rc1, coming from the new duals .= Dual{T,V,N}.(x, Ref(seed)) in seed!. It is clean on 1.10/1.12 (those QA jobs pass).
  • I checked the obvious workarounds on 1.13-rc1: both a closure broadcast ((xi -> Dual{T,V,N}(xi, seed)).(x)) and map! are JET-clean — but both measure 35–50% slower than the Type broadcast for the full-array seed on 1.10 and 1.12 (the Type constructor broadcast SIMD-vectorizes better). Trading permanent seeding performance for a prerelease-only JET false positive seemed like the wrong call, so I left the broadcast in. If the reports persist once JET's 1.13 support settles, that decision can be revisited.

The chunk-unseed O(n²) follow-up mentioned above is now open as #821 (stacked on this branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants