Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians) by ChrisRackauckas-Claude · Pull Request #816 · JuliaDiff/ForwardDiff.jl

ChrisRackauckas-Claude · 2026-06-13T08:23:27Z

Summary

ForwardDiff 1.x rewrote the four seed! methods in src/apiutils.jl to write each
dual with a scalar setindex! loop over structural_eachindex. On GPU arrays this
triggers ERROR: Scalar indexing is disallowed, so every ForwardDiff.jacobian /
jacobian! call on a GPU array (e.g. CuArray) errors in seed!. ForwardDiff 0.10
seeded with broadcast and worked on GPU arrays, so this is a regression for GPU usage
introduced in the 1.0 rewrite. First hit downstream in SciML/ComplementaritySolve.jl#65.

Fix

Following @devmotion's suggestion,
this now special-cases dense arrays in the main package instead of adding a
GPUArraysCore extension (the first iteration of this PR). Key fact:
GPUArraysCore.AbstractGPUArray <: DenseArray, so a DenseArray fast path covers all
GPU array types without depending on anything about the (undocumented)
AbstractGPUArray interface — and without a weak dependency.

Each of the four seed! methods takes a fast path when

duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)

using broadcast for the full-array write and map! over contiguous views for the
indexed writes (index:end and N-at-offset chunks). Everything else
(structural wrappers like UpperTriangular/Diagonal, non-isbits values with unset
elements, offset axes) falls through to the existing structural path, unchanged.

This turns out to be faster than the scalar loop on the CPU as well, because the
structural path pays Iterators.drop(structural_eachindex(duals, x), offset) — an
O(index) walk per chunk, i.e. O(n²/N) per full chunked jacobian/gradient sweep — while
the dense path indexes the chunk directly. map! (not broadcast) is used for the
chunk writes because slicing the seeds tuple at runtime (seeds[1:chunksize], as 0.10
did) allocates, and for the index:end write because the broadcast dotview allocates
under --check-bounds=yes on Julia 1.10; the map! forms are allocation-free on 1.10
and 1.12 with and without bounds checks, and work on GPU arrays.

Benchmarks

Julia 1.12.4 (1.10.11 agrees on all trends), x = rand(n), defaults (chunk = 12 for n ≥ 1000):

benchmark	master	this PR
`seed!(duals, x, seeds)` n=1000	64 ns	50 ns
`seed!(duals, x, 501, seeds)` n=1000	435 ns	65 ns
`gradient!` n=10 (vector mode)	141 ns	158 ns
`gradient!` n=1000	464 μs	344 μs
`gradient!` n=100000 (median of 7)	4.80 s	3.96 s
`jacobian!` n=100	63 μs	46 μs
`jacobian!` n=1000	5.92 ms	4.84 ms

The isolated seed! comparison per method (loop vs broadcast vs map!) is in
the review discussion below.

The only measured regression is vector-mode gradient! at n=10 (+15 ns, ~10%);
seed! itself measures identical there (46 ns both), so it is an
inlining/codegen side effect of the added branch.

Scope

This covers the jacobian paths (vector and chunk mode), which is what seed!
feeds. ForwardDiff.gradient on GPU arrays has a second scalar-indexing site in the
gradient extraction path (extract_gradient_chunk!) that this PR does not touch;
fixing seed! is the necessary first step and resolves the jacobian regression.

Separately, the 1.x rewrite made the chunk-mode "unseed" call seed!(xdual, x, i)
write from i to the end of the array (0.10 wrote exactly the N chunk elements),
which is O(n²) dual writes per sweep. Restoring the narrow unseed on top of this PR
takes gradient! at n=100000 from ~3.9 s to ~2.4 s. That is left for a follow-up PR
to keep this one reviewable.

Tests

test/GPUArraysTest.jl exercises all four seed! methods through jacobian,
jacobian!, chunked configs, matrix inputs, and view inputs using JLArrays, which
emulates GPU array semantics (including the scalar-indexing ban) on the CPU — so the
GPU code path is covered in CI without a physical GPU.

Note

Opened as a draft by an agent on behalf of @ChrisRackauckas. Please ignore
until reviewed by @ChrisRackauckas.

🤖 Generated with Claude Code

ChrisRackauckas · 2026-06-13T08:25:19Z

Because the broadcasts were purposefully removed due to performance things, we should at least recover the GPU functionality by handling AbstractGPUArray.

codecov · 2026-06-13T08:29:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.90%. Comparing base (090ddbb) to head (35d4f92).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #816      +/-   ##
==========================================
+ Coverage   90.74%   90.90%   +0.16%     
==========================================
  Files          11       11              
  Lines        1070     1089      +19     
==========================================
+ Hits          971      990      +19     
  Misses         99       99

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

devmotion · 2026-06-15T22:16:20Z

GPU arrays are always
dense, one-based, and carry isbits element types, so the
structural_eachindex / unset-element handling of the generic methods is
unnecessary on this path and plain broadcast suffices.

It's difficult to verify this claim given that the AbstractGPUArray interface is not documented: https://juliagpu.github.io/GPUArrays.jl/dev/functionality/host/

As an alternative to an extension, I wonder whether special-casing DenseArrays with isbits types to use e.g. map! (broadcasting caused performance regressions IIRC but maybe that would perform better? should be compared with eachindex though) could fix GPU support as well.

ForwardDiff 1.x seeds duals with scalar setindex! loops over structural_eachindex, which errors on GPU arrays ("scalar indexing is disallowed") — a regression vs 0.10's broadcast seeding. Add a fast path to the four seed! methods gated on duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x) using broadcast for the bulk writes and map! over contiguous views for the chunk writes. AbstractGPUArray <: DenseArray, so this restores GPU jacobians without a weak dependency, and it is faster than the scalar loop on the CPU as well: the structural path pays an O(index) Iterators.drop walk per chunk, i.e. O(n^2/N) per chunked sweep. map! (not broadcast) is used for the chunk writes because slicing the seeds tuple at runtime allocates, and for the index:end write because the broadcast dotview allocates under --check-bounds=yes on Julia 1.10. Structural wrappers, non-isbits values (unset-element handling), and offset axes keep using the structural path unchanged. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vx7zQ96NYk4VV4ML2s3kAC

ChrisRackauckas-Claude · 2026-07-03T16:19:36Z

Good suggestion — I benchmarked it, and the DenseArray special-case is both viable and faster than the current scalar loops on the CPU, so I've reworked the PR to that approach and dropped the extension entirely.

On the interface concern: dispatching on DenseArray sidesteps it. GPUArraysCore.AbstractGPUArray <: DenseArray (it's declared that way in GPUArraysCore itself), so a dense fast path in the main package catches all GPU array types without relying on any undocumented AbstractGPUArray property. The gate is

duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)

so nothing needs to be assumed about GPU element types either — non-isbits V (which needs the unset-element handling) and offset axes fall through to the existing structural path.

`map!` vs broadcast vs the `eachindex` loop, per `seed!` method

Isolated seed! benchmarks, Julia 1.12.4 (1.10.11 agrees on all trends), Vector{Float64} input, chunk N = 12:

method	writes	loop (master)	broadcast	`map!`
`seed!(duals, x, seed)`, n=1000	all n	4.1 μs	2.7 μs	4.1 μs
`seed!(duals, x, seeds)`, n=1000	first N	56 ns	56 ns	52 ns
`seed!(duals, x, 500, seed)`, n=1000	index:end	3.0 μs	1.4 μs	2.0 μs
`seed!(duals, x, 500, seeds)`, n=1000	N at index	458 ns	1.6 μs (7 allocs)	62 ns
`seed!(duals, x, 50000, seeds)`, n=100000	N at index	40 μs	1.6 μs (7 allocs)	68 ns

So the answer to "map! or broadcast?" is: both, per method (bold = what the PR now uses).

For the full-array write (method 1), broadcast wins; on 1.10 map! is actually ~25% slower than the loop there, so the PR uses broadcast.
For method 3 (index:end), broadcast is fastest but its dotview allocates 112 bytes under --check-bounds=yes on Julia 1.10 (which Pkg.test uses, and test/AllocationsTest.jl rightly rejects), so the PR uses map! over views — allocation-free on both versions and still 1.5× faster than the loop.
For the chunk writes (methods 2 and 4), map! wins — but the seeds NTuple can't be a map! source (there's no map!(f, dest, ::Tuple) method), and slicing it at runtime (seeds[1:chunksize], what both 0.10 and the extension in the first iteration of this PR did) is what costs the 1.6 μs + allocations in the broadcast column. map! over the index range with a closure over the tuple avoids both, and GPUArrays' map! handles the range argument fine.
The last row is the real story for chunk mode: the structural path pays Iterators.drop(structural_eachindex(duals, x), offset) — O(index) per chunk, so O(n²/N) per full chunked jacobian/gradient sweep. The dense path indexes the chunk directly.

On the 2018 revert (#354) that removed broadcast for performance: the regression jrevels measured there was pre-Julia-1.0 broadcast plus an allocating x[dual_inds] slice; with views and Ref none of these forms allocate today.

End-to-end (master → this PR, Julia 1.12.4)

gradient! n=1000 (chunk 12): 464 μs → 344 μs
gradient! n=100000 (chunk 12, median of 7 runs): 4.80 s → 3.96 s
jacobian! n=100: 63 μs → 46 μs
jacobian! n=1000: 5.92 ms → 4.84 ms
gradient! n=10 (vector mode): 141 ns → 158 ns — the one regression; seed! itself measures identical (46 ns both), so it's an inlining side effect of the added branch.

One more thing that turned up while measuring: the 1.x rewrite also changed the chunk-mode "unseed" call seed!(xdual, x, i) to write from i to the end of the array each chunk — 0.10 wrote exactly the N chunk elements — which is O(n²) dual writes per sweep (~40 GB of memory traffic for the n=100000 gradient above). Restoring the narrow unseed on top of this PR takes the n=100000 gradient! from ~3.9 s to ~2.4 s. I've kept that out of this PR to keep it focused; follow-up PR coming.

JLArray tests (vector/chunk mode, jacobian!, f!, matrix and view inputs) now exercise the main-package code path directly; the extension and weak dep are gone.

ChrisRackauckas-Claude · 2026-07-03T16:45:55Z

CI status: all 19 Julia 1 / lts / min-patch jobs (3 OSes × NaN-safe on/off) plus Documentation are green. The 6 "Julia pre" jobs fail, which needs a note:

Master's pre jobs already fail (every run since ≥2025-12): the JET.@test_opt tests in test/QATest.jl report runtime dispatches on Julia 1.13.0-rc1 whose leaf frames are all in Base internals (broadcast preprocess, simd_index/CartesianIndex, SubArray construction, mightalias/_in_tuple). Reproduced locally on a clean master clone with 1.13.0-rc1: same failures, byte-identical reports. JET's own Julia 1.13 support is still in progress (Add Julia 1.13 support while maintaining Julia 1.12 compatibility aviatesk/JET.jl#815, Declare documented API public (for downstream ExplicitImports) #819), so these look like upstream Julia/JET issues, not ForwardDiff regressions.
This PR adds one more instance of that same class: the gradient QA test (QATest.jl:11) now also reports a Base-broadcast preprocess dispatch on 1.13-rc1, coming from the new duals .= Dual{T,V,N}.(x, Ref(seed)) in seed!. It is clean on 1.10/1.12 (those QA jobs pass).
I checked the obvious workarounds on 1.13-rc1: both a closure broadcast ((xi -> Dual{T,V,N}(xi, seed)).(x)) and map! are JET-clean — but both measure 35–50% slower than the Type broadcast for the full-array seed on 1.10 and 1.12 (the Type constructor broadcast SIMD-vectorizes better). Trading permanent seeding performance for a prerelease-only JET false positive seemed like the wrong call, so I left the broadcast in. If the reports persist once JET's 1.13 support settles, that decision can be revisited.

The chunk-unseed O(n²) follow-up mentioned above is now open as #821 (stacked on this branch).

ChrisRackauckas-Claude mentioned this pull request Jun 13, 2026

CUDA solver paths error with ForwardDiff >= 1: seed! scalar-indexes GPU dual arrays SciML/ComplementaritySolve.jl#65

Open

ChrisRackauckas requested a review from devmotion June 13, 2026 08:24

ChrisRackauckas-Claude force-pushed the gpuarrayscore-seed-ext branch from 18a692a to 35d4f92 Compare July 3, 2026 16:18

ChrisRackauckas-Claude changed the title ~~Add GPUArraysCore extension: broadcast-based seed! for GPU dual arrays~~ Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians) Jul 3, 2026

ChrisRackauckas-Claude mentioned this pull request Jul 3, 2026

Unseed only the chunk in 3-arg seed! #821

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians)#816

Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians)#816
ChrisRackauckas-Claude wants to merge 1 commit into
JuliaDiff:masterfrom
ChrisRackauckas-Claude:gpuarrayscore-seed-ext

ChrisRackauckas-Claude commented Jun 13, 2026 •

edited

Loading

Uh oh!

ChrisRackauckas commented Jun 13, 2026

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

devmotion commented Jun 15, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Benchmarks

Scope

Tests

Uh oh!

ChrisRackauckas commented Jun 13, 2026

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

devmotion commented Jun 15, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

map! vs broadcast vs the eachindex loop, per seed! method

End-to-end (master → this PR, Julia 1.12.4)

Uh oh!

ChrisRackauckas-Claude commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChrisRackauckas-Claude commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading

`map!` vs broadcast vs the `eachindex` loop, per `seed!` method