Seed dense arrays of isbits duals without scalar indexing (fixes GPU jacobians)#816
Conversation
|
Because the broadcasts were purposefully removed due to performance things, we should at least recover the GPU functionality by handling AbstractGPUArray. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #816 +/- ##
==========================================
+ Coverage 90.74% 90.90% +0.16%
==========================================
Files 11 11
Lines 1070 1089 +19
==========================================
+ Hits 971 990 +19
Misses 99 99 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
It's difficult to verify this claim given that the As an alternative to an extension, I wonder whether special-casing |
ForwardDiff 1.x seeds duals with scalar setindex! loops over
structural_eachindex, which errors on GPU arrays ("scalar indexing is
disallowed") — a regression vs 0.10's broadcast seeding. Add a fast path
to the four seed! methods gated on
duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)
using broadcast for the bulk writes and map! over contiguous views for
the chunk writes. AbstractGPUArray <: DenseArray, so this restores GPU
jacobians without a weak dependency, and it is faster than the scalar
loop on the CPU as well: the structural path pays an O(index)
Iterators.drop walk per chunk, i.e. O(n^2/N) per chunked sweep.
map! (not broadcast) is used for the chunk writes because slicing the
seeds tuple at runtime allocates, and for the index:end write because
the broadcast dotview allocates under --check-bounds=yes on Julia 1.10.
Structural wrappers, non-isbits values (unset-element handling), and
offset axes keep using the structural path unchanged.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vx7zQ96NYk4VV4ML2s3kAC
18a692a to
35d4f92
Compare
|
Good suggestion — I benchmarked it, and the On the interface concern: dispatching on duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)so nothing needs to be assumed about GPU element types either — non-isbits
|
| method | writes | loop (master) | broadcast | map! |
|---|---|---|---|---|
seed!(duals, x, seed), n=1000 |
all n | 4.1 μs | 2.7 μs | 4.1 μs |
seed!(duals, x, seeds), n=1000 |
first N | 56 ns | 56 ns | 52 ns |
seed!(duals, x, 500, seed), n=1000 |
index:end | 3.0 μs | 1.4 μs | 2.0 μs |
seed!(duals, x, 500, seeds), n=1000 |
N at index | 458 ns | 1.6 μs (7 allocs) | 62 ns |
seed!(duals, x, 50000, seeds), n=100000 |
N at index | 40 μs | 1.6 μs (7 allocs) | 68 ns |
So the answer to "map! or broadcast?" is: both, per method (bold = what the PR now uses).
- For the full-array write (method 1), broadcast wins; on 1.10
map!is actually ~25% slower than the loop there, so the PR uses broadcast. - For method 3 (index:end), broadcast is fastest but its dotview allocates 112 bytes under
--check-bounds=yeson Julia 1.10 (whichPkg.testuses, andtest/AllocationsTest.jlrightly rejects), so the PR usesmap!over views — allocation-free on both versions and still 1.5× faster than the loop. - For the chunk writes (methods 2 and 4),
map!wins — but the seedsNTuplecan't be amap!source (there's nomap!(f, dest, ::Tuple)method), and slicing it at runtime (seeds[1:chunksize], what both 0.10 and the extension in the first iteration of this PR did) is what costs the 1.6 μs + allocations in the broadcast column.map!over the index range with a closure over the tuple avoids both, and GPUArrays'map!handles the range argument fine. - The last row is the real story for chunk mode: the structural path pays
Iterators.drop(structural_eachindex(duals, x), offset)— O(index) per chunk, so O(n²/N) per full chunked jacobian/gradient sweep. The dense path indexes the chunk directly.
On the 2018 revert (#354) that removed broadcast for performance: the regression jrevels measured there was pre-Julia-1.0 broadcast plus an allocating x[dual_inds] slice; with views and Ref none of these forms allocate today.
End-to-end (master → this PR, Julia 1.12.4)
gradient!n=1000 (chunk 12): 464 μs → 344 μsgradient!n=100000 (chunk 12, median of 7 runs): 4.80 s → 3.96 sjacobian!n=100: 63 μs → 46 μsjacobian!n=1000: 5.92 ms → 4.84 msgradient!n=10 (vector mode): 141 ns → 158 ns — the one regression;seed!itself measures identical (46 ns both), so it's an inlining side effect of the added branch.
One more thing that turned up while measuring: the 1.x rewrite also changed the chunk-mode "unseed" call seed!(xdual, x, i) to write from i to the end of the array each chunk — 0.10 wrote exactly the N chunk elements — which is O(n²) dual writes per sweep (~40 GB of memory traffic for the n=100000 gradient above). Restoring the narrow unseed on top of this PR takes the n=100000 gradient! from ~3.9 s to ~2.4 s. I've kept that out of this PR to keep it focused; follow-up PR coming.
JLArray tests (vector/chunk mode, jacobian!, f!, matrix and view inputs) now exercise the main-package code path directly; the extension and weak dep are gone.
|
CI status: all 19 Julia 1 / lts / min-patch jobs (3 OSes × NaN-safe on/off) plus Documentation are green. The 6 "Julia pre" jobs fail, which needs a note:
The chunk-unseed O(n²) follow-up mentioned above is now open as #821 (stacked on this branch). |
Summary
ForwardDiff 1.x rewrote the four
seed!methods insrc/apiutils.jlto write eachdual with a scalar
setindex!loop overstructural_eachindex. On GPU arrays thistriggers
ERROR: Scalar indexing is disallowed, so everyForwardDiff.jacobian/jacobian!call on a GPU array (e.g.CuArray) errors inseed!. ForwardDiff 0.10seeded with broadcast and worked on GPU arrays, so this is a regression for GPU usage
introduced in the 1.0 rewrite. First hit downstream in SciML/ComplementaritySolve.jl#65.
Fix
Following @devmotion's suggestion,
this now special-cases dense arrays in the main package instead of adding a
GPUArraysCoreextension (the first iteration of this PR). Key fact:GPUArraysCore.AbstractGPUArray <: DenseArray, so aDenseArrayfast path covers allGPU array types without depending on anything about the (undocumented)
AbstractGPUArrayinterface — and without a weak dependency.Each of the four
seed!methods takes a fast path whenusing broadcast for the full-array write and
map!over contiguous views for theindexed writes (
index:endandN-at-offset chunks). Everything else(structural wrappers like
UpperTriangular/Diagonal, non-isbits values with unsetelements, offset axes) falls through to the existing structural path, unchanged.
This turns out to be faster than the scalar loop on the CPU as well, because the
structural path pays
Iterators.drop(structural_eachindex(duals, x), offset)— anO(index) walk per chunk, i.e. O(n²/N) per full chunked jacobian/gradient sweep — while
the dense path indexes the chunk directly.
map!(not broadcast) is used for thechunk writes because slicing the seeds tuple at runtime (
seeds[1:chunksize], as 0.10did) allocates, and for the
index:endwrite because the broadcast dotview allocatesunder
--check-bounds=yeson Julia 1.10; themap!forms are allocation-free on 1.10and 1.12 with and without bounds checks, and work on GPU arrays.
Benchmarks
Julia 1.12.4 (1.10.11 agrees on all trends),
x = rand(n), defaults (chunk = 12for n ≥ 1000):seed!(duals, x, seeds)n=1000seed!(duals, x, 501, seeds)n=1000gradient!n=10 (vector mode)gradient!n=1000gradient!n=100000 (median of 7)jacobian!n=100jacobian!n=1000The isolated
seed!comparison per method (loop vs broadcast vsmap!) is inthe review discussion below.
The only measured regression is vector-mode
gradient!at n=10 (+15 ns, ~10%);seed!itself measures identical there (46 ns both), so it is aninlining/codegen side effect of the added branch.
Scope
This covers the jacobian paths (vector and chunk mode), which is what
seed!feeds.
ForwardDiff.gradienton GPU arrays has a second scalar-indexing site in thegradient extraction path (
extract_gradient_chunk!) that this PR does not touch;fixing
seed!is the necessary first step and resolves the jacobian regression.Separately, the 1.x rewrite made the chunk-mode "unseed" call
seed!(xdual, x, i)write from
ito the end of the array (0.10 wrote exactly theNchunk elements),which is O(n²) dual writes per sweep. Restoring the narrow unseed on top of this PR
takes
gradient!at n=100000 from ~3.9 s to ~2.4 s. That is left for a follow-up PRto keep this one reviewable.
Tests
test/GPUArraysTest.jlexercises all fourseed!methods throughjacobian,jacobian!, chunked configs, matrix inputs, and view inputs usingJLArrays, whichemulates GPU array semantics (including the scalar-indexing ban) on the CPU — so the
GPU code path is covered in CI without a physical GPU.
Note
Opened as a draft by an agent on behalf of @ChrisRackauckas. Please ignore
until reviewed by @ChrisRackauckas.
🤖 Generated with Claude Code