Unseed only the chunk in 3-arg seed!#821
Draft
ChrisRackauckas-Claude wants to merge 2 commits into
Draft
Conversation
ForwardDiff 1.x seeds duals with scalar setindex! loops over
structural_eachindex, which errors on GPU arrays ("scalar indexing is
disallowed") — a regression vs 0.10's broadcast seeding. Add a fast path
to the four seed! methods gated on
duals isa DenseArray && isbitstype(V) && !Base.has_offset_axes(duals, x)
using broadcast for the bulk writes and map! over contiguous views for
the chunk writes. AbstractGPUArray <: DenseArray, so this restores GPU
jacobians without a weak dependency, and it is faster than the scalar
loop on the CPU as well: the structural path pays an O(index)
Iterators.drop walk per chunk, i.e. O(n^2/N) per chunked sweep.
map! (not broadcast) is used for the chunk writes because slicing the
seeds tuple at runtime allocates, and for the index:end write because
the broadcast dotview allocates under --check-bounds=yes on Julia 1.10.
Structural wrappers, non-isbits values (unset-element handling), and
offset axes keep using the structural path unchanged.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Vx7zQ96NYk4VV4ML2s3kAC
The chunk-mode unseed call seed!(xdual, x, i) only needs to clear the N-wide chunk seeded at i: the rest of the array is zeroed up front and every other chunk clears itself. ForwardDiff 0.10 wrote exactly N elements here; the 1.x rewrite made it write from i to the end of the array, i.e. O(n^2/2N) redundant dual writes per chunked gradient/jacobian sweep (~40 GB of memory traffic for gradient! of 100000 elements at chunk 12). Write at most N elements starting at index: Iterators.take(_, N) on the structural path, a clamped range on the dense path. The 3-arg seed! form has no other callers. gradient! n=1000 (chunk 12): 344 -> 222 us; n=100000: 3.96 -> 2.61 s. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Vx7zQ96NYk4VV4ML2s3kAC
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #821 +/- ##
==========================================
+ Coverage 90.74% 90.81% +0.06%
==========================================
Files 11 11
Lines 1070 1089 +19
==========================================
+ Hits 971 989 +18
- Misses 99 100 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
Author
|
CI: all 19 Julia 1 / lts / min-patch jobs green. The 6 "Julia pre" failures are the pre-existing Julia 1.13-rc1 + JET issue that also fails on master — root-cause analysis in #816 (comment). This commit introduces no new failures relative to the #816 branch it stacks on. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #816 (first commit is that PR; review only the last commit). Depends on #816 landing first.
In ForwardDiff 0.10, the chunk-mode "unseed" call
seed!(duals, x, index, seed)wroteexactly the
Nchunk elements starting atindex. The 1.x rewrite changed it to writefrom
indexto the end of the array:Chunk mode calls this after every chunk (
seed!(xdual, x, i)inchunk_mode_gradient_expr/chunk_mode_jacobian_expr) purely to clear the chunk itjust seeded — the rest of the array is already unseeded (everything is zeroed once
up front, and every other chunk clears itself). So writing to the end does
Σ(n − i) ≈ n²/2N redundant dual writes per gradient/jacobian sweep. At n = 100000 with
chunk 12 and
Dual{…,Float64,12}(104 bytes each) that is ~40 GB of memory traffic.This restores the 0.10 behavior: write at most
Nelements starting atindex(
Iterators.take(…, N)on the structural path, a clamped range on the dense path).The 3-arg
seed!form has no other callers in the package.Benchmarks
Julia 1.12.4,
gradient!with default chunk 12, median of 7 runs at n=100000:gradient!n=1000gradient!n=100000jacobian!is extraction-dominated at these sizes, so its unseed saving (~2%) iswithin run-to-run noise.
Tests
No new tests: chunk-vs-vector-mode consistency (including non-divisible chunk sizes and
UpperTriangular/Diagonalinputs) is covered by the existing suite, which passes onJulia 1.10 and 1.12 locally. Verified additionally that chunked results match
full-chunk references for n ∈ {5, 12, 13, 24, 25, 100} × chunk ∈ {1, 3, 5, 12}.
Note
Opened as a draft by an agent on behalf of @ChrisRackauckas. Please ignore
until reviewed by @ChrisRackauckas.
🤖 Generated with Claude Code