Use CUDA atomics in GPU rasterize kernels (#2167) by brendancol · Pull Request #2198 · xarray-contrib/xarray-spatial

brendancol · 2026-05-20T15:35:16Z

Summary

The GPU point and line rasterize kernels used non-atomic
read-modify-write on per-pixel state, so overlapping geometries
returned values that disagreed with the numpy backend and varied
between runs. Switch them to CUDA atomics for the six built-in
aggregators (last, first, sum, count, min, max).
first / last use a two-pass scheme: pass 1 resolves the winning
input index with atomic min/max on an index buffer; pass 2 stamps
the value in.
Add regression tests covering coincident points, crossing line
segments, and duplicate segments, with both cross-backend parity
and per-run determinism checks.

Backends touched: cupy, dask+cupy. numpy and dask+numpy paths are
unchanged.

Test plan

New tests in xrspatial/tests/test_rasterize_gpu_race_2167.py
pass on the local GPU.
Existing rasterize tests still pass on the local GPU.

The GPU point and line burn kernels did non-atomic read-modify-write on per-pixel state, so overlapping geometries produced nondeterministic results that disagreed with the numpy backend. Switch the per-pixel write strategy to CUDA atomics for the six built-in aggregators: - sum, count: cuda.atomic.add on the output buffer. - min, max: cuda.atomic.min / cuda.atomic.max on the output buffer. - first, last: two-pass. Pass 1 resolves the per-pixel winning input index with cuda.atomic.min / cuda.atomic.max on an index buffer; pass 2 stamps the winner's value into the output. The per-merge buffers are initialised up front (zero / +inf / -inf / fill) and a host-side post-pass blends fill back into untouched pixels. User-supplied merge callables keep the previous non-atomic closure path, since the public merge_fn signature has no atomic equivalent. The numpy and dask+numpy backends are unchanged.

Cover the six built-in aggregators (last, first, sum, count, min, max) across three overlap scenarios: coincident points, crossing line segments, and duplicate segments. Assert that the cupy output matches the numpy output exactly and that repeated cupy runs produce identical arrays.

brendancol

PR Review: Use CUDA atomics in GPU rasterize kernels (#2167)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

xrspatial/rasterize.py:1569-1577 -- In _run_cupy the boundary-burn block uses poly_geoms and poly_ids from the polygon-fill block above. The boundary path runs whenever all_touched and poly_geoms is truthy, and poly_geoms comes from _classify_geometries earlier, so it's defined even if no edges crossed the raster. The same shape appears in _rasterize_tile_cupy:2008-2031, where poly_geoms is bound inside if poly_wkb: and only used when poly_wkb and all_touched. Safe today, fragile to future refactors. Worth nesting the boundary block inside the polygon block, or adding a comment, so the dependency is explicit.
xrspatial/rasterize.py:1083-1109 -- _apply_merge_gpu now branches on atomic_mode at compile time, so each merge produces a separate compiled kernel. Fine. But written is passed into every kernel and only read in the legacy user-callable branch. For atomic modes it's an (H, W) int8 allocation that the kernels never touch. Worth skipping the written allocation for atomic modes. Memory is small but it's a clean win.

Nits (optional improvements)

xrspatial/rasterize.py:1079-1080 -- The comment block says `pass_id` selects which branch executes (kernels are compiled once per pass). The implementation has no pass_id parameter; pass 1 and pass 2 are separate kernels named ..._gpu and ..._gpu_pass2. Update the comment to match the code.
xrspatial/tests/test_rasterize_gpu_race_2167.py -- An all_touched=True scenario with overlapping polygon boundaries would be a useful addition. The cross-backend parity test would already catch a regression there, but an explicit scenario makes the coverage clearer for the next reader.

What looks good

The atomic strategy follows the fix sketched in issue #2167: atomic add for sum/count, atomic min/max on the value buffer for min/max, two-pass index-resolution for first/last.
The numpy and dask+numpy paths are untouched.
The kernel cache key now includes merge_name, so atomic and non-atomic kernels coexist without collision.
Tests cover all six aggregators across three overlap scenarios, with both cross-backend parity and per-run determinism.
The legacy non-atomic closure path is kept for user callables, which is the right call given the public merge_fn signature.
_gpu_finalize_buffers handles the +inf / -inf initialisers for min/max so the user-visible fill semantics stay consistent with numpy.

Checklist

Algorithm matches the strategy described in issue #2167.
All six built-in aggregators produce results consistent with the numpy backend (verified by tests).
NaN / fill handling is correct (finalize blends fill into untouched pixels).
Edge cases are covered (coincident points, crossing lines, duplicate segments).
[n/a] Dask chunk boundaries: tiles partition the output grid, so per-tile atomicity is sufficient. dask+cupy uses the same kernel set via _rasterize_tile_cupy.
No premature materialization. Device transfers are staged once and reused for pass 2.
[n/a] Benchmark: this is a correctness fix, not a perf change.
[n/a] README feature matrix: no public API changed.
Docstrings updated (_ensure_gpu_kernels and _run_cupy describe the atomic path).

- Skip the (H, W) ``written`` allocation when the atomic kernel path is selected. Atomic modes never read or write ``written``; a (1, 1) placeholder satisfies the kernel signature without spending H*W bytes on dead storage. - Nest the all_touched boundary launch inside the ``if poly_wkb:`` block in ``_rasterize_tile_cupy`` so the dependency on ``poly_geoms`` / ``poly_ids`` is local. Add a parallel comment to ``_run_cupy`` explaining why the same shape is safe there (``poly_geoms`` comes from ``_classify_geometries`` unconditionally). - Fix the stale ``pass_id`` comment in ``_ensure_gpu_kernels``; pass 1 and pass 2 are separate compiled kernels, not branches selected by a runtime ``pass_id``. - Add a shared-boundary polygon scenario (cross-backend parity plus per-run determinism) to the regression tests. This is the polygon analogue of the line-overlap case and exercises the all_touched scanline-plus-boundary write pattern.

brendancol

Follow-up review (after `2842709`)

All four findings from the prior review are addressed.

Disposition

Suggestion 1 (fragile poly_geoms dependency in boundary block) -- fixed. _rasterize_tile_cupy now stages the boundary launch inside the if poly_wkb: block so the dependency is local. _run_cupy got an explicit comment noting that poly_geoms comes from _classify_geometries unconditionally and the same shape is safe there.
Suggestion 2 (written allocation wasted for atomic modes) -- fixed. Atomic modes now get a (1, 1) int8 placeholder; the kernels never touch it for those modes, and the legacy user-callable path still gets a full (H, W) allocation.
Nit 1 (stale pass_id comment) -- fixed. Comment now describes the actual two-kernel layout.
Nit 2 (no all_touched=True polygon scenario) -- fixed. Added shared-boundary polygon tests covering cross-backend parity and per-run determinism for all six aggregators.

Verification

50 tests pass in xrspatial/tests/test_rasterize_gpu_race_2167.py (up from 38 in the first pass).
325 tests pass across the full rasterize suite (test_rasterize.py, test_rasterize_accuracy.py, test_rasterize_coverage_2026_05_17.py, test_rasterize_tile_props_slice_2020.py, test_rasterize_gpu_race_2167.py).

brendancol · 2026-05-20T16:01:33Z

Heads up on CI: the three failing run (..., 3.14) jobs are failing on test_polygonize_dask_multi_chunk_default_tolerance and test_polygonize_dask_multi_chunk_strict_float in xrspatial/tests/test_polygonize.py (assert 2 == 3).

These two failures are pre-existing on main, not introduced by the atomics work in this PR. The same two failures show up on the unrelated PR #2196 against the same main head. Confirmed by checking recent main runs — the polygonize tests have been red there independently.

Everything in this PR's surface area (the new GPU rasterize race tests, the existing rasterize suite) passes locally on the merged branch. Happy to address the polygonize failures in a separate change if useful, but they did not regress here.

# Conflicts: # xrspatial/rasterize.py

brendancol added 2 commits May 20, 2026 08:34

github-actions Bot added the performance PR touches performance-sensitive code label May 20, 2026

brendancol commented May 20, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into issue-2167

684099f

Merge remote-tracking branch 'origin/main' into issue-2167

aa2e3f1

# Conflicts: # xrspatial/rasterize.py

brendancol merged commit bd6583a into main May 21, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CUDA atomics in GPU rasterize kernels (#2167)#2198

Use CUDA atomics in GPU rasterize kernels (#2167)#2198
brendancol merged 5 commits into
mainfrom
issue-2167

brendancol commented May 20, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

brendancol commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 20, 2026

Summary

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Use CUDA atomics in GPU rasterize kernels (#2167)

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review (after 2842709)

Disposition

Verification

Uh oh!

brendancol commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up review (after `2842709`)