Use CUDA atomics in GPU rasterize kernels (#2167)#2198
Conversation
The GPU point and line burn kernels did non-atomic read-modify-write on per-pixel state, so overlapping geometries produced nondeterministic results that disagreed with the numpy backend. Switch the per-pixel write strategy to CUDA atomics for the six built-in aggregators: - sum, count: cuda.atomic.add on the output buffer. - min, max: cuda.atomic.min / cuda.atomic.max on the output buffer. - first, last: two-pass. Pass 1 resolves the per-pixel winning input index with cuda.atomic.min / cuda.atomic.max on an index buffer; pass 2 stamps the winner's value into the output. The per-merge buffers are initialised up front (zero / +inf / -inf / fill) and a host-side post-pass blends fill back into untouched pixels. User-supplied merge callables keep the previous non-atomic closure path, since the public merge_fn signature has no atomic equivalent. The numpy and dask+numpy backends are unchanged.
Cover the six built-in aggregators (last, first, sum, count, min, max) across three overlap scenarios: coincident points, crossing line segments, and duplicate segments. Assert that the cupy output matches the numpy output exactly and that repeated cupy runs produce identical arrays.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Use CUDA atomics in GPU rasterize kernels (#2167)
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
xrspatial/rasterize.py:1569-1577-- In_run_cupythe boundary-burn block usespoly_geomsandpoly_idsfrom the polygon-fill block above. The boundary path runs wheneverall_touched and poly_geomsis truthy, andpoly_geomscomes from_classify_geometriesearlier, so it's defined even if no edges crossed the raster. The same shape appears in_rasterize_tile_cupy:2008-2031, wherepoly_geomsis bound insideif poly_wkb:and only used whenpoly_wkb and all_touched. Safe today, fragile to future refactors. Worth nesting the boundary block inside the polygon block, or adding a comment, so the dependency is explicit.xrspatial/rasterize.py:1083-1109--_apply_merge_gpunow branches onatomic_modeat compile time, so each merge produces a separate compiled kernel. Fine. Butwrittenis passed into every kernel and only read in the legacy user-callable branch. For atomic modes it's an(H, W) int8allocation that the kernels never touch. Worth skipping thewrittenallocation for atomic modes. Memory is small but it's a clean win.
Nits (optional improvements)
xrspatial/rasterize.py:1079-1080-- The comment block says`pass_id` selects which branch executes (kernels are compiled once per pass). The implementation has nopass_idparameter; pass 1 and pass 2 are separate kernels named..._gpuand..._gpu_pass2. Update the comment to match the code.xrspatial/tests/test_rasterize_gpu_race_2167.py-- Anall_touched=Truescenario with overlapping polygon boundaries would be a useful addition. The cross-backend parity test would already catch a regression there, but an explicit scenario makes the coverage clearer for the next reader.
What looks good
- The atomic strategy follows the fix sketched in issue #2167: atomic add for sum/count, atomic min/max on the value buffer for min/max, two-pass index-resolution for first/last.
- The numpy and dask+numpy paths are untouched.
- The kernel cache key now includes
merge_name, so atomic and non-atomic kernels coexist without collision. - Tests cover all six aggregators across three overlap scenarios, with both cross-backend parity and per-run determinism.
- The legacy non-atomic closure path is kept for user callables, which is the right call given the public
merge_fnsignature. _gpu_finalize_buffershandles the +inf / -inf initialisers for min/max so the user-visible fill semantics stay consistent with numpy.
Checklist
- Algorithm matches the strategy described in issue #2167.
- All six built-in aggregators produce results consistent with the numpy backend (verified by tests).
- NaN / fill handling is correct (finalize blends fill into untouched pixels).
- Edge cases are covered (coincident points, crossing lines, duplicate segments).
- [n/a] Dask chunk boundaries: tiles partition the output grid, so per-tile atomicity is sufficient. dask+cupy uses the same kernel set via
_rasterize_tile_cupy. - No premature materialization. Device transfers are staged once and reused for pass 2.
- [n/a] Benchmark: this is a correctness fix, not a perf change.
- [n/a] README feature matrix: no public API changed.
- Docstrings updated (
_ensure_gpu_kernelsand_run_cupydescribe the atomic path).
- Skip the (H, W) ``written`` allocation when the atomic kernel path is selected. Atomic modes never read or write ``written``; a (1, 1) placeholder satisfies the kernel signature without spending H*W bytes on dead storage. - Nest the all_touched boundary launch inside the ``if poly_wkb:`` block in ``_rasterize_tile_cupy`` so the dependency on ``poly_geoms`` / ``poly_ids`` is local. Add a parallel comment to ``_run_cupy`` explaining why the same shape is safe there (``poly_geoms`` comes from ``_classify_geometries`` unconditionally). - Fix the stale ``pass_id`` comment in ``_ensure_gpu_kernels``; pass 1 and pass 2 are separate compiled kernels, not branches selected by a runtime ``pass_id``. - Add a shared-boundary polygon scenario (cross-backend parity plus per-run determinism) to the regression tests. This is the polygon analogue of the line-overlap case and exercises the all_touched scanline-plus-boundary write pattern.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (after 2842709)
All four findings from the prior review are addressed.
Disposition
- Suggestion 1 (fragile
poly_geomsdependency in boundary block) -- fixed._rasterize_tile_cupynow stages the boundary launch inside theif poly_wkb:block so the dependency is local._run_cupygot an explicit comment noting thatpoly_geomscomes from_classify_geometriesunconditionally and the same shape is safe there. - Suggestion 2 (
writtenallocation wasted for atomic modes) -- fixed. Atomic modes now get a(1, 1) int8placeholder; the kernels never touch it for those modes, and the legacy user-callable path still gets a full(H, W)allocation. - Nit 1 (stale
pass_idcomment) -- fixed. Comment now describes the actual two-kernel layout. - Nit 2 (no
all_touched=Truepolygon scenario) -- fixed. Added shared-boundary polygon tests covering cross-backend parity and per-run determinism for all six aggregators.
Verification
- 50 tests pass in
xrspatial/tests/test_rasterize_gpu_race_2167.py(up from 38 in the first pass). - 325 tests pass across the full rasterize suite (
test_rasterize.py,test_rasterize_accuracy.py,test_rasterize_coverage_2026_05_17.py,test_rasterize_tile_props_slice_2020.py,test_rasterize_gpu_race_2167.py).
|
Heads up on CI: the three failing These two failures are pre-existing on Everything in this PR's surface area (the new GPU rasterize race tests, the existing rasterize suite) passes locally on the merged branch. Happy to address the polygonize failures in a separate change if useful, but they did not regress here. |
# Conflicts: # xrspatial/rasterize.py
Summary
read-modify-write on per-pixel state, so overlapping geometries
returned values that disagreed with the numpy backend and varied
between runs. Switch them to CUDA atomics for the six built-in
aggregators (last, first, sum, count, min, max).
input index with atomic min/max on an index buffer; pass 2 stamps
the value in.
segments, and duplicate segments, with both cross-backend parity
and per-run determinism checks.
Backends touched: cupy, dask+cupy. numpy and dask+numpy paths are
unchanged.
Test plan
xrspatial/tests/test_rasterize_gpu_race_2167.pypass on the local GPU.
Closes #2167